One more data revolution should be nipped in the bud
By Toby James
Machine learning is all-pervasive. From the facial recognition in your phone to the most groundbreaking of research astrophysics, a journey through the scientific and technological landscape of 2019 wouldn’t be complete without more mentions of machine learning. When used correctly, machine learning can be a powerful tool. But are the researchers in loco parentis shirking their responsibilities? Rice University statistician Dr Genevera Allen believes so. In a speech to the American Association for the Advancement of Science, she argued that scientists are callously leading their machines into bad habits, and science will pay a price for it.
What is machine learning?
Machine learning is, fundamentally, a statistical technique. A mathematical tool to find patterns in otherwise unmanageably large datasets. Often bundled in conversation with the more abstract concept of ‘artificial intelligence’, machine learning is conversely well-defined, (theoretically) well-understood, and, arguably, well-overused.
It all starts with data. The hidden currency behind the 21st century, data and the patterns therein hold considerable worth in the right hands. The primary goal of machine learning is to find useful patterns in large datasets. The latest generation of experiments produce more data than can be reasonably handled, and machine learning is a powerful tool that can identify important information from it.
The key word is classification. The objective of many machine learning algorithms is to apply a set of labels to a set of data. To do this, the algorithms must first be ‘trained’ on a dataset of known, labelled data. In this step, the parameters of the algorithm are tuned until it returns the known results on each of the known data.
A major criticism of machine learning is that it is a black box. Data goes in, results come out, and what happens in between is incomprehensible. It is difficult to argue with this standpoint.
There are several traps lying in wait for any prospective machine learning algorithm. One of the most perilous is ‘over-fitting’. This is when the algorithm relies on features of its training dataset that are particular to that individual dataset, and not to the data as a whole. Through this, coincidental similarities in initial data can render an algorithm useless.
Another trap lies in the fact that these algorithms are designed to find patterns, and do not have the ability to indicate that there may, in fact, be no underlying pattern in a set of data. As a result, they can be prone to finding patterns where there are none.
Allen argues that this type of error is more common than many scientists realise. Citing studies on cancer data as an example, she discusses how the clustering of data – the classification of data points as similar – is wildly different, and indeed incompatible, between studies. This alarming lack of reproducibility raises questions about the reliability of these studies, and of the many using similar methodology.
A simple fix would surely be the repetition of experiments. Deadlines, funding issues, and agenda all too often get in the way of this: The real world is impeding on scientific best practice. According to Allen, the consequences of this lack of repetition often don’t rear their head until two studies are later compared by another researcher. Describing this as a “reproducibility crisis,” she is not afraid to apportion a large part of the blame to misuse of machine learning techniques.
All of this reads like just another chapter in the modern day’s struggle with data and how to handle it. The 21st century is in the midst of what has been described as the 4th industrial revolution, the age of big data. Despite the promise, data is running the risk of becoming a dirty word. Malpractice is seeping through every facet of data science, and Allen’s accusations are just the latest in a series of examples.
Since Cambridge Analytica’s Lil Uzi Vert-alike Christopher Wylie revealed the extent of Facebook’s information harvesting activities, data has been at the front of public imagination. But this is not new. Ben Goldacre’s Bad Science was a bestseller in 2009, a decade ago.