28th February 2019

One more data revolution should be nipped in the bud

At the AAAS annual conference, Dr Allen describes the perils of using machine learning to analyse data in experimental science

Categories: Science & Tech

One more data revolution should be nipped in the bud

Photo: The Opte Project @ Wikimedia Commons

Machine learning is all-pervasive. From the facial recognition in your phone to the most groundbreaking of research astrophysics, a journey through the scientific and technological landscape of 2019 wouldn’t be complete without more mentions of machine learning. When used correctly, machine learning can be a powerful tool. But are the researchers in loco parentis shirking their responsibilities? Rice University statistician Dr Genevera Allen believes so. In a speech to the American Association for the Advancement of Science, she argued that scientists are callously leading their machines into bad habits, and science will pay a price for it.

What is machine learning?

Machine learning is, fundamentally, a statistical technique. A mathematical tool to find patterns in otherwise unmanageably large datasets. Often bundled in conversation with the more abstract concept of ‘artificial intelligence’, machine learning is conversely well-defined, (theoretically) well-understood, and, arguably, well-overused.

It all starts with data. The hidden currency behind the 21st century, data and the patterns therein hold considerable worth in the right hands. The primary goal of machine learning is to find useful patterns in large datasets. The latest generation of experiments produce more data than can be reasonably handled, and machine learning is a powerful tool that can identify important information from it.

The key word is classification. The objective of many machine learning algorithms is to apply a set of labels to a set of data. To do this, the algorithms must first be ‘trained’ on a dataset of known, labelled data. In this step, the parameters of the algorithm are tuned until it returns the known results on each of the known data.

A major criticism of machine learning is that it is a black box. Data goes in, results come out, and what happens in between is incomprehensible. It is difficult to argue with this standpoint.

There are several traps lying in wait for any prospective machine learning algorithm. One of the most perilous is ‘over-fitting’. This is when the algorithm relies on features of its training dataset that are particular to that individual dataset, and not to the data as a whole. Through this, coincidental similarities in initial data can render an algorithm useless.

Another trap lies in the fact that these algorithms are designed to find patterns, and do not have the ability to indicate that there may, in fact, be no underlying pattern in a set of data. As a result, they can be prone to finding patterns where there are none.

Allen argues that this type of error is more common than many scientists realise. Citing studies on cancer data as an example, she discusses how the clustering of data – the classification of data points as similar – is wildly different, and indeed incompatible, between studies. This alarming lack of reproducibility raises questions about the reliability of these studies, and of the many using similar methodology.

A simple fix would surely be the repetition of experiments. Deadlines, funding issues, and agenda all too often get in the way of this: The real world is impeding on scientific best practice. According to Allen, the consequences of this lack of repetition often don’t rear their head until two studies are later compared by another researcher. Describing this as a “reproducibility crisis,” she is not afraid to apportion a large part of the blame to misuse of machine learning techniques.

All of this reads like just another chapter in the modern day’s struggle with data and how to handle it. The 21st century is in the midst of what has been described as the 4th industrial revolution, the age of big data. Despite the promise, data is running the risk of becoming a dirty word. Malpractice is seeping through every facet of data science, and Allen’s accusations are just the latest in a series of examples.

Since Cambridge Analytica’s Lil Uzi Vert-alike Christopher Wylie revealed the extent of Facebook’s information harvesting activities, data has been at the front of public imagination. But this is not new. Ben Goldacre’s Bad Science was a bestseller in 2009, a decade ago.

Toby James

Physics student at the University of Manchester. Science Fan. Follow me on twitter at @tobyjamez.

More Coverage

Why are you laughing: The science of humour

Get to know: Who is Professor Duncan Ivison?

Disability and ethnicity pay gaps go up, gender goes ...

Manchester Leftist Action member speaks out against a...

Manchester’s continuing problem with inaccessib...

From Our Correspondent: Uncovering Berlin’s les...

Thread Therapy: In conversation with UoM’s Fash...

If Labour wants to regain trust, they must stick to t...

Main Library Musings – Rant column #2

My life has been failing the Bechdel test – and...

Why are you laughing: The science of humour

In conversation with The Lion King’s Head of Ma...

AI learns its first words (and helps explain how huma...

Fallowfield’s 18.35%: Why are students not voting?

“They decided they didn’t want a Nightline any longer...

Spot checks on international students attendance set ...

The new generation of F1 drivers: Wasted potential?

Tyrants cruise to playoff victory against Stirling Cl...

Memories of the game: A look back at favourite sporti...

Manchester Leftist Action member speaks out against academic suspension

Houseplant heaven: The best plants to brighten up your student home

Why is everybody obsessed with minimalism?

How to have a routine when you have so few contact hours

Brunch at Santé: A slice of sun

Giving ‘Too Good To Go’ a go!

Cooking a week of TikTok recipes on a student budget

Springleaf Podcast: James Acaster’s new audio adventure

My year abroad, the visa process, and getting lost in translation

Getting involved: Volunteering at the Booth Centre

Leopard print roars back: The resurgence of a popular fashion trend

Celebrity style guide #7: Lewis Hamilton

What the Croc? Pringles x Crocs and other crazy collabs

Houseplant heaven: The best plants to brighten up your student home

Northern Music Awards 2024: Celebrating breakthrough ...

Vampire Weekend: Indie experimenters push the boundar...

DIIV live in Manchester: Shoegaze stars promise enlig...

The problem with publishing

Spotify vs Audible: The battle for audiobook dominance

Why I don’t regret buying a Kindle

40 Years of the Future: Painting abstract exhibition ...

Review: Please Feel Free to Ignore My Work by David H...

Making Manchester #1: Anna Marsden

Hedda review: A misguided imitation of Ibsen’s ...

My Beautiful Laundrette review: Nationalism, racial t...

Come From Away press launch: A community show for Chr...

Live Review: Caroline Polachek at Albert Hall

My formative film: A love letter to Notting Hill

SCALA!!! co-director Jane Giles on audiences, program...

Chungking Express: Intoxicating youthful cinema | UoM...

Taskmaster: Why it’s so good and where to begin

The fourth season curse: Why do TV shows get worse af...

Drive to Survive season 6: Get strapped in for more F...

Hades 2 is bringing sexyback

So, uh, who exactly is the Borderlands film for?

Celebrities just can’t rescue bad games

My formative film: A love letter to Notting Hill

One more data revolution should be nipped in the bud

Article Summary

More Coverage

Popular Articles