Avoiding bad science - guidelines for machine learning validation in biology

September 8, 2021

Gianluca Pollastri is an Associate Professor in the School of Computer Science and Informatics at University College Dublin. He is coauthor of a community paper recently published in the Nature Methods journal, entitled DOME: recommendations for supervised machine learning validation in biology.


It is perhaps no surprise that biology, the scientific study of life, is drowning in data. 

In molecular biology, especially, you have humongous amounts of data,” says computer scientist Gianluca Pollastri, of the branch that deals with the structure and function of molecules like DNA, RNA and proteins.

“If you were to write out all of the data just for the DNA that we know the sequence of, you would fill up a skyscraper 200 storeys high, taller than any skyscraper in the world. And that data is actually growing by around 40% or 50% every year so it’s heading for the stratosphere. Essentially, we can't look at it with our own eyes. We need to have something that's much faster than ourselves to make sense of it.”

This is where Machine Learning (ML), Gianluca’s area of expertise, comes in. ML is an artificial intelligence technique that enables software programmes to find patterns in data without being explicitly programmed. It can then make reliable statistical predictions about similar new data using ML models, or prediction algorithms. Machine learning could revolutionise the whole field of biology by accelerating the path to efficient drug development - saving time, money and lives. 

But this exciting, interdisciplinary area is not without its potential pitfalls. Gianluca and others are devising best-practice guidelines to mitigate these risks - and later we will get to those. 

First, a look at how problems arise in the first place. 

Twenty years ago to work in ML you needed to be “a bit of a wizard”, mastering maths and writing your own highly technical code. Now with the availability of open-source ML frameworks, “you can almost get it off the shelf”. 

While Gianluca is in favour of making code freely available to further science, the problem arises when “some people might not know what they are doing”. 

He explains: “There is this thing that you can do with machine learning where you can repeat the experiments a bunch of times; it’s normal practice. A lot of the models we use are black boxes - you’ve got the data, you’ve got your model - and you throw the model at the data and get the results. And sometimes you don't like the results and you say, ‘Okay, maybe I've done something wrong’. There are a lot of little knobs within machine learning that you can tweak. So you start tweaking and you repeat the experiment.”

If you prefer the next result, you might apply a misguided rationale or bias to explain why it is more accurate than the previous result. The reality may be different.  

“The fact is that machine learning itself is a stochastic variable, which means that it has got a certain amount of randomness that is down to luck. The result of a machine learning experiment is kind of like buying a lottery ticket. If you repeat your experiment enough times - and if you buy a lottery ticket enough times - you're going to win.”

Meanwhile the exponential rise in data has coincided with an explosion in the number of academic papers covering the crossover field of biology and ML. This compounds the problem. 

“We've had more than 9000 machine learning publications within biology in the last year. A lot of them are good but it's very easy to slip in something that's actually bad science. The problem is, you've got machine learning, which is a highly technical science, and then you've got biology. It’s very hard to find a reviewer that understands both sides of it. Chatting with some colleagues we realised that as reviewers and editors of journals, we'd seen a lot of bad papers. And the problems fell into a number of categories. So we decided to write a paper addressing that, which was published in Briefings in Bioinformatics.”

This paper addressed the risk of “introducing unexpected biases” in machine learning bioinformatics, “which may lead to an overestimation of the performance”. 

Gianluca and his colleagues then decided to expand on their subject. Working with a European consortium called Elixir, they formed a fifty-member machine learning focus group. This group liaises with “both industrial entities and academic umbrella groups” to brainstorm best-practice options for ML in biology research.  

“We wanted to come up with some recommendations which would help weed out bad papers and make sure that people know what they're doing when they're reviewing them. But more than that, we want to create standards as well.”

They founded DOME, an acronym for Data, Optimisation, Model and Evaluation in machine learning. DOME is a set of community-wide guidelines, recommendations and checklists spanning these four areas which aim to help establish standards of supervised machine learning validation in biology. 

“The recommendations are formulated as questions to anyone wishing to pursue implementation of a machine learning algorithm. Answers to these questions can be easily included in the supplementary material of published papers.”

This Q&A will help support researchers in curating and designing high quality datasets too. Elixir also advocates for making fully public the data used in experiments.

“In general, we want to open a dialogue about how we are publishing science within biology when we are using machine learning. Essentially, we were setting up a permanent observatory of this field to see whether we can come up with a fully standardised way of dealing with data.” 


This article was brought to you by UCD Institute for Discovery - fuelling interdisciplinary collaboration.