Statistics: the key to unlocking the story in data
In our Zoom for Thought on July 27th, 2021, UCD Discovery director Prof. Patricia Maguire spoke with Claire Gormley, full professor at UCD School of Mathematics and Statistics, about, “Statistics: the key to unlocking the story in data”. In case you missed it, here are our Top Takeaway Thoughts and a link to the video.
Statistics is the science of data and Claire develops statistical tools and methods that are principled, rigorous and applicable to different data sets. Each data set is different in shape, form and scale. “And there's a story within that. Very often the interdisciplinary researcher wants to unlock or find out what's inside that data set. Very few data sets are the same. You very often need different and bespoke tools - like different keys - to unlock that story in the data. So we use statistics and mathematics to develop those algorithms and those statistical methods to analyse the data sets.”
Statistics is a full spectrum discipline. Theoretical statistics looks at proofs and theorems while applied statistics involves applying existing statistical tools. Methodological statistics – which is more Claire’s line of work - is about developing statistical methods. Some data sets are so large and complex that “entry-level” statistical methods, such as the t-test do not suffice.
“There are so many different varieties of data, and if there is no existing statistical method to analyse a data set, then that is very much what statistics research does; it is the development of those new methods.”
The scale of data now available “was unfathomable, even ten years ago… But that's exactly what stimulates and pushes the boundaries of statistics”.
The development of statistical methodology is “predominantly” inspired by interdisciplinary research. A researcher from any discipline might come to Claire for help analysing a data set.
“You realise that there are no methods to analyse that particular form of data, and that motivates our development of the statistical method - or the key - to unlock the story. You could be working with a biologist, a geneticist, a historian. We literally work with everyone right across the humanities, social sciences, and sciences. You're working with maths, statistics, computer programming, communications. It is very important in statistics to communicate your results through graphics and visualisation. So it calls on a whole range of skills, and is very often stimulated and enriched by that interdisciplinary and collaborative experience.”
Data Science and Statistics
Claire defines data science as encompassing “everything from the designing of how you're going to collect your data to the collection of it, storage and the data warehousing of it, the extraction of it, the analysis, the presentation and the write-up. And that involves huge numbers of different disciplines and contributions”. Statistics is a “fundamental piece of that data science pipeline”. Firstly, at the very beginning of the process, with experimental design, “when you're thinking about how you collect your data”. And then in the middle “when you're analysing the data”. Finally by taking the story out of the data and presenting and communicating it. If a statistician is not involved from the outset “then what we get out at the end is somewhat meaningless and we have no rigour or true confidence in the resulting inference. Statistical thinking needs to be involved, not when the data are gathered and you come to a statistician saying, ‘Can you help me?’ You need a statistician before you even think about designing the study. You need a statistician on day minus ten.”
A key part of working with data and being statistically literate is getting the message of uncertainty across. While researchers can make observations, “we are never certain data are concrete. There's uncertainty in perhaps the way they were collected, in the sample from which they were collected”. The way in which you collect your data can allow you to infer causality but establishing causality “is a hugely difficult thing to do, particularly when you have observational data, which we are working with more and more these days”. In a paper in the Journal of the American Statistical Association, famous statistician George Box said, ‘All models are wrong, and the scientist must be alert to what is importantly wrong’.
Claire “would now replace the word ‘scientist’ with a more broad spectrum of disciplines”. Researchers must ask themselves, “What is the uncertainty in this result based on what may be wrong in my assumptions? Does that really matter in the context of the decision I'm trying to make, or the policy I'm trying to implement? Which of those assumptions being wrong is actually important?”
Cybersecurity and Apples
Claire mentions the work of two “excellent” colleagues in her research group. Former PhD student Damien McParland developed a software package based on an approach to clustering different types of data. He now works at a cybersecurity firm in Dublin. Dr Silvia D'Angelo developed another software package during a collaboration with Lorraine Brennan, full professor of Human Nutrition at UCD. “Lorraine and her research group developed a feeding study that looked at the relationship of consumed quantities of apple with the biomarkers. Our task was to infer the actual quantity of apples given the levels of biomarkers. Silvia created software packages and a web interface that allows a user to come along and input their data and infer these apple quantities”.
This article was brought to you by UCD Institute for Discovery - fuelling interdisciplinary collaboration.
Claire Gormley graduated with a PhD in Statistics from Trinity College Dublin and spent a period of her doctoral studies as a Visiting Scholar in the Department of Statistics and Centre for Statistics and the Social Sciences in the University of Washington Seattle, USA. She joined UCD in 2006 and is now a full professor in the UCD School of Mathematics and Statistics. Claire is the UCD director of the Science Foundation Ireland (SFI) Centre for Research Training in Foundations of Data Science (www.data-science.ie), a large scale, cohort based PhD programme jointly delivered with the University of Limerick and Maynooth University. She is Principal Investigator in the SFI VistaMilk research centre and a Funded Investigator in the SFI Insight Centre for Data Analytics.
Claire's research develops bespoke statistical methods to analyse complex data, predominantly motivated by problems in applied areas. She has developed a broad array of statistical methods inspired by interdisciplinary research problems in areas such as metabolomics, computational biology, network science and social science. Her open-source R (www.r-project.org) software packages facilitate use of the developed methods by the wider community.
Claire is an Associate Editor for the Annals of Applied Statistics, was awarded Chartered Statistician (CStat) status by the Royal Statistical Society and was an elected member of the Council of the Royal Statistical Society 2017-2020.