Modern statistical methods for non-Mendelian disease prediction and Internet data

Speaker: Donal McMahon (Google, Dublin)

Time: 3:00PM
Date: Thu 31st March 2011

Location: Statistics Seminar Room- L550 Library building

This talk shall comprise two distinct sections. Firstly, I will discuss pre-implantation disease prediction for parents contemplating IVF (in vitro fertilisation) treatment, and the models we developed to combine multiple data sources. The second part of this talk will involve a more general discussion of the role of statistics at Google. Here, I will give an overview of various applications; including model fitting for extremely large datasets (hundreds of billions of rows), an overview of the Google experimental framework, predicting economic signals early using search data and the role of statistics in developing self-driving cars.

In vitro fertlisation has become increasingly popular in the past decade. It now accounts for over 1% of births in the US. However, there are increased risks of genetic disease. This talk shall concentrate on the problem of combining clinical research for diseases with no direct inheritance pattern (non-Mendelian). For most non-Mendelian diseases there exists no single clinical study which considers all of the relevant risk factors. Generally there are hundreds of published studies which investigate the genetic and environmental variables. Each study incrementally measures a particular factor or a combination of factors, but is missing data on the combination of all possibly relevant variables, thereby producing underdetermined results. By synthesizing these studies into a single meta-analysis, disease prediction can be carried out across the full set of risk factors. Here I will present two solutions to this problem; a likelihood-based approach using the EM algorithm and loglinear models, and a Bayesian Data Augmentation alternative. These general models will then be extended for data-specific problems, such as retrospective sampling, conditional slicing and multiple perspective linked tables. Variance estimation techniques, model-selection criteria and tests of heterogeneity are also derived.

Between the dawn of civilisation and 2003, the human race only created five exabytes of data; now we're doing that every two days. The role of statistics changes in this "data obese" world, and methods need to be developed to both filter and understand all of this information. In this talk I will speak of some of the problems we encounter at Google, and the initial attempts we have made at providing adequate solutions. We aim to make data-driven decisions in how we configure the basic search algorithm, in how we choose the ads to show and in all other projects. Therefore statisticians fulfill an pivotal role in many of these endeavours, and I will outline some of the current statistical research.

(This talk is part of the Statistics and Actuarial Science series.)