Analyzing Longitudinal Metabolomic Data

Speaker: Gift Nyamundanda

Time: 1:00PM
Date: Fri 6th May 2011

Location: Statistics Seminar Room- L550 Library building

Metabolomics is the term used to describe the study of small molecules or metabolites present in biological samples. Data sets from metabolomic studies are typically high-dimensional and complex. In a longitudinal metabolomic study, multiple metabolites are measured from subjects at multiple time points. Typically the number of samples n in such studies is much less than the number of variables p, n?p.\\

Traditional principal component analysis (PCA) is currently the most widely used statistical technique for analyzing metabolomic data. However, the application of PCA to longitudinal metabolomic studies is limited by the fact that it assumes independence of the repeated measurements and it is not based on a statistical model. Probabilistic principal component analysis (PPCA) detailed in the article by Nyamundanda et al 2010, addresses some of the limitations of PCA. Here, we propose an extension of PPCA called dynamic PPCA which allows us to use PPCA to model metabolomic data, while taking into account the correlation due to repeated measurements. Dynamic PPCA reduces the dimension of the data by defining the p-dimensional observation x−im, i.e. the metabolomic spectrum for sample i at time point m, as a linear transformation of the lower q-dimensional latent variable u−−im:


where Wm and μ−m are a p×q loadings matrix and the mean of the data at time point m respectively and ?−im is a multivariate Gaussian noise process for sample i at time point m, i.e. p(?−im)=MVNp(0−,σ2mI) and I denotes the identity matrix. The dynamic PPCA model corrects for the correlation in repeated measurements by assuming that log(σ2m) has a stationary autoregressive model of order 1, centered around mean ν with persistence parameter ?:


The innovations rim are assumed to be independent rim∼N(0,υ2).

This model allows us to observe the change in position of subjects in the latent principal subspace and to identify the spectral regions responsible for the structure in the data at each time point. The usefulness and applicability of dynamic PPCA is demonstrated on a longitudinal metabolomic study of urine samples of animals taken over 15 days.(This talk is part of the Working Group on Statistical Learning series.)