General Linear Models (fitting)

A tutorial about data analysis using R (Website Version)

Author

Affiliation

Jon Yearsley

School of Biology and Environmental Science, UCD

Published

January 1, 2024

How to Read this Tutorial

This tutorial is a mixture of R code chunks and explanations of the code. The R code chunks will appear in boxes.

Below is an example of a chunk of R code:

# This is a chunk of R code. All text after a # symbol is a comment
# Set working directory using setwd() function
setwd('Enter the path to my working directory')

# Clear all variables in R's memory
rm(list=ls())    # Standard code to clear R's memory

Sometimes the output from running this R code will be displayed after the chunk of code.

Here is a chunk of code followed by the R output

2 + 4            # Use R to add two numbers

[1] 6

Objectives

The objectives of this tutorial are:

To introduce general linear models
- As statistical models for describing a population
- As a tool for estimating quantities of interest and making decisions from data
The concept of response and explanatory variables
Explain how general linear model can be divided into deterministic and random parts
Develop statistical models for a hypothesis (H₁) and a null-hypothesis (H₀)
Introduce the concept of model parameters
Introduce R’s lm() function for fitting a general linear model to data
Demonstrate the use of lm() by fitting models to cortisol data in the WOLF.CSV file

Statistical modelling

Review this online lesson introducing statistical models before continuing.

Online Lesson: Statisticial Modelling

This lesson is at https://www.ucd.ie/ecomodel/OnlineLessons/lesson4_statisticalmodelling_Website.html

A story about wolves

Every winter in Canada hunters shoot and trap wolves for their pelts. This hunting reduces the population size of wolves, but it has many other knock on consequences. For example, hunting also removes individuals from social groups which disrupts the wolf’s social hierarchy.

**Figure: Timber wolves fighting over food (Photo: Martin Cathrae).**

Questions about the world

Biologists have many questions about the impact of such hunting upon the wolves:

Does hunting change a wolf pack’s behaviour?
Do hunted wolves show signs of stress?
Does hunting impact a wolf packs individual reproductive rate?
Are all wolves affected equally by hunting?

And the list goes on.

We’ll focus on one question:

Do hunted wolves show signs of stress?

We happen to have a dataset that is relevant to this question: the WOLF.CSV data. This file contains data on individual wolves from three populations. Population 1 had almost no hunting intensity, population two had heavy hunting intensity and population 3 was a captive population. Cortisol concentrations from hair samples of individual wolves are also in this dataset. Cortisol is a physiological marker of stress.

To study this the above question we have:

A response variable: Quantitative data on the main topic of interest (stress) in the form of individual cortisol concentrations (units of pg/mg).
An explanatory variable: Data that might explain some of the variation in the response. In this case hunting intensity may help explain wolf stress levels. It is a qualitative variable (the population identity).

Hypotheses

We’re going to write our question as a testable statement, which is called a hypothesis. It is very important that this hypothesis is testable so that we can use our data to quantitatively test the hypothesis.

Start with a hypothesis about the effect of hunting on cortisol concentrations, which we’ll call the alternative hypothesis (H~1):

H₁: Wolves in population 2 (heavily hunted) have average cortisol levels that are higher than wolves in population 1 (almost not hunted).

We’ll also write a hypothesis that is a statement about no effect, which we’ll call the null hypothesis (H₀)

H₀: Wolves in population 2 have the same average cortisol levels as wolves in population 1

Testable predictions

If these hypotheses are fit for purpose we should be able to generate a testable prediction (or several predictions) from each hypothesis.

For example:

Hypothesis	Prediction
H₁	Prediction 1: The concentration of cortisol in a sample of wolf hair from population 2 will be higher, on average, than a sample of hair from population 1.
H₁	Prediction 2: The concentration of cortisol in blood samples from population 2 will be higher, on average, than a sample of blood from population 1.
H₀	Prediction 1: On average, the concentration of cortisol in a sample of wolf hair from populations 1 and 2 will be the same.
H₀	Prediction 2: On average the concentration of cortisol in blood samples from populations 1 and 2 will be the same.

Importing the data

First we import our data (our sample which gives partial infomration about our system). We make sure that the data type of the Population variable is set as qualitative using the function as.factor().

wolf = read.table('WOLF.CSV', header=T, sep=',')  # Import wolf data set
wolf$Population = as.factor(wolf$Population)      # Set Population as a qualitative variable

We will focus on populations 1 and 2 (light and heavy hunting respectively), so let’s subset our data to contain only these two populations.

wolf12 = droplevels(subset(wolf, Population!=3))  # Extract data from population 1 & 2

Notice that we have used the droplevels() function to ‘update’ the levels of the factors in the wolf12 data frame. This droplevels() function will force R to forget about the level for population 3.

Exploratory data analysis

After importing the data it’s important to start making a picture of the data by visualising the data using numerical and graphical summaries.

Numerical data exploration

First let’s look at a summary of the data

summary(wolf12)        # Numerical summary of the data

   Individual         Sex            Population    Colour         
 Min.   :  1.00   Length:148         1: 45      Length:148        
 1st Qu.: 37.75   Class :character   2:103      Class :character  
 Median : 74.50   Mode  :character              Mode  :character  
 Mean   : 74.50                                                   
 3rd Qu.:111.25                                                   
 Max.   :148.00                                                   
                                                                  
     Cpgmg           Tpgmg            Ppgmg      
 Min.   : 4.75   Min.   : 3.250   Min.   :12.76  
 1st Qu.:12.16   1st Qu.: 4.378   1st Qu.:19.50  
 Median :15.38   Median : 5.030   Median :25.00  
 Mean   :16.61   Mean   : 5.617   Mean   :25.89  
 3rd Qu.:19.98   3rd Qu.: 6.067   3rd Qu.:30.01  
 Max.   :40.43   Max.   :15.130   Max.   :53.28  
                                  NA's   :79

Start to get a feeling for the numbers. Note the names of the variables, their range, the central tendency and their spread.

We can calculate the central tendency for each population using the aggregate() function

# Calculate mean Cortisol in each population
aggregate(Cpgmg~Population, data=wolf12, FUN=mean)

  Population    Cpgmg
1          1 15.56222
2          2 17.07495

Notice that mean cortisol is higher in population 2. Is that the pattern we’d expect?

Graphical data exploration

Exploring the data with graphics is also useful

library(ggplot2)       # Load ggplot2

First we’ll look at the distribution of the cortisol concentrations

# Plot the distribution of the cortisol data
ggplot(data=wolf12,
       aes(x=Cpgmg)) +
  geom_histogram(bins=20) +
  labs(x='Cortisol Concentration [pg/mg]') + 
  theme_bw()

The distribution looks slightly right-skewed.

We can plot the raw data

# Plot the raw cortisol data
ggplot(data=wolf12,
       aes(x=Population,
           y=Cpgmg,
           colour=Population)) +
  geom_point(position='jitter') +
  labs(x='Population',
       y='Cortisol Concentration [pg/mg]') + 
  theme_bw()

or summarise the data with a boxplot. Importantly, The boxplot not only visualises the central tendency of the data from the two populations (the median) but also the spread (in terms of the 25% and 75% quantiles as well as the whiskers).

# Draw a boxplot of the cortisol data from populations 1 and 2
ggplot(data=wolf12,
       aes(x=Population,
           y=Cpgmg)) +
  geom_boxplot() +
  labs(x='Population',
       y='Cortisol Concentration [pg/mg]') + 
  theme_bw()

Median cortisol levels in population 2 are slightly higher than population 1, but there is a lot of spread about the central tendency.

A violin plot is an alternative to a boxplot

# Draw a violin plot of the cortisol data from populations 1 and 2
ggplot(data=wolf12,
       aes(x=Population,
           y=Cpgmg)) +
  geom_violin() +
  labs(x='Population',
       y='Cortisol Concentration [pg/mg]') + 
  theme_bw()

We may want to look at how several variables are able to explain cortisol values (the response variable). For example, differences between males and females and differences between the two populations.

# Draw a violin plot of the cortisol data from populations 1 and 2
# for males and females
ggplot(data=wolf12,
       aes(x=Population,
           y=Cpgmg,
           fill=Sex)) +
  geom_violin() +
  labs(x='Population',
       y='Cortisol Concentration [pg/mg]') + 
  theme_bw()

Balanced and unbalanced data

Use the table() function to look at how much data we have from each population.

# Look at the balance of the data across
# levels of Population
table(wolf12$Population)


  1   2 
 45 103

The sample size for population 1 is lower than population 2. This is called an unbalanced design and is very common when working with data collected from the field. A balanced design is when sample sizes are the same for all levels of a factor. Classical ANOVA analysis required the data to be balanced. So these data would require special treatment if analysed with a classical ANOVA. General linear models can be used with unbalanced data, because they use maximum likelihood to fit the model. However, general linear model analysis is easiest for balanced data sets, and they should not be used for data that are extremely unbalanced.

Analyse the data

Modelling the population

Our data are a sample. The data contain partial information about the topic of our hypotheses (i.e. the population). The population is the set of cortisol concentrations from all wolves from the two locations, at any time of the year, and maybe even across more of ta wolf’s body than just the hair.

If we are going to investigate our hypotheses then we need a way to describe the population. We will use a statistical model to describe the distribution of data we may expect across the whole population.

A statistical model

The figure below sketches out a possible statistical model. It sketches the distribution of cortisol for populations 1 and 2. The model has three important aspects:

The response variable is cortisol concentration (on the x-axis), and it is a quantitative variable
The average value of cortisol in each population is described by a number (represented as \mu_1 and \mu_2, whose value is as yet not known). In other words, the average value in each population has a unique value than can be determined. Given complete infomration about the population we could determine the value precisely.
The variation of cortisol values in each population is being described by a normal distribution. The variation within a population is being described as being random because we are not trying to specify unique values of cortisol for every individual wolf. Instead we are describing all the possible outcomes of cortisol values by a normal distribution. There is an important differnce to the last point: Given complete infomration about the population we could not determine the precise value of a wolf’s cortisol value, but we can assign a probability to observing a cortisol concentration.

**Figure: A possible statistical model for cortisol concentrations from all wolves in populations 1 (blue line) and 2 (orange line).**

This model in the above sketch could be written in equation form as

\text{Cpgmg} = \mu_i + \text{Normal}(0, \sigma)

where \mu_i is the mean cortisol for population i (i is an index that can be 1 or 2) and \sigma is the standard deviation in cortisol values (i.e. the spread in cortisol values).

Note that:

the spread is the same in both population 1 and 2.
\text{Normal}(0,\sigma) is shorthand for a normal distribution with meanof zero and standard deviation \sigma.

General linear models

The sketch above is a representation of a general linear model. General linear models are a hugely important family of statistical models.

A general linear model always uses a normal distribution to describe variation in the population that we will not try to associate with another variable (e.g. variation in cortisol concentrations between wolves in a population). The normal distribution in a general linear model is assumed to have a mean of zero and a constant standard deviation that must be estimated from data.

Special cases of general linear models are the methods of: linear regression, t-tests, analysis of variance (ANOVA) and analysis of covariance (ANCOVA).

General linear models are a statistical tool that allow us to:

Analyse a very wide range of data with one statistical methodology
Always describe randomness using a normal distribution
Estimate means and other parameters of the population (e.g. the difference in mean cortisol between populations)
Estimate the uncertainty in population parameters from point 3.
Quantify the strength of evidence against the null-hypothesis

The work-horse function in R for fitting general linear models to data is the lm() function.

We will use the general linear model framework to estimate mean cortisol values in populations 1 & 2, and then test the hypothesis that these means are the same. We will then compare this approach to a t-test approach and finally to a permutation test approach.

Hypotheses revisited

Each hypothesis (the null-hypothesis, H₀ and the alternative hypothesis, H₁) can be expressed in terms of our statistical model.

Model for H₀

The null-hypothesis states that the there is no difference in average cortisol concentrations between populations 1 and 2.

This means that \mu_1 = \mu_2. The model corresponding to this is sketched below.

**Figure: A sketch of the statistical model corresponding to the null hypothesis. It is a simplification of the previous sketch**

This model could also be written in equation form as

\text{Cpgmg} = \mu + \text{Normal}(0, \sigma)

where \mu is the mean cortisol in population 1 or 2 and \sigma is the standard deviation in cortisol values

In R the formula for this model is written simply as Cpgmg ~ 1 (for more details on how to specify a model using a formula go to http://www.ucd.ie/ecomodel/Resources/R_formulae_WebVersion.html).

You could also think of this model in terms of trying to predict average cortisol values. To predict average cortisol in either population 1 or population 2, given this null-hypothesis, you would make the same prediction (\mu) for both population.

The random part (represented by \text{Normal}(0, \sigma)) is always assumed to be a normal distribution when using the lm() function because lm() fits a general linear model to the data. The deterministic part (sometimes called the systematic part) in the formula Cpgmg ~ 1 is represented by the 1 which means that it is a constant (i.e. there is one mean that is not affected by any other variable). This constant is called the Intercept by R.

Model for H₁

The hypothesis states that avergage cortisol is different in populations 1 and 2.

This means that \mu_1 \ne \mu_2. The model corresponding to this is our first sketch showing two bell shaped normal distributions.

**Figure: A sketch of the statistical model for the hypothesis H₁**

We therefore need two average values (\mu_1 and \mu_2) plus an estimate of spread (\sigma) to specify the model for this hypothesis.

You could also think of this model in terms of trying to predict average cortisol values. To predict average cortisol in either population 1 or population 2, given this alternative-hypothesis, you would make the prediction \mu_1 for population 1 and \mu_2 for population 2 (i.e. different predictions.

In R the formula for this model will be written simply as Cpgmg ~ 1 + Population

Formulas in R: Click here for more details on how to specify a model using a formula

Remember:
The random part of a general linear model is always assumed to be a normal distribution when using the lm() function.
The deterministic part now includes the factor Population to show that separate means should be fitted to population 1 and 2 (each level of the factor). The factor Population is an example of an explantory variable.

Model parameters

The model for H₀ has two parameters: \mu and \sigma.
The model for H₁ has three parameters: \mu_1, \mu_2 and \sigma.

Parameters can also be called coefficients. The term coefficient is commonly used when talking about the deterministic part of a model.

Fitting each of these models to the wolf12 data involves estimating values for these parameters. The lm() function fits each model using a method called maximum likelihood.

Fitting a model to data

Use the lm() function to fit a general linear model to data.

To use the lm() function we specify the model using a formula (see above) and specify the data frame containing the data.

Our formulas are

H₀ model: Cpgmg ~ 1
H₁ model: Cpgmg ~ 1 + Population

The response (Cpgmg) is on the left hand side of the tilde (~) and the deterministic part of the model (the explanatory variables) is on the right hand side of the tilde. The random part is not represented in the formula because it is automatically taken to be a normal distribution. The function lm() always assumes the random part of a model is a normal distribution.

Here is the R code to fit the H₀ model (null model) and the H₁ model (alternative model)

# Fit a linear model for the null-model 
# (i.e. Population is not part of the model)
m0 = lm(Cpgmg ~ 1, data=wolf12)



# Fit a linear model for the alternative model 
# (i.e. Population affects cortisol)
m1 = lm(Cpgmg ~ 1 + Population, data=wolf12)

Summarising the fitted model

The summary() function to can be used to view the fitted models.

summary(m0)        # Summarise the null model


Call:
lm(formula = Cpgmg ~ 1, data = wolf12)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.865  -4.455  -1.240   3.360  23.815 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  16.6150     0.5051    32.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.145 on 147 degrees of freedom

The summary of each model prints:

Call: the lm() command used to fitted the model
Residuals: the quartiles of the residuals (described in later tutorials)
Coefficients: the fitted parameters for the deterministic part of the model (called the model coefficients) and their uncertainty (standard errors). For model m0 this is just the mean cortisol (called the Intercept) with an estimated value of 16.6 (standard error=0.51).
Residual standard error: the fitted value for the standard deviation of the normal distribution in the statistical model. For the m0 model this is s_n-1= 6.14.
- the residual degrees of freedom. This is the number of data points (=148) minus the number of parameters in the deterministic part of the model. For model m0 there is one parameter in the deterministic part of the model (\mu), therefore the residual degrees of freedom = 147.

If the model is more complex than fitting a single mean (e.g. model m1) then summary() also prints:

Multiple R-squared: the R² of the model. For model m1 the R²=0.013. The R² can be interpreted as the percentage of the variance in the response that is explained by the deterministic part of the model.
- an adjusted R² which we will not be considering.
F statistic: an additional F statistic which we will not be considering

Interpreting model coefficients

Model H₀

The H₀ model has one coefficient, the mean cortisol value, \mu (output from summary(m0)). This is estimated to be 16.61 (s.e. 0.51). This information is from the (Intercept) row in the model summary info.

Remember:
that a fitted coefficients has an uncertainty because we used a sample to estimate a property of the population.

It is important to give an uncertainty with every estimated value from a model (a standard error or a 95% confidence interval).

We advise you to use a 95% confidence interval (confidence intervals will be discussed in a later tutorial):

Use the coef() function to quickly see the fitted values for a model’s coefficients
Use the confint() command to obtain 95% confidence intervals in these fitted coefficients

# Model coefficient estimates and uncertainties

coef(m0)      # Coefficients for model m0

(Intercept) 
     16.615

confint(m0)   # 95% confidence intervals on the coefficients for model m0

               2.5 %   97.5 %
(Intercept) 15.61685 17.61315

We can now report our estimate for the mean cortisol value (notice that we back transform to quote the result in the original units, not on the log-scale):

The average cortisol for all wolves from populations 1 and 2 is 16.61 pg/mg (95%CI: 15.62 - 17.61)

Model H₁

The H₁ model has two coefficients:

coef(m1)      # Coefficients for model m1

(Intercept) Population2 
  15.562222    1.512729

confint(m1)   # 95% confidence intervals on the coefficients for model m1

                 2.5 %    97.5 %
(Intercept) 13.7575212 17.366923
Population2 -0.6505745  3.676033

the mean cortisol value for population 1 (\mu_1). This is estimated to be 15.56 (95% CI: (95% CI: 13.76 - 17.37). This is the coefficient labelled (Intercept).
the difference between the means of population 1 and 2 (\mu_2 - \mu_1). This difference is estimated to be 1.51 (95% CI: -0.65 - 3.68). This is the coefficient labelled Population2.

Residuals

A residual is the difference between a data point and the prediction from a model. The residual is therefore associated with the random part of the model. It is the part of the response that cannot be explained by the model.

Most of the assumptions in our statistical model concern the residuals. So we will look at residuals in more detail when we discuss if the fitted model obeys the assumptions (model validation is the subject of a later tutorial).

The residuals of a fitted linear model can be obtained with the residuals() function

m1_resid = residuals(m1)   # Extract the residuals for model m1

The distribution of residuals has a mean of zero,

mean(m1_resid)             # Mean of the residuals from model m1

[1] 5.521109e-16

The calculated mean is extremely close to zero. It is not exactly zero because of the limited accuracy of the computer (remember 1.0e-16 is computer shorthand for 1x10^-16 = 0.0000000000000001).

The distribution of residuals can also be visualised by plotting a histogram

# Plot the distribution of residuals from model m1
ggplot(data=as.data.frame(m1_resid),
       aes(x=m1_resid)) +
  geom_histogram(bins=30) +
  theme_bw()

The standard deviation of the residuals is close to being an estimate of \sigma from our statistical model (it is a slight under-estimate).

sd(m1_resid)              # Standard deviation of the residuals

[1] 6.104729

A more precise estimate is given by the residual standard error in the summary() output.

Long-hand calculation of residuals

Residuals can be calculated explicitly by subtracting the model predictions (using the predict() function) from the original data.

m1_resid2 = wolf12$logCpgmg - predict(m1)  # Calculate residuals by hand

Have a go at comparing the two approaches for calculating residuals and show that they give the same answer

Summary of the topics covered

The statistical model for general linear models revolves around the Normal distribution
General linear models always use a normal distribution as the random part
The variable of interest is called the response variable
Statistical models of data always have a deterministic and a random part
Statistical models have parameters
The values of a model’s parameters can be estimated by fitting a model to data
The lm() function fits general linear models to data