Organising Data in R

A tutorial about data analysis using R (Website Version)

Author

Affiliation

Jon Yearsley

School of Biology and Environmental Science, UCD

Published

January 1, 2024

How to Read this Tutorial

This tutorial is a mixture of R code chunks and explanations of the code. The R code chunks will appear in boxes.

Below is an example of a chunk of R code:

# This is a chunk of R code. All text after a # symbol is a comment
# Set working directory using setwd() function
setwd('Enter the path to my working directory')

# Clear all variables in R's memory
rm(list=ls())    # Standard code to clear R's memory

Sometimes the output from running this R code will be displayed after the chunk of code.

Here is a chunk of code followed by the R output

2 + 4            # Use R to add two numbers

[1] 6

Objectives

The objectives of this tutorial are:

Introduce the concept of a data frame
Demonstrate how data frames can be manipulated
Demonstrate how to reformat data and code for missing data
Explain data subsetting in R
Save imported data to a compact binary file

Introduction

This tutorial will show you how to view, subset and manipulate data frames within R. This assumes that the data have been successfully imported into R (if you are unsuccessful at importing data into R then you need to read the data importing worksheet).

The data we’ll be using have been imported from these files:

WOLF.CSV: This file is a text file of comma separated variables.
INSECT.TXT:This file is a text file of TAB delimited variables.

These data sets are described at http://www.ucd.ie/ecomodel/Resources/datasets_WebVersion.html

Viewing a data frame

Finding variable names

Use the ls() function to print a list of variables in R’s memory

ls()                    # Display the variables in R's memory

[1] "insect" "wolf"

A poor way to view data

Typing the name of a variable will display all the data contained in the variable.

insect                    # Display the entire insect data frame

   Spray.A Spray.B Spray.C Spray.D Spray.E Spray.F  X X.1
1       10      11       0       3       3      11 NA  NA
2        7      17       1       5       5       9 NA  NA
3       20      21       7      12       3      15 NA  NA
4       14      11       2       6       5      22 NA  NA
5       14      16       3       4       3      15 NA  NA
6       12      14       1       3       6      16 NA  NA
7       10      17       2       5       1      13 NA  NA
8       23      17       1       5       1      10 NA  NA
9       17      19       3       5       3      26 NA  NA
10      20      21       0       5       2      26 NA  NA
11      14       7       1       2       6      24 NA  NA
12      13      13       4       4       4      13 NA  NA

BEWARE: Printing out the entire data set is rarely useful, because data sets are often too large to fit on a computer screen (for example, the wolf data frame has 178 rows of data, making it hard to read in one go). There are often better ways to view a data frame than to just print out the entire variable.

Good ways to view data

Here are some options for viewing data frames:

head(wolf)              # Display the first 6 lines of the wolf data frame
tail(wolf, n=10)        # Display the last 10 lines of the wolf data frame
summary(wolf)           # Display an overview of the wolf data frame
str(wolf)               # Display the structure of the wolf data frame

The summary() function is particularly useful. It displays summary statistics for each variable in a data frame. Later we will see how the summary() function has many uses, such as displaying summary results from a data analysis.

The summary output for a data frame depends upon a variable’s data type.

For quantitative data (num and int) the summary shows the minimum, first quartile (25% quantile), the mean, the median (50% quantile or second quartile), the third quartile (75% quantile), the maximum and the number of missing values (missing values are represented as NA in R). Examples of numerical data in the wolf data frame Cpgmg, Tpgmg and Ppgmg.
For qualitative data (factor, logi) the summary shows first five categories of a qualitative variable and the number of data points in each category. Any remaining categories are lumped together as (Other). The number of missing values are also shown. Examples of qualitative data in the wolf data frame are Sex and Colour.
For plain text data that isn’t qualitative the summary displays the type of data (Class : character).

The data type of a variable (e.g. quantitative, qualitative, character) is displayed in the output from the str() function.

Viewing part of a data frame

Refering to a single column in a data frame using `$`

A single variable (column) in a data frame can be specified by giving the name of the data frame, followed by a $ followed by the name of the variable.

Here is a example that specifies just the cortisol data in the wolf data frame

wolf$Cpgmg     # Display just the cortisol data

The names of the variables can be seen at the top of each column of data (for example, using the head() function)

# Variable names appear above each column of data
head(wolf)     # Display first 6 rows of data.

  Individual Sex Population Colour Cpgmg Tpgmg    Ppgmg
1          1   M          2      W 15.86  5.32       NA
2          2   F          1      D 20.02  3.71 14.37622
3          3   F          2      W  9.95  5.30 21.65902
4          4   F          1      D 25.22  3.71 13.42507
5          5   M          2      D 21.13  5.34       NA
6          6   M          2      W 12.48  4.60       NA

Adding a variable into a data frame

We can add a variable to a data frame using the $ operator.
Here is an example where we add the variable Replicate (1-12) which codes for each replicate of an experimental treatment

insect$Replicate = c(1:12)   # Add a variable called Replicate to the data frame

head(insect)                 # Display the first 6 rows of the trimmed data frame

  Spray.A Spray.B Spray.C Spray.D Spray.E Spray.F  X X.1 Replicate
1      10      11       0       3       3      11 NA  NA         1
2       7      17       1       5       5       9 NA  NA         2
3      20      21       7      12       3      15 NA  NA         3
4      14      11       2       6       5      22 NA  NA         4
5      14      16       3       4       3      15 NA  NA         5
6      12      14       1       3       6      16 NA  NA         6

Changing a variable’s data type

Data in statistical analyses are often one of two basic data types: quantitative or qualitative data.

R calls a continuous quantitative variable numeric (abbreviated to num)
R calls a discrete quantitative variable integer (abbreviated to int)
R calls a qualitative variable a factor

A qualitative variable is a set of labels (e.g. large, medium and small). Each label is called a level of the factor.

R also has other data types. Some examples are:

character data type = plain text (abbreviated to chr)
logical data type = a variable that is TRUE or FALSE (abbreviated to logi)

In the wolf data frame the variables Population, Individual, Sex and Colour are qualitative (the labels from each of these variables identify a data point to a population, an individual, a sex and a coat colour, respectively).

The data types that R has assigned each variable can be seen by looking at the structure of the wolf data frame

str(wolf)                    # Display the structure of the data frame

'data.frame':   178 obs. of  7 variables:
 $ Individual: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Sex       : chr  "M" "F" "F" "F" ...
 $ Population: int  2 1 2 1 2 2 1 1 1 2 ...
 $ Colour    : chr  "W" "D" "W" "D" ...
 $ Cpgmg     : num  15.86 20.02 9.95 25.22 21.13 ...
 $ Tpgmg     : num  5.32 3.71 5.3 3.71 5.34 4.6 4.58 9.27 4.81 5.07 ...
 $ Ppgmg     : num  NA 14.4 21.7 13.4 NA ...

You can see some issues here:

The variables Population and Individual have not been assigned as quantitative variables (R has identified them as numerical integers, int, because the wolf.csv file used whole numbers as labels for these two variables).
The variables Sex and Colour have been identified as containing text (chr type), but we want these to be recognised as qualitative nominal data types (R calls this data type a factor). The variable Sex has two levels ‘M’ and ‘F’. The variable Colour also has two levels ‘D’, ‘W’, and blank should be explicitly recognised as missing data.

We want to redefine the variables Population, Sex and Colour so that R recognizes it as a factor (unorded factor). We will also redefine the variable Individual to be plain text (i.e. a character) to demonstrate the as.character() function.

# Convert Population variable from numeric to a factor (a qualitative variable)
wolf$Population = as.factor(wolf$Population)

# Convert Sex variable from character to a factor (a qualitative variable)
wolf$Sex = as.factor(wolf$Sex)

# Convert Colour variable from character to a factor (a qualitative variable)
wolf$Colour = as.factor(wolf$Colour)

# Convert Individual variable from numeric to plain text
wolf$Individual = as.character(wolf$Individual) 

# Display an overview of the data frame
summary(wolf)

  Individual        Sex    Population Colour      Cpgmg           Tpgmg       
 Length:178         F:72   1: 45       : 30   Min.   : 4.75   Min.   : 3.140  
 Class :character   M:76   2:103      D: 37   1st Qu.:12.16   1st Qu.: 4.372  
 Mode  :character   U:30   3: 30      W:111   Median :15.61   Median : 5.070  
                                              Mean   :17.74   Mean   : 6.148  
                                              3rd Qu.:20.35   3rd Qu.: 6.317  
                                              Max.   :73.19   Max.   :61.790  
                                                                              
     Ppgmg      
 Min.   :12.76  
 1st Qu.:19.50  
 Median :25.00  
 Mean   :25.89  
 3rd Qu.:30.01  
 Max.   :53.28  
 NA's   :109

Notice how the summary of the variables Population, Sex, Individual and Individual have changed now that they are factors. Also note that missing values, NA’s, are explicitly taken into account when summarizing the data (e.g. the variable Ppgmg).

There are a set of related functions for coercing variables into other data types. Here are some examples

as.factor(...)    # Coerces a variable to be a factor (qualitative, nominal)
as.numeric(...)   # Coerces a variable to be numeric (quantitative, continuous)
as.character(...) # Coerces a variable to be a character (qualitative, unordered)

Removing a variable from a data frame

Sometimes we want to remove a variable from a data frame.

The insect data frame has two variables that should not be part of the data set (X and X.1). This is quite common when importing data. In this case the reason is two additional TABs at the end of each line in the text file. These TABs are hard to see, but R recognized them, created two additional variables and named them with default labels.

The columns can be removed by first finding out how many rows and columns the data frame has and then removing the last two columns. Here is the code

ncol(insect)                # Number of columns in data frame
nrow(insect)                # Number of rows in data frame
dim(insect)                 # Display number of rows and columns

insect = insect[ ,-c(7,8)]  # Remove the last two columns

Set missing data to `NA`

Always use NA to represent missing data

Data on coat colour is missing for population 3. R explicitly represents missing data as NA, but the WOLF.CSV data file uses a blank space to represent missing data.

The code below sets these blank spaces to NA

# Create a logical variable that is TRUE if an observation is from population 3
bool.index = wolf$Population==3 

# Set coat colour variable to be NA for observations from population 3 
wolf$Colour[bool.index] = NA

Subset of a data frame

Selecting observations (rows) from a data frame

To select only particular rows from a data frame using a criterion you can use the subset function.

For example, to make a subset of the data in wolf that contains only females,

wolf.F = subset(wolf, Sex=='F') # Create a subset with data on female wolves

Another way to subset the data frame using a logical index:

# Create a logical variable which is TRUE if an observation is for a female
bool.index = wolf$Sex=='F'  

# Create a subset containing only data on female wolves
wolf.F2 = wolf[bool.index, ]

Make a subset using several variables

# Create a subset containing only data on female wolves in Population 1
# method 1: 
wolf.F3 = subset(wolf, Sex=='F' & Population==1)

# Create a subset containing only data on female wolves in Population 1
# method 2:
bool.index = wolf$Sex=='F' & wolf$Population==1
wolf.F4 = wolf[bool.index,]

Another example using a logical OR (|)

# Create a subset containing only data on wolves in Population 1 OR Population 2
wolf.F5 = subset(wolf, Population==1 | Population==2)

summary(wolf.F5)

  Individual        Sex    Population Colour      Cpgmg           Tpgmg       
 Length:148         F:72   1: 45       :  0   Min.   : 4.75   Min.   : 3.250  
 Class :character   M:76   2:103      D: 37   1st Qu.:12.16   1st Qu.: 4.378  
 Mode  :character   U: 0   3:  0      W:111   Median :15.38   Median : 5.030  
                                              Mean   :16.61   Mean   : 5.617  
                                              3rd Qu.:19.98   3rd Qu.: 6.067  
                                              Max.   :40.43   Max.   :15.130  
                                                                              
     Ppgmg      
 Min.   :12.76  
 1st Qu.:19.50  
 Median :25.00  
 Mean   :25.89  
 3rd Qu.:30.01  
 Max.   :53.28  
 NA's   :79

Dropping unused levels of a factor

The subset wolf.F5 contains no data from population 3, but the factor Population still has 3 levels. To remove unused levels from a factor use the function droplevels()

Using the droplevels() function on the data frame wolf.F5 will remove the level for population 3, as well as any other levels that contain no data (e.g. wolves with an undetermined sex, level U of variable Sex)

wolf.F5 = droplevels(wolf.F5) # Update the levels of factors in wolf.F5
summary(wolf.F5)              # The factor Population now has 2 levels

  Individual        Sex    Population Colour      Cpgmg           Tpgmg       
 Length:148         F:72   1: 45      D: 37   Min.   : 4.75   Min.   : 3.250  
 Class :character   M:76   2:103      W:111   1st Qu.:12.16   1st Qu.: 4.378  
 Mode  :character                             Median :15.38   Median : 5.030  
                                              Mean   :16.61   Mean   : 5.617  
                                              3rd Qu.:19.98   3rd Qu.: 6.067  
                                              Max.   :40.43   Max.   :15.130  
                                                                              
     Ppgmg      
 Min.   :12.76  
 1st Qu.:19.50  
 Median :25.00  
 Mean   :25.89  
 3rd Qu.:30.01  
 Max.   :53.28  
 NA's   :79

Selecting variables (columns) from a data frame

The subset command can be used to extract one or more variables from a data frame. For example, to select only the cortisol (Cpgmg) and Population variables from the wolf data frame (these are the third and fifth columns in the data frame)

# Create a subset of the data containing the variables 'Population' and 'Cpgmg'
wolf.subset1 = subset(wolf, select=c('Population','Cpgmg'))

Other ways to select variables from a data frame

# Create a subset of the data containing the variables 'Population' and 'Cpgmg'
wolf.subset2 = wolf[,c('Population','Cpgmg')]


# Create a subset of the data containing the variables 'Population' and 'Cpgmg' 
# (columns 3 and 5 in the wolf data frame)
wolf.subset3 = wolf[,c(3,5)]


# Create a subset of the data containing the variable 'Population'
# using the variable name
wolf$Population

Variables (columns) and observations (rows) can be selected at the same time. Here is an example selecting data on population identity and cortisol for just female wolves

# Create a subset of the data containing only female wolves and the 
# variables 'Population' and 'Cpgmg'
wolf.subset4 = subset(wolf, Sex=='F', select=c('Population','Cpgmg'))

Saving data

Large data sets can be time consuming to import into R. Once a file has been imported it is a good idea to save the data in R’s native binary format. Data in this format is quick to import and takes up less space on the hard drive. By convention, files containing data in R’s binary format have the suffix .Rdata.

To save the variables wolf, insect.tidy and bees to a file use the save() command

# Save wolf, insect.tidy and bees to a file called 'sheet2_data.Rdata'
save(wolf, insect, file='sheet2_data.Rdata')

We can verify that the data have been correctly saved by clearing R’s memory and re-importing them using the load() command. Try running the following commands to see if you can reload the data saved in file sheet2_data.Rdata.

rm(list=ls())                           # Clear variables from memory
ls()                                    # Display the variables in R's memory
load(file='sheet2_data.Rdata')          # Import R binary data from a file
ls()                                    # Display the variables in R's memory

Summary of the topics covered

Displaying contents of a data frame
Manipulating data in a data frame
Creating subset of data
Saving a data frame to a file using R’s binary data file format
Reading data from an R binary data file