Manipulating data in R

Using arrays and data frames

Author
Affiliation

Jon Yearsley

School of Biology and Environmental Science, UCD

Published

January 1, 2024


How to Read this Tutorial

This tutorial is a mixture of R code chunks and explanations of the code. The R code chunks will appear in boxes.

Below is an example of a chunk of R code:

# This is a chunk of R code. All text after a # symbol is a comment
# Set working directory using setwd() function
setwd('Enter the path to my working directory')

# Clear all variables in R's memory
rm(list=ls())    # Standard code to clear R's memory

Sometimes the output from running this R code will be displayed after the chunk of code.

Here is a chunk of code followed by the R output

2 + 4            # Use R to add two numbers
[1] 6

Introduction

This worksheet covers the different ways to manipulate arrays of numbers and data frames in R.

Creating an array of numbers

One-dimensional arrays

Use the array() function to create an array of numbers.

For example, to create an array of 13 zeros we would type

array(0, dim=13)            # Create an array with 13 elements, all set to zero      
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0

This is called a one dimensional array with 13 elements.

We can also create a 1D array of numbers using the concatenate function c()

c(3, 100, 30, 23, -9, 6.3)  # Concatenate six numbers
[1]   3.0 100.0  30.0  23.0  -9.0   6.3

Two-dimensional arrays

A matrix is a two dimensional array. A matrix can also be created using the array() function, but now the dim= argument has two numbers: the first being the number of rows in the matrix and the second the number of columns in the matrix.

# Create a 2D array of zeros with 3 rows and 5 columns
array(0, dim=c(3,5))
     [,1] [,2] [,3] [,4] [,5]
[1,]    0    0    0    0    0
[2,]    0    0    0    0    0
[3,]    0    0    0    0    0

Using [ ] to extract data from an array

Using [ ] to specify a single element

# Create a 1D array called a1 with 10 elements all set to be zero
a1 = array(0, dim=10)

If we want to set the 3rd element of the array a1 to be the number 8 we use square brackets [] as follows

# Set the 3rd elements of the array a1 to be the number 10
a1[3] = 8
a1                       # Display the modified array a1
 [1] 0 0 8 0 0 0 0 0 0 0

Using [ ] to specify a several elements

We can use [ ] to specify several elements of the a1 array. For example, to display just the 3rd, 5th and 8th elements

# Display the 3rd, 5th and 8th elements of a1
a1[ c(3, 5, 8) ]
[1] 8 0 0

And we can use this approach to set the values of several elements at the same time

# Set the 3rd, 5th and 8th elements of a1 to be 2.5, -76 and 109, respectively
a1[ c(3, 5, 8) ] = c( 2.5, -76, 109)
a1                       # Display the modified array a1
 [1]   0.0   0.0   2.5   0.0 -76.0   0.0   0.0 109.0   0.0   0.0

Using [ ] to specify an element of a 2D array

First create a 2D array with 3 rows and 5 columns (i.e 15 elements in total) that contain random numbers draw from a standard normal distribution (mean=0, standard deviation=1)

# Create a 3 by 5 array containing random data from a normal distribution
a2 = array( rnorm(15), dim=c(3,5))
a2          # Display the array a2
           [,1]      [,2]      [,3]       [,4]      [,5]
[1,] -0.7533743 1.1237166 1.0655912  1.4276775 -0.931480
[2,]  0.8206040 0.4790957 0.6671623 -0.1306744 -1.771047
[3,] -0.9179084 0.4961658 0.2490151  0.7301902 -1.867561

The row and column indices of an element can be specified inside square brackets, separated by a comma

# Display the element in the 2nd row, 4th column of a2
a2[ 2, 4]
[1] -0.1306744

We could reassign specific elements of a 2D array

# Reassign the element in the 2nd row, 4th column of a2 to be zero
a2[ 2, 4] = 0
a2          # Display the array a2
           [,1]      [,2]      [,3]      [,4]      [,5]
[1,] -0.7533743 1.1237166 1.0655912 1.4276775 -0.931480
[2,]  0.8206040 0.4790957 0.6671623 0.0000000 -1.771047
[3,] -0.9179084 0.4961658 0.2490151 0.7301902 -1.867561

Using [ ] to specify an entire row/column of a 2D array

An entire row can be specified by leaving the index blank

# Display all the elements in the 2nd row of a2
a2[ 2, ]
[1]  0.8206040  0.4790957  0.6671623  0.0000000 -1.7710466

and similarly for an entire column

# Display all the elements in the 4th column of a2
a2[ , 4]
[1] 1.4276775 0.0000000 0.7301902

Creating a data frame

Importing data

Data frames are produced when you import data into R using one of the data import functions (e.g. read.csv(),read.table()).

Here is the R code to import the data in file WOLF.CSV. The data frame is called wolf and contains the imported data.

# Import WOLF.CSV file using read.table function
wolf = read.table('WOLF.CSV', header=TRUE, sep=',')

Looking at the first 6 rows of the wolf data frame shows that it contains 7 variables (i.e. 7 columns). The columns are called: Individual, Sex, Population, Colour, Cpgmg, Tpgmg, Ppgmg.

head(wolf)
  Individual Sex Population Colour Cpgmg Tpgmg    Ppgmg
1          1   M          2      W 15.86  5.32       NA
2          2   F          1      D 20.02  3.71 14.37622
3          3   F          2      W  9.95  5.30 21.65902
4          4   F          1      D 25.22  3.71 13.42507
5          5   M          2      D 21.13  5.34       NA
6          6   M          2      W 12.48  4.60       NA

Converting a 2D array into a data frame

Use the as.data.frame() function to convert a 2D array into a data frame. Each column of the array will become a different variable. By default the columns will be labelled V1, V2, V3, etc

# Convert array a2 into a data frame
d2 = as.data.frame(a2)
str(d2)       # Display the structure of the new data frame
'data.frame':   3 obs. of  5 variables:
 $ V1: num  -0.753 0.821 -0.918
 $ V2: num  1.124 0.479 0.496
 $ V3: num  1.066 0.667 0.249
 $ V4: num  1.43 0 0.73
 $ V5: num  -0.931 -1.771 -1.868

Creating a data frame from scratch

The data.frame() function can be used to create a data frame from scratch.

# Create data frame containing information from a food diary
d3 = data.frame(day=c('Mon','Tues','Wed','Thurs','Mon'),
                week=c(1,1,1,2,3),
                meal=c('Breakfast','Lunch','Lunch','Dinner','Dinner'),
                portions = c(2,1,0,0,1))

Above is an example where a data frame with 4 variables (from a food diary) is created (variable names are day, week, meal, portions). Each variable has 5 observations (rows).

Looking at the structure of the data frame shows that the text has been taken as data type character (abbreviated to chr) and numbers are data type numeric

str(d3)     # Display the structure of data frame d3
'data.frame':   5 obs. of  4 variables:
 $ day     : chr  "Mon" "Tues" "Wed" "Thurs" ...
 $ week    : num  1 1 1 2 3
 $ meal    : chr  "Breakfast" "Lunch" "Lunch" "Dinner" ...
 $ portions: num  2 1 0 0 1

If we want day and meal to be factors we could use the as.factor() function.

d3$day = as.factor(d3$day)         # Convert day to be a factor
d3$meal = as.factor(d3$meal)       # Convert meal to be a factor

or we could use the argument stringsAsFactors=TRUE in the original data.frame() function.

Specifying a column

A column of a data frame corresponds to a single variable.

The $ symbol

The dollar symbol ($) allows you to specify individual variables in a data frame. The dollar is preceded by the name of the data frame and followed by the name of the variable: <name of data frame>$<name of variable>

To view the meal and week variables in the data frame d3 (above) type

d3$week         # Display variable week in the d3 data frame
[1] 1 1 1 2 3
d3$meal         # Display variable meal in the d3 data frame
[1] Breakfast Lunch     Lunch     Dinner    Dinner   
Levels: Breakfast Dinner Lunch

Note that for factors (e.g. the meal variable) R displays a list of the levels in the factor (i.e. the unique values of the qualitative variable).

Square brackets [ ]

Just as for arrays, square brackets can be used to specify a column from a data frame. To specify the 3rd column in the d3 data frame type

d3[ ,3]    # Display the 3rd column in data frame d3
[1] Breakfast Lunch     Lunch     Dinner    Dinner   
Levels: Breakfast Dinner Lunch

Note that the row index is blank in order to specify all rows

Specifying a row

An observation (a row in a data frame) can be specified using [ ]

To specify the 2nd observation in the wolf data frame type

wolf[2, ]
  Individual Sex Population Colour Cpgmg Tpgmg    Ppgmg
2          2   F          1      D 20.02  3.71 14.37622

Note that the column index is left blank in order to specify all columns

We can use the same notation to specify a single value. To specify the 2nd observation of the 5th column in the wolf data frame type

wolf[2, 5]
[1] 20.02

Logical indexing

Logical indexing uses a logical variable to specify elements in an array or data frame. A logical variable is either TRUE or FALSE.

A logical index will extract values from a data frame at positions where the logical index is TRUE.

Data frame example

Here is an example using the d3 data frame. We want to extract all observations from week 1.

d3            # Display data in the d3 data frame
    day week      meal portions
1   Mon    1 Breakfast        2
2  Tues    1     Lunch        1
3   Wed    1     Lunch        0
4 Thurs    2    Dinner        0
5   Mon    3    Dinner        1
# Create a logical array that is TRUE if an observation is from week 1
logInd = d3$week==1
str(logInd)   # Display the structure of the logical array
 logi [1:5] TRUE TRUE TRUE FALSE FALSE
# Use the logical array inside square brackets to specify rows corresponding to week 1
d3[logInd, ]
   day week      meal portions
1  Mon    1 Breakfast        2
2 Tues    1     Lunch        1
3  Wed    1     Lunch        0

Rows corresponding the where logInd is TRUE are extracted.

The same result could be obtained using the subset() function

# Use the subset function to specify rows corresponding to week 1
subset(d3, week==1)
   day week      meal portions
1  Mon    1 Breakfast        2
2 Tues    1     Lunch        1
3  Wed    1     Lunch        0

Array example

Here is an example where we want to extract all negative numbers from the 2D array a2

a2       # Display the 2D array a2
           [,1]      [,2]      [,3]      [,4]      [,5]
[1,] -0.7533743 1.1237166 1.0655912 1.4276775 -0.931480
[2,]  0.8206040 0.4790957 0.6671623 0.0000000 -1.771047
[3,] -0.9179084 0.4961658 0.2490151 0.7301902 -1.867561
# Create a logical array that is TRUE when an element of a2 is negative
logInd = a2<0   
# Extract elements of a2 that are negative
a2[logInd]
[1] -0.7533743 -0.9179084 -0.9314800 -1.7710466 -1.8675614

Reassigning values in a data frame using [ ]

Above we used logical indexing to extract all observations from week 1 in the data frame d3. We can use this approach to reassign the values in a data frame. This can be useful when correcting errors in a data frame.

Here is how you reassign the values of portions for week 1 in data frame d3 to be equal to 0

# Create a logical array that is TRUE if an observation is from week 1
loInd = d3$week == 1

# Reassign portions to be 0 in week 1
d3$portions[logInd] = 0

Reorganising a data frame

This section shows how to pivot data. This means combining several columns into one column (i.e. putting data in a longer format) and splitting data from one column across several columns (i.e. putting data into a wider format). We will be using the pivot_longer() and pivot_wider() functions from the package tidyr.

We will use three more data sets (INSECT.TXT, MALIN_HEAD.TXT and BEEKEEPER.TXT). These data sets are explained at http://www.ucd.ie/ecomodel/Resources/datasets_WebVersion.html

First import the data

# TAB delimited variables
insect = read.table('INSECT.TXT', header=T, skip=3, sep='\t')
insect = insect[,-c(7,8)] # Remove the last two empty columns

rainfall = read.table('MALIN_HEAD.TXT', header=T, sep='\t')

# White space delimited variables
bees = read.table('BEEKEEPER.TXT', header=T, skip=6, sep='')

Several columns into one column (using pivot_longer())

Look at the insect data frame

insect
   Spray.A Spray.B Spray.C Spray.D Spray.E Spray.F
1       10      11       0       3       3      11
2        7      17       1       5       5       9
3       20      21       7      12       3      15
4       14      11       2       6       5      22
5       14      16       3       4       3      15
6       12      14       1       3       6      16
7       10      17       2       5       1      13
8       23      17       1       5       1      10
9       17      19       3       5       3      26
10      20      21       0       5       2      26
11      14       7       1       2       6      24
12      13      13       4       4       4      13

The insect data frame is not tidy: it is not organised so that each row is an observation and each column is a variable.

It has two variables: spray name (A-F) and insect count, so ideally we want a data frame with two columns (one column would contain the type of spray used on a plot and the second column would contain the insect count from the plot). As it stands the spray name appears as the names of the columns, and the insect count is spread across six columns. Ideally we need to reorganize the data into a long-format, so that there are just two columns, one for each variable.

To reorganize the data frame so that each column corresponds to a single variable we use the pivot_longer() function in the tidyr package. This function will:

  1. Specify which columns to combine into one column with the cols= argument.
  2. Create a new variable containing the names of the columns from the insect data frame. This column’s name can be set with the names_to= argument of the pivot_longer() function.
  3. Create a second new variable containing the numerical data in the columns. This column’s name can be set with the values_to= argument of the pivot_longer() function.
library('tidyr')        # Load tidyr package

# Tidy the insect data frame
# Gather all the count data into one column 
insect.long = pivot_longer(data=insect, 
                           cols=c(1:6),
                           values_to="Count", 
                           names_to="Spray")

head(insect.long)          # Display the reshaped data frame
# A tibble: 6 × 2
  Spray   Count
  <chr>   <int>
1 Spray.A    10
2 Spray.B    11
3 Spray.C     0
4 Spray.D     3
5 Spray.E     3
6 Spray.F    11

This reorganized data frame is in long-format and is sometimes called a molten data frame.

Example: Reorganising climate data

A common example of using the pivot_longer() function to combine multiple columns of data into a single column occurs with monthly climate data (e.g. Met Eireann’s monthly data; https://www.met.ie/climate/available-data). It is common for raw data to be displayed with columns to represent individual months. To analyse these data we will want all rainfall data in one column, and a second column added giving the month.

rainfall    # Display the data
  Year   Jan   Feb   Mar  Apr   May  Jun   Jul   Aug  Sep   Oct   Nov   Dec
1 2016 166.3 130.7  72.7 52.9  65.6 72.4 109.9  63.4   NA    NA    NA    NA
2 2015 176.0  85.8 123.1 64.7 137.0 56.1 132.7 111.0 29.7  71.9 222.9 272.9
3 2014 162.2 189.9  71.6 33.4  86.8 48.6  86.0  95.3 23.0 131.4 134.4 150.5
4 2013 140.9  74.1  61.7 61.6 102.5 85.5  56.5  92.6 69.7 103.8 116.0 178.6

With these data we will combine the monthly rainfall data into two columns: one with the name of the month and one with the rainfall for the month.

However we do not want to change the Year column, so we instruct R to not use the year column, using the argument cols=-Year (see below).

# Organise the rainfall data
rainfall.long = pivot_longer(data=rainfall, 
                             cols=-Year,
                             names_to="Month", 
                             values_to="Rainfall")

# Display the first 6 lines of the organised data frame
head(rainfall.long)
# A tibble: 6 × 3
   Year Month Rainfall
  <int> <chr>    <dbl>
1  2016 Jan      166. 
2  2016 Feb      131. 
3  2016 Mar       72.7
4  2016 Apr       52.9
5  2016 May       65.6
6  2016 Jun       72.4

One column across several columns (using pivot_wider())

Sometimes you want to take well formatted data and spread one variable across several columns. One option to do this you can use the pivot_wider() function in tidyr.

The bees data frame is well formatted (one variable in one column and each row an observation). In this well formatted data set we have a column called Flowers (density of flowers), and a column called Field (the field where the flowers density was measured). We will spread the Flowers data out so that the density of flowers is in two columns: one columns for field 1 and one column for field 2.

We will use the pivot_wider() function. This function will:

  1. Create column names from the values in variable Field (categorical variable). This is done with the names_from= and names_prefix= arguments of the pivot_wider() function.

  2. Indicate that the Flowers variable will be used to fill the newly created columns. This is done with the values_from= argument of the pivot_wider() function.

# Spread the Field across multiple columns 
# (see below for an explanation of names_prefix="Field_")
bees.wide = pivot_wider(data=bees, 
                        values_from="Flowers", 
                        names_from="Field", 
                        names_prefix="Field_")

head(bees.wide)  # Display the spread out data frame
# A tibble: 6 × 4
  Treatment  Plot Field_1 Field_2
      <int> <int>   <int>   <int>
1         1     1      23      31
2         1     2      26      32
3         1     3      20      35
4         1     4      21      29
5         2     1      20      30
6         2     2      24      32

The names_prefix="Field_" argument is used to create the new column names, by adding Field_ as a prefix to the field number.

Example: Reorganising climate data

We could use the pivot_wider() function to return the monthly rainfall data (see above) from its tidy format (data frame rainfall.long) to a wide format where each month’s rainfall has its own column.

# Reorganise the rainfall data back into a wide format
rainfall.wide = pivot_wider(data=rainfall.long, 
                            values_from="Rainfall", 
                            names_from="Month")

rainfall.wide   # Display the wide-format rainfall data 
# A tibble: 4 × 13
   Year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
  <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  2016  166. 131.   72.7  52.9  65.6  72.4 110.   63.4  NA    NA     NA    NA 
2  2015  176   85.8 123.   64.7 137    56.1 133.  111    29.7  71.9  223.  273.
3  2014  162. 190.   71.6  33.4  86.8  48.6  86    95.3  23   131.   134.  150.
4  2013  141.  74.1  61.7  61.6 102.   85.5  56.5  92.6  69.7 104.   116   179.

Other data organisation packages

There are several other data manipulation tools in R. Two common packages for data manipulation are:

  • reshape2
  • dplyr

See also http://www.r-bloggers.com/reshape-and-aggregate-data-with-the-r-package-reshape2/ and http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/

Summarising a variable by group

The summary() function produces a summary of an entire data frame. However, it doesn’t allow you to calculate summaries for groups within the data frame.

If you want to calculate a mean across different grouping (e.g. for each value of a qualitative factor) the aggregate() function is very useful.

To calculate the mean rainfall across years for each month:

# Mean rainfall across years 2013-2016 for each month
aggregate(Rainfall~Month,          # Define aggregation using a formula
          data=rainfall.long,      # Specify data frame to be aggregated
          FUN=mean,                # Specify function to use for aggregation
          na.rm=TRUE,              # Ignore missing values
          na.action=na.pass)       # Pass missing data to the aggregation function
   Month Rainfall
1    Apr  53.1500
2    Aug  90.5750
3    Dec 200.6667
4    Feb 120.1250
5    Jan 161.3500
6    Jul  96.2750
7    Jun  65.6500
8    Mar  82.2750
9    May  97.9750
10   Nov 157.7667
11   Oct 102.3667
12   Sep  40.8000

The order of the months is alphabetical, making the summary harder to read. This could be improved by specifying the Month variable as a factor with the levels in a specified order. Below is the code to do this.

# Define Month as a factor with levels in a specified order
rainfall.long$Month = factor(rainfall.long$Month, 
                             levels = c('Jan','Feb','Mar','Apr','May','Jun',
                                        'Jul','Aug','Sep','Oct','Nov','Dec'))

The aggregate function can be used to produce a summary for each month

# Summary of rainfall across years 2013-2016 for each month
aggregate(Rainfall~Month, 
          data=rainfall.long,
          FUN=summary)
   Month Rainfall.Min. Rainfall.1st Qu. Rainfall.Median Rainfall.Mean
1    Jan      140.9000         156.8750        164.2500      161.3500
2    Feb       74.1000          82.8750        108.2500      120.1250
3    Mar       61.7000          69.1250         72.1500       82.2750
4    Apr       33.4000          48.0250         57.2500       53.1500
5    May       65.6000          81.5000         94.6500       97.9750
6    Jun       48.6000          54.2250         64.2500       65.6500
7    Jul       56.5000          78.6250         97.9500       96.2750
8    Aug       63.4000          85.3000         93.9500       90.5750
9    Sep       23.0000          26.3500         29.7000       40.8000
10   Oct       71.9000          87.8500        103.8000      102.3667
11   Nov      116.0000         125.2000        134.4000      157.7667
12   Dec      150.5000         164.5500        178.6000      200.6667
   Rainfall.3rd Qu. Rainfall.Max.
1          168.7250      176.0000
2          145.5000      189.9000
3           85.3000      123.1000
4           62.3750       64.7000
5          111.1250      137.0000
6           75.6750       85.5000
7          115.6000      132.7000
8           99.2250      111.0000
9           49.7000       69.7000
10         117.6000      131.4000
11         178.6500      222.9000
12         225.7500      272.9000

Summary of topics

  • Creating an array (1D and 2D)
    • array() function
    • c() function
  • Creating a data frame
    • Importing data into a data frame
    • as.data.frame() function
    • data.frame() function
  • Specifying and modifying elements in arrays and data frames
    • Square brackets []
    • Dollar symbol $
    • Logical indices
  • Reorganising a data frame
    • Gathering data into one column
    • Spreading data across several columns
  • Groupwise summary a data frame

Further Reading

All these books can be found in UCD’s library

  • Andrew P. Beckerman and Owen L. Petchey, 2012 Getting Started with R: An introduction for biologists (Oxford University Press, Oxford)
  • Mark Gardner, 2012 Statistics for Ecologists Using R and Excel (Pelagic, Exeter)
  • Tenko Raykov and George A Marcoulides, 2013 Basic statistics: an introduction with R (Rowman and Littlefield, Plymouth)
  • John Verzani, 2005 Using R for introductory statistics (Chapman and Hall, London)
  • Andy P. Field, Jeremy Miles, Zoë Field, 2013 Discovering statistics using R (Sage, London)