end of Navigation menu type

Jump to main navigation


Tutorial 2.4 - Data manipulation

27 Mar 2017

I apologize in advance, this tutorial requires quite a bit of explaining and context before it can get into the code.... Good data manipulation is an art form that requires the use of many specific tools (functions) and expert data manipulation comes from the integration of these tools together. Therefore it is necessary to have an overview of the tool set before investigating any single tool.

Manipulating data sets

Rarely are data transcribed and organized into such simple and ready-to-go data sets. More typically, data are spread throughout multiple sources and in a variety of formats (particularly if compiled by multiple workers or instruments). Consequently, prior to any formal statistical analyses , it is necessary to compile very focused, tidy data sets.

Wickham (2014) suggests that there are many ways to organise data, yet tidy data (data that are structured in such a consistent way as to facilitate analyses) must adhere to a fairly strict set of structural rules. Specifically, in tidy data:

  • variables are in columns
  • observations are in rows - that is, for univariate data, there should be a separate row for each response observation.

To achieve tidy data, common data preparations include:

  • Reshaping and rearranging data
  • Merging multiple data sources
  • Aggregating data to appropriate spatial/temporal scales
  • Transforming data and deriving new variables
  • Sorting and reordering data
  • Relabelling data
This workshop will concentrate on these essential data preparation phases.

Practical data manipulation will be demonstrated via a series of very small artificial datasets. These datasets are presented in tables with black font and lines and the R code required to generate those data will be presented in static code boxes either underneath or adjacent the table. A very basic description of the table and the name of the data frame are displayed above the table. The entire collection of datasets used in this workshop can be obtained by issuing the following command:

  • (if online)
    source(url( "http://www.flutterbys.com.au/stats/downloads/data/manipulationDatasets.R"))
    #OR
    load("http://www.flutterbys.com.au/stats/downloads/data/manipulationDatasets.RData")
    
  • (if offline and are running this from a local version having first downloaded either file)
    source(file="manipulationDatasets.R")
    #OR
    load("manipulationDatasets.RData")
    

The great folks over at Rstudio have produced an excellent set of cheatsheets on a range of topics. For this tutorial, the Data Import Cheat Sheet (mainly page 2) and Data Transformation Cheat Sheet are useful summaries.

tidyverse a data manipulation ecosystem within R

There are numerous packages and base R routines devoted to data manipulation. Notable packages include data.tables, plyr and doBy. Indeed, earlier versions of this tutorial featured examples of each of these packages. However, an entire family of packages from Hadley Wickem's group now stands out as a comprehensive, intuitive suite of tools for data manipulation and visualisation.

Importantly, all of these packages are designed to integrate together and complement one another with a consistent interface. Individually, the important relevant packages for this tutorial include:

  • dplyr, for data manipulation
  • tidyr, for data tidying
  • forcats, for factors

To simplify installing an entire data ecosystem, the tidyverse package is available. Installing this package (via install.packages('tidyverse')) will install the following packages:

  • dplyr, for data manipulation
  • tidyr, for data tidying
  • readr, for importing data
  • ggplot2, for visualising data
  • purr, for functional programming
  • tibble, for a new data.frame-like structure
  • hmms, for times
  • lubridate, for dates/times
  • stringr, for strings
  • forcats, for factors
  • DBI, for databases
  • haven, for importing SAS, SPSS and Stata files
  • httr, fir web apis
  • jsonlite, for JSON
  • rvest, for web scraping
  • xml2, for XML
  • modelr, for model piping
  • broom, for converting models and objects into tidy data

Loading the tidyverse() package will automatically load the following packages:

  • dplyr, for data manipulation
  • tidyr, for data tidying
  • ggplot2, for visualising data
  • tibble, for a new data.frame-like structure
  • readr, for importing data
  • purr, for functional programming
For this tutorial, we additionally need to load the forcats package.

library(tidyverse)
library(forcats)

Finally, for much of the rest of the tutorial, I will demonstrate multiple ways to perform the tasks. Whilst the tidyverse ecosystem of packages all integrate well with one another, they do conflict with other packages. Consequently, for the alternate routines I demonstrate, namespaces will be used rather than loading of the associated packages.

The grammar of data manipulation

Hadley and his collaborators argue that there is a grammar of data manipulation and that manipulations comprises the following core verbs:

  • gather, for melting into long format
  • spread(), for casting into wide format
  • unite, for combining colums
  • separate(), for splitting columns
  • arrange(), for sorting data
  • select(), for subsetting columns
  • rename(), for renaming columns
  • filter(), for subsetting rows
  • slice, for selecting rows
  • mutate, for adding columns
  • summarize, for summarizing data
  • count(), for tallying data
  • group_by(), for defining groups
  • *_join(), for merging data sets

Piping

Typically, data manipulation/preparations comprise multiple steps and stages in which a series of alterations, additions etc are performed sequentially such that the data are progressively molded into a form appropriate for analysis etc. Traditionally, this would have involved a separate expression for each step often resulting in the generation of numerous intermediate data sets.

Furthermore, in an attempt to reduce the number of intermediate steps, functions are often nested within other functions such that alterations are made inline within another function. Collectively, these practices can yield code that is very difficult to read and interpret.

A a motivating (if not a bit extreme) example, lets say we wanted to calculate the logSumExp function: $$ log(\sum^{n}_{i=1} e^{x_i}) $$

## Generate some data
set.seed(123)
x = rgamma(10,5,1)
## Calculate the logSumExp
log(sum(exp(x)))
[1] 9.316408
## OR
x1=exp(x)
x2=sum(x1)
log(x2)
[1] 9.316408

A long honoured unix coding principle is that each application should focus on performing one action and performing that function well. In order to perform a sequence of of actions therefore involves piping (via the unix pipe character '|') the output of one application to the input of another application and so on in a chain. The grammar of data wrangling also adopts this principle (each tool specializing on one action and tools should be piped together to achieve a sequence of actions).

The piping (glue) operator in tidyverse is %>%. An object on the left hand side of the %>% operator is passed as the first argument of the function on the right hand side.

x %>% exp %>% sum %>% log
[1] 9.316408

To reiterate, the following two are equivalent:

exp(x)
 [1]   29.653639 1601.918872    5.101634  118.918637 7140.686681  252.361318    9.175114    4.863565 1756.825350
[10]  199.466617
x %>% exp
 [1]   29.653639 1601.918872    5.101634  118.918637 7140.686681  252.361318    9.175114    4.863565 1756.825350
[10]  199.466617
as are the following:
log(x, base=10)
 [1] 0.5301465 0.8679950 0.2120706 0.6792861 0.9480981 0.7427928 0.3456667 0.1991438 0.8733941 0.7239190
x %>% log(base=10)
 [1] 0.5301465 0.8679950 0.2120706 0.6792861 0.9480981 0.7427928 0.3456667 0.1991438 0.8733941 0.7239190

Most of the following examples will demonstrate isolated data manipulation actions (such as filtering, summarizing or joining) as this focuses on the specific uses of these functions without the distractions and complications of other actions. For isolated uses, piping has little (if any) advantages. Nevertheless, in recognition that data manipulations rarely comprise a single action (rather they are a series of linked actions), for all remaining examples demonstrated in the tidyverse (dplyr/tidyr) context, piping will be used.

Reshaping dataframes

Wide to long (gather)

Whilst wide data formats are often more compact and typically easier to manage for data entry (particularly in the field), the data are not in the appropriate format for most analyses (traditional repeated measures and multivariate analyses are two exceptions). Most analyses require that each replicate is in its own row and thus it is necessary to be rearrange or reshape (melt or gather) the data from this wide format to the long (molten) format.

Whilst there are numerous routines in R for reshaping data, we will only explore those that are formally part of the tidyverse ecosystem.

The gather() function (dplyr package) is very useful for converting wide (repeated measures-like) into long format. The important parameters to specify are;

  • key, a name for the new categorical variable that represents the levels that correspond to the variables being gathered (shown in purple in the Wide data above)
  • value a name for the numeric variable to contain the observations
By default, gather() assumes that you wish to gather all the variables to produce a new table with two columns. In order to restrict the variables to be gathered, after specifying the key and value, we then indicate either which variables to include in the gather or alternatively which to exclude.

data.w
Between Plot Time.0 Time.1 Time.2
R1 A1 P1 8 14 14
R2 A1 P2 10 12 11
R3 A2 P3 7 11 8
R4 A2 P4 11 9 2
View code
set.seed(1)
data.w <- expand.grid(Plot=paste("P",1:4,sep=""))
data.w$Between <- gl(2,2,4,lab=paste("A",1:2,sep=""))
data.w <- with(data.w, data.frame(Between,Plot,
            matrix(rpois(12,10),ncol=3,
            dimnames=list(paste("R",1:4,sep=""),
                          paste("Time",0:2,sep=":")))))
Via gather (package:tidyr)
data.w %>% gather(key=Time,value=Count,Time.0,Time.1,Time.2)
#OR
data.w %>% gather(key=Time,value=Count,-Between,-Plot)
   Between Plot   Time Count
1       A1   P1 Time.0     8
2       A1   P2 Time.0    10
3       A2   P3 Time.0     7
4       A2   P4 Time.0    11
5       A1   P1 Time.1    14
6       A1   P2 Time.1    12
7       A2   P3 Time.1    11
8       A2   P4 Time.1     9
9       A1   P1 Time.2    14
10      A1   P2 Time.2    11
11      A2   P3 Time.2     8
12      A2   P4 Time.2     2

When specifying which columns to gather, there are also a number of "Helper" functions that provide convenient ways to select columns:

  • contains(""), columns whose name match the character string (not case sensitive by default)
  • starts_with(""), columns whose names begin with the character string (not case sensitive by default)
  • ends_with(""), columns whose names end with the character string (not case sensitive by default)
  • one_of(c("","")), columns whose names are amongst the character strings
  • matches(""), columns whose names are matched by the character string of a regular expression (not case sensitive by default)
  • everything(), all columns. This is useful in combination with other specifiers
  • num_range("",), select all columns whose names start with the first argument (a character string) followed by a number in the integer sequence provided as the second argument
Selections are negated by prepending with a '-' and columns are ordered according to the order that they are selected.

Via gather (package:tidyr)
data.w %>% gather(key=Time,value=Count, contains('Time'))
#OR
data.w %>% gather(key=Time,value=Count, num_range('Time.',0:2))
#OR
data.w %>% gather(key=Time,value=Count, matches('Time.[0-2]'))
   Between Plot   Time Count
1       A1   P1 Time.0     8
2       A1   P2 Time.0    10
3       A2   P3 Time.0     7
4       A2   P4 Time.0    11
5       A1   P1 Time.1    14
6       A1   P2 Time.1    12
7       A2   P3 Time.1    11
8       A2   P4 Time.1     9
9       A1   P1 Time.2    14
10      A1   P2 Time.2    11
11      A2   P3 Time.2     8
12      A2   P4 Time.2     2

Long to wide (cast)

The spread() function (tidyr package) is very useful for converting long datasets into wide (repeated measures-like). The heart of the function definition is specifying a formula LHS ~ RHS where;

  • key the existing categorical variable that will define the new variable names
  • value the existing numeric variable that will be reshaped into wide format
Note, it is best that data are already as long as possible.

data
Resp1 Between Plot Subplot Within
1 8 A1 P1 S1 B1
2 10 A1 P1 S1 B2
3 7 A1 P1 S2 B1
4 11 A1 P1 S2 B2
5 14 A2 P2 S3 B1
6 12 A2 P2 S3 B2
7 11 A2 P2 S4 B1
8 9 A2 P2 S4 B2
9 14 A3 P3 S5 B1
10 11 A3 P3 S5 B2
11 8 A3 P3 S6 B1
12 2 A3 P3 S6 B2
13 8 A1 P4 S7 B1
14 10 A1 P4 S7 B2
15 7 A1 P4 S8 B1
16 12 A1 P4 S8 B2
17 11 A2 P5 S9 B1
18 12 A2 P5 S9 B2
19 12 A2 P5 S10 B1
20 10 A2 P5 S10 B2
21 3 A3 P6 S11 B1
22 11 A3 P6 S11 B2
23 13 A3 P6 S12 B1
24 7 A3 P6 S12 B2
View code
data.l = data %>% select(-Resp2)
Widen (spread) Resp1 for repeated measures (Within)
Via spread (package:tidyr)
data.l %>% spread(Within,Resp1)
   Between Plot Subplot B1 B2
1       A1   P1      S1  8 10
2       A1   P1      S2  7 11
3       A1   P4      S7  8 10
4       A1   P4      S8  7 12
5       A2   P2      S3 14 12
6       A2   P2      S4 11  9
7       A2   P5      S9 11 12
8       A2   P5     S10 12 10
9       A3   P3      S5 14 11
10      A3   P3      S6  8  2
11      A3   P6     S11  3 11
12      A3   P6     S12 13  7

If the data are not already fully gathered, then it is necessary to gather before spreading. For example, if we return to the data that contain both Resp1 and Resp2...

data
Resp1 Resp2 Between Plot Subplot Within
1 8 17 A1 P1 S1 B1
2 10 18 A1 P1 S1 B2
3 7 17 A1 P1 S2 B1
4 11 21 A1 P1 S2 B2
5 14 19 A2 P2 S3 B1
6 12 13 A2 P2 S3 B2
7 11 24 A2 P2 S4 B1
8 9 18 A2 P2 S4 B2
9 14 25 A3 P3 S5 B1
10 11 18 A3 P3 S5 B2
11 8 27 A3 P3 S6 B1
12 2 22 A3 P3 S6 B2
13 8 17 A1 P4 S7 B1
14 10 22 A1 P4 S7 B2
15 7 16 A1 P4 S8 B1
16 12 13 A1 P4 S8 B2
17 11 23 A2 P5 S9 B1
18 12 19 A2 P5 S9 B2
19 12 23 A2 P5 S10 B1
20 10 21 A2 P5 S10 B2
21 3 17 A3 P6 S11 B1
22 11 16 A3 P6 S11 B2
23 13 26 A3 P6 S12 B1
24 7 28 A3 P6 S12 B2
View code
set.seed(1)
data <- expand.grid(Within=paste("B",1:2,sep=""),
                    Subplot=paste("S",1:2,sep=""),
                    Plot=paste("P",1:6,sep=""))
data$Subplot <- gl(12,2,24,lab=paste("S",1:12,sep=""))
data$Between <- gl(3,4,24,lab=paste("A",1:3,sep=""))
data$Resp1 <- rpois(24,10)
data$Resp2 <- rpois(24,20)
data <- with(data,data.frame(Resp1,Resp2,Between,Plot,Subplot,Within))
Widen (spread) both Resp1 and Resp2 for repeated measures
Via spread (package:tidyr)
data %>% gather(Resp,Count, contains('Resp')) %>% spread(Within,Count)
   Between Plot Subplot  Resp B1 B2
1       A1   P1      S1 Resp1  8 10
2       A1   P1      S1 Resp2 17 18
3       A1   P1      S2 Resp1  7 11
4       A1   P1      S2 Resp2 17 21
5       A1   P4      S7 Resp1  8 10
6       A1   P4      S7 Resp2 17 22
7       A1   P4      S8 Resp1  7 12
8       A1   P4      S8 Resp2 16 13
9       A2   P2      S3 Resp1 14 12
10      A2   P2      S3 Resp2 19 13
11      A2   P2      S4 Resp1 11  9
12      A2   P2      S4 Resp2 24 18
13      A2   P5      S9 Resp1 11 12
14      A2   P5      S9 Resp2 23 19
15      A2   P5     S10 Resp1 12 10
16      A2   P5     S10 Resp2 23 21
17      A3   P3      S5 Resp1 14 11
18      A3   P3      S5 Resp2 25 18
19      A3   P3      S6 Resp1  8  2
20      A3   P3      S6 Resp2 27 22
21      A3   P6     S11 Resp1  3 11
22      A3   P6     S11 Resp2 17 16
23      A3   P6     S12 Resp1 13  7
24      A3   P6     S12 Resp2 26 28

Note, the above data are not untidy..

Combining columns (unite)

If data are recorded with excessive fidelity, it may be useful to combine multiple fields into a single field. For example, if the date was recorded across three fields (year, month and day, we may like to combine these to form a single date field.

data.d
year month day Resp1
1 2008 07 13 16.00
2 2008 09 28 25.00
3 2009 02 21 52.00
4 2009 10 23 18.00
5 2008 05 26 0.00
6 2009 10 14 45.00
7 2009 11 15 40.00
8 2009 04 23 38.00
9 2009 03 30 9.00
10 2008 02 14 22.00
View code
set.seed(1)
data.d = data.frame(Date=sample(seq(as.Date('2008-01-01'),
                                    as.Date('2009-12-31'), by='day'), size=10),
                    Resp1=rnbinom(10,5,mu=30) * as.numeric(replicate(5,
                          rbinom(2,1,0.8)))) %>%
    separate(Date,into=c('year','month','day'))
Unite year, month and day into a single 'Date' field.
Via unite (package:tidyr)
data.d %>% unite(year,month,day,col='Date',sep='-')
         Date Resp1
1  2008-07-13    16
2  2008-09-28    25
3  2009-02-21    52
4  2009-10-23    18
5  2008-05-26     0
6  2009-10-14    45
7  2009-11-15    40
8  2009-04-23    38
9  2009-03-30     9
10 2008-02-14    22

In the last spread example, we widened Resp1 and Resp2 for each level of Within. However, we may wish to present this table in a more compact form in which the data are further spread into each combination of Resp and Within. Note, doing so would even more untidy data.

Unite year, month and day into a single 'Date' field.
Via unite (package:tidyr)
data %>% gather(Resp,Count, contains('Resp')) %>% unite(RW, Resp, Within) %>% spread(RW,Count)
   Between Plot Subplot Resp1_B1 Resp1_B2 Resp2_B1 Resp2_B2
1       A1   P1      S1        8       10       17       18
2       A1   P1      S2        7       11       17       21
3       A1   P4      S7        8       10       17       22
4       A1   P4      S8        7       12       16       13
5       A2   P2      S3       14       12       19       13
6       A2   P2      S4       11        9       24       18
7       A2   P5      S9       11       12       23       19
8       A2   P5     S10       12       10       23       21
9       A3   P3      S5       14       11       25       18
10      A3   P3      S6        8        2       27       22
11      A3   P6     S11        3       11       17       16
12      A3   P6     S12       13        7       26       28

Splitting columns appart (separate)

Separating variables is the opposite of uniting them. A field is separated by either indicating a character(s) to use as a separator, or else providing a fixed width format.

data.c
Year Resp1
1 M25 45.00
2 M28 40.00
3 F29 38.00
4 F43 9.00
5 M34 22.00
6 F25 20.00
7 F26 22.00
8 F28 32.00
9 F27 38.00
10 M29 47.00
View code
set.seed(1)
data.c <- data.frame(Year=paste0(sample(c('M','F'),10,replace=TRUE),rpois(10,30)),
                     Resp1=rnbinom(10,5,mu=30) * as.numeric(replicate(5,rbinom(2,1,0.8))))
Separate year, month and day into a single 'Date' field.
Via separate (package:tidyr)
data.c %>% separate(Year,into=c('Gender','Age'),sep=1)
   Gender Age Resp1
1       M  25    45
2       M  28    40
3       F  29    38
4       F  43     9
5       M  34    22
6       F  25    20
7       F  26    22
8       F  28    32
9       F  27    38
10      M  29    47

Summary and Vectorized functions

Many data manipulation actions involve the use of specific auxiliary functions to act on specific parts of the data set. For example if we wish to summarize a data set, we might apply a mean() function to the numeric vectors (variables) and levels() or unique function to character or factor vectors. On the other hand, if we wish to generate log-transformed versions of the numeric vectors, the we would apply the log() function to each of those numeric vectors. Furthermore, we might use other auxiliary functions that return vectors of either booleans (TRUE or FALSE) or integers that can represent row or column indices to determine which observations to perform actions on.

Broadly speaking, summary functions take a vector and return a single value. Familiar examples of summary functions are mean(), var(), min() etc. By contrast, vectorized functions take a vector and return a vector of the same length as the original input vector. rank(), cumsum() and log() are all examples of window functions.

The dplyr package introduces a number of additional summary and vectorized functions that are deemed useful augmentations of the standard set of functions available in base R. The following tables describes the most useful functions (from base and dplyr) along with which manipulation functions they can be applied. Those functions defined in dplyr include the dplyr namespace.

To demonstrate summary and vector functions, the following vectors will be used.

## Generate some data
set.seed(123)
(x = rgamma(10,5,1))
 [1] 3.389585 7.378957 1.629561 4.778440 8.873564 5.530862 2.216495 1.581772 7.471264 5.295647
(A = sample(letters[1:2], size=10, replace=TRUE))
 [1] "b" "b" "b" "b" "a" "a" "b" "b" "b" "b"
(B = sample(c(TRUE,FALSE), size=10, replace=TRUE))
 [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Summary functions

FunctionDescriptionApplies toExample
mean(),median()Arithmetic mean and mediansummarize(), mutate(), filter()
mean(x)
[1] 4.814615
median(B)
[1] 1
sum()Sumsummarize(), mutate(), filter()
sum(x)
[1] 48.14615
sum(B)
[1] 9
quantile() Five number summary summarize(), mutate(), filter()
*must only return a single quantile, so prob must be supplied
quantile(x,prob=0.25)
     25% 
2.509767 
quantile(B,prob=0.25)
25% 
  1 
min(),max()Minimum and Maximumsummarize(), mutate(), filter()
min(x)
[1] 1.581772
min(B)
[1] 0
IQR(),mad()Inter-Quartile Range and Mean Absolute Deviationsummarize(), mutate(), filter()
IQR(x)
[1] 4.407166
mad(B)
[1] 0
var(),sd()Variance and Standard Deviationsummarize(), mutate(), filter()
var(x)
[1] 6.692355
sd(B)
[1] 0.3162278
dplyr::first(),
dplyr::last(),
dplyr::nth()
First, Last and nth valuesummarize(), mutate(), filter()
first(x)
[1] 3.389585
first(x,order_by=A)
[1] 8.873564
last(B)
[1] TRUE
nth(A,4)
[1] "b"
dplyr::n(),
dplyr::n_distinct()
Number of values/distinct valuessummarize(), mutate(), filter()
*n() is a special case that can not be used outside of these functions
n_distinct(A)
[1] 2

Vectorized functions

FunctionDescriptionApplies toExample
log(), exp(), scale()Logarithmic, exponential and scale transformationsmutate(), filter()
log(x)
 [1] 1.2207074 1.9986324 0.4883105 1.5641140 2.1830765
 [6] 1.7103437 0.7959271 0.4585455 2.0110642 1.6668851
exp(x)
 [1]   29.653639 1601.918872    5.101634  118.918637
 [5] 7140.686681  252.361318    9.175114    4.863565
 [9] 1756.825350  199.466617
scale(x)
             [,1]
 [1,] -0.55085136
 [2,]  0.99125776
 [3,] -1.23119621
 [4,] -0.01398362
 [5,]  1.56900441
 [6,]  0.27686846
 [7,] -1.00431433
 [8,] -1.24966932
 [9,]  1.02693911
[10,]  0.18594510
attr(,"scaled:center")
[1] 4.814615
attr(,"scaled:scale")
[1] 2.586959
cumsum(), cumprod()Cumulative sum and productmutate(), filter()
cumsum(x)
 [1]  3.389585 10.768542 12.398103 17.176543 26.050107
 [6] 31.580969 33.797464 35.379235 42.850499 48.146146
cumsum(x>3)
 [1] 1 2 2 3 4 5 5 5 6 7
cumsum(B)
 [1] 1 2 2 3 4 5 6 7 8 9
cummin(), cummax()Cumulative min and maxmutate(), filter()
cummax(x)
 [1] 3.389585 7.378957 7.378957 7.378957 8.873564 8.873564
 [7] 8.873564 8.873564 8.873564 8.873564
cummin(B)
 [1] 1 1 0 0 0 0 0 0 0 0
dplyr::cummean()Cumulative meanmutate(), filter()
cummean(x)
 [1] 3.389585 5.384271 4.132701 4.294136 5.210021 5.263495
 [7] 4.828209 4.422404 4.761167 4.814615
cummean(x>2)
 [1] 1.0000000 1.0000000 0.6666667 0.7500000 0.8000000
 [6] 0.8333333 0.8571429 0.7500000 0.7777778 0.8000000
cummean(B)
 [1] 1.0000000 1.0000000 0.6666667 0.7500000 0.8000000
 [6] 0.8333333 0.8571429 0.8750000 0.8888889 0.9000000
dplyr::cumall(), dplyr::cumany()Cumulative all and any
these are of most use as part of a filter
mutate(), filter()
cumall(x)
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
cumall(x>2)
 [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10] FALSE
cumany(B)
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
cumall(B)
 [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10] FALSE
dplyr::lead(), dplyr::lag()Offset items by n or -nmutate(), filter()
lag(x)
 [1]       NA 3.389585 7.378957 1.629561 4.778440 8.873564
 [7] 5.530862 2.216495 1.581772 7.471264
lag(x,3)
 [1]       NA       NA       NA 3.389585 7.378957 1.629561
 [7] 4.778440 8.873564 5.530862 2.216495
lead(A)
 [1] "b" "b" "b" "a" "a" "b" "b" "b" "b" NA 
rank(), order()Rank and order of itemsmutate(), filter()
rank(x)
 [1]  4  8  2  5 10  7  3  1  9  6
order(x)
 [1]  8  3  7  1  4 10  6  2  9  5
rank(B)
 [1] 6 6 1 6 6 6 6 6 6 6
dplyr::min_rank(), dplyr::percent_rank(), dplyr::dense_rank()Ranks in which ties = min, expressed as a percentage (percent_rank), without gaps (dense_rank)mutate(), filter()
min_rank(x)
 [1]  4  8  2  5 10  7  3  1  9  6
percent_rank(x)
 [1] 0.3333333 0.7777778 0.1111111 0.4444444 1.0000000
 [6] 0.6666667 0.2222222 0.0000000 0.8888889 0.5555556
dense_rank(x)
 [1]  4  8  2  5 10  7  3  1  9  6
dplyr::row_number()Ranks in which ties = firstmutate(), filter()
row_number(x)
 [1]  4  8  2  5 10  7  3  1  9  6
row_number(A)
 [1]  3  4  5  6  1  2  7  8  9 10
row_number(B)
 [1]  2  3  1  4  5  6  7  8  9 10
dplyr::cume_dist()Cumulative sum as a proportionmutate(), filter()
cume_dist(x)
 [1] 0.4 0.8 0.2 0.5 1.0 0.7 0.3 0.1 0.9 0.6
cume_dist(A)
 [1] 1.0 1.0 1.0 1.0 0.2 0.2 1.0 1.0 1.0 1.0
dplyr::ntile()Partition into n binsmutate(), filter()
ntile(x,3)
 [1] 1 3 1 2 3 2 1 1 3 2
dplyr::between()Are the values between two numbersmutate(), filter()
between(x,3,7)
 [1]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
[10]  TRUE
dplyr::if_else(), dplyr::case_when()Assign values based on conditionmutate(), filter()
if_else(x > 3,'H','L')
 [1] "H" "H" "L" "H" "H" "H" "L" "L" "H" "H"
case_when(x<=3 ~ 'L', x>3 & x<=6 ~ 'M', x>6 ~ 'H')
 [1] "M" "H" "L" "M" "H" "M" "L" "L" "H" "M"
dplyr::recode(), dplyr::recode_factor()Recode itemsmutate(), filter()
recode(A,a='apple', b='banana')
 [1] "banana" "banana" "banana" "banana" "apple"  "apple" 
 [7] "banana" "banana" "banana" "banana"
recode_factor(A,a='apple', b='banana')
 [1] banana banana banana banana apple  apple  banana banana
 [9] banana banana
Levels: apple banana

Alterations to existing variables

Sorting data (arrange)

Most statistical analyses are invariant to the data order and thus data reordering is typically only for aesthetics in tables and figures.

Sorting data has the potential to be one of the most dangerous forms of data manipulation - particularly in spreadsheets in which there is no real binding between individual columns or rows. It is far to easy to accidentally sort the data in a single column (or row) without applying the ordering to all the other columns in the data set thereby resulting in broken data.

Whilst the above apocalypse is still possible in R, the data structures and manipulation interfaces mean that you really have to try to break the data in this way. Furthermore, you are encouraged to store reordered data in a different object to the original data, and hence 'rolling' back is trivial.

An example of an data set (data.1)
Plot Cond Between Temp LAT LONG
1 P1 H A1 15.74 17.26 146.24
2 P2 H A1 23.84 14.07 144.89
3 P3 H A1 13.64 20.75 144.69
4 P4 H A2 37.95 18.41 142.06
5 P1 M A2 25.30 18.47 144.04
6 P2 M A2 13.80 20.39 145.84
7 P3 M A3 26.87 20.14 147.72
8 P4 M A3 29.38 19.69 144.79
9 P1 L A3 27.76 20.34 145.78
10 P2 L A4 18.95 20.06 144.89
11 P3 L A4 37.12 18.65 142.25
12 P4 L A4 25.90 14.52 144.17
View code
set.seed(1)
data.1 <- expand.grid(Plot=paste("P",1:4,sep=""),Cond=c("H","M","L"))
data.1$Cond <- factor(as.character(data.1$Cond))
data.1$Between <- gl(4,3,12,lab=paste("A",1:4,sep=""))
data.1$Temp <- rnorm(12,22,10)
data.1$LAT <- rnorm(12,18.5,2)
data.1$LONG <- rnorm(12,145,2)
Sort the data by LAT
Via arrange (package:dplyr)
data.1 %>% arrange(LAT)
   Plot Cond Between     Temp      LAT     LONG
1    P2    H      A1 23.83643 14.07060 144.8877
2    P4    L      A4 25.89843 14.52130 144.1700
3    P1    H      A1 15.73546 17.25752 146.2397
4    P4    H      A2 37.95281 18.41013 142.0585
5    P1    M      A2 25.29508 18.46762 144.0437
6    P3    L      A4 37.11781 18.64913 142.2459
7    P4    M      A3 29.38325 19.68780 144.7944
8    P2    L      A4 18.94612 20.06427 144.8924
9    P3    M      A3 26.87429 20.14244 147.7174
10   P1    L      A3 27.75781 20.33795 145.7753
11   P2    M      A2 13.79532 20.38767 145.8359
12   P3    H      A1 13.64371 20.74986 144.6884
Via order
data.1[order(data.1$LAT),]
   Plot Cond Between     Temp      LAT     LONG
2    P2    H      A1 23.83643 14.07060 144.8877
12   P4    L      A4 25.89843 14.52130 144.1700
1    P1    H      A1 15.73546 17.25752 146.2397
4    P4    H      A2 37.95281 18.41013 142.0585
5    P1    M      A2 25.29508 18.46762 144.0437
11   P3    L      A4 37.11781 18.64913 142.2459
8    P4    M      A3 29.38325 19.68780 144.7944
10   P2    L      A4 18.94612 20.06427 144.8924
7    P3    M      A3 26.87429 20.14244 147.7174
9    P1    L      A3 27.75781 20.33795 145.7753
6    P2    M      A2 13.79532 20.38767 145.8359
3    P3    H      A1 13.64371 20.74986 144.6884
Sort the data by LAT (highest to lowest)
Via arrange (package:dplyr)
data.1 %>% arrange(desc(LAT))
#OR
data.1 %>% arrange(-LAT)
   Plot Cond Between     Temp      LAT     LONG
1    P3    H      A1 13.64371 20.74986 144.6884
2    P2    M      A2 13.79532 20.38767 145.8359
3    P1    L      A3 27.75781 20.33795 145.7753
4    P3    M      A3 26.87429 20.14244 147.7174
5    P2    L      A4 18.94612 20.06427 144.8924
6    P4    M      A3 29.38325 19.68780 144.7944
7    P3    L      A4 37.11781 18.64913 142.2459
8    P1    M      A2 25.29508 18.46762 144.0437
9    P4    H      A2 37.95281 18.41013 142.0585
10   P1    H      A1 15.73546 17.25752 146.2397
11   P4    L      A4 25.89843 14.52130 144.1700
12   P2    H      A1 23.83643 14.07060 144.8877
Via order
data.1[rev(order(data.1$LAT)),]
   Plot Cond Between     Temp      LAT     LONG
3    P3    H      A1 13.64371 20.74986 144.6884
6    P2    M      A2 13.79532 20.38767 145.8359
9    P1    L      A3 27.75781 20.33795 145.7753
7    P3    M      A3 26.87429 20.14244 147.7174
10   P2    L      A4 18.94612 20.06427 144.8924
8    P4    M      A3 29.38325 19.68780 144.7944
11   P3    L      A4 37.11781 18.64913 142.2459
5    P1    M      A2 25.29508 18.46762 144.0437
4    P4    H      A2 37.95281 18.41013 142.0585
1    P1    H      A1 15.73546 17.25752 146.2397
12   P4    L      A4 25.89843 14.52130 144.1700
2    P2    H      A1 23.83643 14.07060 144.8877
Sort the data by Cond and then Temp
Via arrange (package:dplyr)
data.1 %>% arrange(Cond,Temp)
   Plot Cond Between     Temp      LAT     LONG
1    P3    H      A1 13.64371 20.74986 144.6884
2    P1    H      A1 15.73546 17.25752 146.2397
3    P2    H      A1 23.83643 14.07060 144.8877
4    P4    H      A2 37.95281 18.41013 142.0585
5    P2    L      A4 18.94612 20.06427 144.8924
6    P4    L      A4 25.89843 14.52130 144.1700
7    P1    L      A3 27.75781 20.33795 145.7753
8    P3    L      A4 37.11781 18.64913 142.2459
9    P2    M      A2 13.79532 20.38767 145.8359
10   P1    M      A2 25.29508 18.46762 144.0437
11   P3    M      A3 26.87429 20.14244 147.7174
12   P4    M      A3 29.38325 19.68780 144.7944
Via order
data.1[order(data.1$Cond,data.1$Temp),]
   Plot Cond Between     Temp      LAT     LONG
3    P3    H      A1 13.64371 20.74986 144.6884
1    P1    H      A1 15.73546 17.25752 146.2397
2    P2    H      A1 23.83643 14.07060 144.8877
4    P4    H      A2 37.95281 18.41013 142.0585
10   P2    L      A4 18.94612 20.06427 144.8924
12   P4    L      A4 25.89843 14.52130 144.1700
9    P1    L      A3 27.75781 20.33795 145.7753
11   P3    L      A4 37.11781 18.64913 142.2459
6    P2    M      A2 13.79532 20.38767 145.8359
5    P1    M      A2 25.29508 18.46762 144.0437
7    P3    M      A3 26.87429 20.14244 147.7174
8    P4    M      A3 29.38325 19.68780 144.7944
Sort the data by the sum of Temp and LAT
Via arrange (package:dplyr)
data.1 %>% arrange(Temp+LAT)
   Plot Cond Between     Temp      LAT     LONG
1    P1    H      A1 15.73546 17.25752 146.2397
2    P2    M      A2 13.79532 20.38767 145.8359
3    P3    H      A1 13.64371 20.74986 144.6884
4    P2    H      A1 23.83643 14.07060 144.8877
5    P2    L      A4 18.94612 20.06427 144.8924
6    P4    L      A4 25.89843 14.52130 144.1700
7    P1    M      A2 25.29508 18.46762 144.0437
8    P3    M      A3 26.87429 20.14244 147.7174
9    P1    L      A3 27.75781 20.33795 145.7753
10   P4    M      A3 29.38325 19.68780 144.7944
11   P3    L      A4 37.11781 18.64913 142.2459
12   P4    H      A2 37.95281 18.41013 142.0585
Via order
data.1[order(data.1$Temp+data.1$LAT),]
   Plot Cond Between     Temp      LAT     LONG
1    P1    H      A1 15.73546 17.25752 146.2397
6    P2    M      A2 13.79532 20.38767 145.8359
3    P3    H      A1 13.64371 20.74986 144.6884
2    P2    H      A1 23.83643 14.07060 144.8877
10   P2    L      A4 18.94612 20.06427 144.8924
12   P4    L      A4 25.89843 14.52130 144.1700
5    P1    M      A2 25.29508 18.46762 144.0437
7    P3    M      A3 26.87429 20.14244 147.7174
9    P1    L      A3 27.75781 20.33795 145.7753
8    P4    M      A3 29.38325 19.68780 144.7944
11   P3    L      A4 37.11781 18.64913 142.2459
4    P4    H      A2 37.95281 18.41013 142.0585

Re-levelling (sorting) factors

In the previous section, we altered the order of the records in the data set. However, it is important to realize that categorical variables in a dataframe have a property (called levels) that indicates the order of the levels of the factor and that this property is completely independent of the order of records in the data structure. By default, categorical variables (factors) are ordered alphabetically (as I said, irrespective of their order in the data set). The order of factor levels dictates the order in which groups will be plotted in figures and tabulated in tables. Consequently, re-levelling factors is often desirable

An example of an data set (data.2)
Plot Cond Between Temp
1 P1 H A1 15.74
2 P2 H A1 23.84
3 P3 H A1 13.64
4 P4 H A2 37.95
5 P1 M A2 25.30
6 P2 M A2 13.80
7 P3 M A3 26.87
8 P4 M A3 29.38
9 P1 L A3 27.76
10 P2 L A4 18.95
11 P3 L A4 37.12
12 P4 L A4 25.90
Most commonly this step is performed within a mutate.
View code
set.seed(1)
data.2 <- expand.grid(Plot=paste("P",1:4,sep=""),Cond=c("H","M","L"))
data.2$Cond <- factor(as.character(data.2$Cond))
data.2$Between <- gl(4,3,12,lab=paste("A",1:4,sep=""))
data.2$Temp <- rnorm(12,22,10)
Default alphabetical order
levels(data.2$Cond)
[1] "H" "L" "M"
Re-levelled into L, M, H
data.2$Cond <- fct_relevel(data.2$Cond, 'L','M','H')
levels(data.2$Cond)
[1] "L" "M" "H"
#OR
data.2$Cond <-  fct_relevel(data.2$Cond,'L','M','H')
View code
data.2$Cond <- factor(data.2$Cond, levels=c("L","M","H"))
levels(data.2$Cond)
[1] "L" "M" "H"

Sometimes it is useful (particularly when preparing data for graphing) to have the order of the levels of a factor defined by another variable - for example to allow groups to be plotted from highest to lowest response etc. This can be achieved via the fct_reorder() and fct_reorder2() functions from the forcats package.

An example of an data set (data.2)
Plot Cond Between Temp
1 P1 H A1 15.74
2 P2 H A1 23.84
3 P3 H A1 13.64
4 P4 H A2 37.95
5 P1 M A2 25.30
6 P2 M A2 13.80
7 P3 M A3 26.87
8 P4 M A3 29.38
9 P1 L A3 27.76
10 P2 L A4 18.95
11 P3 L A4 37.12
12 P4 L A4 25.90
Most commonly this step is performed within a mutate.
View code
set.seed(1)
data.2 <- expand.grid(Plot=paste("P",1:4,sep=""),Cond=c("H","M","L"))
data.2$Cond <- factor(as.character(data.2$Cond))
data.2$Between <- gl(4,3,12,lab=paste("A",1:4,sep=""))
data.2$Temp <- rnorm(12,22,10)
Default alphabetical order
levels(data.2$Cond)
[1] "H" "L" "M"
Re-levelled according to Temp means
data.2$Cond <- fct_reorder(data.2$Cond, data.2$Temp, fun=mean)
levels(data.2$Cond)
[1] "H" "M" "L"

Relabelling a factor

In this case, the L,M,H labels are efficient, yet not as informative as we would perhaps like for labels on a graphic. Instead, we would probably prefer they were something like, "Low", "Medium" and "High"

Give the levels of a factor more informative labels
Via revalue (package:plyr)
# I will create another data set as I want to retain the original for further 
# demonstrations
data.3 <- data.2
# reorder AND rename the factor levels
data.3$Cond<-fct_recode(data.3$Cond, Low="L", Medium="M", High="H")
data.3
   Plot   Cond Between     Temp
1    P1   High      A1 15.73546
2    P2   High      A1 23.83643
3    P3   High      A1 13.64371
4    P4   High      A2 37.95281
5    P1 Medium      A2 25.29508
6    P2 Medium      A2 13.79532
7    P3 Medium      A3 26.87429
8    P4 Medium      A3 29.38325
9    P1    Low      A3 27.75781
10   P2    Low      A4 18.94612
11   P3    Low      A4 37.11781
12   P4    Low      A4 25.89843
levels(data.3$Cond)
[1] "High"   "Medium" "Low"   
Via factor
# I will create another data set as I want to retain the original for further 
# demonstrations
data.3 <- data.2
# reorder AND rename the factor levels
data.3$Cond <- factor(data.3$Cond,levels=c("L","M","H"),lab=c("Low","Medium","High"))
data.3
   Plot   Cond Between     Temp
1    P1   High      A1 15.73546
2    P2   High      A1 23.83643
3    P3   High      A1 13.64371
4    P4   High      A2 37.95281
5    P1 Medium      A2 25.29508
6    P2 Medium      A2 13.79532
7    P3 Medium      A3 26.87429
8    P4 Medium      A3 29.38325
9    P1    Low      A3 27.75781
10   P2    Low      A4 18.94612
11   P3    Low      A4 37.11781
12   P4    Low      A4 25.89843
levels(data.3$Cond)
[1] "Low"    "Medium" "High"  

Renaming columns (variables)

Data can be sourced from a variety of locations and often the column (=variable or field names) are either non-informative, very long or otherwise not convenient. It is possible to rename any or all of the column names of a data frame (or matrix).

Rename the column names (variables)
Via rename (package:dplyr)
Note that it is not necessary to rename all the columns via this routine.
data.2 %>% rename(Temperature=Temp,Condition=Cond, Treatment=Between)
   Plot Condition Treatment Temperature
1    P1         H        A1    15.73546
2    P2         H        A1    23.83643
3    P3         H        A1    13.64371
4    P4         H        A2    37.95281
5    P1         M        A2    25.29508
6    P2         M        A2    13.79532
7    P3         M        A3    26.87429
8    P4         M        A3    29.38325
9    P1         L        A3    27.75781
10   P2         L        A4    18.94612
11   P3         L        A4    37.11781
12   P4         L        A4    25.89843
Via colnames
# again restore the data from the original data.2
data.3 <- data.2
colnames(data.3) <- c("Plot","Condition","Treatment","Temperature")
data.3
   Plot Condition Treatment Temperature
1    P1         H        A1    15.73546
2    P2         H        A1    23.83643
3    P3         H        A1    13.64371
4    P4         H        A2    37.95281
5    P1         M        A2    25.29508
6    P2         M        A2    13.79532
7    P3         M        A3    26.87429
8    P4         M        A3    29.38325
9    P1         L        A3    27.75781
10   P2         L        A4    18.94612
11   P3         L        A4    37.11781
12   P4         L        A4    25.89843

Subsetting

We regularly want to run an analysis or generate a graphic for a sub-set of the data. Take for example the data used here. We may, for example, wish to explore a subset of the data for which Cond is "H". Warning, following any subsetting, it is crucial that you then instruct R to redefine the factor levels of any factors in the subsetted data set. When subsetting a dataset, all factors simply inherit the original levels property of the original factors, and therefore the new factors potentially define more factor levels than are in the actually present in the subsetted data set. This, along with the solution is illustrated below.

Another form of subsetting is where we wish to reduce the number of columns of the data frame to remove excessive and unwanted data fields. The two forms of subsetting are thus:

  • reducing the number of rows - filtering
  • reducing the number of columns - selecting
can be performed simultaneously.

Filtering (subsetting by rows)

Filtering selects rows for which a condition is evaluated to be TRUE. Hence, any logical expression or vectorized function that returns Boolean values (TRUE or FALSE) can be used for filtering.

An example of a data set (data.2)
Plot Cond Between Temp
1 P1 H A1 15.74
2 P2 H A1 23.84
3 P3 H A1 13.64
4 P4 H A2 37.95
5 P1 M A2 25.30
6 P2 M A2 13.80
7 P3 M A3 26.87
8 P4 M A3 29.38
9 P1 L A3 27.76
10 P2 L A4 18.95
11 P3 L A4 37.12
12 P4 L A4 25.90
View code
set.seed(1)
data.2 <- expand.grid(Plot=paste("P",1:4,sep=""),Cond=c("H","M","L"))
data.2$Cond <- factor(as.character(data.2$Cond))
data.2$Between <- gl(4,3,12,lab=paste("A",1:4,sep=""))
data.2$Temp <- rnorm(12,22,10)
Keep only the observations where Cond is H
Via filter (package:dplyr)
data.2 %>% filter(Cond=="H")
  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P4    H      A2 37.95281
Via subset
subset(data.2,Cond=="H")
  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P4    H      A2 37.95281
Via indexing
data.2[data.2$Cond=="H",]
  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P4    H      A2 37.95281
Keep only the observations for which Temp is between 15 and 20
Via filter (package:dplyr)
data.2 %>% filter(between(Temp,15,20))
  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    L      A4 18.94612
Via subset
subset(data.2, Temp>15 & Temp<20)
   Plot Cond Between     Temp
1    P1    H      A1 15.73546
10   P2    L      A4 18.94612
Via indexing
data.2[data.2$Temp>15 & data.2$Temp<20,]
   Plot Cond Between     Temp
1    P1    H      A1 15.73546
10   P2    L      A4 18.94612
Keep only the observations after Temp has surpassed 25
Via filter (package:dplyr)
data.2 %>% filter(cumany(Temp>25))
  Plot Cond Between     Temp
1   P4    H      A2 37.95281
2   P1    M      A2 25.29508
3   P2    M      A2 13.79532
4   P3    M      A3 26.87429
5   P4    M      A3 29.38325
6   P1    L      A3 27.75781
7   P2    L      A4 18.94612
8   P3    L      A4 37.11781
9   P4    L      A4 25.89843
Keep only the observations where Cond is either H or M
Via filter (package:dplyr)
filter(data.2,Cond %in% c("H","M"))
#OR
filter(data.2,Cond=="H" | Cond=="M")
  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P4    H      A2 37.95281
5   P1    M      A2 25.29508
6   P2    M      A2 13.79532
7   P3    M      A3 26.87429
8   P4    M      A3 29.38325
Via subset
subset(data.2,Cond %in% c("H","M"))
  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P4    H      A2 37.95281
5   P1    M      A2 25.29508
6   P2    M      A2 13.79532
7   P3    M      A3 26.87429
8   P4    M      A3 29.38325
Via indexing
data.2[data.2$Cond %in% c("H","M"),]
  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P4    H      A2 37.95281
5   P1    M      A2 25.29508
6   P2    M      A2 13.79532
7   P3    M      A3 26.87429
8   P4    M      A3 29.38325
Keep only the observations with H OR M condition AND Temp < 30
Via filter (package:dplyr)
filter(data.2,Cond %in% c("H","M") & Temp<30)
#OR
filter(data.2,Cond %in% c("H","M"), Temp<30)
  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P1    M      A2 25.29508
5   P2    M      A2 13.79532
6   P3    M      A3 26.87429
7   P4    M      A3 29.38325
Via subset
subset(data.2,Cond %in% c("H","M") & Temp<30)
  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
5   P1    M      A2 25.29508
6   P2    M      A2 13.79532
7   P3    M      A3 26.87429
8   P4    M      A3 29.38325
Via indexing
#Note, the following will yield unexpected results when there are missing values!
data.2[data.2$Cond %in% c("H","M") & data.2$Temp<30,]
  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
5   P1    M      A2 25.29508
6   P2    M      A2 13.79532
7   P3    M      A3 26.87429
8   P4    M      A3 29.38325
Keep only the observations with Cond of H and Temp < 30 OR Cond of M and Temp < 26
Such a statement could be interpreted a number of ways (some of which are listed below):
  • (Cond=="H" & Temp<30) | (Cond=='M' & Temp<26)
  • (Cond=="H") & (Temp<30 | Cond=='M') & Temp<26)
  • (Cond=="H") & (Temp<30 | Cond=='M' & Temp<26)
  • (Cond=="H" & (Temp<30 | Cond=='M')) & Temp<26
  • ...
Lets assume the first of these...
Via filter (package:dplyr)
filter(data.2,(Cond=="H" & Temp<30) | (Cond=='M' & Temp<26))
  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P1    M      A2 25.29508
5   P2    M      A2 13.79532
Via subset
subset(data.2,(Cond=="H" & Temp<30) | (Cond=='M' & Temp<26))
  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
5   P1    M      A2 25.29508
6   P2    M      A2 13.79532
Via indexing
#Note, the following will yield unexpected results when there are missing values!
data.2[(data.2$Cond=="H" & data.2$Temp<30) | (data.2$Cond=='M' & data.2$Temp<26),]
  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
5   P1    M      A2 25.29508
6   P2    M      A2 13.79532
Effect of subsetting on factor levels
#examine the levels of the Cond factor
levels(data.2$Cond)
[1] "H" "L" "M"
#subset the dataset to just Cond H
data.3<-filter(data.2,Cond=="H")
#examine subset data
data.3
  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P4    H      A2 37.95281
#examine the levels of the Cond factor
levels(data.3$Cond)
[1] "H" "L" "M"
levels(data.3$Between)
[1] "A1" "A2" "A3" "A4"
Notice that although we created a subset of the data that only includes cases where Cond is "H", each of the factors have retained their original level properties.
Correcting factor levels of all factors after subsetting
#subset the dataset to just Cond H
data.3<-filter(data.2,Cond=="H")
#drop the unused factor levels from all factors
data.3<-droplevels(data.3)
#examine the levels of each factor
levels(data.3$Cond)
[1] "H"
levels(data.3$Between)
[1] "A1" "A2"
Notice this time, the levels of each factor reflect the subset data.
Correcting factor levels of only a single factor after subsetting
#subset the dataset to just Cond H
data.3<-filter(data.2,Cond=="H")
#drop the unused factor levels from Cond
data.3$Cond<-fct_drop(data.3$Cond)
#OR
data.3$Cond<-factor(data.3$Cond)
#examine the levels of each factor
levels(data.3$Cond)
[1] "H"
levels(data.3$Between)
[1] "A1" "A2" "A3" "A4"
Notice this time, the levels of the Cond factor reflect the subset data, whereas the levels of the Between factor reflect the original dataset.

Selecting (subsetting by columns)

Selecting works by either including (or excluding) the column names that you indicate or by special 'Helper' functions that pass a vector of column indices to include in the subset data.

When selecting with the select function from the dplyr package, there are a number of "Helper" functions that provide convenient ways to select columns:

  • contains(""), columns whose name match the character string (not case sensitive by default)
  • starts_with(""), columns whose names begin with the character string (not case sensitive by default)
  • ends_with(""), columns whose names end with the character string (not case sensitive by default)
  • one_of(c("","")), columns whose names are amongst the character strings
  • matches(""), columns whose names are matched by the character string of a regular expression (not case sensitive by default)
  • everything(), all columns. This is useful in combination with other specifiers
  • num_range("",), select all columns whose names start with the first argument (a character string) followed by a number in the integer sequence provided as the second argument
Selections are negated by prepending with a '-' and columns are ordered according to the order that they are selected.

An example of a data set (data)
Resp1 Resp2 Between Plot Subplot Within
1 8 17 A1 P1 S1 B1
2 10 18 A1 P1 S1 B2
3 7 17 A1 P1 S2 B1
4 11 21 A1 P1 S2 B2
5 14 19 A2 P2 S3 B1
6 12 13 A2 P2 S3 B2
7 11 24 A2 P2 S4 B1
8 9 18 A2 P2 S4 B2
9 14 25 A3 P3 S5 B1
10 11 18 A3 P3 S5 B2
11 8 27 A3 P3 S6 B1
12 2 22 A3 P3 S6 B2
13 8 17 A1 P4 S7 B1
14 10 22 A1 P4 S7 B2
15 7 16 A1 P4 S8 B1
16 12 13 A1 P4 S8 B2
17 11 23 A2 P5 S9 B1
18 12 19 A2 P5 S9 B2
19 12 23 A2 P5 S10 B1
20 10 21 A2 P5 S10 B2
21 3 17 A3 P6 S11 B1
22 11 16 A3 P6 S11 B2
23 13 26 A3 P6 S12 B1
24 7 28 A3 P6 S12 B2
View code
set.seed(1)
data <- expand.grid(Within=paste("B",1:2,sep=""),
                    Subplot=paste("S",1:2,sep=""),
                    Plot=paste("P",1:6,sep=""))
data$Subplot <- gl(12,2,24,lab=paste("S",1:12,sep=""))
data$Between <- gl(3,4,24,lab=paste("A",1:3,sep=""))
data$Resp1 <- rpois(24,10)
data$Resp2 <- rpois(24,20)
data <- with(data,data.frame(Resp1,Resp2,Between,Plot,Subplot,Within))
Keep only the Plot and Resp1 columns
Via select (package:dplyr)
data %>% select(Plot,Resp1)
#OR
data %>% select(starts_with('Pl'),ends_with("p1"))
#OR
data %>% select(contains('Plo', ignore.case=FALSE),matches(".*p1"))
#OR
data %>% select(-matches('w.*n'),-starts_with("Sub"),-matches('Resp'),matches('Resp1'))
#OR
data %>% select(one_of(c('Plot','Resp1')))
   Plot Resp1
1    P1     8
2    P1    10
3    P1     7
4    P1    11
5    P2    14
6    P2    12
7    P2    11
8    P2     9
9    P3    14
10   P3    11
11   P3     8
12   P3     2
13   P4     8
14   P4    10
15   P4     7
16   P4    12
17   P5    11
18   P5    12
19   P5    12
20   P5    10
21   P6     3
22   P6    11
23   P6    13
24   P6     7
Via subset
subset(data,select=c(Plot,Resp1))
   Plot Resp1
1    P1     8
2    P1    10
3    P1     7
4    P1    11
5    P2    14
6    P2    12
7    P2    11
8    P2     9
9    P3    14
10   P3    11
11   P3     8
12   P3     2
13   P4     8
14   P4    10
15   P4     7
16   P4    12
17   P5    11
18   P5    12
19   P5    12
20   P5    10
21   P6     3
22   P6    11
23   P6    13
24   P6     7
#OR
subset(data,select=c("Plot","Resp1"))
   Plot Resp1
1    P1     8
2    P1    10
3    P1     7
4    P1    11
5    P2    14
6    P2    12
7    P2    11
8    P2     9
9    P3    14
10   P3    11
11   P3     8
12   P3     2
13   P4     8
14   P4    10
15   P4     7
16   P4    12
17   P5    11
18   P5    12
19   P5    12
20   P5    10
21   P6     3
22   P6    11
23   P6    13
24   P6     7
Via indexing
data[,c("Plot","Resp1")]
   Plot Resp1
1    P1     8
2    P1    10
3    P1     7
4    P1    11
5    P2    14
6    P2    12
7    P2    11
8    P2     9
9    P3    14
10   P3    11
11   P3     8
12   P3     2
13   P4     8
14   P4    10
15   P4     7
16   P4    12
17   P5    11
18   P5    12
19   P5    12
20   P5    10
21   P6     3
22   P6    11
23   P6    13
24   P6     7
Keep the Plot/Subplot and Resp columns only
Via select (package:dplyr)
data %>% select(Plot,Subplot,Resp1,Resp2)
#OR
data %>% select(contains('plot'), matches('^Resp.*'))
#OR
data %>% select(matches('plot$'), num_range('Resp',1:2))
#OR
data %>% select(contains('plot'),everything(),-matches('w.*n'))
   Plot Subplot Resp1 Resp2
1    P1      S1     8    17
2    P1      S1    10    18
3    P1      S2     7    17
4    P1      S2    11    21
5    P2      S3    14    19
6    P2      S3    12    13
7    P2      S4    11    24
8    P2      S4     9    18
9    P3      S5    14    25
10   P3      S5    11    18
11   P3      S6     8    27
12   P3      S6     2    22
13   P4      S7     8    17
14   P4      S7    10    22
15   P4      S8     7    16
16   P4      S8    12    13
17   P5      S9    11    23
18   P5      S9    12    19
19   P5     S10    12    23
20   P5     S10    10    21
21   P6     S11     3    17
22   P6     S11    11    16
23   P6     S12    13    26
24   P6     S12     7    28
Via subset
subset(data,select=c(Plot,Subplot,Resp1,Resp2))
   Plot Subplot Resp1 Resp2
1    P1      S1     8    17
2    P1      S1    10    18
3    P1      S2     7    17
4    P1      S2    11    21
5    P2      S3    14    19
6    P2      S3    12    13
7    P2      S4    11    24
8    P2      S4     9    18
9    P3      S5    14    25
10   P3      S5    11    18
11   P3      S6     8    27
12   P3      S6     2    22
13   P4      S7     8    17
14   P4      S7    10    22
15   P4      S8     7    16
16   P4      S8    12    13
17   P5      S9    11    23
18   P5      S9    12    19
19   P5     S10    12    23
20   P5     S10    10    21
21   P6     S11     3    17
22   P6     S11    11    16
23   P6     S12    13    26
24   P6     S12     7    28
#OR
subset(data,select=c("Plot","Subplot","Resp1","Resp2"))
   Plot Subplot Resp1 Resp2
1    P1      S1     8    17
2    P1      S1    10    18
3    P1      S2     7    17
4    P1      S2    11    21
5    P2      S3    14    19
6    P2      S3    12    13
7    P2      S4    11    24
8    P2      S4     9    18
9    P3      S5    14    25
10   P3      S5    11    18
11   P3      S6     8    27
12   P3      S6     2    22
13   P4      S7     8    17
14   P4      S7    10    22
15   P4      S8     7    16
16   P4      S8    12    13
17   P5      S9    11    23
18   P5      S9    12    19
19   P5     S10    12    23
20   P5     S10    10    21
21   P6     S11     3    17
22   P6     S11    11    16
23   P6     S12    13    26
24   P6     S12     7    28
Via indexing
data[,c("Plot","Subplot","Resp1","Resp2")]
   Plot Subplot Resp1 Resp2
1    P1      S1     8    17
2    P1      S1    10    18
3    P1      S2     7    17
4    P1      S2    11    21
5    P2      S3    14    19
6    P2      S3    12    13
7    P2      S4    11    24
8    P2      S4     9    18
9    P3      S5    14    25
10   P3      S5    11    18
11   P3      S6     8    27
12   P3      S6     2    22
13   P4      S7     8    17
14   P4      S7    10    22
15   P4      S8     7    16
16   P4      S8    12    13
17   P5      S9    11    23
18   P5      S9    12    19
19   P5     S10    12    23
20   P5     S10    10    21
21   P6     S11     3    17
22   P6     S11    11    16
23   P6     S12    13    26
24   P6     S12     7    28

select_if

There is also a special form of select that selects if a condition is satisfied.

Keep only the numeric variables
Via select (package:dplyr)
data %>% select_if(is.numeric)

filter and select

Recall that one of the advantages of working within the grammar of data manipulation framework is that data can be piped from one tool to another to achieve multiple manipulations.

Keep only the Plot and Temp columns and observations with H OR M condition AND Temp < 30
Via filter and select (package:dplyr)
data.2 %>% filter(Cond %in% c("H","M") & Temp<30) %>% select(Plot, Temp)
  Plot     Temp
1   P1 15.73546
2   P2 23.83643
3   P3 13.64371
4   P1 25.29508
5   P2 13.79532
6   P3 26.87429
7   P4 29.38325
Via subset
subset(data.2,Cond %in% c("H","M") & Temp<30, select=c(Plot,Temp))
  Plot     Temp
1   P1 15.73546
2   P2 23.83643
3   P3 13.64371
5   P1 25.29508
6   P2 13.79532
7   P3 26.87429
8   P4 29.38325
#OR
subset(data.2,Cond %in% c("H","M") & Temp<30, select=c(-Cond,-Between))
  Plot     Temp
1   P1 15.73546
2   P2 23.83643
3   P3 13.64371
5   P1 25.29508
6   P2 13.79532
7   P3 26.87429
8   P4 29.38325
Via indexing
#Note, the following will yield unexpected results when there are missing values!
data.2[data.2$Cond %in% c("H","M") & data.2$Temp<30, c("Plot","Temp")]
  Plot     Temp
1   P1 15.73546
2   P2 23.83643
3   P3 13.64371
5   P1 25.29508
6   P2 13.79532
7   P3 26.87429
8   P4 29.38325

Adding new variables (mutating)

To add to (mutate) a data set, a vectorized function is applied either once or across one or more existing columns.

This section will make use of the following data set:

A very simple dataset (data.s)
Between Plot Resp1 Resp2
1 A1 P1 8 13
2 A1 P2 10 22
3 A2 P3 7 23
4 A2 P4 11 22
View code
set.seed(1)
data.s <- expand.grid(Plot=paste("P",1:4,sep=""))
data.s$Between <- gl(2,2,4,lab=paste("A",1:2,sep=""))
data.s <- with(data.s,data.frame(Between,Plot,Resp1=rpois(4,10), Resp2=rpois(4,20)))

Whether transforming/scaling existing variables or generating new variables derived from existing variables it is advisable that existing variables be retained (unless the data are huge and storage is tight). As with other routines featured in this tutorial, there are numerous ways to derive new variables from existing variables. The main options;

    data.s$logResp1 <- log(data.s$Resp1)
    data.s$logResp2 <- log(data.s$Resp2)
    data.s
    
      Between Plot Resp1 Resp2 logResp1 logResp2
    1      A1   P1     8    13 2.079442 2.564949
    2      A1   P2    10    22 2.302585 3.091042
    3      A2   P3     7    23 1.945910 3.135494
    4      A2   P4    11    22 2.397895 3.091042
    
    transform(data.s, logResp1=log(Resp1), logRes2=log(Resp2))
    
      Between Plot Resp1 Resp2 logResp1  logRes2
    1      A1   P1     8    13 2.079442 2.564949
    2      A1   P2    10    22 2.302585 3.091042
    3      A2   P3     7    23 1.945910 3.135494
    4      A2   P4    11    22 2.397895 3.091042
    
    data.s
    
      Between Plot Resp1 Resp2
    1      A1   P1     8    13
    2      A1   P2    10    22
    3      A2   P3     7    23
    4      A2   P4    11    22
    
    data.s %>% mutate(logResp1=log(Resp1), logRes2=log(Resp2))
    
      Between Plot Resp1 Resp2 logResp1  logRes2
    1      A1   P1     8    13 2.079442 2.564949
    2      A1   P2    10    22 2.302585 3.091042
    3      A2   P3     7    23 1.945910 3.135494
    4      A2   P4    11    22 2.397895 3.091042
    
    data.s
    
      Between Plot Resp1 Resp2
    1      A1   P1     8    13
    2      A1   P2    10    22
    3      A2   P3     7    23
    4      A2   P4    11    22
    
The difference between transform and mutate is that mutate works with each column sequentially and therefore can derive new columns using the columns created in earlier iterations.

The main features of each of the above functions are compared and contrasted in the following table.

Transforms at a timeReturnsNotes
manuallySingleSingle vectors
transformMultipleNew dataframeSomewhat similar to within().
mutateMultipleNew dataframeWorks with each column sequentially and therefore can derive new columns using the columns created in earlier iterations.
Scale (log) transform a single variable - Resp1
Via mutate
data.s %>% mutate(logResp1=log(Resp1))
  Between Plot Resp1 Resp2 logResp1
1      A1   P1     8    13 2.079442
2      A1   P2    10    22 2.302585
3      A2   P3     7    23 1.945910
4      A2   P4    11    22 2.397895
Via transform
transform(data.s,logResp1=log(Resp1))
  Between Plot Resp1 Resp2 logResp1
1      A1   P1     8    13 2.079442
2      A1   P2    10    22 2.302585
3      A2   P3     7    23 1.945910
4      A2   P4    11    22 2.397895
Scale (log) transform a multiple variables (Resp1 and Resp2) and then calculate the difference
Via mutate
data.s %>% mutate(logResp1=log(Resp1), logResp2=log(Resp2), Diff=logResp2-logResp1)
  Between Plot Resp1 Resp2 logResp1 logResp2      Diff
1      A1   P1     8    13 2.079442 2.564949 0.4855078
2      A1   P2    10    22 2.302585 3.091042 0.7884574
3      A2   P3     7    23 1.945910 3.135494 1.1895841
4      A2   P4    11    22 2.397895 3.091042 0.6931472
Via transform
tmp=transform(data.s,logResp1=log(Resp1), logResp2=log(Resp2))
transform(tmp, Diff=logResp2-logResp1)
  Between Plot Resp1 Resp2 logResp1 logResp2      Diff
1      A1   P1     8    13 2.079442 2.564949 0.4855078
2      A1   P2    10    22 2.302585 3.091042 0.7884574
3      A2   P3     7    23 1.945910 3.135494 1.1895841
4      A2   P4    11    22 2.397895 3.091042 0.6931472

Vectorized functions

Recall that any function that returns a either a single value or a vector of length equal to the number of rows in the data can be used. The dplyr package also comes with a number of "Vectorized" functions for performing common data manipulation routines:

    data.s %>% mutate(leadResp1=lead(Resp1), lagResp1=lag(Resp1))
    
      Between Plot Resp1 Resp2 leadResp1 lagResp1
    1      A1   P1     8    13        10       NA
    2      A1   P2    10    22         7        8
    3      A2   P3     7    23        11       10
    4      A2   P4    11    22        NA        7
    
    mutate(data.s, rankResp2=rank(Resp2), drnkResp2=dense_rank(Resp2), mrnkResp2=min_rank(Resp2),
        prnkResp2=percent_rank(Resp2))
    
      Between Plot Resp1 Resp2 rankResp2 drnkResp2 mrnkResp2 prnkResp2
    1      A1   P1     8    13       1.0         1         1 0.0000000
    2      A1   P2    10    22       2.5         2         2 0.3333333
    3      A2   P3     7    23       4.0         3         4 1.0000000
    4      A2   P4    11    22       2.5         2         2 0.3333333
    
    mutate(data.s, tileResp1=ntile(Resp1,3), btwnResp1=between(Resp1,9,12))
    
      Between Plot Resp1 Resp2 tileResp1 btwnResp1
    1      A1   P1     8    13         1     FALSE
    2      A1   P2    10    22         2      TRUE
    3      A2   P3     7    23         1     FALSE
    4      A2   P4    11    22         3      TRUE
    
    ## Cummulative distribution 
    data.s %>% mutate(csResp1=cumsum(Resp1), cdResp1=cume_dist(Resp1))
    
      Between Plot Resp1 Resp2 csResp1 cdResp1
    1      A1   P1     8    13       8    0.50
    2      A1   P2    10    22      18    0.75
    3      A2   P3     7    23      25    0.25
    4      A2   P4    11    22      36    1.00
    
    ## Predicates of cummulatives. 
    ## Accumulate only those that have a Resp1 > 9
    ## Accumulate only if all have a Resp1 > 7 
    data.s %>% mutate(csResp1=cumsum(Resp1), cnyResp1=cumany(Resp1>9), callResp1=cumall(Resp1>7))
    
      Between Plot Resp1 Resp2 csResp1 cnyResp1 callResp1
    1      A1   P1     8    13       8    FALSE      TRUE
    2      A1   P2    10    22      18     TRUE      TRUE
    3      A2   P3     7    23      25     TRUE     FALSE
    4      A2   P4    11    22      36     TRUE     FALSE
    
    ## The above window functions are useful for filtering
    data.s %>% filter(cumall(Resp1>7))
    
      Between Plot Resp1 Resp2
    1      A1   P1     8    13
    2      A1   P2    10    22
    
    ## Cumulative product and mean
    data.s %>% mutate(cpResp1=cumprod(Resp1), cmResp1=cummean(Resp1))
    
      Between Plot Resp1 Resp2 cpResp1  cmResp1
    1      A1   P1     8    13       8 8.000000
    2      A1   P2    10    22      80 9.000000
    3      A2   P3     7    23     560 8.333333
    4      A2   P4    11    22    6160 9.000000
    
    ## Cumulative min and max 
    data.s %>% mutate(cminResp1=cummin(Resp1), cmaxResp1=cummax(Resp1))
    
      Between Plot Resp1 Resp2 cminResp1 cmaxResp1
    1      A1   P1     8    13         8         8
    2      A1   P2    10    22         8        10
    3      A2   P3     7    23         7        10
    4      A2   P4    11    22         7        11
    

The extended mutate_ family

In addition to the mutate() function, the dplyr package also has a number of alternative mutate_ functons that provide convenient ways to select which columns to apply functions to.

  • mutate_all(), apply function(s) to all columns
  • mutate_at(), apply function(s) to selected columns
  • mutate_if(), apply function(s) to columns that satisfy a specific condition

Scale (log) transform multiple variables - Resp1 and Resp2
Via mutate
data.s %>% mutate(Resp1_log=log(Resp1),Resp2_log=log(Resp2))
#OR to account for any number of transformations
data.s %>% mutate_at(vars(Resp1,Resp2), funs(log=log))
#OR
data.s %>% mutate_at(vars(contains("Resp")), funs(log=log))
#OR
data.s %>% mutate_if(is.numeric, funs(log=log))
  Between Plot Resp1 Resp2 Resp1_log Resp2_log
1      A1   P1     8    13  2.079442  2.564949
2      A1   P2    10    22  2.302585  3.091042
3      A2   P3     7    23  1.945910  3.135494
4      A2   P4    11    22  2.397895  3.091042
Via transform
transform(data.s,Resp1_log=log(Resp1),Resp2_log=log(Resp2))
  Between Plot Resp1 Resp2 Resp1_log Resp2_log
1      A1   P1     8    13  2.079442  2.564949
2      A1   P2    10    22  2.302585  3.091042
3      A2   P3     7    23  1.945910  3.135494
4      A2   P4    11    22  2.397895  3.091042
Apply a summary function to convert all character vectors into lowercase
Via mutate
mutate_if(data.s, is.character, stringr:::str_to_lower)
  Between Plot Resp1 Resp2
1      a1   p1     8    13
2      a1   p2    10    22
3      a2   p3     7    23
4      a2   p4    11    22
Applying multiple functions to multiple variables
Via mutate
mutate_at(data.s, vars(contains('Resp')), funs(log=log, c=scale(.,scale=FALSE)))
#OR
mutate_at(data.s, vars(Resp1,Resp2), funs(log=log, c=.-mean(.)))
  Between Plot Resp1 Resp2 Resp1_log Resp2_log Resp1_c Resp2_c
1      A1   P1     8    13  2.079442  2.564949      -1      -7
2      A1   P2    10    22  2.302585  3.091042       1       2
3      A2   P3     7    23  1.945910  3.135494      -2       3
4      A2   P4    11    22  2.397895  3.091042       2       2
Apply a summary function to multiple variables - Resp1 and Resp2
Via mutate
mutate(data.s,Resp1_mean=mean(Resp1, na.rm=TRUE),Resp2_mean=mean(Resp2, na.rm=TRUE))
#OR
mutate_at(data.s, vars(Resp1,Resp2),funs(mean=mean(.,na.rm=TRUE)))
  Between Plot Resp1 Resp2 Resp1_mean Resp2_mean
1      A1   P1     8    13          9         20
2      A1   P2    10    22          9         20
3      A2   P3     7    23          9         20
4      A2   P4    11    22          9         20
Via transform
transform(data.s,mutate(data.s,Resp1_mean=mean(Resp1, na.rm=TRUE),Resp2_mean=mean(Resp2, na.rm=TRUE)))
  Between Plot Resp1 Resp2
1      A1   P1     8    13
2      A1   P2    10    22
3      A2   P3     7    23
4      A2   P4    11    22

Summarizing (aggregating) data

Data are often collected and transcribed at finer temporal/spatial scales and with greater fidelity than is required for all analyses. Therefore an important phase of data preparation is also to summarize the data into the spatial/temporal scales appropriate for the desired graphical and statistical analyses.

Summarizing involves applying one or more summarizing functions to one or more variables.

An example of a data set (data)
Resp1 Resp2 Between Plot Subplot Within
1 8 17 A1 P1 S1 B1
2 10 18 A1 P1 S1 B2
3 7 17 A1 P1 S2 B1
4 11 21 A1 P1 S2 B2
5 14 19 A2 P2 S3 B1
6 12 13 A2 P2 S3 B2
7 11 24 A2 P2 S4 B1
8 9 18 A2 P2 S4 B2
9 14 25 A3 P3 S5 B1
10 11 18 A3 P3 S5 B2
11 8 27 A3 P3 S6 B1
12 2 22 A3 P3 S6 B2
13 8 17 A1 P4 S7 B1
14 10 22 A1 P4 S7 B2
15 7 16 A1 P4 S8 B1
16 12 13 A1 P4 S8 B2
17 11 23 A2 P5 S9 B1
18 12 19 A2 P5 S9 B2
19 12 23 A2 P5 S10 B1
20 10 21 A2 P5 S10 B2
21 3 17 A3 P6 S11 B1
22 11 16 A3 P6 S11 B2
23 13 26 A3 P6 S12 B1
24 7 28 A3 P6 S12 B2
View code
set.seed(1)
data <- expand.grid(Within=paste("B",1:2,sep=""),
                    Subplot=paste("S",1:2,sep=""),
                    Plot=paste("P",1:6,sep=""))
data$Subplot <- gl(12,2,24,lab=paste("S",1:12,sep=""))
data$Between <- gl(3,4,24,lab=paste("A",1:3,sep=""))
data$Resp1 <- rpois(24,10)
data$Resp2 <- rpois(24,20)
data <- with(data,data.frame(Resp1,Resp2,Between,Plot,Subplot,Within))
Calculate the mean Resp1
Via mutate
data %>% summarize(mean(Resp1))
  mean(Resp1)
1    9.708333
Via code
mean(data$Resp1)
[1] 9.708333

Note, unlike the base R solution (and consistent with all the grammar of data manipulation verbs) the dplyr pathway always yields a data.frame (actually a tibble). This is not only a data type that permits further manipulation, it is also a convenient format for producing tabular and graphical summaries.

Calculate the mean and standard deviation of Resp1
Via mutate
data %>% summarize(mean(Resp1), sd(Resp1))
  mean(Resp1) sd(Resp1)
1    9.708333   3.04287
Via code
mean(data$Resp1)
[1] 9.708333
sd(data$Resp1)
[1] 3.04287
Calculate a range of summary statistics
Via mutate
data %>% summarize(mean(Resp1), sd(Resp1), n(), n_distinct(Resp1), first(Resp1), min(Resp1), max(Resp1))
  mean(Resp1) sd(Resp1) n() n_distinct(Resp1) first(Resp1) min(Resp1) max(Resp1)
1    9.708333   3.04287  24                10            8          2         14
Calculate the mean of both Resp1 and Resp2
Via mutate
data %>% summarize(mean(Resp1), mean(Resp2))
  mean(Resp1) mean(Resp2)
1    9.708333          20
Via code
mean(data$Resp1)
[1] 9.708333
mean(data$Resp2)
[1] 20

The extended summary_ family

As with the mutate function, the dplyr package also has a number of alternative summarize_ functons that provide convenient ways to select which columns to apply functions to.

  • summarize_all(), apply function(s) to all columns
  • summarize_at(), apply function(s) to selected columns
  • summarize_if(), apply function(s) to columns that satisfy a specific condition

Calculate the mean of both Resp1 and Resp2
Via mutate
data %>% summarize_at(vars(Resp1,Resp2), mean)
     Resp1 Resp2
1 9.708333    20
#OR
data %>% summarize_if(is.numeric, mean)
     Resp1 Resp2
1 9.708333    20
Calculate the mean and standard deviation of both Resp1 and Resp2
Via mutate
data %>% summarize_at(vars(Resp1,Resp2), funs(mean,sd))
  Resp1_mean Resp2_mean Resp1_sd Resp2_sd
1   9.708333         20  3.04287 4.159849
#OR
data %>% summarize_if(is.numeric, funs(mean,sd))
  Resp1_mean Resp2_mean Resp1_sd Resp2_sd
1   9.708333         20  3.04287 4.159849

Grouping data

Base R has a family of apply functions that apply a function (such as mean()) to a continuous variable separately for:

  • apply(), each column or row.
    apply(data[,1:2], MARGIN=2, mean)
    
        Resp1     Resp2 
     9.708333 20.000000 
    
  • tapply(), each level of a categorical vector (factor)
    tapply(data$Resp1, data$Between, mean)
    
        A1     A2     A3 
     9.125 11.375  8.625 
    
The above apply functions involve a sort of split, apply, combine process. This process became the inspiration for the plyr package (also out of the Wickham lab) which built on and super-charged the process. The dplyr package represents the next iteration of the process (for data frames only as most data are in data frame format).

Central to the modern split/apply/combine process is the idea of groups. Groups are the basis of splitting the data. Functions applied to grouped data are applied to each group (subset) separately before the results are combined back into a single data frame (actually tibble). Hence grouped data are most powerful when combined with the summarize() or mutate() families of functions.

data %>% group_by(Between)
Source: local data frame [24 x 6]
Groups: Between [3]

   Resp1 Resp2 Between   Plot Subplot Within
   <int> <int>  <fctr> <fctr>  <fctr> <fctr>
1      8    17      A1     P1      S1     B1
2     10    18      A1     P1      S1     B2
3      7    17      A1     P1      S2     B1
4     11    21      A1     P1      S2     B2
5     14    19      A2     P2      S3     B1
6     12    13      A2     P2      S3     B2
7     11    24      A2     P2      S4     B1
8      9    18      A2     P2      S4     B2
9     14    25      A3     P3      S5     B1
10    11    18      A3     P3      S5     B2
# ... with 14 more rows

Applying summarize() to grouped data

Calculate the mean and variance for Resp1 for each level of Between

data %>% group_by(Between) %>% summarize(Mean=mean(Resp1), Var=var(Resp1))
# A tibble: 3 × 3
  Between   Mean       Var
   <fctr>  <dbl>     <dbl>
1      A1  9.125  3.553571
2      A2 11.375  2.267857
3      A3  8.625 19.696429

When groups are defined, summarize() and mutate() yield tibbles. A tibble (so named because many dplyr functions generate objects with a class of tbl_df which is pronounced as 'tibble diff') is a data.frame-like object that adheres to a more strict structure (compare the following two):

# generate data.frame
data.frame('(x)'=x, A)
       X.x. A
1  3.389585 b
2  7.378957 b
3  1.629561 b
4  4.778440 b
5  8.873564 a
6  5.530862 a
7  2.216495 b
8  1.581772 b
9  7.471264 b
10 5.295647 b
# generate tibble
data_frame('(x)'=x, A)
# A tibble: 10 × 2
      `(x)`     A
      <dbl> <chr>
1  3.389585     b
2  7.378957     b
3  1.629561     b
4  4.778440     b
5  8.873564     a
6  5.530862     a
7  2.216495     b
8  1.581772     b
9  7.471264     b
10 5.295647     b
# generate data.frame
data.frame('(x)'=x[1:2], A)
       X.x. A
1  3.389585 b
2  7.378957 b
3  3.389585 b
4  7.378957 b
5  3.389585 a
6  7.378957 a
7  3.389585 b
8  7.378957 b
9  3.389585 b
10 7.378957 b
# generate tibble
data_frame('(x)'=x[1:2], A)
Error: Variables must be length 1 or 10.
Problem variables: '(x)'
  • variables are never auto-coerced into specific data types (e.g. character vectors are not coerced into factors)
  • there are row names
  • variable names are never altered (e.g. when special characters are included in names)
  • only objects of length 1 are recycled

Calculate the mean and variance for Resp1 for each Between and Within combination of levels

data %>% group_by(Between,Within) %>% summarize(Mean=mean(Resp1), Var=var(Resp1))
Source: local data frame [6 x 4]
Groups: Between [?]

  Between Within  Mean        Var
   <fctr> <fctr> <dbl>      <dbl>
1      A1     B1  7.50  0.3333333
2      A1     B2 10.75  0.9166667
3      A2     B1 12.00  2.0000000
4      A2     B2 10.75  2.2500000
5      A3     B1  9.50 25.6666667
6      A3     B2  7.75 18.2500000

Calculate the mean and variance for Resp1 and Resp2 for each Between and Within combination of levels

data %>% group_by(Between,Within) %>% summarize_at(vars(Resp1,Resp2), funs(mean,var))
Source: local data frame [6 x 6]
Groups: Between [?]

  Between Within Resp1_mean Resp2_mean  Resp1_var Resp2_var
   <fctr> <fctr>      <dbl>      <dbl>      <dbl>     <dbl>
1      A1     B1       7.50      16.75  0.3333333  0.250000
2      A1     B2      10.75      18.50  0.9166667 16.333333
3      A2     B1      12.00      22.25  2.0000000  4.916667
4      A2     B2      10.75      17.75  2.2500000 11.583333
5      A3     B1       9.50      23.75 25.6666667 20.916667
6      A3     B2       7.75      21.00 18.2500000 28.000000
* Note, arguably a better way to perform the above is to first gather the data first so that the data are as long as possible... We will revisit this manipulation once we have explored gathering data.

Hierarchical aggregations

It is useful to be able to aggregate the data to a different level of replication. Two common reasons are:
  • Exploratory data analysis for assessing normality, homogeneity of variance as applied to the approximate appropriate residuals (appropriate level fo replication for a give test)
  • Calculating observations (typically means of subreplicates) appropriate for the desired scale of the analysis

Continuing on with the same base dataset from the previous examples, we may wish to aggregate to the level of Plots or even Subplot

Lets say that we wished to explore the distributional characteristics of Resp1 and Resp2 within each of the Between levels. The data data.frame is setup to resemble a typical hierarchical design. Notice that:

  • a given Plot can only be of one Between level.
  • within each Plot level there are each of the levels of within<.samp>.
  • in the context of a design the Plots are the replicates of the Between factor
Therefore, to explore the distributional characteristics of the Between factor, we need to aggregate the data to the level of Plot. That is, we need to calculate the mean for each Plot.

Generate Plot means for each of Resp1 and Resp2

data %>% group_by(Plot) %>% summarize_at(vars(Resp1,Resp2), mean)
# A tibble: 6 × 3
    Plot Resp1 Resp2
  <fctr> <dbl> <dbl>
1     P1  9.00 18.25
2     P2 11.50 18.50
3     P3  8.75 23.00
4     P4  9.25 17.00
5     P5 11.25 21.50
6     P6  8.50 21.75

The above is of limited value however as the Between variable is missing... We can address this by including it as a group_by variable.name.
Generate Between/Plot means for each of Resp1 and Resp2

data %>% group_by(Between,Plot) %>% summarize_at(vars(Resp1,Resp2), mean)
Source: local data frame [6 x 4]
Groups: Between [?]

  Between   Plot Resp1 Resp2
   <fctr> <fctr> <dbl> <dbl>
1      A1     P1  9.00 18.25
2      A1     P4  9.25 17.00
3      A2     P2 11.50 18.50
4      A2     P5 11.25 21.50
5      A3     P3  8.75 23.00
6      A3     P6  8.50 21.75
Now it is clear that there are two Plots per level of Between.

The data design also includes a Subplot hierarchical level. These Subplots sit under Plot in the hierarchy.
Generate Between/Plot/Subplot means for each of Resp1 and Resp2

data %>% group_by(Between,Plot,Subplot) %>% summarize_at(vars(Resp1,Resp2), mean)
Source: local data frame [12 x 5]
Groups: Between, Plot [?]

   Between   Plot Subplot Resp1 Resp2
    <fctr> <fctr>  <fctr> <dbl> <dbl>
1       A1     P1      S1   9.0  17.5
2       A1     P1      S2   9.0  19.0
3       A1     P4      S7   9.0  19.5
4       A1     P4      S8   9.5  14.5
5       A2     P2      S3  13.0  16.0
6       A2     P2      S4  10.0  21.0
7       A2     P5      S9  11.5  21.0
8       A2     P5     S10  11.0  22.0
9       A3     P3      S5  12.5  21.5
10      A3     P3      S6   5.0  24.5
11      A3     P6     S11   7.0  16.5
12      A3     P6     S12  10.0  27.0

Applying mutate() to grouped data

Grouped data are also useful in the mutate context. For example, we may wish to center our data separately within each Plot.
Center Resp1 and Resp2 within each plot

data %>% group_by(Plot) %>% mutate_at(vars(Resp1,Resp2), funs(c=scale(.,scale=FALSE)))
Source: local data frame [24 x 8]
Groups: Plot [6]

   Resp1 Resp2 Between   Plot Subplot Within Resp1_c Resp2_c
   <int> <int>  <fctr> <fctr>  <fctr> <fctr>   <dbl>   <dbl>
1      8    17      A1     P1      S1     B1   -1.00   -1.25
2     10    18      A1     P1      S1     B2    1.00   -0.25
3      7    17      A1     P1      S2     B1   -2.00   -1.25
4     11    21      A1     P1      S2     B2    2.00    2.75
5     14    19      A2     P2      S3     B1    2.50    0.50
6     12    13      A2     P2      S3     B2    0.50   -5.50
7     11    24      A2     P2      S4     B1   -0.50    5.50
8      9    18      A2     P2      S4     B2   -2.50   -0.50
9     14    25      A3     P3      S5     B1    5.25    2.00
10    11    18      A3     P3      S5     B2    2.25   -5.00
# ... with 14 more rows
#OR
data %>% group_by(Plot) %>% mutate_at(vars(Resp1,Resp2), funs(c=. - mean(.)))
Source: local data frame [24 x 8]
Groups: Plot [6]

   Resp1 Resp2 Between   Plot Subplot Within Resp1_c Resp2_c
   <int> <int>  <fctr> <fctr>  <fctr> <fctr>   <dbl>   <dbl>
1      8    17      A1     P1      S1     B1   -1.00   -1.25
2     10    18      A1     P1      S1     B2    1.00   -0.25
3      7    17      A1     P1      S2     B1   -2.00   -1.25
4     11    21      A1     P1      S2     B2    2.00    2.75
5     14    19      A2     P2      S3     B1    2.50    0.50
6     12    13      A2     P2      S3     B2    0.50   -5.50
7     11    24      A2     P2      S4     B1   -0.50    5.50
8      9    18      A2     P2      S4     B2   -2.50   -0.50
9     14    25      A3     P3      S5     B1    5.25    2.00
10    11    18      A3     P3      S5     B2    2.25   -5.00
# ... with 14 more rows
In the above two alternatives, note the '.' - this means 'the current variable'. Hence, each of Resp1 and Resp2 will be substituted in place of '.' during calculations. Also note that the print method for the tibble has truncated the printed output so that it is displayed nicely on the console. Stored version of such objects are complete.

Again, these examples could arguably be better suited to gathering first.

Center a single variable (Resp2), centered within each level of the between variable
Via ddply and transform or mutate
plyr:::ddply(data.s,~Between,transform,cResp2=Resp2-mean(Resp2))
  Between Plot Resp1 Resp2 cResp2
1      A1   P1     8    13   -4.5
2      A1   P2    10    22    4.5
3      A2   P3     7    23    0.5
4      A2   P4    11    22   -0.5
#OR
ddply(data.s,~Between,mutate,cResp2=Resp2-mean(Resp2))
Error in eval(expr, envir, enclos): could not find function "ddply"
Difference in means centered within Between (derivative of derivatives)
Via ddply and mutate
plyr:::ddply(data.s,~Between,mutate,cResp1=Resp1-mean(Resp1),cResp2=Resp2-mean(Resp2),cDiff=cResp1-cResp2)
  Between Plot Resp1 Resp2 cResp1 cResp2 cDiff
1      A1   P1     8    13     -1   -4.5   3.5
2      A1   P2    10    22      1    4.5  -3.5
3      A2   P3     7    23     -2    0.5  -2.5
4      A2   P4    11    22      2   -0.5   2.5
Scale (log) transformations of Resp1 and Resp2 centered within Between
Via ddply and mutate
options(width=120)

opts_chunk$set(ts='asis',prompt=FALSE,comment=NA, fig.path='images/graphics-tut2.4', dev='my_png', fig.ext='png',warning=FALSE,message=FALSE  )

## end setup
ddply(data.s,~Between,mutate,sResp1=Resp1/max(Resp1),sResp2=Resp2/max(Resp2),logsResp1=log(sResp1),logsResp2=log(sResp2))
Error in eval(expr, envir, enclos): could not find function "ddply"

Merging (joining) data sets

It is common to have data associated with a particular study organized into a number of separate data tables (databases etc). In fact, large data sets are best managed in databases. However, statistical analyses generally require all data to be encapsulated within a single data structure. Therefore, prior to analysis, it is necessary to bring together multiple sources.

This phase of data preparation can be one of the most difficult to get right and verify.

Merging (or joining) involves creating a new data set that comprises information from two data sets. The resulting joined data set contains all fields from both data sets. The data sets are alignd together according to fields they have in common. Matching records in these common fields are used to select a row from each input data set to be combined.

There are numerous alternative ways of defining what should happen in the event that common records do not occur in both sets. For example, we could specify that only fully matching records be included in the final data set. Alternatively, we could specify that all records be included from both sets and thus the resulting data set will contain missing values. The following describe these various options.

  • left join
    • return all rows and columns from the left data set
    • return all columns from the right data set
    • new columns for unmatched rows from the right data sets receive NA values
    • when there are multiple matches, all combinations included
  • right join
    • return all rows and columns from the right data set
    • return all columns from the left data set
    • new columns for unmatched rows from the left data sets receive NA values
    • when there are multiple matches, all combinations included
  • inner join
    • return all columns from the left and right data set
    • return only rows that match from left and right data sets
    • when there are multiple matches, all combinations included
  • semi join
    • return all rows from the left data set that match with rows from the right data set
    • keep only the columns from the left data set
  • anti join
    • return only the rows from the left data set that do not match with rows from the right data set
    • keep only the columns from the left data set
  • full join
    • return all rows and columns from the left and right data set
    • unmatched rows from either left data sets receive NA values in the associated new columns

Biological data set (missing Subplot 3)
(data.bio)
Resp1 Resp2 Between Plot Subplot
1 8 18 A1 P1 S1
2 10 21 A1 P1 S2
4 11 23 A1 P2 S4
5 14 22 A2 P3 S5
6 12 24 A2 P3 S6
7 11 23 A2 P4 S7
8 9 20 A2 P4 S8
9 14 11 A3 P5 S9
10 11 22 A3 P5 S10
11 8 24 A3 P6 S11
12 2 16 A3 P6 S12
View code
set.seed(1)
data.bio <- expand.grid(Subplot=paste("S",1:2,sep=""),Plot=paste("P",1:6,sep=""))
data.bio$Subplot <- gl(12,1,12,lab=paste("S",1:12,sep=""))
data.bio$Between <- gl(3,4,12,lab=paste("A",1:3,sep=""))
data.bio$Resp1 <- rpois(12,10)
data.bio$Resp2 <- rpois(12,20)
data.bio <- with(data.bio,data.frame(Resp1,Resp2,Between,Plot,Subplot))
data.bio<-data.bio[-3,]
Physio-chemical data (missing Subplot 7)
(data.chem)
Chem1 Chem2 Between Plot Subplot
1 1.45 0.89 A1 P1 S1
2 3.27 0.18 A1 P1 S2
3 1.18 5.08 A1 P2 S3
4 13.40 1.58 A1 P2 S4
5 3.78 1.62 A2 P3 S5
6 1.20 4.24 A2 P3 S6
8 5.69 2.99 A2 P4 S8
9 4.83 4.13 A3 P5 S9
10 2.00 3.60 A3 P5 S10
11 12.33 1.78 A3 P6 S11
12 4.01 0.23 A3 P6 S12
View code
set.seed(1)
data.chem <- expand.grid(Subplot=paste("S",1:2,sep=""),Plot=paste("P",1:6,sep=""))
data.chem$Subplot <- gl(12,1,12,lab=paste("S",1:12,sep=""))
data.chem$Between <- gl(3,4,12,lab=paste("A",1:3,sep=""))
data.chem$Chem1 <- rlnorm(12,1)
data.chem$Chem2 <- rlnorm(12,.5)
data.chem <- with(data.chem,data.frame(Chem1,Chem2,Between,Plot,Subplot))
data.chem<-data.chem[-7,]
Join bio and chem data (only keep full matches - an inner join)
Note how both Subplot 3 and 7 are absent.
Via inner_join (package:dplyr)
inner_join(data.bio,data.chem,by=c("Between","Plot","Subplot"))
   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     11    23      A1   P2      S4 13.400350 1.5762780
4     14    22      A2   P3      S5  3.779183 1.6222430
5     12    24      A2   P3      S6  1.196657 4.2369184
6      9    20      A2   P4      S8  5.687807 2.9859003
7     14    11      A3   P5      S9  4.834518 4.1328919
8     11    22      A3   P5     S10  2.002931 3.6043314
9      8    24      A3   P6     S11 12.326867 1.7763576
10     2    16      A3   P6     S12  4.014221 0.2255188
Via merge
merge(data.bio,data.chem,by=c("Between","Plot","Subplot"))
   Between Plot Subplot Resp1 Resp2     Chem1     Chem2
1       A1   P1      S1     8    18  1.452878 0.8858208
2       A1   P1      S2    10    21  3.266253 0.1800177
3       A1   P2      S4    11    23 13.400350 1.5762780
4       A2   P3      S5    14    22  3.779183 1.6222430
5       A2   P3      S6    12    24  1.196657 4.2369184
6       A2   P4      S8     9    20  5.687807 2.9859003
7       A3   P5     S10    11    22  2.002931 3.6043314
8       A3   P5      S9    14    11  4.834518 4.1328919
9       A3   P6     S11     8    24 12.326867 1.7763576
10      A3   P6     S12     2    16  4.014221 0.2255188
Via join (package:plyr)
plyr:::join(data.bio,data.chem,by=c("Between","Plot","Subplot"), type="inner")
   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     11    23      A1   P2      S4 13.400350 1.5762780
4     14    22      A2   P3      S5  3.779183 1.6222430
5     12    24      A2   P3      S6  1.196657 4.2369184
6      9    20      A2   P4      S8  5.687807 2.9859003
7     14    11      A3   P5      S9  4.834518 4.1328919
8     11    22      A3   P5     S10  2.002931 3.6043314
9      8    24      A3   P6     S11 12.326867 1.7763576
10     2    16      A3   P6     S12  4.014221 0.2255188
Merge bio and chem data (keep all data - full or outer join)
Note that all Subplots are present, yet there are missing cells that correspond to the missing cases from the BIO and CHEM data sets.
Via full_join (package:dplyr)
full_join(data.bio,data.chem,by=c("Between","Plot","Subplot"))
   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     11    23      A1   P2      S4 13.400350 1.5762780
4     14    22      A2   P3      S5  3.779183 1.6222430
5     12    24      A2   P3      S6  1.196657 4.2369184
6     11    23      A2   P4      S7        NA        NA
7      9    20      A2   P4      S8  5.687807 2.9859003
8     14    11      A3   P5      S9  4.834518 4.1328919
9     11    22      A3   P5     S10  2.002931 3.6043314
10     8    24      A3   P6     S11 12.326867 1.7763576
11     2    16      A3   P6     S12  4.014221 0.2255188
12    NA    NA      A1   P2      S3  1.178652 5.0780682
Via merge
merge(data.bio,data.chem,by=c("Between","Plot","Subplot"),all=T)
   Between Plot Subplot Resp1 Resp2     Chem1     Chem2
1       A1   P1      S1     8    18  1.452878 0.8858208
2       A1   P1      S2    10    21  3.266253 0.1800177
3       A1   P2      S3    NA    NA  1.178652 5.0780682
4       A1   P2      S4    11    23 13.400350 1.5762780
5       A2   P3      S5    14    22  3.779183 1.6222430
6       A2   P3      S6    12    24  1.196657 4.2369184
7       A2   P4      S7    11    23        NA        NA
8       A2   P4      S8     9    20  5.687807 2.9859003
9       A3   P5      S9    14    11  4.834518 4.1328919
10      A3   P5     S10    11    22  2.002931 3.6043314
11      A3   P6     S11     8    24 12.326867 1.7763576
12      A3   P6     S12     2    16  4.014221 0.2255188
Via join (package:plyr)
plyr:::join(data.bio,data.chem,by=c("Between","Plot","Subplot"), type="full")
   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     11    23      A1   P2      S4 13.400350 1.5762780
4     14    22      A2   P3      S5  3.779183 1.6222430
5     12    24      A2   P3      S6  1.196657 4.2369184
6     11    23      A2   P4      S7        NA        NA
7      9    20      A2   P4      S8  5.687807 2.9859003
8     14    11      A3   P5      S9  4.834518 4.1328919
9     11    22      A3   P5     S10  2.002931 3.6043314
10     8    24      A3   P6     S11 12.326867 1.7763576
11     2    16      A3   P6     S12  4.014221 0.2255188
12    NA    NA      A1   P2      S3  1.178652 5.0780682
Merge bio and chem data (only keep full BIO matches - left join)
Note how Subplot 3 is absent and Subplot 7 (which was missing from the CHEM data set) is missing CHEM data.
Via left_join (package:dplyr)
left_join(data.bio,data.chem,by=c("Between","Plot","Subplot"))
   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     11    23      A1   P2      S4 13.400350 1.5762780
4     14    22      A2   P3      S5  3.779183 1.6222430
5     12    24      A2   P3      S6  1.196657 4.2369184
6     11    23      A2   P4      S7        NA        NA
7      9    20      A2   P4      S8  5.687807 2.9859003
8     14    11      A3   P5      S9  4.834518 4.1328919
9     11    22      A3   P5     S10  2.002931 3.6043314
10     8    24      A3   P6     S11 12.326867 1.7763576
11     2    16      A3   P6     S12  4.014221 0.2255188
Via merge
merge(data.bio,data.chem,by=c("Between","Plot","Subplot"),all.x=T)
   Between Plot Subplot Resp1 Resp2     Chem1     Chem2
1       A1   P1      S1     8    18  1.452878 0.8858208
2       A1   P1      S2    10    21  3.266253 0.1800177
3       A1   P2      S4    11    23 13.400350 1.5762780
4       A2   P3      S5    14    22  3.779183 1.6222430
5       A2   P3      S6    12    24  1.196657 4.2369184
6       A2   P4      S7    11    23        NA        NA
7       A2   P4      S8     9    20  5.687807 2.9859003
8       A3   P5      S9    14    11  4.834518 4.1328919
9       A3   P5     S10    11    22  2.002931 3.6043314
10      A3   P6     S11     8    24 12.326867 1.7763576
11      A3   P6     S12     2    16  4.014221 0.2255188
Via join (package:plyr)
plyr:::join(data.bio,data.chem,by=c("Between","Plot","Subplot"), type="left")
   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     11    23      A1   P2      S4 13.400350 1.5762780
4     14    22      A2   P3      S5  3.779183 1.6222430
5     12    24      A2   P3      S6  1.196657 4.2369184
6     11    23      A2   P4      S7        NA        NA
7      9    20      A2   P4      S8  5.687807 2.9859003
8     14    11      A3   P5      S9  4.834518 4.1328919
9     11    22      A3   P5     S10  2.002931 3.6043314
10     8    24      A3   P6     S11 12.326867 1.7763576
11     2    16      A3   P6     S12  4.014221 0.2255188
Merge bio and chem data (only keep full CHEM matches - right join)
Note how Subplot 7 is absent and Subplot 3 (which was missing from the BIO data set) is missing BIO data.
Via right_join (package:dplyr)
right_join(data.bio,data.chem,by=c("Between","Plot","Subplot"))
   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     NA    NA      A1   P2      S3  1.178652 5.0780682
4     11    23      A1   P2      S4 13.400350 1.5762780
5     14    22      A2   P3      S5  3.779183 1.6222430
6     12    24      A2   P3      S6  1.196657 4.2369184
7      9    20      A2   P4      S8  5.687807 2.9859003
8     14    11      A3   P5      S9  4.834518 4.1328919
9     11    22      A3   P5     S10  2.002931 3.6043314
10     8    24      A3   P6     S11 12.326867 1.7763576
11     2    16      A3   P6     S12  4.014221 0.2255188
Via merge
merge(data.bio,data.chem,by=c("Between","Plot","Subplot"),all.y=T)
   Between Plot Subplot Resp1 Resp2     Chem1     Chem2
1       A1   P1      S1     8    18  1.452878 0.8858208
2       A1   P1      S2    10    21  3.266253 0.1800177
3       A1   P2      S3    NA    NA  1.178652 5.0780682
4       A1   P2      S4    11    23 13.400350 1.5762780
5       A2   P3      S5    14    22  3.779183 1.6222430
6       A2   P3      S6    12    24  1.196657 4.2369184
7       A2   P4      S8     9    20  5.687807 2.9859003
8       A3   P5      S9    14    11  4.834518 4.1328919
9       A3   P5     S10    11    22  2.002931 3.6043314
10      A3   P6     S11     8    24 12.326867 1.7763576
11      A3   P6     S12     2    16  4.014221 0.2255188
Via join (package:plyr)
plyr:::join(data.bio,data.chem,by=c("Between","Plot","Subplot"), type="right")
   Between Plot Subplot Resp1 Resp2     Chem1     Chem2
1       A1   P1      S1     8    18  1.452878 0.8858208
2       A1   P1      S2    10    21  3.266253 0.1800177
3       A1   P2      S3    NA    NA  1.178652 5.0780682
4       A1   P2      S4    11    23 13.400350 1.5762780
5       A2   P3      S5    14    22  3.779183 1.6222430
6       A2   P3      S6    12    24  1.196657 4.2369184
7       A2   P4      S8     9    20  5.687807 2.9859003
8       A3   P5      S9    14    11  4.834518 4.1328919
9       A3   P5     S10    11    22  2.002931 3.6043314
10      A3   P6     S11     8    24 12.326867 1.7763576
11      A3   P6     S12     2    16  4.014221 0.2255188

VLOOKUP in R

Lookup tables provide a way of inserting a column of data into a large data set such that the entries in the new column are determined by a relational match within another data set (the lookup table). For example, the main data set might contain data collected from a number of sites (Plots). Elsewhere we may have a data set that just contains the set of sites and their corresponding latitudes and longitudes (geographical lookup table). We could incorporate these latitudes and longitudes into the main data set by merging against the geographical lookup table. In Excel, this is referred to as vlookup, in a relational database it is referred to as a join, and in R it is a merge.

Biological data set (data.bio1)
Resp1 Resp2 Between Plot Subplot
1 8 18 A1 P1 S1
2 10 21 A1 P1 S2
3 7 16 A1 P2 S3
4 11 23 A1 P2 S4
5 14 22 A2 P3 S5
6 12 24 A2 P3 S6
7 11 23 A2 P4 S7
8 9 20 A2 P4 S8
9 14 11 A3 P5 S9
10 11 22 A3 P5 S10
11 8 24 A3 P6 S11
12 2 16 A3 P6 S12
View code
set.seed(1)
data.bio1<- expand.grid(Subplot=paste("S",1:2,sep=""),Plot=paste("P",1:6,sep=""))
data.bio1$Subplot <- gl(12,1,12,lab=paste("S",1:12,sep=""))
data.bio1$Between <- gl(3,4,12,lab=paste("A",1:3,sep=""))
data.bio1$Resp1 <- rpois(12,10)
data.bio1$Resp2 <- rpois(12,20)
data.bio1<- with(data.bio1,data.frame(Resp1,Resp2,Between,Plot,Subplot))
Geographical data set (lookup table) (data.geo)
Plot LAT LONG
1 P1 17.96 145.43
2 P2 17.52 146.20
3 P3 17.00 146.38
4 P4 18.23 146.79
5 P5 18.98 146.03
View code
set.seed(1)
data.geo <- expand.grid(Plot=paste("P",1:5,sep=""))
data.geo$LAT<-c(17.9605,17.5210,17.0011,18.235,18.9840)
data.geo$LONG<-c(145.4326,146.1983,146.3839,146.7934,146.0345)
Incorporate (merge) the lat/longs into the bio data
Note how the latitudes and longitudes are duplicated so as to match the Plot listings.
Via left_join (package:dplyr)
left_join(data.bio1,data.geo,by="Plot")
   Resp1 Resp2 Between Plot Subplot     LAT     LONG
1      8    18      A1   P1      S1 17.9605 145.4326
2     10    21      A1   P1      S2 17.9605 145.4326
3      7    16      A1   P2      S3 17.5210 146.1983
4     11    23      A1   P2      S4 17.5210 146.1983
5     14    22      A2   P3      S5 17.0011 146.3839
6     12    24      A2   P3      S6 17.0011 146.3839
7     11    23      A2   P4      S7 18.2350 146.7934
8      9    20      A2   P4      S8 18.2350 146.7934
9     14    11      A3   P5      S9 18.9840 146.0345
10    11    22      A3   P5     S10 18.9840 146.0345
11     8    24      A3   P6     S11      NA       NA
12     2    16      A3   P6     S12      NA       NA
Via merge
merge(data.bio1,data.geo,by="Plot", all.T=TRUE)
   Plot Resp1 Resp2 Between Subplot     LAT     LONG
1    P1     8    18      A1      S1 17.9605 145.4326
2    P1    10    21      A1      S2 17.9605 145.4326
3    P2     7    16      A1      S3 17.5210 146.1983
4    P2    11    23      A1      S4 17.5210 146.1983
5    P3    14    22      A2      S5 17.0011 146.3839
6    P3    12    24      A2      S6 17.0011 146.3839
7    P4    11    23      A2      S7 18.2350 146.7934
8    P4     9    20      A2      S8 18.2350 146.7934
9    P5    14    11      A3      S9 18.9840 146.0345
10   P5    11    22      A3     S10 18.9840 146.0345
Via join (package:plyr)
plyr:::join(data.bio1,data.geo,by="Plot", type='left')
   Resp1 Resp2 Between Plot Subplot     LAT     LONG
1      8    18      A1   P1      S1 17.9605 145.4326
2     10    21      A1   P1      S2 17.9605 145.4326
3      7    16      A1   P2      S3 17.5210 146.1983
4     11    23      A1   P2      S4 17.5210 146.1983
5     14    22      A2   P3      S5 17.0011 146.3839
6     12    24      A2   P3      S6 17.0011 146.3839
7     11    23      A2   P4      S7 18.2350 146.7934
8      9    20      A2   P4      S8 18.2350 146.7934
9     14    11      A3   P5      S9 18.9840 146.0345
10    11    22      A3   P5     S10 18.9840 146.0345
11     8    24      A3   P6     S11      NA       NA
12     2    16      A3   P6     S12      NA       NA

It might also be interesting to explore which records in the Biological data set did not have matching records in the Geographical data set.

Determine the records from bio data that are not matched with lat/longs
Via anti_join (package:dplyr)
anti_join(data.bio1,data.geo,by="Plot")
  Resp1 Resp2 Between Plot Subplot
1     8    24      A3   P6     S11
2     2    16      A3   P6     S12

Worked examples

Data minipulation of NASA temperature records

As part of their 2006 Data Expo, the ASA offered up a data set from NASA Langley Research Center Atmospheric Sciences Data Center and the intentionally vague challenge to participants to provide a graphical summary of the important features of the data set. The data set comprise geographic and atmospheric measures on a very coarse 24x24 grid (Central America) with observations from Han 1995 to Dec 2000. These data are included in the dplyr package as an example of a tbl_cube (a compact data storage format for fully factorial designs).

Format of data
lat long month year cloudhigh cloudlow cloudmid ozone pressure surftemp temperature
1 36.20 -113.80 1 1995 26.00 7.50 34.50 304.00 835.00 272.70 272.10
2 33.70 -113.80 1 1995 20.00 11.50 32.50 304.00 940.00 279.50 282.20
3 31.21 -113.80 1 1995 16.00 16.50 26.00 298.00 960.00 284.70 285.20
4 28.71 -113.80 1 1995 13.00 20.50 14.50 276.00 990.00 289.30 290.70
5 26.22 -113.80 1 1995 7.50 26.00 10.50 274.00 1000.00 292.20 292.70
6 23.72 -113.80 1 1995 8.00 30.00 9.50 264.00 1000.00 294.10 293.60

lat,longlatitude and longitude
year,monthyear and month
cloudlow, cloudmed, cloudhighcloud cover at three heights (Monthly averages)
ozonemonthly average ozone level
surftemp, temperaturemonthly average temperature (in Kelvin)
pressuremonthly average air pressure

Start by converting the tbl_cube into a data frame.

library(dplyr)
nasa = as.data.frame(nasa)
glimpse(nasa)
Observations: 41,472
Variables: 11
$ lat         <dbl> 36.200000, 33.704348, 31.208696, 28.713043, 26.217391, 23.721739, 21.226087, 18.730435, 16.2347...
$ long        <dbl> -113.8000, -113.8000, -113.8000, -113.8000, -113.8000, -113.8000, -113.8000, -113.8000, -113.80...
$ month       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ year        <int> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995,...
$ cloudhigh   <dbl> 26.0, 20.0, 16.0, 13.0, 7.5, 8.0, 14.5, 19.5, 22.5, 21.0, 19.0, 16.5, 17.5, 5.5, 0.5, 0.0, 0.5,...
$ cloudlow    <dbl> 7.5, 11.5, 16.5, 20.5, 26.0, 30.0, 29.5, 26.5, 27.5, 26.0, 28.5, 28.0, 33.0, 44.5, 43.5, 37.0, ...
$ cloudmid    <dbl> 34.5, 32.5, 26.0, 14.5, 10.5, 9.5, 11.0, 17.5, 18.5, 16.5, 12.5, 13.5, 18.5, 13.0, 4.0, 1.0, 1....
$ ozone       <dbl> 304, 304, 298, 276, 274, 264, 258, 252, 250, 250, 248, 248, 250, 248, 248, 248, 248, 250, 252, ...
$ pressure    <dbl> 835, 940, 960, 990, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 100...
$ surftemp    <dbl> 272.7, 279.5, 284.7, 289.3, 292.2, 294.1, 295.0, 298.3, 300.1, 300.1, 301.0, 301.0, 299.2, 298....
$ temperature <dbl> 272.1, 282.2, 285.2, 290.7, 292.7, 293.6, 294.6, 296.9, 297.8, 298.7, 300.1, 300.1, 300.5, 299....

  1. This is a reasonably large data set (41,472 records). Lets start by generating a subset of the data:
    1. Select only lat, long and the cloud.. columns (in the interest keeping the output brief, perhaps just display the first few records).
      Show code
      nasa %>% select(lat,long, contains('cloud')) %>% head
      
             lat   long cloudhigh cloudlow cloudmid
      1 36.20000 -113.8      26.0      7.5     34.5
      2 33.70435 -113.8      20.0     11.5     32.5
      3 31.20870 -113.8      16.0     16.5     26.0
      4 28.71304 -113.8      13.0     20.5     14.5
      5 26.21739 -113.8       7.5     26.0     10.5
      6 23.72174 -113.8       8.0     30.0      9.5
      
    2. Select only the most recent years data (again just display the first few records).
      Show code
      nasa %>% filter(year==last(year)) %>% head
      
             lat   long month year cloudhigh cloudlow cloudmid ozone pressure surftemp temperature
      1 36.20000 -113.8     1 2000      20.5     11.0     28.0   320      955    277.3       274.4
      2 33.70435 -113.8     1 2000      10.0     13.5     24.0   322      975    283.7       284.7
      3 31.20870 -113.8     1 2000       8.5     17.0     19.0   306      980    287.8       289.3
      4 28.71304 -113.8     1 2000       5.5     18.0     14.5   298      990    289.8       293.2
      5 26.21739 -113.8     1 2000       4.0     18.0     10.5   286      995    291.7       293.6
      6 23.72174 -113.8     1 2000       4.0     20.5      8.0   280     1000    292.7       294.6
      
    3. Select only those records for which temperature was above 309 Kelvin and sort these records from highest to lowest temperature.
      Show code
      nasa %>% filter(temperature>309) %>% arrange(desc(temperature))
      
             lat      long month year cloudhigh cloudlow cloudmid ozone pressure surftemp temperature
      1 33.70435 -98.77391     7 1998       2.5     12.0      5.0   294     1000    309.6       310.0
      2 33.70435 -96.26957     7 1998       2.5     12.0      5.0   294     1000    309.6       310.0
      3 28.71304 -98.77391     7 1998       4.0     17.5      6.0   294      990    309.6       309.6
      4 33.70435 -98.77391     8 1999       4.5     15.5      7.0   290      990    307.5       309.6
      5 33.70435 -96.26957     8 1999       4.5     15.5      7.0   290      990    307.5       309.6
      6 26.21739 -98.77391     6 1998       9.5     21.5     12.5   286      995    309.1       309.1
      7 31.20870 -96.26957     7 1998       5.5     13.5      7.5   292     1000    307.5       309.1
      8 31.20870 -93.76522     7 1998       5.5     13.5      7.5   292     1000    307.5       309.1
      
  2. We might be interested in exploring whether temperature has changed over time. Whilst the time span is relatively short, and there will obviously be seasonal and daily fluctuations, it might still be interesting to calculate annual mean temperatures. Temperature is currently recorded in Kelvin. Lets present the data in Celcius units. Since the conversion from Kelvin to Celcius is a simple additive conversion ($T_C = T_K - 273.15$), it does not matter whether the conversion is applied to the individual temperature records or to the annual averages. We will therefore use both approaches separately as practice.
    1. Calculate the annual means based on temperature data converted to Celcius.
      Show code
      nasa %>% mutate(Temp_c = temperature - 273.15) %>% group_by(year) %>% summarize(Mean=mean(Temp_c))
      
      # A tibble: 6 × 2
         year     Mean
        <int>    <dbl>
      1  1995 24.14457
      2  1996 23.95054
      3  1997 24.80664
      4  1998 25.55278
      5  1999 24.98644
      6  2000 25.18578
      
      *Note, if at this point you end up with a single mean (not grouped by year), it means that the plyr package version of the summarize function is being used instead of the dplyr package version. As indicated above, these two packages do not play nicely together. I would strongly recommend only using plyr package functions via their namespace rather than loading the entire package.
    2. Calculate the annual means on Kelvin temperature and express the means in Celcius.
      Show code
      nasa %>% group_by(year) %>% summarize(Mean=mean(temperature)-273.15)
      
      # A tibble: 6 × 2
         year     Mean
        <int>    <dbl>
      1  1995 24.14457
      2  1996 23.95054
      3  1997 24.80664
      4  1998 25.55278
      5  1999 24.98644
      6  2000 25.18578
      
  3. Some temperature research focuses on variability and extremes (not just averages).
    1. Calculate the annual mean, standard deviation as well as maximum temperature
      Show code
      nasa %>% mutate(Temp_c = temperature - 273.15) %>% group_by(year) %>% summarize_at('temperature', funs(mean,sd,max))
      
      # A tibble: 6 × 4
         year     mean       sd   max
        <int>    <dbl>    <dbl> <dbl>
      1  1995 297.2946 4.937595 307.5
      2  1996 297.1005 4.551722 307.9
      3  1997 297.9566 4.768040 307.9
      4  1998 298.7028 4.654017 310.0
      5  1999 298.1364 4.492859 309.6
      6  2000 298.3358 4.747475 308.7
      
    2. Calculate the number of monthly averages per annum that exceed the top 10 percent of monthly temperature averages.
      Show code
      nasa %>% mutate(Temp_c = temperature - 273.15) %>% # convert to Celcius
        mutate(Perc=percent_rank(Temp_c)) %>% # calculate the percentage ranks
        filter(Perc>=0.9) %>% # filter out those values less than 90 %
        group_by(year) %>% # group by year
        summarize(N=n()) # count the number of records (per group)
      
      # A tibble: 6 × 2
         year     N
        <int> <int>
      1  1995   254
      2  1996   173
      3  1997   336
      4  1998   908
      5  1999   489
      6  2000   730
      
      Note, the same can be achieved using the count() function in place of the more generalizable summarize() function.
      nasa %>% mutate(Temp_c = temperature - 273.15) %>% # convert to Celcius
        mutate(Perc=percent_rank(Temp_c)) %>% # calculate the percentage ranks
        filter(Perc>=0.9) %>% # filter out those values less than 90 %
        group_by(year) %>% # group by year
        count() # count the number of records (per group)
      
      # A tibble: 6 × 2
         year     n
        <int> <int>
      1  1995   254
      2  1996   173
      3  1997   336
      4  1998   908
      5  1999   489
      6  2000   730
      

Data minipulation of Thousand Islands (Tikus) coral abundance data

The mvabund package includes a multivariate coral abundance data set (75 coral species) from the Thousand Islands stored in a list format. I have converted this list into a data frame and included it in the manipulationsDatasets.R and manipulationsDatasets.RData downloads. Abundance of a species of coral was measured as the length (cm) of a 10m transect containing that species.

Format of data
time rep Psammocora contigua Psammocora digitata Pocillopora damicornis ...
1 81 1 0 0 79 ...
2 81 2 0 0 51 ...
3 81 3 0 0 42 ...
4 81 4 0 0 15 ...
5 81 5 0 0 9 ...
6 81 6 0 0 72 ...

timeyear in which data were collected
repan id for each location (site)
PSammacora cintigua, ...abundance (cm per 10m transect)
  1. This is another reasonably large data set. Lets explore specific subsets of the data.
    1. Select rep, time and all Porites genera(limit to the first 6 rows for brevity sake).
      Show code
      tikus %>% select(rep,time,starts_with('Porites')) %>% head
      
         rep time Porites cylindrica Porites lichen Porites lobata Porites lutea Porites nigrescens Porites solida
      V1   1   81                 61              0             36            30                  0              0
      V2   2   81                 24             47              0             0                  0              0
      V3   3   81                  0             49              0             0                  0             10
      V4   4   81                 20              0              0             0                 21              0
      V5   5   81                  0              0              0             0                  0             17
      V6   6   81                  0              0              0             0                  9              0
         Porites stephensoni
      V1                   0
      V2                   0
      V3                   0
      V4                   0
      V5                   0
      V6                   0
      
    2. Select rep, time and species that do not contain pora (limit to the first 6 rows for brevity sake).
      Show code
      tikus %>% select(rep,time,everything(),-contains('pora')) %>% head
      
         rep time Psammocora contigua Psammocora digitata Acropera aspera Fungia fungites Fungia paumotensis Fungia concina
      V1   1   81                   0                   0              17               0                  0              0
      V2   2   81                   0                   0              18               0                 33              0
      V3   3   81                   0                   0               9              18                  0              0
      V4   4   81                   0                   0               8              17                  0              0
      V5   5   81                   0                   0              23               0                  0              0
      V6   6   81                   0                   0               0               0                  0              0
         Fungia scutaria Halomitra limax Pavona varians Pavona venosa Pavona cactus Coeloseris mayeri Galaxea fascicularis
      V1               0               0             30             0             0                20                   51
      V2               0               0              0            24             0                 0                   27
      V3               0               0              0             0             0                15                   31
      V4               0               0              0             0             0                 0                   24
      V5               0               0              0             0             0                 9                    0
      V6               0               0              0             0             0                19                   13
         Symphyllia radians Lobophyllia corymbosa Lobophyllia hemprichii Porites cylindrica Porites lichen Porites lobata
      V1                  0                     0                      0                 61              0             36
      V2                  0                     0                      0                 24             47              0
      V3                  0                     0                      0                  0             49              0
      V4                  0                     0                      0                 20              0              0
      V5                  0                     0                      0                  0              0              0
      V6                  0                     0                      0                  0              0              0
         Porites lutea Porites nigrescens Porites solida Porites stephensoni Favia pallida Favia speciosa Favia stelligera
      V1            30                  0              0                   0            10              0                0
      V2             0                  0              0                   0            20              0                0
      V3             0                  0             10                   0             0             30                0
      V4             0                 21              0                   0             0              0                0
      V5             0                  0             17                   0             0              0                0
      V6             0                  9              0                   0             0              0                0
         Favia rotumana Favites abdita Favites chinensis Goniastrea rectiformis Goniastrea pectinata Goniastrea sp
      V1              0             33                 0                      0                    0             0
      V2              0             41                44                      0                    0             0
      V3              0             23                78                      0                    0             0
      V4              0             27                61                      0                    0             0
      V5              0             91                44                      0                    0             0
      V6              0             63                 0                      0                    0             0
         Dulophyllia crispa Platygyra daedalea Platygyra sinensis Leptastrea purpurea Leptastrea pruinosa Cyphastrea serailia
      V1                  0                  0                 47                   0                   0                   0
      V2                  0                 27                 27                   0                   0                   0
      V3                  0                 55                 56                   0                   0                   0
      V4                  0                  0                 26                   0                   0                   0
      V5                  0                 71                  0                   0                   0                   0
      V6                  0                 74                  0                   0                   0                   0
      
  2. This is another reasonably large data set. Lets explore specific subsets of the data.
    1. For each year, calculate the mean abundance of Pocillopora damicornis.
      Show code
      tikus %>% select(time,contains('Pocillopora damicornis')) %>% group_by(time) %>%
           summarize_all(mean)
      
      # A tibble: 6 × 2
          time `Pocillopora damicornis`
        <fctr>                    <dbl>
      1     81                     30.0
      2     83                      0.0
      3     84                      0.0
      4     85                      0.0
      5     87                      1.8
      6     88                      4.0
      
      Show code
      tikus %>% select(time,`Pocillopora damicornis`) %>% group_by(time) %>%
           summarize(MeanAbundance=mean(`Pocillopora damicornis`))
      
      # A tibble: 6 × 2
          time MeanAbundance
        <fctr>         <dbl>
      1     81          30.0
      2     83           0.0
      3     84           0.0
      4     85           0.0
      5     87           1.8
      6     88           4.0
      
      # Note the back ticks in the above
      
      Show code
      tikus %>% select(time,`Pocillopora damicornis`) %>% group_by(time) %>%
          summarize_at(vars(`Pocillopora damicornis`),funs(mean))
      
      # A tibble: 6 × 2
          time `Pocillopora damicornis`
        <fctr>                    <dbl>
      1     81                     30.0
      2     83                      0.0
      3     84                      0.0
      4     85                      0.0
      5     87                      1.8
      6     88                      4.0
      
      # Note the back ticks in the above
      
    2. For each year, calculate the mean abundance of each Pocillopora species.
      Show code
      tikus %>% select(time,starts_with('Pocillopora')) %>% group_by(time) %>%
          summarize_all(funs(mean))
      
      # A tibble: 6 × 3
          time `Pocillopora damicornis` `Pocillopora verrucosa`
        <fctr>                    <dbl>                   <dbl>
      1     81                     30.0                    19.2
      2     83                      0.0                     0.0
      3     84                      0.0                     0.0
      4     85                      0.0                     0.0
      5     87                      1.8                     0.0
      6     88                      4.0                     0.0
      
    3. For each year, calculate the mean and standard deviation abundance of each Pocillopora species.
      Show code
      tikus %>% select(time,starts_with('Pocillopora')) %>% group_by(time) %>%
          summarize_all(funs(mean, sd))
      
      # A tibble: 6 × 5
          time `Pocillopora damicornis_mean` `Pocillopora verrucosa_mean` `Pocillopora damicornis_sd`
        <fctr>                         <dbl>                        <dbl>                       <dbl>
      1     81                          30.0                         19.2                   29.074617
      2     83                           0.0                          0.0                    0.000000
      3     84                           0.0                          0.0                    0.000000
      4     85                           0.0                          0.0                    0.000000
      5     87                           1.8                          0.0                    5.692100
      6     88                           4.0                          0.0                    9.660918
      # ... with 1 more variables: `Pocillopora verrucosa_sd` <dbl>
      
  3. Now lets get a bit more serious.
    1. Calculate the mean cover abundance of Acropora per year. Note:
      • Abundance in cm per 10m will need to be converted to cover abundance
      • There is a typo and one of the Acropora have been incorrectly listed as Acropera
      Show code
      tikus %>% rename(`Acropora aspera`=`Acropera aspera`) %>%
      gather(Species, Abundance,-time,-rep) %>%
      mutate(Cover=Abundance/10) %>%
      separate(Species,c('Genera','Species')) %>%
      filter(Genera=='Acropora') %>%
      group_by(time,rep) %>%
      summarise(SumCover=sum(Cover))
      
      Source: local data frame [60 x 3]
      Groups: time [?]
      
           time    rep SumCover
         <fctr> <fctr>    <dbl>
      1      81      1     64.7
      2      81      2     39.7
      3      81      3     35.7
      4      81      4     40.4
      5      81      5     28.9
      6      81      6      8.8
      7      81      7     15.6
      8      81      8      8.3
      9      81      9      6.2
      10     81     10      7.5
      # ... with 50 more rows
      

Lagged responses

For this example, we will use another of the fabricated data sets in manipulationDatasets.R (data.t). This data set comprises two Responses measured in each of four years from each of four Plots. What we want is to derive a new measure that is the change in abundance between a certain year (2008) and all subsequent years - and of course, we want to do this separately for each Plot and each Response.

Format of data
Plot Year Resp1 Resp2
1 P1 2008 0.00 36.00
2 P1 2009 48.00 0.00
3 P1 2010 12.00 0.00
4 P1 2011 0.00 15.00
5 P2 2008 19.00 34.00
6 P2 2009 18.00 0.00

Plota random effect with four levels
Yearsampling year
Resp1, Resp2Responses (e.g. counts of two species)
  1. This manipulation requires a number of steps. The tidyverse eco-system allows us to link together individual changes and manipulations so as to form the more complex total manipulations. Of course, this also means that there are numerous ways to achieve the same result.
    1. Calculate the change (difference in abundance between 2008 and all subsequent years) for each Response (Resp1 and Resp2))
      Show code
      data.t %>%
          arrange(Plot,Year) %>%  #ensure that data are chronologically arranged in each Plot
          group_by(Plot) %>% #group by Plot
          mutate_at(vars(Resp1,Resp2), funs(Delta=. - first(.))) #calculate difference between each Value and the Value in the first Year
      
      Source: local data frame [16 x 6]
      Groups: Plot [4]
      
           Plot  Year Resp1 Resp2 Resp1_Delta Resp2_Delta
         <fctr> <int> <dbl> <dbl>       <dbl>       <dbl>
      1      P1  2008     0    36           0           0
      2      P1  2009    48     0          48         -36
      3      P1  2010    12     0          12         -36
      4      P1  2011     0    15           0         -21
      5      P2  2008    19    34           0           0
      6      P2  2009    18     0          -1         -34
      7      P2  2010    21    36           2           2
      8      P2  2011    45    12          26         -22
      9      P3  2008    40    49           0           0
      10     P3  2009    38     0          -2         -49
      11     P3  2010     9    18         -31         -31
      12     P3  2011    22    42         -18          -7
      13     P4  2008    20     0           0           0
      14     P4  2009    22    14           2          14
      15     P4  2010    32     0          12           0
      16     P4  2011    38    27          18          27
      
    2. As 2008 is the year we are comparing all others to, we dont really want the 2008 data in the final data set - so lets suppress it.
      Show code
      data.t %>%
          arrange(Plot,Year) %>%  #ensure that data are chronologically arranged in each Plot
          group_by(Plot) %>% #group by Plot
          mutate_at(vars(Resp1,Resp2), funs(Delta=. - first(.))) %>% #calculate difference between each Value and the Value in the first Year
          filter(Year>first(Year))
      
      Source: local data frame [12 x 6]
      Groups: Plot [4]
      
           Plot  Year Resp1 Resp2 Resp1_Delta Resp2_Delta
         <fctr> <int> <dbl> <dbl>       <dbl>       <dbl>
      1      P1  2009    48     0          48         -36
      2      P1  2010    12     0          12         -36
      3      P1  2011     0    15           0         -21
      4      P2  2009    18     0          -1         -34
      5      P2  2010    21    36           2           2
      6      P2  2011    45    12          26         -22
      7      P3  2009    38     0          -2         -49
      8      P3  2010     9    18         -31         -31
      9      P3  2011    22    42         -18          -7
      10     P4  2009    22    14           2          14
      11     P4  2010    32     0          12           0
      12     P4  2011    38    27          18          27
      
  2. So far, the metric representing the difference between before and after has been pretty crude. It is simply the raw difference, the magnitude of which is limitless and dependent on the magnitude and scale of the variables - thereby meaningless for comparisons.
    1. A more sophisticated metric would standardize the differences such that they represent the proportional difference (divide by the first value). $$\delta = [(A_2 - A_1)/A_2] * 100$$
      Show code
      delta <- function(t2,t1){
        100*((t2-t1)/t1)
      }
      data.t %>%
          arrange(Plot,Year) %>%  #ensure that data are chronologically arranged in each Plot
          group_by(Plot) %>% #group by Plot
          mutate_at(vars(Resp1,Resp2), funs(Delta=delta(.,first(.)))) %>%
              filter(Year>first(Year))
      
      Source: local data frame [12 x 6]
      Groups: Plot [4]
      
           Plot  Year Resp1 Resp2 Resp1_Delta Resp2_Delta
         <fctr> <int> <dbl> <dbl>       <dbl>       <dbl>
      1      P1  2009    48     0         Inf -100.000000
      2      P1  2010    12     0         Inf -100.000000
      3      P1  2011     0    15         NaN  -58.333333
      4      P2  2009    18     0   -5.263158 -100.000000
      5      P2  2010    21    36   10.526316    5.882353
      6      P2  2011    45    12  136.842105  -64.705882
      7      P3  2009    38     0   -5.000000 -100.000000
      8      P3  2010     9    18  -77.500000  -63.265306
      9      P3  2011    22    42  -45.000000  -14.285714
      10     P4  2009    22    14   10.000000         Inf
      11     P4  2010    32     0   60.000000         NaN
      12     P4  2011    38    27   90.000000         Inf
      
    2. Of course this introduces additional issues, such as what happens if the initial value is 0? In such cases, any increase is effectively an infinite increase. However, if the initial value and a subsequent value are both zero, then a sensible value of delta would be 0. $$ \delta = \begin{cases} [(A_2 - A_1)/A_2] * 100 & \quad \text{if } A_1\neq 0\\ 10,000 & \quad \text{if } A_1=0\text{ \& }A_2\neq 0\\ 0 & \quad \text{if } A_1=0\text{ \& }A_2=0\\ \end{cases} $$
      Show code
      delta <- function(t2,t1){
        d<-100*((t2-t1)/t1)
        d[t1==0 & t2==0]<-0
        d[t1==0 & t2!=0]<-10000
        log(1+(d/101))
      }
      data.t %>%
          arrange(Plot,Year) %>%  #ensure that data are chronologically arranged in each Plot
          group_by(Plot) %>% #group by Plot
          mutate_at(vars(Resp1,Resp2), funs(Delta=delta(.,first(.)))) %>%
              filter(Year>first(Year))
      
      Source: local data frame [12 x 6]
      Groups: Plot [4]
      
           Plot  Year Resp1 Resp2 Resp1_Delta Resp2_Delta
         <fctr> <int> <dbl> <dbl>       <dbl>       <dbl>
      1      P1  2009    48     0  4.60526919 -4.61512052
      2      P1  2010    12     0  4.60526919 -4.61512052
      3      P1  2011     0    15  0.00000000 -0.86170254
      4      P2  2009    18     0 -0.05351732 -4.61512052
      5      P2  2010    21    36  0.09914006  0.05660821
      6      P2  2011    45    12  0.85648651 -1.02346484
      7      P3  2009    38     0 -0.05077233 -4.61512052
      8      P3  2010     9    18 -1.45812010 -0.98454058
      9      P3  2011    22    42 -0.58976883 -0.15250187
      10     P4  2009    22    14  0.09440968  4.60526919
      11     P4  2010    32     0  0.46628385  0.00000000
      12     P4  2011    38    27  0.63715291  4.60526919
      
  3. The above procedure is perfectly adequate for dealing with a single ubiquitous event (disturbance), but what happens if there have been multiple different disturbances that begin and end at different times in different plots. Comparing each time period to a single, fixed time (2008 in this case), is then not logical. Lets say that in addition to our data set, we had also compiled a data set that defined times prior and post for each disturbance for each Plot. This data set could then be used as a lookup table (as in this section above). In this example we will incorporate the post disturbance date, yet we will not restrict the resulting data beyond this date (that is, we will continue to explore the impacts of the disturbance after the primary condition has ceased).
    View code
    #complex difference function
    data.v <- expand.grid(Plot=paste("P",1:4,sep=""))
    data.v$Prior <- c(2008,2008,2009,2008)
    data.v$Post <- c(2011,2010,2011,2011)
    data.v
    
      Plot Prior Post
    1   P1  2008 2011
    2   P2  2008 2010
    3   P3  2009 2011
    4   P4  2008 2011
    
    The following additional steps are now required in order to incorporate the prior and post disturbance dates.
    • Merge the data and the lookup table such that the prior and post dates are included in the data set.
    • Limit the first date of the comparison for each Plot such that the first measure is relative to the defined Prior disturbance time for that Plot.
    Show code
    delta <- function(t2,t1){
      d<-100*((t2-t1)/t1)
      d[t1==0 & t2==0]<-0
      d[t1==0 & t2!=0]<-10000
      log(1+(d/101))
    }
    data.t %>%
        left_join(data.v) %>%
        filter(Year>=Prior) %>%
        arrange(Plot,Year) %>%  #ensure that data are chronologically arranged in each Plot
        group_by(Plot) %>% #group by Plot
        mutate_at(vars(Resp1,Resp2), funs(Delta=delta(.,first(.)))) %>%
            filter(Year>first(Year))
    
    Source: local data frame [11 x 8]
    Groups: Plot [4]
    
         Plot  Year Resp1 Resp2 Prior  Post Resp1_Delta
       <fctr> <int> <dbl> <dbl> <dbl> <dbl>       <dbl>
    1      P1  2009    48     0  2008  2011  4.60526919
    2      P1  2010    12     0  2008  2011  4.60526919
    3      P1  2011     0    15  2008  2011  0.00000000
    4      P2  2009    18     0  2008  2010 -0.05351732
    5      P2  2010    21    36  2008  2010  0.09914006
    6      P2  2011    45    12  2008  2010  0.85648651
    7      P3  2010     9    18  2009  2011 -1.40895673
    8      P3  2011    22    42  2009  2011 -0.53936879
    9      P4  2009    22    14  2008  2011  0.09440968
    10     P4  2010    32     0  2008  2011  0.46628385
    11     P4  2011    38    27  2008  2011  0.63715291
    # ... with 1 more variables: Resp2_Delta <dbl>
    

References

Wickham, H. (2014NA). “Tidy data”. In: The Journal of Statistical Software 59.10. URL: http://www.jstatsoft.org/v59/i10/.


  R object classes

Assigning entries is basically the act of defining a new object name and specifying what that object contains (its value). For example if we wanted to store the number 10.513 as John Howards IQ, we instruct R to create a new object called (say IQ) and assign the value 10.513 to it. That is, we instruct R that IQ equals 10.513.
In R, the assignment operator is <- instead of =.

> name <- value

So to assign IQ the value of 10.513 in R
IQ <- 10.513

End of instructions

  R object classes

Object classes define how information in stored and displayed. The basic storage unit in R is called a vector. A vector is an array of one or more entries of the same class. The common classes include
  1. numeric - stores a number eg 1, 2.345 etc
  2. character - stores alphanumeric characters eg 'a', 'fish', 'color1'
  3. logical - stores either TRUE or FALSE
So the entries (1, 2, 3 & 4) might make up a numeric vector, whereas the entries ('Big', 'Small' & 'Tiny') would make up a character vector. To determine the class type of an object, use the following syntax (where bold font is used to represent the object whose class is to be determined).

> class(name)

End of instructions

  Print contents

In R, print means to output (list) the contents of an object. By default, the contents are output to the screen (the default output device). It is also possible to redirect output to a file or printer if necessary. The contents of a file are 'printed' by completing the 'print()' function or by just entering the name of the object. Consider the following;
numbers <- c(1, 4, 6, 7, 4, 345, 36, 78)
numbers
[1]   1   4   6   7   4 345  36  78
The first line of this syntax generates and populates the numeric vector called 'numbers'. The second line uses the print function to tell R to list the contents of the 'numbers' object - the output of which appears on the third line. The forth and fifth line illustrate that the same outcome can be achieved by simply entering the name of the object.

End of instructions

  R vectors - variables

In biology, a variable is a collection of observations of the same type. For example, a variable might consist of the observed weights of individuals within a sample of 10 bush rats. Each item (or element) in the variable is of the same type (a weight) and will have been measured comparably (same techniques and units). Biological variables are therefore best represented in R by vectors.

End of instructions

  R Factors

There are a number of ways in which this can be done. One way is to use the 'factor' (makes a list into a factor) function in conjunction with the 'c' (concatenation) function.

> name <- factor(c(list of characters/words))

Another way is to use the 'gl' function (which generates factors according to specified patterns of their levels)

> name <- gl(number of levels, number of replicates, length of data set, lab=c(list of level names)))

Hence, consider the following alternative solutions;
sex <- factor(c('Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male'))
#OR
sex <- factor(c(rep('Female',6),rep('Male',6)))
#OR
sex <- gl(2,6,12,lab=c('Female','Male'))

The second option uses the 'rep()' function which in this case is used to repeat the level name (eg 'Female') 6 times.

End of instructions