Tutorial 2.4 - Data manipulation

27 Mar 2017

I apologize in advance, this tutorial requires quite a bit of explaining and context before it can get into the code.... Good data manipulation is an art form that requires the use of many specific tools (functions) and expert data manipulation comes from the integration of these tools together. Therefore it is necessary to have an overview of the tool set before investigating any single tool.

Manipulating data sets

Rarely are data transcribed and organized into such simple and ready-to-go data sets. More typically, data are spread throughout multiple sources and in a variety of formats (particularly if compiled by multiple workers or instruments). Consequently, prior to any formal statistical analyses , it is necessary to compile very focused, tidy data sets.

Wickham (2014) suggests that there are many ways to organise data, yet tidy data (data that are structured in such a consistent way as to facilitate analyses) must adhere to a fairly strict set of structural rules. Specifically, in tidy data:

variables are in columns
observations are in rows - that is, for univariate data, there should be a separate row for each response observation.

To achieve tidy data, common data preparations include:

Reshaping and rearranging data
Merging multiple data sources
Aggregating data to appropriate spatial/temporal scales
Transforming data and deriving new variables
Sorting and reordering data
Relabelling data

This workshop will concentrate on these essential data preparation phases.

Practical data manipulation will be demonstrated via a series of very small artificial datasets. These datasets are presented in tables with black font and lines and the R code required to generate those data will be presented in static code boxes either underneath or adjacent the table. A very basic description of the table and the name of the data frame are displayed above the table. The entire collection of datasets used in this workshop can be obtained by issuing the following command:

(if online)

source(url( "http://www.flutterbys.com.au/stats/downloads/data/manipulationDatasets.R"))
#OR
load("http://www.flutterbys.com.au/stats/downloads/data/manipulationDatasets.RData")

(if offline and are running this from a local version having first downloaded either file)
source(file="manipulationDatasets.R") #OR load("manipulationDatasets.RData")

The great folks over at Rstudio have produced an excellent set of cheatsheets on a range of topics. For this tutorial, the Data Import Cheat Sheet (mainly page 2) and Data Transformation Cheat Sheet are useful summaries.

`tidyverse` a data manipulation ecosystem within R

There are numerous packages and base R routines devoted to data manipulation. Notable packages include data.tables, plyr and doBy. Indeed, earlier versions of this tutorial featured examples of each of these packages. However, an entire family of packages from Hadley Wickem's group now stands out as a comprehensive, intuitive suite of tools for data manipulation and visualisation.

Importantly, all of these packages are designed to integrate together and complement one another with a consistent interface. Individually, the important relevant packages for this tutorial include:

dplyr, for data manipulation
tidyr, for data tidying
forcats, for factors

To simplify installing an entire data ecosystem, the tidyverse package is available. Installing this package (via install.packages('tidyverse')) will install the following packages:

dplyr, for data manipulation
tidyr, for data tidying
readr, for importing data
ggplot2, for visualising data
purr, for functional programming
tibble, for a new data.frame-like structure
hmms, for times
lubridate, for dates/times
stringr, for strings

forcats, for factors
DBI, for databases
haven, for importing SAS, SPSS and Stata files
httr, fir web apis
jsonlite, for JSON
rvest, for web scraping
xml2, for XML
modelr, for model piping
broom, for converting models and objects into tidy data

Loading the tidyverse() package will automatically load the following packages:

dplyr, for data manipulation
tidyr, for data tidying
ggplot2, for visualising data
tibble, for a new data.frame-like structure
readr, for importing data
purr, for functional programming

For this tutorial, we additionally need to load the forcats package.

library(tidyverse)
library(forcats)

Finally, for much of the rest of the tutorial, I will demonstrate multiple ways to perform the tasks. Whilst the tidyverse ecosystem of packages all integrate well with one another, they do conflict with other packages. Consequently, for the alternate routines I demonstrate, namespaces will be used rather than loading of the associated packages.

The grammar of data manipulation

Hadley and his collaborators argue that there is a grammar of data manipulation and that manipulations comprises the following core verbs:

gather, for melting into long format
spread(), for casting into wide format
unite, for combining colums
separate(), for splitting columns
arrange(), for sorting data
select(), for subsetting columns
rename(), for renaming columns
filter(), for subsetting rows

slice, for selecting rows
mutate, for adding columns
summarize, for summarizing data
count(), for tallying data
group_by(), for defining groups
*_join(), for merging data sets

Piping

Typically, data manipulation/preparations comprise multiple steps and stages in which a series of alterations, additions etc are performed sequentially such that the data are progressively molded into a form appropriate for analysis etc. Traditionally, this would have involved a separate expression for each step often resulting in the generation of numerous intermediate data sets.

Furthermore, in an attempt to reduce the number of intermediate steps, functions are often nested within other functions such that alterations are made inline within another function. Collectively, these practices can yield code that is very difficult to read and interpret.

A a motivating (if not a bit extreme) example, lets say we wanted to calculate the logSumExp function: $$ log(\sum^{n}_{i=1} e^{x_i}) $$

## Generate some data
set.seed(123)
x = rgamma(10,5,1)
## Calculate the logSumExp
log(sum(exp(x)))

[1] 9.316408

## OR
x1=exp(x)
x2=sum(x1)
log(x2)

[1] 9.316408

A long honoured unix coding principle is that each application should focus on performing one action and performing that function well. In order to perform a sequence of of actions therefore involves piping (via the unix pipe character '|') the output of one application to the input of another application and so on in a chain. The grammar of data wrangling also adopts this principle (each tool specializing on one action and tools should be piped together to achieve a sequence of actions).

The piping (glue) operator in tidyverse is %>%. An object on the left hand side of the %>% operator is passed as the first argument of the function on the right hand side.

x %>% exp %>% sum %>% log

[1] 9.316408

To reiterate, the following two are equivalent:

exp(x)

 [1]   29.653639 1601.918872    5.101634  118.918637 7140.686681  252.361318    9.175114    4.863565 1756.825350
[10]  199.466617

x %>% exp

 [1]   29.653639 1601.918872    5.101634  118.918637 7140.686681  252.361318    9.175114    4.863565 1756.825350
[10]  199.466617

as are the following:

log(x, base=10)

 [1] 0.5301465 0.8679950 0.2120706 0.6792861 0.9480981 0.7427928 0.3456667 0.1991438 0.8733941 0.7239190

x %>% log(base=10)

 [1] 0.5301465 0.8679950 0.2120706 0.6792861 0.9480981 0.7427928 0.3456667 0.1991438 0.8733941 0.7239190

Most of the following examples will demonstrate isolated data manipulation actions (such as filtering, summarizing or joining) as this focuses on the specific uses of these functions without the distractions and complications of other actions. For isolated uses, piping has little (if any) advantages. Nevertheless, in recognition that data manipulations rarely comprise a single action (rather they are a series of linked actions), for all remaining examples demonstrated in the tidyverse (dplyr/tidyr) context, piping will be used.

Reshaping dataframes

Wide to long (gather)

Whilst wide data formats are often more compact and typically easier to manage for data entry (particularly in the field), the data are not in the appropriate format for most analyses (traditional repeated measures and multivariate analyses are two exceptions). Most analyses require that each replicate is in its own row and thus it is necessary to be rearrange or reshape (melt or gather) the data from this wide format to the long (molten) format.

Whilst there are numerous routines in R for reshaping data, we will only explore those that are formally part of the tidyverse ecosystem.

The gather() function (dplyr package) is very useful for converting wide (repeated measures-like) into long format. The important parameters to specify are;

key, a name for the new categorical variable that represents the levels that correspond to the variables being gathered (shown in purple in the Wide data above)
value a name for the numeric variable to contain the observations

By default, gather() assumes that you wish to gather all the variables to produce a new table with two columns. In order to restrict the variables to be gathered, after specifying the key and value, we then indicate either which variables to include in the gather or alternatively which to exclude.

data.w

	Between	Plot	Time.0	Time.1	Time.2
R1	A1	P1	8	14	14
R2	A1	P2	10	12	11
R3	A2	P3	7	11	8
R4	A2	P4	11	9	2

View code

set.seed(1)
data.w <- expand.grid(Plot=paste("P",1:4,sep=""))
data.w$Between <- gl(2,2,4,lab=paste("A",1:2,sep=""))
data.w <- with(data.w, data.frame(Between,Plot,
            matrix(rpois(12,10),ncol=3,
            dimnames=list(paste("R",1:4,sep=""),
                          paste("Time",0:2,sep=":")))))

	Between	Plot	Time	Count
1	A1	P1	Time.0	8
2	A1	P2	Time.0	10
3	A2	P3	Time.0	7
4	A2	P4	Time.0	11
5	A1	P1	Time.1	14
6	A1	P2	Time.1	12
7	A2	P3	Time.1	11
8	A2	P4	Time.1	9
9	A1	P1	Time.2	14
10	A1	P2	Time.2	11
11	A2	P3	Time.2	8
12	A2	P4	Time.2	2

Via gather (package:tidyr)

data.w %>% gather(key=Time,value=Count,Time.0,Time.1,Time.2)
#OR
data.w %>% gather(key=Time,value=Count,-Between,-Plot)

   Between Plot   Time Count
1       A1   P1 Time.0     8
2       A1   P2 Time.0    10
3       A2   P3 Time.0     7
4       A2   P4 Time.0    11
5       A1   P1 Time.1    14
6       A1   P2 Time.1    12
7       A2   P3 Time.1    11
8       A2   P4 Time.1     9
9       A1   P1 Time.2    14
10      A1   P2 Time.2    11
11      A2   P3 Time.2     8
12      A2   P4 Time.2     2

When specifying which columns to gather, there are also a number of "Helper" functions that provide convenient ways to select columns:

contains(""), columns whose name match the character string (not case sensitive by default)
starts_with(""), columns whose names begin with the character string (not case sensitive by default)
ends_with(""), columns whose names end with the character string (not case sensitive by default)
one_of(c("","")), columns whose names are amongst the character strings
matches(""), columns whose names are matched by the character string of a regular expression (not case sensitive by default)
everything(), all columns. This is useful in combination with other specifiers
num_range("",), select all columns whose names start with the first argument (a character string) followed by a number in the integer sequence provided as the second argument

Selections are negated by prepending with a '-' and columns are ordered according to the order that they are selected.

	Between	Plot	Time	Count
1	A1	P1	Time.0	8
2	A1	P2	Time.0	10
3	A2	P3	Time.0	7
4	A2	P4	Time.0	11
5	A1	P1	Time.1	14
6	A1	P2	Time.1	12
7	A2	P3	Time.1	11
8	A2	P4	Time.1	9
9	A1	P1	Time.2	14
10	A1	P2	Time.2	11
11	A2	P3	Time.2	8
12	A2	P4	Time.2	2

Via gather (package:tidyr)

data.w %>% gather(key=Time,value=Count, contains('Time'))
#OR
data.w %>% gather(key=Time,value=Count, num_range('Time.',0:2))
#OR
data.w %>% gather(key=Time,value=Count, matches('Time.[0-2]'))

   Between Plot   Time Count
1       A1   P1 Time.0     8
2       A1   P2 Time.0    10
3       A2   P3 Time.0     7
4       A2   P4 Time.0    11
5       A1   P1 Time.1    14
6       A1   P2 Time.1    12
7       A2   P3 Time.1    11
8       A2   P4 Time.1     9
9       A1   P1 Time.2    14
10      A1   P2 Time.2    11
11      A2   P3 Time.2     8
12      A2   P4 Time.2     2

Long to wide (cast)

The spread() function (tidyr package) is very useful for converting long datasets into wide (repeated measures-like). The heart of the function definition is specifying a formula LHS ~ RHS where;

key the existing categorical variable that will define the new variable names
value the existing numeric variable that will be reshaped into wide format

Note, it is best that data are already as long as possible.

data

	Resp1	Between	Plot	Subplot	Within
1	8	A1	P1	S1	B1
2	10	A1	P1	S1	B2
3	7	A1	P1	S2	B1
4	11	A1	P1	S2	B2
5	14	A2	P2	S3	B1
6	12	A2	P2	S3	B2
7	11	A2	P2	S4	B1
8	9	A2	P2	S4	B2
9	14	A3	P3	S5	B1
10	11	A3	P3	S5	B2
11	8	A3	P3	S6	B1
12	2	A3	P3	S6	B2
13	8	A1	P4	S7	B1
14	10	A1	P4	S7	B2
15	7	A1	P4	S8	B1
16	12	A1	P4	S8	B2
17	11	A2	P5	S9	B1
18	12	A2	P5	S9	B2
19	12	A2	P5	S10	B1
20	10	A2	P5	S10	B2
21	3	A3	P6	S11	B1
22	11	A3	P6	S11	B2
23	13	A3	P6	S12	B1
24	7	A3	P6	S12	B2

View code

data.l = data %>% select(-Resp2)

Widen (spread) Resp1 for repeated measures (Within)

	Between	Plot	Subplot	B1	B2
1	A1	P1	S1	8	10
2	A1	P1	S2	7	11
3	A1	P4	S7	8	10
4	A1	P4	S8	7	12
5	A2	P2	S3	14	12
6	A2	P2	S4	11	9
7	A2	P5	S9	11	12
8	A2	P5	S10	12	10
9	A3	P3	S5	14	11
10	A3	P3	S6	8	2
11	A3	P6	S11	3	11
12	A3	P6	S12	13	7

Via spread (package:tidyr)

data.l %>% spread(Within,Resp1)

   Between Plot Subplot B1 B2
1       A1   P1      S1  8 10
2       A1   P1      S2  7 11
3       A1   P4      S7  8 10
4       A1   P4      S8  7 12
5       A2   P2      S3 14 12
6       A2   P2      S4 11  9
7       A2   P5      S9 11 12
8       A2   P5     S10 12 10
9       A3   P3      S5 14 11
10      A3   P3      S6  8  2
11      A3   P6     S11  3 11
12      A3   P6     S12 13  7

If the data are not already fully gathered, then it is necessary to gather before spreading. For example, if we return to the data that contain both Resp1 and Resp2...

data

	Resp1	Resp2	Between	Plot	Subplot	Within
1	8	17	A1	P1	S1	B1
2	10	18	A1	P1	S1	B2
3	7	17	A1	P1	S2	B1
4	11	21	A1	P1	S2	B2
5	14	19	A2	P2	S3	B1
6	12	13	A2	P2	S3	B2
7	11	24	A2	P2	S4	B1
8	9	18	A2	P2	S4	B2
9	14	25	A3	P3	S5	B1
10	11	18	A3	P3	S5	B2
11	8	27	A3	P3	S6	B1
12	2	22	A3	P3	S6	B2
13	8	17	A1	P4	S7	B1
14	10	22	A1	P4	S7	B2
15	7	16	A1	P4	S8	B1
16	12	13	A1	P4	S8	B2
17	11	23	A2	P5	S9	B1
18	12	19	A2	P5	S9	B2
19	12	23	A2	P5	S10	B1
20	10	21	A2	P5	S10	B2
21	3	17	A3	P6	S11	B1
22	11	16	A3	P6	S11	B2
23	13	26	A3	P6	S12	B1
24	7	28	A3	P6	S12	B2

View code

set.seed(1)
data <- expand.grid(Within=paste("B",1:2,sep=""),
                    Subplot=paste("S",1:2,sep=""),
                    Plot=paste("P",1:6,sep=""))
data$Subplot <- gl(12,2,24,lab=paste("S",1:12,sep=""))
data$Between <- gl(3,4,24,lab=paste("A",1:3,sep=""))
data$Resp1 <- rpois(24,10)
data$Resp2 <- rpois(24,20)
data <- with(data,data.frame(Resp1,Resp2,Between,Plot,Subplot,Within))

Widen (spread) both Resp1 and Resp2 for repeated measures

	Between	Plot	Subplot	Resp	B1	B2
1	A1	P1	S1	Resp1	8	10
2	A1	P1	S1	Resp2	17	18
3	A1	P1	S2	Resp1	7	11
4	A1	P1	S2	Resp2	17	21
5	A1	P4	S7	Resp1	8	10
6	A1	P4	S7	Resp2	17	22
7	A1	P4	S8	Resp1	7	12
8	A1	P4	S8	Resp2	16	13
9	A2	P2	S3	Resp1	14	12
10	A2	P2	S3	Resp2	19	13
11	A2	P2	S4	Resp1	11	9
12	A2	P2	S4	Resp2	24	18
13	A2	P5	S9	Resp1	11	12
14	A2	P5	S9	Resp2	23	19
15	A2	P5	S10	Resp1	12	10
16	A2	P5	S10	Resp2	23	21
17	A3	P3	S5	Resp1	14	11
18	A3	P3	S5	Resp2	25	18
19	A3	P3	S6	Resp1	8	2
20	A3	P3	S6	Resp2	27	22
21	A3	P6	S11	Resp1	3	11
22	A3	P6	S11	Resp2	17	16
23	A3	P6	S12	Resp1	13	7
24	A3	P6	S12	Resp2	26	28

Via spread (package:tidyr)

data %>% gather(Resp,Count, contains('Resp')) %>% spread(Within,Count)

   Between Plot Subplot  Resp B1 B2
1       A1   P1      S1 Resp1  8 10
2       A1   P1      S1 Resp2 17 18
3       A1   P1      S2 Resp1  7 11
4       A1   P1      S2 Resp2 17 21
5       A1   P4      S7 Resp1  8 10
6       A1   P4      S7 Resp2 17 22
7       A1   P4      S8 Resp1  7 12
8       A1   P4      S8 Resp2 16 13
9       A2   P2      S3 Resp1 14 12
10      A2   P2      S3 Resp2 19 13
11      A2   P2      S4 Resp1 11  9
12      A2   P2      S4 Resp2 24 18
13      A2   P5      S9 Resp1 11 12
14      A2   P5      S9 Resp2 23 19
15      A2   P5     S10 Resp1 12 10
16      A2   P5     S10 Resp2 23 21
17      A3   P3      S5 Resp1 14 11
18      A3   P3      S5 Resp2 25 18
19      A3   P3      S6 Resp1  8  2
20      A3   P3      S6 Resp2 27 22
21      A3   P6     S11 Resp1  3 11
22      A3   P6     S11 Resp2 17 16
23      A3   P6     S12 Resp1 13  7
24      A3   P6     S12 Resp2 26 28

Note, the above data are not untidy..

Combining columns (unite)

If data are recorded with excessive fidelity, it may be useful to combine multiple fields into a single field. For example, if the date was recorded across three fields (year, month and day, we may like to combine these to form a single date field.

data.d

	year	month	day	Resp1
1	2008	07	13	16.00
2	2008	09	28	25.00
3	2009	02	21	52.00
4	2009	10	23	18.00
5	2008	05	26	0.00
6	2009	10	14	45.00
7	2009	11	15	40.00
8	2009	04	23	38.00
9	2009	03	30	9.00
10	2008	02	14	22.00

View code

set.seed(1)
data.d = data.frame(Date=sample(seq(as.Date('2008-01-01'),
                                    as.Date('2009-12-31'), by='day'), size=10),
                    Resp1=rnbinom(10,5,mu=30) * as.numeric(replicate(5,
                          rbinom(2,1,0.8)))) %>%
    separate(Date,into=c('year','month','day'))

Unite year, month and day into a single 'Date' field.

	Date	Resp1
1	2008-07-13	16.00
2	2008-09-28	25.00
3	2009-02-21	52.00
4	2009-10-23	18.00
5	2008-05-26	0.00
6	2009-10-14	45.00
7	2009-11-15	40.00
8	2009-04-23	38.00
9	2009-03-30	9.00
10	2008-02-14	22.00

Via unite (package:tidyr)

data.d %>% unite(year,month,day,col='Date',sep='-')

         Date Resp1
1  2008-07-13    16
2  2008-09-28    25
3  2009-02-21    52
4  2009-10-23    18
5  2008-05-26     0
6  2009-10-14    45
7  2009-11-15    40
8  2009-04-23    38
9  2009-03-30     9
10 2008-02-14    22

In the last spread example, we widened Resp1 and Resp2 for each level of Within. However, we may wish to present this table in a more compact form in which the data are further spread into each combination of Resp and Within. Note, doing so would even more untidy data.

Unite year, month and day into a single 'Date' field.

	Between	Plot	Subplot	Resp1_B1	Resp1_B2	Resp2_B1	Resp2_B2
1	A1	P1	S1	8	10	17	18
2	A1	P1	S2	7	11	17	21
3	A1	P4	S7	8	10	17	22
4	A1	P4	S8	7	12	16	13
5	A2	P2	S3	14	12	19	13
6	A2	P2	S4	11	9	24	18
7	A2	P5	S9	11	12	23	19
8	A2	P5	S10	12	10	23	21
9	A3	P3	S5	14	11	25	18
10	A3	P3	S6	8	2	27	22
11	A3	P6	S11	3	11	17	16
12	A3	P6	S12	13	7	26	28

Via unite (package:tidyr)

data %>% gather(Resp,Count, contains('Resp')) %>% unite(RW, Resp, Within) %>% spread(RW,Count)

   Between Plot Subplot Resp1_B1 Resp1_B2 Resp2_B1 Resp2_B2
1       A1   P1      S1        8       10       17       18
2       A1   P1      S2        7       11       17       21
3       A1   P4      S7        8       10       17       22
4       A1   P4      S8        7       12       16       13
5       A2   P2      S3       14       12       19       13
6       A2   P2      S4       11        9       24       18
7       A2   P5      S9       11       12       23       19
8       A2   P5     S10       12       10       23       21
9       A3   P3      S5       14       11       25       18
10      A3   P3      S6        8        2       27       22
11      A3   P6     S11        3       11       17       16
12      A3   P6     S12       13        7       26       28

Splitting columns appart (separate)

Separating variables is the opposite of uniting them. A field is separated by either indicating a character(s) to use as a separator, or else providing a fixed width format.

data.c

	Year	Resp1
1	M25	45.00
2	M28	40.00
3	F29	38.00
4	F43	9.00
5	M34	22.00
6	F25	20.00
7	F26	22.00
8	F28	32.00
9	F27	38.00
10	M29	47.00

View code

set.seed(1)
data.c <- data.frame(Year=paste0(sample(c('M','F'),10,replace=TRUE),rpois(10,30)),
                     Resp1=rnbinom(10,5,mu=30) * as.numeric(replicate(5,rbinom(2,1,0.8))))

Separate year, month and day into a single 'Date' field.

	Gender	Age	Resp1
1	M	25	45.00
2	M	28	40.00
3	F	29	38.00
4	F	43	9.00
5	M	34	22.00
6	F	25	20.00
7	F	26	22.00
8	F	28	32.00
9	F	27	38.00
10	M	29	47.00

Via separate (package:tidyr)

data.c %>% separate(Year,into=c('Gender','Age'),sep=1)

   Gender Age Resp1
1       M  25    45
2       M  28    40
3       F  29    38
4       F  43     9
5       M  34    22
6       F  25    20
7       F  26    22
8       F  28    32
9       F  27    38
10      M  29    47

Summary and Vectorized functions

Many data manipulation actions involve the use of specific auxiliary functions to act on specific parts of the data set. For example if we wish to summarize a data set, we might apply a mean() function to the numeric vectors (variables) and levels() or unique function to character or factor vectors. On the other hand, if we wish to generate log-transformed versions of the numeric vectors, the we would apply the log() function to each of those numeric vectors. Furthermore, we might use other auxiliary functions that return vectors of either booleans (TRUE or FALSE) or integers that can represent row or column indices to determine which observations to perform actions on.

Broadly speaking, summary functions take a vector and return a single value. Familiar examples of summary functions are mean(), var(), min() etc. By contrast, vectorized functions take a vector and return a vector of the same length as the original input vector. rank(), cumsum() and log() are all examples of window functions.

The dplyr package introduces a number of additional summary and vectorized functions that are deemed useful augmentations of the standard set of functions available in base R. The following tables describes the most useful functions (from base and dplyr) along with which manipulation functions they can be applied. Those functions defined in dplyr include the dplyr namespace.

To demonstrate summary and vector functions, the following vectors will be used.

## Generate some data
set.seed(123)
(x = rgamma(10,5,1))

 [1] 3.389585 7.378957 1.629561 4.778440 8.873564 5.530862 2.216495 1.581772 7.471264 5.295647

(A = sample(letters[1:2], size=10, replace=TRUE))

 [1] "b" "b" "b" "b" "a" "a" "b" "b" "b" "b"

(B = sample(c(TRUE,FALSE), size=10, replace=TRUE))

 [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Summary functions

Function	Description	Applies to	Example
`mean()`,`median()`	Arithmetic mean and median	`summarize()`, `mutate()`, `filter()`	mean(x) [1] 4.814615 median(B) [1] 1
`sum()`	Sum	`summarize()`, `mutate()`, `filter()`	sum(x) [1] 48.14615 sum(B) [1] 9
`quantile()`	Five number summary	`summarize()`, `mutate()`, `filter()` *must only return a single quantile, so prob must be supplied	quantile(x,prob=0.25) 25% 2.509767 quantile(B,prob=0.25) 25% 1
`min()`,`max()`	Minimum and Maximum	`summarize()`, `mutate()`, `filter()`	min(x) [1] 1.581772 min(B) [1] 0
`IQR()`,`mad()`	Inter-Quartile Range and Mean Absolute Deviation	`summarize()`, `mutate()`, `filter()`	IQR(x) [1] 4.407166 mad(B) [1] 0
`var()`,`sd()`	Variance and Standard Deviation	`summarize()`, `mutate()`, `filter()`	var(x) [1] 6.692355 sd(B) [1] 0.3162278
`dplyr::first()`, `dplyr::last()`, `dplyr::nth()`	First, Last and nth value	`summarize()`, `mutate()`, `filter()`	first(x) [1] 3.389585 first(x,order_by=A) [1] 8.873564 last(B) [1] TRUE nth(A,4) [1] "b"
`dplyr::n()`, `dplyr::n_distinct()`	Number of values/distinct values	`summarize()`, `mutate()`, `filter()` *`n()` is a special case that can not be used outside of these functions	n_distinct(A) [1] 2

Vectorized functions

Function	Description	Applies to	Example
`log()`, `exp()`, `scale()`	Logarithmic, exponential and scale transformations	`mutate()`, `filter()`	log(x) [1] 1.2207074 1.9986324 0.4883105 1.5641140 2.1830765 [6] 1.7103437 0.7959271 0.4585455 2.0110642 1.6668851 exp(x) [1] 29.653639 1601.918872 5.101634 118.918637 [5] 7140.686681 252.361318 9.175114 4.863565 [9] 1756.825350 199.466617 scale(x) [,1] [1,] -0.55085136 [2,] 0.99125776 [3,] -1.23119621 [4,] -0.01398362 [5,] 1.56900441 [6,] 0.27686846 [7,] -1.00431433 [8,] -1.24966932 [9,] 1.02693911 [10,] 0.18594510 attr(,"scaled:center") [1] 4.814615 attr(,"scaled:scale") [1] 2.586959
`cumsum()`, `cumprod()`	Cumulative sum and product	`mutate()`, `filter()`	cumsum(x) [1] 3.389585 10.768542 12.398103 17.176543 26.050107 [6] 31.580969 33.797464 35.379235 42.850499 48.146146 cumsum(x>3) [1] 1 2 2 3 4 5 5 5 6 7 cumsum(B) [1] 1 2 2 3 4 5 6 7 8 9
`cummin()`, `cummax()`	Cumulative min and max	`mutate()`, `filter()`	cummax(x) [1] 3.389585 7.378957 7.378957 7.378957 8.873564 8.873564 [7] 8.873564 8.873564 8.873564 8.873564 cummin(B) [1] 1 1 0 0 0 0 0 0 0 0
`dplyr::cummean()`	Cumulative mean	`mutate()`, `filter()`	cummean(x) [1] 3.389585 5.384271 4.132701 4.294136 5.210021 5.263495 [7] 4.828209 4.422404 4.761167 4.814615 cummean(x>2) [1] 1.0000000 1.0000000 0.6666667 0.7500000 0.8000000 [6] 0.8333333 0.8571429 0.7500000 0.7777778 0.8000000 cummean(B) [1] 1.0000000 1.0000000 0.6666667 0.7500000 0.8000000 [6] 0.8333333 0.8571429 0.8750000 0.8888889 0.9000000
`dplyr::cumall()`, `dplyr::cumany()`	Cumulative all and any these are of most use as part of a filter	`mutate()`, `filter()`	cumall(x) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE cumall(x>2) [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [10] FALSE cumany(B) [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE cumall(B) [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [10] FALSE
`dplyr::lead()`, `dplyr::lag()`	Offset items by n or -n	`mutate()`, `filter()`	lag(x) [1] NA 3.389585 7.378957 1.629561 4.778440 8.873564 [7] 5.530862 2.216495 1.581772 7.471264 lag(x,3) [1] NA NA NA 3.389585 7.378957 1.629561 [7] 4.778440 8.873564 5.530862 2.216495 lead(A) [1] "b" "b" "b" "a" "a" "b" "b" "b" "b" NA
`rank()`, `order()`	Rank and order of items	`mutate()`, `filter()`	rank(x) [1] 4 8 2 5 10 7 3 1 9 6 order(x) [1] 8 3 7 1 4 10 6 2 9 5 rank(B) [1] 6 6 1 6 6 6 6 6 6 6
`dplyr::min_rank()`, `dplyr::percent_rank()`, `dplyr::dense_rank()`	Ranks in which ties = min, expressed as a percentage (`percent_rank`), without gaps (`dense_rank`)	`mutate()`, `filter()`	min_rank(x) [1] 4 8 2 5 10 7 3 1 9 6 percent_rank(x) [1] 0.3333333 0.7777778 0.1111111 0.4444444 1.0000000 [6] 0.6666667 0.2222222 0.0000000 0.8888889 0.5555556 dense_rank(x) [1] 4 8 2 5 10 7 3 1 9 6
`dplyr::row_number()`	Ranks in which ties = first	`mutate()`, `filter()`	row_number(x) [1] 4 8 2 5 10 7 3 1 9 6 row_number(A) [1] 3 4 5 6 1 2 7 8 9 10 row_number(B) [1] 2 3 1 4 5 6 7 8 9 10
`dplyr::cume_dist()`	Cumulative sum as a proportion	`mutate()`, `filter()`	cume_dist(x) [1] 0.4 0.8 0.2 0.5 1.0 0.7 0.3 0.1 0.9 0.6 cume_dist(A) [1] 1.0 1.0 1.0 1.0 0.2 0.2 1.0 1.0 1.0 1.0
`dplyr::ntile()`	Partition into n bins	`mutate()`, `filter()`	ntile(x,3) [1] 1 3 1 2 3 2 1 1 3 2
`dplyr::between()`	Are the values between two numbers	`mutate()`, `filter()`	between(x,3,7) [1] TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE [10] TRUE
`dplyr::if_else()`, `dplyr::case_when()`	Assign values based on condition	`mutate()`, `filter()`	if_else(x > 3,'H','L') [1] "H" "H" "L" "H" "H" "H" "L" "L" "H" "H" case_when(x<=3 ~ 'L', x>3 & x<=6 ~ 'M', x>6 ~ 'H') [1] "M" "H" "L" "M" "H" "M" "L" "L" "H" "M"
`dplyr::recode()`, `dplyr::recode_factor()`	Recode items	`mutate()`, `filter()`	recode(A,a='apple', b='banana') [1] "banana" "banana" "banana" "banana" "apple" "apple" [7] "banana" "banana" "banana" "banana" recode_factor(A,a='apple', b='banana') [1] banana banana banana banana apple apple banana banana [9] banana banana Levels: apple banana

Alterations to existing variables

Sorting data (arrange)

Most statistical analyses are invariant to the data order and thus data reordering is typically only for aesthetics in tables and figures.

Sorting data has the potential to be one of the most dangerous forms of data manipulation - particularly in spreadsheets in which there is no real binding between individual columns or rows. It is far to easy to accidentally sort the data in a single column (or row) without applying the ordering to all the other columns in the data set thereby resulting in broken data.

Whilst the above apocalypse is still possible in R, the data structures and manipulation interfaces mean that you really have to try to break the data in this way. Furthermore, you are encouraged to store reordered data in a different object to the original data, and hence 'rolling' back is trivial.

An example of an data set (data.1)

	Plot	Cond	Between	Temp	LAT	LONG
1	P1	H	A1	15.74	17.26	146.24
2	P2	H	A1	23.84	14.07	144.89
3	P3	H	A1	13.64	20.75	144.69
4	P4	H	A2	37.95	18.41	142.06
5	P1	M	A2	25.30	18.47	144.04
6	P2	M	A2	13.80	20.39	145.84
7	P3	M	A3	26.87	20.14	147.72
8	P4	M	A3	29.38	19.69	144.79
9	P1	L	A3	27.76	20.34	145.78
10	P2	L	A4	18.95	20.06	144.89
11	P3	L	A4	37.12	18.65	142.25
12	P4	L	A4	25.90	14.52	144.17

View code

set.seed(1)
data.1 <- expand.grid(Plot=paste("P",1:4,sep=""),Cond=c("H","M","L"))
data.1$Cond <- factor(as.character(data.1$Cond))
data.1$Between <- gl(4,3,12,lab=paste("A",1:4,sep=""))
data.1$Temp <- rnorm(12,22,10)
data.1$LAT <- rnorm(12,18.5,2)
data.1$LONG <- rnorm(12,145,2)

Sort the data by LAT

	Plot	Cond	Between	Temp	LAT	LONG
2	P2	H	A1	23.84	14.07	144.89
12	P4	L	A4	25.90	14.52	144.17
1	P1	H	A1	15.74	17.26	146.24
4	P4	H	A2	37.95	18.41	142.06
5	P1	M	A2	25.30	18.47	144.04
11	P3	L	A4	37.12	18.65	142.25
8	P4	M	A3	29.38	19.69	144.79
10	P2	L	A4	18.95	20.06	144.89
7	P3	M	A3	26.87	20.14	147.72
9	P1	L	A3	27.76	20.34	145.78
6	P2	M	A2	13.80	20.39	145.84
3	P3	H	A1	13.64	20.75	144.69

Via arrange (package:dplyr)

data.1 %>% arrange(LAT)

   Plot Cond Between     Temp      LAT     LONG
1    P2    H      A1 23.83643 14.07060 144.8877
2    P4    L      A4 25.89843 14.52130 144.1700
3    P1    H      A1 15.73546 17.25752 146.2397
4    P4    H      A2 37.95281 18.41013 142.0585
5    P1    M      A2 25.29508 18.46762 144.0437
6    P3    L      A4 37.11781 18.64913 142.2459
7    P4    M      A3 29.38325 19.68780 144.7944
8    P2    L      A4 18.94612 20.06427 144.8924
9    P3    M      A3 26.87429 20.14244 147.7174
10   P1    L      A3 27.75781 20.33795 145.7753
11   P2    M      A2 13.79532 20.38767 145.8359
12   P3    H      A1 13.64371 20.74986 144.6884

Via order

data.1[order(data.1$LAT),]

   Plot Cond Between     Temp      LAT     LONG
2    P2    H      A1 23.83643 14.07060 144.8877
12   P4    L      A4 25.89843 14.52130 144.1700
1    P1    H      A1 15.73546 17.25752 146.2397
4    P4    H      A2 37.95281 18.41013 142.0585
5    P1    M      A2 25.29508 18.46762 144.0437
11   P3    L      A4 37.11781 18.64913 142.2459
8    P4    M      A3 29.38325 19.68780 144.7944
10   P2    L      A4 18.94612 20.06427 144.8924
7    P3    M      A3 26.87429 20.14244 147.7174
9    P1    L      A3 27.75781 20.33795 145.7753
6    P2    M      A2 13.79532 20.38767 145.8359
3    P3    H      A1 13.64371 20.74986 144.6884

Sort the data by LAT (highest to lowest)

	Plot	Cond	Between	Temp	LAT	LONG
1	P3	H	A1	13.64	20.75	144.69
2	P2	M	A2	13.80	20.39	145.84
3	P1	L	A3	27.76	20.34	145.78
4	P3	M	A3	26.87	20.14	147.72
5	P2	L	A4	18.95	20.06	144.89
6	P4	M	A3	29.38	19.69	144.79
7	P3	L	A4	37.12	18.65	142.25
8	P1	M	A2	25.30	18.47	144.04
9	P4	H	A2	37.95	18.41	142.06
10	P1	H	A1	15.74	17.26	146.24
11	P4	L	A4	25.90	14.52	144.17
12	P2	H	A1	23.84	14.07	144.89

Via arrange (package:dplyr)

data.1 %>% arrange(desc(LAT))
#OR
data.1 %>% arrange(-LAT)

   Plot Cond Between     Temp      LAT     LONG
1    P3    H      A1 13.64371 20.74986 144.6884
2    P2    M      A2 13.79532 20.38767 145.8359
3    P1    L      A3 27.75781 20.33795 145.7753
4    P3    M      A3 26.87429 20.14244 147.7174
5    P2    L      A4 18.94612 20.06427 144.8924
6    P4    M      A3 29.38325 19.68780 144.7944
7    P3    L      A4 37.11781 18.64913 142.2459
8    P1    M      A2 25.29508 18.46762 144.0437
9    P4    H      A2 37.95281 18.41013 142.0585
10   P1    H      A1 15.73546 17.25752 146.2397
11   P4    L      A4 25.89843 14.52130 144.1700
12   P2    H      A1 23.83643 14.07060 144.8877

Via order

data.1[rev(order(data.1$LAT)),]

   Plot Cond Between     Temp      LAT     LONG
3    P3    H      A1 13.64371 20.74986 144.6884
6    P2    M      A2 13.79532 20.38767 145.8359
9    P1    L      A3 27.75781 20.33795 145.7753
7    P3    M      A3 26.87429 20.14244 147.7174
10   P2    L      A4 18.94612 20.06427 144.8924
8    P4    M      A3 29.38325 19.68780 144.7944
11   P3    L      A4 37.11781 18.64913 142.2459
5    P1    M      A2 25.29508 18.46762 144.0437
4    P4    H      A2 37.95281 18.41013 142.0585
1    P1    H      A1 15.73546 17.25752 146.2397
12   P4    L      A4 25.89843 14.52130 144.1700
2    P2    H      A1 23.83643 14.07060 144.8877

Sort the data by Cond and then Temp

	Plot	Cond	Between	Temp	LAT	LONG
3	P3	H	A1	13.64	20.75	144.69
1	P1	H	A1	15.74	17.26	146.24
2	P2	H	A1	23.84	14.07	144.89
4	P4	H	A2	37.95	18.41	142.06
10	P2	L	A4	18.95	20.06	144.89
12	P4	L	A4	25.90	14.52	144.17
9	P1	L	A3	27.76	20.34	145.78
11	P3	L	A4	37.12	18.65	142.25
6	P2	M	A2	13.80	20.39	145.84
5	P1	M	A2	25.30	18.47	144.04
7	P3	M	A3	26.87	20.14	147.72
8	P4	M	A3	29.38	19.69	144.79

Via arrange (package:dplyr)

data.1 %>% arrange(Cond,Temp)

   Plot Cond Between     Temp      LAT     LONG
1    P3    H      A1 13.64371 20.74986 144.6884
2    P1    H      A1 15.73546 17.25752 146.2397
3    P2    H      A1 23.83643 14.07060 144.8877
4    P4    H      A2 37.95281 18.41013 142.0585
5    P2    L      A4 18.94612 20.06427 144.8924
6    P4    L      A4 25.89843 14.52130 144.1700
7    P1    L      A3 27.75781 20.33795 145.7753
8    P3    L      A4 37.11781 18.64913 142.2459
9    P2    M      A2 13.79532 20.38767 145.8359
10   P1    M      A2 25.29508 18.46762 144.0437
11   P3    M      A3 26.87429 20.14244 147.7174
12   P4    M      A3 29.38325 19.68780 144.7944

Via order

data.1[order(data.1$Cond,data.1$Temp),]

   Plot Cond Between     Temp      LAT     LONG
3    P3    H      A1 13.64371 20.74986 144.6884
1    P1    H      A1 15.73546 17.25752 146.2397
2    P2    H      A1 23.83643 14.07060 144.8877
4    P4    H      A2 37.95281 18.41013 142.0585
10   P2    L      A4 18.94612 20.06427 144.8924
12   P4    L      A4 25.89843 14.52130 144.1700
9    P1    L      A3 27.75781 20.33795 145.7753
11   P3    L      A4 37.11781 18.64913 142.2459
6    P2    M      A2 13.79532 20.38767 145.8359
5    P1    M      A2 25.29508 18.46762 144.0437
7    P3    M      A3 26.87429 20.14244 147.7174
8    P4    M      A3 29.38325 19.68780 144.7944

Sort the data by the sum of Temp and LAT

	Plot	Cond	Between	Temp	LAT	LONG
1	P1	H	A1	15.74	17.26	146.24
6	P2	M	A2	13.80	20.39	145.84
3	P3	H	A1	13.64	20.75	144.69
2	P2	H	A1	23.84	14.07	144.89
10	P2	L	A4	18.95	20.06	144.89
12	P4	L	A4	25.90	14.52	144.17
5	P1	M	A2	25.30	18.47	144.04
7	P3	M	A3	26.87	20.14	147.72
9	P1	L	A3	27.76	20.34	145.78
8	P4	M	A3	29.38	19.69	144.79
11	P3	L	A4	37.12	18.65	142.25
4	P4	H	A2	37.95	18.41	142.06

Via arrange (package:dplyr)

data.1 %>% arrange(Temp+LAT)

   Plot Cond Between     Temp      LAT     LONG
1    P1    H      A1 15.73546 17.25752 146.2397
2    P2    M      A2 13.79532 20.38767 145.8359
3    P3    H      A1 13.64371 20.74986 144.6884
4    P2    H      A1 23.83643 14.07060 144.8877
5    P2    L      A4 18.94612 20.06427 144.8924
6    P4    L      A4 25.89843 14.52130 144.1700
7    P1    M      A2 25.29508 18.46762 144.0437
8    P3    M      A3 26.87429 20.14244 147.7174
9    P1    L      A3 27.75781 20.33795 145.7753
10   P4    M      A3 29.38325 19.68780 144.7944
11   P3    L      A4 37.11781 18.64913 142.2459
12   P4    H      A2 37.95281 18.41013 142.0585

Via order

data.1[order(data.1$Temp+data.1$LAT),]

   Plot Cond Between     Temp      LAT     LONG
1    P1    H      A1 15.73546 17.25752 146.2397
6    P2    M      A2 13.79532 20.38767 145.8359
3    P3    H      A1 13.64371 20.74986 144.6884
2    P2    H      A1 23.83643 14.07060 144.8877
10   P2    L      A4 18.94612 20.06427 144.8924
12   P4    L      A4 25.89843 14.52130 144.1700
5    P1    M      A2 25.29508 18.46762 144.0437
7    P3    M      A3 26.87429 20.14244 147.7174
9    P1    L      A3 27.75781 20.33795 145.7753
8    P4    M      A3 29.38325 19.68780 144.7944
11   P3    L      A4 37.11781 18.64913 142.2459
4    P4    H      A2 37.95281 18.41013 142.0585

Re-levelling (sorting) factors

In the previous section, we altered the order of the records in the data set. However, it is important to realize that categorical variables in a dataframe have a property (called levels) that indicates the order of the levels of the factor and that this property is completely independent of the order of records in the data structure. By default, categorical variables (factors) are ordered alphabetically (as I said, irrespective of their order in the data set). The order of factor levels dictates the order in which groups will be plotted in figures and tabulated in tables. Consequently, re-levelling factors is often desirable

An example of an data set (data.2)

	Plot	Cond	Between	Temp
1	P1	H	A1	15.74
2	P2	H	A1	23.84
3	P3	H	A1	13.64
4	P4	H	A2	37.95
5	P1	M	A2	25.30
6	P2	M	A2	13.80
7	P3	M	A3	26.87
8	P4	M	A3	29.38
9	P1	L	A3	27.76
10	P2	L	A4	18.95
11	P3	L	A4	37.12
12	P4	L	A4	25.90

Most commonly this step is performed within a mutate.

Default alphabetical order
View code set.seed(1) data.2 <- expand.grid(Plot=paste("P",1:4,sep=""),Cond=c("H","M","L")) data.2$Cond <- factor(as.character(data.2$Cond)) data.2$Between <- gl(4,3,12,lab=paste("A",1:4,sep="")) data.2$Temp <- rnorm(12,22,10)
[1] "H" "L" "M" levels(data.2$Cond) [1] "H" "L" "M"
Re-levelled into L, M, H
[1] "L" "M" "H" data.2$Cond <- fct_relevel(data.2$Cond, 'L','M','H') levels(data.2$Cond) [1] "L" "M" "H" #OR data.2$Cond <- fct_relevel(data.2$Cond,'L','M','H') View code data.2$Cond <- factor(data.2$Cond, levels=c("L","M","H")) levels(data.2$Cond) [1] "L" "M" "H"

Sometimes it is useful (particularly when preparing data for graphing) to have the order of the levels of a factor defined by another variable - for example to allow groups to be plotted from highest to lowest response etc. This can be achieved via the fct_reorder() and fct_reorder2() functions from the forcats package.

An example of an data set (data.2)

	Plot	Cond	Between	Temp
1	P1	H	A1	15.74
2	P2	H	A1	23.84
3	P3	H	A1	13.64
4	P4	H	A2	37.95
5	P1	M	A2	25.30
6	P2	M	A2	13.80
7	P3	M	A3	26.87
8	P4	M	A3	29.38
9	P1	L	A3	27.76
10	P2	L	A4	18.95
11	P3	L	A4	37.12
12	P4	L	A4	25.90

Most commonly this step is performed within a mutate.

Default alphabetical order
View code set.seed(1) data.2 <- expand.grid(Plot=paste("P",1:4,sep=""),Cond=c("H","M","L")) data.2$Cond <- factor(as.character(data.2$Cond)) data.2$Between <- gl(4,3,12,lab=paste("A",1:4,sep="")) data.2$Temp <- rnorm(12,22,10)
[1] "H" "L" "M" levels(data.2$Cond) [1] "H" "L" "M"
Re-levelled according to Temp means
[1] "H" "M" "L" data.2$Cond <- fct_reorder(data.2$Cond, data.2$Temp, fun=mean) levels(data.2$Cond) [1] "H" "M" "L"

Relabelling a factor

In this case, the L,M,H labels are efficient, yet not as informative as we would perhaps like for labels on a graphic. Instead, we would probably prefer they were something like, "Low", "Medium" and "High"

Give the levels of a factor more informative labels

	Plot	Cond	Between	Temp
1	P1	High	A1	15.74
2	P2	High	A1	23.84
3	P3	High	A1	13.64
4	P4	High	A2	37.95
5	P1	Medium	A2	25.30
6	P2	Medium	A2	13.80
7	P3	Medium	A3	26.87
8	P4	Medium	A3	29.38
9	P1	Low	A3	27.76
10	P2	Low	A4	18.95
11	P3	Low	A4	37.12
12	P4	Low	A4	25.90

Via revalue (package:plyr)

# I will create another data set as I want to retain the original for further 
# demonstrations
data.3 <- data.2
# reorder AND rename the factor levels
data.3$Cond<-fct_recode(data.3$Cond, Low="L", Medium="M", High="H")
data.3

   Plot   Cond Between     Temp
1    P1   High      A1 15.73546
2    P2   High      A1 23.83643
3    P3   High      A1 13.64371
4    P4   High      A2 37.95281
5    P1 Medium      A2 25.29508
6    P2 Medium      A2 13.79532
7    P3 Medium      A3 26.87429
8    P4 Medium      A3 29.38325
9    P1    Low      A3 27.75781
10   P2    Low      A4 18.94612
11   P3    Low      A4 37.11781
12   P4    Low      A4 25.89843

levels(data.3$Cond)

[1] "High"   "Medium" "Low"

Via factor

# I will create another data set as I want to retain the original for further 
# demonstrations
data.3 <- data.2
# reorder AND rename the factor levels
data.3$Cond <- factor(data.3$Cond,levels=c("L","M","H"),lab=c("Low","Medium","High"))
data.3

   Plot   Cond Between     Temp
1    P1   High      A1 15.73546
2    P2   High      A1 23.83643
3    P3   High      A1 13.64371
4    P4   High      A2 37.95281
5    P1 Medium      A2 25.29508
6    P2 Medium      A2 13.79532
7    P3 Medium      A3 26.87429
8    P4 Medium      A3 29.38325
9    P1    Low      A3 27.75781
10   P2    Low      A4 18.94612
11   P3    Low      A4 37.11781
12   P4    Low      A4 25.89843

levels(data.3$Cond)

[1] "Low"    "Medium" "High"

Renaming columns (variables)

Data can be sourced from a variety of locations and often the column (=variable or field names) are either non-informative, very long or otherwise not convenient. It is possible to rename any or all of the column names of a data frame (or matrix).

Rename the column names (variables)

	Plot	Condition	Treatment	Temperature
1	P1	H	A1	15.74
2	P2	H	A1	23.84
3	P3	H	A1	13.64
4	P4	H	A2	37.95
5	P1	M	A2	25.30
6	P2	M	A2	13.80
7	P3	M	A3	26.87
8	P4	M	A3	29.38
9	P1	L	A3	27.76
10	P2	L	A4	18.95
11	P3	L	A4	37.12
12	P4	L	A4	25.90

Via rename (package:dplyr)

Note that it is not necessary to rename all the columns via this routine.

data.2 %>% rename(Temperature=Temp,Condition=Cond, Treatment=Between)

   Plot Condition Treatment Temperature
1    P1         H        A1    15.73546
2    P2         H        A1    23.83643
3    P3         H        A1    13.64371
4    P4         H        A2    37.95281
5    P1         M        A2    25.29508
6    P2         M        A2    13.79532
7    P3         M        A3    26.87429
8    P4         M        A3    29.38325
9    P1         L        A3    27.75781
10   P2         L        A4    18.94612
11   P3         L        A4    37.11781
12   P4         L        A4    25.89843

Via colnames

# again restore the data from the original data.2
data.3 <- data.2
colnames(data.3) <- c("Plot","Condition","Treatment","Temperature")
data.3

   Plot Condition Treatment Temperature
1    P1         H        A1    15.73546
2    P2         H        A1    23.83643
3    P3         H        A1    13.64371
4    P4         H        A2    37.95281
5    P1         M        A2    25.29508
6    P2         M        A2    13.79532
7    P3         M        A3    26.87429
8    P4         M        A3    29.38325
9    P1         L        A3    27.75781
10   P2         L        A4    18.94612
11   P3         L        A4    37.11781
12   P4         L        A4    25.89843

Subsetting

We regularly want to run an analysis or generate a graphic for a sub-set of the data. Take for example the data used here. We may, for example, wish to explore a subset of the data for which Cond is "H". Warning, following any subsetting, it is crucial that you then instruct R to redefine the factor levels of any factors in the subsetted data set. When subsetting a dataset, all factors simply inherit the original levels property of the original factors, and therefore the new factors potentially define more factor levels than are in the actually present in the subsetted data set. This, along with the solution is illustrated below.

Another form of subsetting is where we wish to reduce the number of columns of the data frame to remove excessive and unwanted data fields. The two forms of subsetting are thus:

reducing the number of rows - filtering
reducing the number of columns - selecting

can be performed simultaneously.

Filtering (subsetting by rows)

Filtering selects rows for which a condition is evaluated to be TRUE. Hence, any logical expression or vectorized function that returns Boolean values (TRUE or FALSE) can be used for filtering.

An example of a data set (data.2)

	Plot	Cond	Between	Temp
1	P1	H	A1	15.74
2	P2	H	A1	23.84
3	P3	H	A1	13.64
4	P4	H	A2	37.95
5	P1	M	A2	25.30
6	P2	M	A2	13.80
7	P3	M	A3	26.87
8	P4	M	A3	29.38
9	P1	L	A3	27.76
10	P2	L	A4	18.95
11	P3	L	A4	37.12
12	P4	L	A4	25.90

View code

set.seed(1)
data.2 <- expand.grid(Plot=paste("P",1:4,sep=""),Cond=c("H","M","L"))
data.2$Cond <- factor(as.character(data.2$Cond))
data.2$Between <- gl(4,3,12,lab=paste("A",1:4,sep=""))
data.2$Temp <- rnorm(12,22,10)

Keep only the observations where Cond is H

	Plot	Cond	Between	Temp
1	P1	H	A1	15.74
2	P2	H	A1	23.84
3	P3	H	A1	13.64
4	P4	H	A2	37.95

Via filter (package:dplyr)

data.2 %>% filter(Cond=="H")

  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P4    H      A2 37.95281

Via subset

subset(data.2,Cond=="H")

  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P4    H      A2 37.95281

Via indexing

data.2[data.2$Cond=="H",]

  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P4    H      A2 37.95281

Keep only the observations for which Temp is between 15 and 20

	Plot	Cond	Between	Temp
1	P1	H	A1	15.74
2	P2	L	A4	18.95

Via filter (package:dplyr)

data.2 %>% filter(between(Temp,15,20))

  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    L      A4 18.94612

Via subset

subset(data.2, Temp>15 & Temp<20)

   Plot Cond Between     Temp
1    P1    H      A1 15.73546
10   P2    L      A4 18.94612

Via indexing

data.2[data.2$Temp>15 & data.2$Temp<20,]

   Plot Cond Between     Temp
1    P1    H      A1 15.73546
10   P2    L      A4 18.94612

Keep only the observations after Temp has surpassed 25

	Plot	Cond	Between	Temp
1	P4	H	A2	37.95
2	P1	M	A2	25.30
3	P2	M	A2	13.80
4	P3	M	A3	26.87
5	P4	M	A3	29.38
6	P1	L	A3	27.76
7	P2	L	A4	18.95
8	P3	L	A4	37.12
9	P4	L	A4	25.90

Via filter (package:dplyr)

data.2 %>% filter(cumany(Temp>25))

  Plot Cond Between     Temp
1   P4    H      A2 37.95281
2   P1    M      A2 25.29508
3   P2    M      A2 13.79532
4   P3    M      A3 26.87429
5   P4    M      A3 29.38325
6   P1    L      A3 27.75781
7   P2    L      A4 18.94612
8   P3    L      A4 37.11781
9   P4    L      A4 25.89843

Keep only the observations where Cond is either H or M

	Plot	Cond	Between	Temp
1	P1	H	A1	15.74
2	P2	H	A1	23.84
3	P3	H	A1	13.64
4	P4	H	A2	37.95
5	P1	M	A2	25.30
6	P2	M	A2	13.80
7	P3	M	A3	26.87
8	P4	M	A3	29.38

Via filter (package:dplyr)

filter(data.2,Cond %in% c("H","M"))
#OR
filter(data.2,Cond=="H" | Cond=="M")

  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P4    H      A2 37.95281
5   P1    M      A2 25.29508
6   P2    M      A2 13.79532
7   P3    M      A3 26.87429
8   P4    M      A3 29.38325

Via subset

subset(data.2,Cond %in% c("H","M"))

  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P4    H      A2 37.95281
5   P1    M      A2 25.29508
6   P2    M      A2 13.79532
7   P3    M      A3 26.87429
8   P4    M      A3 29.38325

Via indexing

data.2[data.2$Cond %in% c("H","M"),]

  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P4    H      A2 37.95281
5   P1    M      A2 25.29508
6   P2    M      A2 13.79532
7   P3    M      A3 26.87429
8   P4    M      A3 29.38325

Keep only the observations with H OR M condition AND Temp < 30

	Plot	Cond	Between	Temp
1	P1	H	A1	15.74
2	P2	H	A1	23.84
3	P3	H	A1	13.64
5	P1	M	A2	25.30
6	P2	M	A2	13.80
7	P3	M	A3	26.87
8	P4	M	A3	29.38

Via filter (package:dplyr)

filter(data.2,Cond %in% c("H","M") & Temp<30)
#OR
filter(data.2,Cond %in% c("H","M"), Temp<30)

  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P1    M      A2 25.29508
5   P2    M      A2 13.79532
6   P3    M      A3 26.87429
7   P4    M      A3 29.38325

Via subset

subset(data.2,Cond %in% c("H","M") & Temp<30)

  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
5   P1    M      A2 25.29508
6   P2    M      A2 13.79532
7   P3    M      A3 26.87429
8   P4    M      A3 29.38325

Via indexing

#Note, the following will yield unexpected results when there are missing values!
data.2[data.2$Cond %in% c("H","M") & data.2$Temp<30,]

  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
5   P1    M      A2 25.29508
6   P2    M      A2 13.79532
7   P3    M      A3 26.87429
8   P4    M      A3 29.38325

Keep only the observations with Cond of H and Temp < 30 OR Cond of M and Temp < 26

Such a statement could be interpreted a number of ways (some of which are listed below):

(Cond=="H" & Temp<30) | (Cond=='M' & Temp<26)
(Cond=="H") & (Temp<30 | Cond=='M') & Temp<26)
(Cond=="H") & (Temp<30 | Cond=='M' & Temp<26)
(Cond=="H" & (Temp<30 | Cond=='M')) & Temp<26
...

Lets assume the first of these...

	Plot	Cond	Between	Temp
1	P1	H	A1	15.74
2	P2	H	A1	23.84
3	P3	H	A1	13.64
4	P1	M	A2	25.30
5	P2	M	A2	13.80

Via filter (package:dplyr)

filter(data.2,(Cond=="H" & Temp<30) | (Cond=='M' & Temp<26))

  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P1    M      A2 25.29508
5   P2    M      A2 13.79532

Via subset

subset(data.2,(Cond=="H" & Temp<30) | (Cond=='M' & Temp<26))

  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
5   P1    M      A2 25.29508
6   P2    M      A2 13.79532

Via indexing

#Note, the following will yield unexpected results when there are missing values!
data.2[(data.2$Cond=="H" & data.2$Temp<30) | (data.2$Cond=='M' & data.2$Temp<26),]

  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
5   P1    M      A2 25.29508
6   P2    M      A2 13.79532

Effect of subsetting on factor levels

#examine the levels of the Cond factor
levels(data.2$Cond)

[1] "H" "L" "M"

#subset the dataset to just Cond H
data.3<-filter(data.2,Cond=="H")
#examine subset data
data.3

  Plot Cond Between     Temp
1   P1    H      A1 15.73546
2   P2    H      A1 23.83643
3   P3    H      A1 13.64371
4   P4    H      A2 37.95281

#examine the levels of the Cond factor
levels(data.3$Cond)

[1] "H" "L" "M"

levels(data.3$Between)

[1] "A1" "A2" "A3" "A4"

Notice that although we created a subset of the data that only includes cases where Cond is "H", each of the factors have retained their original level properties.

Correcting factor levels of all factors after subsetting

#subset the dataset to just Cond H
data.3<-filter(data.2,Cond=="H")
#drop the unused factor levels from all factors
data.3<-droplevels(data.3)
#examine the levels of each factor
levels(data.3$Cond)

[1] "H"

levels(data.3$Between)

[1] "A1" "A2"

Notice this time, the levels of each factor reflect the subset data.

Correcting factor levels of only a single factor after subsetting

#subset the dataset to just Cond H
data.3<-filter(data.2,Cond=="H")
#drop the unused factor levels from Cond
data.3$Cond<-fct_drop(data.3$Cond)
#OR
data.3$Cond<-factor(data.3$Cond)
#examine the levels of each factor
levels(data.3$Cond)

[1] "H"

levels(data.3$Between)

[1] "A1" "A2" "A3" "A4"

Notice this time, the levels of the Cond factor reflect the subset data, whereas the levels of the Between factor reflect the original dataset.

Selecting (subsetting by columns)

Selecting works by either including (or excluding) the column names that you indicate or by special 'Helper' functions that pass a vector of column indices to include in the subset data.

When selecting with the select function from the dplyr package, there are a number of "Helper" functions that provide convenient ways to select columns:

contains(""), columns whose name match the character string (not case sensitive by default)
starts_with(""), columns whose names begin with the character string (not case sensitive by default)
ends_with(""), columns whose names end with the character string (not case sensitive by default)
one_of(c("","")), columns whose names are amongst the character strings
matches(""), columns whose names are matched by the character string of a regular expression (not case sensitive by default)
everything(), all columns. This is useful in combination with other specifiers
num_range("",), select all columns whose names start with the first argument (a character string) followed by a number in the integer sequence provided as the second argument

Selections are negated by prepending with a '-' and columns are ordered according to the order that they are selected.

An example of a data set (data)

	Resp1	Resp2	Between	Plot	Subplot	Within
1	8	17	A1	P1	S1	B1
2	10	18	A1	P1	S1	B2
3	7	17	A1	P1	S2	B1
4	11	21	A1	P1	S2	B2
5	14	19	A2	P2	S3	B1
6	12	13	A2	P2	S3	B2
7	11	24	A2	P2	S4	B1
8	9	18	A2	P2	S4	B2
9	14	25	A3	P3	S5	B1
10	11	18	A3	P3	S5	B2
11	8	27	A3	P3	S6	B1
12	2	22	A3	P3	S6	B2
13	8	17	A1	P4	S7	B1
14	10	22	A1	P4	S7	B2
15	7	16	A1	P4	S8	B1
16	12	13	A1	P4	S8	B2
17	11	23	A2	P5	S9	B1
18	12	19	A2	P5	S9	B2
19	12	23	A2	P5	S10	B1
20	10	21	A2	P5	S10	B2
21	3	17	A3	P6	S11	B1
22	11	16	A3	P6	S11	B2
23	13	26	A3	P6	S12	B1
24	7	28	A3	P6	S12	B2

View code

set.seed(1)
data <- expand.grid(Within=paste("B",1:2,sep=""),
                    Subplot=paste("S",1:2,sep=""),
                    Plot=paste("P",1:6,sep=""))
data$Subplot <- gl(12,2,24,lab=paste("S",1:12,sep=""))
data$Between <- gl(3,4,24,lab=paste("A",1:3,sep=""))
data$Resp1 <- rpois(24,10)
data$Resp2 <- rpois(24,20)
data <- with(data,data.frame(Resp1,Resp2,Between,Plot,Subplot,Within))

Keep only the Plot and Resp1 columns

	Plot	Resp1
1	P1	8
2	P1	10
3	P1	7
4	P1	11
5	P2	14
6	P2	12
7	P2	11
8	P2	9
9	P3	14
10	P3	11
11	P3	8
12	P3	2
13	P4	8
14	P4	10
15	P4	7
16	P4	12
17	P5	11
18	P5	12
19	P5	12
20	P5	10
21	P6	3
22	P6	11
23	P6	13
24	P6	7

Via select (package:dplyr)

data %>% select(Plot,Resp1)
#OR
data %>% select(starts_with('Pl'),ends_with("p1"))
#OR
data %>% select(contains('Plo', ignore.case=FALSE),matches(".*p1"))
#OR
data %>% select(-matches('w.*n'),-starts_with("Sub"),-matches('Resp'),matches('Resp1'))
#OR
data %>% select(one_of(c('Plot','Resp1')))

   Plot Resp1
1    P1     8
2    P1    10
3    P1     7
4    P1    11
5    P2    14
6    P2    12
7    P2    11
8    P2     9
9    P3    14
10   P3    11
11   P3     8
12   P3     2
13   P4     8
14   P4    10
15   P4     7
16   P4    12
17   P5    11
18   P5    12
19   P5    12
20   P5    10
21   P6     3
22   P6    11
23   P6    13
24   P6     7

Via subset

subset(data,select=c(Plot,Resp1))

   Plot Resp1
1    P1     8
2    P1    10
3    P1     7
4    P1    11
5    P2    14
6    P2    12
7    P2    11
8    P2     9
9    P3    14
10   P3    11
11   P3     8
12   P3     2
13   P4     8
14   P4    10
15   P4     7
16   P4    12
17   P5    11
18   P5    12
19   P5    12
20   P5    10
21   P6     3
22   P6    11
23   P6    13
24   P6     7

#OR
subset(data,select=c("Plot","Resp1"))

   Plot Resp1
1    P1     8
2    P1    10
3    P1     7
4    P1    11
5    P2    14
6    P2    12
7    P2    11
8    P2     9
9    P3    14
10   P3    11
11   P3     8
12   P3     2
13   P4     8
14   P4    10
15   P4     7
16   P4    12
17   P5    11
18   P5    12
19   P5    12
20   P5    10
21   P6     3
22   P6    11
23   P6    13
24   P6     7

Via indexing

data[,c("Plot","Resp1")]

   Plot Resp1
1    P1     8
2    P1    10
3    P1     7
4    P1    11
5    P2    14
6    P2    12
7    P2    11
8    P2     9
9    P3    14
10   P3    11
11   P3     8
12   P3     2
13   P4     8
14   P4    10
15   P4     7
16   P4    12
17   P5    11
18   P5    12
19   P5    12
20   P5    10
21   P6     3
22   P6    11
23   P6    13
24   P6     7

Keep the Plot/Subplot and Resp columns only

	Plot	Subplot	Resp1	Resp2
1	P1	S1	8	17
2	P1	S1	10	18
3	P1	S2	7	17
4	P1	S2	11	21
5	P2	S3	14	19
6	P2	S3	12	13
7	P2	S4	11	24
8	P2	S4	9	18
9	P3	S5	14	25
10	P3	S5	11	18
11	P3	S6	8	27
12	P3	S6	2	22
13	P4	S7	8	17
14	P4	S7	10	22
15	P4	S8	7	16
16	P4	S8	12	13
17	P5	S9	11	23
18	P5	S9	12	19
19	P5	S10	12	23
20	P5	S10	10	21
21	P6	S11	3	17
22	P6	S11	11	16
23	P6	S12	13	26
24	P6	S12	7	28

Via select (package:dplyr)

data %>% select(Plot,Subplot,Resp1,Resp2)
#OR
data %>% select(contains('plot'), matches('^Resp.*'))
#OR
data %>% select(matches('plot$'), num_range('Resp',1:2))
#OR
data %>% select(contains('plot'),everything(),-matches('w.*n'))

   Plot Subplot Resp1 Resp2
1    P1      S1     8    17
2    P1      S1    10    18
3    P1      S2     7    17
4    P1      S2    11    21
5    P2      S3    14    19
6    P2      S3    12    13
7    P2      S4    11    24
8    P2      S4     9    18
9    P3      S5    14    25
10   P3      S5    11    18
11   P3      S6     8    27
12   P3      S6     2    22
13   P4      S7     8    17
14   P4      S7    10    22
15   P4      S8     7    16
16   P4      S8    12    13
17   P5      S9    11    23
18   P5      S9    12    19
19   P5     S10    12    23
20   P5     S10    10    21
21   P6     S11     3    17
22   P6     S11    11    16
23   P6     S12    13    26
24   P6     S12     7    28

Via subset

subset(data,select=c(Plot,Subplot,Resp1,Resp2))

   Plot Subplot Resp1 Resp2
1    P1      S1     8    17
2    P1      S1    10    18
3    P1      S2     7    17
4    P1      S2    11    21
5    P2      S3    14    19
6    P2      S3    12    13
7    P2      S4    11    24
8    P2      S4     9    18
9    P3      S5    14    25
10   P3      S5    11    18
11   P3      S6     8    27
12   P3      S6     2    22
13   P4      S7     8    17
14   P4      S7    10    22
15   P4      S8     7    16
16   P4      S8    12    13
17   P5      S9    11    23
18   P5      S9    12    19
19   P5     S10    12    23
20   P5     S10    10    21
21   P6     S11     3    17
22   P6     S11    11    16
23   P6     S12    13    26
24   P6     S12     7    28

#OR
subset(data,select=c("Plot","Subplot","Resp1","Resp2"))

   Plot Subplot Resp1 Resp2
1    P1      S1     8    17
2    P1      S1    10    18
3    P1      S2     7    17
4    P1      S2    11    21
5    P2      S3    14    19
6    P2      S3    12    13
7    P2      S4    11    24
8    P2      S4     9    18
9    P3      S5    14    25
10   P3      S5    11    18
11   P3      S6     8    27
12   P3      S6     2    22
13   P4      S7     8    17
14   P4      S7    10    22
15   P4      S8     7    16
16   P4      S8    12    13
17   P5      S9    11    23
18   P5      S9    12    19
19   P5     S10    12    23
20   P5     S10    10    21
21   P6     S11     3    17
22   P6     S11    11    16
23   P6     S12    13    26
24   P6     S12     7    28

Via indexing

data[,c("Plot","Subplot","Resp1","Resp2")]

   Plot Subplot Resp1 Resp2
1    P1      S1     8    17
2    P1      S1    10    18
3    P1      S2     7    17
4    P1      S2    11    21
5    P2      S3    14    19
6    P2      S3    12    13
7    P2      S4    11    24
8    P2      S4     9    18
9    P3      S5    14    25
10   P3      S5    11    18
11   P3      S6     8    27
12   P3      S6     2    22
13   P4      S7     8    17
14   P4      S7    10    22
15   P4      S8     7    16
16   P4      S8    12    13
17   P5      S9    11    23
18   P5      S9    12    19
19   P5     S10    12    23
20   P5     S10    10    21
21   P6     S11     3    17
22   P6     S11    11    16
23   P6     S12    13    26
24   P6     S12     7    28

`select_if`

There is also a special form of select that selects if a condition is satisfied.

Keep only the numeric variables

	Resp1	Resp2
1	8	17
2	10	18
3	7	17
4	11	21
5	14	19
6	12	13
7	11	24
8	9	18
9	14	25
10	11	18
11	8	27
12	2	22
13	8	17
14	10	22
15	7	16
16	12	13
17	11	23
18	12	19
19	12	23
20	10	21
21	3	17
22	11	16
23	13	26
24	7	28

Via select (package:dplyr)

data %>% select_if(is.numeric)

`filter` and `select`

Recall that one of the advantages of working within the grammar of data manipulation framework is that data can be piped from one tool to another to achieve multiple manipulations.

Keep only the Plot and Temp columns and observations with H OR M condition AND Temp < 30

	Plot	Temp
1	P1	15.74
2	P2	23.84
3	P3	13.64
5	P1	25.30
6	P2	13.80
7	P3	26.87
8	P4	29.38

Via filter and select (package:dplyr)

data.2 %>% filter(Cond %in% c("H","M") & Temp<30) %>% select(Plot, Temp)

  Plot     Temp
1   P1 15.73546
2   P2 23.83643
3   P3 13.64371
4   P1 25.29508
5   P2 13.79532
6   P3 26.87429
7   P4 29.38325

Via subset

subset(data.2,Cond %in% c("H","M") & Temp<30, select=c(Plot,Temp))

  Plot     Temp
1   P1 15.73546
2   P2 23.83643
3   P3 13.64371
5   P1 25.29508
6   P2 13.79532
7   P3 26.87429
8   P4 29.38325

#OR
subset(data.2,Cond %in% c("H","M") & Temp<30, select=c(-Cond,-Between))

  Plot     Temp
1   P1 15.73546
2   P2 23.83643
3   P3 13.64371
5   P1 25.29508
6   P2 13.79532
7   P3 26.87429
8   P4 29.38325

Via indexing

#Note, the following will yield unexpected results when there are missing values!
data.2[data.2$Cond %in% c("H","M") & data.2$Temp<30, c("Plot","Temp")]

  Plot     Temp
1   P1 15.73546
2   P2 23.83643
3   P3 13.64371
5   P1 25.29508
6   P2 13.79532
7   P3 26.87429
8   P4 29.38325

Adding new variables (mutating)

To add to (mutate) a data set, a vectorized function is applied either once or across one or more existing columns.

This section will make use of the following data set:

A very simple dataset (data.s)

	Between	Plot	Resp1	Resp2
1	A1	P1	8	13
2	A1	P2	10	22
3	A2	P3	7	23
4	A2	P4	11	22

View code

set.seed(1)
data.s <- expand.grid(Plot=paste("P",1:4,sep=""))
data.s$Between <- gl(2,2,4,lab=paste("A",1:2,sep=""))
data.s <- with(data.s,data.frame(Between,Plot,Resp1=rpois(4,10), Resp2=rpois(4,20)))

Whether transforming/scaling existing variables or generating new variables derived from existing variables it is advisable that existing variables be retained (unless the data are huge and storage is tight). As with other routines featured in this tutorial, there are numerous ways to derive new variables from existing variables. The main options;

manually add new columns to a dataframe

data.s$logResp1 <- log(data.s$Resp1)
data.s$logResp2 <- log(data.s$Resp2)
data.s

  Between Plot Resp1 Resp2 logResp1 logResp2
1      A1   P1     8    13 2.079442 2.564949
2      A1   P2    10    22 2.302585 3.091042
3      A2   P3     7    23 1.945910 3.135494
4      A2   P4    11    22 2.397895 3.091042

the overloaded transform function

transform(data.s, logResp1=log(Resp1), logRes2=log(Resp2))

  Between Plot Resp1 Resp2 logResp1  logRes2
1      A1   P1     8    13 2.079442 2.564949
2      A1   P2    10    22 2.302585 3.091042
3      A2   P3     7    23 1.945910 3.135494
4      A2   P4    11    22 2.397895 3.091042

data.s

  Between Plot Resp1 Resp2
1      A1   P1     8    13
2      A1   P2    10    22
3      A2   P3     7    23
4      A2   P4    11    22

the mutate function

data.s %>% mutate(logResp1=log(Resp1), logRes2=log(Resp2))

  Between Plot Resp1 Resp2 logResp1  logRes2
1      A1   P1     8    13 2.079442 2.564949
2      A1   P2    10    22 2.302585 3.091042
3      A2   P3     7    23 1.945910 3.135494
4      A2   P4    11    22 2.397895 3.091042

data.s

  Between Plot Resp1 Resp2
1      A1   P1     8    13
2      A1   P2    10    22
3      A2   P3     7    23
4      A2   P4    11    22

The difference between transform and mutate is that mutate works with each column sequentially and therefore can derive new columns using the columns created in earlier iterations.

The main features of each of the above functions are compared and contrasted in the following table.

	Transforms at a time	Returns	Notes
manually	Single	Single vectors
`transform`	Multiple	New dataframe	Somewhat similar to `within()`.
`mutate`	Multiple	New dataframe	Works with each column sequentially and therefore can derive new columns using the columns created in earlier iterations.

Scale (log) transform a single variable - Resp1

	Between	Plot	Resp1	Resp2	logResp1
1	A1	P1	8	13	2.08
2	A1	P2	10	22	2.30
3	A2	P3	7	23	1.95
4	A2	P4	11	22	2.40

Via mutate

data.s %>% mutate(logResp1=log(Resp1))

  Between Plot Resp1 Resp2 logResp1
1      A1   P1     8    13 2.079442
2      A1   P2    10    22 2.302585
3      A2   P3     7    23 1.945910
4      A2   P4    11    22 2.397895

Via transform

transform(data.s,logResp1=log(Resp1))

  Between Plot Resp1 Resp2 logResp1
1      A1   P1     8    13 2.079442
2      A1   P2    10    22 2.302585
3      A2   P3     7    23 1.945910
4      A2   P4    11    22 2.397895

Scale (log) transform a multiple variables (Resp1 and Resp2) and then calculate the difference

	Between	Plot	Resp1	Resp2	logResp1	logResp2	Diff
1	A1	P1	8	13	2.08	2.56	0.49
2	A1	P2	10	22	2.30	3.09	0.79
3	A2	P3	7	23	1.95	3.14	1.19
4	A2	P4	11	22	2.40	3.09	0.69

Via mutate

data.s %>% mutate(logResp1=log(Resp1), logResp2=log(Resp2), Diff=logResp2-logResp1)

  Between Plot Resp1 Resp2 logResp1 logResp2      Diff
1      A1   P1     8    13 2.079442 2.564949 0.4855078
2      A1   P2    10    22 2.302585 3.091042 0.7884574
3      A2   P3     7    23 1.945910 3.135494 1.1895841
4      A2   P4    11    22 2.397895 3.091042 0.6931472

Via transform

tmp=transform(data.s,logResp1=log(Resp1), logResp2=log(Resp2))
transform(tmp, Diff=logResp2-logResp1)

  Between Plot Resp1 Resp2 logResp1 logResp2      Diff
1      A1   P1     8    13 2.079442 2.564949 0.4855078
2      A1   P2    10    22 2.302585 3.091042 0.7884574
3      A2   P3     7    23 1.945910 3.135494 1.1895841
4      A2   P4    11    22 2.397895 3.091042 0.6931472

Vectorized functions

Recall that any function that returns a either a single value or a vector of length equal to the number of rows in the data can be used. The dplyr package also comes with a number of "Vectorized" functions for performing common data manipulation routines:

lead and lag

data.s %>% mutate(leadResp1=lead(Resp1), lagResp1=lag(Resp1))

  Between Plot Resp1 Resp2 leadResp1 lagResp1
1      A1   P1     8    13        10       NA
2      A1   P2    10    22         7        8
3      A2   P3     7    23        11       10
4      A2   P4    11    22        NA        7

dense_rank, min_rank and percent_rank

mutate(data.s, rankResp2=rank(Resp2), drnkResp2=dense_rank(Resp2), mrnkResp2=min_rank(Resp2),
    prnkResp2=percent_rank(Resp2))

  Between Plot Resp1 Resp2 rankResp2 drnkResp2 mrnkResp2 prnkResp2
1      A1   P1     8    13       1.0         1         1 0.0000000
2      A1   P2    10    22       2.5         2         2 0.3333333
3      A2   P3     7    23       4.0         3         4 1.0000000
4      A2   P4    11    22       2.5         2         2 0.3333333

ntile and between

mutate(data.s, tileResp1=ntile(Resp1,3), btwnResp1=between(Resp1,9,12))

  Between Plot Resp1 Resp2 tileResp1 btwnResp1
1      A1   P1     8    13         1     FALSE
2      A1   P2    10    22         2      TRUE
3      A2   P3     7    23         1     FALSE
4      A2   P4    11    22         3      TRUE

cume_dist, cumall, cumany, cummean

## Cummulative distribution 
data.s %>% mutate(csResp1=cumsum(Resp1), cdResp1=cume_dist(Resp1))

  Between Plot Resp1 Resp2 csResp1 cdResp1
1      A1   P1     8    13       8    0.50
2      A1   P2    10    22      18    0.75
3      A2   P3     7    23      25    0.25
4      A2   P4    11    22      36    1.00

## Predicates of cummulatives. 
## Accumulate only those that have a Resp1 > 9
## Accumulate only if all have a Resp1 > 7 
data.s %>% mutate(csResp1=cumsum(Resp1), cnyResp1=cumany(Resp1>9), callResp1=cumall(Resp1>7))

  Between Plot Resp1 Resp2 csResp1 cnyResp1 callResp1
1      A1   P1     8    13       8    FALSE      TRUE
2      A1   P2    10    22      18     TRUE      TRUE
3      A2   P3     7    23      25     TRUE     FALSE
4      A2   P4    11    22      36     TRUE     FALSE

## The above window functions are useful for filtering
data.s %>% filter(cumall(Resp1>7))

  Between Plot Resp1 Resp2
1      A1   P1     8    13
2      A1   P2    10    22

## Cumulative product and mean
data.s %>% mutate(cpResp1=cumprod(Resp1), cmResp1=cummean(Resp1))

  Between Plot Resp1 Resp2 cpResp1  cmResp1
1      A1   P1     8    13       8 8.000000
2      A1   P2    10    22      80 9.000000
3      A2   P3     7    23     560 8.333333
4      A2   P4    11    22    6160 9.000000

## Cumulative min and max 
data.s %>% mutate(cminResp1=cummin(Resp1), cmaxResp1=cummax(Resp1))

  Between Plot Resp1 Resp2 cminResp1 cmaxResp1
1      A1   P1     8    13         8         8
2      A1   P2    10    22         8        10
3      A2   P3     7    23         7        10
4      A2   P4    11    22         7        11

The extended `mutate_` family

In addition to the mutate() function, the dplyr package also has a number of alternative mutate_ functons that provide convenient ways to select which columns to apply functions to.

mutate_all(), apply function(s) to all columns
mutate_at(), apply function(s) to selected columns
mutate_if(), apply function(s) to columns that satisfy a specific condition

Scale (log) transform multiple variables - Resp1 and Resp2

	Between	Plot	Resp1	Resp2	Resp1_log	Resp2_log
1	A1	P1	8	13	2.08	2.56
2	A1	P2	10	22	2.30	3.09
3	A2	P3	7	23	1.95	3.14
4	A2	P4	11	22	2.40	3.09

Via mutate

data.s %>% mutate(Resp1_log=log(Resp1),Resp2_log=log(Resp2))
#OR to account for any number of transformations
data.s %>% mutate_at(vars(Resp1,Resp2), funs(log=log))
#OR
data.s %>% mutate_at(vars(contains("Resp")), funs(log=log))
#OR
data.s %>% mutate_if(is.numeric, funs(log=log))

  Between Plot Resp1 Resp2 Resp1_log Resp2_log
1      A1   P1     8    13  2.079442  2.564949
2      A1   P2    10    22  2.302585  3.091042
3      A2   P3     7    23  1.945910  3.135494
4      A2   P4    11    22  2.397895  3.091042

Via transform

transform(data.s,Resp1_log=log(Resp1),Resp2_log=log(Resp2))

  Between Plot Resp1 Resp2 Resp1_log Resp2_log
1      A1   P1     8    13  2.079442  2.564949
2      A1   P2    10    22  2.302585  3.091042
3      A2   P3     7    23  1.945910  3.135494
4      A2   P4    11    22  2.397895  3.091042

Apply a summary function to convert all character vectors into lowercase

	Between	Plot	Resp1	Resp2
1	a1	p1	8	13
2	a1	p2	10	22
3	a2	p3	7	23
4	a2	p4	11	22

Via mutate

mutate_if(data.s, is.character, stringr:::str_to_lower)

  Between Plot Resp1 Resp2
1      a1   p1     8    13
2      a1   p2    10    22
3      a2   p3     7    23
4      a2   p4    11    22

Applying multiple functions to multiple variables

	Between	Plot	Resp1	Resp2	Resp1_log	Resp2_log	Resp1_c	Resp2_c
1	A1	P1	8	13	2.08	2.56	-1.00	-7.00
2	A1	P2	10	22	2.30	3.09	1.00	2.00
3	A2	P3	7	23	1.95	3.14	-2.00	3.00
4	A2	P4	11	22	2.40	3.09	2.00	2.00

Via mutate

mutate_at(data.s, vars(contains('Resp')), funs(log=log, c=scale(.,scale=FALSE)))
#OR
mutate_at(data.s, vars(Resp1,Resp2), funs(log=log, c=.-mean(.)))

  Between Plot Resp1 Resp2 Resp1_log Resp2_log Resp1_c Resp2_c
1      A1   P1     8    13  2.079442  2.564949      -1      -7
2      A1   P2    10    22  2.302585  3.091042       1       2
3      A2   P3     7    23  1.945910  3.135494      -2       3
4      A2   P4    11    22  2.397895  3.091042       2       2

Apply a summary function to multiple variables - Resp1 and Resp2

	Between	Plot	Resp1	Resp2	Resp1_mean	Resp2_mean
1	A1	P1	8	13	9.00	20.00
2	A1	P2	10	22	9.00	20.00
3	A2	P3	7	23	9.00	20.00
4	A2	P4	11	22	9.00	20.00

Via mutate

mutate(data.s,Resp1_mean=mean(Resp1, na.rm=TRUE),Resp2_mean=mean(Resp2, na.rm=TRUE))
#OR
mutate_at(data.s, vars(Resp1,Resp2),funs(mean=mean(.,na.rm=TRUE)))

  Between Plot Resp1 Resp2 Resp1_mean Resp2_mean
1      A1   P1     8    13          9         20
2      A1   P2    10    22          9         20
3      A2   P3     7    23          9         20
4      A2   P4    11    22          9         20

Via transform

transform(data.s,mutate(data.s,Resp1_mean=mean(Resp1, na.rm=TRUE),Resp2_mean=mean(Resp2, na.rm=TRUE)))

  Between Plot Resp1 Resp2
1      A1   P1     8    13
2      A1   P2    10    22
3      A2   P3     7    23
4      A2   P4    11    22

Summarizing (aggregating) data

Data are often collected and transcribed at finer temporal/spatial scales and with greater fidelity than is required for all analyses. Therefore an important phase of data preparation is also to summarize the data into the spatial/temporal scales appropriate for the desired graphical and statistical analyses.

Summarizing involves applying one or more summarizing functions to one or more variables.

An example of a data set (data)

	Resp1	Resp2	Between	Plot	Subplot	Within
1	8	17	A1	P1	S1	B1
2	10	18	A1	P1	S1	B2
3	7	17	A1	P1	S2	B1
4	11	21	A1	P1	S2	B2
5	14	19	A2	P2	S3	B1
6	12	13	A2	P2	S3	B2
7	11	24	A2	P2	S4	B1
8	9	18	A2	P2	S4	B2
9	14	25	A3	P3	S5	B1
10	11	18	A3	P3	S5	B2
11	8	27	A3	P3	S6	B1
12	2	22	A3	P3	S6	B2
13	8	17	A1	P4	S7	B1
14	10	22	A1	P4	S7	B2
15	7	16	A1	P4	S8	B1
16	12	13	A1	P4	S8	B2
17	11	23	A2	P5	S9	B1
18	12	19	A2	P5	S9	B2
19	12	23	A2	P5	S10	B1
20	10	21	A2	P5	S10	B2
21	3	17	A3	P6	S11	B1
22	11	16	A3	P6	S11	B2
23	13	26	A3	P6	S12	B1
24	7	28	A3	P6	S12	B2

View code

set.seed(1)
data <- expand.grid(Within=paste("B",1:2,sep=""),
                    Subplot=paste("S",1:2,sep=""),
                    Plot=paste("P",1:6,sep=""))
data$Subplot <- gl(12,2,24,lab=paste("S",1:12,sep=""))
data$Between <- gl(3,4,24,lab=paste("A",1:3,sep=""))
data$Resp1 <- rpois(24,10)
data$Resp2 <- rpois(24,20)
data <- with(data,data.frame(Resp1,Resp2,Between,Plot,Subplot,Within))

Calculate the mean Resp1

	mean(Resp1)
1	9.71

Via mutate

data %>% summarize(mean(Resp1))

  mean(Resp1)
1    9.708333

Via code

mean(data$Resp1)

[1] 9.708333

Note, unlike the base R solution (and consistent with all the grammar of data manipulation verbs) the dplyr pathway always yields a data.frame (actually a tibble). This is not only a data type that permits further manipulation, it is also a convenient format for producing tabular and graphical summaries.

Calculate the mean and standard deviation of Resp1

	mean(Resp1)	sd(Resp1)
1	9.71	3.04

Via mutate

data %>% summarize(mean(Resp1), sd(Resp1))

  mean(Resp1) sd(Resp1)
1    9.708333   3.04287

Via code

mean(data$Resp1)

[1] 9.708333

sd(data$Resp1)

[1] 3.04287

Calculate a range of summary statistics

	mean(Resp1)	sd(Resp1)	n()	n_distinct(Resp1)	first(Resp1)	min(Resp1)	max(Resp1)
1	9.71	3.04	24	10	8	2	14

Via mutate

data %>% summarize(mean(Resp1), sd(Resp1), n(), n_distinct(Resp1), first(Resp1), min(Resp1), max(Resp1))

  mean(Resp1) sd(Resp1) n() n_distinct(Resp1) first(Resp1) min(Resp1) max(Resp1)
1    9.708333   3.04287  24                10            8          2         14

Calculate the mean of both Resp1 and Resp2

	mean(Resp1)	mean(Resp2)
1	9.71	20.00

Via mutate

data %>% summarize(mean(Resp1), mean(Resp2))

  mean(Resp1) mean(Resp2)
1    9.708333          20

Via code

mean(data$Resp1)

[1] 9.708333

mean(data$Resp2)

[1] 20

The extended `summary_` family

As with the mutate function, the dplyr package also has a number of alternative summarize_ functons that provide convenient ways to select which columns to apply functions to.

summarize_all(), apply function(s) to all columns
summarize_at(), apply function(s) to selected columns
summarize_if(), apply function(s) to columns that satisfy a specific condition

Calculate the mean of both Resp1 and Resp2

	Resp1	Resp2
1	9.71	20.00

Via mutate

data %>% summarize_at(vars(Resp1,Resp2), mean)

     Resp1 Resp2
1 9.708333    20

#OR
data %>% summarize_if(is.numeric, mean)

     Resp1 Resp2
1 9.708333    20

Calculate the mean and standard deviation of both Resp1 and Resp2

	Resp1_mean	Resp2_mean	Resp1_sd	Resp2_sd
1	9.71	20.00	3.04	4.16

Via mutate

data %>% summarize_at(vars(Resp1,Resp2), funs(mean,sd))

  Resp1_mean Resp2_mean Resp1_sd Resp2_sd
1   9.708333         20  3.04287 4.159849

#OR
data %>% summarize_if(is.numeric, funs(mean,sd))

  Resp1_mean Resp2_mean Resp1_sd Resp2_sd
1   9.708333         20  3.04287 4.159849

Grouping data

Base R has a family of apply functions that apply a function (such as mean()) to a continuous variable separately for:

apply(), each column or row.

apply(data[,1:2], MARGIN=2, mean)

    Resp1     Resp2 
 9.708333 20.000000

tapply(), each level of a categorical vector (factor)

tapply(data$Resp1, data$Between, mean)

    A1     A2     A3 
 9.125 11.375  8.625

The above apply functions involve a sort of split, apply, combine process. This process became the inspiration for the plyr package (also out of the Wickham lab) which built on and super-charged the process. The dplyr package represents the next iteration of the process (for data frames only as most data are in data frame format).

Central to the modern split/apply/combine process is the idea of groups. Groups are the basis of splitting the data. Functions applied to grouped data are applied to each group (subset) separately before the results are combined back into a single data frame (actually tibble). Hence grouped data are most powerful when combined with the summarize() or mutate() families of functions.

data %>% group_by(Between)

Source: local data frame [24 x 6]
Groups: Between [3]

   Resp1 Resp2 Between   Plot Subplot Within
   <int> <int>  <fctr> <fctr>  <fctr> <fctr>
1      8    17      A1     P1      S1     B1
2     10    18      A1     P1      S1     B2
3      7    17      A1     P1      S2     B1
4     11    21      A1     P1      S2     B2
5     14    19      A2     P2      S3     B1
6     12    13      A2     P2      S3     B2
7     11    24      A2     P2      S4     B1
8      9    18      A2     P2      S4     B2
9     14    25      A3     P3      S5     B1
10    11    18      A3     P3      S5     B2
# ... with 14 more rows

Applying `summarize()` to grouped data

Calculate the mean and variance for Resp1 for each level of Between

data %>% group_by(Between) %>% summarize(Mean=mean(Resp1), Var=var(Resp1))

# A tibble: 3 × 3
  Between   Mean       Var
   <fctr>  <dbl>     <dbl>
1      A1  9.125  3.553571
2      A2 11.375  2.267857
3      A3  8.625 19.696429

When groups are defined, summarize() and mutate() yield tibbles. A tibble (so named because many dplyr functions generate objects with a class of tbl_df which is pronounced as 'tibble diff') is a data.frame-like object that adheres to a more strict structure (compare the following two):

# generate data.frame
data.frame('(x)'=x, A)

       X.x. A
1  3.389585 b
2  7.378957 b
3  1.629561 b
4  4.778440 b
5  8.873564 a
6  5.530862 a
7  2.216495 b
8  1.581772 b
9  7.471264 b
10 5.295647 b

# generate tibble
data_frame('(x)'=x, A)

# A tibble: 10 × 2
      `(x)`     A
      <dbl> <chr>
1  3.389585     b
2  7.378957     b
3  1.629561     b
4  4.778440     b
5  8.873564     a
6  5.530862     a
7  2.216495     b
8  1.581772     b
9  7.471264     b
10 5.295647     b

# generate data.frame
data.frame('(x)'=x[1:2], A)

       X.x. A
1  3.389585 b
2  7.378957 b
3  3.389585 b
4  7.378957 b
5  3.389585 a
6  7.378957 a
7  3.389585 b
8  7.378957 b
9  3.389585 b
10 7.378957 b

# generate tibble
data_frame('(x)'=x[1:2], A)

Error: Variables must be length 1 or 10.
Problem variables: '(x)'

variables are never auto-coerced into specific data types (e.g. character vectors are not coerced into factors)
there are row names
variable names are never altered (e.g. when special characters are included in names)
only objects of length 1 are recycled

Calculate the mean and variance for Resp1 for each Between and Within combination of levels

data %>% group_by(Between,Within) %>% summarize(Mean=mean(Resp1), Var=var(Resp1))

Source: local data frame [6 x 4]
Groups: Between [?]

  Between Within  Mean        Var
   <fctr> <fctr> <dbl>      <dbl>
1      A1     B1  7.50  0.3333333
2      A1     B2 10.75  0.9166667
3      A2     B1 12.00  2.0000000
4      A2     B2 10.75  2.2500000
5      A3     B1  9.50 25.6666667
6      A3     B2  7.75 18.2500000

Calculate the mean and variance for Resp1 and Resp2 for each Between and Within combination of levels

data %>% group_by(Between,Within) %>% summarize_at(vars(Resp1,Resp2), funs(mean,var))

Source: local data frame [6 x 6]
Groups: Between [?]

  Between Within Resp1_mean Resp2_mean  Resp1_var Resp2_var
   <fctr> <fctr>      <dbl>      <dbl>      <dbl>     <dbl>
1      A1     B1       7.50      16.75  0.3333333  0.250000
2      A1     B2      10.75      18.50  0.9166667 16.333333
3      A2     B1      12.00      22.25  2.0000000  4.916667
4      A2     B2      10.75      17.75  2.2500000 11.583333
5      A3     B1       9.50      23.75 25.6666667 20.916667
6      A3     B2       7.75      21.00 18.2500000 28.000000

* Note, arguably a better way to perform the above is to first gather the data first so that the data are as long as possible... We will revisit this manipulation once we have explored gathering data.

Hierarchical aggregations

It is useful to be able to aggregate the data to a different level of replication. Two common reasons are:

Exploratory data analysis for assessing normality, homogeneity of variance as applied to the approximate appropriate residuals (appropriate level fo replication for a give test)
Calculating observations (typically means of subreplicates) appropriate for the desired scale of the analysis

Continuing on with the same base dataset from the previous examples, we may wish to aggregate to the level of Plots or even Subplot

Lets say that we wished to explore the distributional characteristics of Resp1 and Resp2 within each of the Between levels. The data data.frame is setup to resemble a typical hierarchical design. Notice that:

a given Plot can only be of one Between level.
within each Plot level there are each of the levels of within<.samp>.
in the context of a design the Plots are the replicates of the Between factor

Therefore, to explore the distributional characteristics of the Between factor, we need to aggregate the data to the level of Plot. That is, we need to calculate the mean for each Plot.

Generate Plot means for each of Resp1 and Resp2

data %>% group_by(Plot) %>% summarize_at(vars(Resp1,Resp2), mean)

# A tibble: 6 × 3
    Plot Resp1 Resp2
  <fctr> <dbl> <dbl>
1     P1  9.00 18.25
2     P2 11.50 18.50
3     P3  8.75 23.00
4     P4  9.25 17.00
5     P5 11.25 21.50
6     P6  8.50 21.75

The above is of limited value however as the Between variable is missing... We can address this by including it as a group_by variable.name.
Generate Between/Plot means for each of Resp1 and Resp2

data %>% group_by(Between,Plot) %>% summarize_at(vars(Resp1,Resp2), mean)

Source: local data frame [6 x 4]
Groups: Between [?]

  Between   Plot Resp1 Resp2
   <fctr> <fctr> <dbl> <dbl>
1      A1     P1  9.00 18.25
2      A1     P4  9.25 17.00
3      A2     P2 11.50 18.50
4      A2     P5 11.25 21.50
5      A3     P3  8.75 23.00
6      A3     P6  8.50 21.75

Now it is clear that there are two Plots per level of Between.

The data design also includes a Subplot hierarchical level. These Subplots sit under Plot in the hierarchy.
Generate Between/Plot/Subplot means for each of Resp1 and Resp2

data %>% group_by(Between,Plot,Subplot) %>% summarize_at(vars(Resp1,Resp2), mean)

Source: local data frame [12 x 5]
Groups: Between, Plot [?]

   Between   Plot Subplot Resp1 Resp2
    <fctr> <fctr>  <fctr> <dbl> <dbl>
1       A1     P1      S1   9.0  17.5
2       A1     P1      S2   9.0  19.0
3       A1     P4      S7   9.0  19.5
4       A1     P4      S8   9.5  14.5
5       A2     P2      S3  13.0  16.0
6       A2     P2      S4  10.0  21.0
7       A2     P5      S9  11.5  21.0
8       A2     P5     S10  11.0  22.0
9       A3     P3      S5  12.5  21.5
10      A3     P3      S6   5.0  24.5
11      A3     P6     S11   7.0  16.5
12      A3     P6     S12  10.0  27.0

Applying `mutate()` to grouped data

Grouped data are also useful in the mutate context. For example, we may wish to center our data separately within each Plot.
Center Resp1 and Resp2 within each plot

data %>% group_by(Plot) %>% mutate_at(vars(Resp1,Resp2), funs(c=scale(.,scale=FALSE)))

Source: local data frame [24 x 8]
Groups: Plot [6]

   Resp1 Resp2 Between   Plot Subplot Within Resp1_c Resp2_c
   <int> <int>  <fctr> <fctr>  <fctr> <fctr>   <dbl>   <dbl>
1      8    17      A1     P1      S1     B1   -1.00   -1.25
2     10    18      A1     P1      S1     B2    1.00   -0.25
3      7    17      A1     P1      S2     B1   -2.00   -1.25
4     11    21      A1     P1      S2     B2    2.00    2.75
5     14    19      A2     P2      S3     B1    2.50    0.50
6     12    13      A2     P2      S3     B2    0.50   -5.50
7     11    24      A2     P2      S4     B1   -0.50    5.50
8      9    18      A2     P2      S4     B2   -2.50   -0.50
9     14    25      A3     P3      S5     B1    5.25    2.00
10    11    18      A3     P3      S5     B2    2.25   -5.00
# ... with 14 more rows

#OR
data %>% group_by(Plot) %>% mutate_at(vars(Resp1,Resp2), funs(c=. - mean(.)))

Source: local data frame [24 x 8]
Groups: Plot [6]

   Resp1 Resp2 Between   Plot Subplot Within Resp1_c Resp2_c
   <int> <int>  <fctr> <fctr>  <fctr> <fctr>   <dbl>   <dbl>
1      8    17      A1     P1      S1     B1   -1.00   -1.25
2     10    18      A1     P1      S1     B2    1.00   -0.25
3      7    17      A1     P1      S2     B1   -2.00   -1.25
4     11    21      A1     P1      S2     B2    2.00    2.75
5     14    19      A2     P2      S3     B1    2.50    0.50
6     12    13      A2     P2      S3     B2    0.50   -5.50
7     11    24      A2     P2      S4     B1   -0.50    5.50
8      9    18      A2     P2      S4     B2   -2.50   -0.50
9     14    25      A3     P3      S5     B1    5.25    2.00
10    11    18      A3     P3      S5     B2    2.25   -5.00
# ... with 14 more rows

In the above two alternatives, note the '.' - this means 'the current variable'. Hence, each of Resp1 and Resp2 will be substituted in place of '.' during calculations. Also note that the print method for the tibble has truncated the printed output so that it is displayed nicely on the console. Stored version of such objects are complete.

Again, these examples could arguably be better suited to gathering first.

Center a single variable (Resp2), centered within each level of the between variable

	Between	Plot	Resp1	Resp2	cResp2
1	A1	P1	8	13	-4.50
2	A1	P2	10	22	4.50
3	A2	P3	7	23	0.50
4	A2	P4	11	22	-0.50

Via ddply and transform or mutate

plyr:::ddply(data.s,~Between,transform,cResp2=Resp2-mean(Resp2))

  Between Plot Resp1 Resp2 cResp2
1      A1   P1     8    13   -4.5
2      A1   P2    10    22    4.5
3      A2   P3     7    23    0.5
4      A2   P4    11    22   -0.5

#OR
ddply(data.s,~Between,mutate,cResp2=Resp2-mean(Resp2))

Error in eval(expr, envir, enclos): could not find function "ddply"

Difference in means centered within Between (derivative of derivatives)

	Between	Plot	Resp1	Resp2	cResp1	cResp2	cDiff
1	A1	P1	8	13	-1.00	-4.50	3.50
2	A1	P2	10	22	1.00	4.50	-3.50
3	A2	P3	7	23	-2.00	0.50	-2.50
4	A2	P4	11	22	2.00	-0.50	2.50

Via ddply and mutate

plyr:::ddply(data.s,~Between,mutate,cResp1=Resp1-mean(Resp1),cResp2=Resp2-mean(Resp2),cDiff=cResp1-cResp2)

  Between Plot Resp1 Resp2 cResp1 cResp2 cDiff
1      A1   P1     8    13     -1   -4.5   3.5
2      A1   P2    10    22      1    4.5  -3.5
3      A2   P3     7    23     -2    0.5  -2.5
4      A2   P4    11    22      2   -0.5   2.5

Scale (log) transformations of Resp1 and Resp2 centered within Between

Error in as.string(y): Cannot coerce argument to a string

Via ddply and mutate

options(width=120)

opts_chunk$set(ts='asis',prompt=FALSE,comment=NA, fig.path='images/graphics-tut2.4', dev='my_png', fig.ext='png',warning=FALSE,message=FALSE  )

## end setup

ddply(data.s,~Between,mutate,sResp1=Resp1/max(Resp1),sResp2=Resp2/max(Resp2),logsResp1=log(sResp1),logsResp2=log(sResp2))

Error in eval(expr, envir, enclos): could not find function "ddply"

Merging (joining) data sets

It is common to have data associated with a particular study organized into a number of separate data tables (databases etc). In fact, large data sets are best managed in databases. However, statistical analyses generally require all data to be encapsulated within a single data structure. Therefore, prior to analysis, it is necessary to bring together multiple sources.

This phase of data preparation can be one of the most difficult to get right and verify.

Merging (or joining) involves creating a new data set that comprises information from two data sets. The resulting joined data set contains all fields from both data sets. The data sets are alignd together according to fields they have in common. Matching records in these common fields are used to select a row from each input data set to be combined.

There are numerous alternative ways of defining what should happen in the event that common records do not occur in both sets. For example, we could specify that only fully matching records be included in the final data set. Alternatively, we could specify that all records be included from both sets and thus the resulting data set will contain missing values. The following describe these various options.

left join
- return all rows and columns from the left data set
- return all columns from the right data set
- new columns for unmatched rows from the right data sets receive NA values
- when there are multiple matches, all combinations included
right join
- return all rows and columns from the right data set
- return all columns from the left data set
- new columns for unmatched rows from the left data sets receive NA values
- when there are multiple matches, all combinations included
inner join
- return all columns from the left and right data set
- return only rows that match from left and right data sets
- when there are multiple matches, all combinations included
semi join
- return all rows from the left data set that match with rows from the right data set
- keep only the columns from the left data set
anti join
- return only the rows from the left data set that do not match with rows from the right data set
- keep only the columns from the left data set
full join
- return all rows and columns from the left and right data set
- unmatched rows from either left data sets receive NA values in the associated new columns

Biological data set (missing Subplot 3)
(data.bio)

	Resp1	Resp2	Between	Plot	Subplot
1	8	18	A1	P1	S1
2	10	21	A1	P1	S2
4	11	23	A1	P2	S4
5	14	22	A2	P3	S5
6	12	24	A2	P3	S6
7	11	23	A2	P4	S7
8	9	20	A2	P4	S8
9	14	11	A3	P5	S9
10	11	22	A3	P5	S10
11	8	24	A3	P6	S11
12	2	16	A3	P6	S12

View code

set.seed(1)
data.bio <- expand.grid(Subplot=paste("S",1:2,sep=""),Plot=paste("P",1:6,sep=""))
data.bio$Subplot <- gl(12,1,12,lab=paste("S",1:12,sep=""))
data.bio$Between <- gl(3,4,12,lab=paste("A",1:3,sep=""))
data.bio$Resp1 <- rpois(12,10)
data.bio$Resp2 <- rpois(12,20)
data.bio <- with(data.bio,data.frame(Resp1,Resp2,Between,Plot,Subplot))
data.bio<-data.bio[-3,]

Physio-chemical data (missing Subplot 7)
(data.chem)

	Chem1	Chem2	Between	Plot	Subplot
1	1.45	0.89	A1	P1	S1
2	3.27	0.18	A1	P1	S2
3	1.18	5.08	A1	P2	S3
4	13.40	1.58	A1	P2	S4
5	3.78	1.62	A2	P3	S5
6	1.20	4.24	A2	P3	S6
8	5.69	2.99	A2	P4	S8
9	4.83	4.13	A3	P5	S9
10	2.00	3.60	A3	P5	S10
11	12.33	1.78	A3	P6	S11
12	4.01	0.23	A3	P6	S12

View code

set.seed(1)
data.chem <- expand.grid(Subplot=paste("S",1:2,sep=""),Plot=paste("P",1:6,sep=""))
data.chem$Subplot <- gl(12,1,12,lab=paste("S",1:12,sep=""))
data.chem$Between <- gl(3,4,12,lab=paste("A",1:3,sep=""))
data.chem$Chem1 <- rlnorm(12,1)
data.chem$Chem2 <- rlnorm(12,.5)
data.chem <- with(data.chem,data.frame(Chem1,Chem2,Between,Plot,Subplot))
data.chem<-data.chem[-7,]

Join bio and chem data (only keep full matches - an inner join)

	Between	Plot	Subplot	Resp1	Resp2	Chem1	Chem2
1	A1	P1	S1	8	18	1.45	0.89
2	A1	P1	S2	10	21	3.27	0.18
3	A1	P2	S4	11	23	13.40	1.58
4	A2	P3	S5	14	22	3.78	1.62
5	A2	P3	S6	12	24	1.20	4.24
6	A2	P4	S8	9	20	5.69	2.99
7	A3	P5	S10	11	22	2.00	3.60
8	A3	P5	S9	14	11	4.83	4.13
9	A3	P6	S11	8	24	12.33	1.78
10	A3	P6	S12	2	16	4.01	0.23

Note how both Subplot 3 and 7 are absent.

Via inner_join (package:dplyr)

inner_join(data.bio,data.chem,by=c("Between","Plot","Subplot"))

   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     11    23      A1   P2      S4 13.400350 1.5762780
4     14    22      A2   P3      S5  3.779183 1.6222430
5     12    24      A2   P3      S6  1.196657 4.2369184
6      9    20      A2   P4      S8  5.687807 2.9859003
7     14    11      A3   P5      S9  4.834518 4.1328919
8     11    22      A3   P5     S10  2.002931 3.6043314
9      8    24      A3   P6     S11 12.326867 1.7763576
10     2    16      A3   P6     S12  4.014221 0.2255188

Via merge

merge(data.bio,data.chem,by=c("Between","Plot","Subplot"))

   Between Plot Subplot Resp1 Resp2     Chem1     Chem2
1       A1   P1      S1     8    18  1.452878 0.8858208
2       A1   P1      S2    10    21  3.266253 0.1800177
3       A1   P2      S4    11    23 13.400350 1.5762780
4       A2   P3      S5    14    22  3.779183 1.6222430
5       A2   P3      S6    12    24  1.196657 4.2369184
6       A2   P4      S8     9    20  5.687807 2.9859003
7       A3   P5     S10    11    22  2.002931 3.6043314
8       A3   P5      S9    14    11  4.834518 4.1328919
9       A3   P6     S11     8    24 12.326867 1.7763576
10      A3   P6     S12     2    16  4.014221 0.2255188

Via join (package:plyr)

plyr:::join(data.bio,data.chem,by=c("Between","Plot","Subplot"), type="inner")

   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     11    23      A1   P2      S4 13.400350 1.5762780
4     14    22      A2   P3      S5  3.779183 1.6222430
5     12    24      A2   P3      S6  1.196657 4.2369184
6      9    20      A2   P4      S8  5.687807 2.9859003
7     14    11      A3   P5      S9  4.834518 4.1328919
8     11    22      A3   P5     S10  2.002931 3.6043314
9      8    24      A3   P6     S11 12.326867 1.7763576
10     2    16      A3   P6     S12  4.014221 0.2255188

Merge bio and chem data (keep all data - full or outer join)

	Between	Plot	Subplot	Resp1	Resp2	Chem1	Chem2
1	A1	P1	S1	8	18	1.45	0.89
2	A1	P1	S2	10	21	3.27	0.18
3	A1	P2	S3			1.18	5.08
4	A1	P2	S4	11	23	13.40	1.58
5	A2	P3	S5	14	22	3.78	1.62
6	A2	P3	S6	12	24	1.20	4.24
7	A2	P4	S7	11	23
8	A2	P4	S8	9	20	5.69	2.99
9	A3	P5	S9	14	11	4.83	4.13
10	A3	P5	S10	11	22	2.00	3.60
11	A3	P6	S11	8	24	12.33	1.78
12	A3	P6	S12	2	16	4.01	0.23

Note that all Subplots are present, yet there are missing cells that correspond to the missing cases from the BIO and CHEM data sets.

Via full_join (package:dplyr)

full_join(data.bio,data.chem,by=c("Between","Plot","Subplot"))

   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     11    23      A1   P2      S4 13.400350 1.5762780
4     14    22      A2   P3      S5  3.779183 1.6222430
5     12    24      A2   P3      S6  1.196657 4.2369184
6     11    23      A2   P4      S7        NA        NA
7      9    20      A2   P4      S8  5.687807 2.9859003
8     14    11      A3   P5      S9  4.834518 4.1328919
9     11    22      A3   P5     S10  2.002931 3.6043314
10     8    24      A3   P6     S11 12.326867 1.7763576
11     2    16      A3   P6     S12  4.014221 0.2255188
12    NA    NA      A1   P2      S3  1.178652 5.0780682

Via merge

merge(data.bio,data.chem,by=c("Between","Plot","Subplot"),all=T)

   Between Plot Subplot Resp1 Resp2     Chem1     Chem2
1       A1   P1      S1     8    18  1.452878 0.8858208
2       A1   P1      S2    10    21  3.266253 0.1800177
3       A1   P2      S3    NA    NA  1.178652 5.0780682
4       A1   P2      S4    11    23 13.400350 1.5762780
5       A2   P3      S5    14    22  3.779183 1.6222430
6       A2   P3      S6    12    24  1.196657 4.2369184
7       A2   P4      S7    11    23        NA        NA
8       A2   P4      S8     9    20  5.687807 2.9859003
9       A3   P5      S9    14    11  4.834518 4.1328919
10      A3   P5     S10    11    22  2.002931 3.6043314
11      A3   P6     S11     8    24 12.326867 1.7763576
12      A3   P6     S12     2    16  4.014221 0.2255188

Via join (package:plyr)

plyr:::join(data.bio,data.chem,by=c("Between","Plot","Subplot"), type="full")

   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     11    23      A1   P2      S4 13.400350 1.5762780
4     14    22      A2   P3      S5  3.779183 1.6222430
5     12    24      A2   P3      S6  1.196657 4.2369184
6     11    23      A2   P4      S7        NA        NA
7      9    20      A2   P4      S8  5.687807 2.9859003
8     14    11      A3   P5      S9  4.834518 4.1328919
9     11    22      A3   P5     S10  2.002931 3.6043314
10     8    24      A3   P6     S11 12.326867 1.7763576
11     2    16      A3   P6     S12  4.014221 0.2255188
12    NA    NA      A1   P2      S3  1.178652 5.0780682

Merge bio and chem data (only keep full BIO matches - left join)

	Between	Plot	Subplot	Resp1	Resp2	Chem1	Chem2
1	A1	P1	S1	8	18	1.45	0.89
2	A1	P1	S2	10	21	3.27	0.18
3	A1	P2	S4	11	23	13.40	1.58
4	A2	P3	S5	14	22	3.78	1.62
5	A2	P3	S6	12	24	1.20	4.24
6	A2	P4	S7	11	23
7	A2	P4	S8	9	20	5.69	2.99
8	A3	P5	S9	14	11	4.83	4.13
9	A3	P5	S10	11	22	2.00	3.60
10	A3	P6	S11	8	24	12.33	1.78
11	A3	P6	S12	2	16	4.01	0.23

Note how Subplot 3 is absent and Subplot 7 (which was missing from the CHEM data set) is missing CHEM data.

Via left_join (package:dplyr)

left_join(data.bio,data.chem,by=c("Between","Plot","Subplot"))

   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     11    23      A1   P2      S4 13.400350 1.5762780
4     14    22      A2   P3      S5  3.779183 1.6222430
5     12    24      A2   P3      S6  1.196657 4.2369184
6     11    23      A2   P4      S7        NA        NA
7      9    20      A2   P4      S8  5.687807 2.9859003
8     14    11      A3   P5      S9  4.834518 4.1328919
9     11    22      A3   P5     S10  2.002931 3.6043314
10     8    24      A3   P6     S11 12.326867 1.7763576
11     2    16      A3   P6     S12  4.014221 0.2255188

Via merge

merge(data.bio,data.chem,by=c("Between","Plot","Subplot"),all.x=T)

   Between Plot Subplot Resp1 Resp2     Chem1     Chem2
1       A1   P1      S1     8    18  1.452878 0.8858208
2       A1   P1      S2    10    21  3.266253 0.1800177
3       A1   P2      S4    11    23 13.400350 1.5762780
4       A2   P3      S5    14    22  3.779183 1.6222430
5       A2   P3      S6    12    24  1.196657 4.2369184
6       A2   P4      S7    11    23        NA        NA
7       A2   P4      S8     9    20  5.687807 2.9859003
8       A3   P5      S9    14    11  4.834518 4.1328919
9       A3   P5     S10    11    22  2.002931 3.6043314
10      A3   P6     S11     8    24 12.326867 1.7763576
11      A3   P6     S12     2    16  4.014221 0.2255188

Via join (package:plyr)

plyr:::join(data.bio,data.chem,by=c("Between","Plot","Subplot"), type="left")

   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     11    23      A1   P2      S4 13.400350 1.5762780
4     14    22      A2   P3      S5  3.779183 1.6222430
5     12    24      A2   P3      S6  1.196657 4.2369184
6     11    23      A2   P4      S7        NA        NA
7      9    20      A2   P4      S8  5.687807 2.9859003
8     14    11      A3   P5      S9  4.834518 4.1328919
9     11    22      A3   P5     S10  2.002931 3.6043314
10     8    24      A3   P6     S11 12.326867 1.7763576
11     2    16      A3   P6     S12  4.014221 0.2255188

Merge bio and chem data (only keep full CHEM matches - right join)

	Between	Plot	Subplot	Resp1	Resp2	Chem1	Chem2
1	A1	P1	S1	8	18	1.45	0.89
2	A1	P1	S2	10	21	3.27	0.18
3	A1	P2	S3			1.18	5.08
4	A1	P2	S4	11	23	13.40	1.58
5	A2	P3	S5	14	22	3.78	1.62
6	A2	P3	S6	12	24	1.20	4.24
7	A2	P4	S8	9	20	5.69	2.99
8	A3	P5	S9	14	11	4.83	4.13
9	A3	P5	S10	11	22	2.00	3.60
10	A3	P6	S11	8	24	12.33	1.78
11	A3	P6	S12	2	16	4.01	0.23

Note how Subplot 7 is absent and Subplot 3 (which was missing from the BIO data set) is missing BIO data.

Via right_join (package:dplyr)

right_join(data.bio,data.chem,by=c("Between","Plot","Subplot"))

   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     NA    NA      A1   P2      S3  1.178652 5.0780682
4     11    23      A1   P2      S4 13.400350 1.5762780
5     14    22      A2   P3      S5  3.779183 1.6222430
6     12    24      A2   P3      S6  1.196657 4.2369184
7      9    20      A2   P4      S8  5.687807 2.9859003
8     14    11      A3   P5      S9  4.834518 4.1328919
9     11    22      A3   P5     S10  2.002931 3.6043314
10     8    24      A3   P6     S11 12.326867 1.7763576
11     2    16      A3   P6     S12  4.014221 0.2255188

Via merge

merge(data.bio,data.chem,by=c("Between","Plot","Subplot"),all.y=T)

   Between Plot Subplot Resp1 Resp2     Chem1     Chem2
1       A1   P1      S1     8    18  1.452878 0.8858208
2       A1   P1      S2    10    21  3.266253 0.1800177
3       A1   P2      S3    NA    NA  1.178652 5.0780682
4       A1   P2      S4    11    23 13.400350 1.5762780
5       A2   P3      S5    14    22  3.779183 1.6222430
6       A2   P3      S6    12    24  1.196657 4.2369184
7       A2   P4      S8     9    20  5.687807 2.9859003
8       A3   P5      S9    14    11  4.834518 4.1328919
9       A3   P5     S10    11    22  2.002931 3.6043314
10      A3   P6     S11     8    24 12.326867 1.7763576
11      A3   P6     S12     2    16  4.014221 0.2255188

Via join (package:plyr)

plyr:::join(data.bio,data.chem,by=c("Between","Plot","Subplot"), type="right")

   Between Plot Subplot Resp1 Resp2     Chem1     Chem2
1       A1   P1      S1     8    18  1.452878 0.8858208
2       A1   P1      S2    10    21  3.266253 0.1800177
3       A1   P2      S3    NA    NA  1.178652 5.0780682
4       A1   P2      S4    11    23 13.400350 1.5762780
5       A2   P3      S5    14    22  3.779183 1.6222430
6       A2   P3      S6    12    24  1.196657 4.2369184
7       A2   P4      S8     9    20  5.687807 2.9859003
8       A3   P5      S9    14    11  4.834518 4.1328919
9       A3   P5     S10    11    22  2.002931 3.6043314
10      A3   P6     S11     8    24 12.326867 1.7763576
11      A3   P6     S12     2    16  4.014221 0.2255188

VLOOKUP in R

Lookup tables provide a way of inserting a column of data into a large data set such that the entries in the new column are determined by a relational match within another data set (the lookup table). For example, the main data set might contain data collected from a number of sites (Plots). Elsewhere we may have a data set that just contains the set of sites and their corresponding latitudes and longitudes (geographical lookup table). We could incorporate these latitudes and longitudes into the main data set by merging against the geographical lookup table. In Excel, this is referred to as vlookup, in a relational database it is referred to as a join, and in R it is a merge.

Biological data set (data.bio1)

	Resp1	Resp2	Between	Plot	Subplot
1	8	18	A1	P1	S1
2	10	21	A1	P1	S2
3	7	16	A1	P2	S3
4	11	23	A1	P2	S4
5	14	22	A2	P3	S5
6	12	24	A2	P3	S6
7	11	23	A2	P4	S7
8	9	20	A2	P4	S8
9	14	11	A3	P5	S9
10	11	22	A3	P5	S10
11	8	24	A3	P6	S11
12	2	16	A3	P6	S12

View code

set.seed(1)
data.bio1<- expand.grid(Subplot=paste("S",1:2,sep=""),Plot=paste("P",1:6,sep=""))
data.bio1$Subplot <- gl(12,1,12,lab=paste("S",1:12,sep=""))
data.bio1$Between <- gl(3,4,12,lab=paste("A",1:3,sep=""))
data.bio1$Resp1 <- rpois(12,10)
data.bio1$Resp2 <- rpois(12,20)
data.bio1<- with(data.bio1,data.frame(Resp1,Resp2,Between,Plot,Subplot))

Geographical data set (lookup table) (data.geo)

	Plot	LAT	LONG
1	P1	17.96	145.43
2	P2	17.52	146.20
3	P3	17.00	146.38
4	P4	18.23	146.79
5	P5	18.98	146.03

View code

set.seed(1)
data.geo <- expand.grid(Plot=paste("P",1:5,sep=""))
data.geo$LAT<-c(17.9605,17.5210,17.0011,18.235,18.9840)
data.geo$LONG<-c(145.4326,146.1983,146.3839,146.7934,146.0345)

Incorporate (merge) the lat/longs into the bio data

	Plot	Resp1	Resp2	Between	Subplot	LAT	LONG
1	P1	8	18	A1	S1	17.96	145.43
2	P1	10	21	A1	S2	17.96	145.43
3	P2	7	16	A1	S3	17.52	146.20
4	P2	11	23	A1	S4	17.52	146.20
5	P3	14	22	A2	S5	17.00	146.38
6	P3	12	24	A2	S6	17.00	146.38
7	P4	11	23	A2	S7	18.23	146.79
8	P4	9	20	A2	S8	18.23	146.79
9	P5	14	11	A3	S9	18.98	146.03
10	P5	11	22	A3	S10	18.98	146.03

Note how the latitudes and longitudes are duplicated so as to match the Plot listings.

Via left_join (package:dplyr)

left_join(data.bio1,data.geo,by="Plot")

   Resp1 Resp2 Between Plot Subplot     LAT     LONG
1      8    18      A1   P1      S1 17.9605 145.4326
2     10    21      A1   P1      S2 17.9605 145.4326
3      7    16      A1   P2      S3 17.5210 146.1983
4     11    23      A1   P2      S4 17.5210 146.1983
5     14    22      A2   P3      S5 17.0011 146.3839
6     12    24      A2   P3      S6 17.0011 146.3839
7     11    23      A2   P4      S7 18.2350 146.7934
8      9    20      A2   P4      S8 18.2350 146.7934
9     14    11      A3   P5      S9 18.9840 146.0345
10    11    22      A3   P5     S10 18.9840 146.0345
11     8    24      A3   P6     S11      NA       NA
12     2    16      A3   P6     S12      NA       NA

Via merge

merge(data.bio1,data.geo,by="Plot", all.T=TRUE)

   Plot Resp1 Resp2 Between Subplot     LAT     LONG
1    P1     8    18      A1      S1 17.9605 145.4326
2    P1    10    21      A1      S2 17.9605 145.4326
3    P2     7    16      A1      S3 17.5210 146.1983
4    P2    11    23      A1      S4 17.5210 146.1983
5    P3    14    22      A2      S5 17.0011 146.3839
6    P3    12    24      A2      S6 17.0011 146.3839
7    P4    11    23      A2      S7 18.2350 146.7934
8    P4     9    20      A2      S8 18.2350 146.7934
9    P5    14    11      A3      S9 18.9840 146.0345
10   P5    11    22      A3     S10 18.9840 146.0345

Via join (package:plyr)

plyr:::join(data.bio1,data.geo,by="Plot", type='left')

   Resp1 Resp2 Between Plot Subplot     LAT     LONG
1      8    18      A1   P1      S1 17.9605 145.4326
2     10    21      A1   P1      S2 17.9605 145.4326
3      7    16      A1   P2      S3 17.5210 146.1983
4     11    23      A1   P2      S4 17.5210 146.1983
5     14    22      A2   P3      S5 17.0011 146.3839
6     12    24      A2   P3      S6 17.0011 146.3839
7     11    23      A2   P4      S7 18.2350 146.7934
8      9    20      A2   P4      S8 18.2350 146.7934
9     14    11      A3   P5      S9 18.9840 146.0345
10    11    22      A3   P5     S10 18.9840 146.0345
11     8    24      A3   P6     S11      NA       NA
12     2    16      A3   P6     S12      NA       NA

It might also be interesting to explore which records in the Biological data set did not have matching records in the Geographical data set.

Determine the records from bio data that are not matched with lat/longs

	Resp1	Resp2	Between	Plot	Subplot
1	8	24	A3	P6	S11
2	2	16	A3	P6	S12

Via anti_join (package:dplyr)

anti_join(data.bio1,data.geo,by="Plot")

  Resp1 Resp2 Between Plot Subplot
1     8    24      A3   P6     S11
2     2    16      A3   P6     S12

Worked examples

Data minipulation of NASA temperature records

As part of their 2006 Data Expo, the ASA offered up a data set from NASA Langley Research Center Atmospheric Sciences Data Center and the intentionally vague challenge to participants to provide a graphical summary of the important features of the data set. The data set comprise geographic and atmospheric measures on a very coarse 24x24 grid (Central America) with observations from Han 1995 to Dec 2000. These data are included in the dplyr package as an example of a tbl_cube (a compact data storage format for fully factorial designs).

Format of data

	lat	long	month	year	cloudhigh	cloudlow	cloudmid	ozone	pressure	surftemp	temperature
1	36.20	-113.80	1	1995	26.00	7.50	34.50	304.00	835.00	272.70	272.10
2	33.70	-113.80	1	1995	20.00	11.50	32.50	304.00	940.00	279.50	282.20
3	31.21	-113.80	1	1995	16.00	16.50	26.00	298.00	960.00	284.70	285.20
4	28.71	-113.80	1	1995	13.00	20.50	14.50	276.00	990.00	289.30	290.70
5	26.22	-113.80	1	1995	7.50	26.00	10.50	274.00	1000.00	292.20	292.70
6	23.72	-113.80	1	1995	8.00	30.00	9.50	264.00	1000.00	294.10	293.60

lat,long	latitude and longitude
year,month	year and month
cloudlow, cloudmed, cloudhigh	cloud cover at three heights (Monthly averages)
ozone	monthly average ozone level
surftemp, temperature	monthly average temperature (in Kelvin)
pressure	monthly average air pressure

Start by converting the tbl_cube into a data frame.

library(dplyr)
nasa = as.data.frame(nasa)
glimpse(nasa)

Observations: 41,472
Variables: 11
$ lat         <dbl> 36.200000, 33.704348, 31.208696, 28.713043, 26.217391, 23.721739, 21.226087, 18.730435, 16.2347...
$ long        <dbl> -113.8000, -113.8000, -113.8000, -113.8000, -113.8000, -113.8000, -113.8000, -113.8000, -113.80...
$ month       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ year        <int> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995,...
$ cloudhigh   <dbl> 26.0, 20.0, 16.0, 13.0, 7.5, 8.0, 14.5, 19.5, 22.5, 21.0, 19.0, 16.5, 17.5, 5.5, 0.5, 0.0, 0.5,...
$ cloudlow    <dbl> 7.5, 11.5, 16.5, 20.5, 26.0, 30.0, 29.5, 26.5, 27.5, 26.0, 28.5, 28.0, 33.0, 44.5, 43.5, 37.0, ...
$ cloudmid    <dbl> 34.5, 32.5, 26.0, 14.5, 10.5, 9.5, 11.0, 17.5, 18.5, 16.5, 12.5, 13.5, 18.5, 13.0, 4.0, 1.0, 1....
$ ozone       <dbl> 304, 304, 298, 276, 274, 264, 258, 252, 250, 250, 248, 248, 250, 248, 248, 248, 248, 250, 252, ...
$ pressure    <dbl> 835, 940, 960, 990, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 100...
$ surftemp    <dbl> 272.7, 279.5, 284.7, 289.3, 292.2, 294.1, 295.0, 298.3, 300.1, 300.1, 301.0, 301.0, 299.2, 298....
$ temperature <dbl> 272.1, 282.2, 285.2, 290.7, 292.7, 293.6, 294.6, 296.9, 297.8, 298.7, 300.1, 300.1, 300.5, 299....

This is a reasonably large data set (41,472 records). Lets start by generating a subset of the data:

Select only lat, long and the cloud.. columns (in the interest keeping the output brief, perhaps just display the first few records).

Show code

nasa %>% select(lat,long, contains('cloud')) %>% head

       lat   long cloudhigh cloudlow cloudmid
1 36.20000 -113.8      26.0      7.5     34.5
2 33.70435 -113.8      20.0     11.5     32.5
3 31.20870 -113.8      16.0     16.5     26.0
4 28.71304 -113.8      13.0     20.5     14.5
5 26.21739 -113.8       7.5     26.0     10.5
6 23.72174 -113.8       8.0     30.0      9.5

Select only the most recent years data (again just display the first few records).

Show code

nasa %>% filter(year==last(year)) %>% head

       lat   long month year cloudhigh cloudlow cloudmid ozone pressure surftemp temperature
1 36.20000 -113.8     1 2000      20.5     11.0     28.0   320      955    277.3       274.4
2 33.70435 -113.8     1 2000      10.0     13.5     24.0   322      975    283.7       284.7
3 31.20870 -113.8     1 2000       8.5     17.0     19.0   306      980    287.8       289.3
4 28.71304 -113.8     1 2000       5.5     18.0     14.5   298      990    289.8       293.2
5 26.21739 -113.8     1 2000       4.0     18.0     10.5   286      995    291.7       293.6
6 23.72174 -113.8     1 2000       4.0     20.5      8.0   280     1000    292.7       294.6

Select only those records for which temperature was above 309 Kelvin and sort these records from highest to lowest temperature.

Show code

nasa %>% filter(temperature>309) %>% arrange(desc(temperature))

       lat      long month year cloudhigh cloudlow cloudmid ozone pressure surftemp temperature
1 33.70435 -98.77391     7 1998       2.5     12.0      5.0   294     1000    309.6       310.0
2 33.70435 -96.26957     7 1998       2.5     12.0      5.0   294     1000    309.6       310.0
3 28.71304 -98.77391     7 1998       4.0     17.5      6.0   294      990    309.6       309.6
4 33.70435 -98.77391     8 1999       4.5     15.5      7.0   290      990    307.5       309.6
5 33.70435 -96.26957     8 1999       4.5     15.5      7.0   290      990    307.5       309.6
6 26.21739 -98.77391     6 1998       9.5     21.5     12.5   286      995    309.1       309.1
7 31.20870 -96.26957     7 1998       5.5     13.5      7.5   292     1000    307.5       309.1
8 31.20870 -93.76522     7 1998       5.5     13.5      7.5   292     1000    307.5       309.1

We might be interested in exploring whether temperature has changed over time. Whilst the time span is relatively short, and there will obviously be seasonal and daily fluctuations, it might still be interesting to calculate annual mean temperatures. Temperature is currently recorded in Kelvin. Lets present the data in Celcius units. Since the conversion from Kelvin to Celcius is a simple additive conversion ($T_C = T_K - 273.15$), it does not matter whether the conversion is applied to the individual temperature records or to the annual averages. We will therefore use both approaches separately as practice.
1. Calculate the annual means based on temperature data converted to Celcius.
  Show code
  nasa %>% mutate(Temp_c = temperature - 273.15) %>% group_by(year) %>% summarize(Mean=mean(Temp_c))
  
  # A tibble: 6 × 2 year Mean <int> <dbl> 1 1995 24.14457 2 1996 23.95054 3 1997 24.80664 4 1998 25.55278 5 1999 24.98644 6 2000 25.18578
  *Note, if at this point you end up with a single mean (not grouped by year), it means that the plyr package version of the summarize function is being used instead of the dplyr package version. As indicated above, these two packages do not play nicely together. I would strongly recommend only using plyr package functions via their namespace rather than loading the entire package.
2. Calculate the annual means on Kelvin temperature and express the means in Celcius.
  Show code
  nasa %>% group_by(year) %>% summarize(Mean=mean(temperature)-273.15)
  
  # A tibble: 6 × 2 year Mean <int> <dbl> 1 1995 24.14457 2 1996 23.95054 3 1997 24.80664 4 1998 25.55278 5 1999 24.98644 6 2000 25.18578

Some temperature research focuses on variability and extremes (not just averages).

Calculate the annual mean, standard deviation as well as maximum temperature

Show code

nasa %>% mutate(Temp_c = temperature - 273.15) %>% group_by(year) %>% summarize_at('temperature', funs(mean,sd,max))

# A tibble: 6 × 4
   year     mean       sd   max
  <int>    <dbl>    <dbl> <dbl>
1  1995 297.2946 4.937595 307.5
2  1996 297.1005 4.551722 307.9
3  1997 297.9566 4.768040 307.9
4  1998 298.7028 4.654017 310.0
5  1999 298.1364 4.492859 309.6
6  2000 298.3358 4.747475 308.7

Calculate the number of monthly averages per annum that exceed the top 10 percent of monthly temperature averages.

Show code

nasa %>% mutate(Temp_c = temperature - 273.15) %>% # convert to Celcius
  mutate(Perc=percent_rank(Temp_c)) %>% # calculate the percentage ranks
  filter(Perc>=0.9) %>% # filter out those values less than 90 %
  group_by(year) %>% # group by year
  summarize(N=n()) # count the number of records (per group)

# A tibble: 6 × 2
   year     N
  <int> <int>
1  1995   254
2  1996   173
3  1997   336
4  1998   908
5  1999   489
6  2000   730

Note, the same can be achieved using the count() function in place of the more generalizable summarize() function.

nasa %>% mutate(Temp_c = temperature - 273.15) %>% # convert to Celcius
  mutate(Perc=percent_rank(Temp_c)) %>% # calculate the percentage ranks
  filter(Perc>=0.9) %>% # filter out those values less than 90 %
  group_by(year) %>% # group by year
  count() # count the number of records (per group)

# A tibble: 6 × 2
   year     n
  <int> <int>
1  1995   254
2  1996   173
3  1997   336
4  1998   908
5  1999   489
6  2000   730

Data minipulation of Thousand Islands (Tikus) coral abundance data

The mvabund package includes a multivariate coral abundance data set (75 coral species) from the Thousand Islands stored in a list format. I have converted this list into a data frame and included it in the manipulationsDatasets.R and manipulationsDatasets.RData downloads. Abundance of a species of coral was measured as the length (cm) of a 10m transect containing that species.

Format of data

	time	rep	Pocillopora damicornis	...
1	81	1	79	...
2	81	2	51	...
3	81	3	42	...
4	81	4	15	...
5	81	5	9	...
6	81	6	72	...

time	year in which data were collected
rep	an id for each location (site)
PSammacora cintigua, ...	abundance (cm per 10m transect)

This is another reasonably large data set. Lets explore specific subsets of the data.

Select rep, time and all Porites genera(limit to the first 6 rows for brevity sake).

Show code

tikus %>% select(rep,time,starts_with('Porites')) %>% head

   rep time Porites cylindrica Porites lichen Porites lobata Porites lutea Porites nigrescens Porites solida
V1   1   81                 61              0             36            30                  0              0
V2   2   81                 24             47              0             0                  0              0
V3   3   81                  0             49              0             0                  0             10
V4   4   81                 20              0              0             0                 21              0
V5   5   81                  0              0              0             0                  0             17
V6   6   81                  0              0              0             0                  9              0
   Porites stephensoni
V1                   0
V2                   0
V3                   0
V4                   0
V5                   0
V6                   0

Select rep, time and species that do not contain pora (limit to the first 6 rows for brevity sake).

Show code

tikus %>% select(rep,time,everything(),-contains('pora')) %>% head

   rep time Psammocora contigua Psammocora digitata Acropera aspera Fungia fungites Fungia paumotensis Fungia concina
V1   1   81                   0                   0              17               0                  0              0
V2   2   81                   0                   0              18               0                 33              0
V3   3   81                   0                   0               9              18                  0              0
V4   4   81                   0                   0               8              17                  0              0
V5   5   81                   0                   0              23               0                  0              0
V6   6   81                   0                   0               0               0                  0              0
   Fungia scutaria Halomitra limax Pavona varians Pavona venosa Pavona cactus Coeloseris mayeri Galaxea fascicularis
V1               0               0             30             0             0                20                   51
V2               0               0              0            24             0                 0                   27
V3               0               0              0             0             0                15                   31
V4               0               0              0             0             0                 0                   24
V5               0               0              0             0             0                 9                    0
V6               0               0              0             0             0                19                   13
   Symphyllia radians Lobophyllia corymbosa Lobophyllia hemprichii Porites cylindrica Porites lichen Porites lobata
V1                  0                     0                      0                 61              0             36
V2                  0                     0                      0                 24             47              0
V3                  0                     0                      0                  0             49              0
V4                  0                     0                      0                 20              0              0
V5                  0                     0                      0                  0              0              0
V6                  0                     0                      0                  0              0              0
   Porites lutea Porites nigrescens Porites solida Porites stephensoni Favia pallida Favia speciosa Favia stelligera
V1            30                  0              0                   0            10              0                0
V2             0                  0              0                   0            20              0                0
V3             0                  0             10                   0             0             30                0
V4             0                 21              0                   0             0              0                0
V5             0                  0             17                   0             0              0                0
V6             0                  9              0                   0             0              0                0
   Favia rotumana Favites abdita Favites chinensis Goniastrea rectiformis Goniastrea pectinata Goniastrea sp
V1              0             33                 0                      0                    0             0
V2              0             41                44                      0                    0             0
V3              0             23                78                      0                    0             0
V4              0             27                61                      0                    0             0
V5              0             91                44                      0                    0             0
V6              0             63                 0                      0                    0             0
   Dulophyllia crispa Platygyra daedalea Platygyra sinensis Leptastrea purpurea Leptastrea pruinosa Cyphastrea serailia
V1                  0                  0                 47                   0                   0                   0
V2                  0                 27                 27                   0                   0                   0
V3                  0                 55                 56                   0                   0                   0
V4                  0                  0                 26                   0                   0                   0
V5                  0                 71                  0                   0                   0                   0
V6                  0                 74                  0                   0                   0                   0

This is another reasonably large data set. Lets explore specific subsets of the data.

For each year, calculate the mean abundance of Pocillopora damicornis.

Show code

tikus %>% select(time,contains('Pocillopora damicornis')) %>% group_by(time) %>%
     summarize_all(mean)

# A tibble: 6 × 2
    time `Pocillopora damicornis`
  <fctr>                    <dbl>
1     81                     30.0
2     83                      0.0
3     84                      0.0
4     85                      0.0
5     87                      1.8
6     88                      4.0

Show code

tikus %>% select(time,`Pocillopora damicornis`) %>% group_by(time) %>%
     summarize(MeanAbundance=mean(`Pocillopora damicornis`))

# A tibble: 6 × 2
    time MeanAbundance
  <fctr>         <dbl>
1     81          30.0
2     83           0.0
3     84           0.0
4     85           0.0
5     87           1.8
6     88           4.0

# Note the back ticks in the above

Show code

tikus %>% select(time,`Pocillopora damicornis`) %>% group_by(time) %>%
    summarize_at(vars(`Pocillopora damicornis`),funs(mean))

# A tibble: 6 × 2
    time `Pocillopora damicornis`
  <fctr>                    <dbl>
1     81                     30.0
2     83                      0.0
3     84                      0.0
4     85                      0.0
5     87                      1.8
6     88                      4.0

# Note the back ticks in the above

For each year, calculate the mean abundance of each Pocillopora species.

Show code

tikus %>% select(time,starts_with('Pocillopora')) %>% group_by(time) %>%
    summarize_all(funs(mean))

# A tibble: 6 × 3
    time `Pocillopora damicornis` `Pocillopora verrucosa`
  <fctr>                    <dbl>                   <dbl>
1     81                     30.0                    19.2
2     83                      0.0                     0.0
3     84                      0.0                     0.0
4     85                      0.0                     0.0
5     87                      1.8                     0.0
6     88                      4.0                     0.0

For each year, calculate the mean and standard deviation abundance of each Pocillopora species.

Show code

tikus %>% select(time,starts_with('Pocillopora')) %>% group_by(time) %>%
    summarize_all(funs(mean, sd))

# A tibble: 6 × 5
    time `Pocillopora damicornis_mean` `Pocillopora verrucosa_mean` `Pocillopora damicornis_sd`
  <fctr>                         <dbl>                        <dbl>                       <dbl>
1     81                          30.0                         19.2                   29.074617
2     83                           0.0                          0.0                    0.000000
3     84                           0.0                          0.0                    0.000000
4     85                           0.0                          0.0                    0.000000
5     87                           1.8                          0.0                    5.692100
6     88                           4.0                          0.0                    9.660918
# ... with 1 more variables: `Pocillopora verrucosa_sd` <dbl>

Now lets get a bit more serious.

Calculate the mean cover abundance of Acropora per year. Note:

Abundance in cm per 10m will need to be converted to cover abundance
There is a typo and one of the Acropora have been incorrectly listed as Acropera

Show code

tikus %>% rename(`Acropora aspera`=`Acropera aspera`) %>%
gather(Species, Abundance,-time,-rep) %>%
mutate(Cover=Abundance/10) %>%
separate(Species,c('Genera','Species')) %>%
filter(Genera=='Acropora') %>%
group_by(time,rep) %>%
summarise(SumCover=sum(Cover))

Source: local data frame [60 x 3]
Groups: time [?]

     time    rep SumCover
   <fctr> <fctr>    <dbl>
1      81      1     64.7
2      81      2     39.7
3      81      3     35.7
4      81      4     40.4
5      81      5     28.9
6      81      6      8.8
7      81      7     15.6
8      81      8      8.3
9      81      9      6.2
10     81     10      7.5
# ... with 50 more rows

Lagged responses

For this example, we will use another of the fabricated data sets in manipulationDatasets.R (data.t). This data set comprises two Responses measured in each of four years from each of four Plots. What we want is to derive a new measure that is the change in abundance between a certain year (2008) and all subsequent years - and of course, we want to do this separately for each Plot and each Response.

Format of data

	Plot	Year	Resp1	Resp2
1	P1	2008	0.00	36.00
2	P1	2009	48.00	0.00
3	P1	2010	12.00	0.00
4	P1	2011	0.00	15.00
5	P2	2008	19.00	34.00
6	P2	2009	18.00	0.00

Plot	a random effect with four levels
Year	sampling year
Resp1, Resp2	Responses (e.g. counts of two species)

This manipulation requires a number of steps. The tidyverse eco-system allows us to link together individual changes and manipulations so as to form the more complex total manipulations. Of course, this also means that there are numerous ways to achieve the same result.

Calculate the change (difference in abundance between 2008 and all subsequent years) for each Response (Resp1 and Resp2))

Show code

data.t %>%
    arrange(Plot,Year) %>%  #ensure that data are chronologically arranged in each Plot
    group_by(Plot) %>% #group by Plot
    mutate_at(vars(Resp1,Resp2), funs(Delta=. - first(.))) #calculate difference between each Value and the Value in the first Year

Source: local data frame [16 x 6]
Groups: Plot [4]

     Plot  Year Resp1 Resp2 Resp1_Delta Resp2_Delta
   <fctr> <int> <dbl> <dbl>       <dbl>       <dbl>
1      P1  2008     0    36           0           0
2      P1  2009    48     0          48         -36
3      P1  2010    12     0          12         -36
4      P1  2011     0    15           0         -21
5      P2  2008    19    34           0           0
6      P2  2009    18     0          -1         -34
7      P2  2010    21    36           2           2
8      P2  2011    45    12          26         -22
9      P3  2008    40    49           0           0
10     P3  2009    38     0          -2         -49
11     P3  2010     9    18         -31         -31
12     P3  2011    22    42         -18          -7
13     P4  2008    20     0           0           0
14     P4  2009    22    14           2          14
15     P4  2010    32     0          12           0
16     P4  2011    38    27          18          27

As 2008 is the year we are comparing all others to, we dont really want the 2008 data in the final data set - so lets suppress it.

Show code

data.t %>%
    arrange(Plot,Year) %>%  #ensure that data are chronologically arranged in each Plot
    group_by(Plot) %>% #group by Plot
    mutate_at(vars(Resp1,Resp2), funs(Delta=. - first(.))) %>% #calculate difference between each Value and the Value in the first Year
    filter(Year>first(Year))

Source: local data frame [12 x 6]
Groups: Plot [4]

     Plot  Year Resp1 Resp2 Resp1_Delta Resp2_Delta
   <fctr> <int> <dbl> <dbl>       <dbl>       <dbl>
1      P1  2009    48     0          48         -36
2      P1  2010    12     0          12         -36
3      P1  2011     0    15           0         -21
4      P2  2009    18     0          -1         -34
5      P2  2010    21    36           2           2
6      P2  2011    45    12          26         -22
7      P3  2009    38     0          -2         -49
8      P3  2010     9    18         -31         -31
9      P3  2011    22    42         -18          -7
10     P4  2009    22    14           2          14
11     P4  2010    32     0          12           0
12     P4  2011    38    27          18          27

So far, the metric representing the difference between before and after has been pretty crude. It is simply the raw difference, the magnitude of which is limitless and dependent on the magnitude and scale of the variables - thereby meaningless for comparisons.

A more sophisticated metric would standardize the differences such that they represent the proportional difference (divide by the first value). $$\delta = [(A_2 - A_1)/A_2] * 100$$

Show code

delta <- function(t2,t1){
  100*((t2-t1)/t1)
}
data.t %>%
    arrange(Plot,Year) %>%  #ensure that data are chronologically arranged in each Plot
    group_by(Plot) %>% #group by Plot
    mutate_at(vars(Resp1,Resp2), funs(Delta=delta(.,first(.)))) %>%
        filter(Year>first(Year))

Source: local data frame [12 x 6]
Groups: Plot [4]

     Plot  Year Resp1 Resp2 Resp1_Delta Resp2_Delta
   <fctr> <int> <dbl> <dbl>       <dbl>       <dbl>
1      P1  2009    48     0         Inf -100.000000
2      P1  2010    12     0         Inf -100.000000
3      P1  2011     0    15         NaN  -58.333333
4      P2  2009    18     0   -5.263158 -100.000000
5      P2  2010    21    36   10.526316    5.882353
6      P2  2011    45    12  136.842105  -64.705882
7      P3  2009    38     0   -5.000000 -100.000000
8      P3  2010     9    18  -77.500000  -63.265306
9      P3  2011    22    42  -45.000000  -14.285714
10     P4  2009    22    14   10.000000         Inf
11     P4  2010    32     0   60.000000         NaN
12     P4  2011    38    27   90.000000         Inf

Of course this introduces additional issues, such as what happens if the initial value is 0? In such cases, any increase is effectively an infinite increase. However, if the initial value and a subsequent value are both zero, then a sensible value of delta would be 0. $$ \delta = \begin{cases} [(A_2 - A_1)/A_2] * 100 & \quad \text{if } A_1\neq 0\\ 10,000 & \quad \text{if } A_1=0\text{ \& }A_2\neq 0\\ 0 & \quad \text{if } A_1=0\text{ \& }A_2=0\\ \end{cases} $$

Show code

delta <- function(t2,t1){
  d<-100*((t2-t1)/t1)
  d[t1==0 & t2==0]<-0
  d[t1==0 & t2!=0]<-10000
  log(1+(d/101))
}
data.t %>%
    arrange(Plot,Year) %>%  #ensure that data are chronologically arranged in each Plot
    group_by(Plot) %>% #group by Plot
    mutate_at(vars(Resp1,Resp2), funs(Delta=delta(.,first(.)))) %>%
        filter(Year>first(Year))

Source: local data frame [12 x 6]
Groups: Plot [4]

     Plot  Year Resp1 Resp2 Resp1_Delta Resp2_Delta
   <fctr> <int> <dbl> <dbl>       <dbl>       <dbl>
1      P1  2009    48     0  4.60526919 -4.61512052
2      P1  2010    12     0  4.60526919 -4.61512052
3      P1  2011     0    15  0.00000000 -0.86170254
4      P2  2009    18     0 -0.05351732 -4.61512052
5      P2  2010    21    36  0.09914006  0.05660821
6      P2  2011    45    12  0.85648651 -1.02346484
7      P3  2009    38     0 -0.05077233 -4.61512052
8      P3  2010     9    18 -1.45812010 -0.98454058
9      P3  2011    22    42 -0.58976883 -0.15250187
10     P4  2009    22    14  0.09440968  4.60526919
11     P4  2010    32     0  0.46628385  0.00000000
12     P4  2011    38    27  0.63715291  4.60526919

The above procedure is perfectly adequate for dealing with a single ubiquitous event (disturbance), but what happens if there have been multiple different disturbances that begin and end at different times in different plots. Comparing each time period to a single, fixed time (2008 in this case), is then not logical. Lets say that in addition to our data set, we had also compiled a data set that defined times prior and post for each disturbance for each Plot. This data set could then be used as a lookup table (as in this section above). In this example we will incorporate the post disturbance date, yet we will not restrict the resulting data beyond this date (that is, we will continue to explore the impacts of the disturbance after the primary condition has ceased).

View code

#complex difference function
data.v <- expand.grid(Plot=paste("P",1:4,sep=""))
data.v$Prior <- c(2008,2008,2009,2008)
data.v$Post <- c(2011,2010,2011,2011)
data.v

  Plot Prior Post
1   P1  2008 2011
2   P2  2008 2010
3   P3  2009 2011
4   P4  2008 2011

The following additional steps are now required in order to incorporate the prior and post disturbance dates.

Merge the data and the lookup table such that the prior and post dates are included in the data set.
Limit the first date of the comparison for each Plot such that the first measure is relative to the defined Prior disturbance time for that Plot.

Show code

delta <- function(t2,t1){
  d<-100*((t2-t1)/t1)
  d[t1==0 & t2==0]<-0
  d[t1==0 & t2!=0]<-10000
  log(1+(d/101))
}
data.t %>%
    left_join(data.v) %>%
    filter(Year>=Prior) %>%
    arrange(Plot,Year) %>%  #ensure that data are chronologically arranged in each Plot
    group_by(Plot) %>% #group by Plot
    mutate_at(vars(Resp1,Resp2), funs(Delta=delta(.,first(.)))) %>%
        filter(Year>first(Year))

Source: local data frame [11 x 8]
Groups: Plot [4]

     Plot  Year Resp1 Resp2 Prior  Post Resp1_Delta
   <fctr> <int> <dbl> <dbl> <dbl> <dbl>       <dbl>
1      P1  2009    48     0  2008  2011  4.60526919
2      P1  2010    12     0  2008  2011  4.60526919
3      P1  2011     0    15  2008  2011  0.00000000
4      P2  2009    18     0  2008  2010 -0.05351732
5      P2  2010    21    36  2008  2010  0.09914006
6      P2  2011    45    12  2008  2010  0.85648651
7      P3  2010     9    18  2009  2011 -1.40895673
8      P3  2011    22    42  2009  2011 -0.53936879
9      P4  2009    22    14  2008  2011  0.09440968
10     P4  2010    32     0  2008  2011  0.46628385
11     P4  2011    38    27  2008  2011  0.63715291
# ... with 1 more variables: Resp2_Delta <dbl>

References

Wickham, H. (2014NA). “Tidy data”. In: The Journal of Statistical Software 59.10. URL: http://www.jstatsoft.org/v59/i10/.

Tutorial 2.4 - Data manipulation

Manipulating data sets

tidyverse a data manipulation ecosystem within R

The grammar of data manipulation

Piping

Reshaping dataframes

Wide to long (gather)

Long to wide (cast)

Combining columns (unite)

Splitting columns appart (separate)

Summary and Vectorized functions

Summary functions

Vectorized functions

Alterations to existing variables

Sorting data (arrange)

Re-levelling (sorting) factors

Relabelling a factor

Renaming columns (variables)

Subsetting

Filtering (subsetting by rows)

Selecting (subsetting by columns)

select_if

filter and select

Adding new variables (mutating)

Vectorized functions

The extended mutate_ family

Summarizing (aggregating) data

The extended summary_ family

Grouping data

Applying summarize() to grouped data

Hierarchical aggregations

Applying mutate() to grouped data

Merging (joining) data sets

VLOOKUP in R

Worked examples

Data minipulation of NASA temperature records

Data minipulation of Thousand Islands (Tikus) coral abundance data

Lagged responses

References

`tidyverse` a data manipulation ecosystem within R

`select_if`

`filter` and `select`

The extended `mutate_` family

The extended `summary_` family

Applying `summarize()` to grouped data

Applying `mutate()` to grouped data