23 April 2011
Constructing dataframes
Importing (reading) data
Exporting (writing) data
Data within dataframes
Manipulating dataframes
Rarely are data transcribed and organized into such simple and ready-to-go data sets. More typically,
data are spread throughout multiple sources and in a variety of formats (particularly if compiled by multiple
workers or instruments). Consequently, prior to any formal statistical analyses , it is necessary to compile very focused
data sets.
Common data preparations include:
- Reshaping and rearranging data
- Merging multiple data sources
- Aggregating data to appropriate spatial/temporal scales
- Transforming data and deriving new variables
- Sorting and reordering data
- Relabelling data
This workshop will concentrate on these essential data preparation phases.
Practical data manipulation will be demonstrated via a series of very small artificial datasets.
These datasets are presented in tables with black font and lines and the R code required to generate those data will be presented in static code boxes either underneath or adjacent the table.
A very basic description of the table and the name of the data frame are displayed above the table. The entire collection of datasets used in this workshop can be obtained by issuing the following
command:
- (if online)
> source(url( "http://users.monash.edu.au/~murray/AIMS-R-users/downloads/data/
manipulationDatasets.R"))
- (if offline and are running this from a local version, run the following from within the /downloads/data directory)
> source(file = "manipulationDatasets.R")
Reshaping dataframes
Wide to long (melt)
Whilst wide data formats are often more compact and typically easier to manage for data entry (particularly in the field), the data are not
in the appropriate format for most analyses (traditional repeated measures is one exception).
Most analyses require that each replicate is in its own row and thus it is necessary to be rearrange or reshape (melt) the data from this wide format
to the long (molten) format.
The melt function (reshape package) is very useful for converting wide (repeated measures-like) into long format.
The important parameters to specify are;
- id the set of factors already in long format
- measure.var the columns that define the repeated measure that should be melted into a long column
| Between | Plot | Time.0 | Time.1 | Time.2 |
R1 | A1 | P1 | 8.00 | 14.00 | 14.00 |
R2 | A1 | P2 | 10.00 | 12.00 | 11.00 |
R3 | A2 | P3 | 7.00 | 11.00 | 8.00 |
R4 | A2 | P4 | 11.00 | 9.00 | 2.00 |
|
View code
> set.seed(1)
> data.w <- expand.grid(Plot = paste("P", 1:4, sep = ""))
> data.w$Between <- gl(2, 2, 4, lab = paste("A", 1:2,
+ sep = ""))
> data.w <- with(data.w, data.frame(Between, Plot, matrix(rpois(12,
+ 10), ncol = 3, dimnames = list(paste("R", 1:4, sep = ""),
+ paste("Time", 0:2, sep = ":")))))
|
| Between | Plot | Time | value |
1 | A1 | P1 | Time.0 | 8.00 |
2 | A1 | P2 | Time.0 | 10.00 |
3 | A2 | P3 | Time.0 | 7.00 |
4 | A2 | P4 | Time.0 | 11.00 |
5 | A1 | P1 | Time.1 | 14.00 |
6 | A1 | P2 | Time.1 | 12.00 |
7 | A2 | P3 | Time.1 | 11.00 |
8 | A2 | P4 | Time.1 | 9.00 |
9 | A1 | P1 | Time.2 | 14.00 |
10 | A1 | P2 | Time.2 | 11.00 |
11 | A2 | P3 | Time.2 | 8.00 |
12 | A2 | P4 | Time.2 | 2.00 |
|
Via melt
> library(reshape)
> melt(data.w, id = 1:2, measure.var = 3:5, var = "Time")
Between Plot Time value
1 A1 P1 Time.0 8
2 A1 P2 Time.0 10
3 A2 P3 Time.0 7
4 A2 P4 Time.0 11
5 A1 P1 Time.1 14
6 A1 P2 Time.1 12
7 A2 P3 Time.1 11
8 A2 P4 Time.1 9
9 A1 P1 Time.2 14
10 A1 P2 Time.2 11
11 A2 P3 Time.2 8
12 A2 P4 Time.2 2
>
> library(reshape)
> melt(data.w, id = c("Between", "Plot"), measure.var = c("Time.0",
+ "Time.1", "Time.2"), var = "Time")
Between Plot Time value
1 A1 P1 Time.0 8
2 A1 P2 Time.0 10
3 A2 P3 Time.0 7
4 A2 P4 Time.0 11
5 A1 P1 Time.1 14
6 A1 P2 Time.1 12
7 A2 P3 Time.1 11
8 A2 P4 Time.1 9
9 A1 P1 Time.2 14
10 A1 P2 Time.2 11
11 A2 P3 Time.2 8
12 A2 P4 Time.2 2
|
Long to wide (cast)
The cast function (reshape package) is very useful for converting long datasets into wide (repeated measures-like).
The heart of the function definition is specifying a formula LHS ~ RHS where;
- LHS is a plus separated set of variables to retain long format
- RHS the variable to be cast into wide format
| Resp1 | Resp2 | Between | Plot | Subplot | Within |
1 | 8.00 | 17.00 | A1 | P1 | S1 | B1 |
2 | 10.00 | 18.00 | A1 | P1 | S1 | B2 |
3 | 7.00 | 17.00 | A1 | P1 | S2 | B1 |
4 | 11.00 | 21.00 | A1 | P1 | S2 | B2 |
5 | 14.00 | 19.00 | A2 | P2 | S3 | B1 |
6 | 12.00 | 13.00 | A2 | P2 | S3 | B2 |
7 | 11.00 | 24.00 | A2 | P2 | S4 | B1 |
8 | 9.00 | 18.00 | A2 | P2 | S4 | B2 |
9 | 14.00 | 25.00 | A3 | P3 | S5 | B1 |
10 | 11.00 | 18.00 | A3 | P3 | S5 | B2 |
11 | 8.00 | 27.00 | A3 | P3 | S6 | B1 |
12 | 2.00 | 22.00 | A3 | P3 | S6 | B2 |
13 | 8.00 | 17.00 | A1 | P4 | S7 | B1 |
14 | 10.00 | 22.00 | A1 | P4 | S7 | B2 |
15 | 7.00 | 16.00 | A1 | P4 | S8 | B1 |
16 | 12.00 | 13.00 | A1 | P4 | S8 | B2 |
17 | 11.00 | 23.00 | A2 | P5 | S9 | B1 |
18 | 12.00 | 19.00 | A2 | P5 | S9 | B2 |
19 | 12.00 | 23.00 | A2 | P5 | S10 | B1 |
20 | 10.00 | 21.00 | A2 | P5 | S10 | B2 |
21 | 3.00 | 17.00 | A3 | P6 | S11 | B1 |
22 | 11.00 | 16.00 | A3 | P6 | S11 | B2 |
23 | 13.00 | 26.00 | A3 | P6 | S12 | B1 |
24 | 7.00 | 28.00 | A3 | P6 | S12 | B2 |
|
View code
> set.seed(1)
> data <- expand.grid(Within = paste("B",
+ 1:2, sep = ""), Subplot = paste("S", 1:2,
+ sep = ""), Plot = paste("P", 1:6, sep = ""))
> data$Subplot <- gl(12, 2, 24, lab = paste("S",
+ 1:12, sep = ""))
> data$Between <- gl(3, 4, 24, lab = paste("A",
+ 1:3, sep = ""))
> data$Resp1 <- rpois(24, 10)
> data$Resp2 <- rpois(24, 20)
> data <- with(data, data.frame(Resp1,
+ Resp2, Between, Plot, Subplot, Within))
|
Widen (cast) Resp1 for repeated measures (Within) | |
---|
| Between | Plot | Subplot | B1 | B2 |
1 | A1 | P1 | S1 | 8.00 | 10.00 |
2 | A1 | P1 | S2 | 7.00 | 11.00 |
3 | A1 | P4 | S7 | 8.00 | 10.00 |
4 | A1 | P4 | S8 | 7.00 | 12.00 |
5 | A2 | P2 | S3 | 14.00 | 12.00 |
6 | A2 | P2 | S4 | 11.00 | 9.00 |
7 | A2 | P5 | S9 | 11.00 | 12.00 |
8 | A2 | P5 | S10 | 12.00 | 10.00 |
9 | A3 | P3 | S5 | 14.00 | 11.00 |
10 | A3 | P3 | S6 | 8.00 | 2.00 |
11 | A3 | P6 | S11 | 3.00 | 11.00 |
12 | A3 | P6 | S12 | 13.00 | 7.00 |
|
Via cast
> library(reshape)
> cast(data, Between + Plot + Subplot ~ Within, value = "Resp1")
Between Plot Subplot B1 B2
1 A1 P1 S1 8 10
2 A1 P1 S2 7 11
3 A1 P4 S7 8 10
4 A1 P4 S8 7 12
5 A2 P2 S3 14 12
6 A2 P2 S4 11 9
7 A2 P5 S9 11 12
8 A2 P5 S10 12 10
9 A3 P3 S5 14 11
10 A3 P3 S6 8 2
11 A3 P6 S11 3 11
12 A3 P6 S12 13 7
|
Widen (cast) both Resp1 and Resp2 for repeated measures | |
---|
| Between | Plot | Subplot | B1Resp1 | B2Resp1 | B1Resp2 | B2Resp2 |
1 | A1 | P1 | S1 | 8.00 | 10.00 | 17.00 | 18.00 |
2 | A1 | P1 | S2 | 7.00 | 11.00 | 17.00 | 21.00 |
3 | A1 | P4 | S7 | 8.00 | 10.00 | 17.00 | 22.00 |
4 | A1 | P4 | S8 | 7.00 | 12.00 | 16.00 | 13.00 |
5 | A2 | P2 | S3 | 14.00 | 12.00 | 19.00 | 13.00 |
6 | A2 | P2 | S4 | 11.00 | 9.00 | 24.00 | 18.00 |
7 | A2 | P5 | S10 | 12.00 | 10.00 | 23.00 | 21.00 |
8 | A2 | P5 | S9 | 11.00 | 12.00 | 23.00 | 19.00 |
9 | A3 | P3 | S5 | 14.00 | 11.00 | 25.00 | 18.00 |
10 | A3 | P3 | S6 | 8.00 | 2.00 | 27.00 | 22.00 |
11 | A3 | P6 | S11 | 3.00 | 11.00 | 17.00 | 16.00 |
12 | A3 | P6 | S12 | 13.00 | 7.00 | 26.00 | 28.00 |
|
Via cast
> library(reshape)
> d1 <- cast(data, Between + Plot + Subplot ~ Within, value = "Resp1")
> d2 <- cast(data, Between + Plot + Subplot ~ Within, value = "Resp2")
> merge(d1, d2, by = c("Between", "Plot", "Subplot"), suffixes = c("Resp1", "Resp2"))
Between Plot Subplot B1Resp1 B2Resp1 B1Resp2 B2Resp2
1 A1 P1 S1 8 10 17 18
2 A1 P1 S2 7 11 17 21
3 A1 P4 S7 8 10 17 22
4 A1 P4 S8 7 12 16 13
5 A2 P2 S3 14 12 19 13
6 A2 P2 S4 11 9 24 18
7 A2 P5 S10 12 10 23 21
8 A2 P5 S9 11 12 23 19
9 A3 P3 S5 14 11 25 18
10 A3 P3 S6 8 2 27 22
11 A3 P6 S11 3 11 17 16
12 A3 P6 S12 13 7 26 28
|
Merging data sets
It is common to have data associated with a particular study organized into a number of separate data tables (databases etc).
In fact, large data sets are best managed in databases. However, statistical analyses generally require all data to be encapsulated
within a single data structure. Therefore, prior to analysis, it is necessary to bring together multiple sources.
This phase of data preparation can be one of the most difficult to get right and verify.
Biological data set (missing Subplot 3) (data.bio) | |
| Resp1 | Resp2 | Between | Plot | Subplot |
1 | 8.00 | 18.00 | A1 | P1 | S1 |
2 | 10.00 | 21.00 | A1 | P1 | S2 |
4 | 11.00 | 23.00 | A1 | P2 | S4 |
5 | 14.00 | 22.00 | A2 | P3 | S5 |
6 | 12.00 | 24.00 | A2 | P3 | S6 |
7 | 11.00 | 23.00 | A2 | P4 | S7 |
8 | 9.00 | 20.00 | A2 | P4 | S8 |
9 | 14.00 | 11.00 | A3 | P5 | S9 |
10 | 11.00 | 22.00 | A3 | P5 | S10 |
11 | 8.00 | 24.00 | A3 | P6 | S11 |
12 | 2.00 | 16.00 | A3 | P6 | S12 |
|
View code
> set.seed(1)
> data.bio <- expand.grid(Subplot = paste("S",
+ 1:2, sep = ""), Plot = paste("P", 1:6, sep = ""))
> data.bio$Subplot <- gl(12, 1, 12, lab = paste("S",
+ 1:12, sep = ""))
> data.bio$Between <- gl(3, 4, 12, lab = paste("A",
+ 1:3, sep = ""))
> data.bio$Resp1 <- rpois(12, 10)
> data.bio$Resp2 <- rpois(12, 20)
> data.bio <- with(data.bio, data.frame(Resp1,
+ Resp2, Between, Plot, Subplot))
> data.bio <- data.bio[-3, ]
|
Physio-chemical data (missing Subplot 7) (data.chem) | |
| Chem1 | Chem2 | Between | Plot | Subplot |
1 | 1.45 | 0.89 | A1 | P1 | S1 |
2 | 3.27 | 0.18 | A1 | P1 | S2 |
3 | 1.18 | 5.08 | A1 | P2 | S3 |
4 | 13.40 | 1.58 | A1 | P2 | S4 |
5 | 3.78 | 1.62 | A2 | P3 | S5 |
6 | 1.20 | 4.24 | A2 | P3 | S6 |
8 | 5.69 | 2.99 | A2 | P4 | S8 |
9 | 4.83 | 4.13 | A3 | P5 | S9 |
10 | 2.00 | 3.60 | A3 | P5 | S10 |
11 | 12.33 | 1.78 | A3 | P6 | S11 |
12 | 4.01 | 0.23 | A3 | P6 | S12 |
|
View code
> set.seed(1)
> data.chem <- expand.grid(Subplot = paste("S",
+ 1:2, sep = ""), Plot = paste("P", 1:6, sep = ""))
> data.chem$Subplot <- gl(12, 1, 12,
+ lab = paste("S", 1:12, sep = ""))
> data.chem$Between <- gl(3, 4, 12, lab = paste("A",
+ 1:3, sep = ""))
> data.chem$Chem1 <- rlnorm(12, 1)
> data.chem$Chem2 <- rlnorm(12, 0.5)
> data.chem <- with(data.chem, data.frame(Chem1,
+ Chem2, Between, Plot, Subplot))
> data.chem <- data.chem[-7, ]
|
Merge bio and chem data (only keep full matches) | |
| Between | Plot | Subplot | Resp1 | Resp2 | Chem1 | Chem2 |
1 | A1 | P1 | S1 | 8.00 | 18.00 | 1.45 | 0.89 |
2 | A1 | P1 | S2 | 10.00 | 21.00 | 3.27 | 0.18 |
3 | A1 | P2 | S4 | 11.00 | 23.00 | 13.40 | 1.58 |
4 | A2 | P3 | S5 | 14.00 | 22.00 | 3.78 | 1.62 |
5 | A2 | P3 | S6 | 12.00 | 24.00 | 1.20 | 4.24 |
6 | A2 | P4 | S8 | 9.00 | 20.00 | 5.69 | 2.99 |
7 | A3 | P5 | S10 | 11.00 | 22.00 | 2.00 | 3.60 |
8 | A3 | P5 | S9 | 14.00 | 11.00 | 4.83 | 4.13 |
9 | A3 | P6 | S11 | 8.00 | 24.00 | 12.33 | 1.78 |
10 | A3 | P6 | S12 | 2.00 | 16.00 | 4.01 | 0.23 |
|
Note how both Subplot 3 and 7 are absent.
|
Via merge
> merge(data.bio, data.chem, by = c("Between", "Plot", "Subplot"))
Between Plot Subplot Resp1 Resp2 Chem1 Chem2
1 A1 P1 S1 8 18 1.453 0.8858
2 A1 P1 S2 10 21 3.266 0.1800
3 A1 P2 S4 11 23 13.400 1.5763
4 A2 P3 S5 14 22 3.779 1.6222
5 A2 P3 S6 12 24 1.197 4.2369
6 A2 P4 S8 9 20 5.688 2.9859
7 A3 P5 S10 11 22 2.003 3.6043
8 A3 P5 S9 14 11 4.835 4.1329
9 A3 P6 S11 8 24 12.327 1.7764
10 A3 P6 S12 2 16 4.014 0.2255
Merge bio and chem data (keep all data) | |
| Between | Plot | Subplot | Resp1 | Resp2 | Chem1 | Chem2 |
1 | A1 | P1 | S1 | 8.00 | 18.00 | 1.45 | 0.89 |
2 | A1 | P1 | S2 | 10.00 | 21.00 | 3.27 | 0.18 |
3 | A1 | P2 | S3 | | | 1.18 | 5.08 |
4 | A1 | P2 | S4 | 11.00 | 23.00 | 13.40 | 1.58 |
5 | A2 | P3 | S5 | 14.00 | 22.00 | 3.78 | 1.62 |
6 | A2 | P3 | S6 | 12.00 | 24.00 | 1.20 | 4.24 |
7 | A2 | P4 | S7 | 11.00 | 23.00 | | |
8 | A2 | P4 | S8 | 9.00 | 20.00 | 5.69 | 2.99 |
9 | A3 | P5 | S9 | 14.00 | 11.00 | 4.83 | 4.13 |
10 | A3 | P5 | S10 | 11.00 | 22.00 | 2.00 | 3.60 |
11 | A3 | P6 | S11 | 8.00 | 24.00 | 12.33 | 1.78 |
12 | A3 | P6 | S12 | 2.00 | 16.00 | 4.01 | 0.23 |
|
Note that all Subplots are present, yet there are missing cells that correspond to the missing cases from the BIO and CHEM data sets.
|
Via merge
> merge(data.bio, data.chem, by = c("Between", "Plot", "Subplot"), all = T)
Between Plot Subplot Resp1 Resp2 Chem1 Chem2
1 A1 P1 S1 8 18 1.453 0.8858
2 A1 P1 S2 10 21 3.266 0.1800
3 A1 P2 S3 NA NA 1.179 5.0781
4 A1 P2 S4 11 23 13.400 1.5763
5 A2 P3 S5 14 22 3.779 1.6222
6 A2 P3 S6 12 24 1.197 4.2369
7 A2 P4 S7 11 23 NA NA
8 A2 P4 S8 9 20 5.688 2.9859
9 A3 P5 S9 14 11 4.835 4.1329
10 A3 P5 S10 11 22 2.003 3.6043
11 A3 P6 S11 8 24 12.327 1.7764
12 A3 P6 S12 2 16 4.014 0.2255
Merge bio and chem data (only keep full BIO matches) | |
| Between | Plot | Subplot | Resp1 | Resp2 | Chem1 | Chem2 |
1 | A1 | P1 | S1 | 8.00 | 18.00 | 1.45 | 0.89 |
2 | A1 | P1 | S2 | 10.00 | 21.00 | 3.27 | 0.18 |
3 | A1 | P2 | S4 | 11.00 | 23.00 | 13.40 | 1.58 |
4 | A2 | P3 | S5 | 14.00 | 22.00 | 3.78 | 1.62 |
5 | A2 | P3 | S6 | 12.00 | 24.00 | 1.20 | 4.24 |
6 | A2 | P4 | S7 | 11.00 | 23.00 | | |
7 | A2 | P4 | S8 | 9.00 | 20.00 | 5.69 | 2.99 |
8 | A3 | P5 | S9 | 14.00 | 11.00 | 4.83 | 4.13 |
9 | A3 | P5 | S10 | 11.00 | 22.00 | 2.00 | 3.60 |
10 | A3 | P6 | S11 | 8.00 | 24.00 | 12.33 | 1.78 |
11 | A3 | P6 | S12 | 2.00 | 16.00 | 4.01 | 0.23 |
|
Note how Subplot 3 is absent and Subplot 7 (which was missing from the CHEM data set) is missing CHEM data.
|
Via merge
> merge(data.bio, data.chem, by = c("Between", "Plot", "Subplot"), all.x = T)
Between Plot Subplot Resp1 Resp2 Chem1 Chem2
1 A1 P1 S1 8 18 1.453 0.8858
2 A1 P1 S2 10 21 3.266 0.1800
3 A1 P2 S4 11 23 13.400 1.5763
4 A2 P3 S5 14 22 3.779 1.6222
5 A2 P3 S6 12 24 1.197 4.2369
6 A2 P4 S7 11 23 NA NA
7 A2 P4 S8 9 20 5.688 2.9859
8 A3 P5 S9 14 11 4.835 4.1329
9 A3 P5 S10 11 22 2.003 3.6043
10 A3 P6 S11 8 24 12.327 1.7764
11 A3 P6 S12 2 16 4.014 0.2255
Merge bio and chem data (only keep full CHEM matches) | |
| Between | Plot | Subplot | Resp1 | Resp2 | Chem1 | Chem2 |
1 | A1 | P1 | S1 | 8.00 | 18.00 | 1.45 | 0.89 |
2 | A1 | P1 | S2 | 10.00 | 21.00 | 3.27 | 0.18 |
3 | A1 | P2 | S3 | | | 1.18 | 5.08 |
4 | A1 | P2 | S4 | 11.00 | 23.00 | 13.40 | 1.58 |
5 | A2 | P3 | S5 | 14.00 | 22.00 | 3.78 | 1.62 |
6 | A2 | P3 | S6 | 12.00 | 24.00 | 1.20 | 4.24 |
7 | A2 | P4 | S8 | 9.00 | 20.00 | 5.69 | 2.99 |
8 | A3 | P5 | S9 | 14.00 | 11.00 | 4.83 | 4.13 |
9 | A3 | P5 | S10 | 11.00 | 22.00 | 2.00 | 3.60 |
10 | A3 | P6 | S11 | 8.00 | 24.00 | 12.33 | 1.78 |
11 | A3 | P6 | S12 | 2.00 | 16.00 | 4.01 | 0.23 |
|
Note how Subplot 7 is absent and Subplot 3 (which was missing from the BIO data set) is missing BIO data.
|
Via merge
> merge(data.bio, data.chem, by = c("Between", "Plot", "Subplot"), all.y = T)
Between Plot Subplot Resp1 Resp2 Chem1 Chem2
1 A1 P1 S1 8 18 1.453 0.8858
2 A1 P1 S2 10 21 3.266 0.1800
3 A1 P2 S3 NA NA 1.179 5.0781
4 A1 P2 S4 11 23 13.400 1.5763
5 A2 P3 S5 14 22 3.779 1.6222
6 A2 P3 S6 12 24 1.197 4.2369
7 A2 P4 S8 9 20 5.688 2.9859
8 A3 P5 S9 14 11 4.835 4.1329
9 A3 P5 S10 11 22 2.003 3.6043
10 A3 P6 S11 8 24 12.327 1.7764
11 A3 P6 S12 2 16 4.014 0.2255
VLOOKUP in R
Lookup tables provide a way of inserting a column of data into a large data set
such that the entries in the new column are determined by a relational match within another data set (the lookup table).
For example, the main data set might contain data collected from a number of sites (Plots).
Elsewhere we may have a data set that just contains the set of sites and their corresponding latitudes and longitudes (geographical lookup table).
We could incorporate these latitudes and longitudes into the main data set by merging against the geographical lookup table.
In Excel, this is referred to as vlookup, in a relational database it is referred to as a join, and in R it is a merge.
Biological data set (data.bio1) | |
| Resp1 | Resp2 | Between | Plot | Subplot |
1 | 8.00 | 18.00 | A1 | P1 | S1 |
2 | 10.00 | 21.00 | A1 | P1 | S2 |
3 | 7.00 | 16.00 | A1 | P2 | S3 |
4 | 11.00 | 23.00 | A1 | P2 | S4 |
5 | 14.00 | 22.00 | A2 | P3 | S5 |
6 | 12.00 | 24.00 | A2 | P3 | S6 |
7 | 11.00 | 23.00 | A2 | P4 | S7 |
8 | 9.00 | 20.00 | A2 | P4 | S8 |
9 | 14.00 | 11.00 | A3 | P5 | S9 |
10 | 11.00 | 22.00 | A3 | P5 | S10 |
11 | 8.00 | 24.00 | A3 | P6 | S11 |
12 | 2.00 | 16.00 | A3 | P6 | S12 |
|
View code
> set.seed(1)
> data.bio1 <- expand.grid(Subplot = paste("S",
+ 1:2, sep = ""), Plot = paste("P", 1:6, sep = ""))
> data.bio1$Subplot <- gl(12, 1, 12,
+ lab = paste("S", 1:12, sep = ""))
> data.bio1$Between <- gl(3, 4, 12, lab = paste("A",
+ 1:3, sep = ""))
> data.bio1$Resp1 <- rpois(12, 10)
> data.bio1$Resp2 <- rpois(12, 20)
> data.bio1 <- with(data.bio1, data.frame(Resp1,
+ Resp2, Between, Plot, Subplot))
|
Geographical data set (lookup table) (data.geo) | |
---|
| Plot | LAT | LONG |
1 | P1 | 17.96 | 145.43 |
2 | P2 | 17.52 | 146.20 |
3 | P3 | 17.00 | 146.38 |
4 | P4 | 18.23 | 146.79 |
5 | P5 | 18.98 | 146.03 |
6 | P6 | 20.12 | 146.47 |
|
View code
> set.seed(1)
> data.geo <- expand.grid(Plot = paste("P",
+ 1:6, sep = ""))
> data.geo$LAT <- c(17.9605, 17.521,
+ 17.0011, 18.235, 18.984, 20.1154)
> data.geo$LONG <- c(145.4326, 146.1983,
+ 146.3839, 146.7934, 146.0345, 146.4672)
|
Incorporate (merge) the lat/longs into the bio data | |
| Plot | Resp1 | Resp2 | Between | Subplot | LAT | LONG |
1 | P1 | 8.00 | 18.00 | A1 | S1 | 17.96 | 145.43 |
2 | P1 | 10.00 | 21.00 | A1 | S2 | 17.96 | 145.43 |
3 | P2 | 7.00 | 16.00 | A1 | S3 | 17.52 | 146.20 |
4 | P2 | 11.00 | 23.00 | A1 | S4 | 17.52 | 146.20 |
5 | P3 | 14.00 | 22.00 | A2 | S5 | 17.00 | 146.38 |
6 | P3 | 12.00 | 24.00 | A2 | S6 | 17.00 | 146.38 |
7 | P4 | 11.00 | 23.00 | A2 | S7 | 18.23 | 146.79 |
8 | P4 | 9.00 | 20.00 | A2 | S8 | 18.23 | 146.79 |
9 | P5 | 14.00 | 11.00 | A3 | S9 | 18.98 | 146.03 |
10 | P5 | 11.00 | 22.00 | A3 | S10 | 18.98 | 146.03 |
11 | P6 | 8.00 | 24.00 | A3 | S11 | 20.12 | 146.47 |
12 | P6 | 2.00 | 16.00 | A3 | S12 | 20.12 | 146.47 |
|
Note how the latitudes and longitudes are duplicated so as to match the Plot listings.
|
Via merge
> merge(data.bio1, data.geo, by = "Plot")
Plot Resp1 Resp2 Between Subplot LAT LONG
1 P1 8 18 A1 S1 17.96 145.4
2 P1 10 21 A1 S2 17.96 145.4
3 P2 7 16 A1 S3 17.52 146.2
4 P2 11 23 A1 S4 17.52 146.2
5 P3 14 22 A2 S5 17.00 146.4
6 P3 12 24 A2 S6 17.00 146.4
7 P4 11 23 A2 S7 18.23 146.8
8 P4 9 20 A2 S8 18.23 146.8
9 P5 14 11 A3 S9 18.98 146.0
10 P5 11 22 A3 S10 18.98 146.0
11 P6 8 24 A3 S11 20.12 146.5
12 P6 2 16 A3 S12 20.12 146.5
Aggregating data sets
Data are often collected and transcribed at finer temporal/spatial scales and with greater fidelity than is required for
all analyses. Therefore an important phase of data preparation is also to summarize the data into
the spatial/temporal scales appropriate for the desired graphical and statistical analyses.
Example data set (data.1) | |
| Resp1 | Resp2 | Between | Plot |
1 | 8.00 | 22.00 | A1 | P1 |
2 | 10.00 | 18.00 | A1 | P2 |
3 | 7.00 | 26.00 | A2 | P3 |
4 | 11.00 | 21.00 | A2 | P4 |
5 | 14.00 | 17.00 | A3 | P5 |
6 | 12.00 | 18.00 | A3 | P6 |
|
View code
> set.seed(1)
> data.1 <- expand.grid(Plot = paste("P",
+ 1:6, sep = ""))
> data.1$Between <- gl(3, 2, 6, lab = paste("A",
+ 1:3, sep = ""))
> data.1$Resp1 <- rpois(6, 10)
> data.1$Resp2 <- rpois(6, 20)
> data.1 <- with(data.1, data.frame(Resp1,
+ Resp2, Between, Plot))
|
The major functions useful for summarizing/aggregating are listed below.
To illustrate the differences between these functions, a print statement is issued within the function
so as to explore the type and contents of objects used internally within the function.
The result of a mean is returned by each function, again to enable us to compare the outcome of each function.
> tapply(data.1$Resp1, data.1$Between, function(x) {
+ print(x)
+ return(mean(x))
+ })
[1] 8 10
[1] 7 11
[1] 14 12
> library(reshape)
> cast(data.1, ~Between, value = "Resp1", fun = function(x) {
+ print(x)
+ return(mean(x))
+ })
numeric(0)
[1] 8 10
[1] 7 11
[1] 14 12
value A1 A2 A3
1 (all) 9 9 13
> library(plyr)
> ddply(data.1, ~Between, function(x) {
+ print(x)
+ return(mean(x))
+ })
Resp1 Resp2 Between Plot
1 8 22 A1 P1
2 10 18 A1 P2
Resp1 Resp2 Between Plot
3 7 26 A2 P3
4 11 21 A2 P4
Resp1 Resp2 Between Plot
5 14 17 A3 P5
6 12 18 A3 P6
Resp1 Resp2 Between Plot
1 9 20.0 NA NA
2 9 23.5 NA NA
3 13 17.5 NA NA
> library(doBy)
> f1 <- function(x, ...) {
+ print(x)
+ m = mean(x, ...)
+ }
> summaryBy(Resp1 ~ Between, data.1, FUN = f1)
[1] 8 10
[1] 7 11
[1] 14 12
Between Resp1.f1
1 A1 9
2 A2 9
3 A3 13
The main features of each of the above functions are compared and contrasted in the following table.
| Variables passed | Passed as | Functions | Returns |
tapply | Single specified | Vectors | Single | Matrix |
cast | Single specified | Vectors | Multiples as list | Dataframe |
summaryBy | All specified | Vectors | Multiples as list | Dataframe |
ddply | All | Dataframes | Defined per variable | Dataframe |
The following data set will be used to illustrate the principles of data aggregation in R.
A typical data set (data) | |
| Resp1 | Resp2 | Between | Plot | Subplot | Within |
1 | 8.00 | 17.00 | A1 | P1 | S1 | B1 |
2 | 10.00 | 18.00 | A1 | P1 | S1 | B2 |
3 | 7.00 | 17.00 | A1 | P1 | S2 | B1 |
4 | 11.00 | 21.00 | A1 | P1 | S2 | B2 |
5 | 14.00 | 19.00 | A2 | P2 | S3 | B1 |
6 | 12.00 | 13.00 | A2 | P2 | S3 | B2 |
7 | 11.00 | 24.00 | A2 | P2 | S4 | B1 |
8 | 9.00 | 18.00 | A2 | P2 | S4 | B2 |
9 | 14.00 | 25.00 | A3 | P3 | S5 | B1 |
10 | 11.00 | 18.00 | A3 | P3 | S5 | B2 |
11 | 8.00 | 27.00 | A3 | P3 | S6 | B1 |
12 | 2.00 | 22.00 | A3 | P3 | S6 | B2 |
13 | 8.00 | 17.00 | A1 | P4 | S7 | B1 |
14 | 10.00 | 22.00 | A1 | P4 | S7 | B2 |
15 | 7.00 | 16.00 | A1 | P4 | S8 | B1 |
16 | 12.00 | 13.00 | A1 | P4 | S8 | B2 |
17 | 11.00 | 23.00 | A2 | P5 | S9 | B1 |
18 | 12.00 | 19.00 | A2 | P5 | S9 | B2 |
19 | 12.00 | 23.00 | A2 | P5 | S10 | B1 |
20 | 10.00 | 21.00 | A2 | P5 | S10 | B2 |
21 | 3.00 | 17.00 | A3 | P6 | S11 | B1 |
22 | 11.00 | 16.00 | A3 | P6 | S11 | B2 |
23 | 13.00 | 26.00 | A3 | P6 | S12 | B1 |
24 | 7.00 | 28.00 | A3 | P6 | S12 | B2 |
|
View code
> library(xtable)
> set.seed(1)
> data <- expand.grid(Within = paste("B",
+ 1:2, sep = ""), Subplot = paste("S", 1:2,
+ sep = ""), Plot = paste("P", 1:6, sep = ""))
> data$Subplot <- gl(12, 2, 24, lab = paste("S",
+ 1:12, sep = ""))
> data$Between <- gl(3, 4, 24, lab = paste("A",
+ 1:3, sep = ""))
> data$Resp1 <- rpois(24, 10)
> data$Resp2 <- rpois(24, 20)
> data <- with(data, data.frame(Resp1,
+ Resp2, Between, Plot, Subplot, Within))
|
Pivot tables, data summaries
Resp 1 means (Between) | |
| A1 | A2 | A3 |
means | 9.12 | 11.38 | 8.62 |
|
Via tapply
> tapply(data$Resp1, data$Between, mean)
A1 A2 A3
9.125 11.375 8.625
|
Resp1 means and standard deviations (Between) | |
---|
| Between | mean | sd |
1 | A1 | 9.12 | 1.89 |
2 | A2 | 11.38 | 1.51 |
3 | A3 | 8.62 | 4.44 |
|
Via cast
> library(reshape)
> cast(data, Between ~ ., value = "Resp1", fun = c(mean, sd))
Between mean sd
1 A1 9.125 1.885
2 A2 11.375 1.506
3 A3 8.625 4.438
Via summaryBy
> library(doBy)
> summaryBy(Resp1 ~ Between, data, FUN = list(mean, sd))
Between Resp1.mean Resp1.sd
1 A1 9.125 1.885
2 A2 11.375 1.506
3 A3 8.625 4.438
|
Resp1 means and standard deviations (Between/Within) | |
---|
| Between | B1_mean | B1_sd | B2_mean | B2_sd |
1 | A1 | 7.50 | 0.58 | 10.75 | 0.96 |
2 | A2 | 12.00 | 1.41 | 10.75 | 1.50 |
3 | A3 | 9.50 | 5.07 | 7.75 | 4.27 |
|
Via cast
> library(reshape)
> cast(data, Between ~ Within, value = "Resp1",
+ fun = c(mean, sd))
Between B1_mean B1_sd B2_mean B2_sd
1 A1 7.5 0.5774 10.75 0.9574
2 A2 12.0 1.4142 10.75 1.5000
3 A3 9.5 5.0662 7.75 4.2720
|
Resp1 means for each Between/Within combination | |
---|
| Between | Within | Resp1_means |
1 | A1 | B1 | 7.50 |
2 | A1 | B2 | 10.75 |
3 | A2 | B1 | 12.00 |
4 | A2 | B2 | 10.75 |
5 | A3 | B1 | 9.50 |
6 | A3 | B2 | 7.75 |
|
Via ddply
> library(plyr)
> ddply(data, ~Between + Within, summarize, Resp1_means = mean(Resp1))
Between Within Resp1_means
1 A1 B1 7.50
2 A1 B2 10.75
3 A2 B1 12.00
4 A2 B2 10.75
5 A3 B1 9.50
6 A3 B2 7.75
Via summaryBy
> library(doBy)
> summaryBy(Resp1 ~ Between + Within, data = data)
Between Within Resp1.mean
1 A1 B1 7.50
2 A1 B2 10.75
3 A2 B1 12.00
4 A2 B2 10.75
5 A3 B1 9.50
6 A3 B2 7.75
|
Resp1 and Resp2 means for each Between/Within combination | |
---|
| Between | Within | Resp1 | Resp2 |
1 | A1 | B1 | 7.50 | 16.75 |
2 | A1 | B2 | 10.75 | 18.50 |
3 | A2 | B1 | 12.00 | 22.25 |
4 | A2 | B2 | 10.75 | 17.75 |
5 | A3 | B1 | 9.50 | 23.75 |
6 | A3 | B2 | 7.75 | 21.00 |
|
Via ddply
> library(plyr)
> ddply(data, ~Between + Within, summarize, Resp1 = mean(Resp1),
+ Resp2 = mean(Resp2))
Between Within Resp1 Resp2
1 A1 B1 7.50 16.75
2 A1 B2 10.75 18.50
3 A2 B1 12.00 22.25
4 A2 B2 10.75 17.75
5 A3 B1 9.50 23.75
6 A3 B2 7.75 21.00
Via summaryBy
> library(doBy)
> summaryBy(Resp1 + Resp2 ~ Between + Within, data = data)
Between Within Resp1.mean Resp2.mean
1 A1 B1 7.50 16.75
2 A1 B2 10.75 18.50
3 A2 B1 12.00 22.25
4 A2 B2 10.75 17.75
5 A3 B1 9.50 23.75
6 A3 B2 7.75 21.00
|
Resp1 and Resp2 means for each Between/Within combination in long format | |
---|
| Between | Within | variable | value |
1 | A1 | B1 | Resp1 | 7.50 |
2 | A1 | B2 | Resp1 | 10.75 |
3 | A2 | B1 | Resp1 | 12.00 |
4 | A2 | B2 | Resp1 | 10.75 |
5 | A3 | B1 | Resp1 | 9.50 |
6 | A3 | B2 | Resp1 | 7.75 |
7 | A1 | B1 | Resp2 | 16.75 |
8 | A1 | B2 | Resp2 | 18.50 |
9 | A2 | B1 | Resp2 | 22.25 |
10 | A2 | B2 | Resp2 | 17.75 |
11 | A3 | B1 | Resp2 | 23.75 |
12 | A3 | B2 | Resp2 | 21.00 |
|
Via ddply
> library(plyr)
> library(reshape)
> melt(ddply(data, ~Between + Within, summarize, Resp1 = mean(Resp1),
+ Resp2 = mean(Resp2)), measure.vars = c("Resp1", "Resp2"),
+ variable_name = "Resp")
Between Within Resp value
1 A1 B1 Resp1 7.50
2 A1 B2 Resp1 10.75
3 A2 B1 Resp1 12.00
4 A2 B2 Resp1 10.75
5 A3 B1 Resp1 9.50
6 A3 B2 Resp1 7.75
7 A1 B1 Resp2 16.75
8 A1 B2 Resp2 18.50
9 A2 B1 Resp2 22.25
10 A2 B2 Resp2 17.75
11 A3 B1 Resp2 23.75
12 A3 B2 Resp2 21.00
|
Resp1 mean and standard deviation for each Between/Within combination | |
---|
| Between | Within | means | sds |
1 | A1 | B1 | 7.50 | 0.58 |
2 | A1 | B2 | 10.75 | 0.96 |
3 | A2 | B1 | 12.00 | 1.41 |
4 | A2 | B2 | 10.75 | 1.50 |
5 | A3 | B1 | 9.50 | 5.07 |
6 | A3 | B2 | 7.75 | 4.27 |
|
Via ddply
> library(plyr)
> ddply(data, ~Between + Within, summarize, means = mean(Resp1),
+ sds = sd(Resp1))
Between Within means sds
1 A1 B1 7.50 0.5774
2 A1 B2 10.75 0.9574
3 A2 B1 12.00 1.4142
4 A2 B2 10.75 1.5000
5 A3 B1 9.50 5.0662
6 A3 B2 7.75 4.2720
Via summaryBy
> library(doBy)
> summaryBy(Resp1 ~ Between + Within, data = data, FUN = list(mean,
+ sd))
Between Within Resp1.mean Resp1.sd
1 A1 B1 7.50 0.5774
2 A1 B2 10.75 0.9574
3 A2 B1 12.00 1.4142
4 A2 B2 10.75 1.5000
5 A3 B1 9.50 5.0662
6 A3 B2 7.75 4.2720
|
Resp1 and Resp2 means and standard deviations (Between/Within) wide format | |
---|
| Between | Within | Resp1 | SD1 | Resp2 | SD2 |
1 | A1 | B1 | 7.50 | 0.58 | 16.75 | 0.50 |
2 | A1 | B2 | 10.75 | 0.96 | 18.50 | 4.04 |
3 | A2 | B1 | 12.00 | 1.41 | 22.25 | 2.22 |
4 | A2 | B2 | 10.75 | 1.50 | 17.75 | 3.40 |
5 | A3 | B1 | 9.50 | 5.07 | 23.75 | 4.57 |
6 | A3 | B2 | 7.75 | 4.27 | 21.00 | 5.29 |
|
Via ddply
> library(plyr)
> ddply(data, ~Between + Within, function(df) data.frame(Resp1 = mean(df$Resp1),
+ SD1 = sd(df$Resp1), Resp2 = mean(df$Resp2), SD2 = sd(df$Resp2)))
Between Within Resp1 SD1 Resp2 SD2
1 A1 B1 7.50 0.5774 16.75 0.500
2 A1 B2 10.75 0.9574 18.50 4.041
3 A2 B1 12.00 1.4142 22.25 2.217
4 A2 B2 10.75 1.5000 17.75 3.403
5 A3 B1 9.50 5.0662 23.75 4.573
6 A3 B2 7.75 4.2720 21.00 5.292
Via summaryBy
> library(doBy)
> summaryBy(Resp1 + Resp2 ~ Between + Within, data = data,
+ FUN = list(mean, sd))
Between Within Resp1.mean Resp2.mean Resp1.sd Resp2.sd
1 A1 B1 7.50 16.75 0.5774 0.500
2 A1 B2 10.75 18.50 0.9574 4.041
3 A2 B1 12.00 22.25 1.4142 2.217
4 A2 B2 10.75 17.75 1.5000 3.403
5 A3 B1 9.50 23.75 5.0662 4.573
6 A3 B2 7.75 21.00 4.2720 5.292
|
Resp1 and Resp2 means and standard deviations (Between/Within) long format | |
---|
| Between | Within | Resp | Mean | Sd |
1 | A1 | B1 | Resp1 | 7.50 | 0.58 |
2 | A1 | B1 | Resp2 | 16.75 | 0.50 |
3 | A1 | B2 | Resp1 | 10.75 | 0.96 |
4 | A1 | B2 | Resp2 | 18.50 | 4.04 |
5 | A2 | B1 | Resp1 | 12.00 | 1.41 |
6 | A2 | B1 | Resp2 | 22.25 | 2.22 |
7 | A2 | B2 | Resp1 | 10.75 | 1.50 |
8 | A2 | B2 | Resp2 | 17.75 | 3.40 |
9 | A3 | B1 | Resp1 | 9.50 | 5.07 |
10 | A3 | B1 | Resp2 | 23.75 | 4.57 |
11 | A3 | B2 | Resp1 | 7.75 | 4.27 |
12 | A3 | B2 | Resp2 | 21.00 | 5.29 |
|
Via ddply
> library(plyr)
> library(plyr)
> library(reshape2)
> d1 <- ddply(data, ~Between + Within,
+ function(df) melt(data.frame(Resp1 = mean(df$Resp1),
+ Resp2 = mean(df$Resp2)), var = "Resp",
+ value.name = "Mean"))
> d2 <- ddply(data, ~Between + Within,
+ function(df) melt(data.frame(Resp1 = sd(df$Resp1),
+ Resp2 = sd(df$Resp2)), var = "Resp", value.name = "Sd"))
> d3 <- merge(d1, d2)
> d3
Between Within Resp Mean Sd
1 A1 B1 Resp1 7.50 0.5774
2 A1 B1 Resp2 16.75 0.5000
3 A1 B2 Resp1 10.75 0.9574
4 A1 B2 Resp2 18.50 4.0415
5 A2 B1 Resp1 12.00 1.4142
6 A2 B1 Resp2 22.25 2.2174
7 A2 B2 Resp1 10.75 1.5000
8 A2 B2 Resp2 17.75 3.4034
9 A3 B1 Resp1 9.50 5.0662
10 A3 B1 Resp2 23.75 4.5735
11 A3 B2 Resp1 7.75 4.2720
12 A3 B2 Resp2 21.00 5.2915
|
Hierarchical aggregations
It is useful to be able to aggregate the data to a different level of replication.
Two common reasons are:
- Exploratory data analysis for assessing normality, homogeneity of variance as applied to the approximate appropriate residuals (appropriate level fo replication for a give test)
- Calculating observations (typically means of subreplicates) appropriate for the desired scale of the analysis
Continuing on with the same base dataset from the previous examples, we may wish to aggregate to the level of Plots or even Subplot
Warning gsummary() in the nlme package appears to be unable to resolve the base scope
if functions like mean are overloaded by other packages.
Consequently, it is a good idea when specifying the FUN parameter in
gsummary, to include the scope. E.g. use base:::mean instead of just mean.
Similarly, if the lme4 package has been loaded after the nlme package,
it will mask many of the functions and replace the global namespace scope. Consequently, it is necessary to
include the namespace in the call to the function (nlme:::gsummary()).
Plot means (ignore Subplot and Within column as these are both at a scale within Plots). | |
---|
| Plot | Between | Resp1 | Resp2 |
1 | P1 | A1 | 9.00 | 18.25 |
2 | P2 | A2 | 11.50 | 18.50 |
3 | P3 | A3 | 8.75 | 23.00 |
4 | P4 | A1 | 9.25 | 17.00 |
5 | P5 | A2 | 11.25 | 21.50 |
6 | P6 | A3 | 8.50 | 21.75 |
|
Via ddply and summarize
> library(plyr)
> ddply(data, ~Plot, summarize, Between = unique(Between),
+ Resp1 = mean(Resp1), Resp2 = mean(Resp2))
Plot Between Resp1 Resp2
1 P1 A1 9.00 18.25
2 P2 A2 11.50 18.50
3 P3 A3 8.75 23.00
4 P4 A1 9.25 17.00
5 P5 A2 11.25 21.50
6 P6 A3 8.50 21.75
Via summaryBy
> library(doBy)
> summaryBy(Resp1 + Resp2 ~ Between +
+ Plot, data = data, FUN = mean)
Between Plot Resp1.mean Resp2.mean
1 A1 P1 9.00 18.25
2 A1 P4 9.25 17.00
3 A2 P2 11.50 18.50
4 A2 P5 11.25 21.50
5 A3 P3 8.75 23.00
6 A3 P6 8.50 21.75
Via gsummary
> library(nlme)
> nlme:::gsummary(data, form = ~Plot,
+ base:::mean)
Resp1 Resp2 Between Plot Subplot Within
P1 9.00 18.25 A1 P1 S1 B1
P2 11.50 18.50 A2 P2 S3 B1
P3 8.75 23.00 A3 P3 S5 B1
P4 9.25 17.00 A1 P4 S7 B1
P5 11.25 21.50 A2 P5 S10 B1
P6 8.50 21.75 A3 P6 S11 B1
|
Subplot means (ignore the Within column as this at a scale within Subplots). | |
---|
| Plot | Subplot | Between | Resp1 | Resp2 |
1 | P1 | S1 | A1 | 9.00 | 17.50 |
2 | P1 | S2 | A1 | 9.00 | 19.00 |
3 | P2 | S3 | A2 | 13.00 | 16.00 |
4 | P2 | S4 | A2 | 10.00 | 21.00 |
5 | P3 | S5 | A3 | 12.50 | 21.50 |
6 | P3 | S6 | A3 | 5.00 | 24.50 |
7 | P4 | S7 | A1 | 9.00 | 19.50 |
8 | P4 | S8 | A1 | 9.50 | 14.50 |
9 | P5 | S9 | A2 | 11.50 | 21.00 |
10 | P5 | S10 | A2 | 11.00 | 22.00 |
11 | P6 | S11 | A3 | 7.00 | 16.50 |
12 | P6 | S12 | A3 | 10.00 | 27.00 |
|
Via ddply and summarize
> library(plyr)
> ddply(data, ~Plot + Subplot, summarize,
+ Between = unique(Between), Resp1 = mean(Resp1),
+ Resp2 = mean(Resp2))
Plot Subplot Between Resp1 Resp2
1 P1 S1 A1 9.0 17.5
2 P1 S2 A1 9.0 19.0
3 P2 S3 A2 13.0 16.0
4 P2 S4 A2 10.0 21.0
5 P3 S5 A3 12.5 21.5
6 P3 S6 A3 5.0 24.5
7 P4 S7 A1 9.0 19.5
8 P4 S8 A1 9.5 14.5
9 P5 S9 A2 11.5 21.0
10 P5 S10 A2 11.0 22.0
11 P6 S11 A3 7.0 16.5
12 P6 S12 A3 10.0 27.0
Via summaryBy
> library(doBy)
> summaryBy(Resp1 + Resp2 ~ Between +
+ Plot + Subplot, data = data, FUN = mean)
Between Plot Subplot Resp1.mean Resp2.mean
1 A1 P1 S1 9.0 17.5
2 A1 P1 S2 9.0 19.0
3 A1 P4 S7 9.0 19.5
4 A1 P4 S8 9.5 14.5
5 A2 P2 S3 13.0 16.0
6 A2 P2 S4 10.0 21.0
7 A2 P5 S9 11.5 21.0
8 A2 P5 S10 11.0 22.0
9 A3 P3 S5 12.5 21.5
10 A3 P3 S6 5.0 24.5
11 A3 P6 S11 7.0 16.5
12 A3 P6 S12 10.0 27.0
Via gsummary
> library(nlme)
> nlme:::gsummary(data, form = ~Plot/Subplot,
+ base:::mean)
Resp1 Resp2 Between Plot Subplot Within
P1/S1 9.0 17.5 A1 P1 S1 B1
P1/S2 9.0 19.0 A1 P1 S2 B1
P2/S3 13.0 16.0 A2 P2 S3 B1
P2/S4 10.0 21.0 A2 P2 S4 B1
P3/S5 12.5 21.5 A3 P3 S5 B1
P3/S6 5.0 24.5 A3 P3 S6 B1
P4/S7 9.0 19.5 A1 P4 S7 B1
P4/S8 9.5 14.5 A1 P4 S8 B1
P5/S9 11.5 21.0 A2 P5 S9 B1
P5/S10 11.0 22.0 A2 P5 S10 B1
P6/S11 7.0 16.5 A3 P6 S11 B1
P6/S12 10.0 27.0 A3 P6 S12 B1
|
This section will make use of the following data set:
A very simple dataset (data.s) | |
| Between | Plot | Resp1 | Resp2 |
1 | A1 | P1 | 8.00 | 13.00 |
2 | A1 | P2 | 10.00 | 22.00 |
3 | A2 | P3 | 7.00 | 23.00 |
4 | A2 | P4 | 11.00 | 22.00 |
|
View code
> set.seed(1)
> data.s <- expand.grid(Plot = paste("P",
+ 1:4, sep = ""))
> data.s$Between <- gl(2, 2, 4, lab = paste("A",
+ 1:2, sep = ""))
> data.s <- with(data.s, data.frame(Between,
+ Plot, Resp1 = rpois(4, 10), Resp2 = rpois(4,
+ 20)))
|
As far as performing transformations within dataframes, there are main options;
> data.s$logResp1 <- log(data.s$Resp1)
> data.s$logResp2 <- log(data.s$Resp2)
> data.s
Between Plot Resp1 Resp2 logResp1 logResp2
1 A1 P1 8 13 2.079 2.565
2 A1 P2 10 22 2.303 3.091
3 A2 P3 7 23 1.946 3.135
4 A2 P4 11 22 2.398 3.091
> library(plyr)
> mutate(data.s, logResp1 = log(Resp1),
+ logRes2 = log(Resp2))
Between Plot Resp1 Resp2 logResp1 logRes2
1 A1 P1 8 13 2.079 2.565
2 A1 P2 10 22 2.303 3.091
3 A2 P3 7 23 1.946 3.135
4 A2 P4 11 22 2.398 3.091
Between Plot Resp1 Resp2
1 A1 P1 8 13
2 A1 P2 10 22
3 A2 P3 7 23
4 A2 P4 11 22
> library(doBy)
> summaryBy(Resp1 + Resp2 ~ ., data.s,
+ FUN = list(log), id = ~Resp1 + Resp2)
Between Plot Resp1.log Resp2.log Resp1 Resp2
1 A1 P1 2.079 2.565 8 13
2 A1 P2 2.303 3.091 10 22
3 A2 P3 1.946 3.135 7 23
4 A2 P4 2.398 3.091 11 22
Between Plot Resp1 Resp2
1 A1 P1 8 13
2 A1 P2 10 22
3 A2 P3 7 23
4 A2 P4 11 22
The difference between transform and mutate is that mutate works with each column sequentially and therefore
can derive new columns using the columns created in earlier iterations.
summaryBy returns a new dataframe, does not build upon old.
The main features of each of the above functions are compared and contrasted in the following table.
| Transforms at a time | Returns | Notes |
manually | Single | Single vectors | |
transform | Multiple | New dataframe | Somewhat similar to within(). |
mutate | Multiple | New dataframe | Works with each column sequentially and therefore
can derive new columns using the columns created in earlier iterations. |
summaryBy | Multiple | New dataframe | |
Scale (log) transform a single variable - Resp1 | |
---|
|
Via transform
Via mutate
> library(plyr)
> mutate(data.s, logResp1 = log(Resp1))
Between Plot Resp1 Resp2 logResp1
1 A1 P1 8 13 2.079
2 A1 P2 10 22 2.303
3 A2 P3 7 23 1.946
4 A2 P4 11 22 2.398
|
Scale (log) transform multiple variables - Resp1 and Resp2 | |
---|
|
Via transform
Via mutate
> library(plyr)
> mutate(data.s, logResp1 = log(Resp1),
+ logResp2 = log(Resp2))
Between Plot Resp1 Resp2 logResp1 logResp2
1 A1 P1 8 13 2.079 2.565
2 A1 P2 10 22 2.303 3.091
3 A2 P3 7 23 1.946 3.135
4 A2 P4 11 22 2.398 3.091
>
>
> VAR <- c("Resp1", "Resp2")
> eval(parse(text = paste("mutate(data.s,",
+ paste("log", VAR, "=log(", VAR, ")", sep = "",
+ collapse = ","), ")", sep = "")))
Between Plot Resp1 Resp2 logResp1 logResp2
1 A1 P1 8 13 2.079 2.565
2 A1 P2 10 22 2.303 3.091
3 A2 P3 7 23 1.946 3.135
4 A2 P4 11 22 2.398 3.091
Via summaryBy
> library(doBy)
> summaryBy(Resp1 + Resp2 ~ ., data.s,
+ FUN = list(log), id = ~Resp1 + Resp2)
Between Plot Resp1.log Resp2.log Resp1 Resp2
1 A1 P1 2.079 2.565 8 13
2 A1 P2 2.303 3.091 10 22
3 A2 P3 1.946 3.135 7 23
4 A2 P4 2.398 3.091 11 22
>
> summaryBy(Resp1 + Resp2 ~ ., data.s,
+ FUN = list(c, log))
Between Plot Resp1.c Resp2.c Resp1.log Resp2.log
1 A1 P1 8 13 2.079 2.565
2 A1 P2 10 22 2.303 3.091
3 A2 P3 7 23 1.946 3.135
4 A2 P4 11 22 2.398 3.091
|
Deriving new variables
Center a single variable (Resp2), centered within the entire variable | |
---|
|
Via transform
Via mutate
> library(plyr)
> mutate(data.s, cResp2 = Resp2 - mean(Resp2))
Between Plot Resp1 Resp2 cResp2
1 A1 P1 8 13 -7
2 A1 P2 10 22 2
3 A2 P3 7 23 3
4 A2 P4 11 22 2
|
Center a single variable (Resp2), centered within each level of the between variable | |
---|
| Between | Plot | Resp1 | Resp2 | cResp2 |
1 | A1 | P1 | 8.00 | 13.00 | -4.50 |
2 | A1 | P2 | 10.00 | 22.00 | 4.50 |
3 | A2 | P3 | 7.00 | 23.00 | 0.50 |
4 | A2 | P4 | 11.00 | 22.00 | -0.50 |
|
Via ddply and transform or mutate
> library(plyr)
> ddply(data.s, ~Between, transform,
+ cResp2 = Resp2 - mean(Resp2))
Between Plot Resp1 Resp2 cResp2
1 A1 P1 8 13 -4.5
2 A1 P2 10 22 4.5
3 A2 P3 7 23 0.5
4 A2 P4 11 22 -0.5
>
> ddply(data.s, ~Between, mutate, cResp2 = Resp2 -
+ mean(Resp2))
Between Plot Resp1 Resp2 cResp2
1 A1 P1 8 13 -4.5
2 A1 P2 10 22 4.5
3 A2 P3 7 23 0.5
4 A2 P4 11 22 -0.5
|
Difference in means centered within Between (derivative of derivatives) | |
---|
| Between | Plot | Resp1 | Resp2 | cResp1 | cResp2 | cDiff |
1 | A1 | P1 | 8.00 | 13.00 | -1.00 | -4.50 | 3.50 |
2 | A1 | P2 | 10.00 | 22.00 | 1.00 | 4.50 | -3.50 |
3 | A2 | P3 | 7.00 | 23.00 | -2.00 | 0.50 | -2.50 |
4 | A2 | P4 | 11.00 | 22.00 | 2.00 | -0.50 | 2.50 |
|
Via ddply and mutate
> ddply(data.s, ~Between, mutate, cResp1 = Resp1 -
+ mean(Resp1), cResp2 = Resp2 - mean(Resp2),
+ cDiff = cResp1 - cResp2)
Between Plot Resp1 Resp2 cResp1 cResp2 cDiff
1 A1 P1 8 13 -1 -4.5 3.5
2 A1 P2 10 22 1 4.5 -3.5
3 A2 P3 7 23 -2 0.5 -2.5
4 A2 P4 11 22 2 -0.5 2.5
|
Scale (log) transformations of Resp1 and Resp2 centered within Between | |
---|
| Between | Plot | Resp1 | Resp2 | sResp1 | sResp2 | logsResp1 | logsResp2 |
1 | A1 | P1 | 8.00 | 13.00 | 0.80 | 0.59 | -0.22 | -0.53 |
2 | A1 | P2 | 10.00 | 22.00 | 1.00 | 1.00 | 0.00 | 0.00 |
3 | A2 | P3 | 7.00 | 23.00 | 0.64 | 1.00 | -0.45 | 0.00 |
4 | A2 | P4 | 11.00 | 22.00 | 1.00 | 0.96 | 0.00 | -0.04 |
|
|
Via ddply and mutate
> ddply(data.s, ~Between, mutate, sResp1 = Resp1/max(Resp1), sResp2 = Resp2/max(Resp2),
+ logsResp1 = log(sResp1), logsResp2 = log(sResp2))
Between Plot Resp1 Resp2 sResp1 sResp2 logsResp1 logsResp2
1 A1 P1 8 13 0.8000 0.5909 -0.2231 -0.52609
2 A1 P2 10 22 1.0000 1.0000 0.0000 0.00000
3 A2 P3 7 23 0.6364 1.0000 -0.4520 0.00000
4 A2 P4 11 22 1.0000 0.9565 0.0000 -0.04445
Alterations
Sorting data
Most statistical analyses are invariant to the data order and thus data reordering is typically only for aesthetics in
tables and figures.
Sorting data has the potential to be one of the most dangerous forms of data manipulation - particularly in spreadsheets in which
there is no real binding between individual columns or rows. It is far to easy to accidentally sort the data in a single
column (or row) without applying the ordering to all the other columns in the data set thereby resulting in broken data.
Whilst the above apocalypse is still possible in R, the data structures and manipulation interfaces mean that you really
have to try to break the data in this way. Furthermore, you are encouraged to store reordered data in a different object
to the original data, and hence 'rolling' back is trivial.
An example of an small omibus dataset (Created here)(data.m) | |
| Plot | Resp1 | Resp2 | Between | Subplot | LAT | LONG |
1 | P1 | 8.00 | 18.00 | A1 | S1 | 17.96 | 145.43 |
2 | P1 | 10.00 | 21.00 | A1 | S2 | 17.96 | 145.43 |
3 | P2 | 7.00 | 16.00 | A1 | S3 | 17.52 | 146.20 |
4 | P2 | 11.00 | 23.00 | A1 | S4 | 17.52 | 146.20 |
5 | P3 | 14.00 | 22.00 | A2 | S5 | 17.00 | 146.38 |
6 | P3 | 12.00 | 24.00 | A2 | S6 | 17.00 | 146.38 |
7 | P4 | 11.00 | 23.00 | A2 | S7 | 18.23 | 146.79 |
8 | P4 | 9.00 | 20.00 | A2 | S8 | 18.23 | 146.79 |
9 | P5 | 14.00 | 11.00 | A3 | S9 | 18.98 | 146.03 |
10 | P5 | 11.00 | 22.00 | A3 | S10 | 18.98 | 146.03 |
11 | P6 | 8.00 | 24.00 | A3 | S11 | 20.12 | 146.47 |
12 | P6 | 2.00 | 16.00 | A3 | S12 | 20.12 | 146.47 |
|
View code
> merge(data.bio1, data.geo, by = "Plot")
|
| Plot | Resp1 | Resp2 | Between | Subplot | LAT | LONG |
5 | P3 | 14.00 | 22.00 | A2 | S5 | 17.00 | 146.38 |
6 | P3 | 12.00 | 24.00 | A2 | S6 | 17.00 | 146.38 |
3 | P2 | 7.00 | 16.00 | A1 | S3 | 17.52 | 146.20 |
4 | P2 | 11.00 | 23.00 | A1 | S4 | 17.52 | 146.20 |
1 | P1 | 8.00 | 18.00 | A1 | S1 | 17.96 | 145.43 |
2 | P1 | 10.00 | 21.00 | A1 | S2 | 17.96 | 145.43 |
7 | P4 | 11.00 | 23.00 | A2 | S7 | 18.23 | 146.79 |
8 | P4 | 9.00 | 20.00 | A2 | S8 | 18.23 | 146.79 |
9 | P5 | 14.00 | 11.00 | A3 | S9 | 18.98 | 146.03 |
10 | P5 | 11.00 | 22.00 | A3 | S10 | 18.98 | 146.03 |
11 | P6 | 8.00 | 24.00 | A3 | S11 | 20.12 | 146.47 |
12 | P6 | 2.00 | 16.00 | A3 | S12 | 20.12 | 146.47 |
|
Via order
> data.m[order(data.m$LAT), ]
Plot Resp1 Resp2 Between Subplot LAT LONG
5 P3 14 22 A2 S5 17.00 146.4
6 P3 12 24 A2 S6 17.00 146.4
3 P2 7 16 A1 S3 17.52 146.2
4 P2 11 23 A1 S4 17.52 146.2
1 P1 8 18 A1 S1 17.96 145.4
2 P1 10 21 A1 S2 17.96 145.4
7 P4 11 23 A2 S7 18.23 146.8
8 P4 9 20 A2 S8 18.23 146.8
9 P5 14 11 A3 S9 18.98 146.0
10 P5 11 22 A3 S10 18.98 146.0
11 P6 8 24 A3 S11 20.12 146.5
12 P6 2 16 A3 S12 20.12 146.5
|
Sort the data by Resp1 and then Resp2 | |
---|
| Plot | Resp1 | Resp2 | Between | Subplot | LAT | LONG |
12 | P6 | 2.00 | 16.00 | A3 | S12 | 20.12 | 146.47 |
3 | P2 | 7.00 | 16.00 | A1 | S3 | 17.52 | 146.20 |
1 | P1 | 8.00 | 18.00 | A1 | S1 | 17.96 | 145.43 |
11 | P6 | 8.00 | 24.00 | A3 | S11 | 20.12 | 146.47 |
8 | P4 | 9.00 | 20.00 | A2 | S8 | 18.23 | 146.79 |
2 | P1 | 10.00 | 21.00 | A1 | S2 | 17.96 | 145.43 |
10 | P5 | 11.00 | 22.00 | A3 | S10 | 18.98 | 146.03 |
4 | P2 | 11.00 | 23.00 | A1 | S4 | 17.52 | 146.20 |
7 | P4 | 11.00 | 23.00 | A2 | S7 | 18.23 | 146.79 |
6 | P3 | 12.00 | 24.00 | A2 | S6 | 17.00 | 146.38 |
9 | P5 | 14.00 | 11.00 | A3 | S9 | 18.98 | 146.03 |
5 | P3 | 14.00 | 22.00 | A2 | S5 | 17.00 | 146.38 |
|
Via order
> data.m[order(data.m$Resp1, data.m$Resp2),
+ ]
Plot Resp1 Resp2 Between Subplot LAT LONG
12 P6 2 16 A3 S12 20.12 146.5
3 P2 7 16 A1 S3 17.52 146.2
1 P1 8 18 A1 S1 17.96 145.4
11 P6 8 24 A3 S11 20.12 146.5
8 P4 9 20 A2 S8 18.23 146.8
2 P1 10 21 A1 S2 17.96 145.4
10 P5 11 22 A3 S10 18.98 146.0
4 P2 11 23 A1 S4 17.52 146.2
7 P4 11 23 A2 S7 18.23 146.8
6 P3 12 24 A2 S6 17.00 146.4
9 P5 14 11 A3 S9 18.98 146.0
5 P3 14 22 A2 S5 17.00 146.4
|
Re-levelling (sorting) factors
In the previous section, we altered the order of the records in the data set. However, it is important to realize that categorical
variables in a dataframe have a property (called levels) that indicates the order of the levels of the factor and that this property
is completely independent of the order of records in the data structure. By default, categorical variables (factors) are ordered
alphabetically (as I said, irrespective of their order in the data set). The order of factor levels dictates the order in which
groups will be plotted in figures and tabulated in tables. Consequently, re-levelling factors is often desirable
An example of an data set (data.2) | |
| Plot | Cond | Between | Temp |
1 | P1 | H | A1 | 15.74 |
2 | P2 | H | A1 | 23.84 |
3 | P3 | H | A1 | 13.64 |
4 | P4 | H | A2 | 37.95 |
5 | P1 | M | A2 | 25.30 |
6 | P2 | M | A2 | 13.80 |
7 | P3 | M | A3 | 26.87 |
8 | P4 | M | A3 | 29.38 |
9 | P1 | L | A3 | 27.76 |
10 | P2 | L | A4 | 18.95 |
11 | P3 | L | A4 | 37.12 |
12 | P4 | L | A4 | 25.90 |
|
View code
> set.seed(1)
> data.2 <- expand.grid(Plot = paste("P", 1:4, sep = ""),
+ Cond = c("H", "M", "L"))
> data.2$Cond <- factor(as.character(data.2$Cond))
> data.2$Between <- gl(4, 3, 12, lab = paste("A", 1:4,
+ sep = ""))
> data.2$Temp <- rnorm(12, 22, 10)
|
Default alphabetical order |
|
Re-levelled into L, M, H |
> data.2$Cond <- factor(data.2$Cond, levels = c("L",
+ "M", "H"))
> levels(data.2$Cond)
|
|
Relabelling
In this case, the L,M,H labels are efficient, yet not as informative as we would perhaps like for labels on a graphic. Instead,
we would probably prefer they were something like, "Low", "Medium" and "High"
Give the levels of a factor more informative labels | |
---|
| Plot | Cond | Between | Temp |
1 | P1 | High | A1 | 15.74 |
2 | P2 | High | A1 | 23.84 |
3 | P3 | High | A1 | 13.64 |
4 | P4 | High | A2 | 37.95 |
5 | P1 | Medium | A2 | 25.30 |
6 | P2 | Medium | A2 | 13.80 |
7 | P3 | Medium | A3 | 26.87 |
8 | P4 | Medium | A3 | 29.38 |
9 | P1 | Low | A3 | 27.76 |
10 | P2 | Low | A4 | 18.95 |
11 | P3 | Low | A4 | 37.12 |
12 | P4 | Low | A4 | 25.90 |
|
Via factor
>
>
> data.3 <- data.2
> data.3$Cond <- factor(data.3$Cond, levels = c("L",
+ "M", "H"), lab = c("Low", "Medium", "High"))
> data.3
Plot Cond Between Temp
1 P1 High A1 15.74
2 P2 High A1 23.84
3 P3 High A1 13.64
4 P4 High A2 37.95
5 P1 Medium A2 25.30
6 P2 Medium A2 13.80
7 P3 Medium A3 26.87
8 P4 Medium A3 29.38
9 P1 Low A3 27.76
10 P2 Low A4 18.95
11 P3 Low A4 37.12
12 P4 Low A4 25.90
|
Subsetting
We regularly want to run an analysis or generate a graphic for a sub-set of the data. Take for example the data used here.
We may, for example, wish to explore a subset of the data for which Cond is "H".
Warning, following any subsetting, it is crucial that you then instruct R to redefine the factor levels of any factors in the subsetted data set.
When subsetting a dataset, all factors simply inherit the original levels property of the original factors, and therefore the new factors potentially define more factor levels than are in the
actually present in the subsetted data set. This, along with the solution is illustrated below.
An example of a data set (data.2) | |
| Plot | Cond | Between | Temp |
1 | P1 | H | A1 | 15.74 |
2 | P2 | H | A1 | 23.84 |
3 | P3 | H | A1 | 13.64 |
4 | P4 | H | A2 | 37.95 |
5 | P1 | M | A2 | 25.30 |
6 | P2 | M | A2 | 13.80 |
7 | P3 | M | A3 | 26.87 |
8 | P4 | M | A3 | 29.38 |
9 | P1 | L | A3 | 27.76 |
10 | P2 | L | A4 | 18.95 |
11 | P3 | L | A4 | 37.12 |
12 | P4 | L | A4 | 25.90 |
|
View code
> set.seed(1)
> data.2 <- expand.grid(Plot = paste("P",
+ 1:4, sep = ""), Cond = c("H", "M", "L"))
> data.2$Cond <- factor(as.character(data.2$Cond))
> data.2$Between <- gl(4, 3, 12, lab = paste("A",
+ 1:4, sep = ""))
> data.2$Temp <- rnorm(12, 22, 10)
|
Effect of subsetting on factor levels | |
---|
>
> data.3 <- subset(data.2, Cond == "H")
>
> data.3
Plot Cond Between Temp
1 P1 H A1 15.74
2 P2 H A1 23.84
3 P3 H A1 13.64
4 P4 H A2 37.95
|
Notice that although we created a subset of the data that only includes cases where Cond is "H", each of the factors have retained their original level properties.
|
Correcting factor levels of all factors after subsetting | |
---|
>
> data.3 <- subset(data.2, Cond == "H")
>
> data.3 <- droplevels(data.3)
>
> levels(data.3$Cond)
|
Notice this time, the levels of each factor reflect the subset data.
|
Correcting factor levels of only a single factor after subsetting | |
---|
>
> data.3 <- subset(data.2, Cond == "H")
>
> data.3$Cond <- factor(data.3$Cond)
>
> levels(data.3$Cond)
|
Notice this time, the levels of the Cond factor reflect the subset data, whereas the levels of the Between factor reflect the original dataset.
|
Keep only the observations where Cond is H | |
---|
| Plot | Cond | Between | Temp |
1 | P1 | H | A1 | 15.74 |
2 | P2 | H | A1 | 23.84 |
3 | P3 | H | A1 | 13.64 |
4 | P4 | H | A2 | 37.95 |
|
Via subset
> subset(data.2, Cond == "H")
Plot Cond Between Temp
1 P1 H A1 15.74
2 P2 H A1 23.84
3 P3 H A1 13.64
4 P4 H A2 37.95
Via indexing
> data.2[data.2$Cond == "H", ]
Plot Cond Between Temp
1 P1 H A1 15.74
2 P2 H A1 23.84
3 P3 H A1 13.64
4 P4 H A2 37.95
|
Keep only the observations where Cond is either H or M | |
---|
| Plot | Cond | Between | Temp |
1 | P1 | H | A1 | 15.74 |
2 | P2 | H | A1 | 23.84 |
3 | P3 | H | A1 | 13.64 |
4 | P4 | H | A2 | 37.95 |
5 | P1 | M | A2 | 25.30 |
6 | P2 | M | A2 | 13.80 |
7 | P3 | M | A3 | 26.87 |
8 | P4 | M | A3 | 29.38 |
|
Via subset
> subset(data.2, Cond %in% c("H", "M"))
Plot Cond Between Temp
1 P1 H A1 15.74
2 P2 H A1 23.84
3 P3 H A1 13.64
4 P4 H A2 37.95
5 P1 M A2 25.30
6 P2 M A2 13.80
7 P3 M A3 26.87
8 P4 M A3 29.38
Via indexing
> data.2[data.2$Cond %in% c("H", "M"), ]
Plot Cond Between Temp
1 P1 H A1 15.74
2 P2 H A1 23.84
3 P3 H A1 13.64
4 P4 H A2 37.95
5 P1 M A2 25.30
6 P2 M A2 13.80
7 P3 M A3 26.87
8 P4 M A3 29.38
|
Keep only the observations with H OR M condition AND Temp < 30 | |
---|
| Plot | Cond | Between | Temp |
1 | P1 | H | A1 | 15.74 |
2 | P2 | H | A1 | 23.84 |
3 | P3 | H | A1 | 13.64 |
5 | P1 | M | A2 | 25.30 |
6 | P2 | M | A2 | 13.80 |
7 | P3 | M | A3 | 26.87 |
8 | P4 | M | A3 | 29.38 |
|
Via subset
> subset(data.2, Cond %in% c("H", "M") & Temp < 30)
Plot Cond Between Temp
1 P1 H A1 15.74
2 P2 H A1 23.84
3 P3 H A1 13.64
5 P1 M A2 25.30
6 P2 M A2 13.80
7 P3 M A3 26.87
8 P4 M A3 29.38
Via indexing
> data.2[data.2$Cond %in% c("H", "M") & data.2$Temp < 30, ]
Plot Cond Between Temp
1 P1 H A1 15.74
2 P2 H A1 23.84
3 P3 H A1 13.64
5 P1 M A2 25.30
6 P2 M A2 13.80
7 P3 M A3 26.87
8 P4 M A3 29.38
|
List manipulations
Data sets can also be represented by lists.
Like data frames, lists store multiple objects (which could be variables) of different types.
Yet unlike data frames, there is no implicit link between the objects in a list and so there is no need for objects vectors to be of the same length or even dimensions.
Indeed, it is often useful to
"package" together all of the information about a study into a list comprising;
- the main data set
- notes on sampling units and/or observations
- site, block etc characteristics (such as latitude and longitude, elevation, ...)
Hence lists provide a level of organization for related and structured data without any necessity for elements to contain the same number of objects/rows.
As a motivating example, let us return to a data set generated earlier (data) comprising two responses (Resp1 and Resp2), a between block factor (Between), Plots and Subplots and a Within subplot (Within) factor.
Create a simple list of three data frames | |
---|
|
View code
> set.seed(1)
> mlist1 <- vector("list", 3)
> names(mlist1) <- LETTERS[1:3]
> for (i in 1:3) {
+ mlist1[[i]] <- expand.grid(Plot = paste("P", 1:3, sep = ""),
+ Cond = c("H", "L"))
+ mlist1[[i]]$Temp <- rnorm(6, 20, 5)
+ }
> mlist1
|
Subsetting simple lists
Lists are manipulated a little different to dataframes.
Select first list element (return list) | |
---|
|
Via indexing
$A
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
$A
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
Via llply
> library(plyr)
> ldply(mlist1[1])
.id Plot Cond Temp
1 A P1 H 16.87
2 A P2 H 20.92
3 A P3 H 15.82
4 A P1 L 27.98
5 A P2 L 21.65
6 A P3 L 15.90
|
Select the first list elements' contents (return dataframe) | |
---|
| Plot | Cond | Temp |
1 | P1 | H | 16.87 |
2 | P2 | H | 20.92 |
3 | P3 | H | 15.82 |
4 | P1 | L | 27.98 |
5 | P2 | L | 21.65 |
6 | P3 | L | 15.90 |
|
Via index
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
Via ldply
> library(plyr)
> ldply(mlist1[1])
.id Plot Cond Temp
1 A P1 H 16.87
2 A P2 H 20.92
3 A P3 H 15.82
4 A P1 L 27.98
5 A P2 L 21.65
6 A P3 L 15.90
|
Select list elements 1 and 3 (return list) | |
---|
|
Via indexing
$A
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
$C
Plot Cond Temp
1 P1 H 16.894
2 P2 H 8.927
3 P3 H 25.625
4 P1 L 19.775
5 P2 L 19.919
6 P3 L 24.719
$A
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
$C
Plot Cond Temp
1 P1 H 16.894
2 P2 H 8.927
3 P3 H 25.625
4 P1 L 19.775
5 P2 L 19.919
6 P3 L 24.719
$A
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
$C
Plot Cond Temp
1 P1 H 16.894
2 P2 H 8.927
3 P3 H 25.625
4 P1 L 19.775
5 P2 L 19.919
6 P3 L 24.719
Via lapply
> lapply(mlist1[c(1, 3)], function(x) x)
$A
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
$C
Plot Cond Temp
1 P1 H 16.894
2 P2 H 8.927
3 P3 H 25.625
4 P1 L 19.775
5 P2 L 19.919
6 P3 L 24.719
Via llply
> library(plyr)
> llply(mlist1[c(1, 3)])
$A
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
$C
Plot Cond Temp
1 P1 H 16.894
2 P2 H 8.927
3 P3 H 25.625
4 P1 L 19.775
5 P2 L 19.919
6 P3 L 24.719
|
Select list elements 1 and 3 contents (return combined dataframe) | |
---|
| .id | Plot | Cond | Temp |
1 | A | P1 | H | 16.87 |
2 | A | P2 | H | 20.92 |
3 | A | P3 | H | 15.82 |
4 | A | P1 | L | 27.98 |
5 | A | P2 | L | 21.65 |
6 | A | P3 | L | 15.90 |
7 | C | P1 | H | 16.89 |
8 | C | P2 | H | 8.93 |
9 | C | P3 | H | 25.62 |
10 | C | P1 | L | 19.78 |
11 | C | P2 | L | 19.92 |
12 | C | P3 | L | 24.72 |
|
Via ldply
> library(plyr)
> ldply(mlist1[c(1, 3)])
.id Plot Cond Temp
1 A P1 H 16.868
2 A P2 H 20.918
3 A P3 H 15.822
4 A P1 L 27.976
5 A P2 L 21.648
6 A P3 L 15.898
7 C P1 H 16.894
8 C P2 H 8.927
9 C P3 H 25.625
10 C P1 L 19.775
11 C P2 L 19.919
12 C P3 L 24.719
|
Select the second column of list elements 1 and 3 (return pure list) | |
---|
|
Via lapply
$A
[1] P1 P2 P3 P1 P2 P3
Levels: P1 P2 P3
$B
[1] P1 P2 P3 P1 P2 P3
Levels: P1 P2 P3
$C
[1] P1 P2 P3 P1 P2 P3
Levels: P1 P2 P3
|
Subsetting complex
(nested) lists
Create a nested list of three data frames | |
---|
|
View code
> set.seed(1)
> mlist2 <- vector("list", 3)
> names(mlist2) <- LETTERS[1:3]
> for (i in 1:3) {
+ mlist2[[i]] <- vector("list", 2)
+ names(mlist2[[i]]) <- letters[1:2]
+ for (j in 1:2) {
+ mlist2[[i]][[j]] <- expand.grid(Plot = paste("P", 1:3,
+ sep = ""), Cond = c("H", "L"))
+ mlist2[[i]][[j]]$Temp <- rnorm(6, 20, 5)
+ }
+ }
> mlist2
|
Select first ('A') list element (return nested list) | |
---|
|
Via indexing
$A
$A$a
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
$A$b
Plot Cond Temp
1 P1 H 22.44
2 P2 H 23.69
3 P3 H 22.88
4 P1 L 18.47
5 P2 L 27.56
6 P3 L 21.95
$A
$A$a
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
$A$b
Plot Cond Temp
1 P1 H 22.44
2 P2 H 23.69
3 P3 H 22.88
4 P1 L 18.47
5 P2 L 27.56
6 P3 L 21.95
$A
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
|
Select the first ('A') list element (return simple list) | |
---|
|
Via indexing
$a
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
$b
Plot Cond Temp
1 P1 H 22.44
2 P2 H 23.69
3 P3 H 22.88
4 P1 L 18.47
5 P2 L 27.56
6 P3 L 21.95
$a
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
$b
Plot Cond Temp
1 P1 H 22.44
2 P2 H 23.69
3 P3 H 22.88
4 P1 L 18.47
5 P2 L 27.56
6 P3 L 21.95
|
Select each of the elements of the first and third primary elements (return simple list) | |
---|
|
Via indexing
$A
$A$a
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
$A$b
Plot Cond Temp
1 P1 H 22.44
2 P2 H 23.69
3 P3 H 22.88
4 P1 L 18.47
5 P2 L 27.56
6 P3 L 21.95
$C
$C$a
Plot Cond Temp
1 P1 H 23.10
2 P2 H 19.72
3 P3 H 19.22
4 P1 L 12.65
5 P2 L 17.61
6 P3 L 22.09
$C$b
Plot Cond Temp
1 P1 H 26.79
2 P2 H 19.49
3 P3 H 21.94
4 P1 L 19.73
5 P2 L 13.11
6 P3 L 17.93
$A
$A$a
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
$A$b
Plot Cond Temp
1 P1 H 22.44
2 P2 H 23.69
3 P3 H 22.88
4 P1 L 18.47
5 P2 L 27.56
6 P3 L 21.95
$C
$C$a
Plot Cond Temp
1 P1 H 23.10
2 P2 H 19.72
3 P3 H 19.22
4 P1 L 12.65
5 P2 L 17.61
6 P3 L 22.09
$C$b
Plot Cond Temp
1 P1 H 26.79
2 P2 H 19.49
3 P3 H 21.94
4 P1 L 19.73
5 P2 L 13.11
6 P3 L 17.93
|
Select first child ('a') list element of the first ('A') list element (return list) | |
---|
|
Via indexing
$a
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
$a
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
|
Select first child ('a') list element of the first ('A') list element (return dataframe) | |
---|
| Plot | Cond | Temp |
1 | P1 | H | 16.87 |
2 | P2 | H | 20.92 |
3 | P3 | H | 15.82 |
4 | P1 | L | 27.98 |
5 | P2 | L | 21.65 |
6 | P3 | L | 15.90 |
|
Via indexing
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
|
Select first child ('a') list element from each ('A', 'B', 'C') element (return list) | |
---|
|
Via lapply
$A
$A$a
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
$B
$B$a
Plot Cond Temp
1 P1 H 16.894
2 P2 H 8.927
3 P3 H 25.625
4 P1 L 19.775
5 P2 L 19.919
6 P3 L 24.719
$C
$C$a
Plot Cond Temp
1 P1 H 23.10
2 P2 H 19.72
3 P3 H 19.22
4 P1 L 12.65
5 P2 L 17.61
6 P3 L 22.09
>
> lapply(mlist2,"[",'a')
$A
$A$a
Plot Cond Temp
1 P1 H 16.87
2 P2 H 20.92
3 P3 H 15.82
4 P1 L 27.98
5 P2 L 21.65
6 P3 L 15.90
$B
$B$a
Plot Cond Temp
1 P1 H 16.894
2 P2 H 8.927
3 P3 H 25.625
4 P1 L 19.775
5 P2 L 19.919
6 P3 L 24.719
$C
$C$a
Plot Cond Temp
1 P1 H 23.10
2 P2 H 19.72
3 P3 H 19.22
4 P1 L 12.65
5 P2 L 17.61
6 P3 L 22.09
|
Converting complex lists to dataframes
Convert complex nested list of data frames into a single data frame (return dataframe) | |
---|
| ABC | abc | Plot | Cond | Temp |
1 | A | a | P1 | H | 16.87 |
2 | A | a | P2 | H | 20.92 |
3 | A | a | P3 | H | 15.82 |
4 | A | a | P1 | L | 27.98 |
5 | A | a | P2 | L | 21.65 |
6 | A | a | P3 | L | 15.90 |
7 | A | b | P1 | H | 22.44 |
8 | A | b | P2 | H | 23.69 |
9 | A | b | P3 | H | 22.88 |
10 | A | b | P1 | L | 18.47 |
11 | A | b | P2 | L | 27.56 |
12 | A | b | P3 | L | 21.95 |
13 | B | a | P1 | H | 16.89 |
14 | B | a | P2 | H | 8.93 |
15 | B | a | P3 | H | 25.62 |
16 | B | a | P1 | L | 19.78 |
17 | B | a | P2 | L | 19.92 |
18 | B | a | P3 | L | 24.72 |
19 | B | b | P1 | H | 24.11 |
20 | B | b | P2 | H | 22.97 |
21 | B | b | P3 | H | 24.59 |
22 | B | b | P1 | L | 23.91 |
23 | B | b | P2 | L | 20.37 |
24 | B | b | P3 | L | 10.05 |
25 | C | a | P1 | H | 23.10 |
26 | C | a | P2 | H | 19.72 |
27 | C | a | P3 | H | 19.22 |
28 | C | a | P1 | L | 12.65 |
29 | C | a | P2 | L | 17.61 |
30 | C | a | P3 | L | 22.09 |
31 | C | b | P1 | H | 26.79 |
32 | C | b | P2 | H | 19.49 |
33 | C | b | P3 | H | 21.94 |
34 | C | b | P1 | L | 19.73 |
35 | C | b | P2 | L | 13.11 |
36 | C | b | P3 | L | 17.93 |
|
Via indexing
> mData2 <- ldply(llply(mlist2, function(x) {
+ out <- ldply(x)
+ names(out)[1] <- "abc"
+ out
+ }))
> names(mData2)[1] <- "ABC"
> mData2
ABC abc Plot Cond Temp
1 A a P1 H 16.868
2 A a P2 H 20.918
3 A a P3 H 15.822
4 A a P1 L 27.976
5 A a P2 L 21.648
6 A a P3 L 15.898
7 A b P1 H 22.437
8 A b P2 H 23.692
9 A b P3 H 22.879
10 A b P1 L 18.473
11 A b P2 L 27.559
12 A b P3 L 21.949
13 B a P1 H 16.894
14 B a P2 H 8.927
15 B a P3 H 25.625
16 B a P1 L 19.775
17 B a P2 L 19.919
18 B a P3 L 24.719
19 B b P1 H 24.106
20 B b P2 H 22.970
21 B b P3 H 24.595
22 B b P1 L 23.911
23 B b P2 L 20.373
24 B b P3 L 10.053
25 C a P1 H 23.099
26 C a P2 H 19.719
27 C a P3 H 19.221
28 C a P1 L 12.646
29 C a P2 L 17.609
30 C a P3 L 22.090
31 C b P1 H 26.793
32 C b P2 H 19.486
33 C b P3 H 21.938
34 C b P1 L 19.731
35 C b P2 L 13.115
36 C b P3 L 17.925
|
More complex manipulations
Here is an example data set that comprises two Responses measured in each of four years from each of four Plots.
What we want is to derive a new measure that is the change in abundance between a certain year (2008) and all subsequent years -
and of course, we want to do this separately for each Plot.
An example data set (data.t) | |
---|
| Plot | Year | Resp1 | Resp2 |
1 | P1 | 2008 | 0.00 | 36.00 |
2 | P1 | 2009 | 48.00 | 0.00 |
3 | P1 | 2010 | 12.00 | 0.00 |
4 | P1 | 2011 | 0.00 | 15.00 |
5 | P2 | 2008 | 19.00 | 34.00 |
6 | P2 | 2009 | 18.00 | 0.00 |
7 | P2 | 2010 | 21.00 | 36.00 |
8 | P2 | 2011 | 45.00 | 12.00 |
9 | P3 | 2008 | 40.00 | 49.00 |
10 | P3 | 2009 | 38.00 | 0.00 |
11 | P3 | 2010 | 9.00 | 18.00 |
12 | P3 | 2011 | 22.00 | 42.00 |
13 | P4 | 2008 | 20.00 | 0.00 |
14 | P4 | 2009 | 22.00 | 14.00 |
15 | P4 | 2010 | 32.00 | 0.00 |
16 | P4 | 2011 | 38.00 | 27.00 |
|
View code
> set.seed(1)
> data.t <- expand.grid(Year = 2008:2011, Plot = paste("P",
+ 1:4, sep = ""))[, c(2, 1)]
> data.t <- data.frame(data.t, Resp1 = rnbinom(16, 5,
+ mu = 30) * as.numeric(replicate(4, rbinom(4, 1, 0.8))), Resp2 = rnbinom(16,
+ 5, mu = 30) * as.numeric(replicate(4, rbinom(4, 1, 0.8))))
> data.t
|
As this is ultimately a demonstration of a complex data manipulation, we will perform the manipulation in increasingly
more complex iterations. Should you purely be interested in the final solution, you may which to skip to the final point.
- Calculate the change (difference in abundance between 2008 and all subsequent years) for a single variable (Resp1)
| Plot | Year | Resp1 | delta |
1 | P1 | 2008 | 0.00 | 0.00 |
2 | P1 | 2009 | 48.00 | 48.00 |
3 | P1 | 2010 | 12.00 | 12.00 |
4 | P1 | 2011 | 0.00 | 0.00 |
5 | P2 | 2008 | 19.00 | 0.00 |
6 | P2 | 2009 | 18.00 | -1.00 |
7 | P2 | 2010 | 21.00 | 2.00 |
8 | P2 | 2011 | 45.00 | 26.00 |
9 | P3 | 2008 | 40.00 | 0.00 |
10 | P3 | 2009 | 38.00 | -2.00 |
11 | P3 | 2010 | 9.00 | -31.00 |
12 | P3 | 2011 | 22.00 | -18.00 |
13 | P4 | 2008 | 20.00 | 0.00 |
14 | P4 | 2009 | 22.00 | 2.00 |
15 | P4 | 2010 | 32.00 | 12.00 |
16 | P4 | 2011 | 38.00 | 18.00 |
|
Via ddply
> ddply(data.t, ~Plot, function(df) {
+ t1 <- df$Resp1[df$Year == 2008]
+ yr <- df$Year
+ t2 <- df$Resp1
+ data.frame(Year = yr, Resp1 = df$Resp1, delta = t2 - t1)
+ })
Plot Year Resp1 delta
1 P1 2008 0 0
2 P1 2009 48 48
3 P1 2010 12 12
4 P1 2011 0 0
5 P2 2008 19 0
6 P2 2009 18 -1
7 P2 2010 21 2
8 P2 2011 45 26
9 P3 2008 40 0
10 P3 2009 38 -2
11 P3 2010 9 -31
12 P3 2011 22 -18
13 P4 2008 20 0
14 P4 2009 22 2
15 P4 2010 32 12
16 P4 2011 38 18
|
- As 2008 is the year we are comparing all others to, we dont really want the 2008 data in the final data set - so lets suppress it.
| Plot | Year | Resp1 | delta |
1 | P1 | 2009 | 48.00 | 48.00 |
2 | P1 | 2010 | 12.00 | 12.00 |
3 | P1 | 2011 | 0.00 | 0.00 |
4 | P2 | 2009 | 18.00 | -1.00 |
5 | P2 | 2010 | 21.00 | 2.00 |
6 | P2 | 2011 | 45.00 | 26.00 |
7 | P3 | 2009 | 38.00 | -2.00 |
8 | P3 | 2010 | 9.00 | -31.00 |
9 | P3 | 2011 | 22.00 | -18.00 |
10 | P4 | 2009 | 22.00 | 2.00 |
11 | P4 | 2010 | 32.00 | 12.00 |
12 | P4 | 2011 | 38.00 | 18.00 |
|
Via ddply
> ddply(data.t, ~Plot, function(df) {
+ t1 <- df$Resp1[df$Year == 2008]
+ df <- df[df$Year > 2008, ]
+ yr <- df$Year
+ t2 <- df$Resp1
+ data.frame(Year = yr, Resp1 = df$Resp1, delta = t2 - t1)
+ })
Plot Year Resp1 delta
1 P1 2009 48 48
2 P1 2010 12 12
3 P1 2011 0 0
4 P2 2009 18 -1
5 P2 2010 21 2
6 P2 2011 45 26
7 P3 2009 38 -2
8 P3 2010 9 -31
9 P3 2011 22 -18
10 P4 2009 22 2
11 P4 2010 32 12
12 P4 2011 38 18
|
- Lets now extend this to multiple variables (Resp1 and Resp2). This is achieved with the following modifications
- The response variables to apply the delta calculations to are specified by their column indices in code rows 2 and 5 (e.g. 3:4)
- t1 and t2 now store vectors and matrices respectively.
Therefore we need to use the sweep function in order to subtract each item of the t1 vector
from each item in the corresponding columns of the t2 matrix.
The sweep function takes the values in one matrix (or vector) and "sweeps" (subtracts by default) them through either the
rows (1)
or columns (2) of another matrix (perform the action repeatedly along the rows or columns).
- Finally, to ensure that the results of the sweep are correctly packaged into rows and appended to the splitting variable (Plot),
we need to construct and return a dataframe to the ddply function.
| Plot | Year | Resp1 | Resp2 | Resp1.1 | Resp2.1 |
1 | P1 | 2009 | 48.00 | 0.00 | 48.00 | -36.00 |
2 | P1 | 2010 | 12.00 | 0.00 | 12.00 | -36.00 |
3 | P1 | 2011 | 0.00 | 15.00 | 0.00 | -21.00 |
4 | P2 | 2009 | 18.00 | 0.00 | -1.00 | -34.00 |
5 | P2 | 2010 | 21.00 | 36.00 | 2.00 | 2.00 |
6 | P2 | 2011 | 45.00 | 12.00 | 26.00 | -22.00 |
7 | P3 | 2009 | 38.00 | 0.00 | -2.00 | -49.00 |
8 | P3 | 2010 | 9.00 | 18.00 | -31.00 | -31.00 |
9 | P3 | 2011 | 22.00 | 42.00 | -18.00 | -7.00 |
10 | P4 | 2009 | 22.00 | 14.00 | 2.00 | 14.00 |
11 | P4 | 2010 | 32.00 | 0.00 | 12.00 | 0.00 |
12 | P4 | 2011 | 38.00 | 27.00 | 18.00 | 27.00 |
|
Via ddply
> ddply(data.t, ~Plot, function(df) {
+ t1 <- as.matrix(df[df$Year == 2008, 3:4])
+ df <- df[df$Year > 2008, ]
+ yr <- df$Year
+ t2 <- as.matrix(df[, 3:4])
+ return(data.frame(Year = yr, Resp1 = df$Resp1,
+ Resp2 = df$Resp2, sweep(t2, 2, t1)))
+ })
Plot Year Resp1 Resp2 Resp1.1 Resp2.1
1 P1 2009 48 0 48 -36
2 P1 2010 12 0 12 -36
3 P1 2011 0 15 0 -21
4 P2 2009 18 0 -1 -34
5 P2 2010 21 36 2 2
6 P2 2011 45 12 26 -22
7 P3 2009 38 0 -2 -49
8 P3 2010 9 18 -31 -31
9 P3 2011 22 42 -18 -7
10 P4 2009 22 14 2 14
11 P4 2010 32 0 12 0
12 P4 2011 38 27 18 27
|
-
So far, the metric representing the difference between before and after has been pretty crude.
It is simply the raw difference, the magnitude of which is limitless and dependent on the magnitude and scale
of the variables - thereby meaningless for comparisons.
A more sophisticated metric would standardize the differences such that they represent the proportional difference (divide by the
first value).
Of course this introduces additional issues, such as what happens if the initial value is 0?
In such cases, any increase is effectively an infinite increase. However, if the initial value and a subsequent value
are both zero, then a sensible value of delta would be 0.
Δ =
|
[(A2 - A1)/A2] × 100
|
if A1≠0 |
10,000
|
if A1=0 & A2≠0 |
0
|
if A1=0 & A2=0 |
|
|
We can then further scale this metric to be symmetrically bound around to a range around zero (-4.615 and 4.615 on a log scale).
*An alternative way of dealing with these issues is to replace the zero with
2 standard deviations below the smallest value in the set. This is a technique that is often employed to deal with situations in which zeros arise as a result of the resolution of instruments (having a minimum detection
that is higher than the sample they are measuring) or otherwise inability to detect very small quantities. Similarly, it could be argued that a measured abundance of zero here could alo be the result of an inability to detect
something in very low quantities. Hence, we could replace the zeros with estimates of the lowest detection value or similar. *
The standardized and scaled option is achieved with the following modifications
- As indicated in the previous version above, the sweep function performs a simple subtraction.
We want the sweep function to perform a different function - one that we define in accordance with
the formula and rules above. So we first create a function that can perform this task.
- Have the sweep function call out specifically-defined function.
| Plot | Year | Resp1 | Resp2 | Resp1.1 | Resp2.1 |
1 | P1 | 2009 | 48.00 | 0.00 | 4.61 | -4.62 |
2 | P1 | 2010 | 12.00 | 0.00 | 4.61 | -4.62 |
3 | P1 | 2011 | 0.00 | 15.00 | 0.00 | -0.86 |
4 | P2 | 2009 | 18.00 | 0.00 | -0.05 | -4.62 |
5 | P2 | 2010 | 21.00 | 36.00 | 0.10 | 0.06 |
6 | P2 | 2011 | 45.00 | 12.00 | 0.86 | -1.02 |
7 | P3 | 2009 | 38.00 | 0.00 | -0.05 | -4.62 |
8 | P3 | 2010 | 9.00 | 18.00 | -1.46 | -0.98 |
9 | P3 | 2011 | 22.00 | 42.00 | -0.59 | -0.15 |
10 | P4 | 2009 | 22.00 | 14.00 | 0.09 | 4.61 |
11 | P4 | 2010 | 32.00 | 0.00 | 0.47 | 0.00 |
12 | P4 | 2011 | 38.00 | 27.00 | 0.64 | 4.61 |
|
Via ddply
>
> delta <- function(t2, t1) {
+ d <- 100 * ((t2 - t1)/t1)
+ d[t1 == 0 & t2 == 0] <- 0
+ d[t1 == 0 & t2 != 0] <- 10000
+ log(1 + (d/101))
+ }
>
> ddply(data.t, ~Plot, function(df) {
+ t1 <- as.matrix(df[df$Year == 2008, 3:4])
+ df <- df[df$Year > 2008, ]
+ yr <- df$Year
+ t2 <- as.matrix(df[, 3:4])
+ return(data.frame(Year = yr, Resp1 = df$Resp1,
+ Resp2 = df$Resp2, sweep(t2, 2, t1, delta)))
+ })
Plot Year Resp1 Resp2 Resp1.1 Resp2.1
1 P1 2009 48 0 4.60527 -4.61512
2 P1 2010 12 0 4.60527 -4.61512
3 P1 2011 0 15 0.00000 -0.86170
4 P2 2009 18 0 -0.05352 -4.61512
5 P2 2010 21 36 0.09914 0.05661
6 P2 2011 45 12 0.85649 -1.02346
7 P3 2009 38 0 -0.05077 -4.61512
8 P3 2010 9 18 -1.45812 -0.98454
9 P3 2011 22 42 -0.58977 -0.15250
10 P4 2009 22 14 0.09441 4.60527
11 P4 2010 32 0 0.46628 0.00000
12 P4 2011 38 27 0.63715 4.60527
|
-
The above procedure is perfectly adequate for dealing with a single ubiquitous event (disturbance), but what happens if there have been multiple different disturbances that begin and end at different times in different plots.
Comparing each time period to a single, fixed time (2008 in this case), is then not logical. Lets say that in addition to our data set, we had also compiled a data set that defined times prior and post for each disturbance
for each Plot. This data set could then be used as a lookup table (as in this section above). In this example we will incorporate the post disturbance date, yet we will not restrict the resulting
data beyond this date (that is, we will continue to explore the impacts of the disturbance after the primary condition has ceased).
| Plot | Prior | Post |
1 | P1 | 2008.00 | 2011.00 |
2 | P2 | 2008.00 | 2010.00 |
3 | P3 | 2009.00 | 2011.00 |
4 | P4 | 2008.00 | 2011.00 |
|
Via ddply
>
> data.v <- expand.grid(Plot = paste("P",
+ 1:4, sep = ""))
> data.v$Prior <- c(2008, 2008, 2009,
+ 2008)
> data.v$Post <- c(2011, 2010, 2011,
+ 2011)
> data.v
Plot Prior Post
1 P1 2008 2011
2 P2 2008 2010
3 P3 2009 2011
4 P4 2008 2011
|
The following additional steps are now required in order to incorporate the prior and post disturbance dates.
- Merge the data and the lookup table such that the prior and post dates are included in the data set.
- Limit the first date of the comparison for each Plot such that the first measure is relative to the defined Prior disturbance time for that Plot.
| Plot | Year | Resp1 | Resp2 | Resp1.1 | Resp2.1 |
1 | P1 | 2009 | 48.00 | 0.00 | 4.61 | -4.62 |
2 | P1 | 2010 | 12.00 | 0.00 | 4.61 | -4.62 |
3 | P1 | 2011 | 0.00 | 15.00 | 0.00 | -0.86 |
4 | P2 | 2009 | 18.00 | 0.00 | -0.05 | -4.62 |
5 | P2 | 2010 | 21.00 | 36.00 | 0.10 | 0.06 |
6 | P2 | 2011 | 45.00 | 12.00 | 0.86 | -1.02 |
7 | P3 | 2010 | 9.00 | 18.00 | -1.41 | 4.61 |
8 | P3 | 2011 | 22.00 | 42.00 | -0.54 | 4.61 |
9 | P4 | 2009 | 22.00 | 14.00 | 0.09 | 4.61 |
10 | P4 | 2010 | 32.00 | 0.00 | 0.47 | 0.00 |
11 | P4 | 2011 | 38.00 | 27.00 | 0.64 | 4.61 |
|
Via ddply
>
> data.u <- merge(data.t, data.v, by = "Plot")
>
>
> delta <- function(t2, t1) {
+ d <- 100 * ((t2 - t1)/t1)
+ d[t1 == 0 & t2 == 0] <- 0
+ d[t1 == 0 & t2 != 0] <- 10000
+ log(1 + (d/101))
+ }
>
> ddply(data.u, ~Plot, function(df) {
+ t1 <- as.matrix(df[df$Year == df$Prior, 3:4])
+ df <- df[df$Year > df$Prior, ]
+ yr <- df$Year
+ t2 <- as.matrix(df[, 3:4])
+ return(data.frame(Year = yr, Resp1 = df$Resp1,
+ Resp2 = df$Resp2, sweep(t2, 2, t1, delta)))
+ })
Plot Year Resp1 Resp2 Resp1.1 Resp2.1
1 P1 2009 48 0 4.60527 -4.61512
2 P1 2010 12 0 4.60527 -4.61512
3 P1 2011 0 15 0.00000 -0.86170
4 P2 2009 18 0 -0.05352 -4.61512
5 P2 2010 21 36 0.09914 0.05661
6 P2 2011 45 12 0.85649 -1.02346
7 P3 2010 9 18 -1.40896 4.60527
8 P3 2011 22 42 -0.53937 4.60527
9 P4 2009 22 14 0.09441 4.60527
10 P4 2010 32 0 0.46628 0.00000
11 P4 2011 38 27 0.63715 4.60527
|
Dummy data sets - random data generation