Tutorial 2.3 - Data frame vectors
07 Mar 2017
Please note that the following tutorial describes data.frame manipulation from then classical R perspective. Urguably, more inititive and satifying outcomes can be achieved with the dplyr() and tidyr() packages. These will be explored in the next tutorial.
In generating a data frame from individual vectors (such as above), copies of the original vectors, rather than the actual original vectors themselves are amalgamated. Consequently, while the vectors contained in the data frame contain the same information (entries) as the original vectors, they are completely distinct from the original vectors.
By way of a motivating example for this tutorial, we will again create an extract of some bird abundances data of Mac Nally (1996).
HABITAT <- factor(c('Mixed','Gipps.Manna','Gipps.Manna','Gipps.Manna','Mixed', 'Mixed','Mixed','Mixed')) GST <- c(3.4, 3.4, 8.4, 3.0, 5.6, 8.1, 8.3, 4.6) EYR <- c(0.0, 9.2, 3.8, 5.0, 5.6, 4.1, 7.1, 5.3) MACNALLY <- data.frame(HABITAT,GST,EYR)
The current R workspace will contain the vectors HABITAT, GST and EYR as well as HABITAT, GST and EYR within the MACNALLY data frame. Note that the separate vectors HABITAT, GST and EYR are entirely different objects from those within the data frame (occupy an entirely different memory slot).
ls()
[1] "EYR" "GST" "HABITAT" "MACNALLY" "my_png"
To refer to a vector within a data frame, the name of the vector is proceeded by the name of the data frame and the two names are separated by a $ character. For example, to refer to the GST vector of the MACNALLY data frame:
MACNALLY$GST
[1] 3.4 3.4 8.4 3.0 5.6 8.1 8.3 4.6
MACNALLY$HABITAT
[1] Mixed Gipps.Manna Gipps.Manna Gipps.Manna Mixed Mixed Mixed [8] Mixed Levels: Gipps.Manna Mixed
Any modifications made to the original vectors will not affect the vectors within a data frame. Therefore, it is important to remember to use the data frame prefix. To avoid confusion, it is generally recommended that following the successful generation of the data frame from individual vectors, the original vectors should be deleted.
rm(HABITAT,GST,EYR)
Thereafter, any inadvertent reference to the original vector (GST) rather than vector within the data frame (MACNALLY$GST) will result in a error informing that the object does not exist.
GST
Error in eval(expr, envir, enclos): object 'GST' not found
Factor levels
When factors are generated directly using the factor() function to convert character vectors into factors, factor levels are automatically arranged alphabetically. For example, examine the levels of the MACNALLY$HABITAT factor:
levels(MACNALLY$HABITAT)
[1] "Gipps.Manna" "Mixed"
Although the order of factor levels has no bearing on most statistical procedures and for many applications, alphabetical ordering is as valid as any other arrangement, for some analyses (particularly those involving contrasts) it is necessary to know the arrangement of factor levels. Furthermore, for graphical summaries of some data, alphabetical factor levels might not represent the natural trends among groups.
Consider a data set that includes a factorial variable with the levels 'high', 'medium' and 'low'. Presented alphabetically, the levels of the factor would be 'high' 'low' 'medium'. Those data would probably be more effectively presented in the more natural order of 'high' 'medium' 'low' or 'low' 'medium' 'high'.
When creating a factor, the order of factor levels can be specified as a list of labels. For example, consider a factor with the levels 'low','medium' and 'high':
FACTOR <- gl(3,2,6,labels=c('low','medium','high')) FACTOR
[1] low low medium medium high high Levels: low medium high
The order of existing factor levels can also be altered by redefining a factor:
# examine the default order of levels levels(MACNALLY$HABITAT)
[1] "Gipps.Manna" "Mixed"
# redefine the order of levels MACNALLY$HABITAT<-factor(MACNALLY$HABITAT, levels=c( 'Montane Forest', 'Foothills Woodland','Mixed', 'Gipps.Manna', 'Box-Ironbark','River Red Gum')) # examine the new order of levels levels(MACNALLY$HABITAT)
[1] "Montane Forest" "Foothills Woodland" "Mixed" "Gipps.Manna" [5] "Box-Ironbark" "River Red Gum"
Notice that in the above code snippet, not only did I alter the order of the factor levels, I also introduced additional factor levels. This is not generally advisable (as it can result in unexpected behaviors of some summary functions), however, it does illustrate how to reorder factor levels.
Furthermore, it also helps to reinforce the notion that the levels property of a factor are like a key. Internally, categorical vectors (factors) are stored as integer values (1,2,3...). The levels property indicates a name for each of these factor levels.
as.numeric(MACNALLY$HABITAT)
[1] 3 4 4 4 3 3 3 3
levels(MACNALLY$HABITAT)
[1] "Montane Forest" "Foothills Woodland" "Mixed" "Gipps.Manna" [5] "Box-Ironbark" "River Red Gum"
In addition, some analyses perform different operations on factors that are defined as 'ordered' compared to 'unordered' factors. Regardless of whether you have altered the ordering of factor levels or not, by default all factors are implicitly considered `unordered' until otherwise defined using the ordered() function. Alternatively, the argument ordered=TRUE can be supplied to the factor function when defining a vector as a factor.
# define the factor as ordered FACTOR <- ordered(FACTOR) FACTOR
[1] low low medium medium high high Levels: low < medium < high
Subsets of data frames - data frame indexing
Indexing of data frames follows the format of dataframe[rows,columns], see the following table.
Action | Example index syntax |
---|---|
Indexing by rows (sampling units) |
Select the first 5 rows of each of the vectors in the data frame
MACNALLY[1:5,] HABITAT GST EYR 1 Mixed 3.4 0.0 2 Gipps.Manna 3.4 9.2 3 Gipps.Manna 8.4 3.8 4 Gipps.Manna 3.0 5.0 5 Mixed 5.6 5.6 |
Select each of the vectors for the row called 'Pearcedale' from the data frame. Note for this to work row names need to be defined.
MACNALLY['Pearcedale',] |
|
Indexing by columns (variables) |
Select all rows but just the first and third vector of the data frame
MACNALLY[,c(1,3)] HABITAT EYR 1 Mixed 0.0 2 Gipps.Manna 9.2 3 Gipps.Manna 3.8 4 Gipps.Manna 5.0 5 Mixed 5.6 6 Mixed 4.1 7 Mixed 7.1 8 Mixed 5.3 |
Select the GST and EYR vectors for all sites from the dataframe
MACNALLY[,c('GST','EYR')] GST EYR 1 3.4 0.0 2 3.4 9.2 3 8.4 3.8 4 3.0 5.0 5 5.6 5.6 6 8.1 4.1 7 8.3 7.1 8 4.6 5.3 |
|
Indexing by conditions | Select the data for sites that have GST values greater than 3
MACNALLY[MACNALLY$GST>3,] HABITAT GST EYR 1 Mixed 3.4 0.0 2 Gipps.Manna 3.4 9.2 3 Gipps.Manna 8.4 3.8 5 Mixed 5.6 5.6 6 Mixed 8.1 4.1 7 Mixed 8.3 7.1 8 Mixed 4.6 5.3 |
Select data for 'Mixed' habitat sites that have GST values greater than 3
MACNALLY[MACNALLY$GST>3 & MACNALLY$HABITAT=='Mixed',] HABITAT GST EYR 1 Mixed 3.4 0.0 5 Mixed 5.6 5.6 6 Mixed 8.1 4.1 7 Mixed 8.3 7.1 8 Mixed 4.6 5.3 |
The subset() function
As an alternative to data frame indexing, the subset() function can be used:
function (x, subset, select, drop = FALSE, ...)
- x - is the data frame to be subset
- subset - is a vector of logical values (TRUE and FALSE) resulting from a conditional statement that defines which rows to include
- select - is an expression involving either column indexes or column names (that are converted to column indexes) indicating columns to include
Here are a few more examples:
Example subset syntax |
---|
Select all rows but just the first and third vector of the data frame
subset(MACNALLY, select=c(1,3)) HABITAT EYR 1 Mixed 0.0 2 Gipps.Manna 9.2 3 Gipps.Manna 3.8 4 Gipps.Manna 5.0 5 Mixed 5.6 6 Mixed 4.1 7 Mixed 7.1 8 Mixed 5.3 |
Select the GST and EYR vectors for all sites from the dataframe
subset(MACNALLY, select=c(GST,EYR)) GST EYR 1 3.4 0.0 2 3.4 9.2 3 8.4 3.8 4 3.0 5.0 5 5.6 5.6 6 8.1 4.1 7 8.3 7.1 8 4.6 5.3 |
Select the data for sites that have GST values greater than 3
subset(MACNALLY, GST>3) HABITAT GST EYR 1 Mixed 3.4 0.0 2 Gipps.Manna 3.4 9.2 3 Gipps.Manna 8.4 3.8 5 Mixed 5.6 5.6 6 Mixed 8.1 4.1 7 Mixed 8.3 7.1 8 Mixed 4.6 5.3 |
Select data for 'Mixed' habitat sites that have GST values greater than 3
subset(MACNALLY, GST>3 & HABITAT=='Mixed') HABITAT GST EYR 1 Mixed 3.4 0.0 5 Mixed 5.6 5.6 6 Mixed 8.1 4.1 7 Mixed 8.3 7.1 8 Mixed 4.6 5.3 |
Select the 'HABITAT' and 'EYR' columns of the MACNALLY data for 'Mixed' habitat sites that have GST values greater than 3
subset(MACNALLY, GST>3 & HABITAT=='Mixed', select=c(HABITAT,EYR)) HABITAT EYR 1 Mixed 0.0 5 Mixed 5.6 6 Mixed 4.1 7 Mixed 7.1 8 Mixed 5.3 |
The subset() function can be used within many other analysis functions and therefore provides a convenient way of performing data analysis on subsets of larger data sets. Moreover, the subset() function should be used in preference to the above conditional indexing techniques when there are missing values or more defined factor levels than actual levels in the data.
The %in% matching operator
It is often desirable to subset according to multiple alternative conditions. The \%in\% operator searches through all of the entries in the object on the lefthand side for matches with any of the entries within the vector on the righthand side.
MACNALLY1 <- read.table('../downloads/data/macnally.csv', header=TRUE, sep=",")
#subset the MACNALLY dataset according to those rows that correspond to #HABITAT 'Montane Forest' or 'Foothills Woodland' MACNALLY1[MACNALLY1$HABITAT %in% c("Montane Forest","Foothills Woodland"),]
HABITAT GST EYR Fern Tree Gum Montane Forest 3.2 5.2 Sherwin Foothills Woodland 4.6 1.2 Heathcote Ju Montane Forest 3.7 2.5 Warburton Montane Forest 3.8 6.5 Panton Gap Montane Forest 3.8 3.8 St Andrews Foothills Woodland 4.7 3.6 Nepean Foothills Woodland 14.0 5.6 Tallarook Foothills Woodland 4.3 2.9
Conveniently, the %in% operator can also be used in the subset function.
Pivot tables and aggregating datasets
Sometimes it is necessary to calculate summary statistics of a vector separately for different levels of a factor. One way to achieved this is by specifying the numeric vector, the factor (or list of factors) and the summary statistic function (such as mean) as arguments in the tapply() function.
#calculate the mean GST densities per HABITAT tapply(MACNALLY1$GST, MACNALLY1$HABITAT, mean)
Box-Ironbark Foothills Woodland Gipps.Manna Mixed 4.575000 6.900000 5.325000 5.035294 Montane Forest River Red Gum 3.625000 3.300000
#OR with(MACNALLY1, tapply(GST,HABITAT,mean))
Box-Ironbark Foothills Woodland Gipps.Manna Mixed 4.575000 6.900000 5.325000 5.035294 Montane Forest River Red Gum 3.625000 3.300000
When it is necessary to calculate the summary statistic for multiple variables at a time, or to retain the dataset (data.frame) format to facilitate subsequent analyses or graphical summaries, a range of other functions are available. Indeed, there is an entire package devoted to pivot table like functionality and data set aggregations (called plyr), this will be explored in Tutorial 2.4).
Nevertheless, it is appropriate at this point to showcase a small selection of aggregating functions.
- the ddply function within the plyr package. This function performs a split (generates subsets of the data), apply
(applies a function or set of functions on the subsets) and combine (bring the chunks of aggregated data back together as a data frame).
#calculate the mean GST and EYR densities per habitat library(plyr) ddply(MACNALLY1, ~HABITAT, function(df) { data.frame(GST=mean(df$GST, na.rm=T), EYR=mean(df$EYR, na.rm=T)) })
HABITAT GST EYR 1 Box-Ironbark 4.575000 1.450000 2 Foothills Woodland 6.900000 3.325000 3 Gipps.Manna 5.325000 6.925000 4 Mixed 5.035294 4.264706 5 Montane Forest 3.625000 4.500000 6 River Red Gum 3.300000 0.000000
#OR if the function you want to apply is the same for each column ddply(MACNALLY1, ~HABITAT, colwise(mean))
HABITAT GST EYR 1 Box-Ironbark 4.575000 1.450000 2 Foothills Woodland 6.900000 3.325000 3 Gipps.Manna 5.325000 6.925000 4 Mixed 5.035294 4.264706 5 Montane Forest 3.625000 4.500000 6 River Red Gum 3.300000 0.000000
- the aggregate() function
#calculate the mean GST and EYR densities per habitat aggregate(MACNALLY1[c('GST','EYR')], list(Habitat=MACNALLY1$HABITAT), mean)
Habitat GST EYR 1 Box-Ironbark 4.575000 1.450000 2 Foothills Woodland 6.900000 3.325000 3 Gipps.Manna 5.325000 6.925000 4 Mixed 5.035294 4.264706 5 Montane Forest 3.625000 4.500000 6 River Red Gum 3.300000 0.000000
- alternatively, the gsummary() function within the
nlme and lme4 packages performs similarly.
The gsummary() function performs more conveniently than
aggregate() on grouped data (data containing hierarchical blocking or nesting).
Note that due to competing namespaces as well as other technical issues, when using the
gsummary function, it is nearly always necessary to explicitly
include the namespace (scope) for the summary function. For example:
library(nlme) gsummary(MACNALLY1[c('GST','EYR')],groups=MACNALLY1$HABITAT, FUN=base:::mean)
GST EYR Box-Ironbark 4.575000 1.450000 Foothills Woodland 6.900000 3.325000 Gipps.Manna 5.325000 6.925000 Mixed 5.035294 4.264706 Montane Forest 3.625000 4.500000 River Red Gum 3.300000 0.000000
Sorting datasets
Often it is necessary to rearrange or sort datasets according to one or more variables. This is done by using the order() function to generate the row indices. By default, data are sorted in increasing order, however this can be reversed by supplying the decreasing=T argument to the order() function.
It is possible to sort according to multiple variables simply by specifying a comma separated list of the vector names (see example below), whereby the data are sorted first by the first supplied vector, then the next and so on. Note however, when multiple vectors are supplied, all are sorted in the same direction.
MACNALLY1[order(MACNALLY1$HABITAT,MACNALLY1$GST),]
HABITAT GST EYR Rushworth Box-Ironbark 2.1 1.1 Sayers Box-Ironbark 2.6 0.0 Bailieston Box-Ironbark 6.5 2.5 Costerfield Box-Ironbark 7.1 2.2 Tallarook Foothills Woodland 4.3 2.9 Sherwin Foothills Woodland 4.6 1.2 St Andrews Foothills Woodland 4.7 3.6 Nepean Foothills Woodland 14.0 5.6 Cranbourne Gipps.Manna 3.0 5.0 Pearcedale Gipps.Manna 3.4 9.2 Bittern Gipps.Manna 6.5 9.7 Warneet Gipps.Manna 8.4 3.8 Donna Buang Mixed 1.5 0.0 Hawke Mixed 1.7 2.6 Waranga Mixed 3.0 1.6 Ben Cairn Mixed 3.1 9.3 Reedy Lake Mixed 3.4 0.0 Ghin Ghin Mixed 3.4 2.7 Balnarring Mixed 4.1 4.9 Olinda Mixed 4.6 5.3 Upper Yarra Mixed 4.7 3.1 Millgrove Mixed 5.4 6.5 Lysterfield Mixed 5.6 5.6 Minto Mixed 5.6 3.3 Cape Schanck Mixed 6.0 4.9 Gembrook Mixed 7.5 7.5 Red Hill Mixed 8.1 4.1 Devilbend Mixed 8.3 7.1 OShannassy Mixed 9.6 4.0 Fern Tree Gum Montane Forest 3.2 5.2 Heathcote Ju Montane Forest 3.7 2.5 Warburton Montane Forest 3.8 6.5 Panton Gap Montane Forest 3.8 3.8 Undera River Red Gum 2.7 0.0 Toolamba River Red Gum 3.0 0.0 Arcadia River Red Gum 3.1 0.0 Coomboona River Red Gum 4.4 0.0
To appreciate how this is working, examine just the order component
order(MACNALLY1$HABITAT,MACNALLY1$GST)
[1] 33 34 25 36 37 10 20 21 4 2 24 3 26 19 35 14 1 17 23 8 27 13 5 18 22 28 6 7 [29] 16 9 11 12 15 30 32 29 31
Hence when this sequence is applied as row indices to MACNALLY, it would be interpreted as 'display row 33, then row 34, 25 etc'.
Accessing and evaluating expressions within the context of a dataframe
For times when you find it necessary to repeatedly include the name of the dataframe within functions and expressions, the with() function is very convenient. This function evaluates an expression (which can include functions) within the context of the dataframe. Hence, the above order() illustration could also be performed as:
with(MACNALLY1, order(HABITAT, GST))
[1] 33 34 25 36 37 10 20 21 4 2 24 3 26 19 35 14 1 17 23 8 27 13 5 18 22 28 6 7 [29] 16 9 11 12 15 30 32 29 31
Similarly, the within function can be used to create new variables within the context of a dataset. This is particularly useful for scale transformations. The within function returns a new instance of the data frame, it does not effect the original data frame.
MACNALLY2 <- within(MACNALLY1, logGST <- log(GST)) MACNALLY2
HABITAT GST EYR logGST Reedy Lake Mixed 3.4 0.0 1.2237754 Pearcedale Gipps.Manna 3.4 9.2 1.2237754 Warneet Gipps.Manna 8.4 3.8 2.1282317 Cranbourne Gipps.Manna 3.0 5.0 1.0986123 Lysterfield Mixed 5.6 5.6 1.7227666 Red Hill Mixed 8.1 4.1 2.0918641 Devilbend Mixed 8.3 7.1 2.1162555 Olinda Mixed 4.6 5.3 1.5260563 Fern Tree Gum Montane Forest 3.2 5.2 1.1631508 Sherwin Foothills Woodland 4.6 1.2 1.5260563 Heathcote Ju Montane Forest 3.7 2.5 1.3083328 Warburton Montane Forest 3.8 6.5 1.3350011 Millgrove Mixed 5.4 6.5 1.6863990 Ben Cairn Mixed 3.1 9.3 1.1314021 Panton Gap Montane Forest 3.8 3.8 1.3350011 OShannassy Mixed 9.6 4.0 2.2617631 Ghin Ghin Mixed 3.4 2.7 1.2237754 Minto Mixed 5.6 3.3 1.7227666 Hawke Mixed 1.7 2.6 0.5306283 St Andrews Foothills Woodland 4.7 3.6 1.5475625 Nepean Foothills Woodland 14.0 5.6 2.6390573 Cape Schanck Mixed 6.0 4.9 1.7917595 Balnarring Mixed 4.1 4.9 1.4109870 Bittern Gipps.Manna 6.5 9.7 1.8718022 Bailieston Box-Ironbark 6.5 2.5 1.8718022 Donna Buang Mixed 1.5 0.0 0.4054651 Upper Yarra Mixed 4.7 3.1 1.5475625 Gembrook Mixed 7.5 7.5 2.0149030 Arcadia River Red Gum 3.1 0.0 1.1314021 Undera River Red Gum 2.7 0.0 0.9932518 Coomboona River Red Gum 4.4 0.0 1.4816045 Toolamba River Red Gum 3.0 0.0 1.0986123 Rushworth Box-Ironbark 2.1 1.1 0.7419373 Sayers Box-Ironbark 2.6 0.0 0.9555114 Waranga Mixed 3.0 1.6 1.0986123 Costerfield Box-Ironbark 7.1 2.2 1.9600948 Tallarook Foothills Woodland 4.3 2.9 1.4586150
MACNALLY1
HABITAT GST EYR Reedy Lake Mixed 3.4 0.0 Pearcedale Gipps.Manna 3.4 9.2 Warneet Gipps.Manna 8.4 3.8 Cranbourne Gipps.Manna 3.0 5.0 Lysterfield Mixed 5.6 5.6 Red Hill Mixed 8.1 4.1 Devilbend Mixed 8.3 7.1 Olinda Mixed 4.6 5.3 Fern Tree Gum Montane Forest 3.2 5.2 Sherwin Foothills Woodland 4.6 1.2 Heathcote Ju Montane Forest 3.7 2.5 Warburton Montane Forest 3.8 6.5 Millgrove Mixed 5.4 6.5 Ben Cairn Mixed 3.1 9.3 Panton Gap Montane Forest 3.8 3.8 OShannassy Mixed 9.6 4.0 Ghin Ghin Mixed 3.4 2.7 Minto Mixed 5.6 3.3 Hawke Mixed 1.7 2.6 St Andrews Foothills Woodland 4.7 3.6 Nepean Foothills Woodland 14.0 5.6 Cape Schanck Mixed 6.0 4.9 Balnarring Mixed 4.1 4.9 Bittern Gipps.Manna 6.5 9.7 Bailieston Box-Ironbark 6.5 2.5 Donna Buang Mixed 1.5 0.0 Upper Yarra Mixed 4.7 3.1 Gembrook Mixed 7.5 7.5 Arcadia River Red Gum 3.1 0.0 Undera River Red Gum 2.7 0.0 Coomboona River Red Gum 4.4 0.0 Toolamba River Red Gum 3.0 0.0 Rushworth Box-Ironbark 2.1 1.1 Sayers Box-Ironbark 2.6 0.0 Waranga Mixed 3.0 1.6 Costerfield Box-Ironbark 7.1 2.2 Tallarook Foothills Woodland 4.3 2.9