Tutorial 2.1 - Constructing data frames
07 Mar 2017
data.frame
Data frames are generated by amalgamating vectors of the same length together. To illustrate the translation of a data set (collection of variables) into an R data frame (collection of vectors), a portion of a real data set by Mac Nally (1996) in which the bird communities were investigated from 37 sites across five habitats in southeastern Australia will be used. Although the original data set includes the measured maximum density of 102 bird species from the 37 sites, for simplicity's sake only two bird species (GST: gray shrike thrush, EYR: eastern yellow robin) and the first eight of the sites will be included. The truncated data set, comprises a single factorial (or categorical) variable, two continuous variables, and a set of site (row) names, and is as follows:
Site | HABITAT | GST | EYR |
---|---|---|---|
Reedy Lake | Mixed | 3.4 | 0.0 |
Pearcedale | Gipps.Manna | 3.4 | 9.2 |
Warneet | Gipps.Manna | 8.4 | 3.8 |
Cranbourne | Gipps.Manna | 3.0 | 5.0 |
Lysterfield | Mixed | 5.6 | 5.6 |
Red Hill | Mixed | 8.1 | 4.1 |
Devilbend | Mixed | 8.3 | 7.1 |
Olinda | Mixed | 4.6 | 5.3 |
Firstly, we will generate the three variables (excluding the site labels as they are not variables) separately:
HABITAT <- factor(c('Mixed','Gipps.Manna','Gipps.Manna','Gipps.Manna','Mixed', 'Mixed','Mixed','Mixed')) GST <- c(3.4, 3.4, 8.4, 3.0, 5.6, 8.1, 8.3, 4.6) EYR <- c(0.0, 9.2, 3.8, 5.0, 5.6, 4.1, 7.1, 5.3)
Next, use the list the names of the vectors as arguments in the data.frame() function to amalgamate the three separate variables into a single data frame (data set) which we will call MACNALLY (after the author).
MACNALLY <- data.frame(HABITAT, GST, EYR) MACNALLY
HABITAT GST EYR 1 Mixed 3.4 0.0 2 Gipps.Manna 3.4 9.2 3 Gipps.Manna 8.4 3.8 4 Gipps.Manna 3.0 5.0 5 Mixed 5.6 5.6 6 Mixed 8.1 4.1 7 Mixed 8.3 7.1 8 Mixed 4.6 5.3
Notice that each vector (variable) becomes a column in the data frame and that each row represents a single sampling unit (in this case, each row represents a different site). By default, the rows are named using numbers corresponding to the number of rows in the data frame. However, these can be altered to reflect the names of the sampling units by assigning a list of alternative names to the row.names() (data frame row names) property of the data frame.
row.names(MACNALLY) <- c('Reedy Lake', 'Pearcedale', 'Warneet', 'Cranbourne', 'Lysterfield', 'Red Hill', 'Devilbend', 'Olinda') MACNALLY
HABITAT GST EYR Reedy Lake Mixed 3.4 0.0 Pearcedale Gipps.Manna 3.4 9.2 Warneet Gipps.Manna 8.4 3.8 Cranbourne Gipps.Manna 3.0 5.0 Lysterfield Mixed 5.6 5.6 Red Hill Mixed 8.1 4.1 Devilbend Mixed 8.3 7.1 Olinda Mixed 4.6 5.3
expand.grid
When the data set contains multiple fully crossed categorical variables (factors), the expand.grid() function provides a convenient way to create the factor vectors.
expand.grid(rep=1:4,B=paste("b",1:2,sep=""),A=paste("a",1:3,sep=""))
rep B A 1 1 b1 a1 2 2 b1 a1 3 3 b1 a1 4 4 b1 a1 5 1 b2 a1 6 2 b2 a1 7 3 b2 a1 8 4 b2 a1 9 1 b1 a2 10 2 b1 a2 11 3 b1 a2 12 4 b1 a2 13 1 b2 a2 14 2 b2 a2 15 3 b2 a2 16 4 b2 a2 17 1 b1 a3 18 2 b1 a3 19 3 b1 a3 20 4 b1 a3 21 1 b2 a3 22 2 b2 a3 23 3 b2 a3 24 4 b2 a3
Summarizing data frames
For very small and simple data.frame's like the MACNALLY example above, the whole data data.frame can be comfortably displayed in the console. However for much larger data.frame's, displaying all the data can be overwhelming and not very useful. There are a number of convenient functions that provide overviews of data. To appreciate the particulars of each routine as well as the differences between the different routines, we will add some other data types to our MACNALLY data.
MACNALLY$Bool <- rep(c(TRUE,FALSE),4) MACNALLY$Char <- rep(c('Large','Small'),4) MACNALLY$Date <- seq(as.Date('2000-02-29'),as.Date('2000-05-12'), len=8)
summary()
The summary() function is an overloaded function whose behaviour depends on the object passed to the function. When summary() is called with a data.frame, a summary is provided in which:
- numeric vectors (variables) are summarized by the standard 5 number statistics and if there are any missing values, the number of missing values is also provided
- categorical (factors) vectors are tallied up - that is, the number of instances of each level are counted.
- boolean states are also tallied
- character vectors are only described by their length
- date (and POSIX) vectors are summarized by 5 number summaries
summary(MACNALLY)
HABITAT GST EYR Bool Char Gipps.Manna:3 Min. :3.00 Min. :0.000 Mode :logical Length:8 Mixed :5 1st Qu.:3.40 1st Qu.:4.025 FALSE:4 Class :character Median :5.10 Median :5.150 TRUE :4 Mode :character Mean :5.60 Mean :5.013 NA's :0 3rd Qu.:8.15 3rd Qu.:5.975 Max. :8.40 Max. :9.200 Date Min. :2000-02-29 1st Qu.:2000-03-18 Median :2000-04-05 Mean :2000-04-05 3rd Qu.:2000-04-23 Max. :2000-05-12
str()
Similar to summary(), the str() function is an overloaded. The str() function generally produces a compact view of the structure of an object. When str() is called with a data.frame, this compact view comprises a nested list of abbreviated structures.
str(MACNALLY)
'data.frame': 8 obs. of 6 variables: $ HABITAT: Factor w/ 2 levels "Gipps.Manna",..: 2 1 1 1 2 2 2 2 $ GST : num 3.4 3.4 8.4 3 5.6 8.1 8.3 4.6 $ EYR : num 0 9.2 3.8 5 5.6 4.1 7.1 5.3 $ Bool : logi TRUE FALSE TRUE FALSE TRUE FALSE ... $ Char : chr "Large" "Small" "Large" "Small" ... $ Date : Date, format: "2000-02-29" "2000-03-10" ...
glimpse()
The glimpse() function in the tibble package is similar to str() except that it attempts to maximize the amount of data displayed according to the dimensions of the output.
library(tibble) glimpse(MACNALLY)
Observations: 8 Variables: 6 $ HABITAT <fctr> Mixed, Gipps.Manna, Gipps.Manna, Gipps.Manna, Mixed, Mixed, Mixed, M... $ GST <dbl> 3.4, 3.4, 8.4, 3.0, 5.6, 8.1, 8.3, 4.6 $ EYR <dbl> 0.0, 9.2, 3.8, 5.0, 5.6, 4.1, 7.1, 5.3 $ Bool <lgl> TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE $ Char <chr> "Large", "Small", "Large", "Small", "Large", "Small", "Large", "Small" $ Date <date> 2000-02-29, 2000-03-10, 2000-03-20, 2000-03-31, 2000-04-10, 2000-04-...
Others
There are also numerous graphical methods including view() and fix(), however, I have focused on the script friendly routines. As the graphical routines require user input, they are inappropriate to include in scripts.
Within Rstudio, a data frame can be viewed like a spreadsheet. Furthermore, when in R Notebook mode, a simple functioning spreadsheet will be embedded within the notebook.