Workshop 2.2 - Data importation and exploratory data analysis
09 Dec 2014
Basic statistics references
- Logan (2010) - Chpt 1, 2 & 6
- Quinn & Keough (2002) - Chpt 1, 2, 3 & 4
Data sets - Data frames(R)
Rarely is only a single biological variable collected. Data are usually collected in sets of variables reflecting tests of relationships, differences between groups, multiple characterizations etc. Consequently, data sets are best organized into collections of variables (vectors). Such collections are called data frames in R.
Data frames are generated by combining multiple vectors together whereby each vector becomes a separate column in the data frame. In for a data frame to represent the data properly, the sequence in which observations appear in the vectors (variables) must be the same for each vector and each vector should have the same number of observations. For example, the first observations from each of the vectors to be included in the data frame must represent observations collected from the same sampling unit.
To demonstrate the use of dataframes in R, we will use fictitious data representing the areas of leaves of two species of Japanese Boxwood
Format of the fictitious data set | ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
- First create the categorical (factor) variable containing the listing of B.semp three times and B.micro three times
- Now create the dependent variable (numeric vector) containing the leaf areas
- Combine the two variables (vectors) into a single data set (data frame) called LEAVES
- Print (to the screen) the contents of this new data set called LEAVES
- You will have noticed that the names of the rows are listed as 1 to 6 (this is the default). In the table above, we can see that there is a variable called PLANT that listed unique plant identification labels. These labels are of no use for any statistics, however, they are useful for identifying particular observations. Consequently it would be good to incorporate these labels as row names in the data set. Create a variable called PLANT that contains a listing of the plant identifications
- Use this plant identification label variable to define the row names in the data frame called LEAVES
- In the textbox provided below, list each of the lines of R syntax required to generate the data set
The above syntax forms a list of instructions that R can perform. Such lists are called scripts. Scripts offer the following;
- Enable a sequence of tasks such as data entry, analysis and graphical preparation to be repeated quickly and precisely
- Ensure that the sequence of tasks used to complete an analysis are permanently documented
- Simplify performing many similar analyses
- Simplify sharing of data, analyses and techniques
- close down R
- restart R
- Change the working directory (path) to the location where you saved the script file in Q1-2 above
- Source the script file
- Print (list on screen) the contents of the AREA vector. Note, that this is listing the contents of the AREA vector, this is not the same as asking it to list the contents of the AREA vector within the LEAVES data frame. For example, multiply all of the numbers in the AREA vector by 2. Now print the contents of the AREA vector then the LEAVES data frame. Notice that only the values in the AREA vector have changed - the values within the AREA vector of the LEAVES data frame were not effected.
- To avoid confusion and clutter, it is therefore always best to remove single vectors once a data frame has been created. Remove the PLANTS, SPECIES and AREA vectors.
- Notice what happens when you now try to access the AREA vector.
- To access a variable from within a data frame, we use the $ sign. Print the contents of the LEAVES AREA vector
Access | Syntax |
---|---|
print the LEAVES data set | hint |
print first leaf area in the LEAVES data set | hint |
print the first 3 leaf areas in the LEAVES data set | hint |
print a list of leaf areas that are greater than 20 | hint |
print a list of leaf areas for the B.microphylum species | hint |
print the section of the data set that contains the B.microphylum species | hint |
alter the second leaf area from 22 to 23 | hint |
Importing data and data files
Although it is possible to generate a data set from scratch using the procedures demonstrated in the above demonstration module, often data sets are better managed with spreadsheet software. R is not designed to be a spreadsheet, and thus, it is necessary to import data into R. We will use the following small data set (in which the feeding metabolic rate of stick insects fed two different diets was recorded)to demonstrate how a data set is imported into R.
Format of the fictitious data set | ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
- Enter the above data set into Excel and save the sheet as a comma delimited text file (CSV)
. Ensure that
e column titles (variable names) are in the first row and that you take note where the file is saved. To see the format of this file, open it in Notepad (the windows accessory program). Notice that it is just a straight text file, there is no encryption or encoding. - Ensure that the current working directory is set to the location of this file
- Read (import) the data set into a data table . Since data exploration and analysis cannot begin until the data is imported into R, the syntax of this step would usually be on the first line in a new script file that is stored with the comma delimited text file version of the data set.
- To ensure that the data have been successfully imported, print the data frame
- Examine the contents of this comma delimited text file using Notepad