Workshop 14.2 - Principle Components Analysis (PCA)
14 Jan 2013
Basic statistics references
- Legendre and Legendre
- Quinn & Keough (2002) - Chpt 17
Principle components analysis
Gittens(1985) measured the abundance of 8 species of plants from 45 sites within 3 habitat types. Essentially, the plant ecologist wanted to be able to compare the sites according to their plant communities.
Download veg data setFormat of veg.csv data file | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
> veg <- read.csv("../downloads/data/veg.csv") > veg
SITE HABITAT SP1 SP2 SP3 SP4 SP5 SP6 SP7 SP8 1 1 A 4 0 0 36 28 24 99 68 2 2 B 92 84 0 8 0 0 84 4 3 3 A 9 0 0 52 4 40 96 68 4 4 A 52 0 0 52 12 28 96 24 5 5 C 99 0 36 88 52 8 72 0 6 6 A 12 0 0 20 40 40 88 68 7 7 C 72 0 20 72 24 20 72 8 8 8 C 80 0 0 48 16 28 92 8 9 9 B 80 76 4 8 12 0 84 0 10 10 C 92 0 40 72 36 12 84 0 11 11 A 28 4 0 16 56 28 96 56 12 12 A 8 0 0 36 68 8 99 28 13 13 C 99 12 4 84 36 12 88 8 14 14 A 40 0 0 68 12 8 88 24 15 15 A 28 0 0 36 64 28 99 56 16 16 A 28 0 0 28 44 20 88 32 17 17 C 80 0 0 52 20 32 96 20 18 18 C 84 0 0 76 44 16 96 0 19 19 B 88 40 12 8 24 8 92 0 20 20 C 99 0 60 88 28 0 80 0 21 21 A 12 0 0 36 16 12 88 76 22 22 A 0 0 0 20 8 0 99 60 23 23 C 88 0 12 72 32 16 88 0 24 24 C 56 0 4 32 56 4 96 16 25 25 C 99 0 40 60 20 4 56 4 26 26 A 12 0 0 28 4 4 99 72 27 27 A 28 0 0 48 64 4 99 28 28 28 B 92 52 0 40 64 8 96 4 29 29 C 80 0 0 68 40 12 80 8 30 30 A 32 0 0 56 28 36 84 24 31 31 A 40 0 0 60 8 36 96 56 32 32 A 44 0 0 44 8 20 96 24 33 33 A 48 0 0 72 20 12 99 32 34 34 A 48 0 0 8 44 8 92 56 35 35 B 99 36 20 56 8 4 24 0 36 36 A 15 0 4 36 4 28 99 44 37 37 A 8 0 0 20 16 12 99 56 38 38 A 28 0 0 24 16 12 99 36 39 39 A 52 0 0 48 12 28 99 32 40 40 C 92 0 4 56 12 16 70 4 41 41 C 92 0 8 52 56 8 99 4 42 42 A 4 0 0 44 24 4 99 60 43 43 A 16 0 0 36 0 24 99 76 44 44 A 76 0 0 48 12 36 96 32 45 45 B 96 36 4 28 28 8 88 4
-
Before leaping into a multivariate analysis, we will begin by
treating the data set as a series of univariate responses (only a single response variable), and
examine the patterns between habitats based on the following single species:
- SP1
Show code
> plot(SP1 ~ SITE, col = as.numeric(HABITAT), data = veg, pch = 16)
- SP2
Show code
> plot(SP2 ~ SITE, col = as.numeric(HABITAT), data = veg, pch = 16)
- SP5
Show code
> plot(SP5 ~ SITE, col = as.numeric(HABITAT), data = veg, pch = 16)
- SP8
Show code
> plot(SP8 ~ SITE, col = as.numeric(HABITAT), data = veg, pch = 16)
- SP1
-
Examine the degree of correlation between each of the species (HINT).
Show code
> cor(veg[, 3:10])
SP1 SP2 SP3 SP4 SP5 SP6 SP7 SP8 SP1 1.00000 0.3993 0.51974 0.42648 0.09499 -0.28461 -0.5248 -0.8927 SP2 0.39933 1.0000 -0.03110 -0.39693 -0.12100 -0.38590 -0.2359 -0.3847 SP3 0.51974 -0.0311 1.00000 0.49179 0.06762 -0.33336 -0.5441 -0.4715 SP4 0.42648 -0.3969 0.49179 1.00000 0.07452 0.05903 -0.3018 -0.4180 SP5 0.09499 -0.1210 0.06762 0.07452 1.00000 -0.14179 0.1307 -0.1895 SP6 -0.28461 -0.3859 -0.33336 0.05903 -0.14179 1.00000 0.2534 0.3706 SP7 -0.52483 -0.2359 -0.54411 -0.30180 0.13070 0.25345 1.0000 0.4704 SP8 -0.89271 -0.3847 -0.47153 -0.41800 -0.18947 0.37061 0.4704 1.0000
-
SP1 and SP8 appeared to be highly negatively correlated.
To examine this correlation, create a scatterplot of SP1 against SP8 (HINT) and fit a smooth line through these data(HINT).
If we were purely trying to combine SP1 and SP2 into a new group, the position of each site (point) along this effectively becomes the sites new value in the new group.
Show code
> plot(SP1 ~ SP8, col = as.numeric(HABITAT), data = veg, pch = 16, asp = 1) > # model II regression slope from eigen analysis it is not critical that you understand the > # following - a simple line of best fit would illustrate just as well > e <- eigen(cor(veg[, c(3, 10)])) > e1 <- e$values[1] * e$vectors[, 1] > e2 <- e$values[2] * e$vectors[, 2] > slope <- (e2[2] - e1[2])/(e2[1] - e1[1]) > int <- mean(veg$SP1) - slope * mean(veg$SP8) > abline(a = int, b = slope)
-
Since a PCA is based either on covariances or correlations (both of which assume linearity), we first need
to explore the nature of the relationships between variables (species).
Show code
> pairs(veg[, 3:10])
-
Use principal components analysis (PCA, HINT)
to generate new groups (components) and explore the trends in plant communities amongst sites (and habitats)
Show code
> library(vegan) > veg.pca <- rda(veg[, c(-1, -2)], scale = TRUE) > summary(veg.pca, display = NULL)
Call: rda(X = veg[, c(-1, -2)], scale = TRUE) Partitioning of correlations: Inertia Proportion Total 8 1 Unconstrained 8 1 Eigenvalues, and their contribution to the correlations Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 Eigenvalue 3.295 1.640 1.130 0.809 0.5023 0.3426 0.1890 0.0923 Proportion Explained 0.412 0.205 0.141 0.101 0.0628 0.0428 0.0236 0.0115 Cumulative Proportion 0.412 0.617 0.758 0.859 0.9220 0.9648 0.9885 1.0000 Scaling 2 for species and site scores * Species are scaled proportional to eigenvalues * Sites are unscaled: weighted dispersion equal on all dimensions * General scaling constant of scores:
- Examine the Eigenvalues for each new component (group). The sum of these values should add up to the number of original variables (species).
If there were absolutely no correlations between the species, then you would expect each new component to represent a single original species and it would have an eigenvalue of 1.
The more correlated the species were, the more they will group together into the first few newly generated components (groups) and thus the higher the eigenvalues of these earlier groups.
What do the eigenvalues indicate in this case?
Show code> veg.pca$CA$eig
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 3.29533 1.63976 1.12956 0.80916 0.50231 0.34259 0.18898 0.09231
> sum(veg.pca$CA$eig)
[1] 8
- Calculate the percentage of total variation is explained by each of the new principal components.
How much of the total original variation is explained by principal component 1 (as a percentage)?
Show code
> 100 * veg.pca$CA$eig/sum(veg.pca$CA$eig)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 41.192 20.497 14.120 10.115 6.279 4.282 2.362 1.154
- Calculate the cumulative sum of these percentages.
How much of the total variation is explained by the first three principal components (as a percentage)?
Show code
> cumsum(100 * veg.pca$CA$eig/sum(veg.pca$CA$eig))
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 41.19 61.69 75.81 85.92 92.20 96.48 98.85 100.00
- Using the eigenvalues and a screeplot, determine how many principal components are necessary to represent the original variables (species)
. How many principal components are necessary?
Show code> screeplot(veg.pca) > abline(a = 1, b = 0)
- Examine the Eigenvalues for each new component (group). The sum of these values should add up to the number of original variables (species).
If there were absolutely no correlations between the species, then you would expect each new component to represent a single original species and it would have an eigenvalue of 1.
The more correlated the species were, the more they will group together into the first few newly generated components (groups) and thus the higher the eigenvalues of these earlier groups.
What do the eigenvalues indicate in this case?
-
Generate a a quick biplot ordination (scatterplot of principal components) with principal component 1 on the x-axis and principal component 2 on the y-axis.
Are the patterns of sites associated with any particular species?
Show code
> biplot(veg.pca, scaling = 2)
- Whilst the above biplot illustrates some of the patterns, it does not allow us to directly see whether the communities change in the different habitats.
So lets instead construct the plot at a lower level.
- Create the base ordination plot and add the sites (colored according to habitat). Since we are more interested in the habitats than
the actual sites, we can just label the points according to their habitat rather than their site names.
Show code
> veg.ord <- ordiplot(veg.pca, type = "n") > text(veg.ord, "sites", lab = veg$HABITAT, col = as.numeric(veg$HABITAT))
- Lets now add the species correlation vectors (component loadings). This will yield a biplot similar to the previous question.
Show code
> veg.ord <- ordiplot(veg.pca, type = "n") > text(veg.ord, "sites", lab = veg$HABITAT, col = as.numeric(veg$HABITAT)) > data.envfit <- envfit(veg.pca, veg[, 3:8]) > plot(data.envfit, col = "grey")
- Now lets fit the habitat vectors onto this ordination.
Before environmental variables can he added to an ordination plot, they must first be numeric representations.
If we wish to display the orientation of each habitat on the ordination plot, then we need to convert the habitat
variable into dummy variables.
Show code
> veg.ord <- ordiplot(veg.pca, type = "n") > text(veg.ord, "sites", lab = veg$HABITAT, col = as.numeric(veg$HABITAT)) > data.envfit <- envfit(veg.pca, veg[, 3:8]) > plot(data.envfit, col = "grey") > # dummy code the habitat factor > habitat <- model.matrix(~-1 + HABITAT, veg) > data.envfit <- envfit(veg.pca, env = habitat) > data.envfit
***VECTORS PC1 PC2 r2 Pr(>r) HABITATA -0.989 0.147 0.74 0.001 *** HABITATB 0.448 -0.894 0.79 0.001 *** HABITATC 0.810 0.586 0.58 0.001 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 P values based on 999 permutations.
> plot(data.envfit, col = "blue")
- Species 1 in primarily associated with principal component (axis)?
- Species 2 in primarily associated with principal component (axis)?
- Species 5 in primarily associated with principal component (axis)?
- Habitat A aligns primarily with the
- Habitat B aligns primarily with the
- Habitat C strongly reflects the abundances of
- Create the base ordination plot and add the sites (colored according to habitat). Since we are more interested in the habitats than
the actual sites, we can just label the points according to their habitat rather than their site names.
- We could take the above exploration one step further.
We would specifically address the question, are the communities in the three different habitats significantly different from one another.
One simple way to achieve this is to take the first couple of principle components and use them as responses in ANOVA's against habitat.
Lets do this separately for the first two principle components.
Since Habitat A is physically a little different to the other two habitats (it is occasionally completely submerged), it is perhaps sensible to compare each of the other habitats to habitat A
- PC1
Show code
> veg.scores <- scores(veg.pca, display = "sites") > summary(lm(veg.scores[, 1] ~ veg$HABITAT))
Call: lm(formula = veg.scores[, 1] ~ veg$HABITAT) Residuals: Min 1Q Median 3Q Max -0.6750 -0.2038 -0.0841 0.2462 0.8236 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.4905 0.0705 -6.96 1.7e-08 *** veg$HABITATB 1.1457 0.1602 7.15 8.8e-09 *** veg$HABITATC 1.0855 0.1176 9.23 1.2e-11 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.352 on 42 degrees of freedom Multiple R-squared: 0.722, Adjusted R-squared: 0.709 F-statistic: 54.5 on 2 and 42 DF, p-value: 2.11e-12
> boxplot(veg.scores[, 1] ~ veg$HABITAT)
- PC2
Show code
> veg.scores <- scores(veg.pca, display = "sites") > summary(lm(veg.scores[, 2] ~ veg$HABITAT))
Call: lm(formula = veg.scores[, 2] ~ veg$HABITAT) Residuals: Min 1Q Median 3Q Max -0.9480 -0.1510 -0.0156 0.2221 0.8079 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.073 0.074 0.99 0.329 veg$HABITATB -1.382 0.168 -8.22 2.8e-10 *** veg$HABITATC 0.358 0.123 2.90 0.006 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.37 on 42 degrees of freedom Multiple R-squared: 0.694, Adjusted R-squared: 0.679 F-statistic: 47.5 on 2 and 42 DF, p-value: 1.63e-11
> boxplot(veg.scores[, 2] ~ veg$HABITAT)
- PC1
The ecologist was not interested in teasing out the patterns based on each individual species in isolation. The ecologist wanted to see patterns between the plant communities, rather than individual species. Hence a multivariate approach was taken. You may have noticed that the patterns between sites (and habitats) based on SP1 and SP8 were very similar. The abundances of SP1 and SP2 appear to be correlated to one another. It is not surprising that different species might be correlated, as they are likely to respond similarly to similar conditions.
If two or more species reveal exactly the same patterns (hypothetically), we could easily combine them into a single group that characterises the sites. Species are never likely to be exactly correlated, however we can still generate new groups that are the combinations of multiple species abundances. If we were to attempt to combine three species, two of which were highly correlated to one another and the other not correlated to either, then the two correlated species will contribute a lot to the new group and the other variable will contribute only little.
In the example, lets say we wanted to reduce the 8 species variables down to just 2 groups. Based on how much each species is correlated to each other species, each species will contribute something to each of the two new groups. So each new group is a linear combination of original species variables. This sort of data combining (or reduction) can be done in a number of ways, however for it to work meaningfully, there must be a reasonable degree of correlation between the species.
Principle components analysis
Peet & Loucks (1977) examined the abundances of 8 species of trees (Bur oak, Black oak, White oak, Red oak, American elm, Basswood, Ironwood, Sugar maple) at 10 forest sites in southern Wisconsin, USA. The data (given below) are the mean measurements of canopy cover for eight species of north American trees in 10 samples (quadrats). For this question we will explore the relationships between the quadrats, in terms of tree species abundances using PCA. That is, which quadrats are most similar/dissimilar to one another.
Download wisc data setFormat of wisc.csv data file | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
> wisc <- read.csv("../downloads/data/wisc.csv") > wisc
QUADRAT BUROAK BLACKOAK WHITEOAK REDOAK ELM BASSWOOD IRONWOOD MAPLE 1 1 9 8 5 3 2 0 0 0 2 2 8 9 4 4 2 0 0 0 3 3 3 8 9 0 4 0 0 0 4 4 5 7 9 6 5 0 0 0 5 5 6 0 7 9 6 2 0 0 6 6 0 0 7 8 5 7 6 5 7 7 5 0 4 7 5 6 7 4 8 8 0 0 6 6 0 6 4 8 9 9 0 0 0 4 2 7 6 8 10 10 0 0 2 3 5 6 5 9
- Since a PCA is based either on covariances or correlations (both of which assume linearity), we first need
to explore the nature of the relationships between variables (species)
Show code
> pairs(wisc[, -1])
-
Use principal components analysis (PCA),
to generate new groups (components) and explore the trends in tree communities amongst quadrats.
Show code
> library(vegan) > wisc.pca <- rda(wisc[, -1], scale = TRUE) > summary(wisc.pca, display = NULL)
Call: rda(X = wisc[, -1], scale = TRUE) Partitioning of correlations: Inertia Proportion Total 8 1 Unconstrained 8 1 Eigenvalues, and their contribution to the correlations Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 Eigenvalue 4.634 1.687 0.7660 0.6153 0.1912 0.06725 0.03576 0.00346 Proportion Explained 0.579 0.211 0.0958 0.0769 0.0239 0.00841 0.00447 0.00043 Cumulative Proportion 0.579 0.790 0.8859 0.9628 0.9867 0.99510 0.99957 1.00000 Scaling 2 for species and site scores * Species are scaled proportional to eigenvalues * Sites are unscaled: weighted dispersion equal on all dimensions * General scaling constant of scores:
- Examine the Eigenvalues for each new component (group). The sum of these values should add up to the number of original variables (species).
If there were absolutely no correlations between the species, then you would expect each new component to represent a single original species and it would have an eigenvalue of 1.
The more correlated the species were, the more they will group together into the first few newly generated components (groups) and thus the higher the eigenvalues of these earlier groups.
What do the eigenvalues indicate in this case?
Show code> wisc.pca$CA$eig
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 4.634205 1.686922 0.765963 0.615270 0.191164 0.067254 0.035764 0.003456
> sum(wisc.pca$CA$eig)
[1] 8
- Calculate the percentage of total variation is explained by each of the new principal components.
How much of the total original variation is explained by principal component 1 (as a percentage)?
Show code
> 100 * wisc.pca$CA$eig/sum(wisc.pca$CA$eig)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 57.9276 21.0865 9.5745 7.6909 2.3896 0.8407 0.4471 0.0432
- Calculate the cumulative sum of these percentages.
How much of the total variation is explained by the first three principal components (as a percentage)?
Show code
> cumsum(100 * wisc.pca$CA$eig/sum(wisc.pca$CA$eig))
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 57.93 79.01 88.59 96.28 98.67 99.51 99.96 100.00
- Using the eigenvalues and a screeplot, determine how many principal components are necessary to represent the original variables (species)
. How many principal components are necessary?
Show code> screeplot(wisc.pca) > abline(a = 1, b = 0)
- Examine the Eigenvalues for each new component (group). The sum of these values should add up to the number of original variables (species).
If there were absolutely no correlations between the species, then you would expect each new component to represent a single original species and it would have an eigenvalue of 1.
The more correlated the species were, the more they will group together into the first few newly generated components (groups) and thus the higher the eigenvalues of these earlier groups.
What do the eigenvalues indicate in this case?
-
Generate a a quick biplot ordination (scatterplot of principal components) with principal component 1 on the x-axis and principal component 2 on the y-axis.
Are the patterns of quadrats associated with any particular tree species?
Show code
> biplot(wisc.pca, scaling = 2)
- Notice the "horseshoe effect". This effect is typically the result of a sampling regime that
spans multiple communities such that the abundances of species are unimodal.
Plot the abundances of the difference species against site number
Show code
> plot(1:10, wisc[, 3], type = "l") > for (i in 3:8) lines(1:10, wisc[, i], type = "l", col = i)