Workshop 14.2 - Principle Components Analysis (PCA)

14 Jan 2013

Basic statistics references

Legendre and Legendre
Quinn & Keough (2002) - Chpt 17

Principle components analysis

Gittens(1985) measured the abundance of 8 species of plants from 45 sites within 3 habitat types. Essentially, the plant ecologist wanted to be able to compare the sites according to their plant communities.

Download veg data set

Format of veg.csv data file

SITE	HABITAT	SP1	...	SP8
1	A	4	..	68
2	B	92	..	4
3	A	9	..	68
4	A	52	..	24
5	C	99	..	0
6	A	12	..	68
7	C	72	..	8
8	C	80	..	8
9	B	80	..	0
10	C	92	..	0

SITE	A number or name given to each quadrat (site)
HABITAT	A letter or name given to each habitat type
SP1, SP2, .., SP8	Number of individuals of each plant species found in each quadrat

Open the veg data set.

Show code

> veg <- read.csv("../downloads/data/veg.csv")
> veg

   SITE HABITAT SP1 SP2 SP3 SP4 SP5 SP6 SP7 SP8
1     1       A   4   0   0  36  28  24  99  68
2     2       B  92  84   0   8   0   0  84   4
3     3       A   9   0   0  52   4  40  96  68
4     4       A  52   0   0  52  12  28  96  24
5     5       C  99   0  36  88  52   8  72   0
6     6       A  12   0   0  20  40  40  88  68
7     7       C  72   0  20  72  24  20  72   8
8     8       C  80   0   0  48  16  28  92   8
9     9       B  80  76   4   8  12   0  84   0
10   10       C  92   0  40  72  36  12  84   0
11   11       A  28   4   0  16  56  28  96  56
12   12       A   8   0   0  36  68   8  99  28
13   13       C  99  12   4  84  36  12  88   8
14   14       A  40   0   0  68  12   8  88  24
15   15       A  28   0   0  36  64  28  99  56
16   16       A  28   0   0  28  44  20  88  32
17   17       C  80   0   0  52  20  32  96  20
18   18       C  84   0   0  76  44  16  96   0
19   19       B  88  40  12   8  24   8  92   0
20   20       C  99   0  60  88  28   0  80   0
21   21       A  12   0   0  36  16  12  88  76
22   22       A   0   0   0  20   8   0  99  60
23   23       C  88   0  12  72  32  16  88   0
24   24       C  56   0   4  32  56   4  96  16
25   25       C  99   0  40  60  20   4  56   4
26   26       A  12   0   0  28   4   4  99  72
27   27       A  28   0   0  48  64   4  99  28
28   28       B  92  52   0  40  64   8  96   4
29   29       C  80   0   0  68  40  12  80   8
30   30       A  32   0   0  56  28  36  84  24
31   31       A  40   0   0  60   8  36  96  56
32   32       A  44   0   0  44   8  20  96  24
33   33       A  48   0   0  72  20  12  99  32
34   34       A  48   0   0   8  44   8  92  56
35   35       B  99  36  20  56   8   4  24   0
36   36       A  15   0   4  36   4  28  99  44
37   37       A   8   0   0  20  16  12  99  56
38   38       A  28   0   0  24  16  12  99  36
39   39       A  52   0   0  48  12  28  99  32
40   40       C  92   0   4  56  12  16  70   4
41   41       C  92   0   8  52  56   8  99   4
42   42       A   4   0   0  44  24   4  99  60
43   43       A  16   0   0  36   0  24  99  76
44   44       A  76   0   0  48  12  36  96  32
45   45       B  96  36   4  28  28   8  88   4

Before leaping into a multivariate analysis, we will begin by treating the data set as a series of univariate responses (only a single response variable), and examine the patterns between habitats based on the following single species:
1. SP1
  Show code
  > plot(SP1 ~ SITE, col = as.numeric(HABITAT), data = veg, pch = 16)
  What does this plot illustrate about the relationships between habitats based on this plant species?
2. SP2
  Show code
  > plot(SP2 ~ SITE, col = as.numeric(HABITAT), data = veg, pch = 16)
  What does this plot illustrate about the relationships between habitats based on this plant species?
3. SP5
  Show code
  > plot(SP5 ~ SITE, col = as.numeric(HABITAT), data = veg, pch = 16)
  What does this plot illustrate about the relationships between habitats based on this plant species?
4. SP8
  Show code
  > plot(SP8 ~ SITE, col = as.numeric(HABITAT), data = veg, pch = 16)
  What does this plot illustrate about the relationships between habitats based on this plant species?

The ecologist was not interested in teasing out the patterns based on each individual species in isolation. The ecologist wanted to see patterns between the plant communities, rather than individual species. Hence a multivariate approach was taken. You may have noticed that the patterns between sites (and habitats) based on SP1 and SP8 were very similar. The abundances of SP1 and SP2 appear to be correlated to one another. It is not surprising that different species might be correlated, as they are likely to respond similarly to similar conditions.

If two or more species reveal exactly the same patterns (hypothetically), we could easily combine them into a single group that characterises the sites. Species are never likely to be exactly correlated, however we can still generate new groups that are the combinations of multiple species abundances. If we were to attempt to combine three species, two of which were highly correlated to one another and the other not correlated to either, then the two correlated species will contribute a lot to the new group and the other variable will contribute only little.

In the example, lets say we wanted to reduce the 8 species variables down to just 2 groups. Based on how much each species is correlated to each other species, each species will contribute something to each of the two new groups. So each new group is a linear combination of original species variables. This sort of data combining (or reduction) can be done in a number of ways, however for it to work meaningfully, there must be a reasonable degree of correlation between the species.

Examine the degree of correlation between each of the species (HINT).

Show code

> cor(veg[, 3:10])

         SP1     SP2      SP3      SP4      SP5      SP6     SP7     SP8
SP1  1.00000  0.3993  0.51974  0.42648  0.09499 -0.28461 -0.5248 -0.8927
SP2  0.39933  1.0000 -0.03110 -0.39693 -0.12100 -0.38590 -0.2359 -0.3847
SP3  0.51974 -0.0311  1.00000  0.49179  0.06762 -0.33336 -0.5441 -0.4715
SP4  0.42648 -0.3969  0.49179  1.00000  0.07452  0.05903 -0.3018 -0.4180
SP5  0.09499 -0.1210  0.06762  0.07452  1.00000 -0.14179  0.1307 -0.1895
SP6 -0.28461 -0.3859 -0.33336  0.05903 -0.14179  1.00000  0.2534  0.3706
SP7 -0.52483 -0.2359 -0.54411 -0.30180  0.13070  0.25345  1.0000  0.4704
SP8 -0.89271 -0.3847 -0.47153 -0.41800 -0.18947  0.37061  0.4704  1.0000

Does it appear that some species are correlated to others, which have the greatest degree of correlation?

SP1 and SP8 appeared to be highly negatively correlated. To examine this correlation, create a scatterplot of SP1 against SP8 (HINT) and fit a smooth line through these data(HINT). If we were purely trying to combine SP1 and SP2 into a new group, the position of each site (point) along this effectively becomes the sites new value in the new group.

Show code

> plot(SP1 ~ SP8, col = as.numeric(HABITAT), data = veg, pch = 16, asp = 1)
> # model II regression slope from eigen analysis it is not critical that you understand the
> # following - a simple line of best fit would illustrate just as well
> e <- eigen(cor(veg[, c(3, 10)]))
> e1 <- e$values[1] * e$vectors[, 1]
> e2 <- e$values[2] * e$vectors[, 2]
> slope <- (e2[2] - e1[2])/(e2[1] - e1[1])
> int <- mean(veg$SP1) - slope * mean(veg$SP8)
> abline(a = int, b = slope)

Since a PCA is based either on covariances or correlations (both of which assume linearity), we first need to explore the nature of the relationships between variables (species).
Show code
> pairs(veg[, 3:10])
What conclusions would you draw from the above?

Use principal components analysis (PCA, HINT) to generate new groups (components) and explore the trends in plant communities amongst sites (and habitats)

Show code

> library(vegan)
> veg.pca <- rda(veg[, c(-1, -2)], scale = TRUE)
> summary(veg.pca, display = NULL)

Call:
rda(X = veg[, c(-1, -2)], scale = TRUE) 

Partitioning of correlations:
              Inertia Proportion
Total               8          1
Unconstrained       8          1

Eigenvalues, and their contribution to the correlations 

Importance of components:
                        PC1   PC2   PC3   PC4    PC5    PC6    PC7    PC8
Eigenvalue            3.295 1.640 1.130 0.809 0.5023 0.3426 0.1890 0.0923
Proportion Explained  0.412 0.205 0.141 0.101 0.0628 0.0428 0.0236 0.0115
Cumulative Proportion 0.412 0.617 0.758 0.859 0.9220 0.9648 0.9885 1.0000

Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions
* General scaling constant of scores:

Examine the Eigenvalues for each new component (group). The sum of these values should add up to the number of original variables (species). If there were absolutely no correlations between the species, then you would expect each new component to represent a single original species and it would have an eigenvalue of 1. The more correlated the species were, the more they will group together into the first few newly generated components (groups) and thus the higher the eigenvalues of these earlier groups. What do the eigenvalues indicate in this case?

Show code
> veg.pca$CA$eig
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 3.29533 1.63976 1.12956 0.80916 0.50231 0.34259 0.18898 0.09231
> sum(veg.pca$CA$eig)
[1] 8
Calculate the percentage of total variation is explained by each of the new principal components. How much of the total original variation is explained by principal component 1 (as a percentage)?
Show code
> 100 * veg.pca$CA$eig/sum(veg.pca$CA$eig)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 41.192 20.497 14.120 10.115 6.279 4.282 2.362 1.154

Calculate the cumulative sum of these percentages. How much of the total variation is explained by the first three principal components (as a percentage)?

Show code

> cumsum(100 * veg.pca$CA$eig/sum(veg.pca$CA$eig))

   PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8 
 41.19  61.69  75.81  85.92  92.20  96.48  98.85 100.00

Using the eigenvalues and a screeplot, determine how many principal components are necessary to represent the original variables (species) . How many principal components are necessary?

Show code
> screeplot(veg.pca) > abline(a = 1, b = 0)

Generate a a quick biplot ordination (scatterplot of principal components) with principal component 1 on the x-axis and principal component 2 on the y-axis. Are the patterns of sites associated with any particular species?
Show code
> biplot(veg.pca, scaling = 2)
Whilst the above biplot illustrates some of the patterns, it does not allow us to directly see whether the communities change in the different habitats. So lets instead construct the plot at a lower level.
1. Create the base ordination plot and add the sites (colored according to habitat). Since we are more interested in the habitats than the actual sites, we can just label the points according to their habitat rather than their site names.
  Show code
  > veg.ord <- ordiplot(veg.pca, type = "n") > text(veg.ord, "sites", lab = veg$HABITAT, col = as.numeric(veg$HABITAT))
2. Lets now add the species correlation vectors (component loadings). This will yield a biplot similar to the previous question.
  Show code
  > veg.ord <- ordiplot(veg.pca, type = "n") > text(veg.ord, "sites", lab = veg$HABITAT, col = as.numeric(veg$HABITAT)) > data.envfit <- envfit(veg.pca, veg[, 3:8]) > plot(data.envfit, col = "grey")
3. Now lets fit the habitat vectors onto this ordination. Before environmental variables can he added to an ordination plot, they must first be numeric representations. If we wish to display the orientation of each habitat on the ordination plot, then we need to convert the habitat variable into dummy variables.
  Show code
  > veg.ord <- ordiplot(veg.pca, type = "n") > text(veg.ord, "sites", lab = veg$HABITAT, col = as.numeric(veg$HABITAT)) > data.envfit <- envfit(veg.pca, veg[, 3:8]) > plot(data.envfit, col = "grey") > # dummy code the habitat factor > habitat <- model.matrix(~-1 + HABITAT, veg) > data.envfit <- envfit(veg.pca, env = habitat) > data.envfit
  ***VECTORS PC1 PC2 r2 Pr(>r) HABITATA -0.989 0.147 0.74 0.001 *** HABITATB 0.448 -0.894 0.79 0.001 *** HABITATC 0.810 0.586 0.58 0.001 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 P values based on 999 permutations.
  > plot(data.envfit, col = "blue")
To ensure you appreciate the patterns displayed in this ordination plot, answer the following questions.
1. Species 1 in primarily associated with principal component (axis)?
2. Species 2 in primarily associated with principal component (axis)?
3. Species 5 in primarily associated with principal component (axis)?
4. Habitat A aligns primarily with the
5. Habitat B aligns primarily with the
6. Habitat C strongly reflects the abundances of

We could take the above exploration one step further. We would specifically address the question, are the communities in the three different habitats significantly different from one another. One simple way to achieve this is to take the first couple of principle components and use them as responses in ANOVA's against habitat. Lets do this separately for the first two principle components.

Since Habitat A is physically a little different to the other two habitats (it is occasionally completely submerged), it is perhaps sensible to compare each of the other habitats to habitat A

PC1

Show code

> veg.scores <- scores(veg.pca, display = "sites")
> summary(lm(veg.scores[, 1] ~ veg$HABITAT))

Call:
lm(formula = veg.scores[, 1] ~ veg$HABITAT)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6750 -0.2038 -0.0841  0.2462  0.8236 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -0.4905     0.0705   -6.96  1.7e-08 ***
veg$HABITATB   1.1457     0.1602    7.15  8.8e-09 ***
veg$HABITATC   1.0855     0.1176    9.23  1.2e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.352 on 42 degrees of freedom
Multiple R-squared: 0.722,	Adjusted R-squared: 0.709 
F-statistic: 54.5 on 2 and 42 DF,  p-value: 2.11e-12

> boxplot(veg.scores[, 1] ~ veg$HABITAT)

What conclusions would you draw?

PC2

Show code

> veg.scores <- scores(veg.pca, display = "sites")
> summary(lm(veg.scores[, 2] ~ veg$HABITAT))

Call:
lm(formula = veg.scores[, 2] ~ veg$HABITAT)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9480 -0.1510 -0.0156  0.2221  0.8079 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.073      0.074    0.99    0.329    
veg$HABITATB   -1.382      0.168   -8.22  2.8e-10 ***
veg$HABITATC    0.358      0.123    2.90    0.006 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.37 on 42 degrees of freedom
Multiple R-squared: 0.694,	Adjusted R-squared: 0.679 
F-statistic: 47.5 on 2 and 42 DF,  p-value: 1.63e-11

> boxplot(veg.scores[, 2] ~ veg$HABITAT)

What conclusions would you draw?

Principle components analysis

Peet & Loucks (1977) examined the abundances of 8 species of trees (Bur oak, Black oak, White oak, Red oak, American elm, Basswood, Ironwood, Sugar maple) at 10 forest sites in southern Wisconsin, USA. The data (given below) are the mean measurements of canopy cover for eight species of north American trees in 10 samples (quadrats). For this question we will explore the relationships between the quadrats, in terms of tree species abundances using PCA. That is, which quadrats are most similar/dissimilar to one another.

Download wisc data set

Format of wisc.csv data file

QUAD.	BUROAK	BLACKOAK	WHITEOAK	REDOAK	ELM	BASSWOOD	IRONWOOD	MAPLE
1	9	8	5	3	2	0	0	0
2	8	9	4	4	2	0	0	0
3	3	8	9	0	4	0	0	0
4	5	7	9	6	5	0	0	0
5	6	0	7	9	6	2	0	0
6	0	0	7	8	5	7	6	5
7	5	0	4	7	5	6	7	4
8	0	0	6	6	0	6	4	8
9	0	0	0	4	2	7	6	8
10	0	0	2	3	5	6	5	9

QUADRAT	A number or name given to each quadrat
BUROAK, BLACKOAK,....	Number of individuals of each tree species found in each quadrat

Open the wisc data set.

Show code

> wisc <- read.csv("../downloads/data/wisc.csv")
> wisc

   QUADRAT BUROAK BLACKOAK WHITEOAK REDOAK ELM BASSWOOD IRONWOOD MAPLE
1        1      9        8        5      3   2        0        0     0
2        2      8        9        4      4   2        0        0     0
3        3      3        8        9      0   4        0        0     0
4        4      5        7        9      6   5        0        0     0
5        5      6        0        7      9   6        2        0     0
6        6      0        0        7      8   5        7        6     5
7        7      5        0        4      7   5        6        7     4
8        8      0        0        6      6   0        6        4     8
9        9      0        0        0      4   2        7        6     8
10      10      0        0        2      3   5        6        5     9

Since a PCA is based either on covariances or correlations (both of which assume linearity), we first need to explore the nature of the relationships between variables (species)
Show code
> pairs(wisc[, -1])
What conclusions would you draw from the above?

Use principal components analysis (PCA), to generate new groups (components) and explore the trends in tree communities amongst quadrats.

Show code

> library(vegan)
> wisc.pca <- rda(wisc[, -1], scale = TRUE)
> summary(wisc.pca, display = NULL)

Call:
rda(X = wisc[, -1], scale = TRUE) 

Partitioning of correlations:
              Inertia Proportion
Total               8          1
Unconstrained       8          1

Eigenvalues, and their contribution to the correlations 

Importance of components:
                        PC1   PC2    PC3    PC4    PC5     PC6     PC7     PC8
Eigenvalue            4.634 1.687 0.7660 0.6153 0.1912 0.06725 0.03576 0.00346
Proportion Explained  0.579 0.211 0.0958 0.0769 0.0239 0.00841 0.00447 0.00043
Cumulative Proportion 0.579 0.790 0.8859 0.9628 0.9867 0.99510 0.99957 1.00000

Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions
* General scaling constant of scores:

Examine the Eigenvalues for each new component (group). The sum of these values should add up to the number of original variables (species). If there were absolutely no correlations between the species, then you would expect each new component to represent a single original species and it would have an eigenvalue of 1. The more correlated the species were, the more they will group together into the first few newly generated components (groups) and thus the higher the eigenvalues of these earlier groups. What do the eigenvalues indicate in this case?

Show code
> wisc.pca$CA$eig
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 4.634205 1.686922 0.765963 0.615270 0.191164 0.067254 0.035764 0.003456
> sum(wisc.pca$CA$eig)
[1] 8
Calculate the percentage of total variation is explained by each of the new principal components. How much of the total original variation is explained by principal component 1 (as a percentage)?
Show code
> 100 * wisc.pca$CA$eig/sum(wisc.pca$CA$eig)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 57.9276 21.0865 9.5745 7.6909 2.3896 0.8407 0.4471 0.0432

Calculate the cumulative sum of these percentages. How much of the total variation is explained by the first three principal components (as a percentage)?

Show code

> cumsum(100 * wisc.pca$CA$eig/sum(wisc.pca$CA$eig))

   PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8 
 57.93  79.01  88.59  96.28  98.67  99.51  99.96 100.00

Using the eigenvalues and a screeplot, determine how many principal components are necessary to represent the original variables (species) . How many principal components are necessary?

Show code
> screeplot(wisc.pca) > abline(a = 1, b = 0)

Generate a a quick biplot ordination (scatterplot of principal components) with principal component 1 on the x-axis and principal component 2 on the y-axis. Are the patterns of quadrats associated with any particular tree species?
Show code
> biplot(wisc.pca, scaling = 2)
Notice the "horseshoe effect". This effect is typically the result of a sampling regime that spans multiple communities such that the abundances of species are unimodal. Plot the abundances of the difference species against site number
Show code
> plot(1:10, wisc[, 3], type = "l") > for (i in 3:8) lines(1:10, wisc[, i], type = "l", col = i)

Transformation	Syntax
log_e	> new_var <- log(old_var)
log₁₀	> new_var <- log10(old_var)
square root	> new_var <- sqrt(old_var)
arcsin	> new_var <- asin(sqrt(old_var))
scale (mean=0, unit variance)	> new_var <- scale(old_var)

Transformation

Syntax

log_e

> new_var <- log(old_var)

log₁₀

> new_var <- log10(old_var)

square root

> new_var <- sqrt(old_var)

arcsin

> new_var <- asin(sqrt(old_var))

scale (mean=0, unit variance)

> new_var <- scale(old_var)

Sample number	Sample mean
1	12.1
2	12.7
3	12.5
Mean of sample means	12.433
> SD of sample means	0.306