Workshop 9.6a - Factorial Analysis of Variance
5 July 2011
Factorial ANOVA references
- Logan (2010) - Chpt 12
- Quinn & Keough (2002) - Chpt 9
Very basic overview of Two-Factor ANOVA
Exercise 2 - Two factor ANOVA
A biologist studying starlings wanted to know whether the mean mass of starlings differed according to different roosting situations. She was also interested in whether the mean mass of starlings altered over winter (Northern hemisphere) and whether the patterns amongst roosting situations were consistent throughout winter, therefore starlings were captured at the start (November) and end of winter (January). Ten starlings were captured from each roosting situation in each season, so in total, 80 birds were captured and weighed.
Download Starling data set
Format of starling.csv data files | |||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Open the starling data file.
starling <- read.table("../downloads/data/starling.csv", header = T, sep = ",", strip.white = T) head(starling)
## SITUATION MONTH MASS GROUP ## 1 S1 November 78 S1Nov ## 2 S1 November 88 S1Nov ## 3 S1 November 87 S1Nov ## 4 S1 November 88 S1Nov ## 5 S1 November 83 S1Nov ## 6 S1 November 82 S1Nov
boxplot(MASS ~ SITUATION * MONTH, data = starling)
means <- with(starling, tapply(MASS, list(SITUATION, MONTH), mean)) vars <- with(starling, tapply(MASS, list(SITUATION, MONTH), var)) plot(means, vars, pch = 16)
- Is there any evidence that one or more of the assumptions are likely to be violated? (Y or N)
- Is the proposed model balanced? (Y or N)
replications(MASS ~ SITUATION * MONTH, data = starling)
## SITUATION MONTH SITUATION:MONTH ## 20 40 10
!is.list(replications(MASS ~ SITUATION * MONTH, data = starling))
## [1] TRUE
starling.lm <- lm(MASS ~ SITUATION * MONTH, data = starling) # setup a 2x6 plotting device with small margins par(mfrow = c(2, 1), oma = c(0, 0, 2, 0)) plot(starling.lm, ask = F, which = 1:2)
Examine the ANOVA table
summary(starling.lm)
## ## Call: ## lm(formula = MASS ~ SITUATION * MONTH, data = starling) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.4 -3.2 -0.4 2.9 9.2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 90.80 1.33 68.26 < 2e-16 *** ## SITUATIONS2 -0.60 1.88 -0.32 0.75069 ## SITUATIONS3 -2.60 1.88 -1.38 0.17121 ## SITUATIONS4 -6.60 1.88 -3.51 0.00078 *** ## MONTHNovember -7.20 1.88 -3.83 0.00027 *** ## SITUATIONS2:MONTHNovember -3.60 2.66 -1.35 0.18023 ## SITUATIONS3:MONTHNovember -2.40 2.66 -0.90 0.37000 ## SITUATIONS4:MONTHNovember -1.60 2.66 -0.60 0.54946 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.21 on 72 degrees of freedom ## Multiple R-squared: 0.64, Adjusted R-squared: 0.605 ## F-statistic: 18.3 on 7 and 72 DF, p-value: 9.55e-14
anova(starling.lm)
## Analysis of Variance Table ## ## Response: MASS ## Df Sum Sq Mean Sq F value Pr(>F) ## SITUATION 3 574 191 10.82 6.0e-06 *** ## MONTH 1 1656 1656 93.60 1.2e-14 *** ## SITUATION:MONTH 3 34 11 0.64 0.59 ## Residuals 72 1274 18 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Source of Variation | SS | df | MS | F-ratio | Pvalue |
---|---|---|---|---|---|
SITUATION | |||||
MONTH | |||||
SITUATION : MONTH | |||||
Residual (within groups) |
library(car)
with(starling, interaction.plot(SITUATION, MONTH, MASS))
In the classical frequentist regime, many at this point would advocate dropping the interaction term from the model (p-value for the interaction is greater than 0.25). This term is not soaking up much of the residual and yet is soaking up 3 degrees of freedom. The figure also indicates that situation and month are likely to operate additively.
library(multcomp)
summary(glht(starling.lm, linfct = mcp(SITUATION = "Tukey")))
## Warning: covariate interactions found -- default contrast might be inappropriate
## ## Simultaneous Tests for General Linear Hypotheses ## ## Multiple Comparisons of Means: Tukey Contrasts ## ## ## Fit: lm(formula = MASS ~ SITUATION * MONTH, data = starling) ## ## Linear Hypotheses: ## Estimate Std. Error t value Pr(>|t|) ## S2 - S1 == 0 -0.60 1.88 -0.32 0.9887 ## S3 - S1 == 0 -2.60 1.88 -1.38 0.5146 ## S4 - S1 == 0 -6.60 1.88 -3.51 0.0043 ** ## S3 - S2 == 0 -2.00 1.88 -1.06 0.7129 ## S4 - S2 == 0 -6.00 1.88 -3.19 0.0112 * ## S4 - S3 == 0 -4.00 1.88 -2.13 0.1545 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## (Adjusted p values reported -- single-step method)
library(multcomp) summary(glht(lm(MASS ~ SITUATION + MONTH, data = starling), linfct = mcp(SITUATION = "Tukey")))
## ## Simultaneous Tests for General Linear Hypotheses ## ## Multiple Comparisons of Means: Tukey Contrasts ## ## ## Fit: lm(formula = MASS ~ SITUATION + MONTH, data = starling) ## ## Linear Hypotheses: ## Estimate Std. Error t value Pr(>|t|) ## S2 - S1 == 0 -2.40 1.32 -1.82 0.2735 ## S3 - S1 == 0 -3.80 1.32 -2.88 0.0263 * ## S4 - S1 == 0 -7.40 1.32 -5.60 <0.001 *** ## S3 - S2 == 0 -1.40 1.32 -1.06 0.7147 ## S4 - S2 == 0 -5.00 1.32 -3.79 0.0018 ** ## S4 - S3 == 0 -3.60 1.32 -2.73 0.0388 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## (Adjusted p values reported -- single-step method)
confint(glht(lm(MASS ~ SITUATION + MONTH, data = starling), linfct = mcp(SITUATION = "Tukey")))
## ## Simultaneous Confidence Intervals ## ## Multiple Comparisons of Means: Tukey Contrasts ## ## ## Fit: lm(formula = MASS ~ SITUATION + MONTH, data = starling) ## ## Quantile = 2.628 ## 95% family-wise confidence level ## ## ## Linear Hypotheses: ## Estimate lwr upr ## S2 - S1 == 0 -2.400 -5.870 1.070 ## S3 - S1 == 0 -3.800 -7.270 -0.330 ## S4 - S1 == 0 -7.400 -10.870 -3.930 ## S3 - S2 == 0 -1.400 -4.870 2.070 ## S4 - S2 == 0 -5.000 -8.470 -1.530 ## S4 - S3 == 0 -3.600 -7.070 -0.130
Comparison | Multiplicative | Additive | ||||||
---|---|---|---|---|---|---|---|---|
Est | P | Lwr | Upr | Est | P | Lwr | Upr | |
Situation 2 vs Situation 1 | ||||||||
Situation 3 vs Situation 1 | ||||||||
Situation 4 vs Situation 1 | ||||||||
Situation 3 vs Situation 2 | ||||||||
Situation 4 vs Situation 2 | ||||||||
Situation 4 vs Situation 3 |
starling.lm2 <- lm(MASS ~ SITUATION + MONTH, data = starling) cbind(coef(starling.lm2), confint(starling.lm2))
## 2.5 % 97.5 % ## (Intercept) 91.75 89.670 93.830 ## SITUATIONS2 -2.40 -5.031 0.231 ## SITUATIONS3 -3.80 -6.431 -1.169 ## SITUATIONS4 -7.40 -10.031 -4.769 ## MONTHNovember -9.10 -10.960 -7.240
contrasts(starling$SITUATION) <- contr.sum
Estimate | Mean | Lower 95% CI | Upper 95% CI |
---|---|---|---|
December Situation 1 | |||
Effect size (Dec:Sit2 - Dec:Sit1) | |||
Effect size (Dec:Sit3 - Dec:Sit1) | |||
Effect size (Dec:Sit4 - Dec:Sit1) | |||
Effect size (Nov:Sit1 - Dec:Sit1) |
opar <- par(mar = c(5, 5, 1, 1)) star.means <- with(starling, tapply(MASS, list(MONTH, SITUATION), mean)) star.sds <- with(starling, tapply(MASS, list(MONTH, SITUATION), sd)) star.len <- with(starling, tapply(MASS, list(MONTH, SITUATION), length)) star.ses <- star.sds/sqrt(star.len) xs <- barplot(star.means, ylim = range(starling$MASS), beside = T, axes = F, ann = F, axisnames = F, xpd = F, axis.lty = 2, col = c(0, 1)) arrows(xs, star.means, xs, star.means + star.ses, code = 2, length = 0.05, ang = 90) axis(2, las = 1) axis(1, at = apply(xs, 2, median), lab = c("Situation 1", "Situation 2", "Situation 3", "Situation 4")) mtext(2, text = "Mass (g) of starlings", line = 3, cex = 1.25) legend("topright", leg = c("January", "November"), fill = c(0, 1), col = c(0, 1), bty = "n", cex = 1) box(bty = "l")
par(opar)
Exercise 3 - Two factor ANOVA - Type III SS
Here is a modified example from Quinn and Keough (2002). Stehman and Meredith (1995) present data from an experiment that was set up to test the hypothesis that healthy spruce seedlings break bud sooner than diseased spruce seedlings. There were 2 factors: pH (3 levels: 3, 5.5, 7) and HEALTH (2 levels: healthy, diseased). The dependent variable was the average (from 5 buds) bud emergence rating (BRATING) on each seedling. The sample size varied for each combination of pH and health, ranging from 7 to 23 seedlings. With two factors, this experiment should be analyzed with a 2 factor (2 x 3) ANOVA.
Download Stehman data set
Format of stehman.csv data files | |||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Open the stehman data file.
stehman <- read.table("../downloads/data/stehman.csv", header = T, sep = ",", strip.white = T) head(stehman)
## PH HEALTH GROUP BRATING ## 1 3 D D3 0.0 ## 2 3 D D3 0.8 ## 3 3 D D3 0.8 ## 4 3 D D3 0.8 ## 5 3 D D3 0.8 ## 6 3 D D3 0.8
stehman$PH <- as.factor(stehman$PH)
boxplot(BRATING ~ HEALTH * PH, data = stehman)
means <- with(stehman, tapply(BRATING, list(HEALTH, PH), mean)) vars <- with(stehman, tapply(BRATING, list(HEALTH, PH), var)) plot(means, vars, pch = 16)
- Is there any evidence that one or more of the assumptions are likely to be violated? (Y or N)
- Is the proposed model balanced? (Y or N)
replications(BRATING ~ HEALTH * PH, data = stehman)
## $HEALTH ## HEALTH ## D H ## 67 28 ## ## $PH ## PH ## 3 5.5 7 ## 34 30 31 ## ## $`HEALTH:PH` ## PH ## HEALTH 3 5.5 7 ## D 23 23 21 ## H 11 7 10
!is.list(replications(BRATING ~ HEALTH * PH, data = stehman))
## [1] FALSE
As the model is not balanced, we will likely want to examine the ANOVA table based on Type III (marginal) Sums of Squares. In preparation for doing so, we must define something other than treatment contrasts for the factors.
contrasts(stehman$HEALTH) <- contr.sum contrasts(stehman$PH) <- contr.sum
stehman.lm <- lm(BRATING ~ HEALTH * PH, data = stehman) library(car) Anova(stehman.lm, type = "III")
## Anova Table (Type III tests) ## ## Response: BRATING ## Sum Sq Df F value Pr(>F) ## (Intercept) 114.1 1 444.87 <2e-16 *** ## HEALTH 2.4 1 9.40 0.0029 ** ## PH 1.9 2 3.63 0.0306 * ## HEALTH:PH 0.2 2 0.37 0.6897 ## Residuals 22.8 89 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Source of Variation | SS | df | MS | F-ratio | Pvalue |
---|---|---|---|---|---|
PH | |||||
HEALTH | |||||
PH : HEALTH | |||||
Residual (within groups) |
library(car) with(stehman, interaction.plot(PH, HEALTH, BRATING))
library(multcomp) summary(glht(stehman.lm, linfct = mcp(PH = "Tukey")))
## Warning: covariate interactions found -- default contrast might be inappropriate
## ## Simultaneous Tests for General Linear Hypotheses ## ## Multiple Comparisons of Means: Tukey Contrasts ## ## ## Fit: lm(formula = BRATING ~ HEALTH * PH, data = stehman) ## ## Linear Hypotheses: ## Estimate Std. Error t value Pr(>|t|) ## 5.5 - 3 == 0 -0.386 0.143 -2.69 0.023 * ## 7 - 3 == 0 -0.173 0.134 -1.29 0.407 ## 7 - 5.5 == 0 0.213 0.146 1.46 0.316 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## (Adjusted p values reported -- single-step method)
pH 3 vs pH 5.5
pH 3 vs pH 7
pH 5.5 vs pH 7
The above analysis reflects a very classic approach to investigating the effects of PH and HEALTH via NHST (null hypothesis testing). Many argue that a more modern/valid approach is to:
- Abandon hypothesis testing
- Estimate effect sizes and associated effect sizes
- As PH is an ordered factor, this is arguably better modelled via polynomial contrasts
stehman.lm2 <- lm(BRATING ~ PH * HEALTH, data = stehman, contrasts = list(PH = contr.poly(3, scores = c(3, 5.5, 7)), HEALTH = contr.treatment)) summary(stehman.lm2)
## ## Call: ## lm(formula = BRATING ~ PH * HEALTH, data = stehman, contrasts = list(PH = contr.poly(3, ## scores = c(3, 5.5, 7)), HEALTH = contr.treatment)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.2286 -0.3238 -0.0087 0.3818 0.9913 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.0413 0.0619 16.81 <2e-16 *** ## PH.L -0.0879 0.1077 -0.82 0.4162 ## PH.Q 0.2751 0.1069 2.57 0.0117 * ## HEALTH2 0.3543 0.1155 3.07 0.0029 ** ## PH.L:HEALTH2 -0.1360 0.1899 -0.72 0.4758 ## PH.Q:HEALTH2 -0.1008 0.2099 -0.48 0.6323 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.506 on 89 degrees of freedom ## Multiple R-squared: 0.196, Adjusted R-squared: 0.15 ## F-statistic: 4.33 on 5 and 89 DF, p-value: 0.00143
confint(stehman.lm2)
## 2.5 % 97.5 % ## (Intercept) 0.91821 1.1643 ## PH.L -0.30184 0.1260 ## PH.Q 0.06273 0.4875 ## HEALTH2 0.12475 0.5839 ## PH.L:HEALTH2 -0.51324 0.2413 ## PH.Q:HEALTH2 -0.51773 0.3162
- Healthy spruce trees have a higher bud emergence rating than diseased (ES=0.35 CI=0.12-0.58)
- The bud emergence rating follows a quadratic change with PH
Exercise 4 - Two factor ANOVA
An ecologist studying a rocky shore at Phillip Island, in southeastern Australia, was interested in how clumps of intertidal mussels are maintained. In particular, he wanted to know how densities of adult mussels affected recruitment of young individuals from the plankton. As with most marine invertebrates, recruitment is highly patchy in time, so he expected to find seasonal variation, and the interaction between season and density - whether effects of adult mussel density vary across seasons - was the aspect of most interest.
The data were collected from four seasons, and with two densities of adult mussels. The experiment consisted of clumps of adult mussels attached to the rocks. These clumps were then brought back to the laboratory, and the number of baby mussels recorded. There were 3-6 replicate clumps for each density and season combination.
Download Quinn data set
Format of quinn.csv data files | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Open the quinn data file.
quinn <- read.table("../downloads/data/quinn.csv", header = T, sep = ",", strip.white = T) head(quinn)
## SEASON DENSITY RECRUITS SQRTRECRUITS GROUP ## 1 Spring Low 15 3.873 SpringLow ## 2 Spring Low 10 3.162 SpringLow ## 3 Spring Low 13 3.606 SpringLow ## 4 Spring Low 13 3.606 SpringLow ## 5 Spring Low 5 2.236 SpringLow ## 6 Spring High 11 3.317 SpringHigh
Confirm the need for a square root transformation, by examining boxplots
and mean vs variance plots for both raw and transformed data. Note that square root transformation was selected because the data were counts (count data often includes values of zero - cannot compute log of zero).par(mfrow = c(2, 2)) boxplot(RECRUITS ~ SEASON * DENSITY, data = quinn) means <- with(quinn, tapply(RECRUITS, list(SEASON, DENSITY), mean)) vars <- with(quinn, tapply(RECRUITS, list(SEASON, DENSITY), var)) plot(means, vars, pch = 16) boxplot(SQRTRECRUITS ~ SEASON * DENSITY, data = quinn) means <- with(quinn, tapply(SQRTRECRUITS, list(SEASON, DENSITY), mean)) vars <- with(quinn, tapply(SQRTRECRUITS, list(SEASON, DENSITY), var)) plot(means, vars, pch = 16)
!is.list(replications(sqrt(RECRUITS) ~ SEASON * DENSITY, data = quinn))
## [1] FALSE
replications(sqrt(RECRUITS) ~ SEASON * DENSITY, data = quinn)
## $SEASON ## SEASON ## Autumn Spring Summer Winter ## 10 11 12 9 ## ## $DENSITY ## DENSITY ## High Low ## 24 18 ## ## $`SEASON:DENSITY` ## DENSITY ## SEASON High Low ## Autumn 6 4 ## Spring 6 5 ## Summer 6 6 ## Winter 6 3
contrasts(quinn$SEASON) <- contr.sum contrasts(quinn$DENSITY) <- contr.sum
. Produce an anova table based on Type III SS and fill in the following table:
quinn.lm <- lm(SQRTRECRUITS ~ SEASON * DENSITY, data = quinn) library(car) Anova(quinn.lm, type = "III")
## Anova Table (Type III tests) ## ## Response: SQRTRECRUITS ## Sum Sq Df F value Pr(>F) ## (Intercept) 540 1 529.04 < 2e-16 *** ## SEASON 91 3 29.61 1.3e-09 *** ## DENSITY 6 1 6.35 0.017 * ## SEASON:DENSITY 11 3 3.71 0.021 * ## Residuals 35 34 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Source of Variation | SS | df | MS | F-ratio | Pvalue |
---|---|---|---|---|---|
SEASON | |||||
DENSITY | |||||
SEASON : DENSITY | |||||
Residual (within groups) |
library(car) with(quinn, interaction.plot(SEASON, DENSITY, SQRTRECRUITS))
For the current data set, the effect of density is of greatest interest, and thus the former option is the most interesting. Perform the simple main effects anovas.
Download biology package
library(biology) AnovaM(quinn.lm.summer <- mainEffects(quinn.lm, at = SEASON == "Summer"), type = "III")
## Warning: number of items to replace is not a multiple of replacement length
## Warning: number of items to replace is not a multiple of replacement length
## Warning: number of items to replace is not a multiple of replacement length
## Df Sum Sq Mean Sq F value Pr(>F) ## INT 6 91.2 15.20 14.9 2.8e-08 *** ## DENSITY 1 14.7 14.70 14.4 0.00058 *** ## Residuals 34 34.7 1.02 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(biology) AnovaM(quinn.lm. <- mainEffects(quinn.lm, at = SEASON == "Autumn"), type = "III")
## Warning: number of items to replace is not a multiple of replacement length
## Warning: number of items to replace is not a multiple of replacement length
## Warning: number of items to replace is not a multiple of replacement length
## Df Sum Sq Mean Sq F value Pr(>F) ## INT 6 105.9 17.65 17.3 4.7e-09 *** ## DENSITY 1 0.0 0.00 0.0 0.96 ## Residuals 34 34.7 1.02 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AnovaM(quinn.lm. <- mainEffects(quinn.lm, at = SEASON == "Winter"), type = "III")
## Warning: number of items to replace is not a multiple of replacement length
## Warning: number of items to replace is not a multiple of replacement length
## Warning: number of items to replace is not a multiple of replacement length
## Df Sum Sq Mean Sq F value Pr(>F) ## INT 6 102.4 17.06 16.72 7.1e-09 *** ## DENSITY 1 3.5 3.54 3.47 0.071 . ## Residuals 34 34.7 1.02 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AnovaM(quinn.lm. <- mainEffects(quinn.lm, at = SEASON == "Spring"), type = "III")
## Warning: number of items to replace is not a multiple of replacement length
## Warning: number of items to replace is not a multiple of replacement length
## Warning: number of items to replace is not a multiple of replacement length
## Df Sum Sq Mean Sq F value Pr(>F) ## INT 6 105.7 17.61 17.3 4.8e-09 *** ## DENSITY 1 0.2 0.21 0.2 0.65 ## Residuals 34 34.7 1.02 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Was the effect of DENSITY on recruitment consistent across all levels of SEASON? (Y or N)
- How would you interpret these results?