Jump to main navigation


Workshop 9.9a - Split-plot and complex repeated measures designs (mixed effects models)

4 August 2011

Basic statistics references

  • Logan (2010) - Chpt 12-14
  • Quinn & Keough (2002) - Chpt 9-11
Very basic overview of Nested ANOVA

Very basic overview of Randomized block

Split-plot

Split-plot

In an attempt to understand the effects on marine animals of short-term exposure to toxic substances, such as might occur following a spill, or a major increase in storm water flows, a it was decided to examine the toxicant in question, Copper, as part of a field experiment in Honk Kong. The experiment consisted of small sources of Cu (small, hemispherical plaster blocks, impregnated with copper), which released the metal into sea water over 4 or 5 days. The organism whose response to Cu was being measured was a small, polychaete worm, Hydroides, that attaches to hard surfaces in the sea, and is one of the first species to colonize any surface that is submerged. The biological questions focused on whether the timing of exposure to Cu affects the overall abundance of these worms. The time period of interest was the first or second week after a surface being available.

The experimental setup consisted of sheets of black perspex (settlement plates), which provided good surfaces for these worms. Each plate had a plaster block bolted to its centre, and the dissolving block would create a gradient of [Cu] across the plate. Over the two weeks of the experiment, a given plate would have pl ain plaster blocks (Control) or a block containing copper in the first week, followed by a plain block, or a plain block in the first week, followed by a dose of copper in the second week. After two weeks in the water, plates were removed and counted back in the laboratory. Without a clear idea of how sensitive these worms are to copper, an effect of the treatments might show up as an overall difference in the density of worms across a plate, or it could show up as a gradient in abundance across the plate, with a different gradient in different treatments. Therefore, on each plate, the density of worms (#/cm2) was recorded at each of four distances from the center of the plate.

Download Copper data set

Format of copper.csv data file
COPPERPLATEDISTWORMS
........

COPPERCategorical listing of the copper treatment (control = no copper applied, week 2 = copper treatment applied in second week and week 1= copper treatment applied in first week) applied to whole plates. Factor A (between plot factor).
PLATESubstrate provided for polychaete worm colonization on which copper treatment applied. These are the plots (Factor B). Numbers in this column represent numerical labels given to each plate.
DISTCategorical listing for the four concentric distances from the center of the plate (source of copper treatment) with 1 being the closest and 4 the furthest. Factor C (within plot factor)
WORMSDensity (#/cm2) of worms measured. Response variable.
Saltmarsh

Open the Copper data set
Show code
> copper <- read.table("../downloads/data/copper.csv", header = T,
+     sep = ",", strip.white = T)
> head(copper)
   COPPER PLATE DIST WORMS
1 control   200    4 11.50
2 control   200    3 13.00
3 control   200    2 13.50
4 control   200    1 12.00
5 control    39    4 17.75
6 control    39    3 13.75

Notice that both the PLATE variable and the DIST variable contain only numbers. Make sure that you define both of these as factors (HINT)
Show code
> copper$PLATE <- factor(copper$PLATE)
> copper$DIST <- factor(copper$DIST)
Q3-1. What are the main hypotheses being tested?
  1. H0 Main Effect 1 (Factor A):
  2. H0 Main Effect 2 (Factor C):
  3. H0 Main Effect 3 (A*C):

Q3-2. The usual ANOVA assumptions apply to split-plot designs, and these can be tested by constructing boxplots for each of the main effects. However, it is important to consider what data the boxplots should be based upon. For each of the main hypothesis tests, describe what data should be used to construct boxplots (remember that the assumptions of normality and homogeneity of variance apply to the residuals of each hypothesis test, and therefore the data used in the construction of boxplots for each hypothesis test should represent the respective residuals, or sources of unexplained variation).
  1. H0 Main Effect 1 (Factor A):
  2. H0 Main Effect 2 (Factor C):
  3. H0 Main Effect 3 (A*C):

Q3-3. For each of the hypothesis tests, indicate which Mean Square term should be used as the residual (denominator) in the F-ratio calculation. Note, COPPER and DIST are fixed factors and PLATE is a random factor.
  1. H0 Main Effect 1 (Factor A): F-ratio = MSCOPPER/MS
  2. H0 Main Effect 2 (Factor C): F-ratio = MSDIST/MS
  3. H0 Main Effect 3 (A*C): F-ratio = MSCOPPER:DIST/MS

Q3-4. Construct a boxplot to investigate the assumptions as they apply to the test of H0 Main Effect 1 (Factor A): This is done in two steps

  1. Aggregate the data set by the mean number of WORMS within each plate
    (HINT)
  2. Show code
    > library(plyr)
    > cu.ag <- ddply(copper, ~COPPER + PLATE, function(df) data.frame(WORMS = mean(df$WORMS)))
    > #OR
    > library(plyr)
    > cu.ag <- ddply(copper, ~COPPER + PLATE, summarise, WORMS = mean(WORMS,
    +     na.rm = TRUE))
    > #OR
    > cu.ag <- aggregate(copper, list(COPPER = copper$COPPER, PLATE = copper$PLATE),
    +     mean)
    > #OR
    > library(nlme)
    > cu.ag <- gsummary(copper, form = ~COPPER/PLATE, base:::mean)
    
  3. Construct a boxplot of aggregated mean number of WORMS against COPPER treatment
    (HINT)
  4. Show code
    > boxplot(WORMS ~ COPPER, data = cu.ag)
    
  5. Any evidence of violations of the assumptions (y or n)?


Q3-5. Construct a boxplot to investigate the assumptions as they apply to the test of H0 Main Effect 2 (Factor C): Since Factor C is tested against the overal residual in this case, this is a relatively straight forward procedure. (HINT)
Show code
> boxplot(WORMS ~ DIST, data = copper)
  1. Any evidence of violations of the assumptions (y or n)?


Q3-6. Construct a boxplot to investigate the assumptions as they apply to the test of H0 the main interaction effect (A:C): Since A:C is tested against the overal residual, this is a relatively straight forward procedure.(HINT)
Show code
> boxplot(WORMS ~ COPPER:DIST, data = copper)
  1. Any evidence of violations of the assumptions (y or n)?


Q3-7. In addition to the above assumptions, the test of PLATE assumes that there is no PLATE by DIST interaction as this is the overal residual (the replicates). That is, the test assumes that the effect of DIST is consistent in all PLATES. Construct an interaction plot to examine whether there is any evidence of an interaction between PLATE and DISTANCE (HINT)
Show code
> library(car)
> with(copper, interaction.plot(PLATE, DIST, WORMS))
> library(alr3)
> tukey.nonadd.test(lm(WORMS ~ PLATE + DIST, data = copper))
      Test     Pvalue 
-4.122e+00  3.749e-05 
  1. Any evidence of an interaction (y or n)?
  2. Is the design (model) unbalanced ?
    (HINT) (Yor N)
    Show code
    > !is.list(replications(WORMS ~ COPPER + Error(PLATE) + DIST + COPPER:DIST,
    +     data = copper))
    
    [1] TRUE
    
  3. Is there any evidence of issues with sphericity
    ?(HINT) (Y or N)
  4. Show code
    > library(biology)
    > epsi.GG.HF(aov(WORMS ~ COPPER + Error(PLATE) + DIST + COPPER:DIST,
    +     data = copper))
    
    $GG.eps
    [1] 0.5203
    
    $HF.eps
    [1] 0.5841
    
    $sigma
    [1] 1.807
    
    

Q3-8. Write out the linear model

Q3-9. Perform a split-plot ANOVA
(HINT), and complete the following table (HINT). To obtain the hypothesis test for the random factor (Factor B: PLATE), run the model as if all factors were fixed and thus all terms are tested against the overall residuals, HINT)
Show code
> copper.aov <- aov(WORMS ~ COPPER + DIST + COPPER:DIST + Error(PLATE),
+     data = copper)
> AnovaM(copper.aov, RM = T)
   Sphericity Epsilon Values 
-------------------------------
 Greenhouse.Geisser Huynh.Feldt
             0.5203      0.5841

Anova Table (Type I tests)
Response: WORMS
Error: PLATE
          Df Sum Sq Mean Sq F value  Pr(>F)    
COPPER     2    784     392     128 8.1e-09 ***
Residuals 12     37       3                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Error: Within
            Df Sum Sq Mean Sq F value  Pr(>F)    
DIST         3  153.8    51.3   28.38 1.4e-09 ***
COPPER:DIST  6   53.4     8.9    4.93   9e-04 ***
Residuals   36   65.0     1.8                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


Greenhouse-Geisser corrected ANOVA table
Response: WORMS
Error: PLATE
             Df Sum Sq Mean Sq F value Pr(>F)    
COPPER     1.04    784     392     128  8e-08 ***
Residuals 12.00     37       3                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Error: Within
               Df Sum Sq Mean Sq F value  Pr(>F)    
DIST         1.56  153.8    51.3   28.38 2.7e-07 ***
COPPER:DIST  3.12   53.4     8.9    4.93  0.0052 ** 
Residuals   36.00   65.0     1.8                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


Huynh-Feldt corrected ANOVA table
Response: WORMS
Error: PLATE
             Df Sum Sq Mean Sq F value  Pr(>F)    
COPPER     1.17    784     392     128 5.3e-08 ***
Residuals 12.00     37       3                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Error: Within
               Df Sum Sq Mean Sq F value  Pr(>F)    
DIST         1.75  153.8    51.3   28.38 1.1e-07 ***
COPPER:DIST  3.50   53.4     8.9    4.93   0.004 ** 
Residuals   36.00   65.0     1.8                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
Source of variationdfMean SqF-ratioP-value
COPPER
PLATE   
DIST
COPPER:DIST
Residuals   

Q3-10.Construct an interaction plot
showing the density of worms against distance from Cu source, with each treatment as different lines (or different bars). HINT
Show code
> interaction.plot(copper$DIST, copper$COPPER, copper$WORMS)

Q3-11. What conclusions would you draw from the analysis (and graph)?

Q3-12. In order to tease out the interaction, we could analyse the effects of the main factor (COPPER) separately for each Distance and/or investigate the effects of Distance separately for each of the three copper treatments. Recall that when performing such simple main effects, it is necessary to use the residual terms from the global analyses as these are likely to be better estimates (as they are based on a larger amount of data).
  1. Investigate the effects of copper separately for each of the four distances HINT
    Show code
    > AnovaM(mainEffects(copper.aov, at = DIST == 1), type = "III")
    
    Error: PLATE
              Df Sum Sq Mean Sq F value  Pr(>F)    
    INT        2    784     392     128 8.1e-09 ***
    Residuals 12     37       3                    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Error: Within
              Df Sum Sq Mean Sq F value  Pr(>F)    
    INT        9    207   23.02    12.7 8.4e-09 ***
    Residuals 36     65    1.81                    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    > AnovaM(mainEffects(copper.aov, at = DIST == 2), type = "III")
    
    Error: PLATE
              Df Sum Sq Mean Sq F value  Pr(>F)    
    INT        2    784     392     128 8.1e-09 ***
    Residuals 12     37       3                    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Error: Within
              Df Sum Sq Mean Sq F value  Pr(>F)    
    INT        9    207   23.02    12.7 8.4e-09 ***
    Residuals 36     65    1.81                    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    > AnovaM(mainEffects(copper.aov, at = DIST == 3), type = "III")
    
    Error: PLATE
              Df Sum Sq Mean Sq F value  Pr(>F)    
    INT        2    784     392     128 8.1e-09 ***
    Residuals 12     37       3                    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Error: Within
              Df Sum Sq Mean Sq F value  Pr(>F)    
    INT        9    207   23.02    12.7 8.4e-09 ***
    Residuals 36     65    1.81                    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    

    What conclusions would you draw from these four analyses?

  2. Investigate the effects of distance separately for each of the three copper treatments HINT
    Show code
    > AnovaM(mainEffects(copper.aov, at = COPPER == "control"), type = "III")
    
    Error: PLATE
              Df Sum Sq Mean Sq F value  Pr(>F)    
    INT        2    784     392     128 8.1e-09 ***
    Residuals 12     37       3                    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Error: Within
              Df Sum Sq Mean Sq F value  Pr(>F)    
    INT        6  188.6   31.43   17.40 2.5e-09 ***
    DIST       3   18.6    6.21    3.44   0.027 *  
    Residuals 36   65.0    1.81                    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    > AnovaM(mainEffects(copper.aov, at = COPPER == "Week 1"), type = "III")
    
    Error: PLATE
              Df Sum Sq Mean Sq F value  Pr(>F)    
    INT        2    784     392     128 8.1e-09 ***
    Residuals 12     37       3                    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Error: Within
              Df Sum Sq Mean Sq F value  Pr(>F)    
    INT        6  188.1   31.34   17.35 2.6e-09 ***
    DIST       3   19.2    6.39    3.54   0.024 *  
    Residuals 36   65.0    1.81                    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    > AnovaM(mainEffects(copper.aov, at = COPPER == "Week 2"), type = "III")
    
    Error: PLATE
              Df Sum Sq Mean Sq F value  Pr(>F)    
    INT        2    784     392     128 8.1e-09 ***
    Residuals 12     37       3                    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Error: Within
              Df Sum Sq Mean Sq F value Pr(>F)    
    INT        6   37.8     6.3    3.49 0.0081 ** 
    DIST       3  169.4    56.5   31.26  4e-10 ***
    Residuals 36   65.0     1.8                   
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    

    What conclusions would you draw from these four analyses?

Mixed models

Show code
> copper.lme <- lme(WORMS ~ COPPER * DIST, random = ~1 | PLATE, data = copper,
+     method = "ML")
> copper.lme1 <- lme(WORMS ~ COPPER * DIST, random = ~1 | PLATE, data = copper,
+     method = "ML", correlation = corCompSymm(form = ~1 | PLATE))
> copper.lme2 <- lme(WORMS ~ COPPER * DIST, random = ~1 | PLATE, data = copper,
+     method = "ML", correlation = corAR1(form = ~1 | PLATE))
> anova(copper.lme, copper.lme1)
            Model df   AIC   BIC logLik   Test   L.Ratio p-value
copper.lme      1 14 228.3 257.6 -100.1                         
copper.lme1     2 15 230.3 261.7 -100.1 1 vs 2 1.025e-10       1
> #Check whether a first order autoregressive correlation structure is more
> #   appropriate
> copper.lme2 <- update(copper.lme, correlation = corAR1(form = ~1 |
+     PLATE))
> anova(copper.lme, copper.lme2)
            Model df   AIC   BIC  logLik   Test L.Ratio p-value
copper.lme      1 14 228.3 257.6 -100.14                       
copper.lme2     2 15 222.9 254.3  -96.47 1 vs 2   7.352  0.0067
> # Now fit the 'best'' model with REML
> copper.lme3 <- update(copper.lme2, method = "REML")
> anova.lme(copper.lme3)
            numDF denDF F-value p-value
(Intercept)     1    36   980.6  <.0001
COPPER          2    12    93.1  <.0001
DIST            3    36    25.1  <.0001
COPPER:DIST     6    36     4.3  0.0025
Show code
> detach(package:nlme)
> copper.lmer <- lmer(WORMS ~ COPPER * DIST + (1 | PLATE), data = copper)
> #Examine the fixed effects coefficients
> summary(copper.lmer)@coefs
                   Estimate Std. Error   t value
(Intercept)           10.85     0.6512  16.66170
COPPERWeek 1          -3.60     0.9209  -3.90910
COPPERWeek 2         -10.60     0.9209 -11.51013
DIST2                  1.15     0.8501   1.35275
DIST3                  1.55     0.8501   1.82327
DIST4                  2.70     0.8501   3.17601
COPPERWeek 1:DIST2    -0.05     1.2023  -0.04159
COPPERWeek 2:DIST2     0.05     1.2023   0.04159
COPPERWeek 1:DIST3    -0.30     1.2023  -0.24953
COPPERWeek 2:DIST3     2.20     1.2023   1.82990
COPPERWeek 1:DIST4     0.05     1.2023   0.04159
COPPERWeek 2:DIST4     4.90     1.2023   4.07568
> #OR
> #library(gmodels)
> #ci(copper.lmer)
> library(languageR)
> pvals.fnc(copper.lmer, withMCMC = FALSE, addPlot = F)
$fixed
                   Estimate MCMCmean HPD95lower HPD95upper  pMCMC Pr(>|t|)
(Intercept)           10.85  10.8445     9.5503     12.152 0.0001   0.0000
COPPERWeek 1          -3.60  -3.5913    -5.4902     -1.794 0.0006   0.0003
COPPERWeek 2         -10.60 -10.5975   -12.4019     -8.745 0.0001   0.0000
DIST2                  1.15   1.1658    -0.5512      2.997 0.1948   0.1825
DIST3                  1.55   1.5584    -0.3118      3.309 0.0950   0.0745
DIST4                  2.70   2.7097     0.8658      4.457 0.0048   0.0026
COPPERWeek 1:DIST2    -0.05  -0.0723    -2.5778      2.417 0.9516   0.9670
COPPERWeek 2:DIST2     0.05   0.0402    -2.5969      2.468 0.9698   0.9670
COPPERWeek 1:DIST3    -0.30  -0.3208    -2.8322      2.186 0.7888   0.8040
COPPERWeek 2:DIST3     2.20   2.1924    -0.3846      4.715 0.0902   0.0735
COPPERWeek 1:DIST4     0.05   0.0372    -2.5428      2.598 0.9712   0.9670
COPPERWeek 2:DIST4     4.90   4.8815     2.2577      7.404 0.0002   0.0002

$random
    Groups        Name Std.Dev. MCMCmedian MCMCmean HPD95lower HPD95upper
1    PLATE (Intercept)   0.5599     0.2525   0.2804      0.000     0.7233
2 Residual               1.3442     1.4191   1.4341      1.149     1.7584

Broad interpretation: In general, compared to the number of worms settling on the control plates, the number of worms settling was less in the Week 1 treatment and even lower in the Week 2 treatment. The increases in number of settled worms in distances progressively further from the source (Dist 1) get progressively larger and the trend between DIST1 (base level) and DIST 4 is significantly greater for the Week 2 treatment compared to the control copper treatment.

Examine the model fit measures
Show code
> summary(copper.lmer)@AICtab
   AIC   BIC logLik deviance REMLdev
 218.3 247.6 -95.13    200.3   190.3
Examine the variance components
Show code
> data.frame(summary(copper.lmer)@REmat)
    Groups        Name Variance Std.Dev.
1    PLATE (Intercept)    0.314     0.56
2 Residual                1.807     1.34
Examine the predicted means and confidence intervals
Show code
> newdata <- expand.grid(COPPER = levels(copper$COPPER), DIST = levels(copper$DIST),
+     WORMS = 0)
> mod.mat <- model.matrix(terms(copper.lmer), newdata)
> newdata$WORMS = mod.mat %*% fixef(copper.lmer)
Error: no applicable method for 'fixef' applied to an object of class "mer"
> pvar1 <- diag(mod.mat %*% tcrossprod(vcov(copper.lmer), mod.mat))
> tvar1 <- pvar1 + VarCorr(copper.lmer)$PLATE[1]
Error: no applicable method for 'VarCorr' applied to an object of class "mer"
> newdata <- data.frame(newdata, plo = newdata$WORMS - 2 * sqrt(pvar1),
+     phi = newdata$WORMS + 2 * sqrt(pvar1), tlo = newdata$WORMS - 2 * sqrt(tvar1),
+     thi = newdata$WORMS + 2 * sqrt(tvar1))
Error: object 'tvar1' not found
> par(mar = c(5, 5, 1, 1))
> plot(WORMS ~ as.numeric(DIST), newdata, type = "n", axes = F, ann = F,
+     ylim = range(newdata$plo, newdata$phi), xlim = c(0.5, 4.5))
Error: need finite 'ylim' values
> ##Control
> with(subset(newdata, COPPER == "control"), arrows(as.numeric(DIST),
+     plo, as.numeric(DIST), phi, ang = 90, length = 0.05, code = 3))
Error: object 'plo' not found
> with(subset(newdata, COPPER == "control"), lines(as.numeric(DIST),
+     WORMS, lty = 1))
> with(subset(newdata, COPPER == "control"), points(as.numeric(DIST),
+     WORMS, pch = 21, bg = "black", col = "black"))
> ##Week 1
> with(subset(newdata, COPPER == "Week 1"), arrows(as.numeric(DIST),
+     plo, as.numeric(DIST), phi, ang = 90, length = 0.05, code = 3))
Error: object 'plo' not found
> with(subset(newdata, COPPER == "Week 1"), lines(as.numeric(DIST),
+     WORMS, lty = 2))
> with(subset(newdata, COPPER == "Week 1"), points(as.numeric(DIST),
+     WORMS, pch = 21, bg = "grey", col = "black"))
> ##Week 2
> with(subset(newdata, COPPER == "Week 2"), arrows(as.numeric(DIST),
+     plo, as.numeric(DIST), phi, ang = 90, length = 0.05, code = 3))
Error: object 'plo' not found
> with(subset(newdata, COPPER == "Week 2"), lines(as.numeric(DIST),
+     WORMS, lty = 3))
> with(subset(newdata, COPPER == "Week 2"), points(as.numeric(DIST),
+     WORMS, pch = 21, bg = "white", col = "black"))
> 
> axis(1, at = as.numeric(newdata$DIST), lab = newdata$DIST)
> axis(2, las = 1)
> mtext("Number of settled worms", 2, line = 3, cex = 1.25)
> mtext("Distance from Copper", 1, line = 3, cex = 1.25)
> legend("bottomright", legend = c("Control", "Week 1", "Week 2"),
+     pch = c(21, 21, 21), pt.bg = c("black", "grey", "white"), lty = c(1, 2,
+         3), bty = "n", title = "Treatment")
> box(bty = "l")
Show code
> library(AICcmodavg)
> data.frame(newdata, predictSE.mer(copper.lmer, newdata, print.matrix = T))
Error: no applicable method for 'fixef' applied to an object of class "mer"
Now modeled against a Poisson distribution with polynomial contrasts for the distance variable.
Show code
> contrasts(copper$DIST) <- contr.poly
> copper.lmer <- lmer(WORMS ~ COPPER * DIST + (1 | PLATE), data = copper,
+     family = "poisson")
Examine the fixed effects coefficients
Show code
> summary(copper.lmer)@coefs
                     Estimate Std. Error  z value  Pr(>|z|)
(Intercept)          2.498289    0.06422 38.90153 0.000e+00
COPPERWeek 1        -0.361811    0.10033 -3.60622 3.107e-04
COPPERWeek 2        -1.890271    0.25974 -7.27768 3.396e-13
DIST.L               0.156402    0.12875  1.21476 2.245e-01
DIST.Q              -0.006025    0.12844 -0.04691 9.626e-01
DIST.C               0.027693    0.12813  0.21613 8.289e-01
COPPERWeek 1:DIST.L  0.063304    0.20091  0.31509 7.527e-01
COPPERWeek 2:DIST.L  2.382689    0.63043  3.77948 1.572e-04
COPPERWeek 1:DIST.Q  0.016657    0.20066  0.08301 9.338e-01
COPPERWeek 2:DIST.Q -0.535793    0.51947 -1.03142 3.023e-01
COPPERWeek 1:DIST.C  0.032273    0.20041  0.16103 8.721e-01
COPPERWeek 2:DIST.C  0.062334    0.37717  0.16527 8.687e-01
> #OR
> #library(gmodels)
> #ci(copper.lmer)
Interpretation: Compared to the number of worms settling on control plates, the number of settling worms is less in Week 1 and even lower in the Week 2 copper treatment. Settlement increases with increasing distance from the copper source and this rate of increase is over twice the rate (significantly greater) in the Week 2 treatment.
Show code - ezAnova
> library(ez)
> ezANOVA(dv = .(WORMS), between = .(COPPER), within = .(DIST), wid = .(PLATE),
+     data = copper, type = 3)
$ANOVA
       Effect DFn DFd       F         p p<.05    ges
2      COPPER   2  12 128.021 8.051e-09     * 0.8851
3        DIST   3  36  28.375 1.356e-09     * 0.6018
4 COPPER:DIST   6  36   4.928 9.028e-04     * 0.3442

$`Mauchly's Test for Sphericity`
       Effect      W        p p<.05
3        DIST 0.2311 0.008005     *
4 COPPER:DIST 0.2311 0.008005     *

$`Sphericity Corrections`
       Effect    GGe     p[GG] p[GG]<.05    HFe     p[HF] p[HF]<.05
3        DIST 0.5203 6.356e-06         * 0.5841 2.048e-06         *
4 COPPER:DIST 0.5203 1.022e-02         * 0.5841 7.346e-03         *

Repeated Measures

Repeated Measures

In an honours thesis from (1992), Mullens was investigating the ways that cane toads ( Bufo marinus ) respond to conditions of hypoxia. Toads show two different kinds of breathing patterns, lung or buccal, requiring them to be treated separately in the experiment. Her aim was to expose toads to a range of O2 concentrations, and record their breathing patterns, including parameters such as the expired volume for individual breaths. It was desirable to have around 8 replicates to compare the responses of the two breathing types, and the complication is that animals are expensive, and different individuals are likely to have different O2 profiles (leading to possibly reduced power). There are two main design options for this experiment;

  • One animal per O2 treatment, 8 concentrations, 2 breathing types. With 8 replicates the experiment would require 128 animals, but that this could be analysed as a completely randomized design
  • One O2 profile per animal, so that each animal would be used 8 times and only 16 animals are required (8 lung and 8 buccal breathers)
Mullens decided to use the second option so as to reduce the number of animals required (on financial and ethical grounds). By selecting this option, she did not have a set of independent measurements for each oxygen concentration, by repeated measurements on each animal across the 8 oxygen concentrations.

Download Mullens data set

Format of mullens.csv data file
BREATHTOADO2LEVELFREQBUCSFREQBUC
lunga010.63.256
lunga518.84.336
lunga1017.44.171
lunga1516.64.074
...............

BREATHCategorical listing of the breathing type treatment (buccal = buccal breathing toads, lung = lung breathing toads). This is the between subjects (plots) effect and applies to the whole toads (since a single toad can only be one breathing type - either lung or buccal). Equivalent to Factor A (between plots effect) in a split-plot design
TOADThese are the subjects (equivalent to the plots in a split-plot design: Factor B). The letters in this variable represent the labels given to each individual toad.
O2LEVEL0 through to 50 represent the the different oxygen concentrations (0% to 50%). The different oxygen concentrations are equivalent to the within plot effects in a split-plot (Factor C).
FREQBUCThe frequency of buccal breathing - the response variable
SFREQBUCSquare root transformed frequency of buccal breathing - the response variable
Saltmarsh

Open the mullens data file. HINT.
Show code
> mullens <- read.table("../downloads/data/mullens.csv", header = T,
+     sep = ",", strip.white = T)
> head(mullens)
  BREATH TOAD O2LEVEL FREQBUC SFREQBUC
1   lung    a       0    10.6    3.256
2   lung    a       5    18.8    4.336
3   lung    a      10    17.4    4.171
4   lung    a      15    16.6    4.074
5   lung    a      20     9.4    3.066
6   lung    a      30    11.4    3.376
Notice that both the O2LEVEL variable contains only numbers. Make sure that you define both of this as a factors (HINT)

Show code
> mullens$O2LEVEL <- factor(mullens$O2LEVEL)
Q4-1. What are the main hypotheses being tested?
  1. H0 Main Effect 1 (Factor A):
  2. H0 Main Effect 2 (Factor C):
  3. H0 Main Effect 3 (A*C):

Q4-2. We will now address all the assumptions.
Construct a boxplot to investigate the assumptions as they apply to the test of H0 Main Effect 1 (Factor A): This is done in two steps

  1. Construct a boxplot to investigate the assumptions as they apply to the test of H0 Main Effect 1 (Factor A): Start by aggregating the data set by TOAD
    (since the toads are the replicates for the between subject effect - BREATH) (HINT). Then Construct a boxplot of aggregated mean FREQBUC against BREATH treatment
    (HINT). Any evidence of violations of the assumptions (y or n)?
    Show code
    > library(plyr)
    > mullens.ag <- ddply(mullens, ~BREATH + TOAD, function(df) data.frame(FREQBUC = mean(df$FREQBUC)))
    > boxplot(FREQBUC ~ BREATH, mullens.ag)
    

    Try a square-root transformation!
  2. Show code
    > library(plyr)
    > mullens.ag <- ddply(mullens, ~BREATH + TOAD, function(df) data.frame(SFREQBUC = mean(sqrt(df$FREQBUC))))
    > boxplot(SFREQBUC ~ BREATH, mullens.ag)
    
  3. Construct a boxplot to investigate the assumptions as they apply to the test of H0 Main Effect 2 (Factor C): Since Factor C is tested against the overall residual in this case, this is a relatively straight forward procedure.(HINT). Any evidence of violations of the assumptions (y or n)?
  4. Show code
    > boxplot(SFREQBUC ~ O2LEVEL, mullens)
    
  5. Construct a boxplot to investigate the assumptions as they apply to the test of H0 the main interaction effect (A:C): Since A:C is tested against the overal residual, this is a relatively straight forward procedure.(HINT). Any evidence of violations of the assumptions (y or n)?
  6. Show code
    > boxplot(SFREQBUC ~ BREATH * O2LEVEL, mullens)
    
  7. In addition to the above assumptions, the test of TOAD assumes that there is no TOAD by O2LEVEL interaction as this is the overal residual (the replicates). That is, the test assumes that the effect of O2LEVEL is consistent in all TOADS. Construct an interaction plot to examine whether there is any evidence of an interaction between TOAD and O2LEVEL (HINT). Any evidence of an interaction (y or n)?
  8. Show code
    > with(mullens, interaction.plot(TOAD, O2LEVEL, SFREQBUC))
    
    > library(alr3)
    > tukey.nonadd.test(lm(SFREQBUC ~ TOAD + O2LEVEL, data = mullens))
    
        Test   Pvalue 
    3.079699 0.002072 
    
  9. You must also check to see whether the proposed model balanced.
    Show code
    > !is.list(replications(SFREQBUC ~ BREATH + Error(TOAD) + O2LEVEL +
    +     BREATH:O2LEVEL, data = mullens))
    
    [1] FALSE
    
    Is it? (Yor N)
  10. Finally, since the design is a repeated measures, there is a good chance that the assumption of sphericity
    has been violated balanced.
    Show code
    > library(biology)
    > epsi.GG.HF(aov(SFREQBUC ~ BREATH + O2LEVEL + BREATH:O2LEVEL + Error(TOAD),
    +     data = mullens))
    
    $GG.eps
    [1] 0.4282
    
    $HF.eps
    [1] 0.5173
    
    $sigma
    [1] 0.7531
    
    
    Has it (HINT)? (Y or N)

Q4-3. Assume that the assumption of compound symmetry/sphericity will be violated and perform a split-plot ANOVA (repeated measures)
(HINT), and complete the following table with corrected p-values (HINT).
Show code
> mullens.aov <- aov(SFREQBUC ~ BREATH * O2LEVEL + Error(TOAD), data = mullens)
> library(biology)
> AnovaM(mullens.aov, RM = T)
   Sphericity Epsilon Values 
-------------------------------
 Greenhouse.Geisser Huynh.Feldt
             0.4282      0.5173

Anova Table (Type I tests)
Response: SFREQBUC
Error: TOAD
          Df Sum Sq Mean Sq F value Pr(>F)  
BREATH     1   39.9    39.9    5.76  0.027 *
Residuals 19  131.6     6.9                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Error: Within
                Df Sum Sq Mean Sq F value  Pr(>F)    
O2LEVEL          7   39.0    5.57    7.39 1.7e-07 ***
BREATH:O2LEVEL   7   56.4    8.05   10.69 1.2e-10 ***
Residuals      133  100.2    0.75                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


Greenhouse-Geisser corrected ANOVA table
Response: SFREQBUC
Error: TOAD
              Df Sum Sq Mean Sq F value Pr(>F)  
BREATH     0.428   39.9    39.9    5.76  0.048 *
Residuals 19.000  131.6     6.9                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Error: Within
                Df Sum Sq Mean Sq F value  Pr(>F)    
O2LEVEL          3   39.0    5.57    7.39 0.00013 ***
BREATH:O2LEVEL   3   56.4    8.05   10.69 2.4e-06 ***
Residuals      133  100.2    0.75                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 


Huynh-Feldt corrected ANOVA table
Response: SFREQBUC
Error: TOAD
              Df Sum Sq Mean Sq F value Pr(>F)  
BREATH     0.517   39.9    39.9    5.76  0.044 *
Residuals 19.000  131.6     6.9                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Error: Within
                   Df Sum Sq Mean Sq F value  Pr(>F)    
O2LEVEL          3.62   39.0    5.57    7.39 4.1e-05 ***
BREATH:O2LEVEL   3.62   56.4    8.05   10.69 4.2e-07 ***
Residuals      133.00  100.2    0.75                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
Source of variationdfMean SqF-ratioP-valueGG.PHF.P
BREATH   
TOAD     
O2LEVEL
BREATH:O2LEVEL
Residuals     

Q4-4. Construct an interaction plot
showing the frequency of buccal breathing against oxygen level, with each breathing type as different lines (or different bars). HINT
Show code
> with(mullens, interaction.plot(O2LEVEL, BREATH, FREQBUC))

Q4-5. What conclusions would you draw from the analysis (and graph)?

Mixed models

Show code
> mullens.lmer <- lmer(SFREQBUC ~ BREATH * O2LEVEL + (1 | TOAD), data = mullens)
Examine the fixed effects coefficients
Show code
> summary(mullens.lmer)@coefs
                     Estimate Std. Error t value
(Intercept)            4.9410     0.3425 14.4264
BREATHlung            -3.3914     0.5549 -6.1117
O2LEVEL5              -0.2174     0.3404 -0.6386
O2LEVEL10             -0.5476     0.3404 -1.6087
O2LEVEL15             -0.8687     0.3404 -2.5521
O2LEVEL20             -1.5444     0.3404 -4.5371
O2LEVEL30             -1.4644     0.3404 -4.3021
O2LEVEL40             -2.2457     0.3404 -6.5973
O2LEVEL50             -2.4600     0.3404 -7.2271
BREATHlung:O2LEVEL5    1.0362     0.5515  1.8789
BREATHlung:O2LEVEL10   2.3327     0.5515  4.2298
BREATHlung:O2LEVEL15   2.3729     0.5515  4.3027
BREATHlung:O2LEVEL20   3.2985     0.5515  5.9809
BREATHlung:O2LEVEL30   2.9767     0.5515  5.3975
BREATHlung:O2LEVEL40   3.6168     0.5515  6.5582
BREATHlung:O2LEVEL50   3.4669     0.5515  6.2864
> #library(gmodels)
> #ci(mullens.lmer)
> library(languageR)
> pvals.fnc(mullens.lmer, withMCMC = FALSE, addPlot = F)
$fixed
                     Estimate MCMCmean HPD95lower HPD95upper  pMCMC Pr(>|t|)
(Intercept)            4.9410   4.9453     4.3055     5.5567 0.0001   0.0000
BREATHlung            -3.3914  -3.3902    -4.3822    -2.3809 0.0001   0.0000
O2LEVEL5              -0.2174  -0.2215    -0.9507     0.4798 0.5432   0.5240
O2LEVEL10             -0.5476  -0.5560    -1.3045     0.1490 0.1288   0.1097
O2LEVEL15             -0.8687  -0.8698    -1.5810    -0.1290 0.0160   0.0117
O2LEVEL20             -1.5444  -1.5478    -2.2561    -0.8058 0.0001   0.0000
O2LEVEL30             -1.4644  -1.4645    -2.1969    -0.7506 0.0002   0.0000
O2LEVEL40             -2.2457  -2.2462    -3.0119    -1.5349 0.0001   0.0000
O2LEVEL50             -2.4600  -2.4671    -3.1994    -1.7526 0.0001   0.0000
BREATHlung:O2LEVEL5    1.0362   1.0344    -0.1435     2.1923 0.0814   0.0622
BREATHlung:O2LEVEL10   2.3327   2.3378     1.1807     3.5274 0.0002   0.0000
BREATHlung:O2LEVEL15   2.3729   2.3715     1.1653     3.5353 0.0001   0.0000
BREATHlung:O2LEVEL20   3.2985   3.2962     2.1323     4.4427 0.0001   0.0000
BREATHlung:O2LEVEL30   2.9767   2.9748     1.8024     4.1244 0.0001   0.0000
BREATHlung:O2LEVEL40   3.6168   3.6087     2.4778     4.7871 0.0001   0.0000
BREATHlung:O2LEVEL50   3.4669   3.4722     2.2994     4.6385 0.0001   0.0000

$random
    Groups        Name Std.Dev. MCMCmedian MCMCmean HPD95lower HPD95upper
1     TOAD (Intercept)   0.8786     0.5998   0.6079     0.4390     0.7956
2 Residual               0.8678     0.9309   0.9342     0.8194     1.0525


  Reading data into R

Ensure that the working directory is pointing to the path containing the file to be imported before proceeding.
To import data into R, we read the contents of a file into a data frame. The general format of the command for reading data into a data frame is

> name <- read.table('filename.csv', header=T, sep=',', row.names=column, strip.white=T)

where name is a name you wish the data frame to be referred to as, filename.csv is the name of the csv file that you created in excel and column is the number of the column that had row labels (if there were any). The argument header=T indicates that the variable (vector) names will be created from the names supplied in the first row of the file. The argument sep=',' indicates that entries in the file are separated by a comma (hence a comma delimited file). If the data file does not contain row labels, or you are not sure whether it does, it is best to omit the row.names=column argument. The strip.white=T arguement ensures that no leading or trailing spaces are left in character names (these can be particularly nasty in categorical variables!).

As an example
> phasmid <- read.data("phasmid.csv", header = T, sep = ",", row.names = 1,
+     strip.white = T)

End of instructions

  Nested ANOVA

A nested ANOVA is a simple extension of the single factor ANOVA. An additional factor (Factor B) is added that is nested within the main factor of interest (Factor A). Nested factors are incorporated in situations where researches envisage potentially large amounts of variation between different sets of replicates within the main Factor.
For example, a limnologist investigating the effect of flow (Factor A: high, medium and low) on macroinvertebate density, might collect samples from a number of high, medium and low flow streams. However, the limnologist might expect that many other factors (perhaps more important than stream flow) also influence the density of macroinvertebrates and that even samples collected in close proximity to one another might be expected to vary considerably in density of macroinvertebrates. Such variability increases the unexplained variability and therefore has the potential to make it very difficult to pick up the effect of the actual factor of interest (in this case flow).
In an attempt to combat this, the researcher decides to collect multiple samples from within each stream. If some of the unexplained variability can be explained by variation between samples within streams, then the test of flow will be more sensitive.
This is called a nested ANOVA, an additional factor (Factor B: stream) has been introduce within the main factor (Factor A: flow). Factor A is a fixed factor as we are only interested in the specific levels chosen (e.g. high, medium and low flow). Factor B however, is a random as we are not just interested in the specific streams that were sampled, we are interested in all the possible streams that they represent. In a nested ANOVA the categories of the Factor B nested within each level of Factor A are different. E.g. the streams used for each flow type are different. Therefore it is not possible to test for an interaction between Flow type and stream.

There are two null hypotheses being tested in a two factor nested ANOVA. H0 main effect: no effect of Factor A (flow) H0: = &mu1 = &mu2 = ... = &mui = &mu (as for single factor ANOVA). H0 nesting effect: no effect of Factor B (stream) nested within Factor A (flow). Technically it is that there is no added variance due to differences between all the possible levels of Factor B with any level of Factor A. E.g., that there is no added variance due to differences between all possible high flow streams, or between all possible medium flow streams or between all possible low flow streams. Diagram shows layout of nested design with three treatments (different pattern fills).

End of instructions

  Convert numeric variable into a factor

> data$var <- as.factor(data$var)

where data is the name of the data set and var is the name of the variable to be made into a factor. While there is no visible evidence that the data are any different, R will now treat the variable as a factor rather than as a numeric variable.

End of instructions

  Nested ANOVA assumptions

In ANOVA, the assumptions of normality and variance homogeneity apply to the residuals for the test of interest. In this example, the nested factor (Factor B, SITE) is random. Therefore, the normality and heterogeneity of variance assumptions for Factor A (SEASON) must be based on the distribution, mean and variance of the response variable (S4DUGES) for each level of Factor A using the means of Factor B within each Factor A as the observations (since the error term for Factor A (SEASON) is the variation in the response variable between categories of Factor B within each level of Factor A (MSSITE). Calculate the mean from each level of Factor B (SITE), and use these means to calculate the mean, variance and distribution of each level of Factor A (SEASON).

# aggregate the data set by calculating the means for each level of the nesting factor
> name.ag <- aggregate(dataset, list(Name1=FactorA, Name2=FactorB), mean)
#OR perhaps more efficiently
library(lme4)
name.ag <- gsummary(dataset, groups=FactorB)
# use the aggregated means to produce the boxplots
> boxplot(dv~FactorA, name.ag)

where name.ag is a name given to the new aggregated data set, dataset is the name of the original data set, Name1 and Name2 are names to appear in the aggregated data set that represent FactorA (Main within factor) and FactorB (nesting factor) respectively. Finally, dv is the name of the dependent variable.

This data set generates two boxplots, each constructed from only 3 values and thus of questionable value.

End of instructions

  Nested ANOVA assumptions

# fit the heirachical linear model
> name.aov <- aov(dv~FactorA + Error(FactorB), data=data)
# print out the ANOVA strata
> summary(name.aov)

where name.aov is a name provided to store the fitted model, dv is the dependent variable, FactorA is the main fixed factor and FactorB is the random nested factor. data is the name of the data set. The 'Error()' argument defines additional error (residual) strata, hence the above template indicates that Factor should be tested against FactorB.
For example:

# fit the heirachical linear model
> curdies.aov <- aov(S4DUGES~SEASON + Error(SITE), data=curdies)
# print out the ANOVA strata
> summary(curdies.aov)

In recognition that Factor is a random nested factor, and thus of little interest, the F-ratio is not provided. Since the effect of FactorB is tested against the overal residual (error) term, to obtain the F-ratio for the test of FactorB, fit the model without defining the an additional error strata. Note that when you do this, all terms are tested against the residual term and therefore, any fixed factors above the random factors (in the hierarchy) will be incorrectly calculated.

> curdies.aov1 <- aov(S4DUGES~SEASON + SITE, data=curdies)
> summary(curdies.aov)

End of instructions

  Two factor ANOVA

Statistical models that incorporate more than one categorical predictor variable are broadly referred to as multivariate analysis of variance. There are two main reasons for the inclusion of multiple factors:

  1. To attempt to reduce the unexplained (or residual) variation in the response variable and therefore increase the sensitivity of the main effect test.
  2. To examine the interaction between factors. That is, whether the effect of one factor on the response variable is dependent on other factor(s) (consistent across all levels of other factor(s)).

Fully factorial linear models are used when the design incorporates two or more factors (independent, categorical variables) that are crossed with each other. That is, all combinations of the factors are included in the design and every level (group) of each factor occurs in combination with every level of the other factor(s). Furthermore, each combination is replicated.

In fully factorial designs both factors are equally important (potentially), and since all factors are crossed, we can also test whether there is an interaction between the factors (does the effect of one factor depend on the level of the other factor(s)). Graphs above depict a) interaction, b) no interaction.


For example, Quinn (1988) investigated the effects of season (two levels, winter/spring and summer/autumn) and adult density (four levels, 8, 15, 30 and 45 animals per 225cm2) on the number of limpet egg masses. As the design was fully crossed (all four densities were used in each season), he was also able to test for an interaction between season and density. That is, was the effect of density consistent across both seasons and, was the effect of season consistent across all densities.

Diagram shows layout of 2 factor fully crossed design. The two factors (each with two levels) are color (black or gray) and pattern (solid or striped). There are three experimental units (replicates) per treatment combination.


End of instructions

  Two-factor ANOVA

> name.aov <- aov(dv~factor1 * factor2, data=data)

where dv is the name of the numeric vector (dependent variable), factor1 and factor2 are the names of the factors (categorical or factor variables) and data is the name of the data frame (data set).

End of instructions

  Fully factorial ANOVA assumption checking

The assumptions of normality and homogeneity of variance apply to each of the factor level combinations, since it is the replicate observations of these that are the test residuals. If the design has two factors (IV1 and IV2) with three and four levels (groups) respectively within these factors, then a boxplot of each of the 3x4=12 combinations needs to be constructed. It is recommended that a variable (called GROUP) be setup to represent the combination of the two categorical (factor) variables.

Simply construct a boxplot with the dependent variable on the y-axis and GROUP on the x-axis. Visually assess whether the data from each group is normal (or at least that the groups are not all consistently skewed in the same direction), and whether the spread of data is each group is similar (or at least not related to the mean for that group). The GROUP variable can also assist in calculating the mean and variance for each group, to further assess variance homogeneity.

End of instructions

  Factorial boxplots

> boxplot(dv~factor1*factor2,data=data)

where dv is the name of the numeric vector (dependent variable), factor1 and factor2 are the names of the factors (categorical or factor variables) and data is the name of the data frame (data set). The '~' is used in formulas to represent that the left hand side is proportional to the right hand side.

End of instructions

  Plotting mean vs variance

# first calculate the means and variances
> means <- tapply(data$dv, list(data$factor1, data$factor2), mean)
> vars <- tapply(data$dv, list(data$factor1, data$factor2), var)
# then plot the mean against variance
> plot(means,vars)

where dv is the name of the numeric vector (dependent variable), factor1 and factor2 are the names of the factors (categorical or factor variables) and data is the name of the data frame (data set).

End of instructions

  Model balance

Balanced models are those models that have equal sample sizes for each level of a treatment. Unequal sample sizes are generally problematic for statistical analyses as they can be the source of unequal variances. They are additionally problematic for multifactor designs because of the way that total variability is partitioned into the contributions of each of the main effects as well as the contributions of interactions (ie the calculations of sums of squares and degrees of freedom). Sums of squares (and degrees of freedom) are usually (type I SS) partitioned additively. That is:
SSTotal=SSA+SSB+SSAB+SSResidual, and
dfTotal=dfA+dfB+dfAB+dfResidual
. In practice, these formulas are used in reverse. For example, if the model was entered as DV~A+B+AB, first SSResidual is calculated followed by SSAB, then SSB. SSA is thus calculated as the total sums of squares minus what has already been partitioned off to other terms (SSA=SSTotal-SSB+SSAB+SSResidual). If the model is balanced, then it doesn't matter what order the model is constructed (DV~B+A+AB will be equivalent).

However, in unbalanced designs, SS cannot be partitioned additively, that is the actual SS of each of the terms does not add up to SSTotal (SSTotal=SSA+SSB+SSAB+SSResidual). Of course, when SS is partitioned it all must add up. Consequently, the SS of the last term to be calculated will not be correct. Therefore, in unbalanced designs, SS values for the terms are different (and therefore so are the F-ratios etc) depending on the order in which the terms are added - an undesirable situation.

To account for this, there are alternative methods of calculating SS for the main effects (all provide same SS for interaction terms and residuals).

It turns out that it is relatively easy to determine whether or not the model is balanced or not - use the following syntax:

> !is.list(replications(formula,data))

where formula is the model formula (for example, MASS~SITUATION*MONTH) and data is the name of data set.
If this function returns a TRUE, your model is balanced. If it returns a FALSE then your model is not balanced and you should consider using Type II or III SS methods.

End of instructions

  Linear regression diagnostics

> plot(name.lm)

where name.lm is the name of a fitted linear model. You will be prompted to hit return to cycle through a series of four diagnostic plots. The first (and most important of these) is the Residual Plot.

End of instructions

  ANOVA table

#print a summary of the ANOVA table
> anova(name.aov)

where name.aov is the name of a fitted linear model.

End of instructions

  Interaction plot

Interaction plots display the degree of consistency (or lack of) of the effect of one factor across each level of another factor. Interaction plots can be either bar or line graphs, however line graphs are more effective. The x-axis represents the levels of one factor, and a separate line in drawn for each level of the other factor. The following interaction plots represent two factors, A (with levels A1, A2, A3) and B (with levels B1, B2).


The parallel lines of first plot clearly indicate no interaction. The effect of factor A is consistent for both levels of factor B and visa versa. The middle plot demonstrates a moderate interaction and bottom plot illustrates a severe interaction. Clearly, whether or not there is an effect of factor B (e.g. B1 > B2 or B2 > B1) depends on the level of factor A. The trend is not consistent.

End of instructions

  R Interaction plot

> interaction.plot(dataset$Factor1, dataset$Factor2, dataset$dv)

where Factor1 is the name of the categical variable (factor) with the most levels (this will be used for the x-axis), Factor2 is the name of the other factorial variable (this will be used to produce the different lines) and dv is the name of the dependent variable within the dataset data set
For example:

> interaction.plot(copper$DIST, copper$COPPER, dataset$WORMS)

End of instructions

  Specific comparisons of means

Following a significant ANOVA result, it is often desirable to specifically compare group means to determine which groups are significantly different. However, multiple comparisons lead to two statistical problems. Firstly, multiple significance tests increase the Type I errors (&alpha, the probability of falsely rejecting H0). E.g., testing for differences between 5 groups requires ten pairwise comparisons. If the &alpha for each test is 0.05 (5%), then the probability of at least one Type I error for the family of 10 tests is 0.4 (40%). Secondly, the outcome of each test needs to be independent (orthogonality). E.g. if A>B and B>C then we already know the result of A vs. C.

Post-hoc unplanned pairwise comparisons (e.g. Tukey's test) compare all possible pairs of group means and are useful in an exploratory fashion to reveal major differences. Tukey's test control the family-wise Type I error rate to no more that 0.05. However, this reduces the power of each of the pairwise comparisons, and only very large differences are detected (a consequence that exacerbates with an increasing number of groups).

Planned comparisons are specific comparisons that are usually planned during the design stage of the experiment. No more than (p-1, where p is the number of groups) comparisons can be made, however, each comparison (provided it is non-orthogonal) can be tested at &alpha = 0.05. Amongst all possible pairwise comparisons, specific comparisons are selected, while other meaningless (within the biological context of the investigation) are ignored.

Planned comparisons are defined using what are known as contrasts coefficients. These coefficients are a set of numbers that indicate which groups are to be compared, and what the relative contribution of each group is to the comparison. For example, if a factor has four groups (A, B, C and D) and you wish to specifically compare groups A and B, the contrast coefficients for this comparison would be 1, -1, 0,and 0 respectively. This indicates that groups A and B are to be compared (since their signs oppose) and that groups C and D are omitted from the specific comparison (their values are 0).

It is also possible to compare the average of two or more groups with the average of other groups. For example, to compare the average of groups A and B with the average of group C, the contrast coefficients would be 1, 1, -2, and 0 respectively. Note that 0.5, 0.5, 1 and 0 would be equivalent.

The following points relate to the specification of contrast coefficients:

  1. The sum of the coefficients should equal 0
  2. Groups to be compared should have opposing signs

End of instructions

  Tukey's Test

# load the 'multcomp' package
> library(multcomp)
# perform Tukey's test
> summary(simtest(dv~factor1*factor2, whichf='factor', data=data, type="Tukey"))
# OR the more up to date method
> summary(glht(data.aov, linfct=mcp(factor="Tukey")))

where dv is the name of the numeric vector (dependent variable), factor1 and factor2 are the names of the factors (categorical or factor variables), and data is the name of the data frame (data set). The argument tyoe="Tukey" indicates that a Tukeys p-value adjustments should be made. When the linear model contains multiple categorical predictor variables, the arguement whichf='' must be used to indicate which factor the Tukey's test should be performed on.

The coefficients list the specific comparisons made. Each line represents a specific hypothesis test including difference between groups, t-value, raw p value and Tukey's adjusted p value (the other values presented are not important for BIO3011).

End of instructions

  Simple Main Effects

> library(biology)
> quinn.aov1 <- aov(SQRTRECRUITS~DENSITY, data=quinn, subset=c(SEASON=='Summer'))
> AnovaM(lm(quinn.aov1), lm(quinn.aov),type="II")

Subset, type in the name of the other categorical variable followed by == (two equals signs) and the name of one of the levels (surrounded by single quotation marks) of this factor. For example, if you had a categorical variable called TEMPERATURE and it consisted of three levels (High, Medium and Low), to perform a anova using only the High temperature data, the subset expression would be TEMPERATURE=='High'.

End of instructions

  Randomized complete block (RCB) designs

These are designs where one factor is a blocking variable (Factor B) and the other variable (Factor A) is the main treatment factor of interest. These designs are used in situations where the environment is potentially heterogeneous enough to substantially increase the variability in the response variable and thus obscure the effects of the main factor. Experimental units are grouped into blocks (units of space or time that are likely to have less variable background conditions) and then each level of the treatment factor is applied to a single unit within each block. By grouping the experimental units together into blocks, some of the total variability in the response variable that is not explained by the main factor may be explained by the blocks. If so, the residual (unexplained) variation will be reduced resulting in more precise estimates of parameters and more powerful tests of treatments.

A plant ecologist was investigating the effects of leaf damage on plant chemical defenses. Since individual trees vary greatly in the degree of chemical defense, she applied each of three artificial leaf damage treatments (heavy, medium and light damage) to each tree. She applied one of three treatments to three branches within eight randomly selected trees. Trees were the blocks. The diagram shows the layout of a blocking design. The main factor has three treatments (pattern fills), each of which are applied randomly within 3 randomly selected blocks. The decision to include a blocking factor (rather than single factor, completely random) depends on whether the experimental units (replicates) within a block are likely to be more similar to one another than those between blocks and whether the reduction in MSResidual is likely to compensate for the for the loss of residual df.

End of instructions

  Randomized blocks

> name.aov <- aov(dv~factorA + Error(factorB), data=data)

where dv is the name of the numeric vector (dependent variable), factorA and factorB represent the main factor of interest and the blocking factor respectively.

End of instructions

  Split-plot designs

Split-plot designs can be thought of as a combination of a randomized complete block (RCB) and nested designs. In their simplest form, they represent a RCB design, with a second factor applied to the whole blocks. One factor (Factor C) is applied to experimental units (usually randomly positioned) within each block (now referred to as plots, or Factor B). These factors are the Within-plot factors. A second factor (Between-plots factor, or Factor A) is then applied to the whole plots, and these plots are replicated. Hence a nested design applies between Factor A and the plots, and a blocking design applies between Factor C and the plots. Factor A is also crossed with Factor C, and therefore there is a test of the interaction between the two main factors.


Note that if there are no replicate units of Factor C within each plot (that is if each level of Factor C is applied only once to each plot), then the interaction of Factor C by Plots within Factor A represents the overall residual. However, since the plots are usually random factors, the effects of Factor A must be tested against the Plots within Factor A term.
If there there are replicate units of Factor C within each plot, then the Factor A by Plots within Factor A by Factor C is the overal residual.

SourceA,C (Fix); B (Rand)
AMSA/MSB
BMSB/MSResidual
CMSC/MSB:C
A:CMSA:C/MSB:C
B:CMSB:C/MSResidual

For example, a marine biologist investigated the effects of estuary flow on the abundance of mussels and barnacles, and whether these effects varied with tidal height. A splitplot was used whereby randomly selected estuary sites were the plots, six tidal heights (from 0 to 3.6 m) were applied within each site (within-plot effect), and each site was of either high or low flow (between-plots effect). Diagram illustrates layout for split-plot design. There are three treatments (levels of Factor C, fill patterns for squares) applied within each plot (ovals). One of two treatments (levels of Factor A, fill for ovals) was applied to whole plots. Note that treatments for both Factor A and C are replicated and fully crossed.

End of instructions

  Aggregate a data set

> name.ag <- aggregate(dataset, list(NAME=FactorA, NAME1=FactorB), mean)
# OR
> library(lme4)
> name.ag<-gsummary(dataset, groups=dataset$FactorA)

where name.ag is a name you provide for the aggregated data set, dataset is the name of the data set, NAME is for the first factor in the resulting data set, FactorA is the name of the main factor, NAME1 is for the second factor in the resulting data set and FactorA is the name of the plot (blocking) factor.
For example:

> cu.ag <- aggregate(copper, list(COPPER=copper$COPPER, PLATE=copper$PLATE), mean)
# OR
> library(lme4) > cu.ag <- aggregate(copper, groups=copper$PLATE)

End of instructions

  Boxplot from aggregated data set

Having generated an aggregated data set, it is possible to use the aggregated means to test the assumptions of normality and homogeneity of variance.

> boxplot(name.ag$dv~name.ag$FactorA)

where name.ag is a name of the aggregated data set, dv is the name of the dependent variable and FactorA is the name of the main factor.
For example:

> boxplot(cu.ag$WORMS ~ cu.ag$COPPER)

End of instructions

  R Compound symmetry/sphericity

# source Murray's syntax for dealing with sphericity
> library(biology)
# after fitting a randomized block, repeated measures or split-plot
> epsi.GG.HF(name.aov)

where name.aov is the name of the fitted randomized block, repeated measures or split-plot model. The epsi.GG.HF function returns a table of epsilon values corresponding to each of the model terms.

End of instructions

  R Split-plot

# Fit the linear model with the different error strata
> name.aov <- aov(dv ~ FactorA + FactorC + FactorA:FactorC + Error(FactorB), data=dataset))
> AnovaM(name.aov) # alternatively, if sphericity is an issue > AnovaM(name.aov, RM=T)

where name.aov is a name to provide the fitted linear model, dv is the name of the dependent variable, FactorA is the name of the between plots factor (Factor A), FactorB is the name of the block factor (random) and FactorC is the name of the within plots factor. dataset is the name of the data set.
For example:

> copper.aov <- aov(WORMS ~ COPPER + DIST + COPPER:DIST + Error(PLATE), data=copper))
> AnovaM(copper.aov, RM=T)

End of instructions

  Repeated Measures

Simple repeated measures designs and similar to Randomized complete block (RCB) designs. In the same way that we could explain some of the residual variation by blocking, we can also reduce the residual variation by removing the between-replicate variation that exists at the start of the experiment. This is achieved by applying each treatments (repeated measures) to the same subjects, rather than using different subjects for each treatment. In this way it is the same as a RCB. Where it differs from a RCB is that each of the treatments cannot be applied to the blocks (subjects) at the same time. For example, it is not possible to determine the effects of different drugs on the heart rate of rats if each of the drugs is applied at the same time! Each of the treatments should be applied in random order, however, when the repeated factor is time, this is not possible. The subjects (blocks or plots) are always a random factor. If a second factor is applied at the level of the whole subject and over each of the repeated measures (e.g. subjects are of one of two species), the analysis then resembles a splitplot design, where this second factor is the between subjects effect.


A agricultural scientist was investigating the effects of fiber on the digestibility of pastural grasses by cows. Ten randomly selected cows of two different breeds (five of each) were each fed a random sequence of four food types (of different fiber levels). After a week of feeding on a single food type, the digestibility of the food was assessed. Individual cows represented the subjects. Breed of cow was the between subjects effect and food fiber level was the within subjects effect. Diagram illustrates layout of 2 Factor repeated measures. Rectangles represent subjects. Squares represent the within subject treatments (random sequence represents the order in which each treatment is applied). Hence there are four repeated measures. There are two levels of the between subjects effect (pattern fill of rectangles).

End of instructions

  R Compound symmetry/sphericity

# source Murray's syntax for dealing with sphericity
> library(biology)
# after fitting a randomized block, repeated measures or split-plot
> epsi.GG.HF(name.aov)

where name.aov is the name of the fitted randomized block, repeated measures or split-plot model. The epsi.GG.HF function returns a table of epsilon values corresponding to each of the model terms.

End of instructions