Jump to main navigation


Workshop 9.8a - Randomized block and repeated measures designs (mixed effects models)

4 August 2011

Basic statistics references

  • Logan (2010) - Chpt 12-14
  • Quinn & Keough (2002) - Chpt 9-11
Very basic overview of Nested ANOVA

Very basic overview of Randomized block

Randomized block

A plant pathologist wanted to examine the effects of two different strengths of tobacco virus on the number of lesions on tobacco leaves. She knew from pilot studies that leaves were inherently very variable in response to the virus. In an attempt to account for this leaf to leaf variability, both treatments were applied to each leaf. Eight individual leaves were divided in half, with half of each leaf inoculated with weak strength virus and the other half inoculated with strong virus. So the leaves were blocks and each treatment was represented once in each block. A completely randomised design would have had 16 leaves, with 8 whole leaves randomly allocated to each treatment.

Download Tobacco data set

Format of tobacco.csv data files
LEAFTREATNUMBER
1Strong35.898
1Week25.02
2Strong34.118
2Week23.167
3Strong35.702
3Week24.122
.........
LEAFThe blocking factor - Factor B
TREATCategorical representation of the strength of the tobacco virus - main factor of interest Factor A
NUMBERNumber of lesions on that part of the tobacco leaf - response variable
Starlings

Open the tobacco data file.

Show code
> tobacco <- read.table("../downloads/data/tobacco.csv", header = T,
+     sep = ",", strip.white = T)
> head(tobacco)
  LEAF TREATMENT NUMBER
1   L1    Strong  35.90
2   L1      Weak  25.02
3   L2    Strong  34.12
4   L2      Weak  23.17
5   L3    Strong  35.70
6   L3      Weak  24.12

Since each level of treatment was applied to each leaf, the data represents a randomized block design with leaves as blocks.

The variable LEAF contains a list of Leaf identification numbers and is supposed to represent a factorial blocking variable. However, because the contents of this variable are numbers, R initially treats them as numbers, and therefore considers the variable to be numeric rather than categorical. In order to force R to treat this variable as a factor (categorical) it is necessary to first convert this numeric variable into a factor
(HINT).
Show code
> tobacco$LEAF <- as.factor(tobacco$LEAF)
Q2-1. What are the main hypotheses being tested?
  1. H0 Factor A:
  2. H0 Factor B:
Q2-2. In the table below, list the assumptions of a randomized block design along with how violations of each assumption are diagnosed and/or the risks of violations are minimized.
  1. AssumptionDiagnostic/Risk Minimization
    I.
    II.
    III.
    IV.
  2. Is the proposed model balanced?
    (Yor N)
  3. Show code
    > replications(NUMBER ~ LEAF + TREATMENT, data = tobacco)
    
         LEAF TREATMENT 
            2         8 
    
    > !is.list(replications(NUMBER ~ LEAF + TREATMENT, data = tobacco))
    
    [1] TRUE
    

Q2-3. There are a number of ways of diagnosing block by within block interactions.
  1. The simplest method is to plot the number of lesions for each treatment and leaf combination (ie. an interaction plot
    (HINT). Any evidence of an interaction? Note that we cannot formally test for an interaction because we don't have replicates for each treatment-leaf combination!
  2. Show code
    > library(car)
    > interaction.plot(tobacco$LEAF, tobacco$TREAT, tobacco$NUMBER)
    
  3. We can also examine the residual plot from the fitted linear model. A curvilinear pattern in which the residual values switch from positive to negative and back again (or visa versa) over the range of predicted values implies that the scale (magnitude but not polarity) of the main treatment effects differs substantially across the range of blocks. (HINT). Any evidence of an interaction?
  4. Show code
    > residualPlots(lm(NUMBER ~ LEAF + TREATMENT, data = tobacco))
    
               Test stat Pr(>|t|)
    LEAF              NA       NA
    TREATMENT         NA       NA
    Tukey test     -0.18    0.858
    
    > residualPlots(lm(NUMBER ~ LEAF + TREATMENT, data = tobacco))
    
               Test stat Pr(>|t|)
    LEAF              NA       NA
    TREATMENT         NA       NA
    Tukey test     -0.18    0.858
    
  5. Perform a Tukey's test for non-additivity evaluated at α = 0.10 or even &alpha = 0.25. This (curvilinear test) formally tests for the presence of a quadratic trend in the relationship between residuals and predicted values. As such, it too is only appropriate for simple interactions of scale. (HINT). Any evidence of an interaction?
  6. Show code
    > library(alr3)
    > tukey.nonadd.test(lm(NUMBER ~ LEAF + TREATMENT, data = tobacco))
    
       Test  Pvalue 
    -0.1795  0.8575 
    
  7. Examine a lattice graphic to determine the consistency of the effect across each of the blocks(HINT).
  8. Show code
    > library(lattice)
    > print(barchart(NUMBER ~ TREATMENT | LEAF, data = tobacco))
    

Q2-4. Analyse these data with a classic randomized block ANOVA
(HINT) to test the H0 that there is no difference in the number of lesions produced by the different strength viruses. Complete the table below.
Show code
> tobacco.aov <- aov(NUMBER ~ TREATMENT + Error(LEAF), data = tobacco)
> summary(tobacco.aov)
Error: LEAF
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals  7    292    41.7               

Error: Within
          Df Sum Sq Mean Sq F value Pr(>F)   
TREATMENT  1    248   248.3    17.2 0.0043 **
Residuals  7    101    14.5                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
Source of variationdfMean SqF-ratioP-value
LEAF (block)   
TREAT
Residuals   


Q2-5. Interprete the results.

Q2-6. Now analyse these data with a linear mixed effects model (REML).
Show code
> library(nlme)
> tobacco.lme <- lme(NUMBER ~ TREATMENT, random = ~1 | LEAF, data = tobacco,
+     method = "ML")
> summary(tobacco.lme)
Linear mixed-effects model fit by maximum likelihood
 Data: tobacco 
    AIC   BIC logLik
  102.5 105.6 -47.25

Random effects:
 Formula: ~1 | LEAF
        (Intercept) Residual
StdDev:       3.454    3.557

Fixed effects: NUMBER ~ TREATMENT 
              Value Std.Error DF t-value p-value
(Intercept)   34.94     1.874  7  18.645  0.0000
TREATMENTWeak -7.88     1.901  7  -4.144  0.0043
 Correlation: 
              (Intr)
TREATMENTWeak -0.507

Standardized Within-Group Residuals:
     Min       Q1      Med       Q3      Max 
-1.73025 -0.51798  0.01212  0.43724  1.52636 

Number of Observations: 16
Number of Groups: 8 
> #unique(confint(tobacco.lme,'TREATMENTWeak')[[1]])
> library(gmodels)
> ci(tobacco.lme)
              Estimate CI lower CI upper Std. Error DF   p-value
(Intercept)     34.940    30.51   39.371      1.874  7 3.169e-07
TREATMENTWeak   -7.879   -12.38   -3.383      1.901  7 4.328e-03
> anova(tobacco.lme)
            numDF denDF F-value p-value
(Intercept)     1     7   368.5  <.0001
TREATMENT       1     7    17.2  0.0043
Show code
> library(lme4)
> tobacco.lmer <- lmer(NUMBER ~ TREATMENT + (1 | LEAF), data = tobacco)
> summary(tobacco.lmer)
Linear mixed model fit by REML 
Formula: NUMBER ~ TREATMENT + (1 | LEAF) 
   Data: tobacco 
  AIC  BIC logLik deviance REMLdev
 96.7 99.8  -44.4     94.5    88.7
Random effects:
 Groups   Name        Variance Std.Dev.
 LEAF     (Intercept) 13.6     3.69    
 Residual             14.5     3.80    
Number of obs: 16, groups: LEAF, 8

Fixed effects:
              Estimate Std. Error t value
(Intercept)      34.94       1.87   18.65
TREATMENTWeak    -7.88       1.90   -4.14

Correlation of Fixed Effects:
            (Intr)
TREATMENTWk -0.507
> #Calculate the confidence interval for the effect size of the main effect
> #   of season
> library(languageR)
> pvals.fnc(tobacco.lmer, withMCMC = FALSE, addPlot = F)
$fixed
              Estimate MCMCmean HPD95lower HPD95upper  pMCMC Pr(>|t|)
(Intercept)     34.940   34.919      30.83     38.740 0.0001    0.000
TREATMENTWeak   -7.879   -7.841     -12.86     -2.231 0.0092    0.001

$random
    Groups        Name Std.Dev. MCMCmedian MCMCmean HPD95lower HPD95upper
1     LEAF (Intercept)    3.692      1.243    1.440       0.00      3.917
2 Residual                3.803      5.038    5.189       3.08      7.531

> #Examine the effect season via a likelihood ratio test
> tobacco.lmer1 <- lmer(NUMBER ~ TREATMENT + (1 | LEAF), data = tobacco,
+     REML = F)
> tobacco.lmer2 <- lmer(NUMBER ~ 1 + (1 | LEAF), data = tobacco, REML = F)
> anova(tobacco.lmer1, tobacco.lmer2, test = "F")
Data: tobacco
Models:
tobacco.lmer2: NUMBER ~ 1 + (1 | LEAF)
tobacco.lmer1: NUMBER ~ TREATMENT + (1 | LEAF)
              Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)   
tobacco.lmer2  3 110 113  -52.2                           
tobacco.lmer1  4 102 106  -47.2  9.98      1     0.0016 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
Show code
> library(ez)
> ezANOVA(dv = .(NUMBER), within = .(TREATMENT), wid = .(LEAF), data = tobacco,
+     type = 3)
$ANOVA
     Effect DFn DFd     F        p p<.05   ges
2 TREATMENT   1   7 17.17 0.004328     * 0.387

Plot predicted mean square-root transformed number of Dugesia (and confidence interval) for each season
Show code
> newdata <- expand.grid(TREATMENT = c("STRONG", "WEAK"), NUMBER = 0)
> mod.mat <- model.matrix(terms(tobacco.lmer), newdata)
> newdata$NUMBER = mod.mat %*% fixef(tobacco.lmer)
> pvar1 <- (diag(mod.mat %*% tcrossprod(vcov(tobacco.lmer), mod.mat)))
> tvar1 <- pvar1 + VarCorr(tobacco.lmer)$LEAF[1]
> newdata <- data.frame(newdata, plo = newdata$NUMBER - 2 * sqrt(pvar1),
+     phi = newdata$NUMBER + 2 * sqrt(pvar1), tlo = newdata$NUMBER - 2 * sqrt(tvar1),
+     thi = newdata$NUMBER + 2 * sqrt(tvar1))
> par(mar = c(5, 5, 1, 1))
> plot(NUMBER ~ as.numeric(TREATMENT), newdata, type = "n", axes = F,
+     ann = F, ylim = range(newdata$plo, newdata$phi), xlim = c(0.5, 2.5))
> with(newdata, arrows(as.numeric(TREATMENT), plo, as.numeric(TREATMENT),
+     phi, ang = 90, length = 0.05, code = 3))
> with(newdata, points(as.numeric(TREATMENT), NUMBER, pch = 21, bg = "black",
+     col = "black"))
> axis(1, at = c(1, 2), lab = newdata$TREATMENT)
> axis(2, las = 1)
> mtext("Number of lesions", 2, line = 3, cex = 1.25)
> mtext("Season", 1, line = 3, cex = 1.25)
> box(bty = "l")

  Reading data into R

Ensure that the working directory is pointing to the path containing the file to be imported before proceeding.
To import data into R, we read the contents of a file into a data frame. The general format of the command for reading data into a data frame is

> name <- read.table('filename.csv', header=T, sep=',', row.names=column, strip.white=T)

where name is a name you wish the data frame to be referred to as, filename.csv is the name of the csv file that you created in excel and column is the number of the column that had row labels (if there were any). The argument header=T indicates that the variable (vector) names will be created from the names supplied in the first row of the file. The argument sep=',' indicates that entries in the file are separated by a comma (hence a comma delimited file). If the data file does not contain row labels, or you are not sure whether it does, it is best to omit the row.names=column argument. The strip.white=T arguement ensures that no leading or trailing spaces are left in character names (these can be particularly nasty in categorical variables!).

As an example
> phasmid <- read.data("phasmid.csv", header = T, sep = ",", row.names = 1,
+     strip.white = T)

End of instructions

  Nested ANOVA

A nested ANOVA is a simple extension of the single factor ANOVA. An additional factor (Factor B) is added that is nested within the main factor of interest (Factor A). Nested factors are incorporated in situations where researches envisage potentially large amounts of variation between different sets of replicates within the main Factor.
For example, a limnologist investigating the effect of flow (Factor A: high, medium and low) on macroinvertebate density, might collect samples from a number of high, medium and low flow streams. However, the limnologist might expect that many other factors (perhaps more important than stream flow) also influence the density of macroinvertebrates and that even samples collected in close proximity to one another might be expected to vary considerably in density of macroinvertebrates. Such variability increases the unexplained variability and therefore has the potential to make it very difficult to pick up the effect of the actual factor of interest (in this case flow).
In an attempt to combat this, the researcher decides to collect multiple samples from within each stream. If some of the unexplained variability can be explained by variation between samples within streams, then the test of flow will be more sensitive.
This is called a nested ANOVA, an additional factor (Factor B: stream) has been introduce within the main factor (Factor A: flow). Factor A is a fixed factor as we are only interested in the specific levels chosen (e.g. high, medium and low flow). Factor B however, is a random as we are not just interested in the specific streams that were sampled, we are interested in all the possible streams that they represent. In a nested ANOVA the categories of the Factor B nested within each level of Factor A are different. E.g. the streams used for each flow type are different. Therefore it is not possible to test for an interaction between Flow type and stream.

There are two null hypotheses being tested in a two factor nested ANOVA. H0 main effect: no effect of Factor A (flow) H0: = &mu1 = &mu2 = ... = &mui = &mu (as for single factor ANOVA). H0 nesting effect: no effect of Factor B (stream) nested within Factor A (flow). Technically it is that there is no added variance due to differences between all the possible levels of Factor B with any level of Factor A. E.g., that there is no added variance due to differences between all possible high flow streams, or between all possible medium flow streams or between all possible low flow streams. Diagram shows layout of nested design with three treatments (different pattern fills).

End of instructions

  Convert numeric variable into a factor

> data$var <- as.factor(data$var)

where data is the name of the data set and var is the name of the variable to be made into a factor. While there is no visible evidence that the data are any different, R will now treat the variable as a factor rather than as a numeric variable.

End of instructions

  Nested ANOVA assumptions

In ANOVA, the assumptions of normality and variance homogeneity apply to the residuals for the test of interest. In this example, the nested factor (Factor B, SITE) is random. Therefore, the normality and heterogeneity of variance assumptions for Factor A (SEASON) must be based on the distribution, mean and variance of the response variable (S4DUGES) for each level of Factor A using the means of Factor B within each Factor A as the observations (since the error term for Factor A (SEASON) is the variation in the response variable between categories of Factor B within each level of Factor A (MSSITE). Calculate the mean from each level of Factor B (SITE), and use these means to calculate the mean, variance and distribution of each level of Factor A (SEASON).

# aggregate the data set by calculating the means for each level of the nesting factor
> name.ag <- aggregate(dataset, list(Name1=FactorA, Name2=FactorB), mean)
#OR perhaps more efficiently
library(lme4)
name.ag <- gsummary(dataset, groups=FactorB)
# use the aggregated means to produce the boxplots
> boxplot(dv~FactorA, name.ag)

where name.ag is a name given to the new aggregated data set, dataset is the name of the original data set, Name1 and Name2 are names to appear in the aggregated data set that represent FactorA (Main within factor) and FactorB (nesting factor) respectively. Finally, dv is the name of the dependent variable.

This data set generates two boxplots, each constructed from only 3 values and thus of questionable value.

End of instructions

  Nested ANOVA assumptions

# fit the heirachical linear model
> name.aov <- aov(dv~FactorA + Error(FactorB), data=data)
# print out the ANOVA strata
> summary(name.aov)

where name.aov is a name provided to store the fitted model, dv is the dependent variable, FactorA is the main fixed factor and FactorB is the random nested factor. data is the name of the data set. The 'Error()' argument defines additional error (residual) strata, hence the above template indicates that Factor should be tested against FactorB.
For example:

# fit the heirachical linear model
> curdies.aov <- aov(S4DUGES~SEASON + Error(SITE), data=curdies)
# print out the ANOVA strata
> summary(curdies.aov)

In recognition that Factor is a random nested factor, and thus of little interest, the F-ratio is not provided. Since the effect of FactorB is tested against the overal residual (error) term, to obtain the F-ratio for the test of FactorB, fit the model without defining the an additional error strata. Note that when you do this, all terms are tested against the residual term and therefore, any fixed factors above the random factors (in the hierarchy) will be incorrectly calculated.

> curdies.aov1 <- aov(S4DUGES~SEASON + SITE, data=curdies)
> summary(curdies.aov)

End of instructions

  Two factor ANOVA

Statistical models that incorporate more than one categorical predictor variable are broadly referred to as multivariate analysis of variance. There are two main reasons for the inclusion of multiple factors:

  1. To attempt to reduce the unexplained (or residual) variation in the response variable and therefore increase the sensitivity of the main effect test.
  2. To examine the interaction between factors. That is, whether the effect of one factor on the response variable is dependent on other factor(s) (consistent across all levels of other factor(s)).

Fully factorial linear models are used when the design incorporates two or more factors (independent, categorical variables) that are crossed with each other. That is, all combinations of the factors are included in the design and every level (group) of each factor occurs in combination with every level of the other factor(s). Furthermore, each combination is replicated.

In fully factorial designs both factors are equally important (potentially), and since all factors are crossed, we can also test whether there is an interaction between the factors (does the effect of one factor depend on the level of the other factor(s)). Graphs above depict a) interaction, b) no interaction.


For example, Quinn (1988) investigated the effects of season (two levels, winter/spring and summer/autumn) and adult density (four levels, 8, 15, 30 and 45 animals per 225cm2) on the number of limpet egg masses. As the design was fully crossed (all four densities were used in each season), he was also able to test for an interaction between season and density. That is, was the effect of density consistent across both seasons and, was the effect of season consistent across all densities.

Diagram shows layout of 2 factor fully crossed design. The two factors (each with two levels) are color (black or gray) and pattern (solid or striped). There are three experimental units (replicates) per treatment combination.


End of instructions

  Two-factor ANOVA

> name.aov <- aov(dv~factor1 * factor2, data=data)

where dv is the name of the numeric vector (dependent variable), factor1 and factor2 are the names of the factors (categorical or factor variables) and data is the name of the data frame (data set).

End of instructions

  Fully factorial ANOVA assumption checking

The assumptions of normality and homogeneity of variance apply to each of the factor level combinations, since it is the replicate observations of these that are the test residuals. If the design has two factors (IV1 and IV2) with three and four levels (groups) respectively within these factors, then a boxplot of each of the 3x4=12 combinations needs to be constructed. It is recommended that a variable (called GROUP) be setup to represent the combination of the two categorical (factor) variables.

Simply construct a boxplot with the dependent variable on the y-axis and GROUP on the x-axis. Visually assess whether the data from each group is normal (or at least that the groups are not all consistently skewed in the same direction), and whether the spread of data is each group is similar (or at least not related to the mean for that group). The GROUP variable can also assist in calculating the mean and variance for each group, to further assess variance homogeneity.

End of instructions

  Factorial boxplots

> boxplot(dv~factor1*factor2,data=data)

where dv is the name of the numeric vector (dependent variable), factor1 and factor2 are the names of the factors (categorical or factor variables) and data is the name of the data frame (data set). The '~' is used in formulas to represent that the left hand side is proportional to the right hand side.

End of instructions

  Plotting mean vs variance

# first calculate the means and variances
> means <- tapply(data$dv, list(data$factor1, data$factor2), mean)
> vars <- tapply(data$dv, list(data$factor1, data$factor2), var)
# then plot the mean against variance
> plot(means,vars)

where dv is the name of the numeric vector (dependent variable), factor1 and factor2 are the names of the factors (categorical or factor variables) and data is the name of the data frame (data set).

End of instructions

  Model balance

Balanced models are those models that have equal sample sizes for each level of a treatment. Unequal sample sizes are generally problematic for statistical analyses as they can be the source of unequal variances. They are additionally problematic for multifactor designs because of the way that total variability is partitioned into the contributions of each of the main effects as well as the contributions of interactions (ie the calculations of sums of squares and degrees of freedom). Sums of squares (and degrees of freedom) are usually (type I SS) partitioned additively. That is:
SSTotal=SSA+SSB+SSAB+SSResidual, and
dfTotal=dfA+dfB+dfAB+dfResidual
. In practice, these formulas are used in reverse. For example, if the model was entered as DV~A+B+AB, first SSResidual is calculated followed by SSAB, then SSB. SSA is thus calculated as the total sums of squares minus what has already been partitioned off to other terms (SSA=SSTotal-SSB+SSAB+SSResidual). If the model is balanced, then it doesn't matter what order the model is constructed (DV~B+A+AB will be equivalent).

However, in unbalanced designs, SS cannot be partitioned additively, that is the actual SS of each of the terms does not add up to SSTotal (SSTotal=SSA+SSB+SSAB+SSResidual). Of course, when SS is partitioned it all must add up. Consequently, the SS of the last term to be calculated will not be correct. Therefore, in unbalanced designs, SS values for the terms are different (and therefore so are the F-ratios etc) depending on the order in which the terms are added - an undesirable situation.

To account for this, there are alternative methods of calculating SS for the main effects (all provide same SS for interaction terms and residuals).

It turns out that it is relatively easy to determine whether or not the model is balanced or not - use the following syntax:

> !is.list(replications(formula,data))

where formula is the model formula (for example, MASS~SITUATION*MONTH) and data is the name of data set.
If this function returns a TRUE, your model is balanced. If it returns a FALSE then your model is not balanced and you should consider using Type II or III SS methods.

End of instructions

  Linear regression diagnostics

> plot(name.lm)

where name.lm is the name of a fitted linear model. You will be prompted to hit return to cycle through a series of four diagnostic plots. The first (and most important of these) is the Residual Plot.

End of instructions

  ANOVA table

#print a summary of the ANOVA table
> anova(name.aov)

where name.aov is the name of a fitted linear model.

End of instructions

  Interaction plot

Interaction plots display the degree of consistency (or lack of) of the effect of one factor across each level of another factor. Interaction plots can be either bar or line graphs, however line graphs are more effective. The x-axis represents the levels of one factor, and a separate line in drawn for each level of the other factor. The following interaction plots represent two factors, A (with levels A1, A2, A3) and B (with levels B1, B2).


The parallel lines of first plot clearly indicate no interaction. The effect of factor A is consistent for both levels of factor B and visa versa. The middle plot demonstrates a moderate interaction and bottom plot illustrates a severe interaction. Clearly, whether or not there is an effect of factor B (e.g. B1 > B2 or B2 > B1) depends on the level of factor A. The trend is not consistent.

End of instructions

  R Interaction plot

> interaction.plot(dataset$Factor1, dataset$Factor2, dataset$dv)

where Factor1 is the name of the categical variable (factor) with the most levels (this will be used for the x-axis), Factor2 is the name of the other factorial variable (this will be used to produce the different lines) and dv is the name of the dependent variable within the dataset data set
For example:

> interaction.plot(copper$DIST, copper$COPPER, dataset$WORMS)

End of instructions

  Specific comparisons of means

Following a significant ANOVA result, it is often desirable to specifically compare group means to determine which groups are significantly different. However, multiple comparisons lead to two statistical problems. Firstly, multiple significance tests increase the Type I errors (&alpha, the probability of falsely rejecting H0). E.g., testing for differences between 5 groups requires ten pairwise comparisons. If the &alpha for each test is 0.05 (5%), then the probability of at least one Type I error for the family of 10 tests is 0.4 (40%). Secondly, the outcome of each test needs to be independent (orthogonality). E.g. if A>B and B>C then we already know the result of A vs. C.

Post-hoc unplanned pairwise comparisons (e.g. Tukey's test) compare all possible pairs of group means and are useful in an exploratory fashion to reveal major differences. Tukey's test control the family-wise Type I error rate to no more that 0.05. However, this reduces the power of each of the pairwise comparisons, and only very large differences are detected (a consequence that exacerbates with an increasing number of groups).

Planned comparisons are specific comparisons that are usually planned during the design stage of the experiment. No more than (p-1, where p is the number of groups) comparisons can be made, however, each comparison (provided it is non-orthogonal) can be tested at &alpha = 0.05. Amongst all possible pairwise comparisons, specific comparisons are selected, while other meaningless (within the biological context of the investigation) are ignored.

Planned comparisons are defined using what are known as contrasts coefficients. These coefficients are a set of numbers that indicate which groups are to be compared, and what the relative contribution of each group is to the comparison. For example, if a factor has four groups (A, B, C and D) and you wish to specifically compare groups A and B, the contrast coefficients for this comparison would be 1, -1, 0,and 0 respectively. This indicates that groups A and B are to be compared (since their signs oppose) and that groups C and D are omitted from the specific comparison (their values are 0).

It is also possible to compare the average of two or more groups with the average of other groups. For example, to compare the average of groups A and B with the average of group C, the contrast coefficients would be 1, 1, -2, and 0 respectively. Note that 0.5, 0.5, 1 and 0 would be equivalent.

The following points relate to the specification of contrast coefficients:

  1. The sum of the coefficients should equal 0
  2. Groups to be compared should have opposing signs

End of instructions

  Tukey's Test

# load the 'multcomp' package
> library(multcomp)
# perform Tukey's test
> summary(simtest(dv~factor1*factor2, whichf='factor', data=data, type="Tukey"))
# OR the more up to date method
> summary(glht(data.aov, linfct=mcp(factor="Tukey")))

where dv is the name of the numeric vector (dependent variable), factor1 and factor2 are the names of the factors (categorical or factor variables), and data is the name of the data frame (data set). The argument tyoe="Tukey" indicates that a Tukeys p-value adjustments should be made. When the linear model contains multiple categorical predictor variables, the arguement whichf='' must be used to indicate which factor the Tukey's test should be performed on.

The coefficients list the specific comparisons made. Each line represents a specific hypothesis test including difference between groups, t-value, raw p value and Tukey's adjusted p value (the other values presented are not important for BIO3011).

End of instructions

  Simple Main Effects

> library(biology)
> quinn.aov1 <- aov(SQRTRECRUITS~DENSITY, data=quinn, subset=c(SEASON=='Summer'))
> AnovaM(lm(quinn.aov1), lm(quinn.aov),type="II")

Subset, type in the name of the other categorical variable followed by == (two equals signs) and the name of one of the levels (surrounded by single quotation marks) of this factor. For example, if you had a categorical variable called TEMPERATURE and it consisted of three levels (High, Medium and Low), to perform a anova using only the High temperature data, the subset expression would be TEMPERATURE=='High'.

End of instructions

  Randomized complete block (RCB) designs

These are designs where one factor is a blocking variable (Factor B) and the other variable (Factor A) is the main treatment factor of interest. These designs are used in situations where the environment is potentially heterogeneous enough to substantially increase the variability in the response variable and thus obscure the effects of the main factor. Experimental units are grouped into blocks (units of space or time that are likely to have less variable background conditions) and then each level of the treatment factor is applied to a single unit within each block. By grouping the experimental units together into blocks, some of the total variability in the response variable that is not explained by the main factor may be explained by the blocks. If so, the residual (unexplained) variation will be reduced resulting in more precise estimates of parameters and more powerful tests of treatments.

A plant ecologist was investigating the effects of leaf damage on plant chemical defenses. Since individual trees vary greatly in the degree of chemical defense, she applied each of three artificial leaf damage treatments (heavy, medium and light damage) to each tree. She applied one of three treatments to three branches within eight randomly selected trees. Trees were the blocks. The diagram shows the layout of a blocking design. The main factor has three treatments (pattern fills), each of which are applied randomly within 3 randomly selected blocks. The decision to include a blocking factor (rather than single factor, completely random) depends on whether the experimental units (replicates) within a block are likely to be more similar to one another than those between blocks and whether the reduction in MSResidual is likely to compensate for the for the loss of residual df.

End of instructions

  Randomized blocks

> name.aov <- aov(dv~factorA + Error(factorB), data=data)

where dv is the name of the numeric vector (dependent variable), factorA and factorB represent the main factor of interest and the blocking factor respectively.

End of instructions

  Split-plot designs

Split-plot designs can be thought of as a combination of a randomized complete block (RCB) and nested designs. In their simplest form, they represent a RCB design, with a second factor applied to the whole blocks. One factor (Factor C) is applied to experimental units (usually randomly positioned) within each block (now referred to as plots, or Factor B). These factors are the Within-plot factors. A second factor (Between-plots factor, or Factor A) is then applied to the whole plots, and these plots are replicated. Hence a nested design applies between Factor A and the plots, and a blocking design applies between Factor C and the plots. Factor A is also crossed with Factor C, and therefore there is a test of the interaction between the two main factors.


Note that if there are no replicate units of Factor C within each plot (that is if each level of Factor C is applied only once to each plot), then the interaction of Factor C by Plots within Factor A represents the overall residual. However, since the plots are usually random factors, the effects of Factor A must be tested against the Plots within Factor A term.
If there there are replicate units of Factor C within each plot, then the Factor A by Plots within Factor A by Factor C is the overal residual.

SourceA,C (Fix); B (Rand)
AMSA/MSB
BMSB/MSResidual
CMSC/MSB:C
A:CMSA:C/MSB:C
B:CMSB:C/MSResidual

For example, a marine biologist investigated the effects of estuary flow on the abundance of mussels and barnacles, and whether these effects varied with tidal height. A splitplot was used whereby randomly selected estuary sites were the plots, six tidal heights (from 0 to 3.6 m) were applied within each site (within-plot effect), and each site was of either high or low flow (between-plots effect). Diagram illustrates layout for split-plot design. There are three treatments (levels of Factor C, fill patterns for squares) applied within each plot (ovals). One of two treatments (levels of Factor A, fill for ovals) was applied to whole plots. Note that treatments for both Factor A and C are replicated and fully crossed.

End of instructions

  Aggregate a data set

> name.ag <- aggregate(dataset, list(NAME=FactorA, NAME1=FactorB), mean)
# OR
> library(lme4)
> name.ag<-gsummary(dataset, groups=dataset$FactorA)

where name.ag is a name you provide for the aggregated data set, dataset is the name of the data set, NAME is for the first factor in the resulting data set, FactorA is the name of the main factor, NAME1 is for the second factor in the resulting data set and FactorA is the name of the plot (blocking) factor.
For example:

> cu.ag <- aggregate(copper, list(COPPER=copper$COPPER, PLATE=copper$PLATE), mean)
# OR
> library(lme4) > cu.ag <- aggregate(copper, groups=copper$PLATE)

End of instructions

  Boxplot from aggregated data set

Having generated an aggregated data set, it is possible to use the aggregated means to test the assumptions of normality and homogeneity of variance.

> boxplot(name.ag$dv~name.ag$FactorA)

where name.ag is a name of the aggregated data set, dv is the name of the dependent variable and FactorA is the name of the main factor.
For example:

> boxplot(cu.ag$WORMS ~ cu.ag$COPPER)

End of instructions

  R Compound symmetry/sphericity

# source Murray's syntax for dealing with sphericity
> library(biology)
# after fitting a randomized block, repeated measures or split-plot
> epsi.GG.HF(name.aov)

where name.aov is the name of the fitted randomized block, repeated measures or split-plot model. The epsi.GG.HF function returns a table of epsilon values corresponding to each of the model terms.

End of instructions

  R Split-plot

# Fit the linear model with the different error strata
> name.aov <- aov(dv ~ FactorA + FactorC + FactorA:FactorC + Error(FactorB), data=dataset))
> AnovaM(name.aov) # alternatively, if sphericity is an issue > AnovaM(name.aov, RM=T)

where name.aov is a name to provide the fitted linear model, dv is the name of the dependent variable, FactorA is the name of the between plots factor (Factor A), FactorB is the name of the block factor (random) and FactorC is the name of the within plots factor. dataset is the name of the data set.
For example:

> copper.aov <- aov(WORMS ~ COPPER + DIST + COPPER:DIST + Error(PLATE), data=copper))
> AnovaM(copper.aov, RM=T)

End of instructions

  Repeated Measures

Simple repeated measures designs and similar to Randomized complete block (RCB) designs. In the same way that we could explain some of the residual variation by blocking, we can also reduce the residual variation by removing the between-replicate variation that exists at the start of the experiment. This is achieved by applying each treatments (repeated measures) to the same subjects, rather than using different subjects for each treatment. In this way it is the same as a RCB. Where it differs from a RCB is that each of the treatments cannot be applied to the blocks (subjects) at the same time. For example, it is not possible to determine the effects of different drugs on the heart rate of rats if each of the drugs is applied at the same time! Each of the treatments should be applied in random order, however, when the repeated factor is time, this is not possible. The subjects (blocks or plots) are always a random factor. If a second factor is applied at the level of the whole subject and over each of the repeated measures (e.g. subjects are of one of two species), the analysis then resembles a splitplot design, where this second factor is the between subjects effect.


A agricultural scientist was investigating the effects of fiber on the digestibility of pastural grasses by cows. Ten randomly selected cows of two different breeds (five of each) were each fed a random sequence of four food types (of different fiber levels). After a week of feeding on a single food type, the digestibility of the food was assessed. Individual cows represented the subjects. Breed of cow was the between subjects effect and food fiber level was the within subjects effect. Diagram illustrates layout of 2 Factor repeated measures. Rectangles represent subjects. Squares represent the within subject treatments (random sequence represents the order in which each treatment is applied). Hence there are four repeated measures. There are two levels of the between subjects effect (pattern fill of rectangles).

End of instructions

  R Compound symmetry/sphericity

# source Murray's syntax for dealing with sphericity
> library(biology)
# after fitting a randomized block, repeated measures or split-plot
> epsi.GG.HF(name.aov)

where name.aov is the name of the fitted randomized block, repeated measures or split-plot model. The epsi.GG.HF function returns a table of epsilon values corresponding to each of the model terms.

End of instructions