Workshop 11.2b - χ² tests (Bayesian)

23 April 2011

Basic χ² references

Logan (2010) - Chpt 16-17
Quinn & Keough (2002) - Chpt 13-14

View R code for preliminaries.

> library(R2jags)
> library(ggplot2)
> library(grid)
> libray(coda)

Error: could not find function "libray"

> 
> #define my common ggplot options
> murray_opts <- opts(panel.grid.major=theme_blank(),
+ 			   panel.grid.minor=theme_blank(),
+ 			   panel.border = theme_blank(),
+ 			   panel.background = theme_blank(),
+ 			   axis.title.y=theme_text(size=15, vjust=0,angle=90),
+ 			   axis.text.y=theme_text(size=12),
+ 			   axis.title.x=theme_text(size=15, vjust=-1),
+ 			   axis.text.x=theme_text(size=12),
+ 			   axis.line = theme_segment(),
+ 			   legend.position=c(1,0.1),legend.justification=c(1,0),
+ 			   legend.key=theme_blank(),
+ 			   plot.margin=unit(c(0.5,0.5,1,2),"lines")
+ )

Goodness of fit test

A fictitious plant ecologist sampled 90 shrubs of a dioecious plant in a forest, and each plant was classified as being either male or female. The ecologist was interested in the sex ratio and whether it differed from 50:50. The observed counts and the predicted (expected) counts based on a theoretical 50:50 sex ratio follow.

Format of fictitious plant sex ratios - note, not a file

Expected and Observed data (50:50 sex ratio).

	Female	Male	Total
Observed	40	50	90
Expected	45	45	90

Note, it is not necessary to open or create a data file for this question.

As the fictitious researcher was interested in exploring sex parity (1:1 frequency ratio) in the dioecious plant, the $\chi^2$ statistic would seem appropriate. Recall that the $\chi^2$ statistic is defined as: $$\chi^2=\sum\frac{(o-e)^2}{e}$$ In effect, the chi-square statistic (which incorporates the variability in the data in to measure of the difference between observed and expected) becomes the input for the likelihood model. Whilst we could simply pass JAGS the chi-square statistic, by parsing the observed and expected values and having the chi-square value calculated within JAGS data, the resulting JAGS code is more complete and able to accommodate other scenarios.
So if, $chisq$ is the chi-square statistic and $k$ is the degrees of freedom (and thus expected value of the $\chi^2$ distribution), then the likelihood model is:
\begin{align*} chisq &\sim \chi^2(k)\\ k &\sim U(0.01, 100) \end{align*} Translate this into JAGS code.
Show code
> modelString=" + data { + for (i in 1:n){ + resid[i] <- pow(obs[i]-exp[i],2)/exp[i] + } + chisq <- sum(resid) + } + model { + #Likelihood + chisq ~ dchisqr(k) + k ~ dunif(0.01,100) + } + " > ## write the model to a text file (I suggest you alter the path to somewhere more relevant > ## to your system!) > writeLines(modelString,con="../downloads/BUGSscripts/ws11.2bQ1.1.txt")
We can now generate the observed and expected data as a list (as is necessary for JAGS). When exploring parity, we are "expecting" the number of Females and Males to be very similar. So from 90 individuals we expect 50% (0.5) to be Female and 50% to be Male. To supply the JAGS code created in the step above, we require the observed counts, expected counts and the number of classification levels ($=2$, Females and Males).
Show code
> #The observed Female and Male plant frequencies > obs <- c(40, 50) > #The expected Female and Make plant frequencies > exp <- rep(sum(obs) * 0.5, 2) > plant.list <- list(obs = c(40, 50), exp = exp, n = 2) > plant.list
$obs [1] 40 50 $exp [1] 45 45 $n [1] 2
Next we should define the behavioural parameters of the MCMC sampling chains. Include the following:
- the nodes (estimated parameters) to monitor (return samples for)
- the number of MCMC chains (3)
- the number of burnin steps (1000)
- the thinning factor (10)
- the number of MCMC iterations - determined by the number of samples to save, the rate of thinning and the number of chains
Show code
> params <- c("chisq", "resid", "k") > nChains = 3 > burnInSteps = 1000 > thinSteps = 10 > numSavedSteps = 5000 > nIter = ceiling((numSavedSteps * thinSteps)/nChains)

Now run the JAGS code via the R2jags interface and view a summary.

Show code

> #Run any random function just to get .Random.seed started
> runif(1)

[1] 0.1886

> plant.r2jags <- jags(data = plant.list, inits = NULL, parameters.to.save = params, 
+     model.file = "../downloads/BUGSscripts/ws11.2bQ1.1.txt", n.chains = 3, n.iter = nIter, 
+     n.burnin = burnInSteps, n.thin = thinSteps)

Compiling data graph
   Resolving undeclared variables
   Allocating nodes
   Initializing
   Reading data back into data table
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 11

Initializing model

> print(plant.r2jags)

Inference for Bugs model at "../downloads/BUGSscripts/ws11.2bQ1.1.txt", fit using jags,
 3 chains, each with 16667 iterations (first 1000 discarded), n.thin = 10
 n.sims = 4701 iterations saved
         mu.vect sd.vect  2.5%   25%   50%   75% 97.5%  Rhat n.eff
chisq      1.111   0.000 1.111 1.111 1.111 1.111 1.111 1.000     1
k          2.880   1.620 0.505 1.663 2.623 3.826 6.609 1.001  4700
resid[1]   0.556   0.000 0.556 0.556 0.556 0.556 0.556 1.000     1
resid[2]   0.556   0.000 0.556 0.556 0.556 0.556 0.556 1.000     1
deviance   3.453   1.379 2.498 2.592 2.921 3.769 7.202 1.002  1700

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 1.0 and DIC = 4.4
DIC is an estimate of expected predictive error (lower deviance is better).

Prior to using the MCMC samples, we should confirm that the chains converged on a stable posterior distribution with adequate mixing (note it is really only k that we are interested in):

Trace and density plots
View code
> plot(as.mcmc(plant.r2jags))
Conclusion:

Raftery diagnostic

View code

> raftery.diag(as.mcmc(plant.r2jags))

[[1]]

Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95 

You need a sample size of at least 3746 with these values of q, r and s

[[2]]

Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95 

You need a sample size of at least 3746 with these values of q, r and s

[[3]]

Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95 

You need a sample size of at least 3746 with these values of q, r and s

Conclusion:

Autocorrelation

View code

> autocorr.diag(as.mcmc(plant.r2jags))

        chisq  deviance         k resid[1] resid[2]
Lag 0     NaN  1.000000  1.000000      NaN      NaN
Lag 10    NaN  0.002296 -0.019663      NaN      NaN
Lag 50    NaN  0.005514  0.002009      NaN      NaN
Lag 100   NaN  0.003508  0.018041      NaN      NaN
Lag 500   NaN -0.007814 -0.011057      NaN      NaN

Conclusion:

What conclusions would you draw?

Lets now extend this fictitious endeavor. Recent studies on a related species of shrub have suggested a 30:70 female:male sex ratio. Knowing that our plant ecologist had similar research interests, the authors contacted her to inquire whether her data contradicted their findings. Fit the appropriate JAGS (BUGS) model. We can reuse the JAGS model from above. We just have to alter the expected values (input data) and rerun the analysis.

Show code

> plant.list2 <- list(obs=c(40,50),exp=c(27,63), n=2)
> plant.r2jags2 <- jags(data=plant.list2, 
+ 	  inits=NULL, #or inits=list(inits,inits,inits) # since there are three chains
+ 	  parameters.to.save=params, 
+ 	  model.file="../downloads/BUGSscripts/ws11.2bQ1.1.txt",
+ 	  n.chains=3,
+ 	  n.iter=nIter, 
+ 	  n.burnin=burnInSteps,
+       n.thin=thinSteps
+ 	  )

Compiling data graph
   Resolving undeclared variables
   Allocating nodes
   Initializing
   Reading data back into data table
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 11

Initializing model

> print(plant.r2jags2)

Inference for Bugs model at "../downloads/BUGSscripts/ws11.2bQ1.1.txt", fit using jags,
 3 chains, each with 16667 iterations (first 1000 discarded), n.thin = 10
 n.sims = 4701 iterations saved
         mu.vect sd.vect  2.5%   25%    50%    75%  97.5%  Rhat n.eff
chisq      8.942   0.000 8.942 8.942  8.942  8.942  8.942 1.000     1
k         11.038   4.333 3.726 7.907 10.688 13.849 20.265 1.002  2000
resid[1]   6.259   0.000 6.259 6.259  6.259  6.259  6.259 1.000     1
resid[2]   2.683   0.000 2.683 2.683  2.683  2.683  2.683 1.000     1
deviance   5.726   1.454 4.704 4.809  5.180  6.077  9.728 1.001  4700

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 1.1 and DIC = 6.8
DIC is an estimate of expected predictive error (lower deviance is better).

What conclusions would you draw from this analysis?
We could alternatively explore these data from the perspective of estimating the population fractions given the observed data. We do this by fitting a binomial model: \begin{align*} obs_i &\sim Bin(p_i,n_i)\\ p_i &\sim B(a_i,b_i)\\ a &\sim gamma(1, 0.01)\\ b &\sim gamma(1, 0.01) \end{align*} where $p_i$ are the probabilities (fractions) of each sex, $n_i$ are the total number of individuals and $a_i$ and $b_i$ are hyper parameters of the prior beta distribution. Have a go at specifying the model in JAGS.
Show code
> modelString=" + model { + #Likelihood + for (i in 1:nGroups) { + obs[i] ~ dbin(p[i],n[i]) + p[i] ~ dbeta(a[i],b[i]) + a[i] ~ dgamma(1,0.01) + b[i] ~ dgamma(1,0.01) + } + } + " > ## write the model to a text file (I suggest you alter the path to somewhere more relevant > ## to your system!) > writeLines(modelString,con="../downloads/BUGSscripts/ws11.2bQ1.9.txt")
Next create a suitable data list incorporating the observed values ($obs$), the total number of species ($n$) and the number of sexes ($nGroups$).
Show code
> obs <- c(40, 50) > plant.list3 <- list(obs = obs, n = c(90, 90), nGroups = 2) > plant.list3
$obs [1] 40 50 $n [1] 90 90 $nGroups [1] 2

Define the MCMC chain parameters (similar to above) along with the nodes to monitor (just the two $p$ parameter values).

Show code

> params <- c("p")
> nChains = 3
> burnInSteps = 1000
> thinSteps = 10
> numSavedSteps = 5000
> nIter = ceiling((numSavedSteps * thinSteps)/nChains)

Now fit the model via the R2jags interface.

Show code

> plant.r2jags3 <- jags(data=plant.list3, 
+ 	  inits=NULL, #or inits=list(inits,inits,inits) # since there are three chains
+ 	  parameters.to.save=params, 
+ 	  model.file="../downloads/BUGSscripts/ws11.2bQ1.9.txt",
+ 	  n.chains=3,
+ 	  n.iter=nIter, 
+ 	  n.burnin=burnInSteps,
+       n.thin=thinSteps
+ 	  )

Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 13

Initializing model

> print(plant.r2jags3)

Inference for Bugs model at "../downloads/BUGSscripts/ws11.2bQ1.9.txt", fit using jags,
 3 chains, each with 16667 iterations (first 1000 discarded), n.thin = 10
 n.sims = 4701 iterations saved
         mu.vect sd.vect  2.5%    25%    50%    75%  97.5%  Rhat n.eff
p[1]       0.445   0.051 0.349  0.411  0.444  0.480  0.547 1.001  4700
p[2]       0.554   0.051 0.452  0.520  0.554  0.588  0.653 1.001  4700
deviance  11.817   1.935 9.948 10.456 11.219 12.628 17.103 1.001  4700

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 1.9 and DIC = 13.7
DIC is an estimate of expected predictive error (lower deviance is better).

Confirm that the MCMC sampler converged and mixed well

Show code

> library(plyr)
> library(coda)
> preds <- c("p[1]", "p[2]")
> #Trace and density plots
> plot(as.mcmc.list(alply(plant.r2jags3$BUGSoutput$sims.array[, , preds], 
+     2, function(x) {
+         as.mcmc(x)
+     })))

> autocorr.diag(as.mcmc.list(alply(plant.r2jags3$BUGSoutput$sims.array[, 
+     , preds], 2, function(x) {
+     as.mcmc(x)
+ })))

            p[1]     p[2]
Lag 0   1.000000  1.00000
Lag 1   0.040793  0.02910
Lag 5  -0.009471  0.01711
Lag 10  0.035296  0.01240
Lag 50  0.008957 -0.01252

> raftery.diag(as.mcmc(plant.r2jags3))

[[1]]

Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95 

You need a sample size of at least 3746 with these values of q, r and s

[[2]]

Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95 

You need a sample size of at least 3746 with these values of q, r and s

[[3]]

Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95 

You need a sample size of at least 3746 with these values of q, r and s

What conclusions would you draw from the above analyses? Comment on whether a 1:1 (0.5:0.5) or 30:70 (0.3:0.7) ratios are likely.

How about a nice summary figure?

Show code

> preds <- c("p[1]", "p[2]")
> plant.mcmc <- as.mcmc(plant.r2jags3$BUGSoutput$sims.matrix[, preds])
> head(plant.mcmc)

Markov Chain Monte Carlo (MCMC) output:
Start = 1 
End = 7 
Thinning interval = 1 
       p[1]   p[2]
[1,] 0.3603 0.5058
[2,] 0.4588 0.5477
[3,] 0.5483 0.5912
[4,] 0.5149 0.5482
[5,] 0.4343 0.5329
[6,] 0.4842 0.6612
[7,] 0.5680 0.5567

> plant.mcmc.sum <- adply(plant.mcmc, 2, function(x) {
+     data.frame(mean = mean(x), median = median(x), coda:::HPDinterval(x, p = c(0.68)), 
+         coda:::HPDinterval(x))
+ })
> plant.mcmc.sum$Sex = c("F", "M")
> library(ggplot2)
> p <- ggplot(plant.mcmc.sum, aes(y = mean, x = Sex)) + geom_hline(yintercept = 0.5, 
+     linetype = "dashed") + geom_hline(yintercept = c(0.3, 0.7), linetype = "dashed") + 
+     geom_errorbar(aes(ymin = lower.1, ymax = upper.1, x = Sex), width = 0) + 
+     geom_errorbar(aes(ymin = lower, ymax = upper, x = Sex), width = 0, size = 3) + 
+     scale_y_continuous("Population fraction") + geom_point() + coord_flip() + 
+     murray_opts
> p

Workshop 11.2b - χ2 tests (Bayesian)

Basic χ2 references

Goodness of fit test

Workshop 11.2b - χ² tests (Bayesian)

Basic χ² references