Tutorial 11.2b - Goodness of fit tests
23 April 2011
> library(R2jags) > library(plyr)
Scenario and Data
Goodness of fit tests are concerned with comparing the observed frequencies with those expected on the basis of a specific null hypothesis. So lets now fabricate a motivating scenario and some data.
We will create a scenario that involves items classified into one of three groups (A, B and C). The number of items in each classification group are then tallied up. Out of a total of 47 items, 15 where of type A, 9 where of type B and 23 where of type C. We could evaluate a parity (a 1:1:1 ratio from these data. In a frequentist context, this might involve testing a null hypothesis that the observed data could have come from a population with a 1:1 item ratio. In this case the probability would be the probability of obtaining the observed ratio of frequencies when the null hypothesis is true.
In a Bayesian context, there are numerous ways that we could tackle these data. We would be evaluating the evidence for the null hypothesis (1:1:1 item ratio) given the observed by estimating the degree of freedom from a chi-square distribution. Alternatively, we could estimate the value of the three population fractions which are expected to be 1/3, 1/3, 1/3 when 1:1:1. We will explore this option first and then explore the chi-square approach second.
To extend the example, lets also explore a 1:1:2 ratio.
We start by generating the observed data:
> #the observed frequences of A and B > obs <- c(15, 9, 23) > obs
[1] 15 9 23
Estimating population fractions - binomial distribution
The binomial distribution represents the distribution of possible densities (probabilities) for the number of successes $p$ out of a total of $n$ independent trials. In this case, it can be used to model the number of items of each group (A, B and C) out of a total of 47 items. The prior distribution for $p_i$ would be a beta distribution (values range from 0 to 1) with shape parameters $a$ and $b$ the hyperpriors of which follow vague (flat, imprecise) gamma distributions. \begin{align*} obs_i &\sim Bin(p_i,n_i)\\ p_i &\sim B(a_i,b_i)\\ a &\sim gamma(1, 0.01)\\ b &\sim gamma(1, 0.01) \end{align*}
Exploratory data analysis and initial assumption checking
The data should logically follow a binomial distribution (since the observations are counts of positive events out of a total).
Model fitting or statistical analysis
Define the model
We now translate the likelihood model into BUGS/JAGS code and store the code in an external file.
> modelString=" + model { + #Likelihood + for (i in 1:nGroups) { + obs[i] ~ dbin(p[i],n[i]) + p[i] ~ dbeta(a[i],b[i]) + a[i] ~ dgamma(1,0.01) + b[i] ~ dgamma(1,0.01) + } + } + " > ## write the model to a text file (I suggest you alter the path to somewhere more relevant > ## to your system!) > writeLines(modelString,con="../downloads/BUGSscripts/tut11.2bQ1.1.txt")
- The likelihood model indicates that the observed counts are modeled by a binomial distribution with a probability of $p$ (fraction) from $n$ trials (items).
- The prior on each $p$ is defined as a beta distribution with shape parameters $a$ and $b$
- The hyperpriors for each $a$ and $b$ are drawn from imprecise (vague, flat) gamma distributions.
Define the data list
As input, JAGS will need to be supplied with:
- the observed data ($obs$)
- the total number of observed items ($n$)
- the number of classification groups ($nGroups$)
> #The observed item frequencies > obs <- c(15, 9, 23) > data.list <- list(obs = obs, n = c(47, 47, 47), nGroups = 3) > data.list
$obs [1] 15 9 23 $n [1] 47 47 47 $nGroups [1] 3
Define the MCMC chain parameters
Next we should define the behavioural parameters of the MCMC sampling chains. Include the following:
- the nodes (estimated parameters) to monitor (return samples for)
- the number of MCMC chains (3)
- the number of burnin steps (1000)
- the thinning factor (10)
- the number of MCMC iterations - determined by the number of samples to save, the rate of thinning and the number of chains
> params <- c("p") > nChains = 3 > burnInSteps = 1000 > thinSteps = 10 > numSavedSteps = 5000 > nIter = ceiling((numSavedSteps * thinSteps)/nChains)
Fit the model
Now run the JAGS code via the R2jags interface. Note that the first time jags is run after the R2jags package is loaded, it is often necessary to run any kind of randomization function just to initiate the .Random.seed variable.
> #Run any random function just to get .Random.seed started > runif(1)
[1] 0.01705
> #Fit the model for the 1:1:1 ratio > data.r2jags <- jags(data = data.list, inits = NULL, parameters.to.save = params, + model.file = "../downloads/BUGSscripts/tut11.2bQ1.1.txt", n.chains = 3, + n.iter = nIter, n.burnin = burnInSteps, n.thin = thinSteps)
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 18 Initializing model
> print(data.r2jags)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.2bQ1.1.txt", fit using jags, 3 chains, each with 16667 iterations (first 1000 discarded), n.thin = 10 n.sims = 4701 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff p[1] 0.327 0.066 0.203 0.282 0.325 0.371 0.462 1.002 1500 p[2] 0.204 0.058 0.105 0.162 0.199 0.242 0.330 1.002 2000 p[3] 0.490 0.070 0.355 0.443 0.488 0.538 0.625 1.001 4700 deviance 15.258 2.375 12.530 13.515 14.634 16.370 21.387 1.001 4700 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 2.8 and DIC = 18.1 DIC is an estimate of expected predictive error (lower deviance is better).
Conclusions: Initially, we should focus our attention on the Rhat and n.eff columns. These are the scale reduction and number of effective samples respectively and they provide an indication of the degree of mixing or coverage of the samples. Ideally, the n.eff values should be approximately equal to the number of saved samples (in this case 4701), and the Rhat values should be approximately 1 (complete convergence).
Whilst the actual values are likely to differ substantially from run to run (due to the stochastic nature of the way the chains traverse the posterior distribution), on this occasion, the n.eff of the first two probability parameters (p[1] and p[2]) are substantially lower than 4700. Hence, the samples of these parameters may not accurately reflect the posterior distribution. We might consider altering one or more of the chain behavioural paramters (such as the thinning rate), alter the model definition (or priors) itself.
Model evaluation
Prior to exploring the model parameters, it is necessary to first evaluate whether the MCMC samples are likely to be representative of a stationary posterior distribution. Specifically, we should explore:
- whether the MCMC chains have converged - Trace plots
- whether the MCMC chains have yielded well mixed (not autocorrelated samples) - Raftery and autocorrelation diagnostics
- whether the MCMC samples are independent of the initial starting configuration (adequate burnin) - Trace plots
- What sort of point estimates (mean, median and mode) are appropriate for the parameters - Density plots
- Trace and density plots
For the Goodness of fit test, it is only the distribution of the $p$'s that we are interested in> library(plyr) > library(coda) > preds <- c("p[1]", "p[2]", "p[3]") > plot(as.mcmc.list(alply(data.r2jags$BUGSoutput$sims.array[, , preds], + 2, function(x) { + as.mcmc(x) + })))
> plot(as.mcmc(data.r2jags))
- Autocorrelation diagnostic
> preds <- c("p[1]", "p[2]", "p[3]") > autocorr.diag(as.mcmc.list(alply(data.r2jags$BUGSoutput$sims.array[, + , preds], 2, function(x) { + as.mcmc(x) + })))
p[1] p[2] p[3] Lag 0 1.0000000 1.000000 1.0000000 Lag 1 0.1424401 0.154652 0.1087300 Lag 5 0.0124894 0.016465 -0.0005187 Lag 10 0.0002183 -0.007616 -0.0110928 Lag 50 0.0296613 -0.017395 -0.0295151
> autocorr.diag(as.mcmc(data.r2jags))
deviance p[1] p[2] p[3] Lag 0 1.000000 1.0000000 1.000000 1.0000000 Lag 10 0.060629 0.1424401 0.154652 0.1087300 Lag 50 0.012898 0.0124894 0.016465 -0.0005187 Lag 100 0.011000 0.0002183 -0.007616 -0.0110928 Lag 500 -0.006843 0.0296613 -0.017395 -0.0295151
- Raftery diagnostic
> raftery.diag(as.mcmc(data.r2jags))
[[1]] Quantile (q) = 0.025 Accuracy (r) = +/- 0.005 Probability (s) = 0.95 You need a sample size of at least 3746 with these values of q, r and s [[2]] Quantile (q) = 0.025 Accuracy (r) = +/- 0.005 Probability (s) = 0.95 You need a sample size of at least 3746 with these values of q, r and s [[3]] Quantile (q) = 0.025 Accuracy (r) = +/- 0.005 Probability (s) = 0.95 You need a sample size of at least 3746 with these values of q, r and s
> params <- c("p") > nChains = 3 > burnInSteps = 1000 > thinSteps = 50 > numSavedSteps = 5000 > nIter = ceiling((numSavedSteps * thinSteps)/nChains) > data.r2jags <- jags(data = data.list, inits = NULL, parameters.to.save = params, + model.file = "../downloads/BUGSscripts/tut11.2bQ1.1.txt", n.chains = 3, + n.iter = nIter, n.burnin = burnInSteps, n.thin = thinSteps)
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 18 Initializing model
> print(data.r2jags)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.2bQ1.1.txt", fit using jags, 3 chains, each with 83334 iterations (first 1000 discarded), n.thin = 50 n.sims = 4941 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff p[1] 0.328 0.067 0.207 0.281 0.327 0.372 0.466 1.001 4900 p[2] 0.204 0.056 0.104 0.164 0.200 0.239 0.323 1.001 4900 p[3] 0.490 0.071 0.355 0.443 0.491 0.538 0.625 1.001 4900 deviance 15.237 2.330 12.533 13.532 14.659 16.291 21.381 1.002 1600 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 2.7 and DIC = 17.9 DIC is an estimate of expected predictive error (lower deviance is better).
Conclusions: Rhat and n.eff are now much better for the probability parameters. The estimated fractions for A, B and C are:
- A: 0.327 (0.207, 0.466)
- B: 0.200 (0.104, 0.323)
- C: 0.491 (0.355, 0.625)
Chi-square
An appropriate test statistic for comparing an observed ($o$) frequency ratio to an expected ($e$) frequency ratio is the chi-square $\chi^2$ statistic. $$\chi^2=\sum\frac{(o-e)^2}{e}$$ In effect, the chi-square statistic (which incorporates the variability in the data in to measure of the difference between observed and expected) becomes the input for the likelihood model. Whilst we could simply pass JAGS the chi-square statistic, by parsing the observed and expected values and having the chi-square value calculated within JAGS data, the resulting JAGS code is more complete and able to accommodate other scenarios.
So if, $chisq$ is the chi-square statistic and $k$ is the degrees of freedom (and thus expected value of the $\chi^2$ distribution), then the likelihood model is:
\begin{align*} chisq &\sim \chi^2(k)\\ k &\sim U(0.01, 100) \end{align*}Exploratory data analysis and initial assumption checking
- All of the observations are classified independently - this must be addressed at the design and collection stages
- No more than 20% of the expected values should be less than 5. Since the location and spread of the expected value of a $\chi^2$ distribution are the same parameter ($\lambda$), and that the $\chi^2$ distribution is bounded by a lower limit of zero, distributions with expected values less than 5 have an asymmetrical shape and are thus unreliable (for calculating probabilities).
So lets calculate the expected frequencies as a means to evaluate this assumption. The expected values are calculated as: $$e = total~counts \times expected~fraction$$
1:1:1 ratio | 1:1:2 ratio | |
---|---|---|
Expected fractions | A=1/3=0.33, B=1/3=0.33, C=1/3=0.33 | A=1/3=0.25, B=1/4=0.25, C=2/4=0.5 |
Expected frequencies | $e_A=(15+9+23)\times 1/3=15.67$ $e_B=(15+9+23)\times 1/3=15.67$ $e_C=(15+9+23)\times 1/3=15.67$ | $e_A=(15+9+23)\times 1/4=11.75$ $e_B=(15+9+23)\times 1/4=11.75$ $e_C=(15+9+23)\times 2/4=23.5$ |
It is clear that in neither case are any of the expected frequencies less than 5. Therefore, we would conclude that probabilities derived from the $\chi^2$ distribution are likely to be reliable.
Model fitting or statistical analysis
Define the model
We now translate the likelihood model into BUGS/JAGS code and store the code in an external file.
> modelString=" + data { + for (i in 1:n){ + resid[i] <- pow(obs[i]-exp[i],2)/exp[i] + } + chisq <- sum(resid) + } + model { + #Likelihood + chisq ~ dchisqr(k) + #Priors + k ~ dunif(0.01,100) + } + " > ## write the model to a text file (I suggest you alter the path to somewhere more relevant > ## to your system!) > writeLines(modelString,con="../downloads/BUGSscripts/tut11.2bQ2.1.txt")
- First of all, the standardized residuals and chi-square statistic are calculated according to the formula listed above.
- The likelihood model indicates that the chi-squared statistic can be modeled by a $\chi^2$ distribution with a centrality parameter of $k$.
- The prior on $k$ is defined as a uniform (thus vague) flat prior whose values could range from 0.01 to 100 (all with equal probability).
Define the data list
As input, JAGS will need to be supplied with:
- the observed data ($obs$)
- the expected frequencies ($exp$) (based on the hypothesis)
- the total number of observed items ($n$)
> #The observed item frequencies > obs <- c(15, 9, 23) > #The expected item frequencies (for a 1:1:1 ratio) > exp <- rep(sum(obs) * 1/3, 3) > data.list <- list(obs = obs, exp = exp, n = 3) > data.list
$obs [1] 15 9 23 $exp [1] 15.67 15.67 15.67 $n [1] 3
> #The expected item frequencies (for a 1:1:2 ratio) > exp <- sum(obs) * c(1/4, 1/4, 2/4) > data.list1 <- list(obs = obs, exp = exp, n = 3) > data.list1
$obs [1] 15 9 23 $exp [1] 11.75 11.75 23.50 $n [1] 3
Define the MCMC chain parameters
Next we should define the behavioural parameters of the MCMC sampling chains. Include the following:
- the nodes (estimated parameters) to monitor (return samples for)
- the number of MCMC chains (3)
- the number of burnin steps (1000)
- the thinning factor (10)
- the number of MCMC iterations - determined by the number of samples to save, the rate of thinning and the number of chains
> params <- c("chisq", "resid", "k") > nChains = 3 > burnInSteps = 1000 > thinSteps = 10 > numSavedSteps = 5000 > nIter = ceiling((numSavedSteps * thinSteps)/nChains)
Fit the model
Now run the JAGS code via the R2jags interface. Note that the first time jags is run after the R2jags package is loaded, it is often necessary to run any kind of randomization function just to initiate the .Random.seed variable.
> #Run any random function just to get .Random.seed started > runif(1)
[1] 0.4075
> #Fit the model for the 1:1:1 ratio > data.r2jags <- jags(data = data.list, inits = NULL, parameters.to.save = params, + model.file = "../downloads/BUGSscripts/tut11.2bQ2.1.txt", n.chains = 3, + n.iter = nIter, n.burnin = burnInSteps, n.thin = thinSteps)
Compiling data graph Resolving undeclared variables Allocating nodes Initializing Reading data back into data table Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 14 Initializing model
> #Fit the model for the 1:1:2 ratio > data.r2jags1 <- jags(data = data.list1, inits = NULL, parameters.to.save = params, + model.file = "../downloads/BUGSscripts/tut11.2bQ2.1.txt", n.chains = 3, + n.iter = nIter, n.burnin = burnInSteps, n.thin = thinSteps)
Compiling data graph Resolving undeclared variables Allocating nodes Initializing Reading data back into data table Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 14 Initializing model
Model evaluation
Prior to exploring the model parameters, it is necessary to first evaluate whether the MCMC samples are likely to be representative of a stationary posterior distribution. Specifically, we should explore:
- whether the MCMC chains have converged - Trace plots
- whether the MCMC chains have yielded well mixed (not autocorrelated samples) - Raftery and autocorrelation diagnostics
- whether the MCMC samples are independent of the initial starting configuration (adequate burnin) - Trace plots
- What sort of point estimates (mean, median and mode) are appropriate for the parameters - Density plots
- Trace and density plots
For the Goodness of fit test, it is only the distribution of $k$ (degrees of freedom) that we are interested in> library(plyr) > library(coda) > preds <- c("k") > plot(as.mcmc.list(alply(data.r2jags$BUGSoutput$sims.array[, , preds], + 2, function(x) { + as.mcmc(x) + })))
> plot(as.mcmc(data.r2jags))
- Autocorrelation diagnostic
> preds <- c("k") > autocorr.diag(as.mcmc.list(alply(data.r2jags$BUGSoutput$sims.array[, + , preds], 2, function(x) { + as.mcmc(x) + })))
[,1] Lag 0 1.0000000 Lag 1 -0.0009681 Lag 5 0.0203399 Lag 10 0.0040986 Lag 50 -0.0203103
> autocorr.diag(as.mcmc(data.r2jags))
chisq deviance k resid[1] resid[2] resid[3] Lag 0 NaN 1.000000 1.0000000 NaN NaN NaN Lag 10 NaN 0.005874 -0.0009681 NaN NaN NaN Lag 50 NaN -0.013484 0.0203399 NaN NaN NaN Lag 100 NaN -0.002441 0.0040986 NaN NaN NaN Lag 500 NaN -0.002589 -0.0203103 NaN NaN NaN
- Raftery diagnostic
> raftery.diag(as.mcmc(data.r2jags))
[[1]] Quantile (q) = 0.025 Accuracy (r) = +/- 0.005 Probability (s) = 0.95 You need a sample size of at least 3746 with these values of q, r and s [[2]] Quantile (q) = 0.025 Accuracy (r) = +/- 0.005 Probability (s) = 0.95 You need a sample size of at least 3746 with these values of q, r and s [[3]] Quantile (q) = 0.025 Accuracy (r) = +/- 0.005 Probability (s) = 0.95 You need a sample size of at least 3746 with these values of q, r and s
Exploring the model parameters, test hypotheses
If there was any evidence that the assumptions had been violated, the model was not an appropriate fit or the MCMC chains had not converged or mixed thoroughly, then we would need to reconsider the model and start the process again. In this case, there is no evidence that the test will be unreliable so we can proceed to explore the test statistics.
> print(data.r2jags)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.2bQ2.1.txt", fit using jags, 3 chains, each with 16667 iterations (first 1000 discarded), n.thin = 10 n.sims = 4701 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff chisq 6.298 0.000 6.298 6.298 6.298 6.298 6.298 1.000 1 k 8.317 3.571 2.306 5.778 8.002 10.467 16.157 1.001 4700 resid[1] 0.028 0.000 0.028 0.028 0.028 0.028 0.028 1.000 1 resid[2] 2.837 0.000 2.837 2.837 2.837 2.837 2.837 1.000 1 resid[3] 3.433 0.000 3.433 3.433 3.433 3.433 3.433 1.000 1 deviance 5.320 1.436 4.346 4.443 4.772 5.625 9.302 1.002 2300 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 1.0 and DIC = 6.4 DIC is an estimate of expected predictive error (lower deviance is better).
> print(data.r2jags1)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.2bQ2.1.txt", fit using jags, 3 chains, each with 16667 iterations (first 1000 discarded), n.thin = 10 n.sims = 4701 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff chisq 1.553 0.000 1.553 1.553 1.553 1.553 1.553 1.000 1 k 3.309 1.861 0.566 1.938 3.012 4.385 7.610 1.001 4700 resid[1] 0.899 0.000 0.899 0.899 0.899 0.899 0.899 1.000 1 resid[2] 0.644 0.000 0.644 0.644 0.644 0.644 0.644 1.000 1 resid[3] 0.011 0.000 0.011 0.011 0.011 0.011 0.011 1.000 1 deviance 3.809 1.364 2.870 2.958 3.276 4.129 7.524 1.001 4700 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 0.9 and DIC = 4.7 DIC is an estimate of expected predictive error (lower deviance is better).
Further explorations of the trends
There are a number of avenues we could take in order to explore the data and models further. One thing we could do is calculate the probability that $k$ is greater than 2 (the expected value) for each hypothesis. This can be done either by modifying the BUGS/JAGS code to include a derivative that uses the step function, or we can derive it within R from the $k$ samples. Lets explore the latter.
> k <- data.r2jags$BUGSoutput$sims.matrix[, "k"] > pr <- sum(k > 2)/length(k) > pr
[1] 0.9821
> > k <- data.r2jags1$BUGSoutput$sims.matrix[, "k"] > pr1 <- sum(k > 2)/length(k) > pr1
[1] 0.7362
We could also compare the two alternative hypotheses. The 1:1:2 hypothesis has lower DIC and is therefore considered a better fit (4.7 vs 6.4). This is a difference in DIC of around 1.7 units. So the data have higher support for a 1:1:2 population ratio than a 1:1:1 ratio.