Tutorial 11.5b - Poisson regression
02 April 2014
Overview
Whilst in many instances, count data can be approximated reasonably well by a normal distribution (particularly if the counts are all above zero and the mean count is greater than about 20), more typically, when count data are modelled via normal distribution certain undesirable characteristics arise that are a consequence of the nature of discrete non-negative data.
- Expected (predicted) values and confidence bands less than zero are illogical, yet these are entirely possible from a normal distribution
- The distribution of count data are often skewed when their mean is low (in part because the distribution is truncated to the left by zero) and variance usually increases with increasing mean (variance is typically proportional to mean in count data). By contrast, the Gaussian (normal) distribution assumes that mean and variance are unrelated and thus estimates (particularly of standard error) might well be reasonable inaccurate.
Poisson regression is a type of generalized linear model (GLM) in which a non-negative integer (natural number) response is modelled against a linear predictor via a specific link function. The linear predictor is typically a linear combination of effects parameters (e.g. $\beta_0 + \beta_1x_x$). The role of the link function is to transform the expected values of the response y (which is on the scale of (0,$\infty$), as is the Poisson distribution from which expectations are drawn) into the scale of the linear predictor (which is $-\infty,\infty$).
As implied in the name of this group of analyses, a Poisson rather than Gaussian (normal) distribution is used to represent the errors (residuals). Like count data (number of individuals, species etc), the Poisson distribution encapsulates positive integers and is bound by zero at one end. Consequently, the degree of variability is directly related the expected value (equivalent to the mean of a Gaussian distribution). Put differently, the variance is a function of the mean. Repeated observations from a Poisson distribution located close to zero will yield a much smaller spread of observations than will samples drawn from a Poisson distribution located a greater distance from zero. In the Poisson distribution, the variance has a 1:1 relationship with the mean.
The canonical link function for the Poisson distribution is a log-link function.
Whilst the expectation that the mean=variance ($\mu=\sigma$) is broadly compatible with actual count data (that variance increases at the same rate as the mean), under certain circumstances, this might not be the case. For example, when there are other unmeasured influences on the response variable, the distribution of counts might be somewhat clumped which can result in higher than expected variability (that is $\sigma\gt\mu$). The variance increases more rapidly than does the mean. This is referred to as overdispersion. The degree to which the variability is greater than the mean (and thus the expected degree of variability) is called dispersion. Effectively, the Poisson distribution has a dispersion parameter (or scaling factor) of 1.
It turns out that overdispersion is very common for count data and it typically underestimates variability, standard errors and thus deflated p-values. There are a number of ways of overcoming this limitation, the effectiveness of which depend on the causes of overdispersion.
- Quasi-Poisson models - these introduce the dispersion parameter ($\phi$) into the model.
This approach does not utilize an underlying error distribution to calculate the maximum likelihood (there is no quasi-Poisson distribution).
Instead, if the Newton-Ralphson iterative reweighting least squares algorithm is applied using a direct specification of the relationship between
mean and variance ($var(y)=\phi\mu$), the estimates of the regression coefficients are identical to those of the maximum
likelihood estimates from the Poisson model. This is analogous to fitting ordinary least squares on symmetrical, yet not normally distributed data -
the parameter estimates are the same, however they won't necessarily be as efficient.
The standard errors of the coefficients are then calculated by multiplying the Poisson model coefficient
standard errors by $\sqrt{\phi}$.
Unfortunately, because the quasi-poisson model is not estimated via maximum likelihood, properties such as AIC and log-likelihood cannot be derived. Consequently, quasi-poisson and Poisson model fits cannot be compared via either AIC or likelihood ratio tests (nor can they be compared via deviance as uasi-poisson and Poisson models have the same residual deviance). That said, quasi-likelihood can be obtained by dividing the likelihood from the Poisson model by the dispersion (scale) factor.
- Negative binomial model - technically, the negative binomial distribution is a probability distribution for the number of successes before a specified number of failures.
However, the negative binomial can also be defined (parameterized) in terms of a mean ($\mu$) and scale factor ($\omega$),
$$p(y_i)=\frac{\Gamma(y_i+\omega)}{\Gamma(\omega)y!}\times\frac{\mu_i^{y_i}\omega^\omega}{(\mu_i+\omega)^{\mu_i+\omega}}$$
where the expectected value of the values $y_i$ (the means) are ($\mu_i$) and the variance is $y_i=\frac{\mu_i+\mu_i^2}{\omega}$.
In this way, the negative binomial is a two-stage hierarchical process in which the response is modeled against a Poisson distribution whose expected count is in turn
modeled by a Gamma distribution with a mean of $\mu$ and constant scale parameter ($\omega$).
Strictly, the negative binomial is not an exponential family distribution (unless $\omega$ is fixed as a constant), and thus negative binomial models cannot be fit via the usual GLM iterative reweighting algorithm. Instead estimates of the regression parameters along with the scale factor ($\omega$) are obtained via maximum likelihood.
The negative binomial model is useful for accommodating overdispersal when it is likely caused by clumping (due to the influence of other unmeasured factors) within the response.
- Zero-inflated Poisson model - overdispersion can also be caused by the presence of a greater number of zero's than would otherwise be expected for a Poisson distribution.
There are potentially two sources of zero counts - genuine zeros and false zeros.
Firstly, there may genuinely be no individuals present. This would be the number expected by a Poisson distribution.
Secondly, individuals may have been present yet not detected or may not even been possible.
These are false zero's and lead to zero inflated data (data with more zeros than expected).
For example, the number of joeys accompanying an adult koala could be zero because the koala has no offspring (true zero) or because the koala is male or infertile (both of which would be examples of false zeros). Similarly, zero counts of the number of individual in a transect are due either to the absence of individuals or the inability of the observer to detect them. Whilst in the former example, the latent variable representing false zeros (sex or infertility) can be identified and those individuals removed prior to analysis, this is not the case for the latter example. That is, we cannot easily partition which counts of zero are due to detection issues and which are a true indication of the natural state.
Consistent with these two sources of zeros, zero-inflated models combine a binary logistic regression model (that models count membership according to a latent variable representing observations that can only be zeros - not detectable or male koalas) with a Poisson regression (that models count membership according to a latent variable representing observations whose values could be 0 or any positive integer - fertile female koalas).
Summary of important equations
Many ecologists are repulsed or frightened by statistical formulae. In part this is due to the use of Greek letters to represent concepts, constants and functions with the assumption that the readers are familiar with their meaning. Hence, the issue is largely that many ecologists are familiar with certain statistical concepts, yet are not overly familiar with the notation used to represent those concepts.
As a typed language, R exposes users to a rich variety of statistical opportunities. Whilst many of the common procedures are wrapped into simple to use functions (and so all of the calculations etc underlying a GLM need not be performed by hand by the user), not all procedures have been packaged into functions. For such cases, it is useful to have the formulas handy.
Hence, in this section I am going to summarize and compare equations for common procedures associated with Poisson and Negative Binomial models. Feel free to skip this section until it is useful to you.
Poisson | Negative Binomial (p,r) | |
---|---|---|
Density function | $f(y|\lambda)=\frac{\lambda^{y}e^{-\lambda}}{y!}$ | $f(y|r,p)=\frac{\Gamma(y+r)}{\Gamma(r)\Gamma(y+1)}p^r(1-p)^y$ |
Expected value | $E(Y)=\mu=\lambda$ | $E(Y)=\mu=\frac{r(1-p)}{p}$ |
Variance | $var(Y)=\lambda$ | $var(Y)=\frac{r(1-p)}{p^2}$ |
log-likelihood | $\mathcal{LL}(\lambda\mid y) = \sum\limits^n_{i=1}y_i log(\lambda_i)-\lambda_i - log\Gamma(y_i+1)$ | $\begin{align}\mathcal{LL}(r,p\mid y) = \sum\limits^n_{i=1}&log\Gamma(y_1 + r) - log\Gamma(r) - log\Gamma(y_i + 1) + \\&r.log(p) + y_i log(1-p)\end{align}$ |
Negative Binomial (a,b) | |
---|---|
Density function | $f(y \mid a,b) = \frac{b^{a}y^{b-1}e^{-b y}}{\Gamma(a)}$ |
Expected value | $E(Y)=\mu=\frac{r}{p}$ |
Variance | |
log-likelihood | $\mathcal{LL}(\lambda;y)=\sum\limits^n_{i=1} log\Gamma(y_i + a) - log\Gamma(y_i+1) - log\Gamma(a) + \alpha(log(b_i) - log(b_i+1)) - y_i.log(b_i+1)$ |
Zero-inflated Poisson | |
---|---|
Density function | $p(y \mid \theta,\lambda) = \left\{ \begin{array}{l l} \theta + (1-\theta)\times \text{Pois}(0|\lambda) & \quad \text{if $y_i=0$ and}\\ (1-\theta)\times \text{Pois}(y_i|\lambda) & \quad \text{if $y_i>0$} \end{array} \right.$ |
Expected value | $E(Y)=\mu=\lambda\times(1-theta)$ |
Variance | $var(Y)=\lambda\times(1-\theta)\times(1+\theta\times\lambda^2)$ |
log-likelihood | $\mathcal{LL}(\theta,\lambda\mid y) = \left\{ \begin{array}{l l} \sum\limits^n_{i=1} (1-\theta)\times log(\theta + \lambda_i)\\ \sum\limits^n_{i=1}y_i log(1-\theta)\times y_i\lambda_i - exp(\lambda_i)\\ \end{array} \right. $ |
Procedure | Equation |
---|---|
Residuals | $\varepsilon_i = y_i - \hat{y}$ |
Pearson's Residuals | $\frac{y_i - \hat{y}}{\sqrt{var(y)}}$ |
Residual Sums of Squares | $RSS = \sum \varepsilon_i^2$ |
Dispersion statistic | $\phi=\frac{RSS}{df}$ |
$R^2$ | $R^2 = 1-\frac{RSS_{model}}{RSS_{null}}$ |
McFadden's $R^2$ | $R_{McF}^2 = 1 - \frac{\mathcal{LL}_{model}}{\mathcal{LL}_{null}}$ |
Deviance | $dev = -2*\mathcal{LL}$ |
pD | $pD = $ |
DIC | $DIC = -pD$ |
AIC | $AIC = min(dev+(2pD))$ |
Poisson regression
The following equations are provided since in Bayesian modelling, it is occasionally necessary to directly define the log-likelihood calculations (particularly for zero-inflated models and other mixture models). Feel free initially gloss over these equations until such time when your models require them ;)
Density function: $$ f(y\mid\lambda)=\frac{\lambda^{y}e^{-\lambda}}{y!}\\ E(Y)=Var(Y)=\lambda\\ $$ where $\lambda$ is the mean.Likelihood: $$ \begin{align} \mathcal{L}(\lambda\mid y) &= \prod\limits_{i=1}^n \dfrac{\lambda^{y_i}e^{-\lambda}}{y_i!}\\ %&= \dfrac{\lambda^{\sum\limits^n_{i=1}y_i} e^{-n\lambda}}{y_1!y_2! \cdots y_n!}\\ \end{align} $$ Log-likelihood: $$ \begin{array}{rcl} \mathcal{LL}(\lambda\mid y)&=&\sum\limits^n_{i=1}log(\lambda^{y_i}e^{-\lambda_i})-log(y_i!)\\ \mathcal{LL}(\lambda\mid y)&=&\sum\limits^n_{i=1}log(\lambda^{y_i})+log(e^{-\lambda_i})-log(y_i!)\\ \mathcal{LL}(\lambda\mid y)&=&\sum\limits^n_{i=1}y_i log(\lambda_i)-\lambda_i - log(y_i!) \end{array} $$
Scenario and Data
Lets say we wanted to model the abundance of an item ($y$) against a continuous predictor ($x$). As this section is mainly about the generation of artificial data (and not specifically about what to do with the data), understanding the actual details are optional and can be safely skipped. Consequently, I have folded (toggled) this section away.
- the sample size = 20
- the continuous $x$ variable is a random uniform spread of measurements between 1 and 20
- the rate of change in log $y$ per unit of $x$ (slope) = 0.1.
- the value of $x$ when log$y$ equals 0 (when $y$=1)
- to generate the values of $y$ expected at each $x$ value, we evaluate the linear predictor (created by calculating the outer product of the model matrix and the regression parameters). These expected values are then transformed into a scale mapped by (0,$\infty$) by using the log function $e^{linear~predictor}$
- finally, we generate $y$ values by using the expected $y$ values ($\lambda$) as probabilities when drawing random numbers from a Poisson distribution. This step adds random noise to the expected $y$ values and returns only 0's and positive integers.
> set.seed(8) > # The number of samples > n.x <- 20 > # Create x values that at uniformly distributed > # throughout the rate of 1 to 20 > x <- sort(runif(n = n.x, min = 1, max = 20)) > mm <- model.matrix(~x) > intercept <- 0.6 > slope = 0.1 > # The linear predictor > linpred <- mm %*% c(intercept, slope) > # Predicted y values > lambda <- exp(linpred) > # Add some noise and make binomial > y <- rpois(n = n.x, lambda = lambda) > dat <- data.frame(y, x)
With these sort of data, we are primarily interested in investigating whether there is a relationship between the positive integer response variable and the linear predictor (linear combination of one or more continuous or categorical predictors).
Exploratory data analysis and initial assumption checking
- All of the observations are independent - this must be addressed at the design and collection stages
- The response variable (and thus the residuals) should be matched by an appropriate distribution (in the case of positive integers response - a Poisson is appropriate).
- All observations are equally influential in determining the trends - or at least no observations are overly influential. This is most effectively diagnosed via residuals and other influence indices and is very difficult to diagnose prior to analysis
- the relationship between the linear predictor (right hand side of the regression formula) and the link function should be linear. A scatterplot with smoother can be useful for identifying possible non-linearity.
- The dispersion factor is close to 1
There are at least five main potential models we could consider fitting to these data:
- Ordinary least squares regression (general linear model) - assumes normality of residuals
- Poisson regression - assumes mean=variance (dispersion=1)
- Quasi-poisson regression - a general solution to overdispersion. Assumes variance is a function of mean, dispersion estimated, however likelihood based statistics unavailable
- Negative binomial regression - a specific solution to overdispersion caused by clumping (due to an unmeasured latent variable). Scaling factor ($\omega$) is estimated along with the regression parameters.
- Zero-inflation model - a specific solution to overdispersion caused by excessive zeros (due to an unmeasured latent variable). Mixture of binomial and Poisson models.
Confirm non-normality and explore clumping
When counts are all very large (not close to 0) and their ranges do not span orders of magnitude, they take on very Gaussian properties (symmetrical distribution and variance independent of the mean). Given that models based on the Gaussian distribution are more optimized and recognized than Generalized Linear Models, it can be prudent to adopt Gaussian models for such data. Hence it is a good idea to first explore whether a Poisson model is likely to be more appropriate than a standard Gaussian model.
The potential for overdispersion can be explored by adding a rug to boxplot. The rug is simply tick marks on the inside of an axis at the position corresponding to an observation. As multiple identical values result in tick marks drawn over one another, it is typically a good idea to apply a slight amount of jitter (random displacement) to the values used by the rug.
> hist(dat$y)
> boxplot(dat$y, horizontal = TRUE) > rug(jitter(dat$y), side = 1)
There is definitely signs of non-normality that would warrant Poisson models. The rug applied to the boxplots does not indicate a series degree of clumping and there appears to be few zero. Thus overdispersion is unlikely to be an issue.
Confirm linearity
Lets now explore linearity by creating a histogram of the predictor variable ($x$) and a scatterplot of the relationship between the response ($y$) and the predictor ($x$)
> hist(dat$x)
> # now for the scatterplot > plot(y ~ x, dat, log = "y") > with(dat, lines(lowess(y ~ x)))
Conclusions: the predictor ($x$) does not display any skewness or other issues that might lead to non-linearity. The lowess smoother on the scatterplot does not display major deviations from a straight line and thus linearity is satisfied. Violations of linearity could be addressed by either:
- define a non-linear linear predictor (such as a polynomial, spline or other non-linear function)
- transform the scale of the predictor variables
Explore zero inflation
Although we have already established that there are few zeros in the data (and thus overdispersion is unlikely to be an issue), we can also explore this by comparing the number of zeros in the data to the number of zeros that would be expected from a Poisson distribution with a mean equal to the mean count of the data.
> # proportion of 0's in the data > dat.tab <- table(dat$y == 0) > dat.tab/sum(dat.tab)
FALSE 1
> # proportion of 0's expected from a Poisson > # distribution > mu <- mean(dat$y) > cnts <- rpois(1000, mu) > dat.tab <- table(cnts == 0) > dat.tab/sum(dat.tab)
FALSE TRUE 0.997 0.003
0.003
).
Model fitting or statistical analysis
JAGS
$$ \begin{align} Y_i&\sim{}P(\lambda) & (\text{response distribution})\\ log(\lambda_i)&=\eta_i & (\text{link function})\\ \eta_i&=\beta_0+\beta_1 X_i & (\text{linear predictor})\\ \beta_0, \beta_1&\sim{}\mathcal{N}(0,10000) & (\text{diffuse Bayesian prior})\\ \end{align} $$> dat.list <- with(dat,list(Y=y, X=x,N=nrow(dat))) > modelString=" model { for (i in 1:N) { Y[i] ~ dpois(lambda[i]) log(lambda[i]) <- beta0 + beta1*X[i] } beta0 ~ dnorm(0,1.0E-06) beta1 ~ dnorm(0,1.0E-06) } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS1.4.bug') > library(R2jags) > system.time( + dat.P.jags <- jags(model='../downloads/BUGSscripts/tut11.5bS1.4.bug', data=dat.list, inits=NULL, + param=c('beta0','beta1'), + n.chain=3, + n.iter=20000, n.thin=10, n.burnin=10000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 105 Initializing model
user system elapsed 2.376 0.004 2.397
> Xmat <- model.matrix(~x, dat) > nX <- ncol(Xmat) > dat.list1 <- with(dat,list(Y=y, X=Xmat,N=nrow(dat), nX=nX)) > modelString=" model { for (i in 1:N) { Y[i] ~ dpois(lambda[i]) eta[i] <- inprod(beta[], X[i,]) log(lambda[i]) <- eta[i] } for (i in 1:nX) { beta[i] ~ dnorm(0,1.0E-06) } } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS1.41.bug') > library(R2jags) > system.time( + dat.P.jags1 <- jags(model='../downloads/BUGSscripts/tut11.5bS1.41.bug', data=dat.list1, inits=NULL, + param=c('beta'), + n.chain=3, + n.iter=20000, n.thin=10, n.burnin=10000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 127 Initializing model
user system elapsed 1.816 0.004 1.820
> print(dat.P.jags1)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS1.41.bug", fit using jags, 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10 n.sims = 3000 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta[1] 0.545 0.252 0.032 0.378 0.549 0.728 1.014 1.001 3000 beta[2] 0.112 0.018 0.077 0.099 0.112 0.125 0.149 1.001 3000 deviance 88.413 4.782 86.383 86.897 87.703 89.034 93.484 1.001 3000 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 11.4 and DIC = 99.9 DIC is an estimate of expected predictive error (lower deviance is better).
Or arguably better still, use a multivariate normal prior. If we have a $k$ regression parameters ($\beta_k$), then the multivariate normal priors are defined as: $$ \boldsymbol{\beta}\sim{}\mathcal{N_k}(\boldsymbol{\mu}, \mathbf{\Sigma}) $$ where $$\boldsymbol{\mu}=[E[\beta_1],E[\beta_2],...,E[\beta_k]] = \left(\begin{array}{c}0\\\vdots\\0\end{array}\right)$$ $$ \mathbf{\Sigma}=[Cov[\beta_i, \beta_j]] = \left(\begin{array}{ccc}1000^2&0&0\\0&\ddots&0\\0&0&1000^2\end{array} \right) $$ hence, along with the response and predictor matrix, we need to supply $\boldsymbol{\mu}$ (a vector of zeros) and $\boldsymbol{\Sigma}$ (a covariance matrix with $1000^2$ in the diagonals).
> Xmat <- model.matrix(~x, dat) > nX <- ncol(Xmat) > dat.list2 <- with(dat,list(Y=y, X=Xmat,N=nrow(dat), mu=rep(0,nX),Sigma=diag(1.0E-06,nX))) > modelString=" model { for (i in 1:N) { Y[i] ~ dpois(lambda[i]) eta[i] <- inprod(beta[], X[i,]) log(lambda[i]) <- eta[i] } beta ~ dmnorm(mu[],Sigma[,]) } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS1.42.bug') > library(R2jags) > system.time( + dat.P.jags2 <- jags(model='../downloads/BUGSscripts/tut11.5bS1.42.bug', data=dat.list2, inits=NULL, + param=c('beta'), + n.chain=3, + n.iter=20000, n.thin=10, n.burnin=10000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 130 Initializing model
user system elapsed 0.616 0.000 0.629
> print(dat.P.jags2)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS1.42.bug", fit using jags, 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10 n.sims = 3000 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta[1] 0.560 0.250 0.066 0.386 0.569 0.734 1.027 1.001 3000 beta[2] 0.111 0.018 0.077 0.098 0.111 0.123 0.147 1.001 2900 deviance 88.315 2.004 86.378 86.907 87.706 89.027 93.746 1.002 1400 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 2.0 and DIC = 90.3 DIC is an estimate of expected predictive error (lower deviance is better).
Chain mixing and Model validation
Prior to exploring the model parameters, it is prudent to confirm that the model did indeed fit the assumptions and was an appropriate fit to the data as well as that the MCMC sampling chain was adequately mixed and the retained samples independent.
Whilst I will only demonstrate this for the logit model, the procedure would be identical for exploring the probit and clog-log models.
- We will start by exploring the mixing of the MCMC chains via traceplots.
> plot(as.mcmc(dat.P.jags))
The chains appear well mixed and stable
- Next we will explore correlation amongst MCMC samples.
> autocorr.diag(as.mcmc(dat.P.jags))
beta0 beta1 deviance Lag 0 1.000000 1.000000 1.000000 Lag 10 0.185239 0.149437 0.018579 Lag 50 -0.007436 -0.007997 -0.007498 Lag 100 -0.005343 -0.016451 0.009315 Lag 500 -0.023993 -0.018307 0.005840
The level of auto-correlation at the nominated lag of 10 is higher than we would generally like. It is worth increasing the thinning rate from 10 to 50. Obviously, to support this higher thinning rate, we would also increase the number of iterations.
> library(R2jags) > dat.P.jags <- jags(data=dat.list,model.file='../downloads/BUGSscripts/tut11.5bS1.4.bug', + param=c('beta0','beta1'), + n.chains=3, n.iter=100000, n.burnin=20000, n.thin=50)
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 105 Initializing model
> print(dat.P.jags)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS1.4.bug", fit using jags, 3 chains, each with 1e+05 iterations (first 20000 discarded), n.thin = 50 n.sims = 4800 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta0 0.544 0.258 0.018 0.378 0.546 0.722 1.028 1.001 4600 beta1 0.112 0.019 0.076 0.099 0.112 0.124 0.150 1.001 4800 deviance 88.424 3.821 86.371 86.893 87.687 89.125 93.921 1.002 4800 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 7.3 and DIC = 95.7 DIC is an estimate of expected predictive error (lower deviance is better).
> plot(as.mcmc(dat.P.jags))
> autocorr.diag(as.mcmc(dat.P.jags))
beta0 beta1 deviance Lag 0 1.000000 1.000000 1.000000 Lag 50 -0.012740 -0.007513 0.007106 Lag 250 0.007954 0.013158 -0.006200 Lag 500 -0.001340 -0.007863 -0.007557 Lag 2500 0.010148 0.005154 -0.004615
Conclusions: the samples are now less auto-correlated and the chains are also arguably mixed a little better.
- One very important model validation procedure is to examine a plot of residuals against predicted or fitted values (the residual plot).
Ideally, residual plots should show a random scatter of points without outliers.
That is, there should be no patterns in the residuals. Patterns suggest inappropriate linear predictor (or scale) and/or inappropriate
residual distribution/link function.
The residuals used in such plots should be standardized (particularly if the model incorporated any variance-covariance structures - such as an autoregressive correlation structure)
Pearsons's residuals standardize residuals by division with the square-root of the variance.
We can generate Pearson's residuals within the JAGS model.
Alternatively, we could use the parameters to generate the residuals outside of JAGS.
Pearson's residuals are calculated according to:
$$
\varepsilon = \frac{y_i - \mu}{\sqrt{var(y)}}
$$
where $\mu$ is the expected value of $Y$ ($=\lambda$ for Poisson) and $var(y)$ is the variance of $Y$ ($=\lambda$ for Poisson).
> #extract the samples for the two model parameters > coefs <- dat.P.jags$BUGSoutput$sims.matrix[,1:2] > Xmat <- model.matrix(~x, data=dat) > #expected values on a log scale > eta<-coefs %*% t(Xmat) > #expected value on response scale > lambda <- exp(eta) > #Expected value and variance are both equal to lambda > expY <- varY <- lambda > #sweep across rows and then divide by lambda > Resid <- -1*sweep(expY,2,dat$y,'-')/sqrt(varY) > #plot residuals vs expected values > plot(apply(Resid,2,mean)~apply(eta,2,mean))
There is one residual that is substantially larger in magnitude than all the others. However, there are no other patterns in the residuals.
- Now we will compare the sum of squared residuals to the sum of squares residuals that would be expected from a Poisson distribution
matching that estimated by the model. Essentially this is estimating how well the Poisson distribution, the log-link function and the linear model
approximates the observed data.
> SSres<-apply(Resid^2,1,sum) > > set.seed(2) > #generate a matrix of draws from a binomial distribution > # the matrix is the same dimensions as pi and uses the probabilities of pi > YNew <- matrix(rpois(length(lambda),lambda=lambda),nrow=nrow(lambda)) > > Resid1<-(lambda - YNew)/sqrt(lambda) > SSres.sim<-apply(Resid1^2,1,sum) > mean(SSres.sim>SSres)
[1] 0.505
> dat.list1 <- with(dat,list(Y=y, X=x,N=nrow(dat))) > modelString=" model { for (i in 1:N) { #likelihood function Y[i] ~ dpois(lambda[i]) eta[i] <- beta0+beta1*X[i] #linear predictor log(lambda[i]) <- eta[i] #link function #E(Y) and var(Y) expY[i] <- lambda[i] varY[i] <- lambda[i] # Calculate RSS Resid[i] <- (Y[i] - expY[i])/sqrt(varY[i]) RSS[i] <- pow(Resid[i],2) #Simulate data from a Poisson distribution Y1[i] ~ dpois(lambda[i]) #Calculate RSS for simulated data Resid1[i] <- (Y1[i] - expY[i])/sqrt(varY[i]) RSS1[i] <-pow(Resid1[i],2) } #Priors beta0 ~ dnorm(0,1.0E-06) beta1 ~ dnorm(0,1.0E-06) #Bayesian P-value Pvalue <- mean(sum(RSS1)>sum(RSS)) } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS6.41.bug') > library(R2jags) > system.time( + dat.P.jags1 <- jags(model='../downloads/BUGSscripts/tut11.5bS6.41.bug', data=dat.list1, inits=NULL, + param=c('beta0','beta1','Pvalue'), + n.chain=3, + n.iter=100000, n.thin=50, n.burnin=20000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 272 Initializing model
user system elapsed 16.26 0.00 16.29
> print(dat.P.jags1)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS6.41.bug", fit using jags, 3 chains, each with 1e+05 iterations (first 20000 discarded), n.thin = 50 n.sims = 4800 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff Pvalue 0.471 0.499 0.000 0.000 0.000 1.000 1.000 1.002 1700 beta0 0.548 0.257 0.029 0.380 0.551 0.724 1.037 1.001 4800 beta1 0.112 0.019 0.075 0.099 0.112 0.124 0.149 1.002 4800 deviance 88.429 3.645 86.376 86.905 87.724 89.105 93.923 1.002 3500 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 6.6 and DIC = 95.1 DIC is an estimate of expected predictive error (lower deviance is better).
Conclusions: the Bayesian p-value is approximately 0.5, suggesting that there is a good fit of the model to the data.
-
Recall that the Poisson regression model assumes that variance=mean ($var=\mu\phi$ where $\phi=1$) and thus dispersion ($\phi=\frac{var}{\mu}=1$).
However, we can also calculate approximately what the dispersion factor would be by using sum square of the residuals as a measure of variance and the model
residual degrees of freedom as a measure of the mean (since the expected value of a Poisson distribution is the same as its degrees of freedom).
$$\phi=\frac{RSS}{df}$$ where $df=n-k$ and $k$ is the number of estimated model coefficients.
> Resid <- -1*sweep(lambda,2,dat$y,'-')/sqrt(lambda) > RSS<-apply(Resid^2,1,sum) > (df<-nrow(dat)-ncol(coefs))
[1] 18
> Disp <- RSS/df > data.frame(Median=median(Disp), Mean=mean(Disp), HPDinterval(as.mcmc(Disp)), HPDinterval(as.mcmc(Disp),p=0.5))
Median Mean lower upper lower.1 upper.1 var1 1.042 1.114 0.93 1.447 0.9301 1.042
> dat.list <- with(dat,list(Y=y, X=x,N=nrow(dat))) > modelString=" model { for (i in 1:N) { Y[i] ~ dpois(lambda[i]) eta[i] <- beta0 + beta1*X[i] log(lambda[i]) <- eta[i] expY[i] <- lambda[i] varY[i] <- lambda[i] Resid[i] <- (Y[i] - expY[i])/sqrt(varY[i]) } beta0 ~ dnorm(0,1.0E-06) beta1 ~ dnorm(0,1.0E-06) RSS <- sum(pow(Resid,2)) df <- N-2 phi <- RSS/df } " > writeLines(modelString,con='tut11.5bS1.40.bug') > library(R2jags) > system.time( + dat.P.jags <- jags(model='tut11.5bS1.40.bug', data=dat.list, inits=NULL, + param=c('beta0','beta1','phi'), + n.chain=3, + n.iter=100000, n.thin=50, n.burnin=20000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 171 Initializing model
user system elapsed 18.177 0.008 18.259
> print(dat.P.jags)
Inference for Bugs model at "tut11.5bS1.40.bug", fit using jags, 3 chains, each with 1e+05 iterations (first 20000 discarded), n.thin = 50 n.sims = 4800 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta0 0.548 0.255 0.031 0.377 0.553 0.727 1.043 1.001 4800 beta1 0.112 0.019 0.076 0.099 0.112 0.124 0.149 1.001 4800 phi 1.109 0.413 0.934 0.979 1.048 1.164 1.550 1.001 4400 deviance 88.389 3.915 86.377 86.909 87.689 89.084 93.530 1.001 4800 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 7.7 and DIC = 96.1 DIC is an estimate of expected predictive error (lower deviance is better).
The dispersion statistic $\phi$ is close to 1 and thus there is no evidence that the data were overdispersed. The Poisson distribution was therefore appropriate.
Exploring the model parameters, test hypotheses
If there was any evidence that the assumptions had been violated or the model was not an appropriate fit, then we would need to reconsider the model and start the process again. In this case, there is no evidence that the test will be unreliable so we can proceed to explore the test statistics.
> print(dat.P.jags)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS1.4.bug", fit using jags, 3 chains, each with 50000 iterations (first 10000 discarded), n.thin = 50 n.sims = 2400 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta0 0.554 0.254 0.034 0.388 0.564 0.726 1.037 1.003 780 beta1 0.111 0.018 0.075 0.099 0.111 0.123 0.148 1.003 930 deviance 88.471 5.024 86.361 86.857 87.698 89.021 93.951 1.001 2400 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 12.6 and DIC = 101.1 DIC is an estimate of expected predictive error (lower deviance is better).
> > library(plyr) > adply(dat.P.jags$BUGSoutput$sims.matrix[,1:2], 2, function(x) { + data.frame(Median=median(x), Mean=mean(x), HPDinterval(as.mcmc(x)), HPDinterval(as.mcmc(x),p=0.5)) + })
X1 Median Mean lower upper lower.1 upper.1 1 beta0 0.5638 0.5537 0.02371 1.0238 0.4412 0.7749 2 beta1 0.1114 0.1114 0.07362 0.1473 0.0980 0.1218
Actually, many find it more palatable to express the estimates in the original scale of the observations rather than on a log scale.
> library(plyr) > adply(exp(dat.P.jags$BUGSoutput$sims.matrix[,1:2]), 2, function(x) { + data.frame(Median=median(x), Mean=mean(x), HPDinterval(as.mcmc(x)), HPDinterval(as.mcmc(x),p=0.5)) + })
X1 Median Mean lower upper lower.1 upper.1 1 beta0 1.757 1.796 0.9751 2.688 1.403 1.976 2 beta1 1.118 1.118 1.0764 1.159 1.102 1.128
Conclusions:
We would reject the null hypothesis of no effect of $x$ on $y$.
An increase in x is associated with a significant linear increase (positive slope) in the abundance of $y$.
Every 1 unit increase in $x$ results in a log 0.1117
unit increase in $y$.
We usually express this in terms of abundance rather than log abundance, so
every 1 unit increase in $x$ results in a ($e^{0.1117
}=1.1182
$)
1.1182
unit increase in the abundance of $y$.
Further explorations of the trends
A measure of the strength of the relationship can be obtained according to: $$R^2 = 1 - \frac{RSS_{model}}{RSS_{null}}$$
> Xmat <- model.matrix(~x, data=dat) > #expected values on a log scale > eta<-coefs %*% t(Xmat) > #expected value on response scale > lambda <- exp(eta) > #calculate the raw SS residuals > SSres <- apply((-1*(sweep(lambda,2,dat$y,'-')))^2,1,sum) > SSres.null <- sum((dat$y - mean(dat$y))^2) > #OR > SSres.null <- crossprod(dat$y - mean(dat$y)) > #calculate the model r2 > 1-mean(SSres)/SSres.null
[,1] [1,] 0.6555
Conclusions: 65.55
% of the variation in $y$ abundance can be explained by its relationship with $x$.
> dat.list <- with(dat,list(Y=y, X=x,N=nrow(dat))) > modelString=" model { for (i in 1:N) { Y[i] ~ dpois(lambda[i]) eta[i] <- beta0 + beta1*X[i] log(lambda[i]) <- eta[i] res[i] <- Y[i] - lambda[i] resnull[i] <- Y[i] - meanY } meanY <- mean(Y) beta0 ~ dnorm(0,1.0E-06) beta1 ~ dnorm(0,1.0E-06) RSS <- sum(res^2) RSSnull <- sum(resnull^2) r2 <- 1-RSS/RSSnull } " > writeLines(modelString,con='tut11.5bS1.40.bug') > library(R2jags) > system.time( + dat.P.jags <- jags(model='tut11.5bS1.40.bug', data=dat.list, inits=NULL, + param=c('beta0','beta1','r2'), + n.chain=3, + n.iter=100000, n.thin=50, n.burnin=20000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 150 Initializing model
user system elapsed 15.413 0.004 15.458
> print(dat.P.jags)
Inference for Bugs model at "tut11.5bS1.40.bug", fit using jags, 3 chains, each with 1e+05 iterations (first 20000 discarded), n.thin = 50 n.sims = 4800 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta0 0.548 0.255 0.045 0.374 0.553 0.724 1.043 1.001 4800 beta1 0.112 0.019 0.074 0.099 0.112 0.124 0.148 1.001 4800 r2 0.655 0.065 0.512 0.641 0.673 0.691 0.701 1.003 4800 deviance 88.436 4.089 86.365 86.904 87.726 89.139 93.819 1.005 4800 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 8.4 and DIC = 96.8 DIC is an estimate of expected predictive error (lower deviance is better).
Finally, we will create a summary plot.
> par(mar = c(4, 5, 0, 0)) > plot(y ~ x, data = dat, type = "n", ann = F, axes = F) > points(y ~ x, data = dat, pch = 16) > xs <- seq(min(dat$x, na.rm = TRUE), max(dat$x, na.rm = TRUE), + l = 1000) > Xmat <- model.matrix(~xs) > eta <- coefs %*% t(Xmat) > ys <- exp(eta) > library(plyr) > library(coda) > data.tab <- adply(ys, 2, function(x) { + data.frame(Median = median(x), HPDinterval(as.mcmc(x))) + }) > data.tab <- cbind(x = xs, data.tab) > points(Median ~ x, data = data.tab, col = "black", type = "l") > lines(lower ~ x, data = data.tab, col = "black", type = "l", + lty = 2) > lines(upper ~ x, data = data.tab, col = "black", type = "l", + lty = 2) > > axis(1) > mtext("X", 1, cex = 1.5, line = 3) > axis(2, las = 2) > mtext("Abundance of Y", 2, cex = 1.5, line = 3) > box(bty = "l")
Defining full log-likelihood function
Now lets try it by specifying log-likelihood and the zero trick. When applying this trick, we need to manually calculate the deviance as the inbuilt deviance will be based on the log-likelihood of estimating the zeros (as part of the zero trick) rather than the deviance of the intended model..
The one advantage of the zero trick is that the Deviance and thus DIC, AIC provided by R2jags will be incorrect. Hence, they too need to be manually defined within jags. I suspect that the AIC calculation I have used is incorrect...
> Xmat <- model.matrix(~x, dat) > nX <- ncol(Xmat) > dat.list2 <- with(dat,list(Y=y, X=Xmat,N=nrow(dat), mu=rep(0,nX), + Sigma=diag(1.0E-06,nX), zeros=rep(0,nrow(dat)), C=10000)) > modelString=" model { for (i in 1:N) { zeros[i] ~ dpois(zeros.lambda[i]) zeros.lambda[i] <- -ll[i] + C ll[i] <- Y[i]*log(lambda[i]) - lambda[i] - loggam(Y[i]+1) eta[i] <- inprod(beta[], X[i,]) log(lambda[i]) <- eta[i] llm[i] <- Y[i]*log(meanlambda) - meanlambda - loggam(Y[i]+1) } meanlambda <- mean(lambda) beta ~ dmnorm(mu[],Sigma[,]) dev <- sum(-2*ll) pD <- mean(dev)-sum(-2*llm) AIC <- min(dev+(2*pD)) } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS1.42.bug') > library(R2jags) > system.time( + dat.P.jags3 <- jags(model='../downloads/BUGSscripts/tut11.5bS1.42.bug', data=dat.list2, inits=NULL, + param=c('beta','dev','AIC'), + n.chain=3, + n.iter=50000, n.thin=50, n.burnin=10000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 353 Initializing model
user system elapsed 1.896 0.004 1.900
> print(dat.P.jags3)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS1.42.bug", fit using jags, 3 chains, each with 50000 iterations (first 10000 discarded), n.thin = 50 n.sims = 2400 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff AIC 1.358e+01 4.347 9.638e+00 1.063e+01 1.215e+01 1.502e+01 2.587e+01 1.000 2400 beta[1] 5.310e-01 0.257 2.400e-02 3.550e-01 5.350e-01 7.060e-01 1.017e+00 1.002 1500 beta[2] 1.130e-01 0.019 7.600e-02 1.000e-01 1.130e-01 1.260e-01 1.480e-01 1.002 1500 dev 8.834e+01 1.966 8.638e+01 8.692e+01 8.775e+01 8.912e+01 9.367e+01 1.001 2400 deviance 4.001e+05 1.966 4.001e+05 4.001e+05 4.001e+05 4.001e+05 4.001e+05 1.000 1 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 1.9 and DIC = 400090.3 DIC is an estimate of expected predictive error (lower deviance is better).
Negative binomial
The following equations are provided since in Bayesian modelling, it is occasionally necessary to directly define the log-likelihood calculations (particularly for zero-inflated models and other mixture models). Feel free initially gloss over these equations until such time when your models require them ;)
prob, size | alpha, beta (gamma-poisson) |
---|---|
$$ f(y|r,p)=\frac{\Gamma(y+r)}{\Gamma(r)\Gamma(y+1)}p^r(1-p)^y $$ where $p$ is the probability of $y$ successes until $r$ failures. If, we make $p=\frac{size}{size+\mu}$, then we can define the function in terms of $\mu$ $$ f(y|r,\mu)=\frac{\Gamma(y+r)}{\Gamma(r)\Gamma(y+1)}\left(\frac{r}{\mu+r}\right)^r\left(1-\frac{r}{\mu+r}\right)^y $$ where $p$ is the probability of $y$ successes until $r$ failures. $$\mu = \frac{r(1-p)}{p}\\ E(Y)=\mu, Var(Y)=\mu+\frac{\mu^2}{r} $$ | $$ f(y \mid \alpha, \beta) = \frac{\beta^{\alpha}y^{\alpha-1}e^{-\beta y}}{\Gamma(\alpha)} $$ where $$ l(\lambda;y)=\sum\limits^n_{i=1} log\Gamma(y_i + \alpha) - log\Gamma(y_i+1) - log\Gamma(\alpha) + \alpha(log(\beta_i) - log(\beta_i+1)) - y_i.log(\beta_i+1) $$ |
Scenario and Data
Lets say we wanted to model the abundance of an item ($y$) against a continuous predictor ($x$). As this section is mainly about the generation of artificial data (and not specifically about what to do with the data), understanding the actual details are optional and can be safely skipped. Consequently, I have folded (toggled) this section away.
- the sample size = 20
- the continuous $x$ variable is a random uniform spread of measurements between 1 and 20
- the rate of change in log $y$ per unit of $x$ (slope) = 0.1.
- the value of $x$ when log$y$ equals 0 (when $y$=1)
- to generate the values of $y$ expected at each $x$ value, we evaluate the linear predictor (created by calculating the outer product of the model matrix and the regression parameters). These expected values are then transformed into a scale mapped by (0,$\infty$) by using the log function $e^{linear~predictor}$
- finally, we generate $y$ values by using the expected $y$ values ($\lambda$) as probabilities when drawing random numbers from a Poisson distribution. This step adds random noise to the expected $y$ values and returns only 0's and positive integers.
> set.seed(37) #16 #35 > # The number of samples > n.x <- 20 > # Create x values that at uniformly distributed throughout the rate of 1 to 20 > x <- sort(runif(n = n.x, min = 1, max = 20)) > mm <- model.matrix(~x) > intercept <- 0.6 > slope = 0.1 > # The linear predictor > linpred <- mm %*% c(intercept, slope) > # Predicted y values > lambda <- exp(linpred) > # Add some noise and make binomial > y <- rnbinom(n = n.x, mu = lambda, size = 1) > dat.nb <- data.frame(y, x)
Exploratory data analysis and initial assumption checking
- All of the observations are independent - this must be addressed at the design and collection stages
- The response variable (and thus the residuals) should be matched by an appropriate distribution (in the case of positive integers response - a Poisson is appropriate).
- All observations are equally influential in determining the trends - or at least no observations are overly influential. This is most effectively diagnosed via residuals and other influence indices and is very difficult to diagnose prior to analysis
- the relationship between the linear predictor (right hand side of the regression formula) and the link function should be linear. A scatterplot with smoother can be useful for identifying possible non-linearity.
- Dispersion is either 1 or overdispersion is otherwise accounted for in the model
- The number of zeros is either not excessive or else they are specifically addressed by the model
When counts are all very large (not close to 0) and their ranges do not span orders of magnitude, they take on very Gaussian properties (symmetrical distribution and variance independent of the mean). Given that models based on the Gaussian distribution are more optimized and recognized than Generalized Linear Models, it can be prudent to adopt Gaussian models for such data. Hence it is a good idea to first explore whether a Poisson or Negative Binomial model is likely to be more appropriate than a standard Gaussian model.
Recall from Poisson regression, there are five main potential models that we could consider fitting to these data.
There are five main potential models we could consider fitting to these data:
- Ordinary least squares regression (general linear model) - assumes normality of residuals
- Poisson regression - assumes mean=variance (dispersion=1)
- Quasi-poisson regression - a general solution to overdispersion. Assumes variance is a function of mean, dispersion estimated, however likelihood based statistics unavailable
- Negative binomial regression - a specific solution to overdispersion caused by clumping (due to an unmeasured latent variable). Scaling factor ($\omega$) is estimated along with the regression parameters.
- Zero-inflation model - a specific solution to overdispersion caused by excessive zeros (due to an unmeasured latent variable). Mixture of binomial and Poisson models.
Confirm non-normality and explore clumping
Check the distribution of the $y$ abundances
> hist(dat.nb$y)
> boxplot(dat.nb$y, horizontal = TRUE) > rug(jitter(dat.nb$y))
Confirm linearity
Lets now explore linearity by creating a histogram of the predictor variable ($x$) and a scatterplot of the relationship between the response ($y$) and the predictor ($x$)
> hist(dat.nb$x)
> # now for the scatterplot > plot(y ~ x, dat.nb, log = "y") > with(dat.nb, lines(lowess(y ~ x)))
Conclusions: the predictor ($x$) does not display any skewness or other issues that might lead to non-linearity. The lowess smoother on the scatterplot does not display major deviations from a straight line and thus linearity is satisfied. Violations of linearity could be addressed by either:
- define a non-linear linear predictor (such as a polynomial, spline or other non-linear function)
- transform the scale of the predictor variables
Explore zero inflation
Although we have already established that there are few zeros in the data (and thus overdispersion is unlikely to be an issue), we can also explore this by comparing the number of zeros in the data to the number of zeros that would be expected from a Poisson distribution with a mean equal to the mean count of the data.
> # proportion of 0's in the data > dat.nb.tab <- table(dat.nb$y == 0) > dat.nb.tab/sum(dat.nb.tab)
FALSE TRUE 0.95 0.05
> # proportion of 0's expected from a Poisson distribution > mu <- mean(dat.nb$y) > cnts <- rpois(1000, mu) > dat.nb.tabE <- table(cnts == 0) > dat.nb.tabE/sum(dat.nb.tabE)
FALSE 1
Model fitting or statistical analysis
The boxplot of $y$ with the axis rug suggested that there might be some clumping (possibly due to some other unmeasured influence). It is therefore likely that the data are overdispersed.
> dat.nb.list <- with(dat.nb,list(Y=y, X=x,N=nrow(dat.nb))) > modelString=" model { for (i in 1:N) { Y[i] ~ dpois(lambda[i]) eta[i] <- beta0 + beta1*X[i] log(lambda[i]) <- eta[i] } beta0 ~ dnorm(0,1.0E-06) beta1 ~ dnorm(0,1.0E-06) } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS3.4.bug') > library(R2jags) > system.time( + dat.nb.P.jags <- jags(model='../downloads/BUGSscripts/tut11.5bS3.4.bug', data=dat.nb.list, inits=NULL, + param=c('beta0','beta1'), + n.chain=3, + n.iter=20000, n.thin=10, n.burnin=10000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 105 Initializing model
user system elapsed 1.800 0.000 1.803
> print(dat.nb.P.jags)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS3.4.bug", fit using jags, 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10 n.sims = 3000 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta0 0.557 0.270 0.010 0.381 0.566 0.738 1.067 1.002 1800 beta1 0.111 0.020 0.074 0.098 0.111 0.124 0.150 1.001 2700 deviance 135.376 4.173 133.279 133.806 134.600 136.036 140.662 1.009 3000 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 8.7 and DIC = 144.1 DIC is an estimate of expected predictive error (lower deviance is better).
> > > #extract the samples for the two model parameters > coefs <- dat.nb.P.jags$BUGSoutput$sims.matrix[,1:2] > Xmat <- model.matrix(~x, data=dat) > #expected values on a log scale > eta<-coefs %*% t(Xmat) > #expected value on response scale > lambda <- exp(eta) > > Resid <- -1*sweep(lambda,2,dat.nb$y, '-')/sqrt(lambda) > RSS <- apply(Resid^2, 1, sum) > Disp <- RSS/(nrow(dat.nb)-ncol(coefs)) > data.frame(Median=median(Disp), Mean=mean(Disp), HPDinterval(as.mcmc(Disp)), HPDinterval(as.mcmc(Disp),p=0.5))
Median Mean lower upper lower.1 upper.1 var1 3.131 3.235 2.628 4.018 2.747 3.199
The dispersion parameter was 3.2348
, indicating over three times more variability than would be expected for a Poisson distribution.
The data are thus over-dispersed. Given that this is unlikely to be due to zero inflation and the rug plot did suggest some level of clumping,
negative binomial regression would seem reasonable.
Model fitting or statistical analysis
JAGS
$$ \begin{align} Y_i&\sim{}NB(p_i,size) & (\text{response distribution})\\ p_i&=size/(size+\lambda_i)\\ log(\lambda_i)&=\eta_i & (\text{link function})\\ \eta_i&=\beta_0+\beta_1 X_i & (\text{linear predictor})\\ \beta_0, \beta_1&\sim{}\mathcal{N}(0,10000) & (\text{diffuse Bayesian prior})\\ size &\sim{}\mathcal{U}(0.001,1000)\\ \end{align} $$> dat.nb.list <- with(dat.nb,list(Y=y, X=x,N=nrow(dat.nb))) > modelString=" model { for (i in 1:N) { Y[i] ~ dnegbin(p[i],size) p[i] <- size/(size+lambda[i]) log(lambda[i]) <- beta0 + beta1*X[i] } beta0 ~ dnorm(0,1.0E-06) beta1 ~ dnorm(0,1.0E-06) size ~ dunif(0.001,1000) theta <- pow(1/mean(p),2) scaleparam <- mean((1-p)/p) } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS4.1.bug') > library(R2jags) > system.time( + dat.NB.jags <- jags(model='../downloads/BUGSscripts/tut11.5bS4.1.bug', data=dat.nb.list, inits=NULL, + param=c('beta0','beta1', 'size','theta','scaleparam'), + n.chain=3, + n.iter=20000, n.thin=10, n.burnin=10000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 157 Initializing model
user system elapsed 17.361 0.008 17.391
> Xmat <- model.matrix(~x, dat.nb) > nX <- ncol(Xmat) > dat.nb.list1 <- with(dat.nb,list(Y=y, X=Xmat,N=nrow(dat.nb), nX=nX)) > modelString=" model { for (i in 1:N) { Y[i] ~ dnegbin(p[i],size) p[i] <- size/(size+lambda[i]) eta[i] <- inprod(beta[], X[i,]) log(lambda[i]) <- max(-20,min(20,eta[i])) } for (i in 1:nX) { beta[i] ~ dnorm(0,1.0E-06) } size ~ dunif(0.001,10000) } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS4.11.bug') > library(R2jags) > system.time( + dat.NB.jags1 <- jags(model='../downloads/BUGSscripts/tut11.5bS4.11.bug', data=dat.nb.list1, inits=NULL, + param=c('beta'), + n.chain=3, + n.iter=20000, n.thin=10, n.burnin=10000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 212 Initializing model
user system elapsed 18.509 0.004 18.559
> print(dat.NB.jags1)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS4.11.bug", fit using jags, 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10 n.sims = 3000 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta[1] 0.744 0.411 -0.033 0.464 0.734 1.007 1.565 1.001 3000 beta[2] 0.097 0.033 0.033 0.075 0.097 0.118 0.162 1.001 2200 deviance 113.080 2.608 110.119 111.204 112.391 114.267 119.579 1.001 3000 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 3.4 and DIC = 116.5 DIC is an estimate of expected predictive error (lower deviance is better).
Or arguably better still, use a multivariate normal prior. If we have a $k$ regression parameters ($\beta_k$), then the multivariate normal priors are defined as: $$ \boldsymbol{\beta}\sim{}\mathcal{N_k}(\boldsymbol{\mu}, \mathbf{\Sigma}) $$ where $$\boldsymbol{\mu}=[E[\beta_1],E[\beta_2],...,E[\beta_k]] = \left(\begin{array}{c}0\\\vdots\\0\end{array}\right)$$ $$ \mathbf{\Sigma}=[Cov[\beta_i, \beta_j]] = \left(\begin{array}{ccc}1000^2&0&0\\0&\ddots&0\\0&0&1000^2\end{array} \right) $$ hence, along with the response and predictor matrix, we need to supply $\boldsymbol{\mu}$ (a vector of zeros) and $\boldsymbol{\Sigma}$ (a covariance matrix with $1000^2$ in the diagonals).
> Xmat <- model.matrix(~x, dat.nb) > nX <- ncol(Xmat) > dat.nb.list2 <- with(dat.nb,list(Y=y, X=Xmat,N=nrow(dat.nb), mu=rep(0,nX),Sigma=diag(1.0E-06,nX))) > modelString=" model { for (i in 1:N) { Y[i] ~ dnegbin(p[i],size) p[i] <- size/(size+lambda[i]) eta[i] <- inprod(beta[], X[i,]) log(lambda[i]) <- eta[i] } beta ~ dmnorm(mu[],Sigma[,]) size ~ dunif(0.001,10000) } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS4.12.bug') > library(R2jags) > system.time( + dat.NB.jags2 <- jags(model='../downloads/BUGSscripts/tut11.5bS4.12.bug', data=dat.nb.list2, inits=NULL, + param=c('beta', size), + n.chain=3, + n.iter=20000, n.thin=10, n.burnin=10000) + )
Error: object 'size' not found
Timing stopped at: 0 0 0.001
> print(dat.NB.jags2)
Error: error in evaluating the argument 'x' in selecting a method for function 'print': Error: object 'dat.NB.jags2' not found
Chain mixing and Model validation
Prior to exploring the model parameters, it is prudent to confirm that the model did indeed fit the assumptions and was an appropriate fit to the data as well as that the MCMC sampling chain was adequately mixed and the retained samples independent.
- We will start by exploring the mixing of the MCMC chains via traceplots.
> plot(as.mcmc(dat.NB.jags))
The chains appear well mixed and stable
- Next we will explore correlation amongst MCMC samples.
> autocorr.diag(as.mcmc(dat.NB.jags))
beta0 beta1 deviance scaleparam theta Lag 0 1.000000 1.00000 1.0000000 1.0000000 1.000000 Lag 10 0.319600 0.34384 0.0679443 0.0137109 -0.002878 Lag 50 0.023651 0.02475 -0.0159165 -0.0005774 -0.009582 Lag 100 0.004577 0.01310 -0.0029059 0.0169330 0.035428 Lag 500 0.018486 0.01216 0.0000334 -0.0208298 -0.005064
The level of auto-correlation at the nominated lag of 10 is higher than we would generally like. It is worth increasing the thinning rate from 10 to 50. Obviously, to support this higher thinning rate, we would also increase the number of iterations. Typically for a negative binomial, it is worth having a large burnin (approximately half of the iterations).
> library(R2jags) > dat.NB.jags <- jags(data=dat.nb.list,model.file='../downloads/BUGSscripts/tut11.5bS4.1.bug', + param=c('beta0','beta1','size'), + n.chains=3, n.iter=50000, n.burnin=25000, n.thin=50)
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 157 Initializing model
> print(dat.NB.jags)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS4.1.bug", fit using jags, 3 chains, each with 50000 iterations (first 25000 discarded), n.thin = 50 n.sims = 1500 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta0 0.750 0.414 -0.023 0.470 0.742 1.013 1.584 1.001 1400 beta1 0.095 0.033 0.029 0.073 0.096 0.117 0.159 1.001 1500 size 3.121 2.013 1.005 1.904 2.607 3.749 8.498 1.002 1300 deviance 113.070 2.617 110.124 111.140 112.370 114.250 119.675 1.003 1500 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 3.4 and DIC = 116.5 DIC is an estimate of expected predictive error (lower deviance is better).
> plot(as.mcmc(dat.NB.jags))
> autocorr.diag(as.mcmc(dat.NB.jags))
beta0 beta1 deviance size Lag 0 1.000000 1.000000 1.000000 1.000000 Lag 50 -0.023132 -0.020551 0.007207 -0.031935 Lag 250 0.003990 0.009005 -0.017226 0.037372 Lag 500 0.029642 0.034866 0.052578 0.036684 Lag 2500 0.000988 0.029825 0.006859 -0.001861
Conclusions: the samples are now less auto-correlated. Ideally, we should probably collect even more samples. Whilst the traceplots are reasonably noisy, there is more of a signal or pattern than there ideally should be.
- We now explore the goodness of fit of the models via the residuals and deviance.
We could calculate the Pearsons's residuals within the JAGS model.
Alternatively, we could use the parameters to generate the residuals outside of JAGS.
> #extract the samples for the two model parameters > coefs <- dat.NB.jags$BUGSoutput$sims.matrix[,1:2] > size <- dat.NB.jags$BUGSoutput$sims.matrix[,'size'] > Xmat <- model.matrix(~x, data=dat.nb) > #expected values on a log scale > eta<-coefs %*% t(Xmat) > #expected value on response scale > lambda <- exp(eta) > varY <- lambda + (lambda^2)/size > #sweep across rows and then divide by lambda > Resid <- -1*sweep(lambda,2,dat.nb$y,'-')/sqrt(varY) > #plot residuals vs expected values > plot(apply(Resid,2,mean)~apply(eta,2,mean))
There are no real patterns in the residuals.
- Now we will compare the sum of squared residuals to the sum of squares residuals that would be expected from a Poisson distribution
matching that estimated by the model. Essentially this is estimating how well the Poisson distribution, the log-link function and the linear model
approximates the observed data.
> SSres<-apply(Resid^2,1,sum) > > set.seed(2) > #generate a matrix of draws from a negative binomial distribution > # the matrix is the same dimensions as pi and uses the probabilities of pi > YNew <- matrix(rnbinom(length(lambda),mu=lambda, size=size),nrow=nrow(lambda)) > Resid1<-(lambda - YNew)/sqrt(varY) > SSres.sim<-apply(Resid1^2,1,sum) > mean(SSres.sim>SSres)
[1] 0.408
> Xmat <- model.matrix(~x, dat.nb) > nX <- ncol(Xmat) > dat.nb.list2 <- with(dat.nb,list(Y=y, X=Xmat,N=nrow(dat.nb), mu=rep(0,nX),Sigma=diag(1.0E-06,nX))) > modelString=" model { for (i in 1:N) { Y[i] ~ dnegbin(p[i],size) p[i] <- size/(size+lambda[i]) eta[i] <- inprod(beta[], X[i,]) log(lambda[i]) <- eta[i] Y1[i] ~ dnegbin(p[i],size) varY[i] <- lambda[i] + pow(lambda[i],2)/size Resid[i] <- (Y[i] - lambda[i])/sqrt(varY[i]) Resid1[i] <- (Y1[i] - lambda[i])/sqrt(varY[i]) RSS[i] <- pow(Resid[i],2) RSS1[i] <-pow(Resid1[i],2) } beta ~ dmnorm(mu[],Sigma[,]) size ~ dunif(0.001,10000) Pvalue <- mean(sum(RSS1)>sum(RSS)) } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS4.61.bug') > library(R2jags) > system.time( + dat.NB.jags3 <- jags(model='../downloads/BUGSscripts/tut11.5bS4.61.bug', data=dat.nb.list2, inits=NULL, + param=c('beta','Pvalue'), + n.chain=3, + n.iter=20000, n.thin=10, n.burnin=10000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 400 Initializing model
user system elapsed 7.453 0.004 7.463
> print(dat.NB.jags3)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS4.61.bug", fit using jags, 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10 n.sims = 3000 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff Pvalue 0.419 0.493 0.000 0.000 0.000 1.000 1.000 1.002 1900 beta[1] 0.724 0.418 -0.080 0.448 0.722 1.002 1.563 1.001 3000 beta[2] 0.098 0.033 0.034 0.076 0.097 0.120 0.163 1.001 3000 deviance 113.027 2.565 110.122 111.163 112.395 114.184 119.860 1.001 3000 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 3.3 and DIC = 116.3 DIC is an estimate of expected predictive error (lower deviance is better).
Conclusions: the Bayesian p-value is not far from 0.5, suggesting that there is a good fit of the model to the data.
Exploring the model parameters, test hypotheses
If there was any evidence that the assumptions had been violated or the model was not an appropriate fit, then we would need to reconsider the model and start the process again. In this case, there is no evidence that the test will be unreliable so we can proceed to explore the test statistics.
As with most Bayesian models, it is best to base conclusions on medians rather than means.
> print(dat.NB.jags)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS4.1.bug", fit using jags, 3 chains, each with 50000 iterations (first 25000 discarded), n.thin = 50 n.sims = 1500 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta0 0.750 0.414 -0.023 0.470 0.742 1.013 1.584 1.001 1400 beta1 0.095 0.033 0.029 0.073 0.096 0.117 0.159 1.001 1500 size 3.121 2.013 1.005 1.904 2.607 3.749 8.498 1.002 1300 deviance 113.070 2.617 110.124 111.140 112.370 114.250 119.675 1.003 1500 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 3.4 and DIC = 116.5 DIC is an estimate of expected predictive error (lower deviance is better).
> > library(plyr) > adply(dat.NB.jags$BUGSoutput$sims.matrix, 2, function(x) { + data.frame(Median=median(x), Mean=mean(x), HPDinterval(as.mcmc(x)), HPDinterval(as.mcmc(x),p=0.5)) + })
X1 Median Mean lower upper lower.1 upper.1 1 beta0 0.74245 0.75004 -0.02279 1.5842 0.46918 1.0095 2 beta1 0.09635 0.09519 0.03322 0.1607 0.07262 0.1154 3 deviance 112.36982 113.06951 109.89529 118.1002 110.19631 112.5739 4 size 2.60655 3.12087 0.70273 6.7080 1.42865 2.9849
Actually, many find it more palatable to express the estimates in the original scale of the observations rather than on a log scale.
> library(plyr) > adply(exp(dat.NB.jags$BUGSoutput$sims.matrix[,1:2]), 2, function(x) { + data.frame(Median=median(x), Mean=mean(x), HPDinterval(as.mcmc(x)), HPDinterval(as.mcmc(x),p=0.5)) + })
X1 Median Mean lower upper lower.1 upper.1 1 beta0 2.101 2.312 0.856 4.445 1.373 2.436 2 beta1 1.101 1.100 1.034 1.174 1.075 1.122
Conclusions:
We would reject the null hypothesis of no effect of $x$ on $y$.
An increase in x is associated with a significant linear increase (positive slope) in the abundance of $y$.
Every 1 unit increase in $x$ results in a log 0.0963
unit increase in $y$.
We usually express this in terms of abundance rather than log abundance, so
every 1 unit increase in $x$ results in a ($e^{0.0963
}=1.1011
$)
1.1011
unit increase in the abundance of $y$.
Further explorations of the trends
A measure of the strength of the relationship can be obtained according to: $$R^2 = 1 - \frac{RSS_{model}}{RSS_{null}}$$
> Xmat <- model.matrix(~x, data=dat.nb) > #expected values on a log scale > eta<-coefs %*% t(Xmat) > #expected value on response scale > lambda <- exp(eta) > #calculate the raw SS residuals > SSres <- apply((-1*(sweep(lambda,2,dat.nb$y,'-')))^2,1,sum) > SSres.null <- sum((dat.nb$y - mean(dat.nb$y))^2) > #OR > SSres.null <- crossprod(dat.nb$y - mean(dat.nb$y)) > #calculate the model r2 > 1-mean(SSres)/SSres.null
[,1] [1,] 0.2617
Conclusions: 26.17
% of the variation in $y$ abundance can be explained by its relationship with $x$.
> Xmat <- model.matrix(~x, dat.nb) > nX <- ncol(Xmat) > dat.nb.list2 <- with(dat.nb,list(Y=y, X=Xmat,N=nrow(dat.nb), mu=rep(0,nX),Sigma=diag(1.0E-06,nX))) > modelString=" model { for (i in 1:N) { Y[i] ~ dnegbin(p[i],size) p[i] <- size/(size+lambda[i]) eta[i] <- inprod(beta[], X[i,]) log(lambda[i]) <- eta[i] res[i] <- Y[i] - lambda[i] resnull[i] <- Y[i] - meanY } meanY <- mean(Y) beta ~ dmnorm(mu[],Sigma[,]) size ~ dunif(0.001,10000) RSS <- sum(res^2) RSSnull <- sum(resnull^2) r2 <- 1-RSS/RSSnull } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS4.9.bug') > library(R2jags) > system.time( + dat.NB.jags4 <- jags(model='../downloads/BUGSscripts/tut11.5bS4.9.bug', data=dat.nb.list2, inits=NULL, + param=c('beta','r2'), + n.chain=3, + n.iter=20000, n.thin=10, n.burnin=10000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 217 Initializing model
user system elapsed 6.485 0.004 6.492
> print(dat.NB.jags4)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS4.9.bug", fit using jags, 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10 n.sims = 3000 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta[1] 0.756 0.419 -0.058 0.463 0.759 1.021 1.552 1.002 3000 beta[2] 0.095 0.033 0.032 0.073 0.095 0.117 0.160 1.001 3000 r2 0.262 0.316 -0.055 0.240 0.309 0.350 0.389 1.088 3000 deviance 113.109 2.759 110.081 111.178 112.433 114.202 120.420 1.004 780 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 3.8 and DIC = 116.9 DIC is an estimate of expected predictive error (lower deviance is better).
Finally, we will create a summary plot.
> par(mar = c(4, 5, 0, 0)) > plot(y ~ x, data = dat.nb, type = "n", ann = F, axes = F) > points(y ~ x, data = dat.nb, pch = 16) > xs <- seq(min(dat.nb$x, na.rm = TRUE), max(dat.nb$x, na.rm = TRUE), + l = 1000) > Xmat <- model.matrix(~xs) > eta <- coefs %*% t(Xmat) > ys <- exp(eta) > library(plyr) > library(coda) > data.tab <- adply(ys, 2, function(x) { + data.frame(Median = median(x), HPDinterval(as.mcmc(x))) + }) > data.tab <- cbind(x = xs, data.tab) > points(Median ~ x, data = data.tab, col = "black", type = "l") > lines(lower ~ x, data = data.tab, col = "black", type = "l", + lty = 2) > lines(upper ~ x, data = data.tab, col = "black", type = "l", + lty = 2) > > axis(1) > mtext("X", 1, cex = 1.5, line = 3) > axis(2, las = 2) > mtext("Abundance of Y", 2, cex = 1.5, line = 3) > box(bty = "l")
Defining full log-likelihood function
Now lets try it by specifying log-likelihood and the zero trick. When applying this trick, we need to manually calculate the deviance as the inbuilt deviance will be based on the log-likelihood of estimating the zeros (as part of the zero trick) rather than the deviance of the intended model..
The one advantage of the zero trick is that the Deviance and thus DIC, AIC provided by R2jags will be incorrect. Hence, they too need to be manually defined within jags. I suspect that the AIC calculation I have used is incorrect...
> Xmat <- model.matrix(~x, dat.nb) > nX <- ncol(Xmat) > dat.nb.list2 <- with(dat.nb,list(Y=y, X=Xmat,N=nrow(dat.nb), mu=rep(0,nX), + Sigma=diag(1.0E-06,nX), zeros=rep(0,nrow(dat.nb)), C=10000)) > modelString=" model { for (i in 1:N) { zeros[i] ~ dpois(zeros.lambda[i]) zeros.lambda[i] <- -ll[i] + C ll[i] <- loggam(Y[i]+size) - loggam(Y[i]+1) - loggam(size) + size*(log(p[i]) - log(p[i]+1)) - Y[i]*log(p[i]+1) p[i] <- size/lambda[i] eta[i] <- inprod(beta[], X[i,]) log(lambda[i]) <- eta[i] } beta ~ dmnorm(mu[],Sigma[,]) size ~ dunif(0.001,1000) dev <- sum(-2*ll) } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS4.14.bug') > library(R2jags) > system.time( + dat.NB.jags3 <- jags(model='../downloads/BUGSscripts/tut11.5bS4.14.bug', data=dat.nb.list2, inits=NULL, + param=c('beta','dev'), + n.chain=3, + n.iter=50000, n.thin=50, n.burnin=10000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 453 Initializing model
user system elapsed 10.48 0.00 10.50
> print(dat.NB.jags3)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS4.14.bug", fit using jags, 3 chains, each with 50000 iterations (first 10000 discarded), n.thin = 50 n.sims = 2400 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta[1] 7.320e-01 0.411 -0.079 4.600e-01 7.350e-01 9.960e-01 1.580e+00 1.001 1900 beta[2] 9.700e-02 0.033 0.032 7.500e-02 9.600e-02 1.190e-01 1.610e-01 1.003 930 dev 1.130e+02 2.794 110.054 1.111e+02 1.123e+02 1.141e+02 1.205e+02 1.006 620 deviance 4.001e+05 2.794 400110.054 4.001e+05 4.001e+05 4.001e+05 4.001e+05 1.000 1 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 3.9 and DIC = 400116.9 DIC is an estimate of expected predictive error (lower deviance is better).
Zero-inflated Poisson (ZIP) regression
Zero-Inflation Poisson (ZIP) mixture model is defined as: $$ p(y_i|\theta,\lambda) = \left\{ \begin{array}{l l} \theta + (1-\theta)\times \text{Pois}(0|\lambda) & \quad \text{if $y_i=0$ and}\\ (1-\theta)\times \text{Pois}(y_i|\lambda) & \quad \text{if $y_i>0$} \end{array} \right. $$ where $\theta$ is the probability of false values (zeros).
Hence there is essentially two models coupled together (a mixture model) to yield an overall probability:
- when an observed response is zero ($y_i=0$), it is the probability of getting a false value (zero) plus the probability of a true value multiplied probability of drawing a value of zero from a Poisson distribution of $lambda$
- when an observed response is greater than 0, it is the probability of a true value multiplied probability of drawing that value from a Poisson distribution of $lambda$
The above formulation indicates the same $\lambda$ for both the zeros and non-zeros components. In the model of zero values, we are essentially investigating whether the likelihood of false zeros is related to the linear predictor and then the greater than zero model investigates whether the counts are related to the linear predictor.
However, we are typically less interested in modelling determinants of false zeros. Indeed, it is better that the likelihood of false zeros be unrelated to the linear predictor. For example, if excess (false zeros) are due to issues of detectability (individuals are present, just not detected), it is better that the detectability is not related to experimental treatments. Ideally, any detectability issues should be equal across all treatment levels.
The expected value of $Y$ and the variance in $Y$ for a ZIP model are: $$ E(y_i) = \lambda\times(1-\theta)\\ Var(y_i) = \lambda\times(1-\theta)\times(1+\theta\times\lambda^2) $$
Scenario and Data
Lets say we wanted to model the abundance of an item ($y$) against a continuous predictor ($x$). As this section is mainly about the generation of artificial data (and not specifically about what to do with the data), understanding the actual details are optional and can be safely skipped. Consequently, I have folded (toggled) this section away.
- the sample size = 20
- the continuous $x$ variable is a random uniform spread of measurements between 1 and 20
- the rate of change in log $y$ per unit of $x$ (slope) = 0.1.
- the value of $x$ when log$y$ equals 0 (when $y$=1)
- to generate the values of $y$ expected at each $x$ value, we evaluate the linear predictor (created by calculating the outer product of the model matrix and the regression parameters). These expected values are then transformed into a scale mapped by (0,$\infty$) by using the log function $e^{linear~predictor}$
- finally, we generate $y$ values by using the expected $y$ values ($\lambda$) as probabilities when drawing random numbers from a Poisson distribution. This step adds random noise to the expected $y$ values and returns only 0's and positive integers.
> set.seed(37) #34.5 #4 #10 #16 #17 #26 > # The number of samples > n.x <- 20 > # Create x values that at uniformly distributed throughout the rate of 1 to 20 > x <- sort(runif(n = n.x, min = 1, max = 20)) > mm <- model.matrix(~x) > intercept <- 0.6 > slope = 0.1 > # The linear predictor > linpred <- mm %*% c(intercept, slope) > # Predicted y values > lambda <- exp(linpred) > # Add some noise and make binomial > library(gamlss.dist) > # fixed latent binomial > y <- rZIP(n.x, lambda, 0.4) > # latent binomial influenced by the linear predictor y<- rZIP(n.x,lambda, 1-exp(linpred)/(1+exp(linpred))) > dat.zip <- data.frame(y, x) > > summary(glm(y ~ x, dat.zip, family = "poisson"))
Call: glm(formula = y ~ x, family = "poisson", data = dat.zip) Deviance Residuals: Min 1Q Median 3Q Max -2.541 -2.377 -0.975 1.174 3.638 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.6705 0.3302 2.03 0.042 * x 0.0296 0.0266 1.11 0.266 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 85.469 on 19 degrees of freedom Residual deviance: 84.209 on 18 degrees of freedom AIC: 124 Number of Fisher Scoring iterations: 6
> plot(glm(y ~ x, dat.zip, family = "poisson"))
> > library(pscl) > summary(zeroinfl(y ~ x | 1, dist = "poisson", data = dat.zip))
Call: zeroinfl(formula = y ~ x | 1, data = dat.zip, dist = "poisson") Pearson residuals: Min 1Q Median 3Q Max -1.001 -0.956 -0.393 0.966 1.619 Count model coefficients (poisson with log link): Estimate Std. Error z value Pr(>|z|) (Intercept) 0.7047 0.3196 2.21 0.02745 * x 0.0873 0.0253 3.45 0.00056 *** Zero-inflation model coefficients (binomial with logit link): Estimate Std. Error z value Pr(>|z|) (Intercept) -0.229 0.456 -0.5 0.62 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Number of iterations in BFGS optimization: 13 Log-likelihood: -36.2 on 3 Df
> plot(resid(zeroinfl(y ~ x | 1, dist = "poisson", data = dat.zip)) ~ + fitted(zeroinfl(y ~ x | 1, dist = "poisson", data = dat.zip)))
> > library(gamlss) > summary(gamlss(y ~ x, data = dat.zip, family = ZIP))
GAMLSS-RS iteration 1: Global Deviance = 72.44 GAMLSS-RS iteration 2: Global Deviance = 72.34 GAMLSS-RS iteration 3: Global Deviance = 72.34 ******************************************************************* Family: c("ZIP", "Poisson Zero Inflated") Call: gamlss(formula = y ~ x, family = ZIP, data = dat.zip) Fitting method: RS() ------------------------------------------------------------------- Mu link function: log Mu Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.7058 0.3196 2.21 0.04123 x 0.0872 0.0253 3.44 0.00311 ------------------------------------------------------------------- Sigma link function: logit Sigma Coefficients: Estimate Std. Error t value Pr(>|t|) -0.229 0.456 -0.502 0.622 ------------------------------------------------------------------- No. of observations in the fit: 20 Degrees of Freedom for the fit: 3 Residual Deg. of Freedom: 17 at cycle: 3 Global Deviance: 72.34 AIC: 78.34 SBC: 81.33 *******************************************************************
> > predict(gamlss(y ~ x, data = dat.zip, family = ZIP), se.fit = TRUE, + what = "mu")
GAMLSS-RS iteration 1: Global Deviance = 72.44 GAMLSS-RS iteration 2: Global Deviance = 72.34 GAMLSS-RS iteration 3: Global Deviance = 72.34
$fit 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0.7967 0.9236 1.0264 1.0370 1.1305 1.3763 1.4054 1.4300 1.4951 1.6161 1.7036 1.7630 1.8679 1.9838 1.9952 2.1187 2.1250 2.1432 2.1837 2.3065 $se.fit 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0.3518 0.3089 0.2759 0.2726 0.2446 0.1856 0.1807 0.1771 0.1697 0.1656 0.1710 0.1783 0.1973 0.2252 0.2282 0.2638 0.2657 0.2713 0.2840 0.3242
Exploratory data analysis and initial assumption checking
- All of the observations are independent - this must be addressed at the design and collection stages
- The response variable (and thus the residuals) should be matched by an appropriate distribution (in the case of positive integers response - a Poisson is appropriate).
- All observations are equally influential in determining the trends - or at least no observations are overly influential. This is most effectively diagnosed via residuals and other influence indices and is very difficult to diagnose prior to analysis
- the relationship between the linear predictor (right hand side of the regression formula) and the link function should be linear. A scatterplot with smoother can be useful for identifying possible non-linearity.
- Dispersion is either 1 or overdispersion is otherwise accounted for in the model
- The number of zeros is either not excessive or else they are specifically addressed by the model
Confirm non-normality and explore clumping
Check the distribution of the $y$ abundances
> hist(dat.zip$y)
> boxplot(dat.zip$y, horizontal = TRUE) > rug(jitter(dat.zip$y))
Confirm linearity
Lets now explore linearity by creating a histogram of the predictor variable ($x$). Note, it is difficult to directly assess issues of linearity. Indeed, a scatterplot with lowess smoother will be largely influenced by the presence of zeros. One possible way of doing so is to explore the trend in the non-zero data.
> hist(dat.zip$x)
> # now for the scatterplot > plot(y ~ x, dat.zip, log = "y") > with(subset(dat.zip, y > 0), lines(lowess(y ~ x)))
Conclusions: the predictor ($x$) does not display any skewness or other issues that might lead to non-linearity. The lowess smoother on the non-zero data cloud does not display major deviations from a straight line and thus linearity is likely to be satisfied. Violations of linearity (whilst difficult to be certain about due to the unknown influence of the zeros) could be addressed by either:
- define a non-linear linear predictor (such as a polynomial, spline or other non-linear function)
- transform the scale of the predictor variables
Explore zero inflation
Although we have already established that there are few zeros in the data (and thus overdispersion is unlikely to be an issue), we can also explore this by comparing the number of zeros in the data to the number of zeros that would be expected from a Poisson distribution with a mean equal to the mean count of the data.
> # proportion of 0's in the data > dat.zip.tab <- table(dat.zip$y == 0) > dat.zip.tab/sum(dat.zip.tab)
FALSE TRUE 0.55 0.45
> # proportion of 0's expected from a Poisson distribution > mu <- mean(dat.zip$y) > cnts <- rpois(1000, mu) > dat.zip.tabE <- table(cnts == 0) > dat.zip.tabE/sum(dat.zip.tabE)
FALSE TRUE 0.921 0.079
45
%) far
exceeds that that would have been expected (7.9
%).
Hence it is highly likely that any models will be zero-inflated.
Model fitting or statistical analysis
The exploratory data analyses have suggested that a zero-inflated Poisson model might be the most appropriate for these data.
> dat.zip.list <- with(dat.zip,list(Y=y, X=x,N=nrow(dat.zip))) > modelString=" model { for (i in 1:N) { Y[i] ~ dpois(lambda[i]) eta[i] <- beta0 + beta1*X[i] log(lambda[i]) <- eta[i] } beta0 ~ dnorm(0,1.0E-06) beta1 ~ dnorm(0,1.0E-06) } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS5.4.bug') > library(R2jags) > system.time( + dat.zip.P.jags <- jags(model='../downloads/BUGSscripts/tut11.5bS5.4.bug', data=dat.zip.list, inits=NULL, + param=c('beta0','beta1'), + n.chain=3, + n.iter=20000, n.thin=10, n.burnin=10000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 105 Initializing model
user system elapsed 1.484 0.004 1.490
> print(dat.zip.P.jags)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS5.4.bug", fit using jags, 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10 n.sims = 3000 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta0 0.628 0.324 -0.057 0.423 0.642 0.85 1.225 1.001 3000 beta1 0.032 0.026 -0.016 0.014 0.031 0.05 0.084 1.001 3000 deviance 121.911 2.069 120.051 120.545 121.329 122.61 126.947 1.003 3000 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 2.1 and DIC = 124.1 DIC is an estimate of expected predictive error (lower deviance is better).
> > > #extract the samples for the two model parameters > coefs <- dat.zip.P.jags$BUGSoutput$sims.matrix[,1:2] > Xmat <- model.matrix(~x, data=dat) > #expected values on a log scale > eta<-coefs %*% t(Xmat) > #expected value on response scale > lambda <- exp(eta) > > Resid <- -1*sweep(lambda,2,dat.zip$y, '-')/sqrt(lambda) > RSS <- apply(Resid^2, 1, sum) > Disp <- RSS/(nrow(dat.zip)-ncol(coefs)) > data.frame(Median=median(Disp), Mean=mean(Disp), HPDinterval(as.mcmc(Disp)), HPDinterval(as.mcmc(Disp),p=0.5))
Median Mean lower upper lower.1 upper.1 var1 4.211 4.319 3.136 5.616 3.603 4.427
The dispersion parameter was 4.3186
, indicating over three times more variability than would be expected for a Poisson distribution.
The data are thus over-dispersed. Given the large number of zeros in the response, it would seem likely that the overdispersion is as a result of the
excessive zeros and thus zero-inflated Poisson model would seem reasonable. Note, if this model is still overdispersed (possibly due to clumpiness
in the non-zero values), then a zero-inflated negative binomial model might be worth exploring.
JAGS
$$ Y_i = \left\{ \begin{array}{lrcl} 0~\text{with}~P(y_i) = 1-\mu & z_i&\sim&\text{Binom}(1-\theta)\\ & logit(\theta)&=&\gamma_0\\ &\gamma_0&\sim{}&\mathcal{N}(0,10000)\\ >0 & Y_i&\sim&\text{Pois}(\lambda_i)\\ &\lambda_i&=&z_i + \eta_i\\ &log(\eta_i)&=&\beta_0+\beta_1x_1\\ &\beta_0, \beta_1&\sim{}&\mathcal{N}(0,10000)\\ \end{array} \right. $$ Or in shorthand: $$ \begin{align} Y_i&\sim{}ZIP(\lambda,\theta) & (\text{response distribution})\\ logit(\theta)&=\gamma_0 & (\text{link function/linear predictor - zero component})\\ log(\lambda_i)&=\eta_i & (\text{link function - count component})\\ \eta_i&=\beta_0+\beta_1 X_i & (\text{linear predictor - count component})\\ \beta_0, \beta_1, \gamma_0&\sim{}\mathcal{N}(0,10000) & (\text{diffuse Bayesian priors})\\ \end{align} $$> dat.zip.list <- with(dat.zip,list(Y=y, X=x,N=nrow(dat.nb), z=ifelse(y==0,0,1))) > modelString=" model { for (i in 1:N) { z[i] ~ dbern(one.minus.theta) Y[i] ~ dpois(lambda[i]) lambda[i] <- z[i]*eta[i] log(eta[i]) <- beta0 + beta1*X[i] } one.minus.theta <- 1-theta logit(theta) <- gamma0 beta0 ~ dnorm(0,1.0E-06) beta1 ~ dnorm(0,1.0E-06) gamma0 ~ dnorm(0,1.0E-06) } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS5.5.bug') > library(R2jags) > system.time( + dat.zip.jags <- jags(model='../downloads/BUGSscripts/tut11.5bS5.5.bug', data=dat.zip.list, inits=NULL, + param=c('beta0','beta1', 'gamma0','theta'), + n.chain=3, + n.iter=20000, n.thin=10, n.burnin=10000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 149 Initializing model
user system elapsed 2.689 0.004 2.694
> Xmat <- model.matrix(~x, dat.zip) > nX <- ncol(Xmat) > dat.zip.list1 <- with(dat.zip,list(Y=y, X=Xmat,N=nrow(dat.zip), nX=nX, z=ifelse(y==0,0,1))) > modelString=" model { for (i in 1:N) { z[i] ~ dbern(one.minus.theta) Y[i] ~ dpois(lambda[i]) lambda[i] <- z[i]*eta[i] log(eta[i]) <- inprod(beta[], X[i,]) } one.minus.theta <- 1-theta logit(theta) <- gamma0 gamma0 ~ dnorm(0,1.0E-06) for (i in 1:nX) { beta[i] ~ dnorm(0,1.0E-06) } } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS5.6.bug') > library(R2jags) > system.time( + dat.zip.jags1 <- jags(model='../downloads/BUGSscripts/tut11.5bS5.6.bug', data=dat.zip.list1, inits=NULL, + param=c('beta','gamma0'), + n.chain=3, + n.iter=20000, n.thin=10, n.burnin=10000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 171 Initializing model
user system elapsed 2.401 0.004 2.408
> print(dat.zip.jags1)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS5.6.bug", fit using jags, 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10 n.sims = 3000 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta[1] 0.702 0.320 0.033 0.499 0.706 0.917 1.293 1.003 980 beta[2] 0.087 0.026 0.036 0.069 0.087 0.103 0.138 1.003 850 gamma0 -0.216 0.465 -1.128 -0.525 -0.218 0.107 0.669 1.001 3000 deviance 75.730 2.508 72.857 73.842 75.103 76.921 82.412 1.001 3000 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 3.1 and DIC = 78.9 DIC is an estimate of expected predictive error (lower deviance is better).
Or arguably better still, use a multivariate normal prior. If we have a $k$ regression parameters ($\beta_k$), then the multivariate normal priors are defined as: $$ \boldsymbol{\beta}\sim{}\mathcal{N_k}(\boldsymbol{\mu}, \mathbf{\Sigma}) $$ where $$\boldsymbol{\mu}=[E[\beta_1],E[\beta_2],...,E[\beta_k]] = \left(\begin{array}{c}0\\\vdots\\0\end{array}\right)$$ $$ \mathbf{\Sigma}=[Cov[\beta_i, \beta_j]] = \left(\begin{array}{ccc}1000^2&0&0\\0&\ddots&0\\0&0&1000^2\end{array} \right) $$ hence, along with the response and predictor matrix, we need to supply $\boldsymbol{\mu}$ (a vector of zeros) and $\boldsymbol{\Sigma}$ (a covariance matrix with $1000^2$ in the diagonals).
> Xmat <- model.matrix(~x, dat.zip) > nX <- ncol(Xmat) > dat.zip.list2 <- with(dat.zip,list(Y=y, X=Xmat,N=nrow(dat.zip), nX=nX, + mu=rep(0,nX),Sigma=diag(1.0E-06,nX), z=ifelse(y==0,0,1))) > modelString=" model { for (i in 1:N) { z[i] ~ dbern(one.minus.theta) Y[i] ~ dpois(lambda[i]) lambda[i] <- z[i]*eta[i] log(eta[i]) <- inprod(beta[], X[i,]) } one.minus.theta <- 1-theta logit(theta) <- gamma0 gamma0 ~ dnorm(0,1.0E-06) beta ~ dmnorm(mu[],Sigma[,]) } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS5.7.bug') > library(R2jags) > system.time( + dat.zip.jags2 <- jags(model='../downloads/BUGSscripts/tut11.5bS5.7.bug', data=dat.zip.list2, inits=NULL, + param=c('beta', 'gamma0'), + n.chain=3, + n.iter=20000, n.thin=10, n.burnin=10000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 176 Initializing model
user system elapsed 1.000 0.000 0.999
> print(dat.zip.jags2)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS5.7.bug", fit using jags, 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10 n.sims = 3000 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta[1] 0.700 0.313 0.051 0.492 0.708 0.905 1.307 1.005 430 beta[2] 0.087 0.025 0.036 0.070 0.087 0.104 0.136 1.008 680 gamma0 -0.215 0.472 -1.115 -0.523 -0.229 0.090 0.724 1.001 2400 deviance 75.677 2.517 72.884 73.834 75.002 76.788 82.402 1.001 3000 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 3.2 and DIC = 78.8 DIC is an estimate of expected predictive error (lower deviance is better).
Note, the n.eff indicates that we probably have an issue with chain mixing and/or autocorrelation. We probably should increase the number of iterations and the thinning rate.
Chain mixing and Model validation
Prior to exploring the model parameters, it is prudent to confirm that the model did indeed fit the assumptions and was an appropriate fit to the data as well as that the MCMC sampling chain was adequately mixed and the retained samples independent.
- We will start by exploring the mixing of the MCMC chains via traceplots.
> plot(as.mcmc(dat.zip.jags))
The chains appear well mixed and stable
- Next we will explore correlation amongst MCMC samples.
> autocorr.diag(as.mcmc(dat.zip.jags))
beta0 beta1 deviance gamma0 theta Lag 0 1.000000 1.0000000 1.000000 1.000000 1.000000 Lag 10 0.260592 0.2658804 0.069565 0.056243 0.056195 Lag 50 0.013879 0.0002929 -0.017896 -0.005929 -0.004522 Lag 100 0.017172 0.0200417 -0.034988 -0.024724 -0.023377 Lag 500 0.006818 0.0031960 -0.002762 0.010100 0.012140
The level of auto-correlation at the nominated lag of 10 is higher than we would generally like. It is worth increasing the thinning rate from 10 to 50. Obviously, to support this higher thinning rate, we would also increase the number of iterations.
> library(R2jags) > dat.zip.jags <- jags(data=dat.zip.list,model.file='../downloads/BUGSscripts/tut11.5bS5.5.bug', + param=c('beta0','beta1','gamma0','theta'), + n.chains=3, n.iter=100000, n.burnin=50000, n.thin=50)
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 149 Initializing model
> print(dat.zip.jags)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS5.5.bug", fit using jags, 3 chains, each with 1e+05 iterations (first 50000 discarded), n.thin = 50 n.sims = 3000 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta0 0.718 0.317 0.061 0.519 0.720 0.932 1.318 1.001 3000 beta1 0.085 0.025 0.037 0.068 0.085 0.102 0.136 1.001 3000 gamma0 -0.203 0.465 -1.140 -0.509 -0.195 0.112 0.685 1.001 3000 theta 0.452 0.109 0.242 0.375 0.451 0.528 0.665 1.001 3000 deviance 75.655 2.508 72.846 73.842 75.007 76.704 82.268 1.001 3000 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 3.1 and DIC = 78.8 DIC is an estimate of expected predictive error (lower deviance is better).
> plot(as.mcmc(dat.zip.jags))
> autocorr.diag(as.mcmc(dat.zip.jags))
beta0 beta1 deviance gamma0 theta Lag 0 1.00000 1.000000 1.00000 1.000000 1.000000 Lag 50 -0.02079 -0.022691 0.02611 -0.005516 -0.002252 Lag 250 0.01070 0.016621 -0.02444 -0.010192 -0.010403 Lag 500 -0.01010 -0.008113 0.01343 0.047790 0.049889 Lag 2500 -0.01133 -0.024075 -0.01972 -0.040467 -0.040757
Conclusions: the samples are now less auto-correlated and the chains are all well mixed and appear stable.
- We now explore the goodness of fit of the models via the residuals and deviance.
We could calculate the Pearsons's residuals within the JAGS model.
Alternatively, we could use the parameters to generate the residuals outside of JAGS.
> #extract the samples for the two model parameters > coefs <- dat.zip.jags$BUGSoutput$sims.matrix[,1:2] > theta <- dat.zip.jags$BUGSoutput$sims.matrix[,'theta'] > Xmat <- model.matrix(~x, data=dat.zip) > #expected values on a log scale > lambda<-coefs %*% t(Xmat) > #expected value on response scale > eta <- exp(lambda) > expY <- sweep(eta,1,(1-theta),"*") > varY <- eta+sweep(eta^2,1,theta,"*") > varY <- sweep(varY,1,(1-theta),'*') > #sweep across rows and then divide by lambda > Resid <- -1*sweep(expY,2,dat.zip$y,'-')/sqrt(varY) > #plot residuals vs expected values > plot(apply(Resid,2,mean)~apply(eta,2,mean))
There are no real patterns in the residuals.
- Now we will compare the sum of squared residuals to the sum of squares residuals that would be expected from a Poisson distribution
matching that estimated by the model. Essentially this is estimating how well the Poisson distribution, the log-link function and the linear model
approximates the observed data.
When doing so, we need to consider the expected value and variance of the zero-inflated poisson. $$ E(y_i) = \lambda\times(1-\theta)\\ Var(y_i) = \lambda\times(1-\theta)\times(1+\theta\times\lambda^2) $$
> SSres<-apply(Resid^2,1,sum) > > set.seed(2) > #generate a matrix of draws from a zero-inflated poisson (ZIP) distribution > # the matrix is the same dimensions as lambda > library(gamlss.dist) > #YNew <- matrix(rZIP(length(lambda),eta, theta),nrow=nrow(lambda)) > lambda <- sweep(eta,1,ifelse(dat.zip$y==0,0,1),'*') > YNew <- matrix(rpois(length(lambda),lambda),nrow=nrow(lambda)) > Resid1<-(expY - YNew)/sqrt(varY) > SSres.sim<-apply(Resid1^2,1,sum) > mean(SSres.sim>SSres)
[1] 0.4513
Whilst not ideal (as we would prefer a Bayesian P-value of around 0.5), this value is not wildly bad and does not constitute overwhelming evidence of a lack of fit.
> Xmat <- model.matrix(~x, dat.zip) > nX <- ncol(Xmat) > dat.zip.list2 <- with(dat.zip,list(Y=y, X=Xmat,N=nrow(dat.zip), nX=nX, + mu=rep(0,nX),Sigma=diag(1.0E-06,nX), z=ifelse(y==0,0,1))) > modelString=" model { for (i in 1:N) { z[i] ~ dbern(one.minus.theta) Y[i] ~ dpois(lambda[i]) lambda[i] <- z[i]*eta[i] log(eta[i]) <- max(-20,min(20,inprod(beta[], X[i,]))) expY[i] <- eta[i]*(1-theta) varY[i] <- (1-theta)*(eta[i]*theta*pow(eta[i],2)) Resid[i] <- (Y[i] - expY[i])/sqrt(varY[i]) Y1[i] ~ dpois(lambda[i]) Resid1[i] <- (Y1[i] - expY[i])/sqrt(varY[i]) RSS[i] <- pow(Resid[i],2) RSS1[i] <-pow(Resid1[i],2) } one.minus.theta <- 1-theta logit(theta) <- gamma0 gamma0 ~ dnorm(0,1.0E-06) beta ~ dmnorm(mu[],Sigma[,]) Pvalue <- mean(sum(RSS1)>sum(RSS)) } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS5.13.bug') > library(R2jags) > system.time( + dat.zip.jags3 <- jags(model='../downloads/BUGSscripts/tut11.5bS5.13.bug', data=dat.zip.list2, inits=NULL, + param=c('beta', 'gamma0','Pvalue'), + n.chain=3, + n.iter=20000, n.thin=10, n.burnin=10000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 465 Initializing model
user system elapsed 1.337 0.004 1.342
> print(dat.zip.jags3)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS5.13.bug", fit using jags, 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10 n.sims = 3000 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff Pvalue 0.475 0.499 0.000 0.000 0.000 1.000 1.000 1.001 3000 beta[1] 0.712 0.312 0.062 0.508 0.721 0.928 1.292 1.004 660 beta[2] 0.086 0.025 0.037 0.069 0.086 0.103 0.133 1.004 750 gamma0 -0.213 0.461 -1.129 -0.526 -0.197 0.104 0.668 1.001 3000 deviance 75.628 2.489 72.826 73.799 74.996 76.819 81.974 1.001 2100 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 3.1 and DIC = 78.7 DIC is an estimate of expected predictive error (lower deviance is better).
>
Conclusions: the Bayesian p-value is very close to 0.5, suggesting that there is a good fit of the model to the data.
Exploring the model parameters, test hypotheses
If there was any evidence that the assumptions had been violated or the model was not an appropriate fit, then we would need to reconsider the model and start the process again. In this case, there is no evidence that the test will be unreliable so we can proceed to explore the test statistics.
As with most Bayesian models, it is best to base conclusions on medians rather than means.
> print(dat.zip.jags)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS5.5.bug", fit using jags, 3 chains, each with 1e+05 iterations (first 50000 discarded), n.thin = 50 n.sims = 3000 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta0 0.718 0.317 0.061 0.519 0.720 0.932 1.318 1.001 3000 beta1 0.085 0.025 0.037 0.068 0.085 0.102 0.136 1.001 3000 gamma0 -0.203 0.465 -1.140 -0.509 -0.195 0.112 0.685 1.001 3000 theta 0.452 0.109 0.242 0.375 0.451 0.528 0.665 1.001 3000 deviance 75.655 2.508 72.846 73.842 75.007 76.704 82.268 1.001 3000 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 3.1 and DIC = 78.8 DIC is an estimate of expected predictive error (lower deviance is better).
> > library(plyr) > adply(dat.zip.jags$BUGSoutput$sims.matrix, 2, function(x) { + data.frame(Median=median(x), Mean=mean(x), HPDinterval(as.mcmc(x)), HPDinterval(as.mcmc(x),p=0.5)) + })
X1 Median Mean lower upper lower.1 upper.1 1 beta0 0.72032 0.7182 0.11349 1.3617 0.54977 0.95667 2 beta1 0.08544 0.0853 0.03686 0.1355 0.06821 0.10181 3 deviance 75.00697 75.6548 72.65129 80.7637 72.79881 75.08279 4 gamma0 -0.19535 -0.2028 -1.15237 0.6730 -0.57676 0.03602 5 theta 0.45132 0.4521 0.23186 0.6532 0.35968 0.50900
Actually, many find it more palatable to express the estimates in the original scale of the observations rather than on a log scale.
> library(plyr) > adply(exp(dat.zip.jags$BUGSoutput$sims.matrix[,1:2]), 2, function(x) { + data.frame(Median=median(x), Mean=mean(x), HPDinterval(as.mcmc(x)), HPDinterval(as.mcmc(x),p=0.5)) + })
X1 Median Mean lower upper lower.1 upper.1 1 beta0 2.055 2.155 0.9751 3.546 1.520 2.352 2 beta1 1.089 1.089 1.0376 1.145 1.071 1.107
Conclusions:
We would reject the null hypothesis of no effect of $x$ on $y$.
An increase in x is associated with a significant linear increase (positive slope) in the abundance of $y$.
Every 1 unit increase in $x$ results in a log 0.0854
unit increase in $y$.
We usually express this in terms of abundance rather than log abundance, so
every 1 unit increase in $x$ results in a ($e^{0.0854
}=1.0892
$)
1.0892
unit increase in the abundance of $y$.
Further explorations of the trends
A measure of the strength of the relationship can be obtained according to: $$R^2 = 1 - \frac{RSS_{model}}{RSS_{null}}$$
Alternatively, we could use McFadden's psuedo $$R^2 = 1-\frac{\mathcal{LL}(Model_{full})}{\mathcal{LL}(Model_{reduced})}$$ [[http://www.statisticalhorizons.com/category/uncategorized]] - about a quarter of the way down the page.
> Xmat <- model.matrix(~x, dat=dat.zip) > #expected values on a log scale > neta<-coefs %*% t(Xmat) > #expected value on response scale > eta <- exp(neta) > lambda <- sweep(eta,2,ifelse(dat.zip$y==0,0,1),'*') > expY <- sweep(lambda,2,1-theta,'*') > #calculate the raw SS residuals > SSres <- apply((-1*(sweep(expY,2,dat.zip$y,'-')))^2,1,sum) > SSres.null <- sum((dat.zip$y - mean(dat.zip$y))^2) > #calculate the model r2 > 1-mean(SSres)/SSres.null
[1] 0.4928
Conclusions: 49.28
% of the variation in $y$ abundance can be explained by its relationship with $x$.
> Xmat <- model.matrix(~x, dat.zip) > nX <- ncol(Xmat) > dat.zip.list2 <- with(dat.zip,list(Y=y, X=Xmat,N=nrow(dat.zip), nX=nX, + mu=rep(0,nX),Sigma=diag(1.0E-06,nX), z=ifelse(y==0,0,1))) > modelString=" model { for (i in 1:N) { z[i] ~ dbern(one.minus.theta) Y[i] ~ dpois(lambda[i]) lambda[i] <- z[i]*eta[i] log(eta[i]) <- max(-20,min(20,inprod(beta[], X[i,]))) expY[i] <- lambda[i]*(1-theta) res[i] <- Y[i] - expY[i] resnull[i] <- Y[i] - meanY } one.minus.theta <- 1-theta logit(theta) <- gamma0 gamma0 ~ dnorm(0,1.0E-06) beta ~ dmnorm(mu[],Sigma[,]) meanY <- mean(Y) RSS <- sum(res^2) RSSnull <- sum(resnull^2) r2 <- 1-RSS/RSSnull } " > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS5.17.bug') > library(R2jags) > system.time( + dat.ZIP.jags4 <- jags(model='../downloads/BUGSscripts/tut11.5bS5.17.bug', data=dat.zip.list2, inits=NULL, + param=c('beta','r2'), + n.chain=3, + n.iter=20000, n.thin=10, n.burnin=10000) + )
Compiling model graph Resolving undeclared variables Allocating nodes Graph Size: 277 Initializing model
user system elapsed 1.148 0.004 1.153
> print(dat.ZIP.jags4)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS5.17.bug", fit using jags, 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10 n.sims = 3000 iterations saved mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff beta[1] 0.696 0.320 0.050 0.481 0.707 0.912 1.316 1.002 1500 beta[2] 0.087 0.025 0.036 0.070 0.087 0.104 0.134 1.002 1300 r2 0.492 0.182 0.095 0.369 0.512 0.633 0.788 1.001 3000 deviance 75.740 2.524 72.863 73.908 75.100 76.922 82.234 1.001 3000 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = var(deviance)/2) pD = 3.2 and DIC = 78.9 DIC is an estimate of expected predictive error (lower deviance is better).
Finally, we will create a summary plot.
> par(mar = c(4, 5, 0, 0)) > plot(y ~ x, data = dat.zip, type = "n", ann = F, axes = F) > points(y ~ x, data = dat.zip, pch = 16) > xs <- seq(min(dat.zip$x, na.rm = TRUE), max(dat.zip$x, na.rm = TRUE), + l = 1000) > Xmat <- model.matrix(~xs) > eta <- coefs %*% t(Xmat) > ys <- exp(eta) > library(plyr) > library(coda) > data.tab <- adply(ys, 2, function(x) { + data.frame(Median = median(x), HPDinterval(as.mcmc(x))) + }) > data.tab <- cbind(x = xs, data.tab) > points(Median ~ x, data = data.tab, col = "black", type = "l") > lines(lower ~ x, data = data.tab, col = "black", type = "l", + lty = 2) > lines(upper ~ x, data = data.tab, col = "black", type = "l", + lty = 2) > > axis(1) > mtext("X", 1, cex = 1.5, line = 3) > axis(2, las = 2) > mtext("Abundance of Y", 2, cex = 1.5, line = 3) > box(bty = "l")
Defining full log-likelihood function
Now lets try it by specifying log-likelihood and the zero trick. When applying this trick, we need to manually calculate the deviance as the inbuilt deviance will be based on the log-likelihood of estimating the zeros (as part of the zero trick) rather than the deviance of the intended model..
The one advantage of the zero trick is that the Deviance and thus DIC, AIC provided by R2jags will be incorrect. Hence, they too need to be manually defined within jags. I suspect that the AIC calculation I have used is incorrect...