Jump to main navigation


Tutorial 11.5b - Poisson regression

02 April 2014

Overview

Whilst in many instances, count data can be approximated reasonably well by a normal distribution (particularly if the counts are all above zero and the mean count is greater than about 20), more typically, when count data are modelled via normal distribution certain undesirable characteristics arise that are a consequence of the nature of discrete non-negative data.

  • Expected (predicted) values and confidence bands less than zero are illogical, yet these are entirely possible from a normal distribution
  • The distribution of count data are often skewed when their mean is low (in part because the distribution is truncated to the left by zero) and variance usually increases with increasing mean (variance is typically proportional to mean in count data). By contrast, the Gaussian (normal) distribution assumes that mean and variance are unrelated and thus estimates (particularly of standard error) might well be reasonable inaccurate.

Poisson regression is a type of generalized linear model (GLM) in which a non-negative integer (natural number) response is modelled against a linear predictor via a specific link function. The linear predictor is typically a linear combination of effects parameters (e.g. $\beta_0 + \beta_1x_x$). The role of the link function is to transform the expected values of the response y (which is on the scale of (0,$\infty$), as is the Poisson distribution from which expectations are drawn) into the scale of the linear predictor (which is $-\infty,\infty$).

As implied in the name of this group of analyses, a Poisson rather than Gaussian (normal) distribution is used to represent the errors (residuals). Like count data (number of individuals, species etc), the Poisson distribution encapsulates positive integers and is bound by zero at one end. Consequently, the degree of variability is directly related the expected value (equivalent to the mean of a Gaussian distribution). Put differently, the variance is a function of the mean. Repeated observations from a Poisson distribution located close to zero will yield a much smaller spread of observations than will samples drawn from a Poisson distribution located a greater distance from zero. In the Poisson distribution, the variance has a 1:1 relationship with the mean.

The canonical link function for the Poisson distribution is a log-link function.

Whilst the expectation that the mean=variance ($\mu=\sigma$) is broadly compatible with actual count data (that variance increases at the same rate as the mean), under certain circumstances, this might not be the case. For example, when there are other unmeasured influences on the response variable, the distribution of counts might be somewhat clumped which can result in higher than expected variability (that is $\sigma\gt\mu$). The variance increases more rapidly than does the mean. This is referred to as overdispersion. The degree to which the variability is greater than the mean (and thus the expected degree of variability) is called dispersion. Effectively, the Poisson distribution has a dispersion parameter (or scaling factor) of 1.

It turns out that overdispersion is very common for count data and it typically underestimates variability, standard errors and thus deflated p-values. There are a number of ways of overcoming this limitation, the effectiveness of which depend on the causes of overdispersion.

  • Quasi-Poisson models - these introduce the dispersion parameter ($\phi$) into the model. This approach does not utilize an underlying error distribution to calculate the maximum likelihood (there is no quasi-Poisson distribution). Instead, if the Newton-Ralphson iterative reweighting least squares algorithm is applied using a direct specification of the relationship between mean and variance ($var(y)=\phi\mu$), the estimates of the regression coefficients are identical to those of the maximum likelihood estimates from the Poisson model. This is analogous to fitting ordinary least squares on symmetrical, yet not normally distributed data - the parameter estimates are the same, however they won't necessarily be as efficient. The standard errors of the coefficients are then calculated by multiplying the Poisson model coefficient standard errors by $\sqrt{\phi}$.

    Unfortunately, because the quasi-poisson model is not estimated via maximum likelihood, properties such as AIC and log-likelihood cannot be derived. Consequently, quasi-poisson and Poisson model fits cannot be compared via either AIC or likelihood ratio tests (nor can they be compared via deviance as uasi-poisson and Poisson models have the same residual deviance). That said, quasi-likelihood can be obtained by dividing the likelihood from the Poisson model by the dispersion (scale) factor.

  • Negative binomial model - technically, the negative binomial distribution is a probability distribution for the number of successes before a specified number of failures. However, the negative binomial can also be defined (parameterized) in terms of a mean ($\mu$) and scale factor ($\omega$), $$p(y_i)=\frac{\Gamma(y_i+\omega)}{\Gamma(\omega)y!}\times\frac{\mu_i^{y_i}\omega^\omega}{(\mu_i+\omega)^{\mu_i+\omega}}$$ where the expectected value of the values $y_i$ (the means) are ($\mu_i$) and the variance is $y_i=\frac{\mu_i+\mu_i^2}{\omega}$. In this way, the negative binomial is a two-stage hierarchical process in which the response is modeled against a Poisson distribution whose expected count is in turn modeled by a Gamma distribution with a mean of $\mu$ and constant scale parameter ($\omega$).

    Strictly, the negative binomial is not an exponential family distribution (unless $\omega$ is fixed as a constant), and thus negative binomial models cannot be fit via the usual GLM iterative reweighting algorithm. Instead estimates of the regression parameters along with the scale factor ($\omega$) are obtained via maximum likelihood.

    The negative binomial model is useful for accommodating overdispersal when it is likely caused by clumping (due to the influence of other unmeasured factors) within the response.

  • Zero-inflated Poisson model - overdispersion can also be caused by the presence of a greater number of zero's than would otherwise be expected for a Poisson distribution. There are potentially two sources of zero counts - genuine zeros and false zeros. Firstly, there may genuinely be no individuals present. This would be the number expected by a Poisson distribution. Secondly, individuals may have been present yet not detected or may not even been possible. These are false zero's and lead to zero inflated data (data with more zeros than expected).

    For example, the number of joeys accompanying an adult koala could be zero because the koala has no offspring (true zero) or because the koala is male or infertile (both of which would be examples of false zeros). Similarly, zero counts of the number of individual in a transect are due either to the absence of individuals or the inability of the observer to detect them. Whilst in the former example, the latent variable representing false zeros (sex or infertility) can be identified and those individuals removed prior to analysis, this is not the case for the latter example. That is, we cannot easily partition which counts of zero are due to detection issues and which are a true indication of the natural state.

    Consistent with these two sources of zeros, zero-inflated models combine a binary logistic regression model (that models count membership according to a latent variable representing observations that can only be zeros - not detectable or male koalas) with a Poisson regression (that models count membership according to a latent variable representing observations whose values could be 0 or any positive integer - fertile female koalas).

Summary of important equations

Many ecologists are repulsed or frightened by statistical formulae. In part this is due to the use of Greek letters to represent concepts, constants and functions with the assumption that the readers are familiar with their meaning. Hence, the issue is largely that many ecologists are familiar with certain statistical concepts, yet are not overly familiar with the notation used to represent those concepts.

As a typed language, R exposes users to a rich variety of statistical opportunities. Whilst many of the common procedures are wrapped into simple to use functions (and so all of the calculations etc underlying a GLM need not be performed by hand by the user), not all procedures have been packaged into functions. For such cases, it is useful to have the formulas handy.

Hence, in this section I am going to summarize and compare equations for common procedures associated with Poisson and Negative Binomial models. Feel free to skip this section until it is useful to you.

PoissonNegative Binomial (p,r)
Density function $f(y|\lambda)=\frac{\lambda^{y}e^{-\lambda}}{y!}$ $f(y|r,p)=\frac{\Gamma(y+r)}{\Gamma(r)\Gamma(y+1)}p^r(1-p)^y$
Expected value $E(Y)=\mu=\lambda$ $E(Y)=\mu=\frac{r(1-p)}{p}$
Variance $var(Y)=\lambda$ $var(Y)=\frac{r(1-p)}{p^2}$
log-likelihood $\mathcal{LL}(\lambda\mid y) = \sum\limits^n_{i=1}y_i log(\lambda_i)-\lambda_i - log\Gamma(y_i+1)$ $\begin{align}\mathcal{LL}(r,p\mid y) = \sum\limits^n_{i=1}&log\Gamma(y_1 + r) - log\Gamma(r) - log\Gamma(y_i + 1) + \\&r.log(p) + y_i log(1-p)\end{align}$
Negative Binomial (a,b)
Density function $f(y \mid a,b) = \frac{b^{a}y^{b-1}e^{-b y}}{\Gamma(a)}$
Expected value $E(Y)=\mu=\frac{r}{p}$
Variance
log-likelihood $\mathcal{LL}(\lambda;y)=\sum\limits^n_{i=1} log\Gamma(y_i + a) - log\Gamma(y_i+1) - log\Gamma(a) + \alpha(log(b_i) - log(b_i+1)) - y_i.log(b_i+1)$
Zero-inflated Poisson
Density function $p(y \mid \theta,\lambda) = \left\{ \begin{array}{l l} \theta + (1-\theta)\times \text{Pois}(0|\lambda) & \quad \text{if $y_i=0$ and}\\ (1-\theta)\times \text{Pois}(y_i|\lambda) & \quad \text{if $y_i>0$} \end{array} \right.$
Expected value $E(Y)=\mu=\lambda\times(1-theta)$
Variance $var(Y)=\lambda\times(1-\theta)\times(1+\theta\times\lambda^2)$
log-likelihood $\mathcal{LL}(\theta,\lambda\mid y) = \left\{ \begin{array}{l l} \sum\limits^n_{i=1} (1-\theta)\times log(\theta + \lambda_i)\\ \sum\limits^n_{i=1}y_i log(1-\theta)\times y_i\lambda_i - exp(\lambda_i)\\ \end{array} \right. $
ProcedureEquation
Residuals $\varepsilon_i = y_i - \hat{y}$
Pearson's Residuals $\frac{y_i - \hat{y}}{\sqrt{var(y)}}$
Residual Sums of Squares $RSS = \sum \varepsilon_i^2$
Dispersion statistic $\phi=\frac{RSS}{df}$
$R^2$ $R^2 = 1-\frac{RSS_{model}}{RSS_{null}}$
McFadden's $R^2$ $R_{McF}^2 = 1 - \frac{\mathcal{LL}_{model}}{\mathcal{LL}_{null}}$
Deviance $dev = -2*\mathcal{LL}$
pD $pD = $
DIC $DIC = -pD$
AIC $AIC = min(dev+(2pD))$

Poisson regression

The following equations are provided since in Bayesian modelling, it is occasionally necessary to directly define the log-likelihood calculations (particularly for zero-inflated models and other mixture models). Feel free initially gloss over these equations until such time when your models require them ;)

Density function: $$ f(y\mid\lambda)=\frac{\lambda^{y}e^{-\lambda}}{y!}\\ E(Y)=Var(Y)=\lambda\\ $$ where $\lambda$ is the mean.

Likelihood: $$ \begin{align} \mathcal{L}(\lambda\mid y) &= \prod\limits_{i=1}^n \dfrac{\lambda^{y_i}e^{-\lambda}}{y_i!}\\ %&= \dfrac{\lambda^{\sum\limits^n_{i=1}y_i} e^{-n\lambda}}{y_1!y_2! \cdots y_n!}\\ \end{align} $$ Log-likelihood: $$ \begin{array}{rcl} \mathcal{LL}(\lambda\mid y)&=&\sum\limits^n_{i=1}log(\lambda^{y_i}e^{-\lambda_i})-log(y_i!)\\ \mathcal{LL}(\lambda\mid y)&=&\sum\limits^n_{i=1}log(\lambda^{y_i})+log(e^{-\lambda_i})-log(y_i!)\\ \mathcal{LL}(\lambda\mid y)&=&\sum\limits^n_{i=1}y_i log(\lambda_i)-\lambda_i - log(y_i!) \end{array} $$

Scenario and Data

Lets say we wanted to model the abundance of an item ($y$) against a continuous predictor ($x$). As this section is mainly about the generation of artificial data (and not specifically about what to do with the data), understanding the actual details are optional and can be safely skipped. Consequently, I have folded (toggled) this section away.

Random data incorporating the following trends (effect parameters)
  • the sample size = 20
  • the continuous $x$ variable is a random uniform spread of measurements between 1 and 20
  • the rate of change in log $y$ per unit of $x$ (slope) = 0.1.
  • the value of $x$ when log$y$ equals 0 (when $y$=1)
  • to generate the values of $y$ expected at each $x$ value, we evaluate the linear predictor (created by calculating the outer product of the model matrix and the regression parameters). These expected values are then transformed into a scale mapped by (0,$\infty$) by using the log function $e^{linear~predictor}$
  • finally, we generate $y$ values by using the expected $y$ values ($\lambda$) as probabilities when drawing random numbers from a Poisson distribution. This step adds random noise to the expected $y$ values and returns only 0's and positive integers.
> set.seed(8)
> # The number of samples
> n.x <- 20
> # Create x values that at uniformly distributed
> # throughout the rate of 1 to 20
> x <- sort(runif(n = n.x, min = 1, max = 20))
> mm <- model.matrix(~x)
> intercept <- 0.6
> slope = 0.1
> # The linear predictor
> linpred <- mm %*% c(intercept, slope)
> # Predicted y values
> lambda <- exp(linpred)
> # Add some noise and make binomial
> y <- rpois(n = n.x, lambda = lambda)
> dat <- data.frame(y, x)

With these sort of data, we are primarily interested in investigating whether there is a relationship between the positive integer response variable and the linear predictor (linear combination of one or more continuous or categorical predictors).

Exploratory data analysis and initial assumption checking

The assumptions are:
  1. All of the observations are independent - this must be addressed at the design and collection stages
  2. The response variable (and thus the residuals) should be matched by an appropriate distribution (in the case of positive integers response - a Poisson is appropriate).
  3. All observations are equally influential in determining the trends - or at least no observations are overly influential. This is most effectively diagnosed via residuals and other influence indices and is very difficult to diagnose prior to analysis
  4. the relationship between the linear predictor (right hand side of the regression formula) and the link function should be linear. A scatterplot with smoother can be useful for identifying possible non-linearity.
  5. The dispersion factor is close to 1

There are at least five main potential models we could consider fitting to these data:

  1. Ordinary least squares regression (general linear model) - assumes normality of residuals
  2. Poisson regression - assumes mean=variance (dispersion=1)
  3. Quasi-poisson regression - a general solution to overdispersion. Assumes variance is a function of mean, dispersion estimated, however likelihood based statistics unavailable
  4. Negative binomial regression - a specific solution to overdispersion caused by clumping (due to an unmeasured latent variable). Scaling factor ($\omega$) is estimated along with the regression parameters.
  5. Zero-inflation model - a specific solution to overdispersion caused by excessive zeros (due to an unmeasured latent variable). Mixture of binomial and Poisson models.

Confirm non-normality and explore clumping

When counts are all very large (not close to 0) and their ranges do not span orders of magnitude, they take on very Gaussian properties (symmetrical distribution and variance independent of the mean). Given that models based on the Gaussian distribution are more optimized and recognized than Generalized Linear Models, it can be prudent to adopt Gaussian models for such data. Hence it is a good idea to first explore whether a Poisson model is likely to be more appropriate than a standard Gaussian model.

The potential for overdispersion can be explored by adding a rug to boxplot. The rug is simply tick marks on the inside of an axis at the position corresponding to an observation. As multiple identical values result in tick marks drawn over one another, it is typically a good idea to apply a slight amount of jitter (random displacement) to the values used by the rug.

> hist(dat$y)
plot of chunk tut11.5bS1.2
> boxplot(dat$y, horizontal = TRUE)
> rug(jitter(dat$y), side = 1)
plot of chunk tut11.5bS1.2

There is definitely signs of non-normality that would warrant Poisson models. The rug applied to the boxplots does not indicate a series degree of clumping and there appears to be few zero. Thus overdispersion is unlikely to be an issue.

Confirm linearity

Lets now explore linearity by creating a histogram of the predictor variable ($x$) and a scatterplot of the relationship between the response ($y$) and the predictor ($x$)

> hist(dat$x)
plot of chunk tut11.5bS1.3
> # now for the scatterplot
> plot(y ~ x, dat, log = "y")
> with(dat, lines(lowess(y ~ x)))
plot of chunk tut11.5bS1.3

Conclusions: the predictor ($x$) does not display any skewness or other issues that might lead to non-linearity. The lowess smoother on the scatterplot does not display major deviations from a straight line and thus linearity is satisfied. Violations of linearity could be addressed by either:

  • define a non-linear linear predictor (such as a polynomial, spline or other non-linear function)
  • transform the scale of the predictor variables

Explore zero inflation

Although we have already established that there are few zeros in the data (and thus overdispersion is unlikely to be an issue), we can also explore this by comparing the number of zeros in the data to the number of zeros that would be expected from a Poisson distribution with a mean equal to the mean count of the data.

> # proportion of 0's in the data
> dat.tab <- table(dat$y == 0)
> dat.tab/sum(dat.tab)
FALSE 
    1 
> # proportion of 0's expected from a Poisson
> # distribution
> mu <- mean(dat$y)
> cnts <- rpois(1000, mu)
> dat.tab <- table(cnts == 0)
> dat.tab/sum(dat.tab)
FALSE  TRUE 
0.997 0.003 
In the above, the value under FALSE is the proportion of non-zero values in the data and the value under TRUE is the proportion of zeros in the data. In this example, there are no zeros in the observed data which corresponds closely to the very low proportion expected (0.003).

Model fitting or statistical analysis

JAGS

$$ \begin{align} Y_i&\sim{}P(\lambda) & (\text{response distribution})\\ log(\lambda_i)&=\eta_i & (\text{link function})\\ \eta_i&=\beta_0+\beta_1 X_i & (\text{linear predictor})\\ \beta_0, \beta_1&\sim{}\mathcal{N}(0,10000) & (\text{diffuse Bayesian prior})\\ \end{align} $$
> dat.list <- with(dat,list(Y=y, X=x,N=nrow(dat)))
> modelString="
model {
  for (i in 1:N) {
     Y[i] ~ dpois(lambda[i])
     log(lambda[i]) <- beta0 + beta1*X[i]
  }
  beta0 ~ dnorm(0,1.0E-06)
  beta1 ~ dnorm(0,1.0E-06)
} 
"
> writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS1.4.bug')
> library(R2jags)
> system.time(
+ dat.P.jags <- jags(model='../downloads/BUGSscripts/tut11.5bS1.4.bug', data=dat.list, inits=NULL,
+           param=c('beta0','beta1'),
+           n.chain=3,
+           n.iter=20000, n.thin=10, n.burnin=10000)
+   )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 105

Initializing model
   user  system elapsed 
  2.376   0.004   2.397 
> Xmat <- model.matrix(~x, dat)
> nX <- ncol(Xmat)
> dat.list1 <- with(dat,list(Y=y, X=Xmat,N=nrow(dat), nX=nX))
> modelString="
model {
  for (i in 1:N) {
     Y[i] ~ dpois(lambda[i])
     eta[i] <- inprod(beta[], X[i,])
     log(lambda[i]) <- eta[i]
  }
  for (i in 1:nX) {
    beta[i] ~ dnorm(0,1.0E-06)
  }
} 
"
> writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS1.41.bug')
> library(R2jags)
> system.time(
+ dat.P.jags1 <- jags(model='../downloads/BUGSscripts/tut11.5bS1.41.bug', data=dat.list1, inits=NULL,
+           param=c('beta'),
+           n.chain=3,
+           n.iter=20000, n.thin=10, n.burnin=10000)
+   )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 127

Initializing model
   user  system elapsed 
  1.816   0.004   1.820 
> print(dat.P.jags1)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS1.41.bug", fit using jags,
 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10
 n.sims = 3000 iterations saved
         mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
beta[1]    0.545   0.252  0.032  0.378  0.549  0.728  1.014 1.001  3000
beta[2]    0.112   0.018  0.077  0.099  0.112  0.125  0.149 1.001  3000
deviance  88.413   4.782 86.383 86.897 87.703 89.034 93.484 1.001  3000

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 11.4 and DIC = 99.9
DIC is an estimate of expected predictive error (lower deviance is better).

Or arguably better still, use a multivariate normal prior. If we have a $k$ regression parameters ($\beta_k$), then the multivariate normal priors are defined as: $$ \boldsymbol{\beta}\sim{}\mathcal{N_k}(\boldsymbol{\mu}, \mathbf{\Sigma}) $$ where $$\boldsymbol{\mu}=[E[\beta_1],E[\beta_2],...,E[\beta_k]] = \left(\begin{array}{c}0\\\vdots\\0\end{array}\right)$$ $$ \mathbf{\Sigma}=[Cov[\beta_i, \beta_j]] = \left(\begin{array}{ccc}1000^2&0&0\\0&\ddots&0\\0&0&1000^2\end{array} \right) $$ hence, along with the response and predictor matrix, we need to supply $\boldsymbol{\mu}$ (a vector of zeros) and $\boldsymbol{\Sigma}$ (a covariance matrix with $1000^2$ in the diagonals).

> Xmat <- model.matrix(~x, dat)
> nX <- ncol(Xmat)
> dat.list2 <- with(dat,list(Y=y, X=Xmat,N=nrow(dat), mu=rep(0,nX),Sigma=diag(1.0E-06,nX)))
> modelString="
model {
  for (i in 1:N) {
     Y[i] ~ dpois(lambda[i])
     eta[i] <- inprod(beta[], X[i,])
     log(lambda[i]) <- eta[i]
  }
  beta ~ dmnorm(mu[],Sigma[,])
} 
"
> writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS1.42.bug')
> library(R2jags)
> system.time(
+ dat.P.jags2 <- jags(model='../downloads/BUGSscripts/tut11.5bS1.42.bug', data=dat.list2, inits=NULL,
+           param=c('beta'),
+           n.chain=3,
+           n.iter=20000, n.thin=10, n.burnin=10000)
+   )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 130

Initializing model
   user  system elapsed 
  0.616   0.000   0.629 
> print(dat.P.jags2)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS1.42.bug", fit using jags,
 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10
 n.sims = 3000 iterations saved
         mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
beta[1]    0.560   0.250  0.066  0.386  0.569  0.734  1.027 1.001  3000
beta[2]    0.111   0.018  0.077  0.098  0.111  0.123  0.147 1.001  2900
deviance  88.315   2.004 86.378 86.907 87.706 89.027 93.746 1.002  1400

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 2.0 and DIC = 90.3
DIC is an estimate of expected predictive error (lower deviance is better).

Chain mixing and Model validation

Prior to exploring the model parameters, it is prudent to confirm that the model did indeed fit the assumptions and was an appropriate fit to the data as well as that the MCMC sampling chain was adequately mixed and the retained samples independent.

Whilst I will only demonstrate this for the logit model, the procedure would be identical for exploring the probit and clog-log models.

  • We will start by exploring the mixing of the MCMC chains via traceplots.
    > plot(as.mcmc(dat.P.jags))
    
    plot of chunk tut11.5bS2

    The chains appear well mixed and stable

  • Next we will explore correlation amongst MCMC samples.
    > autocorr.diag(as.mcmc(dat.P.jags))
    
                beta0     beta1  deviance
    Lag 0    1.000000  1.000000  1.000000
    Lag 10   0.185239  0.149437  0.018579
    Lag 50  -0.007436 -0.007997 -0.007498
    Lag 100 -0.005343 -0.016451  0.009315
    Lag 500 -0.023993 -0.018307  0.005840
    

    The level of auto-correlation at the nominated lag of 10 is higher than we would generally like. It is worth increasing the thinning rate from 10 to 50. Obviously, to support this higher thinning rate, we would also increase the number of iterations.

    > library(R2jags)
    > dat.P.jags <- jags(data=dat.list,model.file='../downloads/BUGSscripts/tut11.5bS1.4.bug',
    +                    param=c('beta0','beta1'),
    +                    n.chains=3, n.iter=100000, n.burnin=20000, n.thin=50)
    
    Compiling model graph
       Resolving undeclared variables
       Allocating nodes
       Graph Size: 105
    
    Initializing model
    
    > print(dat.P.jags)
    
    Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS1.4.bug", fit using jags,
     3 chains, each with 1e+05 iterations (first 20000 discarded), n.thin = 50
     n.sims = 4800 iterations saved
             mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
    beta0      0.544   0.258  0.018  0.378  0.546  0.722  1.028 1.001  4600
    beta1      0.112   0.019  0.076  0.099  0.112  0.124  0.150 1.001  4800
    deviance  88.424   3.821 86.371 86.893 87.687 89.125 93.921 1.002  4800
    
    For each parameter, n.eff is a crude measure of effective sample size,
    and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
    
    DIC info (using the rule, pD = var(deviance)/2)
    pD = 7.3 and DIC = 95.7
    DIC is an estimate of expected predictive error (lower deviance is better).
    
    > plot(as.mcmc(dat.P.jags))
    
    plot of chunk tut11.5bS4
    > autocorr.diag(as.mcmc(dat.P.jags))
    
                 beta0     beta1  deviance
    Lag 0     1.000000  1.000000  1.000000
    Lag 50   -0.012740 -0.007513  0.007106
    Lag 250   0.007954  0.013158 -0.006200
    Lag 500  -0.001340 -0.007863 -0.007557
    Lag 2500  0.010148  0.005154 -0.004615
    

    Conclusions: the samples are now less auto-correlated and the chains are also arguably mixed a little better.

  • One very important model validation procedure is to examine a plot of residuals against predicted or fitted values (the residual plot). Ideally, residual plots should show a random scatter of points without outliers. That is, there should be no patterns in the residuals. Patterns suggest inappropriate linear predictor (or scale) and/or inappropriate residual distribution/link function.

    The residuals used in such plots should be standardized (particularly if the model incorporated any variance-covariance structures - such as an autoregressive correlation structure) Pearsons's residuals standardize residuals by division with the square-root of the variance.

    We can generate Pearson's residuals within the JAGS model. Alternatively, we could use the parameters to generate the residuals outside of JAGS. Pearson's residuals are calculated according to: $$ \varepsilon = \frac{y_i - \mu}{\sqrt{var(y)}} $$ where $\mu$ is the expected value of $Y$ ($=\lambda$ for Poisson) and $var(y)$ is the variance of $Y$ ($=\lambda$ for Poisson).
    > #extract the samples for the two model parameters
    > coefs <- dat.P.jags$BUGSoutput$sims.matrix[,1:2]
    > Xmat <- model.matrix(~x, data=dat)
    > #expected values on a log scale
    > eta<-coefs %*% t(Xmat)
    > #expected value on response scale
    > lambda <- exp(eta)
    > #Expected value and variance are both equal to lambda
    > expY <- varY <- lambda
    > #sweep across rows and then divide by lambda
    > Resid <- -1*sweep(expY,2,dat$y,'-')/sqrt(varY)
    > #plot residuals vs expected values
    > plot(apply(Resid,2,mean)~apply(eta,2,mean))
    
    plot of chunk tut11.5bS5

    There is one residual that is substantially larger in magnitude than all the others. However, there are no other patterns in the residuals.

  • Now we will compare the sum of squared residuals to the sum of squares residuals that would be expected from a Poisson distribution matching that estimated by the model. Essentially this is estimating how well the Poisson distribution, the log-link function and the linear model approximates the observed data.
    > SSres<-apply(Resid^2,1,sum)
    > 
    > set.seed(2)
    > #generate a matrix of draws from a binomial distribution
    > # the matrix is the same dimensions as pi and uses the probabilities of pi
    > YNew <- matrix(rpois(length(lambda),lambda=lambda),nrow=nrow(lambda))
    > 
    > Resid1<-(lambda - YNew)/sqrt(lambda)
    > SSres.sim<-apply(Resid1^2,1,sum)
    > mean(SSres.sim>SSres)
    
    [1] 0.505
    
    > dat.list1 <- with(dat,list(Y=y, X=x,N=nrow(dat)))
    > modelString="
    model {
      for (i in 1:N) {
        #likelihood function
        Y[i] ~ dpois(lambda[i])
        eta[i] <- beta0+beta1*X[i] #linear predictor
        log(lambda[i]) <- eta[i]   #link function
    
        #E(Y) and var(Y)
        expY[i] <- lambda[i]
        varY[i] <- lambda[i]
    
        # Calculate RSS
        Resid[i] <- (Y[i] - expY[i])/sqrt(varY[i])
        RSS[i] <- pow(Resid[i],2)
    
        #Simulate data from a Poisson distribution
        Y1[i] ~ dpois(lambda[i])
        #Calculate RSS for simulated data
        Resid1[i] <- (Y1[i] - expY[i])/sqrt(varY[i])
        RSS1[i] <-pow(Resid1[i],2) 
      }
      #Priors
      beta0 ~ dnorm(0,1.0E-06)
      beta1 ~ dnorm(0,1.0E-06)
      #Bayesian P-value
      Pvalue <- mean(sum(RSS1)>sum(RSS))
    } 
    "
    > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS6.41.bug')
    > library(R2jags)
    > system.time(
    + dat.P.jags1 <- jags(model='../downloads/BUGSscripts/tut11.5bS6.41.bug', data=dat.list1, inits=NULL,
    +           param=c('beta0','beta1','Pvalue'),
    +           n.chain=3,
    +           n.iter=100000, n.thin=50, n.burnin=20000)
    +   )
    
    Compiling model graph
       Resolving undeclared variables
       Allocating nodes
       Graph Size: 272
    
    Initializing model
    
       user  system elapsed 
      16.26    0.00   16.29 
    
    > print(dat.P.jags1)
    
    Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS6.41.bug", fit using jags,
     3 chains, each with 1e+05 iterations (first 20000 discarded), n.thin = 50
     n.sims = 4800 iterations saved
             mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
    Pvalue     0.471   0.499  0.000  0.000  0.000  1.000  1.000 1.002  1700
    beta0      0.548   0.257  0.029  0.380  0.551  0.724  1.037 1.001  4800
    beta1      0.112   0.019  0.075  0.099  0.112  0.124  0.149 1.002  4800
    deviance  88.429   3.645 86.376 86.905 87.724 89.105 93.923 1.002  3500
    
    For each parameter, n.eff is a crude measure of effective sample size,
    and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
    
    DIC info (using the rule, pD = var(deviance)/2)
    pD = 6.6 and DIC = 95.1
    DIC is an estimate of expected predictive error (lower deviance is better).
    

    Conclusions: the Bayesian p-value is approximately 0.5, suggesting that there is a good fit of the model to the data.

  • Recall that the Poisson regression model assumes that variance=mean ($var=\mu\phi$ where $\phi=1$) and thus dispersion ($\phi=\frac{var}{\mu}=1$). However, we can also calculate approximately what the dispersion factor would be by using sum square of the residuals as a measure of variance and the model residual degrees of freedom as a measure of the mean (since the expected value of a Poisson distribution is the same as its degrees of freedom). $$\phi=\frac{RSS}{df}$$ where $df=n-k$ and $k$ is the number of estimated model coefficients.
    > Resid <- -1*sweep(lambda,2,dat$y,'-')/sqrt(lambda)
    > RSS<-apply(Resid^2,1,sum)
    > (df<-nrow(dat)-ncol(coefs))
    
    [1] 18
    
    > Disp <- RSS/df
    > data.frame(Median=median(Disp), Mean=mean(Disp), HPDinterval(as.mcmc(Disp)), HPDinterval(as.mcmc(Disp),p=0.5))
    
         Median  Mean lower upper lower.1 upper.1
    var1  1.042 1.114  0.93 1.447  0.9301   1.042
    
    > dat.list <- with(dat,list(Y=y, X=x,N=nrow(dat)))
    > modelString="
    model {
      for (i in 1:N) {
         Y[i] ~ dpois(lambda[i])
         eta[i] <- beta0 + beta1*X[i]
         log(lambda[i]) <- eta[i]
         expY[i] <- lambda[i]
         varY[i] <- lambda[i]
    	 Resid[i] <- (Y[i] - expY[i])/sqrt(varY[i]) 
      }
      beta0 ~ dnorm(0,1.0E-06)
      beta1 ~ dnorm(0,1.0E-06)
      RSS <- sum(pow(Resid,2))
      df <- N-2
      phi <- RSS/df
    } 
    "
    > writeLines(modelString,con='tut11.5bS1.40.bug')
    > library(R2jags)
    > system.time(
    + dat.P.jags <- jags(model='tut11.5bS1.40.bug', data=dat.list, inits=NULL,
    +           param=c('beta0','beta1','phi'),
    +           n.chain=3,
    +           n.iter=100000, n.thin=50, n.burnin=20000)
    +   )
    
    Compiling model graph
       Resolving undeclared variables
       Allocating nodes
       Graph Size: 171
    
    Initializing model
    
       user  system elapsed 
     18.177   0.008  18.259 
    
    > print(dat.P.jags)
    
    Inference for Bugs model at "tut11.5bS1.40.bug", fit using jags,
     3 chains, each with 1e+05 iterations (first 20000 discarded), n.thin = 50
     n.sims = 4800 iterations saved
             mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
    beta0      0.548   0.255  0.031  0.377  0.553  0.727  1.043 1.001  4800
    beta1      0.112   0.019  0.076  0.099  0.112  0.124  0.149 1.001  4800
    phi        1.109   0.413  0.934  0.979  1.048  1.164  1.550 1.001  4400
    deviance  88.389   3.915 86.377 86.909 87.689 89.084 93.530 1.001  4800
    
    For each parameter, n.eff is a crude measure of effective sample size,
    and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
    
    DIC info (using the rule, pD = var(deviance)/2)
    pD = 7.7 and DIC = 96.1
    DIC is an estimate of expected predictive error (lower deviance is better).
    

    The dispersion statistic $\phi$ is close to 1 and thus there is no evidence that the data were overdispersed. The Poisson distribution was therefore appropriate.

Exploring the model parameters, test hypotheses

If there was any evidence that the assumptions had been violated or the model was not an appropriate fit, then we would need to reconsider the model and start the process again. In this case, there is no evidence that the test will be unreliable so we can proceed to explore the test statistics.

> print(dat.P.jags)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS1.4.bug", fit using jags,
 3 chains, each with 50000 iterations (first 10000 discarded), n.thin = 50
 n.sims = 2400 iterations saved
         mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
beta0      0.554   0.254  0.034  0.388  0.564  0.726  1.037 1.003   780
beta1      0.111   0.018  0.075  0.099  0.111  0.123  0.148 1.003   930
deviance  88.471   5.024 86.361 86.857 87.698 89.021 93.951 1.001  2400

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 12.6 and DIC = 101.1
DIC is an estimate of expected predictive error (lower deviance is better).
> 
> library(plyr)
> adply(dat.P.jags$BUGSoutput$sims.matrix[,1:2], 2, function(x) {
+   data.frame(Median=median(x), Mean=mean(x), HPDinterval(as.mcmc(x)), HPDinterval(as.mcmc(x),p=0.5))
+ })
     X1 Median   Mean   lower  upper lower.1 upper.1
1 beta0 0.5638 0.5537 0.02371 1.0238  0.4412  0.7749
2 beta1 0.1114 0.1114 0.07362 0.1473  0.0980  0.1218

Actually, many find it more palatable to express the estimates in the original scale of the observations rather than on a log scale.

> library(plyr)
> adply(exp(dat.P.jags$BUGSoutput$sims.matrix[,1:2]), 2, function(x) {
+   data.frame(Median=median(x), Mean=mean(x), HPDinterval(as.mcmc(x)), HPDinterval(as.mcmc(x),p=0.5))
+ })
     X1 Median  Mean  lower upper lower.1 upper.1
1 beta0  1.757 1.796 0.9751 2.688   1.403   1.976
2 beta1  1.118 1.118 1.0764 1.159   1.102   1.128

Conclusions: We would reject the null hypothesis of no effect of $x$ on $y$. An increase in x is associated with a significant linear increase (positive slope) in the abundance of $y$. Every 1 unit increase in $x$ results in a log 0.1117 unit increase in $y$. We usually express this in terms of abundance rather than log abundance, so every 1 unit increase in $x$ results in a ($e^{0.1117}=1.1182$) 1.1182 unit increase in the abundance of $y$.

Further explorations of the trends

A measure of the strength of the relationship can be obtained according to: $$R^2 = 1 - \frac{RSS_{model}}{RSS_{null}}$$

> Xmat <- model.matrix(~x, data=dat)
> #expected values on a log scale
> eta<-coefs %*% t(Xmat)
> #expected value on response scale
> lambda <- exp(eta)
> #calculate the raw SS residuals
> SSres <- apply((-1*(sweep(lambda,2,dat$y,'-')))^2,1,sum)
> SSres.null <- sum((dat$y - mean(dat$y))^2)
> #OR 
> SSres.null <- crossprod(dat$y - mean(dat$y))
> #calculate the model r2
> 1-mean(SSres)/SSres.null
       [,1]
[1,] 0.6555

Conclusions: 65.55% of the variation in $y$ abundance can be explained by its relationship with $x$.

> dat.list <- with(dat,list(Y=y, X=x,N=nrow(dat)))
> modelString="
model {
  for (i in 1:N) {
     Y[i] ~ dpois(lambda[i])
     eta[i] <- beta0 + beta1*X[i]
     log(lambda[i]) <- eta[i]
     res[i] <- Y[i] - lambda[i]
     resnull[i] <- Y[i] - meanY
  }
  meanY <- mean(Y)
  beta0 ~ dnorm(0,1.0E-06)
  beta1 ~ dnorm(0,1.0E-06)
  RSS <- sum(res^2)
  RSSnull <- sum(resnull^2)
  r2 <- 1-RSS/RSSnull
} 
"
> writeLines(modelString,con='tut11.5bS1.40.bug')
> library(R2jags)
> system.time(
+ dat.P.jags <- jags(model='tut11.5bS1.40.bug', data=dat.list, inits=NULL,
+           param=c('beta0','beta1','r2'),
+           n.chain=3,
+           n.iter=100000, n.thin=50, n.burnin=20000)
+   )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 150

Initializing model
   user  system elapsed 
 15.413   0.004  15.458 
> print(dat.P.jags)
Inference for Bugs model at "tut11.5bS1.40.bug", fit using jags,
 3 chains, each with 1e+05 iterations (first 20000 discarded), n.thin = 50
 n.sims = 4800 iterations saved
         mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
beta0      0.548   0.255  0.045  0.374  0.553  0.724  1.043 1.001  4800
beta1      0.112   0.019  0.074  0.099  0.112  0.124  0.148 1.001  4800
r2         0.655   0.065  0.512  0.641  0.673  0.691  0.701 1.003  4800
deviance  88.436   4.089 86.365 86.904 87.726 89.139 93.819 1.005  4800

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 8.4 and DIC = 96.8
DIC is an estimate of expected predictive error (lower deviance is better).

Finally, we will create a summary plot.

> par(mar = c(4, 5, 0, 0))
> plot(y ~ x, data = dat, type = "n", ann = F, axes = F)
> points(y ~ x, data = dat, pch = 16)
> xs <- seq(min(dat$x, na.rm = TRUE), max(dat$x, na.rm = TRUE),
+     l = 1000)
> Xmat <- model.matrix(~xs)
> eta <- coefs %*% t(Xmat)
> ys <- exp(eta)
> library(plyr)
> library(coda)
> data.tab <- adply(ys, 2, function(x) {
+     data.frame(Median = median(x), HPDinterval(as.mcmc(x)))
+ })
> data.tab <- cbind(x = xs, data.tab)
> points(Median ~ x, data = data.tab, col = "black", type = "l")
> lines(lower ~ x, data = data.tab, col = "black", type = "l",
+     lty = 2)
> lines(upper ~ x, data = data.tab, col = "black", type = "l",
+     lty = 2)
> 
> axis(1)
> mtext("X", 1, cex = 1.5, line = 3)
> axis(2, las = 2)
> mtext("Abundance of Y", 2, cex = 1.5, line = 3)
> box(bty = "l")
plot of chunk tut11.5bS1.10

Defining full log-likelihood function

Now lets try it by specifying log-likelihood and the zero trick. When applying this trick, we need to manually calculate the deviance as the inbuilt deviance will be based on the log-likelihood of estimating the zeros (as part of the zero trick) rather than the deviance of the intended model..

The one advantage of the zero trick is that the Deviance and thus DIC, AIC provided by R2jags will be incorrect. Hence, they too need to be manually defined within jags. I suspect that the AIC calculation I have used is incorrect...

> Xmat <- model.matrix(~x, dat)
> nX <- ncol(Xmat)
> dat.list2 <- with(dat,list(Y=y, X=Xmat,N=nrow(dat), mu=rep(0,nX),
+                   Sigma=diag(1.0E-06,nX), zeros=rep(0,nrow(dat)), C=10000))
> modelString="
model {
  for (i in 1:N) {
     zeros[i] ~ dpois(zeros.lambda[i])
     zeros.lambda[i] <- -ll[i] + C     
     ll[i] <- Y[i]*log(lambda[i]) - lambda[i] - loggam(Y[i]+1)
     eta[i] <- inprod(beta[], X[i,])
     log(lambda[i]) <- eta[i]
    llm[i] <- Y[i]*log(meanlambda) - meanlambda - loggam(Y[i]+1)
  }
  meanlambda <- mean(lambda)
  beta ~ dmnorm(mu[],Sigma[,])
  dev <- sum(-2*ll)
  pD <- mean(dev)-sum(-2*llm)
  AIC <- min(dev+(2*pD))
} 
"
> writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS1.42.bug')
> library(R2jags)
> system.time(
+ dat.P.jags3 <- jags(model='../downloads/BUGSscripts/tut11.5bS1.42.bug', data=dat.list2, inits=NULL,
+           param=c('beta','dev','AIC'),
+           n.chain=3,
+           n.iter=50000, n.thin=50, n.burnin=10000)
+   )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 353

Initializing model
   user  system elapsed 
  1.896   0.004   1.900 
> print(dat.P.jags3)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS1.42.bug", fit using jags,
 3 chains, each with 50000 iterations (first 10000 discarded), n.thin = 50
 n.sims = 2400 iterations saved
           mu.vect sd.vect      2.5%       25%       50%       75%     97.5%  Rhat n.eff
AIC      1.358e+01   4.347 9.638e+00 1.063e+01 1.215e+01 1.502e+01 2.587e+01 1.000  2400
beta[1]  5.310e-01   0.257 2.400e-02 3.550e-01 5.350e-01 7.060e-01 1.017e+00 1.002  1500
beta[2]  1.130e-01   0.019 7.600e-02 1.000e-01 1.130e-01 1.260e-01 1.480e-01 1.002  1500
dev      8.834e+01   1.966 8.638e+01 8.692e+01 8.775e+01 8.912e+01 9.367e+01 1.001  2400
deviance 4.001e+05   1.966 4.001e+05 4.001e+05 4.001e+05 4.001e+05 4.001e+05 1.000     1

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 1.9 and DIC = 400090.3
DIC is an estimate of expected predictive error (lower deviance is better).

Negative binomial

The following equations are provided since in Bayesian modelling, it is occasionally necessary to directly define the log-likelihood calculations (particularly for zero-inflated models and other mixture models). Feel free initially gloss over these equations until such time when your models require them ;)

prob, sizealpha, beta (gamma-poisson)
$$ f(y|r,p)=\frac{\Gamma(y+r)}{\Gamma(r)\Gamma(y+1)}p^r(1-p)^y $$ where $p$ is the probability of $y$ successes until $r$ failures. If, we make $p=\frac{size}{size+\mu}$, then we can define the function in terms of $\mu$ $$ f(y|r,\mu)=\frac{\Gamma(y+r)}{\Gamma(r)\Gamma(y+1)}\left(\frac{r}{\mu+r}\right)^r\left(1-\frac{r}{\mu+r}\right)^y $$ where $p$ is the probability of $y$ successes until $r$ failures. $$\mu = \frac{r(1-p)}{p}\\ E(Y)=\mu, Var(Y)=\mu+\frac{\mu^2}{r} $$ $$ f(y \mid \alpha, \beta) = \frac{\beta^{\alpha}y^{\alpha-1}e^{-\beta y}}{\Gamma(\alpha)} $$ where $$ l(\lambda;y)=\sum\limits^n_{i=1} log\Gamma(y_i + \alpha) - log\Gamma(y_i+1) - log\Gamma(\alpha) + \alpha(log(\beta_i) - log(\beta_i+1)) - y_i.log(\beta_i+1) $$

Scenario and Data

Lets say we wanted to model the abundance of an item ($y$) against a continuous predictor ($x$). As this section is mainly about the generation of artificial data (and not specifically about what to do with the data), understanding the actual details are optional and can be safely skipped. Consequently, I have folded (toggled) this section away.

Random data incorporating the following trends (effect parameters)
  • the sample size = 20
  • the continuous $x$ variable is a random uniform spread of measurements between 1 and 20
  • the rate of change in log $y$ per unit of $x$ (slope) = 0.1.
  • the value of $x$ when log$y$ equals 0 (when $y$=1)
  • to generate the values of $y$ expected at each $x$ value, we evaluate the linear predictor (created by calculating the outer product of the model matrix and the regression parameters). These expected values are then transformed into a scale mapped by (0,$\infty$) by using the log function $e^{linear~predictor}$
  • finally, we generate $y$ values by using the expected $y$ values ($\lambda$) as probabilities when drawing random numbers from a Poisson distribution. This step adds random noise to the expected $y$ values and returns only 0's and positive integers.
> set.seed(37)  #16 #35
> # The number of samples
> n.x <- 20
> # Create x values that at uniformly distributed throughout the rate of 1 to 20
> x <- sort(runif(n = n.x, min = 1, max = 20))
> mm <- model.matrix(~x)
> intercept <- 0.6
> slope = 0.1
> # The linear predictor
> linpred <- mm %*% c(intercept, slope)
> # Predicted y values
> lambda <- exp(linpred)
> # Add some noise and make binomial
> y <- rnbinom(n = n.x, mu = lambda, size = 1)
> dat.nb <- data.frame(y, x)

Exploratory data analysis and initial assumption checking

The assumptions are:
  1. All of the observations are independent - this must be addressed at the design and collection stages
  2. The response variable (and thus the residuals) should be matched by an appropriate distribution (in the case of positive integers response - a Poisson is appropriate).
  3. All observations are equally influential in determining the trends - or at least no observations are overly influential. This is most effectively diagnosed via residuals and other influence indices and is very difficult to diagnose prior to analysis
  4. the relationship between the linear predictor (right hand side of the regression formula) and the link function should be linear. A scatterplot with smoother can be useful for identifying possible non-linearity.
  5. Dispersion is either 1 or overdispersion is otherwise accounted for in the model
  6. The number of zeros is either not excessive or else they are specifically addressed by the model

When counts are all very large (not close to 0) and their ranges do not span orders of magnitude, they take on very Gaussian properties (symmetrical distribution and variance independent of the mean). Given that models based on the Gaussian distribution are more optimized and recognized than Generalized Linear Models, it can be prudent to adopt Gaussian models for such data. Hence it is a good idea to first explore whether a Poisson or Negative Binomial model is likely to be more appropriate than a standard Gaussian model.

Recall from Poisson regression, there are five main potential models that we could consider fitting to these data.

There are five main potential models we could consider fitting to these data:

  1. Ordinary least squares regression (general linear model) - assumes normality of residuals
  2. Poisson regression - assumes mean=variance (dispersion=1)
  3. Quasi-poisson regression - a general solution to overdispersion. Assumes variance is a function of mean, dispersion estimated, however likelihood based statistics unavailable
  4. Negative binomial regression - a specific solution to overdispersion caused by clumping (due to an unmeasured latent variable). Scaling factor ($\omega$) is estimated along with the regression parameters.
  5. Zero-inflation model - a specific solution to overdispersion caused by excessive zeros (due to an unmeasured latent variable). Mixture of binomial and Poisson models.

Confirm non-normality and explore clumping

Check the distribution of the $y$ abundances

> hist(dat.nb$y)
plot of chunk tut11.5bS3.2
> boxplot(dat.nb$y, horizontal = TRUE)
> rug(jitter(dat.nb$y))
plot of chunk tut11.5bS3.2
There is definitely signs of non-normality that would warrant Poisson models. Further to that, the rug on the boxplot is suggestive of clumping (observations are not evenly distributed and well spaced out). Together the, skewness and clumping could point to an overdispersed Poisson. Negative binomial models are one of the most effective means of modeling such data.

Confirm linearity

Lets now explore linearity by creating a histogram of the predictor variable ($x$) and a scatterplot of the relationship between the response ($y$) and the predictor ($x$)

> hist(dat.nb$x)
plot of chunk tut11.5bS3.3
> # now for the scatterplot
> plot(y ~ x, dat.nb, log = "y")
> with(dat.nb, lines(lowess(y ~ x)))
plot of chunk tut11.5bS3.3

Conclusions: the predictor ($x$) does not display any skewness or other issues that might lead to non-linearity. The lowess smoother on the scatterplot does not display major deviations from a straight line and thus linearity is satisfied. Violations of linearity could be addressed by either:

  • define a non-linear linear predictor (such as a polynomial, spline or other non-linear function)
  • transform the scale of the predictor variables

Explore zero inflation

Although we have already established that there are few zeros in the data (and thus overdispersion is unlikely to be an issue), we can also explore this by comparing the number of zeros in the data to the number of zeros that would be expected from a Poisson distribution with a mean equal to the mean count of the data.

> # proportion of 0's in the data
> dat.nb.tab <- table(dat.nb$y == 0)
> dat.nb.tab/sum(dat.nb.tab)
FALSE  TRUE 
 0.95  0.05 
> # proportion of 0's expected from a Poisson distribution
> mu <- mean(dat.nb$y)
> cnts <- rpois(1000, mu)
> dat.nb.tabE <- table(cnts == 0)
> dat.nb.tabE/sum(dat.nb.tabE)
FALSE 
    1 
In the above, the value under FALSE is the proportion of non-zero values in the data and the value under TRUE is the proportion of zeros in the data. In this example, the proportion of zeros observed is similar to the proportion expected. Indeed, there was only a single zero observed. Hence it is likely that if there is overdispersion it is unlikely to be due to excessive zeros.

Model fitting or statistical analysis

The boxplot of $y$ with the axis rug suggested that there might be some clumping (possibly due to some other unmeasured influence). It is therefore likely that the data are overdispersed.

> dat.nb.list <- with(dat.nb,list(Y=y, X=x,N=nrow(dat.nb)))
> modelString="
model {
  for (i in 1:N) {
     Y[i] ~ dpois(lambda[i])
     eta[i] <- beta0 + beta1*X[i]
     log(lambda[i]) <- eta[i]
  }
  beta0 ~ dnorm(0,1.0E-06)
  beta1 ~ dnorm(0,1.0E-06)
} 
"
> writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS3.4.bug')
> library(R2jags)
> system.time(
+ dat.nb.P.jags <- jags(model='../downloads/BUGSscripts/tut11.5bS3.4.bug', data=dat.nb.list, inits=NULL,
+           param=c('beta0','beta1'),
+           n.chain=3,
+           n.iter=20000, n.thin=10, n.burnin=10000)
+   )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 105

Initializing model
   user  system elapsed 
  1.800   0.000   1.803 
> print(dat.nb.P.jags)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS3.4.bug", fit using jags,
 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10
 n.sims = 3000 iterations saved
         mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
beta0      0.557   0.270   0.010   0.381   0.566   0.738   1.067 1.002  1800
beta1      0.111   0.020   0.074   0.098   0.111   0.124   0.150 1.001  2700
deviance 135.376   4.173 133.279 133.806 134.600 136.036 140.662 1.009  3000

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 8.7 and DIC = 144.1
DIC is an estimate of expected predictive error (lower deviance is better).
> 
> 
> #extract the samples for the two model parameters
> coefs <- dat.nb.P.jags$BUGSoutput$sims.matrix[,1:2]
> Xmat <- model.matrix(~x, data=dat)
> #expected values on a log scale
> eta<-coefs %*% t(Xmat)
> #expected value on response scale
> lambda <- exp(eta)
> 
> Resid <- -1*sweep(lambda,2,dat.nb$y, '-')/sqrt(lambda)
> RSS <- apply(Resid^2, 1, sum)
> Disp <- RSS/(nrow(dat.nb)-ncol(coefs))
> data.frame(Median=median(Disp), Mean=mean(Disp), HPDinterval(as.mcmc(Disp)), HPDinterval(as.mcmc(Disp),p=0.5))
     Median  Mean lower upper lower.1 upper.1
var1  3.131 3.235 2.628 4.018   2.747   3.199

The dispersion parameter was 3.2348, indicating over three times more variability than would be expected for a Poisson distribution. The data are thus over-dispersed. Given that this is unlikely to be due to zero inflation and the rug plot did suggest some level of clumping, negative binomial regression would seem reasonable.

Model fitting or statistical analysis

JAGS

$$ \begin{align} Y_i&\sim{}NB(p_i,size) & (\text{response distribution})\\ p_i&=size/(size+\lambda_i)\\ log(\lambda_i)&=\eta_i & (\text{link function})\\ \eta_i&=\beta_0+\beta_1 X_i & (\text{linear predictor})\\ \beta_0, \beta_1&\sim{}\mathcal{N}(0,10000) & (\text{diffuse Bayesian prior})\\ size &\sim{}\mathcal{U}(0.001,1000)\\ \end{align} $$
> dat.nb.list <- with(dat.nb,list(Y=y, X=x,N=nrow(dat.nb)))
> modelString="
model {
  for (i in 1:N) {
     Y[i] ~ dnegbin(p[i],size)
     p[i] <- size/(size+lambda[i])
     log(lambda[i]) <- beta0 + beta1*X[i]
  }
  beta0 ~ dnorm(0,1.0E-06)
  beta1 ~ dnorm(0,1.0E-06)
  size ~ dunif(0.001,1000)
  theta <- pow(1/mean(p),2)
  scaleparam <- mean((1-p)/p) 
} 
"
> writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS4.1.bug')
> library(R2jags)
> system.time(
+ dat.NB.jags <- jags(model='../downloads/BUGSscripts/tut11.5bS4.1.bug', data=dat.nb.list, inits=NULL,
+           param=c('beta0','beta1', 'size','theta','scaleparam'),
+           n.chain=3,
+           n.iter=20000, n.thin=10, n.burnin=10000)
+   )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 157

Initializing model
   user  system elapsed 
 17.361   0.008  17.391 
> Xmat <- model.matrix(~x, dat.nb)
> nX <- ncol(Xmat)
> dat.nb.list1 <- with(dat.nb,list(Y=y, X=Xmat,N=nrow(dat.nb), nX=nX))
> modelString="
model {
  for (i in 1:N) {
     Y[i] ~ dnegbin(p[i],size)
     p[i] <- size/(size+lambda[i])
     eta[i] <- inprod(beta[], X[i,])
     log(lambda[i]) <- max(-20,min(20,eta[i]))
  }
  for (i in 1:nX) {
    beta[i] ~ dnorm(0,1.0E-06)
  }
  size ~ dunif(0.001,10000)
} 
"
> writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS4.11.bug')
> library(R2jags)
> system.time(
+ dat.NB.jags1 <- jags(model='../downloads/BUGSscripts/tut11.5bS4.11.bug', data=dat.nb.list1, inits=NULL,
+           param=c('beta'),
+           n.chain=3,
+           n.iter=20000, n.thin=10, n.burnin=10000)
+   )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 212

Initializing model
   user  system elapsed 
 18.509   0.004  18.559 
> print(dat.NB.jags1)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS4.11.bug", fit using jags,
 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10
 n.sims = 3000 iterations saved
         mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
beta[1]    0.744   0.411  -0.033   0.464   0.734   1.007   1.565 1.001  3000
beta[2]    0.097   0.033   0.033   0.075   0.097   0.118   0.162 1.001  2200
deviance 113.080   2.608 110.119 111.204 112.391 114.267 119.579 1.001  3000

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.4 and DIC = 116.5
DIC is an estimate of expected predictive error (lower deviance is better).

Or arguably better still, use a multivariate normal prior. If we have a $k$ regression parameters ($\beta_k$), then the multivariate normal priors are defined as: $$ \boldsymbol{\beta}\sim{}\mathcal{N_k}(\boldsymbol{\mu}, \mathbf{\Sigma}) $$ where $$\boldsymbol{\mu}=[E[\beta_1],E[\beta_2],...,E[\beta_k]] = \left(\begin{array}{c}0\\\vdots\\0\end{array}\right)$$ $$ \mathbf{\Sigma}=[Cov[\beta_i, \beta_j]] = \left(\begin{array}{ccc}1000^2&0&0\\0&\ddots&0\\0&0&1000^2\end{array} \right) $$ hence, along with the response and predictor matrix, we need to supply $\boldsymbol{\mu}$ (a vector of zeros) and $\boldsymbol{\Sigma}$ (a covariance matrix with $1000^2$ in the diagonals).

> Xmat <- model.matrix(~x, dat.nb)
> nX <- ncol(Xmat)
> dat.nb.list2 <- with(dat.nb,list(Y=y, X=Xmat,N=nrow(dat.nb), mu=rep(0,nX),Sigma=diag(1.0E-06,nX)))
> modelString="
model {
  for (i in 1:N) {
     Y[i] ~ dnegbin(p[i],size)
     p[i] <- size/(size+lambda[i])
     eta[i] <- inprod(beta[], X[i,])
     log(lambda[i]) <- eta[i]
  }
  beta ~ dmnorm(mu[],Sigma[,])
  size ~ dunif(0.001,10000)
} 
"
> writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS4.12.bug')
> library(R2jags)
> system.time(
+ dat.NB.jags2 <- jags(model='../downloads/BUGSscripts/tut11.5bS4.12.bug', data=dat.nb.list2, inits=NULL,
+           param=c('beta', size),
+           n.chain=3,
+           n.iter=20000, n.thin=10, n.burnin=10000)
+   )
Error: object 'size' not found
Timing stopped at: 0 0 0.001 
> print(dat.NB.jags2)
Error: error in evaluating the argument 'x' in selecting a method for function 'print': Error: object 'dat.NB.jags2' not found

Chain mixing and Model validation

Prior to exploring the model parameters, it is prudent to confirm that the model did indeed fit the assumptions and was an appropriate fit to the data as well as that the MCMC sampling chain was adequately mixed and the retained samples independent.

  • We will start by exploring the mixing of the MCMC chains via traceplots.
    > plot(as.mcmc(dat.NB.jags))
    
    plot of chunk tut11.5bS4.2
    plot of chunk tut11.5bS4.2

    The chains appear well mixed and stable

  • Next we will explore correlation amongst MCMC samples.
    > autocorr.diag(as.mcmc(dat.NB.jags))
    
               beta0   beta1   deviance scaleparam     theta
    Lag 0   1.000000 1.00000  1.0000000  1.0000000  1.000000
    Lag 10  0.319600 0.34384  0.0679443  0.0137109 -0.002878
    Lag 50  0.023651 0.02475 -0.0159165 -0.0005774 -0.009582
    Lag 100 0.004577 0.01310 -0.0029059  0.0169330  0.035428
    Lag 500 0.018486 0.01216  0.0000334 -0.0208298 -0.005064
    

    The level of auto-correlation at the nominated lag of 10 is higher than we would generally like. It is worth increasing the thinning rate from 10 to 50. Obviously, to support this higher thinning rate, we would also increase the number of iterations. Typically for a negative binomial, it is worth having a large burnin (approximately half of the iterations).

    > library(R2jags)
    > dat.NB.jags <- jags(data=dat.nb.list,model.file='../downloads/BUGSscripts/tut11.5bS4.1.bug',
    +                    param=c('beta0','beta1','size'),
    +                    n.chains=3, n.iter=50000, n.burnin=25000, n.thin=50)
    
    Compiling model graph
       Resolving undeclared variables
       Allocating nodes
       Graph Size: 157
    
    Initializing model
    
    > print(dat.NB.jags)
    
    Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS4.1.bug", fit using jags,
     3 chains, each with 50000 iterations (first 25000 discarded), n.thin = 50
     n.sims = 1500 iterations saved
             mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
    beta0      0.750   0.414  -0.023   0.470   0.742   1.013   1.584 1.001  1400
    beta1      0.095   0.033   0.029   0.073   0.096   0.117   0.159 1.001  1500
    size       3.121   2.013   1.005   1.904   2.607   3.749   8.498 1.002  1300
    deviance 113.070   2.617 110.124 111.140 112.370 114.250 119.675 1.003  1500
    
    For each parameter, n.eff is a crude measure of effective sample size,
    and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
    
    DIC info (using the rule, pD = var(deviance)/2)
    pD = 3.4 and DIC = 116.5
    DIC is an estimate of expected predictive error (lower deviance is better).
    
    > plot(as.mcmc(dat.NB.jags))
    
    plot of chunk tut11.5bS4.4
    > autocorr.diag(as.mcmc(dat.NB.jags))
    
                 beta0     beta1  deviance      size
    Lag 0     1.000000  1.000000  1.000000  1.000000
    Lag 50   -0.023132 -0.020551  0.007207 -0.031935
    Lag 250   0.003990  0.009005 -0.017226  0.037372
    Lag 500   0.029642  0.034866  0.052578  0.036684
    Lag 2500  0.000988  0.029825  0.006859 -0.001861
    

    Conclusions: the samples are now less auto-correlated. Ideally, we should probably collect even more samples. Whilst the traceplots are reasonably noisy, there is more of a signal or pattern than there ideally should be.

  • We now explore the goodness of fit of the models via the residuals and deviance. We could calculate the Pearsons's residuals within the JAGS model. Alternatively, we could use the parameters to generate the residuals outside of JAGS.
    > #extract the samples for the two model parameters
    > coefs <- dat.NB.jags$BUGSoutput$sims.matrix[,1:2]
    > size <- dat.NB.jags$BUGSoutput$sims.matrix[,'size']
    > Xmat <- model.matrix(~x, data=dat.nb)
    > #expected values on a log scale
    > eta<-coefs %*% t(Xmat)
    > #expected value on response scale
    > lambda <- exp(eta)
    > varY <- lambda + (lambda^2)/size
    > #sweep across rows and then divide by lambda
    > Resid <- -1*sweep(lambda,2,dat.nb$y,'-')/sqrt(varY)
    > #plot residuals vs expected values
    > plot(apply(Resid,2,mean)~apply(eta,2,mean))
    
    plot of chunk tut11.5bS4.5

    There are no real patterns in the residuals.

  • Now we will compare the sum of squared residuals to the sum of squares residuals that would be expected from a Poisson distribution matching that estimated by the model. Essentially this is estimating how well the Poisson distribution, the log-link function and the linear model approximates the observed data.
    > SSres<-apply(Resid^2,1,sum)
    > 
    > set.seed(2)
    > #generate a matrix of draws from a negative binomial distribution
    > # the matrix is the same dimensions as pi and uses the probabilities of pi
    > YNew <- matrix(rnbinom(length(lambda),mu=lambda, size=size),nrow=nrow(lambda))
    > Resid1<-(lambda - YNew)/sqrt(varY)
    > SSres.sim<-apply(Resid1^2,1,sum)
    > mean(SSres.sim>SSres)
    
    [1] 0.408
    
    > Xmat <- model.matrix(~x, dat.nb)
    > nX <- ncol(Xmat)
    > dat.nb.list2 <- with(dat.nb,list(Y=y, X=Xmat,N=nrow(dat.nb), mu=rep(0,nX),Sigma=diag(1.0E-06,nX)))
    > modelString="
    model {
      for (i in 1:N) {
         Y[i] ~ dnegbin(p[i],size)
         p[i] <- size/(size+lambda[i])
         eta[i] <- inprod(beta[], X[i,])
         log(lambda[i]) <- eta[i]
    
         Y1[i] ~ dnegbin(p[i],size)
    	 varY[i] <- lambda[i] + pow(lambda[i],2)/size
    	 Resid[i] <- (Y[i] - lambda[i])/sqrt(varY[i])
         Resid1[i] <- (Y1[i] - lambda[i])/sqrt(varY[i])
         RSS[i] <- pow(Resid[i],2)
         RSS1[i] <-pow(Resid1[i],2) 
      }
      beta ~ dmnorm(mu[],Sigma[,])
      size ~ dunif(0.001,10000)
      Pvalue <- mean(sum(RSS1)>sum(RSS))
    } 
    "
    > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS4.61.bug')
    > library(R2jags)
    > system.time(
    + dat.NB.jags3 <- jags(model='../downloads/BUGSscripts/tut11.5bS4.61.bug', data=dat.nb.list2, inits=NULL,
    +           param=c('beta','Pvalue'),
    +           n.chain=3,
    +           n.iter=20000, n.thin=10, n.burnin=10000)
    +   )
    
    Compiling model graph
       Resolving undeclared variables
       Allocating nodes
       Graph Size: 400
    
    Initializing model
    
       user  system elapsed 
      7.453   0.004   7.463 
    
    > print(dat.NB.jags3)
    
    Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS4.61.bug", fit using jags,
     3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10
     n.sims = 3000 iterations saved
             mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
    Pvalue     0.419   0.493   0.000   0.000   0.000   1.000   1.000 1.002  1900
    beta[1]    0.724   0.418  -0.080   0.448   0.722   1.002   1.563 1.001  3000
    beta[2]    0.098   0.033   0.034   0.076   0.097   0.120   0.163 1.001  3000
    deviance 113.027   2.565 110.122 111.163 112.395 114.184 119.860 1.001  3000
    
    For each parameter, n.eff is a crude measure of effective sample size,
    and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
    
    DIC info (using the rule, pD = var(deviance)/2)
    pD = 3.3 and DIC = 116.3
    DIC is an estimate of expected predictive error (lower deviance is better).
    

    Conclusions: the Bayesian p-value is not far from 0.5, suggesting that there is a good fit of the model to the data.

Exploring the model parameters, test hypotheses

If there was any evidence that the assumptions had been violated or the model was not an appropriate fit, then we would need to reconsider the model and start the process again. In this case, there is no evidence that the test will be unreliable so we can proceed to explore the test statistics.

As with most Bayesian models, it is best to base conclusions on medians rather than means.

> print(dat.NB.jags)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS4.1.bug", fit using jags,
 3 chains, each with 50000 iterations (first 25000 discarded), n.thin = 50
 n.sims = 1500 iterations saved
         mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
beta0      0.750   0.414  -0.023   0.470   0.742   1.013   1.584 1.001  1400
beta1      0.095   0.033   0.029   0.073   0.096   0.117   0.159 1.001  1500
size       3.121   2.013   1.005   1.904   2.607   3.749   8.498 1.002  1300
deviance 113.070   2.617 110.124 111.140 112.370 114.250 119.675 1.003  1500

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.4 and DIC = 116.5
DIC is an estimate of expected predictive error (lower deviance is better).
> 
> library(plyr)
> adply(dat.NB.jags$BUGSoutput$sims.matrix, 2, function(x) {
+   data.frame(Median=median(x), Mean=mean(x), HPDinterval(as.mcmc(x)), HPDinterval(as.mcmc(x),p=0.5))
+ })
        X1    Median      Mean     lower    upper   lower.1  upper.1
1    beta0   0.74245   0.75004  -0.02279   1.5842   0.46918   1.0095
2    beta1   0.09635   0.09519   0.03322   0.1607   0.07262   0.1154
3 deviance 112.36982 113.06951 109.89529 118.1002 110.19631 112.5739
4     size   2.60655   3.12087   0.70273   6.7080   1.42865   2.9849

Actually, many find it more palatable to express the estimates in the original scale of the observations rather than on a log scale.

> library(plyr)
> adply(exp(dat.NB.jags$BUGSoutput$sims.matrix[,1:2]), 2, function(x) {
+   data.frame(Median=median(x), Mean=mean(x), HPDinterval(as.mcmc(x)), HPDinterval(as.mcmc(x),p=0.5))
+ })
     X1 Median  Mean lower upper lower.1 upper.1
1 beta0  2.101 2.312 0.856 4.445   1.373   2.436
2 beta1  1.101 1.100 1.034 1.174   1.075   1.122

Conclusions: We would reject the null hypothesis of no effect of $x$ on $y$. An increase in x is associated with a significant linear increase (positive slope) in the abundance of $y$. Every 1 unit increase in $x$ results in a log 0.0963 unit increase in $y$. We usually express this in terms of abundance rather than log abundance, so every 1 unit increase in $x$ results in a ($e^{0.0963}=1.1011$) 1.1011 unit increase in the abundance of $y$.

Further explorations of the trends

A measure of the strength of the relationship can be obtained according to: $$R^2 = 1 - \frac{RSS_{model}}{RSS_{null}}$$

> Xmat <- model.matrix(~x, data=dat.nb)
> #expected values on a log scale
> eta<-coefs %*% t(Xmat)
> #expected value on response scale
> lambda <- exp(eta)
> #calculate the raw SS residuals
> SSres <- apply((-1*(sweep(lambda,2,dat.nb$y,'-')))^2,1,sum)
> SSres.null <- sum((dat.nb$y - mean(dat.nb$y))^2)
> #OR 
> SSres.null <- crossprod(dat.nb$y - mean(dat.nb$y))
> #calculate the model r2
> 1-mean(SSres)/SSres.null
       [,1]
[1,] 0.2617

Conclusions: 26.17% of the variation in $y$ abundance can be explained by its relationship with $x$.

> Xmat <- model.matrix(~x, dat.nb)
> nX <- ncol(Xmat)
> dat.nb.list2 <- with(dat.nb,list(Y=y, X=Xmat,N=nrow(dat.nb), mu=rep(0,nX),Sigma=diag(1.0E-06,nX)))
> modelString="
model {
  for (i in 1:N) {
     Y[i] ~ dnegbin(p[i],size)
     p[i] <- size/(size+lambda[i])
     eta[i] <- inprod(beta[], X[i,])
     log(lambda[i]) <- eta[i]

     res[i] <- Y[i] - lambda[i]
     resnull[i] <- Y[i] - meanY
  }
  meanY <- mean(Y)
  beta ~ dmnorm(mu[],Sigma[,])
  size ~ dunif(0.001,10000)
  RSS <- sum(res^2)
  RSSnull <- sum(resnull^2)
  r2 <- 1-RSS/RSSnull
} 
"
> writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS4.9.bug')
> library(R2jags)
> system.time(
+ dat.NB.jags4 <- jags(model='../downloads/BUGSscripts/tut11.5bS4.9.bug', data=dat.nb.list2, inits=NULL,
+           param=c('beta','r2'),
+           n.chain=3,
+           n.iter=20000, n.thin=10, n.burnin=10000)
+   )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 217

Initializing model
   user  system elapsed 
  6.485   0.004   6.492 
> print(dat.NB.jags4)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS4.9.bug", fit using jags,
 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10
 n.sims = 3000 iterations saved
         mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
beta[1]    0.756   0.419  -0.058   0.463   0.759   1.021   1.552 1.002  3000
beta[2]    0.095   0.033   0.032   0.073   0.095   0.117   0.160 1.001  3000
r2         0.262   0.316  -0.055   0.240   0.309   0.350   0.389 1.088  3000
deviance 113.109   2.759 110.081 111.178 112.433 114.202 120.420 1.004   780

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.8 and DIC = 116.9
DIC is an estimate of expected predictive error (lower deviance is better).

Finally, we will create a summary plot.

> par(mar = c(4, 5, 0, 0))
> plot(y ~ x, data = dat.nb, type = "n", ann = F, axes = F)
> points(y ~ x, data = dat.nb, pch = 16)
> xs <- seq(min(dat.nb$x, na.rm = TRUE), max(dat.nb$x, na.rm = TRUE),
+     l = 1000)
> Xmat <- model.matrix(~xs)
> eta <- coefs %*% t(Xmat)
> ys <- exp(eta)
> library(plyr)
> library(coda)
> data.tab <- adply(ys, 2, function(x) {
+     data.frame(Median = median(x), HPDinterval(as.mcmc(x)))
+ })
> data.tab <- cbind(x = xs, data.tab)
> points(Median ~ x, data = data.tab, col = "black", type = "l")
> lines(lower ~ x, data = data.tab, col = "black", type = "l",
+     lty = 2)
> lines(upper ~ x, data = data.tab, col = "black", type = "l",
+     lty = 2)
> 
> axis(1)
> mtext("X", 1, cex = 1.5, line = 3)
> axis(2, las = 2)
> mtext("Abundance of Y", 2, cex = 1.5, line = 3)
> box(bty = "l")
plot of chunk tut11.5bS4.13

Defining full log-likelihood function

Now lets try it by specifying log-likelihood and the zero trick. When applying this trick, we need to manually calculate the deviance as the inbuilt deviance will be based on the log-likelihood of estimating the zeros (as part of the zero trick) rather than the deviance of the intended model..

The one advantage of the zero trick is that the Deviance and thus DIC, AIC provided by R2jags will be incorrect. Hence, they too need to be manually defined within jags. I suspect that the AIC calculation I have used is incorrect...

> Xmat <- model.matrix(~x, dat.nb)
> nX <- ncol(Xmat)
> dat.nb.list2 <- with(dat.nb,list(Y=y, X=Xmat,N=nrow(dat.nb), mu=rep(0,nX),
+                   Sigma=diag(1.0E-06,nX), zeros=rep(0,nrow(dat.nb)), C=10000))
> modelString="
model {
  for (i in 1:N) {
     zeros[i] ~ dpois(zeros.lambda[i])
     zeros.lambda[i] <- -ll[i] + C     
     ll[i] <- loggam(Y[i]+size) - loggam(Y[i]+1) - loggam(size) + size*(log(p[i]) - log(p[i]+1)) - Y[i]*log(p[i]+1)
     p[i] <- size/lambda[i]
     eta[i] <- inprod(beta[], X[i,])
     log(lambda[i]) <- eta[i]
  }
  beta ~ dmnorm(mu[],Sigma[,])
  size ~ dunif(0.001,1000)
  dev <- sum(-2*ll)
} 
"
> writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS4.14.bug')
> library(R2jags)
> system.time(
+ dat.NB.jags3 <- jags(model='../downloads/BUGSscripts/tut11.5bS4.14.bug', data=dat.nb.list2, inits=NULL,
+           param=c('beta','dev'),
+           n.chain=3,
+           n.iter=50000, n.thin=50, n.burnin=10000)
+   )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 453

Initializing model
   user  system elapsed 
  10.48    0.00   10.50 
> print(dat.NB.jags3)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS4.14.bug", fit using jags,
 3 chains, each with 50000 iterations (first 10000 discarded), n.thin = 50
 n.sims = 2400 iterations saved
           mu.vect sd.vect       2.5%       25%       50%       75%     97.5%  Rhat n.eff
beta[1]  7.320e-01   0.411     -0.079 4.600e-01 7.350e-01 9.960e-01 1.580e+00 1.001  1900
beta[2]  9.700e-02   0.033      0.032 7.500e-02 9.600e-02 1.190e-01 1.610e-01 1.003   930
dev      1.130e+02   2.794    110.054 1.111e+02 1.123e+02 1.141e+02 1.205e+02 1.006   620
deviance 4.001e+05   2.794 400110.054 4.001e+05 4.001e+05 4.001e+05 4.001e+05 1.000     1

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.9 and DIC = 400116.9
DIC is an estimate of expected predictive error (lower deviance is better).

Zero-inflated Poisson (ZIP) regression

Zero-Inflation Poisson (ZIP) mixture model is defined as: $$ p(y_i|\theta,\lambda) = \left\{ \begin{array}{l l} \theta + (1-\theta)\times \text{Pois}(0|\lambda) & \quad \text{if $y_i=0$ and}\\ (1-\theta)\times \text{Pois}(y_i|\lambda) & \quad \text{if $y_i>0$} \end{array} \right. $$ where $\theta$ is the probability of false values (zeros).

Hence there is essentially two models coupled together (a mixture model) to yield an overall probability:

  • when an observed response is zero ($y_i=0$), it is the probability of getting a false value (zero) plus the probability of a true value multiplied probability of drawing a value of zero from a Poisson distribution of $lambda$
  • when an observed response is greater than 0, it is the probability of a true value multiplied probability of drawing that value from a Poisson distribution of $lambda$

The above formulation indicates the same $\lambda$ for both the zeros and non-zeros components. In the model of zero values, we are essentially investigating whether the likelihood of false zeros is related to the linear predictor and then the greater than zero model investigates whether the counts are related to the linear predictor.

However, we are typically less interested in modelling determinants of false zeros. Indeed, it is better that the likelihood of false zeros be unrelated to the linear predictor. For example, if excess (false zeros) are due to issues of detectability (individuals are present, just not detected), it is better that the detectability is not related to experimental treatments. Ideally, any detectability issues should be equal across all treatment levels.

The expected value of $Y$ and the variance in $Y$ for a ZIP model are: $$ E(y_i) = \lambda\times(1-\theta)\\ Var(y_i) = \lambda\times(1-\theta)\times(1+\theta\times\lambda^2) $$

Scenario and Data

Lets say we wanted to model the abundance of an item ($y$) against a continuous predictor ($x$). As this section is mainly about the generation of artificial data (and not specifically about what to do with the data), understanding the actual details are optional and can be safely skipped. Consequently, I have folded (toggled) this section away.

Random data incorporating the following trends (effect parameters)
  • the sample size = 20
  • the continuous $x$ variable is a random uniform spread of measurements between 1 and 20
  • the rate of change in log $y$ per unit of $x$ (slope) = 0.1.
  • the value of $x$ when log$y$ equals 0 (when $y$=1)
  • to generate the values of $y$ expected at each $x$ value, we evaluate the linear predictor (created by calculating the outer product of the model matrix and the regression parameters). These expected values are then transformed into a scale mapped by (0,$\infty$) by using the log function $e^{linear~predictor}$
  • finally, we generate $y$ values by using the expected $y$ values ($\lambda$) as probabilities when drawing random numbers from a Poisson distribution. This step adds random noise to the expected $y$ values and returns only 0's and positive integers.
> set.seed(37)  #34.5  #4 #10 #16 #17 #26
> # The number of samples
> n.x <- 20
> # Create x values that at uniformly distributed throughout the rate of 1 to 20
> x <- sort(runif(n = n.x, min = 1, max = 20))
> mm <- model.matrix(~x)
> intercept <- 0.6
> slope = 0.1
> # The linear predictor
> linpred <- mm %*% c(intercept, slope)
> # Predicted y values
> lambda <- exp(linpred)
> # Add some noise and make binomial
> library(gamlss.dist)
> # fixed latent binomial
> y <- rZIP(n.x, lambda, 0.4)
> # latent binomial influenced by the linear predictor y<- rZIP(n.x,lambda, 1-exp(linpred)/(1+exp(linpred)))
> dat.zip <- data.frame(y, x)
> 
> summary(glm(y ~ x, dat.zip, family = "poisson"))
Call:
glm(formula = y ~ x, family = "poisson", data = dat.zip)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.541  -2.377  -0.975   1.174   3.638  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   0.6705     0.3302    2.03    0.042 *
x             0.0296     0.0266    1.11    0.266  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 85.469  on 19  degrees of freedom
Residual deviance: 84.209  on 18  degrees of freedom
AIC: 124

Number of Fisher Scoring iterations: 6
> plot(glm(y ~ x, dat.zip, family = "poisson"))
plot of chunk tut11.5bS5.1
plot of chunk tut11.5bS5.1
plot of chunk tut11.5bS5.1
plot of chunk tut11.5bS5.1
> 
> library(pscl)
> summary(zeroinfl(y ~ x | 1, dist = "poisson", data = dat.zip))
Call:
zeroinfl(formula = y ~ x | 1, data = dat.zip, dist = "poisson")

Pearson residuals:
   Min     1Q Median     3Q    Max 
-1.001 -0.956 -0.393  0.966  1.619 

Count model coefficients (poisson with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.7047     0.3196    2.21  0.02745 *  
x             0.0873     0.0253    3.45  0.00056 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -0.229      0.456    -0.5     0.62
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 13 
Log-likelihood: -36.2 on 3 Df
> plot(resid(zeroinfl(y ~ x | 1, dist = "poisson", data = dat.zip)) ~
+     fitted(zeroinfl(y ~ x | 1, dist = "poisson", data = dat.zip)))
plot of chunk tut11.5bS5.1
> 
> library(gamlss)
> summary(gamlss(y ~ x, data = dat.zip, family = ZIP))
GAMLSS-RS iteration 1: Global Deviance = 72.44 
GAMLSS-RS iteration 2: Global Deviance = 72.34 
GAMLSS-RS iteration 3: Global Deviance = 72.34 
*******************************************************************
Family:  c("ZIP", "Poisson Zero Inflated") 

Call:  gamlss(formula = y ~ x, family = ZIP, data = dat.zip) 

Fitting method: RS() 

-------------------------------------------------------------------
Mu link function:  log
Mu Coefficients:
             Estimate  Std. Error  t value  Pr(>|t|)
(Intercept)    0.7058      0.3196     2.21   0.04123
x              0.0872      0.0253     3.44   0.00311

-------------------------------------------------------------------
Sigma link function:  logit
Sigma Coefficients:
  Estimate  Std. Error     t value    Pr(>|t|)  
    -0.229       0.456      -0.502       0.622  

-------------------------------------------------------------------
No. of observations in the fit:  20 
Degrees of Freedom for the fit:  3
      Residual Deg. of Freedom:  17 
                      at cycle:  3 
 
Global Deviance:     72.34 
            AIC:     78.34 
            SBC:     81.33 
*******************************************************************
> 
> predict(gamlss(y ~ x, data = dat.zip, family = ZIP), se.fit = TRUE,
+     what = "mu")
GAMLSS-RS iteration 1: Global Deviance = 72.44 
GAMLSS-RS iteration 2: Global Deviance = 72.34 
GAMLSS-RS iteration 3: Global Deviance = 72.34 
$fit
     1      2      3      4      5      6      7      8      9     10     11     12     13     14     15     16     17     18     19     20 
0.7967 0.9236 1.0264 1.0370 1.1305 1.3763 1.4054 1.4300 1.4951 1.6161 1.7036 1.7630 1.8679 1.9838 1.9952 2.1187 2.1250 2.1432 2.1837 2.3065 

$se.fit
     1      2      3      4      5      6      7      8      9     10     11     12     13     14     15     16     17     18     19     20 
0.3518 0.3089 0.2759 0.2726 0.2446 0.1856 0.1807 0.1771 0.1697 0.1656 0.1710 0.1783 0.1973 0.2252 0.2282 0.2638 0.2657 0.2713 0.2840 0.3242 

Exploratory data analysis and initial assumption checking

The assumptions are:
  1. All of the observations are independent - this must be addressed at the design and collection stages
  2. The response variable (and thus the residuals) should be matched by an appropriate distribution (in the case of positive integers response - a Poisson is appropriate).
  3. All observations are equally influential in determining the trends - or at least no observations are overly influential. This is most effectively diagnosed via residuals and other influence indices and is very difficult to diagnose prior to analysis
  4. the relationship between the linear predictor (right hand side of the regression formula) and the link function should be linear. A scatterplot with smoother can be useful for identifying possible non-linearity.
  5. Dispersion is either 1 or overdispersion is otherwise accounted for in the model
  6. The number of zeros is either not excessive or else they are specifically addressed by the model

Confirm non-normality and explore clumping

Check the distribution of the $y$ abundances

> hist(dat.zip$y)
plot of chunk tut11.5bS5.2
> boxplot(dat.zip$y, horizontal = TRUE)
> rug(jitter(dat.zip$y))
plot of chunk tut11.5bS5.2
There is definitely signs of non-normality that would warrant Poisson models. Further to that, there appears to be a large number of zeros that are likely to be the cause of overdispersion A zero-inflated Poisson model is likely to be one of the most effective for modeling these data.

Confirm linearity

Lets now explore linearity by creating a histogram of the predictor variable ($x$). Note, it is difficult to directly assess issues of linearity. Indeed, a scatterplot with lowess smoother will be largely influenced by the presence of zeros. One possible way of doing so is to explore the trend in the non-zero data.

> hist(dat.zip$x)
plot of chunk tut11.5bS5.3
> # now for the scatterplot
> plot(y ~ x, dat.zip, log = "y")
> with(subset(dat.zip, y > 0), lines(lowess(y ~ x)))
plot of chunk tut11.5bS5.3

Conclusions: the predictor ($x$) does not display any skewness or other issues that might lead to non-linearity. The lowess smoother on the non-zero data cloud does not display major deviations from a straight line and thus linearity is likely to be satisfied. Violations of linearity (whilst difficult to be certain about due to the unknown influence of the zeros) could be addressed by either:

  • define a non-linear linear predictor (such as a polynomial, spline or other non-linear function)
  • transform the scale of the predictor variables

Explore zero inflation

Although we have already established that there are few zeros in the data (and thus overdispersion is unlikely to be an issue), we can also explore this by comparing the number of zeros in the data to the number of zeros that would be expected from a Poisson distribution with a mean equal to the mean count of the data.

> # proportion of 0's in the data
> dat.zip.tab <- table(dat.zip$y == 0)
> dat.zip.tab/sum(dat.zip.tab)
FALSE  TRUE 
 0.55  0.45 
> # proportion of 0's expected from a Poisson distribution
> mu <- mean(dat.zip$y)
> cnts <- rpois(1000, mu)
> dat.zip.tabE <- table(cnts == 0)
> dat.zip.tabE/sum(dat.zip.tabE)
FALSE  TRUE 
0.921 0.079 
In the above, the value under FALSE is the proportion of non-zero values in the data and the value under TRUE is the proportion of zeros in the data. In this example, the proportion of zeros observed (45%) far exceeds that that would have been expected (7.9%). Hence it is highly likely that any models will be zero-inflated.

Model fitting or statistical analysis

The exploratory data analyses have suggested that a zero-inflated Poisson model might be the most appropriate for these data.

> dat.zip.list <- with(dat.zip,list(Y=y, X=x,N=nrow(dat.zip)))
> modelString="
model {
  for (i in 1:N) {
     Y[i] ~ dpois(lambda[i])
     eta[i] <- beta0 + beta1*X[i]
     log(lambda[i]) <- eta[i]
  }
  beta0 ~ dnorm(0,1.0E-06)
  beta1 ~ dnorm(0,1.0E-06)
} 
"
> writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS5.4.bug')
> library(R2jags)
> system.time(
+ dat.zip.P.jags <- jags(model='../downloads/BUGSscripts/tut11.5bS5.4.bug', data=dat.zip.list, inits=NULL,
+           param=c('beta0','beta1'),
+           n.chain=3,
+           n.iter=20000, n.thin=10, n.burnin=10000)
+   )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 105

Initializing model
   user  system elapsed 
  1.484   0.004   1.490 
> print(dat.zip.P.jags)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS5.4.bug", fit using jags,
 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10
 n.sims = 3000 iterations saved
         mu.vect sd.vect    2.5%     25%     50%    75%   97.5%  Rhat n.eff
beta0      0.628   0.324  -0.057   0.423   0.642   0.85   1.225 1.001  3000
beta1      0.032   0.026  -0.016   0.014   0.031   0.05   0.084 1.001  3000
deviance 121.911   2.069 120.051 120.545 121.329 122.61 126.947 1.003  3000

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 2.1 and DIC = 124.1
DIC is an estimate of expected predictive error (lower deviance is better).
> 
> 
> #extract the samples for the two model parameters
> coefs <- dat.zip.P.jags$BUGSoutput$sims.matrix[,1:2]
> Xmat <- model.matrix(~x, data=dat)
> #expected values on a log scale
> eta<-coefs %*% t(Xmat)
> #expected value on response scale
> lambda <- exp(eta)
> 
> Resid <- -1*sweep(lambda,2,dat.zip$y, '-')/sqrt(lambda)
> RSS <- apply(Resid^2, 1, sum)
> Disp <- RSS/(nrow(dat.zip)-ncol(coefs))
> data.frame(Median=median(Disp), Mean=mean(Disp), HPDinterval(as.mcmc(Disp)), HPDinterval(as.mcmc(Disp),p=0.5))
     Median  Mean lower upper lower.1 upper.1
var1  4.211 4.319 3.136 5.616   3.603   4.427

The dispersion parameter was 4.3186, indicating over three times more variability than would be expected for a Poisson distribution. The data are thus over-dispersed. Given the large number of zeros in the response, it would seem likely that the overdispersion is as a result of the excessive zeros and thus zero-inflated Poisson model would seem reasonable. Note, if this model is still overdispersed (possibly due to clumpiness in the non-zero values), then a zero-inflated negative binomial model might be worth exploring.

JAGS

$$ Y_i = \left\{ \begin{array}{lrcl} 0~\text{with}~P(y_i) = 1-\mu & z_i&\sim&\text{Binom}(1-\theta)\\ & logit(\theta)&=&\gamma_0\\ &\gamma_0&\sim{}&\mathcal{N}(0,10000)\\ >0 & Y_i&\sim&\text{Pois}(\lambda_i)\\ &\lambda_i&=&z_i + \eta_i\\ &log(\eta_i)&=&\beta_0+\beta_1x_1\\ &\beta_0, \beta_1&\sim{}&\mathcal{N}(0,10000)\\ \end{array} \right. $$ Or in shorthand: $$ \begin{align} Y_i&\sim{}ZIP(\lambda,\theta) & (\text{response distribution})\\ logit(\theta)&=\gamma_0 & (\text{link function/linear predictor - zero component})\\ log(\lambda_i)&=\eta_i & (\text{link function - count component})\\ \eta_i&=\beta_0+\beta_1 X_i & (\text{linear predictor - count component})\\ \beta_0, \beta_1, \gamma_0&\sim{}\mathcal{N}(0,10000) & (\text{diffuse Bayesian priors})\\ \end{align} $$
> dat.zip.list <- with(dat.zip,list(Y=y, X=x,N=nrow(dat.nb), z=ifelse(y==0,0,1)))
> modelString="
model {
  for (i in 1:N) {
     z[i] ~ dbern(one.minus.theta)
     Y[i] ~ dpois(lambda[i])
     lambda[i] <- z[i]*eta[i]
     log(eta[i]) <- beta0 + beta1*X[i]
  }
  one.minus.theta <- 1-theta
  logit(theta) <- gamma0
  beta0 ~ dnorm(0,1.0E-06)
  beta1 ~ dnorm(0,1.0E-06)
  gamma0 ~ dnorm(0,1.0E-06)
} 
"
> writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS5.5.bug')
> library(R2jags)
> system.time(
+ dat.zip.jags <- jags(model='../downloads/BUGSscripts/tut11.5bS5.5.bug', data=dat.zip.list, inits=NULL,
+           param=c('beta0','beta1', 'gamma0','theta'),
+           n.chain=3,
+           n.iter=20000, n.thin=10, n.burnin=10000)
+   )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 149

Initializing model
   user  system elapsed 
  2.689   0.004   2.694 
> Xmat <- model.matrix(~x, dat.zip)
> nX <- ncol(Xmat)
> dat.zip.list1 <- with(dat.zip,list(Y=y, X=Xmat,N=nrow(dat.zip), nX=nX, z=ifelse(y==0,0,1)))
> modelString="
model {
  for (i in 1:N) {
     z[i] ~ dbern(one.minus.theta)
     Y[i] ~ dpois(lambda[i])
     lambda[i] <- z[i]*eta[i]
     log(eta[i]) <- inprod(beta[], X[i,])
  }
  one.minus.theta <- 1-theta
  logit(theta) <- gamma0
  gamma0 ~ dnorm(0,1.0E-06)
  for (i in 1:nX) {
    beta[i] ~ dnorm(0,1.0E-06)
  }
} 
"
> writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS5.6.bug')
> library(R2jags)
> system.time(
+ dat.zip.jags1 <- jags(model='../downloads/BUGSscripts/tut11.5bS5.6.bug', data=dat.zip.list1, inits=NULL,
+           param=c('beta','gamma0'),
+           n.chain=3,
+           n.iter=20000, n.thin=10, n.burnin=10000)
+   )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 171

Initializing model
   user  system elapsed 
  2.401   0.004   2.408 
> print(dat.zip.jags1)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS5.6.bug", fit using jags,
 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10
 n.sims = 3000 iterations saved
         mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
beta[1]    0.702   0.320  0.033  0.499  0.706  0.917  1.293 1.003   980
beta[2]    0.087   0.026  0.036  0.069  0.087  0.103  0.138 1.003   850
gamma0    -0.216   0.465 -1.128 -0.525 -0.218  0.107  0.669 1.001  3000
deviance  75.730   2.508 72.857 73.842 75.103 76.921 82.412 1.001  3000

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.1 and DIC = 78.9
DIC is an estimate of expected predictive error (lower deviance is better).

Or arguably better still, use a multivariate normal prior. If we have a $k$ regression parameters ($\beta_k$), then the multivariate normal priors are defined as: $$ \boldsymbol{\beta}\sim{}\mathcal{N_k}(\boldsymbol{\mu}, \mathbf{\Sigma}) $$ where $$\boldsymbol{\mu}=[E[\beta_1],E[\beta_2],...,E[\beta_k]] = \left(\begin{array}{c}0\\\vdots\\0\end{array}\right)$$ $$ \mathbf{\Sigma}=[Cov[\beta_i, \beta_j]] = \left(\begin{array}{ccc}1000^2&0&0\\0&\ddots&0\\0&0&1000^2\end{array} \right) $$ hence, along with the response and predictor matrix, we need to supply $\boldsymbol{\mu}$ (a vector of zeros) and $\boldsymbol{\Sigma}$ (a covariance matrix with $1000^2$ in the diagonals).

> Xmat <- model.matrix(~x, dat.zip)
> nX <- ncol(Xmat)
> dat.zip.list2 <- with(dat.zip,list(Y=y, X=Xmat,N=nrow(dat.zip), nX=nX,
+                       mu=rep(0,nX),Sigma=diag(1.0E-06,nX), z=ifelse(y==0,0,1)))
> modelString="
model {
  for (i in 1:N) {
     z[i] ~ dbern(one.minus.theta)
     Y[i] ~ dpois(lambda[i])
     lambda[i] <- z[i]*eta[i]
     log(eta[i]) <- inprod(beta[], X[i,])
  }
  one.minus.theta <- 1-theta
  logit(theta) <- gamma0
  gamma0 ~ dnorm(0,1.0E-06)
  beta ~ dmnorm(mu[],Sigma[,])
} 
"
> writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS5.7.bug')
> library(R2jags)
> system.time(
+ dat.zip.jags2 <- jags(model='../downloads/BUGSscripts/tut11.5bS5.7.bug', data=dat.zip.list2, inits=NULL,
+           param=c('beta', 'gamma0'),
+           n.chain=3,
+           n.iter=20000, n.thin=10, n.burnin=10000)
+   )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 176

Initializing model
   user  system elapsed 
  1.000   0.000   0.999 
> print(dat.zip.jags2)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS5.7.bug", fit using jags,
 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10
 n.sims = 3000 iterations saved
         mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
beta[1]    0.700   0.313  0.051  0.492  0.708  0.905  1.307 1.005   430
beta[2]    0.087   0.025  0.036  0.070  0.087  0.104  0.136 1.008   680
gamma0    -0.215   0.472 -1.115 -0.523 -0.229  0.090  0.724 1.001  2400
deviance  75.677   2.517 72.884 73.834 75.002 76.788 82.402 1.001  3000

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.2 and DIC = 78.8
DIC is an estimate of expected predictive error (lower deviance is better).

Note, the n.eff indicates that we probably have an issue with chain mixing and/or autocorrelation. We probably should increase the number of iterations and the thinning rate.

Chain mixing and Model validation

Prior to exploring the model parameters, it is prudent to confirm that the model did indeed fit the assumptions and was an appropriate fit to the data as well as that the MCMC sampling chain was adequately mixed and the retained samples independent.

  • We will start by exploring the mixing of the MCMC chains via traceplots.
    > plot(as.mcmc(dat.zip.jags))
    
    plot of chunk tut11.5bS5.8
    plot of chunk tut11.5bS5.8

    The chains appear well mixed and stable

  • Next we will explore correlation amongst MCMC samples.
    > autocorr.diag(as.mcmc(dat.zip.jags))
    
               beta0     beta1  deviance    gamma0     theta
    Lag 0   1.000000 1.0000000  1.000000  1.000000  1.000000
    Lag 10  0.260592 0.2658804  0.069565  0.056243  0.056195
    Lag 50  0.013879 0.0002929 -0.017896 -0.005929 -0.004522
    Lag 100 0.017172 0.0200417 -0.034988 -0.024724 -0.023377
    Lag 500 0.006818 0.0031960 -0.002762  0.010100  0.012140
    

    The level of auto-correlation at the nominated lag of 10 is higher than we would generally like. It is worth increasing the thinning rate from 10 to 50. Obviously, to support this higher thinning rate, we would also increase the number of iterations.

    > library(R2jags)
    > dat.zip.jags <- jags(data=dat.zip.list,model.file='../downloads/BUGSscripts/tut11.5bS5.5.bug',
    +                    param=c('beta0','beta1','gamma0','theta'),
    +                    n.chains=3, n.iter=100000, n.burnin=50000, n.thin=50)
    
    Compiling model graph
       Resolving undeclared variables
       Allocating nodes
       Graph Size: 149
    
    Initializing model
    
    > print(dat.zip.jags)
    
    Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS5.5.bug", fit using jags,
     3 chains, each with 1e+05 iterations (first 50000 discarded), n.thin = 50
     n.sims = 3000 iterations saved
             mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
    beta0      0.718   0.317  0.061  0.519  0.720  0.932  1.318 1.001  3000
    beta1      0.085   0.025  0.037  0.068  0.085  0.102  0.136 1.001  3000
    gamma0    -0.203   0.465 -1.140 -0.509 -0.195  0.112  0.685 1.001  3000
    theta      0.452   0.109  0.242  0.375  0.451  0.528  0.665 1.001  3000
    deviance  75.655   2.508 72.846 73.842 75.007 76.704 82.268 1.001  3000
    
    For each parameter, n.eff is a crude measure of effective sample size,
    and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
    
    DIC info (using the rule, pD = var(deviance)/2)
    pD = 3.1 and DIC = 78.8
    DIC is an estimate of expected predictive error (lower deviance is better).
    
    > plot(as.mcmc(dat.zip.jags))
    
    plot of chunk tut11.5bS5.10
    plot of chunk tut11.5bS5.10
    > autocorr.diag(as.mcmc(dat.zip.jags))
    
                beta0     beta1 deviance    gamma0     theta
    Lag 0     1.00000  1.000000  1.00000  1.000000  1.000000
    Lag 50   -0.02079 -0.022691  0.02611 -0.005516 -0.002252
    Lag 250   0.01070  0.016621 -0.02444 -0.010192 -0.010403
    Lag 500  -0.01010 -0.008113  0.01343  0.047790  0.049889
    Lag 2500 -0.01133 -0.024075 -0.01972 -0.040467 -0.040757
    

    Conclusions: the samples are now less auto-correlated and the chains are all well mixed and appear stable.

  • We now explore the goodness of fit of the models via the residuals and deviance. We could calculate the Pearsons's residuals within the JAGS model. Alternatively, we could use the parameters to generate the residuals outside of JAGS.
    > #extract the samples for the two model parameters
    > coefs <- dat.zip.jags$BUGSoutput$sims.matrix[,1:2]
    > theta <- dat.zip.jags$BUGSoutput$sims.matrix[,'theta']
    > Xmat <- model.matrix(~x, data=dat.zip)
    > #expected values on a log scale
    > lambda<-coefs %*% t(Xmat)
    > #expected value on response scale
    > eta <- exp(lambda)
    > expY <- sweep(eta,1,(1-theta),"*")
    > varY <- eta+sweep(eta^2,1,theta,"*")
    > varY <- sweep(varY,1,(1-theta),'*')
    > #sweep across rows and then divide by lambda
    > Resid <- -1*sweep(expY,2,dat.zip$y,'-')/sqrt(varY)
    > #plot residuals vs expected values
    > plot(apply(Resid,2,mean)~apply(eta,2,mean))
    
    plot of chunk tut11.5bS5.11

    There are no real patterns in the residuals.

  • Now we will compare the sum of squared residuals to the sum of squares residuals that would be expected from a Poisson distribution matching that estimated by the model. Essentially this is estimating how well the Poisson distribution, the log-link function and the linear model approximates the observed data.

    When doing so, we need to consider the expected value and variance of the zero-inflated poisson. $$ E(y_i) = \lambda\times(1-\theta)\\ Var(y_i) = \lambda\times(1-\theta)\times(1+\theta\times\lambda^2) $$

    > SSres<-apply(Resid^2,1,sum)
    >  
    > set.seed(2)
    > #generate a matrix of draws from a zero-inflated poisson (ZIP) distribution
    > # the matrix is the same dimensions as lambda
    > library(gamlss.dist)
    > #YNew <- matrix(rZIP(length(lambda),eta, theta),nrow=nrow(lambda))
    > lambda <- sweep(eta,1,ifelse(dat.zip$y==0,0,1),'*')
    > YNew <- matrix(rpois(length(lambda),lambda),nrow=nrow(lambda))
    > Resid1<-(expY - YNew)/sqrt(varY)
    > SSres.sim<-apply(Resid1^2,1,sum)
    > mean(SSres.sim>SSres)
    
    [1] 0.4513
    

    Whilst not ideal (as we would prefer a Bayesian P-value of around 0.5), this value is not wildly bad and does not constitute overwhelming evidence of a lack of fit.

    > Xmat <- model.matrix(~x, dat.zip)
    > nX <- ncol(Xmat)
    > dat.zip.list2 <- with(dat.zip,list(Y=y, X=Xmat,N=nrow(dat.zip), nX=nX,
    +                       mu=rep(0,nX),Sigma=diag(1.0E-06,nX), z=ifelse(y==0,0,1)))
    > modelString="
    model {
      for (i in 1:N) {
         z[i] ~ dbern(one.minus.theta)
         Y[i] ~ dpois(lambda[i])
         lambda[i] <- z[i]*eta[i]
         log(eta[i]) <- max(-20,min(20,inprod(beta[], X[i,])))
    
         expY[i] <- eta[i]*(1-theta)
         varY[i] <- (1-theta)*(eta[i]*theta*pow(eta[i],2))
         Resid[i] <- (Y[i] - expY[i])/sqrt(varY[i])
         Y1[i] ~ dpois(lambda[i])
         Resid1[i] <- (Y1[i] - expY[i])/sqrt(varY[i])
         RSS[i] <- pow(Resid[i],2)
         RSS1[i] <-pow(Resid1[i],2) 
      }
      one.minus.theta <- 1-theta
      logit(theta) <- gamma0
      gamma0 ~ dnorm(0,1.0E-06)
      beta ~ dmnorm(mu[],Sigma[,])
      Pvalue <- mean(sum(RSS1)>sum(RSS))
    } 
    "
    > writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS5.13.bug')
    > library(R2jags)
    > system.time(
    + dat.zip.jags3 <- jags(model='../downloads/BUGSscripts/tut11.5bS5.13.bug', data=dat.zip.list2, inits=NULL,
    +           param=c('beta', 'gamma0','Pvalue'),
    +           n.chain=3,
    +           n.iter=20000, n.thin=10, n.burnin=10000)
    +   )
    
    Compiling model graph
       Resolving undeclared variables
       Allocating nodes
       Graph Size: 465
    
    Initializing model
    
       user  system elapsed 
      1.337   0.004   1.342 
    
    > print(dat.zip.jags3)
    
    Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS5.13.bug", fit using jags,
     3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10
     n.sims = 3000 iterations saved
             mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
    Pvalue     0.475   0.499  0.000  0.000  0.000  1.000  1.000 1.001  3000
    beta[1]    0.712   0.312  0.062  0.508  0.721  0.928  1.292 1.004   660
    beta[2]    0.086   0.025  0.037  0.069  0.086  0.103  0.133 1.004   750
    gamma0    -0.213   0.461 -1.129 -0.526 -0.197  0.104  0.668 1.001  3000
    deviance  75.628   2.489 72.826 73.799 74.996 76.819 81.974 1.001  2100
    
    For each parameter, n.eff is a crude measure of effective sample size,
    and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
    
    DIC info (using the rule, pD = var(deviance)/2)
    pD = 3.1 and DIC = 78.7
    DIC is an estimate of expected predictive error (lower deviance is better).
    
    > 
    

    Conclusions: the Bayesian p-value is very close to 0.5, suggesting that there is a good fit of the model to the data.

Exploring the model parameters, test hypotheses

If there was any evidence that the assumptions had been violated or the model was not an appropriate fit, then we would need to reconsider the model and start the process again. In this case, there is no evidence that the test will be unreliable so we can proceed to explore the test statistics.

As with most Bayesian models, it is best to base conclusions on medians rather than means.

> print(dat.zip.jags)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS5.5.bug", fit using jags,
 3 chains, each with 1e+05 iterations (first 50000 discarded), n.thin = 50
 n.sims = 3000 iterations saved
         mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
beta0      0.718   0.317  0.061  0.519  0.720  0.932  1.318 1.001  3000
beta1      0.085   0.025  0.037  0.068  0.085  0.102  0.136 1.001  3000
gamma0    -0.203   0.465 -1.140 -0.509 -0.195  0.112  0.685 1.001  3000
theta      0.452   0.109  0.242  0.375  0.451  0.528  0.665 1.001  3000
deviance  75.655   2.508 72.846 73.842 75.007 76.704 82.268 1.001  3000

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.1 and DIC = 78.8
DIC is an estimate of expected predictive error (lower deviance is better).
> 
> library(plyr)
> adply(dat.zip.jags$BUGSoutput$sims.matrix, 2, function(x) {
+   data.frame(Median=median(x), Mean=mean(x), HPDinterval(as.mcmc(x)), HPDinterval(as.mcmc(x),p=0.5))
+ })
        X1   Median    Mean    lower   upper  lower.1  upper.1
1    beta0  0.72032  0.7182  0.11349  1.3617  0.54977  0.95667
2    beta1  0.08544  0.0853  0.03686  0.1355  0.06821  0.10181
3 deviance 75.00697 75.6548 72.65129 80.7637 72.79881 75.08279
4   gamma0 -0.19535 -0.2028 -1.15237  0.6730 -0.57676  0.03602
5    theta  0.45132  0.4521  0.23186  0.6532  0.35968  0.50900

Actually, many find it more palatable to express the estimates in the original scale of the observations rather than on a log scale.

> library(plyr)
> adply(exp(dat.zip.jags$BUGSoutput$sims.matrix[,1:2]), 2, function(x) {
+   data.frame(Median=median(x), Mean=mean(x), HPDinterval(as.mcmc(x)), HPDinterval(as.mcmc(x),p=0.5))
+ })
     X1 Median  Mean  lower upper lower.1 upper.1
1 beta0  2.055 2.155 0.9751 3.546   1.520   2.352
2 beta1  1.089 1.089 1.0376 1.145   1.071   1.107

Conclusions: We would reject the null hypothesis of no effect of $x$ on $y$. An increase in x is associated with a significant linear increase (positive slope) in the abundance of $y$. Every 1 unit increase in $x$ results in a log 0.0854 unit increase in $y$. We usually express this in terms of abundance rather than log abundance, so every 1 unit increase in $x$ results in a ($e^{0.0854}=1.0892$) 1.0892 unit increase in the abundance of $y$.

Further explorations of the trends

A measure of the strength of the relationship can be obtained according to: $$R^2 = 1 - \frac{RSS_{model}}{RSS_{null}}$$

Alternatively, we could use McFadden's psuedo $$R^2 = 1-\frac{\mathcal{LL}(Model_{full})}{\mathcal{LL}(Model_{reduced})}$$ [[http://www.statisticalhorizons.com/category/uncategorized]] - about a quarter of the way down the page.

> Xmat <- model.matrix(~x, dat=dat.zip)
> #expected values on a log scale
> neta<-coefs %*% t(Xmat)
> #expected value on response scale
> eta <- exp(neta)
> lambda <- sweep(eta,2,ifelse(dat.zip$y==0,0,1),'*')
> expY <- sweep(lambda,2,1-theta,'*')
> #calculate the raw SS residuals
> SSres <- apply((-1*(sweep(expY,2,dat.zip$y,'-')))^2,1,sum)
> SSres.null <- sum((dat.zip$y - mean(dat.zip$y))^2)
> #calculate the model r2
> 1-mean(SSres)/SSres.null
[1] 0.4928

Conclusions: 49.28% of the variation in $y$ abundance can be explained by its relationship with $x$.

> Xmat <- model.matrix(~x, dat.zip)
> nX <- ncol(Xmat)
> dat.zip.list2 <- with(dat.zip,list(Y=y, X=Xmat,N=nrow(dat.zip), nX=nX,
+                       mu=rep(0,nX),Sigma=diag(1.0E-06,nX), z=ifelse(y==0,0,1)))
> modelString="
model {
  for (i in 1:N) {
     z[i] ~ dbern(one.minus.theta)
     Y[i] ~ dpois(lambda[i])
     lambda[i] <- z[i]*eta[i]
     log(eta[i]) <- max(-20,min(20,inprod(beta[], X[i,])))

	 expY[i] <- lambda[i]*(1-theta)
     res[i] <- Y[i] - expY[i]
     resnull[i] <- Y[i] - meanY
  }
  one.minus.theta <- 1-theta
  logit(theta) <- gamma0
  gamma0 ~ dnorm(0,1.0E-06)
  beta ~ dmnorm(mu[],Sigma[,])

  meanY <- mean(Y)
  RSS <- sum(res^2)
  RSSnull <- sum(resnull^2)
  r2 <- 1-RSS/RSSnull
} 
"
> writeLines(modelString,con='../downloads/BUGSscripts/tut11.5bS5.17.bug')
> library(R2jags)
> system.time(
+ dat.ZIP.jags4 <- jags(model='../downloads/BUGSscripts/tut11.5bS5.17.bug', data=dat.zip.list2, inits=NULL,
+           param=c('beta','r2'),
+           n.chain=3,
+           n.iter=20000, n.thin=10, n.burnin=10000)
+   )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
   Graph Size: 277

Initializing model
   user  system elapsed 
  1.148   0.004   1.153 
> print(dat.ZIP.jags4)
Inference for Bugs model at "../downloads/BUGSscripts/tut11.5bS5.17.bug", fit using jags,
 3 chains, each with 20000 iterations (first 10000 discarded), n.thin = 10
 n.sims = 3000 iterations saved
         mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
beta[1]    0.696   0.320  0.050  0.481  0.707  0.912  1.316 1.002  1500
beta[2]    0.087   0.025  0.036  0.070  0.087  0.104  0.134 1.002  1300
r2         0.492   0.182  0.095  0.369  0.512  0.633  0.788 1.001  3000
deviance  75.740   2.524 72.863 73.908 75.100 76.922 82.234 1.001  3000

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.2 and DIC = 78.9
DIC is an estimate of expected predictive error (lower deviance is better).

Finally, we will create a summary plot.

> par(mar = c(4, 5, 0, 0))
> plot(y ~ x, data = dat.zip, type = "n", ann = F, axes = F)
> points(y ~ x, data = dat.zip, pch = 16)
> xs <- seq(min(dat.zip$x, na.rm = TRUE), max(dat.zip$x, na.rm = TRUE),
+     l = 1000)
> Xmat <- model.matrix(~xs)
> eta <- coefs %*% t(Xmat)
> ys <- exp(eta)
> library(plyr)
> library(coda)
> data.tab <- adply(ys, 2, function(x) {
+     data.frame(Median = median(x), HPDinterval(as.mcmc(x)))
+ })
> data.tab <- cbind(x = xs, data.tab)
> points(Median ~ x, data = data.tab, col = "black", type = "l")
> lines(lower ~ x, data = data.tab, col = "black", type = "l",
+     lty = 2)
> lines(upper ~ x, data = data.tab, col = "black", type = "l",
+     lty = 2)
> 
> axis(1)
> mtext("X", 1, cex = 1.5, line = 3)
> axis(2, las = 2)
> mtext("Abundance of Y", 2, cex = 1.5, line = 3)
> box(bty = "l")
plot of chunk tut11.5bS5.18

Defining full log-likelihood function

Now lets try it by specifying log-likelihood and the zero trick. When applying this trick, we need to manually calculate the deviance as the inbuilt deviance will be based on the log-likelihood of estimating the zeros (as part of the zero trick) rather than the deviance of the intended model..

The one advantage of the zero trick is that the Deviance and thus DIC, AIC provided by R2jags will be incorrect. Hence, they too need to be manually defined within jags. I suspect that the AIC calculation I have used is incorrect...


Exponential family of distributions

The exponential distributions are a class of continuous distribution which can be characterized by two parameters. One of these parameters (the location parameter) is a function of the mean and the other (the dispersion parameter) is a function of the variance of the distribution. Note that recent developments have further extended generalized linear models to accommodate other non-exponential residual distributions.

End of instructions