Jump to main navigation


Tutorial 6.2b - Comparing two populations (Bayesian)

24 Jun 2017

This tutorial will focus on the use of Bayesian (MCMC sampling) estimation to explore differences between two populations. Bayesian estimation is supported within R via two main packages (MCMCpack and MCMCglmm). These packages provide relatively simple R-like interfaces (particularly MCMCpack) to Bayesian routines and are therefore reasonably good starting points for those transitioning between traditional and Bayesian approaches.

BUGS (Bayesian inference Using Gibbs Sampling) is an algorithm and supporting language (resembling R) dedicated to performing the Gibbs sampling implementation of Markov Chain Monte Carlo method. Dialects of the BUGS language are implemented within two main projects;

  • WinBUGS/OpenBUGS - written in component pascal and therefore originally Windows only. A less platform specific version (OpenBUGS - Windows and Linux via some component pascal libraries) is now being developed as the WinBUGS successor. Unlike WinBUGS, OpenBUGS does not have a Graphical User Interface and was designed to be invoked from other applications (such as R). The major drawback of WinBUGS/OpenBUGS is that it is relatively slow.
  • JAGS (Just Another Gibbs Sampler) - written in C++ and is therefore cross-platform and very fast. It can also be called from within R via various packages.

Whilst the above programs can be used stand-alone, they do offer the rich data pre-processing and graphical capabilities of R, and thus, they are best accessed from within R itself. As such there are multiple packages devoted to interfacing with the various BUGS implementations;

  • R2WinBUGS - interfaces with WinBUGS/OpenBUGS
  • rjags - interfaces with JAGS
  • R2jags - interfaces with JAGS yet also provides some of the more advanced post-processing routines offered within R2WinBUGS. Actually, it is sort of a hybrid between R2WinBUGS, rjags and coda packages.
  • RSTAN - interfaces with STAN - a dedicated Bayesian modelling framework written in C++ and implementing Hamiltonian MCMC samplers.
  • RSTANARM - provides a convenient series of interfaces to STAN that are reminiscent of Frequentist linear modelling interfaces.
  • RSTAN - provides an alternate convenient interface to STAN. It too is specified in a manner similar to Frequentist linear modelling.

There is an additional package (coda) that provides many routines for model checking and exploring the posterior distributions.

The BUGS language and algorithm is very powerful and flexible. However, the cost of this power and flexibility is complexity and the need for a firm understanding of the model you wish to fit as well as the priors to be used. The BUGS algorithm requires the following;

  • the model comprising;
    • likelihood function relating the response to the predictors - this is where good knowledge of the conceptual model that you are seeking to fit is essential.
    • definition of the priors
  • chain properties such;
    • number of chains
    • length of chains (number of iterations)
    • burn-in length (number of initial iterations to ignore)
    • thinning rate (number of iterations to count on before storing a sample)
  • initial estimates to start an mcmc chain. If there are multiple chains, these starting values can differ between chains
  • a list of model parameters and derivatives to monitor (and return the posterior distributions of)

This tutorial will demonstrate MCMCpack, R2JAGS, RSTAN, RSTANARM and BRMS. Each of these requires loading associated packages. Unfortunately, some of these packages mask functionality from one-another (don't play nicely together). Fortunately, it is rarely necessary to run multiple versions of the same model (just select onepackage). It does however mean that this tutorial will load all necessary packages

library(tidyverse)
library(coda)
library(broom)
library(bayesplot)
# Approach specific packages
library(MCMCpack)
library(R2jags)
library(rstan)
library(rstanarm)
library(brms)

We will start by generating a random data set. Note, I am creating two versions of the predictor variable (a numeric version and a factorial version).

set.seed(1)
nA <- 60  #sample size from Population A
nB <- 40  #sample size from Population B
muA <- 105  #population mean of Population A
muB <- 77.5  #population mean of Population B
sigma <- 3  #standard deviation of both populations (equally varied)
yA <- rnorm(nA, muA, sigma)  #Population A sample
yB <- rnorm(nB, muB, sigma)  #Population B sample
y <- c(yA, yB)
x <- factor(rep(c("A", "B"), c(nA, nB)))  #categorical listing of the populations
xn <- as.numeric(x)  #numerical version of the population category for means parameterization. 
# Should not start at 0.
data <- data.frame(y, x, xn)  # dataset
head(data)
         y x xn
1 103.1206 A  1
2 105.5509 A  1
3 102.4931 A  1
4 109.7858 A  1
5 105.9885 A  1
6 102.5386 A  1

Lets begin by performing the usual exploratory data analysis - in this case, a boxplot of the response for each level of the predictor.

plot of chunk Boxplot1
View code
boxplot(y ~ x, data)

Model fitting

A t-test is essentially just a simple regression model in which the categorical predictor is represented by a binary variable in which one level is coded as 0 and the other 1.

For the model itself, the observed response ($y_i$) are assumed to be drawn from a normal distribution with a given mean ($\mu$) and standard deviation ($\sigma$). The expected values ($\mu$) are themselves determined by the linear predictor ($\beta_0 + \beta x_i$). In this case, $\beta_0$ represents the mean of the first treatment group and $\beta$ represents the difference between the mean of the first group and the mean of the second group (the effect).

MCMC sampling requires priors on all parameters. We will employ weakly informative priors. Specifying 'uninformative' priors is always a bit of a balancing act. If the priors are too vague (wide) the MCMC sampler can wander off into nonscence areas of likelihood rather than concentrate around areas of highest likelihood (desired when wanting the outcomes to be largely driven by the data). On the other hand, if the priors are too strong, they may have an influence on the parameters. In such a simple model, this balance is very forgiving - it is for more complex models that prior choice becomes more important.

For this simple model, we will go with zero-centered Gaussian (normal) priors with relatively large standard deviations (1000) for both the intercept and the treatment effect and a wide half-cauchy (scale=25) for the standard deviation. $$ \begin{align} y_i &\sim{} N(\mu, \sigma)\\ \mu &= \beta_0 + \beta x_i\\[1em] \beta_0 &\sim{} N(0,1000)\\ \beta &\sim{} N(0,1000)\\ \sigma &\sim{} cauchy(0,25)\\ \end{align} $$

library(MCMCpack)
data.mcmcpack <- MCMCregress(y ~ x, data = data)

As indicated earlier, BUGS and therefore JAGS, requires more careful thought and effort to fit a model. In particular, we need to be very clear on the model being fitted so that we can correctly define the appropriate and corresponding likelihood function.

Broadly, there are two ways of parameterizing (expressing the unknown (to be estimated) components of a model) a model. Either we can estimate the means of each group (Means parameterization) or we can estimate the mean of one group and the difference between this group and the other group(s) (Effects parameterization). The latter is commonly used for frequentist null hypothesis testing as its parameters are more consistent with the null hypothesis of interest (that the difference between the two groups equals zero).

Effects parameterizationMeans parameterization
yi = α + βj(i)*xi + εi where ε ∼ N(0,σ²)
yi = αj(i) + εi where ε ∼ N(0,σ²)
Each of the (i) values of y are modelled by an intercept (α - mean of Group A) plus a difference parameter (β - difference between mean of Group A and Group B) multiplied by an indicator of which group the observation came from plus a residual drawn from a normal distribution (N) with a mean of zero and a variance of σ².
Actually, there are as many β parameters as there are groups (j), however one of them (typically the first) is set to be equal to zero (to avoid over-parameterization).
Each of the (i) values of y are modelled as the mean (α) of each group (j) plus a residual drawn from a normal distribution (N) with a mean of zero and a variance of σ². Actually, α is a set of j coefficients corresponding to the j dummy coded factor levels.

Alternatively, we can express these as:

e (i) values of y are modelled assuming they are drawn from a normal distribution whose mean is determined by a linear combination of effect parameters (α + βj(i)*xi) and whose variance is defined by the degree of variability (σ²) in this mean.
There are three parameters;
  • α - the mean of (the first) Group A
  • β2 - the (effect) difference between Group A and Group B
  • σ² - the precision
e (i) values of y are modelled assuming they are drawn from a normal distribution whose mean is determined by a linear combination of means parameters (&alphaj(i)) and whose variance is defined by the degree of variability (σ²) in this mean. There are three parameters;
  • α1 - the mean of (the first) Group A
  • α2 - the mean of (the second) Group B
  • σ² - the precision
Effects parameterizationMeans parameterization
yi ~ N(α + βj(i)*xi, σ²)
yi ~ N(αj(i), σ²)
Each of Each of

Actually in the various BUGS dialects, distributions are defined by their precision (τ) rather than their variance (σ²). Precision is just the inverse of variance (τ = 1/σ²). Precision is used instead of variance as it permits the gamma distribution to be used as the conjugate prior of the variance of a normal distribution.

Bayesian analyses require that priors be specified for each of the parameters. We will define vague (non-informative) priors for each of the parameters such that the posterior distributions are almost entirely influenced by the likelihood (and thus the data). Hence, appropriate (conjugate) priors for the effects parameterization could be:

  • $\alpha$ - a very flat (in-precise) normal distribution centered around zero, N(0, 1.0E-6). Note, 1.0E-6 is scientific notation for 0.000001
  • $\beta$2 - a very flat (in-precise) normal distribution centered around zero, N(0,1.0E-6)
  • $\tau$ - a very in-precise gamma distribution with a shape parameter close to zero (must be greater than 0)
plot of chunk test1

Hence, the above equation becomes:

Effects parameterizationMeans parameterization
yi ~ N(mui, $\tau$)
mui = $\alpha$ + $\beta$j*xi
$\tau$ = 1/$\sigma^2$
$\alpha$ ~ N(0, 1.0E-6)
$\beta$1 = 0
$\beta$2 ~ N(0, 1.0E-6)
$\sigma^2$ ~ U(0, 100)
yi ~ N($\alpha$j(i), $\sigma^2$)
mui = $\alpha$ + $\beta$j*xi
$\tau$ = 1/$\sigma^2$
$\alpha$1 ~ N(0, 1.0E-6)
$\alpha$2 ~ N(0, 1.0E-6)
$\sigma$ ~ U(0, 100)
The first two lines define he model likelihood.
The third line defines the $\tau$ parameter in terms of variance.
The next four lines define the priors for each parameter
The first two lines define he model likelihood.
The third line defines the τ parameter in terms of variance.
The next three lines define the priors for each parameter.

The BUGS language very closely matches the above model and prior definitions - hence the importance on understanding the model you wish to fit. The BUGS language resembles R in many respects. It basically consists of:

  • stochastic nodes - those that appear on the left hand side of ~
  • deterministic nodes - those that appear on the left hand side of <-
  • R-like for loops
  • R-like functions to transform and summarize data
That said, BUGS is a declarative language, which means:
  • the order with which statements appear in the model definition are not important
  • nodes should not be defined more than once (you cannot change a value

  1. We are now in a good position to define the model (Likelihood function and prior distributions).
    Effects parameterizationMeans parameterization
    modelString = "  
     model {
      #Likelihood
      for (i in 1:n) {
        y[i]~dnorm(mu[i],tau)
        mu[i] <- alpha+beta[x[i]]
      }
     
      #Priors
      alpha ~ dnorm(0,1.0E-06)
      beta[1] <- 0
      beta[2] ~ dnorm(0,1.0E-06)
      tau <- 1 / (sigma * sigma)
      sigma~dunif(0,100)
      
      #Other Derived parameters 
      # Group means (note, beta is a vector)
      Group.means <-alpha+beta  
     }
     "
    
    modelString.means = "  
      model {
       #Likelihood 
       for (i in 1:n) {
         y[i]~dnorm(mu[i],tau)
         mu[i] <- alpha[x[i]]
       }
     
       #Priors
       for (j in min(x):max(x)) {
         alpha[j] ~ dnorm(0,0.001)
       }
     
       tau <- 1 / (sigma * sigma)
       sigma~dunif(0,100)
     
       #Other Derived parameters 
       effect <-alpha[2]-alpha[1]
     }
     "
    
    Effects
    ## write the model to a text file
    writeLines(modelString, con = "../downloads/BUGSscripts/ttestModel.txt")
    
    Means
    ## write the model to a text file
    writeLines(modelString.means, con = "../downloads/BUGSscripts/ttestModelMeans.txt")
    
  2. Arrange the data as a list (as required by BUGS). Note, all variables must be numeric, therefore we use the numeric version of x.
    Furthermore, the first level must be 1.
    Effects
    data.list <- with(data, list(y = y, x = xn, n = nrow(data)))
    print(data.list)
    
    $y
      [1] 103.12064 105.55093 102.49311 109.78584 105.98852 102.53859 106.46229 107.21497 106.72734
     [10] 104.08383 109.53534 106.16953 103.13628  98.35590 108.37479 104.86520 104.95143 107.83151
     [19] 107.46366 106.78170 107.75693 107.34641 105.22369  99.03194 106.85948 104.83161 104.53261
     [28] 100.58774 103.56555 106.25382 109.07604 104.69164 106.16301 104.83858 100.86882 103.75502
     [37] 103.81713 104.82206 108.30008 107.28953 104.50643 104.23991 107.09089 106.66999 102.93373
     [46] 102.87751 106.09375 107.30560 104.66296 107.64332 106.19432 103.16392 106.02336 101.61191
     [55] 109.29907 110.94120 103.89834 101.86760 106.70916 104.59484  84.70485  77.38228  79.56922
     [64]  77.58401  75.27018  78.06638  72.08512  81.89666  77.95976  84.01784  78.92653  75.37016
     [73]  79.33218  74.69771  73.73910  78.37434  76.17012  77.50332  77.72302  75.73144  75.79399
     [82]  77.09446  81.03426  72.92930  79.28184  78.49885  80.68930  76.58745  78.61006  78.30130
     [91]  75.87244  81.12360  80.98121  79.60064  82.26050  79.17546  73.67022  75.78020  73.82616
    [100]  76.07980
    
    $x
      [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
     [48] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
     [95] 2 2 2 2 2 2
    
    $n
    [1] 100
    
    Means
    data.list.means <- with(data, list(y = y, x = xn, n = nrow(data)))
    
  3. Define the initial values (α, β and σ2 for the chain. Reasonable starting points can be gleaned from the data themselves.
    Effects
    inits <- list(alpha = mean(data$y), beta = c(NA, diff(tapply(data$y,
        data$x, mean))), sigma = sd(data$y/2))
    inits
    
    $alpha
    [1] 94.32666
    
    $beta
                      B 
           NA -27.49047 
    
    $sigma
    [1] 6.900491
    
    Means
    inits.means <- list(alpha = tapply(data$y, data$x, mean), sigma = sd(data$y/2))
    inits.means
    
    $alpha
            A         B 
    105.32285  77.83238 
    
    $sigma
    [1] 6.900491
    
  4. Define the nodes (parameters and derivatives) to monitor
    Effects
    params <- c("alpha", "beta", "sigma", "Group.means")
    
    Means
    params.means <- c("alpha", "effect", "sigma")
    
  5. Define e chain parameters
    adaptSteps = 1000  # the number of steps over which to establish a good stepping distance
    burnInSteps = 2000  # the number of initial samples to discard
    nChains = 3  # the number of independed sampling chains to perform 
    numSavedSteps = 50000  # the total number of samples to store
    thinSteps = 1  # the thinning rate
    nIter = ceiling((numSavedSteps * thinSteps)/nChains)
    
  6. Start e JAGS model (Check the model, load data into the model, specify the number of chains and compile the model.
    Effects
    library(rjags)
    
             jagsModel <- jags.model("../downloads/BUGSscripts/ttestModel.txt",
             data=data.list,
             inits=inits,
             n.chains=nChains,
             n.adapt=adaptSteps)
    
    Compiling model graph
       Resolving undeclared variables
       Allocating nodes
    Graph information:
       Observed stochastic nodes: 100
       Unobserved stochastic nodes: 3
       Total graph size: 218
    
    Initializing model
    
    Means
    library(rjags)
    
    jagsModel.means <- jags.model("../downloads/BUGSscripts/ttestModelMeans.txt",
        data = data.list.means, inits = inits.means, n.chains = nChains,
        n.adapt = adaptSteps)
    
    Compiling model graph
       Resolving undeclared variables
       Allocating nodes
    Graph information:
       Observed stochastic nodes: 100
       Unobserved stochastic nodes: 3
       Total graph size: 214
    
    Initializing model
    
  7. Update e model
    Effects
    update(jagsModel, n.iter = burnInSteps)
    
    Means
    update(jagsModel.means, n.iter = burnInSteps)
    
  8. Extract and summarize the samples
    Effects
    data.mcmc <- coda.samples(jagsModel, variable.names = params,
        n.iter = nIter, thin = thinSteps)
    summary(data.mcmc)
    
    Iterations = 3001:19667
    Thinning interval = 1 
    Number of chains = 3 
    Sample size per chain = 16667 
    
    1. Empirical mean and standard deviation for each variable,
       plus standard error of the mean:
    
                      Mean     SD  Naive SE Time-series SE
    Group.means[1] 105.328 0.3565 0.0015943       0.002439
    Group.means[2]  77.833 0.4323 0.0019333       0.001917
    alpha          105.328 0.3565 0.0015943       0.002439
    beta[1]          0.000 0.0000 0.0000000       0.000000
    beta[2]        -27.495 0.5592 0.0025009       0.003819
    sigma            2.744 0.2011 0.0008995       0.001203
    
    2. Quantiles for each variable:
    
                      2.5%     25%     50%     75%   97.5%
    Group.means[1] 104.629 105.088 105.327 105.567 106.028
    Group.means[2]  76.979  77.545  77.836  78.121  78.681
    alpha          104.629 105.088 105.327 105.567 106.028
    beta[1]          0.000   0.000   0.000   0.000   0.000
    beta[2]        -28.589 -27.868 -27.495 -27.119 -26.398
    sigma            2.385   2.604   2.731   2.871   3.174
    
    Means
    data.mcmc.means <- coda.samples(jagsModel.means, variable.names = params.means,
        n.iter = nIter, thin = thinSteps)
    summary(data.mcmc.means)
    
    Iterations = 3001:19667
    Thinning interval = 1 
    Number of chains = 3 
    Sample size per chain = 16667 
    
    1. Empirical mean and standard deviation for each variable,
       plus standard error of the mean:
    
               Mean     SD  Naive SE Time-series SE
    alpha[1] 105.31 0.3554 0.0015893       0.001585
    alpha[2]  77.82 0.4348 0.0019447       0.001945
    effect   -27.49 0.5614 0.0025107       0.002512
    sigma      2.74 0.1958 0.0008757       0.001150
    
    2. Quantiles for each variable:
    
                2.5%     25%     50%     75%   97.5%
    alpha[1] 104.606 105.073 105.311 105.546 106.010
    alpha[2]  76.971  77.526  77.814  78.108  78.686
    effect   -28.591 -27.866 -27.494 -27.117 -26.373
    sigma      2.389   2.605   2.728   2.865   3.158
    

As indicated earlier, we can alternatively access JAGS from functions within the R2jags package. Doing so provides some additional information. Notably we can obtain estimates of Deviance (and therefore DIC) as well as effective sample sizes for each of the parameters.

library(R2jags)

When using the jags() function (R2jags package), it is not necessary to provide initial values. However, if they are to be supplied, the inital values list must be a list the same length as the number of chains.

         data.r2jags <- jags(data=data.list,
         inits=NULL, #or inits=list(inits,inits,inits) # since there are three chains
         parameters.to.save=params,
         model.file="../downloads/BUGSscripts/ttestModelR2JAGS.txt",
         n.chains=3,
         n.iter=nIter,
         n.burnin=burnInSteps,
         n.thin=10
         )
         print(data.r2jags)

Notes:

  • If inits=NULL the jags() function will generate vaguely sensible initial values for each chain based on the data
  • In addition to the mean and quantiles of each of the sample nodes, the
  • jags() function will calculate:
    • the effective sample size for each sample - if n.eff for a node is substantially less than the number of iterations, then it suggests poor mixing
    • Rhat values for each sample - these are a convergence diagnostic (values of 1 indicate full convergence, values greater than 1.01 are indicative of non-convergence.
    • an information criteria (DIC) for model selection

Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 100
   Unobserved stochastic nodes: 3
   Total graph size: 218

Initializing model
Inference for Bugs model at "../downloads/BUGSscripts/ttestModelR2JAGS.txt", fit using jags,
 3 chains, each with 16667 iterations (first 2000 discarded), n.thin = 10
 n.sims = 4401 iterations saved
               mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
Group.means[1] 105.309   0.353 104.622 105.076 105.304 105.545 106.031 1.001  4400
Group.means[2]  77.839   0.443  76.971  77.534  77.835  78.143  78.711 1.001  4400
alpha          105.309   0.353 104.622 105.076 105.304 105.545 106.031 1.001  4400
beta[1]          0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1
beta[2]        -27.470   0.571 -28.593 -27.846 -27.479 -27.089 -26.356 1.001  4300
sigma            2.746   0.201   2.391   2.607   2.733   2.870   3.177 1.001  2800
deviance       484.149   2.581 481.271 482.256 483.425 485.344 490.827 1.001  4400

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.3 and DIC = 487.5
DIC is an estimate of expected predictive error (lower deviance is better).
View full code
         modelString="   
 model { 
   #Likelihood
   for (i in 1:n) {
     y[i]~dnorm(mu[i],tau)
     mu[i] <- alpha+beta[x[i]]
   }
 
   #Priors
   alpha ~ dnorm (0,0.001)
   beta[1] <- 0
   beta[2] ~ dnorm(0,0.001)
   tau <- 1 / (sigma * sigma)
   sigma~dunif(0,100)
 
   #Other Derived parameters 
   # Group means (note, beta is a vector)
   Group.means <-alpha+beta  
 }
 "
         writeLines(modelString, con="../downloads/BUGSscripts/ttestModelR2JAGS.txt")

         data.list <- with(data, list(y=y, x=xn,n=nrow(data)))

         params <- c('alpha','beta','sigma','Group.means')


         data.r2jags <- jags(data=data.list,
         inits=NULL, #or inits=list(inits,inits,inits) # since there are three chains
         parameters.to.save=params,
         model.file="../downloads/BUGSscripts/ttestModelR2JAGS.txt",
         n.chains=3,
         n.iter=nIter,
         n.burnin=burnInSteps,
         n.thin=10
         )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 100
   Unobserved stochastic nodes: 3
   Total graph size: 218

Initializing model
         print(data.r2jags)
Inference for Bugs model at "../downloads/BUGSscripts/ttestModelR2JAGS.txt", fit using jags,
 3 chains, each with 16667 iterations (first 2000 discarded), n.thin = 10
 n.sims = 4401 iterations saved
               mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
Group.means[1] 105.311   0.352 104.621 105.077 105.313 105.547 105.998 1.002  2100
Group.means[2]  77.825   0.439  76.933  77.531  77.832  78.123  78.652 1.001  4400
alpha          105.311   0.352 104.621 105.077 105.313 105.547 105.998 1.002  2100
beta[1]          0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1
beta[2]        -27.486   0.562 -28.599 -27.868 -27.478 -27.090 -26.422 1.001  4400
sigma            2.743   0.200   2.390   2.603   2.729   2.872   3.179 1.001  4400
deviance       484.117   2.504 481.259 482.278 483.502 485.266 490.547 1.001  2900

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.1 and DIC = 487.3
DIC is an estimate of expected predictive error (lower deviance is better).

The total number samples collected is 4401. That is, ere are 4401 samples collected from the multidimensional posterior distribution and thus, 4401 samples collected from the posterior distributions of each parameter. The effective number of samples column indicates the number of independent samples represented in the total. It is clear that for all except sigma, the chains were well mixed. The number of effective samples of sigma was 1200, potentially indicating incomplete mixing.

From now on, we will perform all Bayesian analyses via jags() so as to leverage the maximum amount of knowledge available from the models.

Whilst the STAN language broadly resembles BUGS (an intentional motivation), there are numerous important differences. Some of these differences are to support translation to c++ for compilation (such as declaring variables). Others reflect leveraging of vectorization to speed up run time.. Here are some important notes about STAN:

  • All variables must be declared
  • Variables declared in the parameters block will be collected
  • Anything in the transformed block will be collected as samples. Also, checks will be made every loop
  • Things declared in the model block will only be collected if declared in the parameters block
  • Consistent with the recommendations of Gelman (2006), I will use half-cauchy distributions (scale=25) for variance priors.

Now I will demonstrate fitting the same models but with STAN

Note, I am using the refresh=0 option so as to suppress the larger regular output in the interest of keeping output to what is necessary for this tutorial. When running outside of a tutorial context, the regular verbose output is useful as it provides a way to gauge progress.

  • Cell means model
    stanString = "  
     data {
     int n;
     vector [n] y;
     vector [n] x;
     }
     parameters {
     real <lower=0, upper=100> sigma;
     real alpha;
     real beta;
     }
     transformed parameters {
     }
     model {
     vector [n] mu;
     
     //Priors
     alpha ~ normal(0,1000);
     beta ~ normal(0,1000);
     sigma ~ cauchy(0,25);
     
     mu = alpha + beta*x;
     //Likelihood
     y ~ normal(mu, sigma);
     }
     generated quantities {
     vector [2] Group_means;
     real CohensD;
     //Other Derived parameters 
     //# Group means (note, beta is a vector)
     Group_means[1] = alpha;
     Group_means[2] = alpha+beta;
     
     CohensD = beta /sigma;  
     }
     "
    
    data.list <- with(data, list(y = y, x = (xn - 1), n = nrow(data)))
    
    library(rstan)
    fit = stan(model_code = stanString, data = data.list, iter = 1000,
        warmup = 500, chains = 3, thin = 10, refresh = 0)
    
    In file included from /usr/local/lib/R/site-library/BH/include/boost/config.hpp:39:0,
                     from /usr/local/lib/R/site-library/BH/include/boost/math/tools/config.hpp:13,
                     from /usr/local/lib/R/site-library/StanHeaders/include/stan/math/rev/core/var.hpp:7,
                     from /usr/local/lib/R/site-library/StanHeaders/include/stan/math/rev/core/gevv_vvv_vari.hpp:5,
                     from /usr/local/lib/R/site-library/StanHeaders/include/stan/math/rev/core.hpp:12,
                     from /usr/local/lib/R/site-library/StanHeaders/include/stan/math/rev/mat.hpp:4,
                     from /usr/local/lib/R/site-library/StanHeaders/include/stan/math.hpp:4,
                     from /usr/local/lib/R/site-library/StanHeaders/include/src/stan/model/model_header.hpp:4,
                     from file6817173fc488.cpp:8:
    /usr/local/lib/R/site-library/BH/include/boost/config/compiler/gcc.hpp:186:0: warning: "BOOST_NO_CXX11_RVALUE_REFERENCES" redefined
     #  define BOOST_NO_CXX11_RVALUE_REFERENCES
     ^
    <command-line>:0:0: note: this is the location of the previous definition
    
     Elapsed Time: 0.028601 seconds (Warm-up)
                   0.01643 seconds (Sampling)
                   0.045031 seconds (Total)
    
    
     Elapsed Time: 0.027946 seconds (Warm-up)
                   0.01798 seconds (Sampling)
                   0.045926 seconds (Total)
    
    
     Elapsed Time: 0.022885 seconds (Warm-up)
                   0.016029 seconds (Sampling)
                   0.038914 seconds (Total)
    
    print(fit)
    
    Inference for Stan model: f34dfebfd815f38d9fa1a7a3e79c427d.
    3 chains, each with iter=1000; warmup=500; thin=10; 
    post-warmup draws per chain=50, total post-warmup draws=150.
    
                      mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
    sigma             2.73    0.02 0.19    2.38    2.61    2.72    2.86    3.08   150 1.00
    alpha           105.30    0.03 0.34  104.55  105.09  105.28  105.53  106.00   150 0.99
    beta            -27.52    0.04 0.54  -28.52  -27.86  -27.51  -27.14  -26.52   150 1.01
    Group_means[1]  105.30    0.03 0.34  104.55  105.09  105.28  105.53  106.00   150 0.99
    Group_means[2]   77.79    0.03 0.42   77.00   77.48   77.82   78.13   78.51   150 1.01
    CohensD         -10.12    0.06 0.77  -11.67  -10.56  -10.13   -9.62   -8.77   150 0.99
    lp__           -149.10    0.11 1.21 -152.89 -149.49 -148.80 -148.32 -147.76   129 0.99
    
    Samples were drawn using NUTS(diag_e) at Fri May 19 15:28:01 2017.
    For each parameter, n_eff is a crude measure of effective sample size,
    and Rhat is the potential scale reduction factor on split chains (at 
    convergence, Rhat=1).
    
  • Effects model
    stanString = " 
     data {
     int n;
     int nX;
     vector [n] y;
     matrix [n,nX] x;
     }
     parameters {
     real <lower=0, upper=100> sigma;
     vector [nX] beta;
     }
     transformed parameters {
     }
     model {
     vector [n] mu;
     
     //Priors
     beta ~ normal(0,1000);
     sigma ~ cauchy(0,25);
     
     mu = x*beta;
     //Likelihood
     y ~ normal(mu, sigma);
     }
     generated quantities {
     vector [2] Group_means;
     real CohensD;
     
     //Other Derived parameters 
     Group_means[1] = beta[1];
     Group_means[2] = beta[1]+beta[2];
     
     CohensD = beta[2] /sigma;  
     }
     "
    X <- model.matrix(~x, data)
    data.list = with(data, list(y = y, x = X, n = nrow(data), nX = ncol(X)))
    
    library(rstan)
    data.stan = stan(model_code = stanString, data = data.list, iter = 2000,
        warmup = 500, chains = 3, thin = 2, refresh = 0)
    
    In file included from /usr/local/lib/R/site-library/BH/include/boost/config.hpp:39:0,
                     from /usr/local/lib/R/site-library/BH/include/boost/math/tools/config.hpp:13,
                     from /usr/local/lib/R/site-library/StanHeaders/include/stan/math/rev/core/var.hpp:7,
                     from /usr/local/lib/R/site-library/StanHeaders/include/stan/math/rev/core/gevv_vvv_vari.hpp:5,
                     from /usr/local/lib/R/site-library/StanHeaders/include/stan/math/rev/core.hpp:12,
                     from /usr/local/lib/R/site-library/StanHeaders/include/stan/math/rev/mat.hpp:4,
                     from /usr/local/lib/R/site-library/StanHeaders/include/stan/math.hpp:4,
                     from /usr/local/lib/R/site-library/StanHeaders/include/src/stan/model/model_header.hpp:4,
                     from file6817e387962.cpp:8:
    /usr/local/lib/R/site-library/BH/include/boost/config/compiler/gcc.hpp:186:0: warning: "BOOST_NO_CXX11_RVALUE_REFERENCES" redefined
     #  define BOOST_NO_CXX11_RVALUE_REFERENCES
     ^
    <command-line>:0:0: note: this is the location of the previous definition
    
     Elapsed Time: 0.021705 seconds (Warm-up)
                   0.037141 seconds (Sampling)
                   0.058846 seconds (Total)
    
    
     Elapsed Time: 0.020017 seconds (Warm-up)
                   0.042813 seconds (Sampling)
                   0.06283 seconds (Total)
    
    
     Elapsed Time: 0.020333 seconds (Warm-up)
                   0.035865 seconds (Sampling)
                   0.056198 seconds (Total)
    
    print(data.stan)
    
    Inference for Stan model: c31e59bbf2105c3cc486bc357dbbaebc.
    3 chains, each with iter=2000; warmup=500; thin=2; 
    post-warmup draws per chain=750, total post-warmup draws=2250.
    
                      mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
    sigma             2.75    0.00 0.21    2.38    2.61    2.74    2.88    3.19  1966    1
    beta[1]         105.32    0.01 0.37  104.61  105.07  105.31  105.57  106.07  1689    1
    beta[2]         -27.48    0.01 0.56  -28.59  -27.85  -27.46  -27.09  -26.39  1782    1
    Group_means[1]  105.32    0.01 0.37  104.61  105.07  105.31  105.57  106.07  1689    1
    Group_means[2]   77.84    0.01 0.43   77.02   77.55   77.85   78.14   78.64  2017    1
    CohensD         -10.05    0.02 0.77  -11.56  -10.55  -10.04   -9.54   -8.57  1854    1
    lp__           -149.25    0.03 1.27 -152.69 -149.83 -148.90 -148.33 -147.79  1762    1
    
    Samples were drawn using NUTS(diag_e) at Fri May 19 15:28:43 2017.
    For each parameter, n_eff is a crude measure of effective sample size,
    and Rhat is the potential scale reduction factor on split chains (at 
    convergence, Rhat=1).
    
  • Separate variances
    stanString = " 
     data {
     int n;
     vector [n] y;
     vector [n] x;
     int<lower=1,upper=2> xn[n];
     }
     parameters {
     vector <lower=0, upper=100>[2] sigma;
     real alpha;
     real beta;
     }
     transformed parameters {
     }
     model {
     vector [n] mu;
     //Priors
     alpha ~ normal(0,1000);
     beta ~ normal(0,1000);
     sigma ~ cauchy(0,25);
    
     mu <- alpha + beta*x;
     //Likelihood
     for (i in 1:n) y[i] ~ normal(mu[i], sigma[xn[i]]);
     }
     generated quantities {
     vector [2] Group_means;
     real CohensD;
     real CLES;
    
     Group_means[1] <-alpha;
     Group_means[2] <-alpha+beta;
     CohensD <- beta /(sum(sigma)/2);
     CLES <- normal_cdf(beta /sum(sigma)),0,1);  
     }
     "
    
    data.list <- with(data, list(y = y, x = (xn - 1), xn = xn, n = nrow(data)))
    
    library(rstan)
    fit = stan(model_code = stanString, data = data.list, iter = 1000,
        warmup = 500, chains = 3, thin = 10, refresh = 0)
    
    Error in stanc(file = file, model_code = model_code, model_name = model_name, : failed to parse Stan model '189bddae261bf37a34cb8d730adede44' due to the above error.
    
    print(fit)
    
    Inference for Stan model: f34dfebfd815f38d9fa1a7a3e79c427d.
    3 chains, each with iter=1000; warmup=500; thin=10; 
    post-warmup draws per chain=50, total post-warmup draws=150.
    
                      mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
    sigma             2.73    0.02 0.19    2.38    2.61    2.72    2.86    3.08   150 1.00
    alpha           105.30    0.03 0.34  104.55  105.09  105.28  105.53  106.00   150 0.99
    beta            -27.52    0.04 0.54  -28.52  -27.86  -27.51  -27.14  -26.52   150 1.01
    Group_means[1]  105.30    0.03 0.34  104.55  105.09  105.28  105.53  106.00   150 0.99
    Group_means[2]   77.79    0.03 0.42   77.00   77.48   77.82   78.13   78.51   150 1.01
    CohensD         -10.12    0.06 0.77  -11.67  -10.56  -10.13   -9.62   -8.77   150 0.99
    lp__           -149.10    0.11 1.21 -152.89 -149.49 -148.80 -148.32 -147.76   129 0.99
    
    Samples were drawn using NUTS(diag_e) at Fri May 19 15:28:01 2017.
    For each parameter, n_eff is a crude measure of effective sample size,
    and Rhat is the potential scale reduction factor on split chains (at 
    convergence, Rhat=1).
    

The STAN team has put together pre-compiled modules (functions) to make specifying and applying STAN models much simpler. Each function offers a consistent interface that is Also reminiscent of major frequentist linear modelling routines in R.

Whilst it is not necessary to specify priors when using rstanarm functions (as defaults will be generated), there is no guarantee that the routines for determining these defaults will persist over time. Furthermore, it is always better to define your own priors if for no other reason that it forces you to thing about what you are doing. Consistent with the pure STAN version, we will employ the following priors:

  • weakly informative Gaussian prior for the intercept $\beta_0 \sim{} N(0, 1000)$
  • weakly informative Gaussian prior for the treatment effect $\beta_1 \sim{} N(0, 1000)$
  • half-cauchy prior for the variance $\sigma \sim{} Cauchy(0, 25)$

Note, I am using the refresh=0 option so as to suppress the larger regular output in the interest of keeping output to what is necessary for this tutorial. When running outside of a tutorial context, the regular verbose output is useful as it provides a way to gauge progress.

library(rstanarm)
data.rstanarm = stan_glm(y ~ x, data = data, iter = 2000, warmup = 200,
    chains = 3, thin = 2, refresh = 0, prior_intercept = normal(0, 1000),
    prior = normal(0, 1000), prior_aux = cauchy(0, 25))
 Elapsed Time: 0.925718 seconds (Warm-up)
               0.835596 seconds (Sampling)
               1.76131 seconds (Total)


 Elapsed Time: 1.16274 seconds (Warm-up)
               0.692525 seconds (Sampling)
               1.85526 seconds (Total)


 Elapsed Time: 1.204 seconds (Warm-up)
               0.845399 seconds (Sampling)
               2.0494 seconds (Total)
print(data.rstanarm)
stan_glm(formula = y ~ x, data = data, iter = 2000, warmup = 200, 
    chains = 3, thin = 2, refresh = 0, prior = normal(0, 1000), 
    prior_intercept = normal(0, 1000), prior_aux = cauchy(0, 
        25))

Estimates:
            Median MAD_SD
(Intercept) 105.3    0.4 
xB          -27.5    0.6 
sigma         2.7    0.2 

Sample avg. posterior predictive 
distribution of y (X = xbar):
         Median MAD_SD
mean_PPD 94.3    0.4  
library(broom)
library(coda)
tidyMCMC(data.rstanarm, conf.int = TRUE, conf.method = "HPDinterval")
         term   estimate std.error   conf.low  conf.high
1 (Intercept) 105.326338 0.3555535 104.659397 106.022743
2          xB -27.484064 0.5670145 -28.619208 -26.373617
3       sigma   2.751499 0.2029175   2.365682   3.141828

The brmssamp> packagei> serves a similar goal to the rstanarm package - to provide a simple user interface to STAN. However, unlike the rstanarm implementation, brms simply converts the formula, data, priors and family into STAN model code and data before executing stan with those elements.

Whilst it is not necessary to specify priors when using brms functions (as defaults will be generated), there is no guarantee that the routines for determining these defaults will persist over time. Furthermore, it is always better to define your own priors if for no other reason that it forces you to thing about what you are doing. Consistent with the pure STAN version, we will employ the following priors:

  • weakly informative Gaussian prior for the intercept $\beta_0 \sim{} N(0, 1000)$
  • weakly informative Gaussian prior for the treatment effect $\beta_1 \sim{} N(0, 1000)$
  • half-cauchy prior for the variance $\sigma \sim{} Cauchy(0, 25)$

Note, I am using the refresh=0. option so as to suppress the larger regular output in the interest of keeping output to what is necessary for this tutorial. When running outside of a tutorial context, the regular verbose output is useful as it provides a way to gauge progress.

library(brms)
data.brms = brm(y ~ x, data = data, iter = 2000, warmup = 200, chains = 3,
    thin = 2, refresh = 0, prior = c(prior(normal(0, 1000), class = "Intercept"),
        prior(normal(0, 1000), class = "b"), prior(cauchy(0, 25), class = "sigma")))
 Elapsed Time: 0.013122 seconds (Warm-up)
               0.038267 seconds (Sampling)
               0.051389 seconds (Total)


 Elapsed Time: 0.011992 seconds (Warm-up)
               0.034585 seconds (Sampling)
               0.046577 seconds (Total)


 Elapsed Time: 0.01629 seconds (Warm-up)
               0.039889 seconds (Sampling)
               0.056179 seconds (Total)
print(data.brms)
 Family: gaussian (identity) 
Formula: y ~ x 
   Data: data (Number of observations: 100) 
Samples: 3 chains, each with iter = 2000; warmup = 200; thin = 2; 
         total post-warmup samples = 2700
   WAIC: Not computed
 
Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept   105.34      0.36   104.64   106.03       2444    1
xB          -27.50      0.57   -28.65   -26.40       2435    1

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma     2.75      0.19     2.39     3.18       2515    1

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).
library(broom)
library(coda)
tidyMCMC(data.brms$fit, conf.int = TRUE, conf.method = "HPDinterval")
         term   estimate std.error   conf.low  conf.high
1 b_Intercept 105.343811 0.3569517 104.629829 106.010549
2        b_xB -27.503759 0.5672394 -28.687165 -26.453145
3       sigma   2.748028 0.1948715   2.354592   3.111595

MCMC diagnostics

In addition to the regular model diagnostic checks (such as residual plots), for Bayesian analyses, it is necessary to explore the characteristics of the MCMC chains and the sampler in general. Recall that the purpose of MCMC sampling is to replicate the posterior distribution of the model likelihood and priors by drawing a known number of samples from this posterior (thereby formulating a probability distribution). This is only reliable if the MCMC samples accurately reflect the posterior.

Unfortunately, since we only know the posterior in the most trivial of circumstances, it is necessary to rely on indirect measures of how accurately the MCMC samples are likely to reflect the likelihood. I will breifly outline the most important diagnostics, however, please refer to Tutorial 4.3, Secton 3.1: Markov Chain Monte Carlo sampling for a discussion of these diagnostics.

  • Traceplots for each parameter illustrate the MCMC sample values after each successive iteration along the chain. Bad chain mixing (characterized by any sort of pattern) suggests that the MCMC sampling chains may not have completely traversed all features of the posterior distribution and that more iterations are required to ensure the distribution has been accurately represented.
    plot of chunk sim7
  • Autocorrelation plot for each paramter illustrate the degree of correlation between MCMC samples separated by different lags. For example, a lag of 0 represents the degree of correlation between each MCMC sample and itself (obviously this will be a correlation of 1). A lag of 1 represents the degree of correlation between each MCMC sample and the next sample along the Chain and so on. In order to be able to generate unbiased estimates of parameters, the MCMC samples should be independent (uncorrelated). In the figures below, this would be violated in the top autocorrelation plot and met in the bottom autocorrelation plot.
    plot of chunk sim7 plot of chunk sim9
  • Rhat statistic for each parameter provides a measure of sampling efficiency/effectiveness. Ideally, all values should be less than 1.05. If there are values of 1.05 or greater it suggests that the sampler was not very efficient or effective. Not only does this mean that the sampler was potentiall slower than it could have been, more importantly, it could indicate that the sampler spent time sampling in a region of the likelihood that is less informative. Such a situation can arise from either a misspecified model or overly vague priors that permit sampling in otherwise nonscence parameter space.

Prior to inspecting any summaries of the parameter estimates, it is prudent to inspect a range of chain convergence diagnostics

  • Trace plots
    View trace plots
    plot(data.mcmcpack)
    
    plot of chunk MCMCpackTrace
    Trace plots show no evidence that the chains have not reasonably traversed the entire multidimensional parameter space.
  • Raftery diagnostic
    View Raftery diagnostic
    raftery.diag(data.mcmcpack)
    
    Quantile (q) = 0.025
    Accuracy (r) = +/- 0.005
    Probability (s) = 0.95 
                                                       
                 Burn-in  Total Lower bound  Dependence
                 (M)      (N)   (Nmin)       factor (I)
     (Intercept) 2        3802  3746         1.010     
     xB          2        3650  3746         0.974     
     sigma2      2        3865  3746         1.030     
    
    The Raftery diagnostics estimate that we would require about 3900 samples to reach the specified level of confidence in convergence. As we have 10,000 samples, we can be confidence that convergence has occurred.
  • Autocorrelation diagnostic
    View autocorrelations
    autocorr.diag(data.mcmcpack)
    
             (Intercept)           xB       sigma2
    Lag 0   1.0000000000  1.000000000  1.000000000
    Lag 1   0.0095872277 -0.001080877  0.027941138
    Lag 5   0.0062995065  0.011489312  0.005292982
    Lag 10  0.0003454982  0.011819057 -0.002390377
    Lag 50 -0.0150021264 -0.023501128 -0.005570799
    
    A lag of 1 appears to be mainly sufficient to avoid autocorrelation (except for sigma2, which is over 0.1). The diagnostic suggests that a lag of 5 would potentially correct this.

Again, prior to examining the summaries, we should have explored the convergence diagnostics.

  • Trace plots
    plot(data.mcmc)
    
    plot of chunk JAGSTrace
    plot of chunk JAGSTrace
    Trace plots show no evidence that the chains have not reasonably traversed the entire multidimensional parameter space.
  • Raftery diagnostic
    raftery.diag(data.mcmc)
    
    [[1]]
    
    Quantile (q) = 0.025
    Accuracy (r) = +/- 0.005
    Probability (s) = 0.95 
                                                          
                    Burn-in  Total Lower bound  Dependence
                    (M)      (N)   (Nmin)       factor (I)
     Group.means[1] 4        4723  3746         1.26      
     Group.means[2] 2        3780  3746         1.01      
     alpha          4        4723  3746         1.26      
     beta[1]        <NA>     <NA>  3746           NA      
     beta[2]        4        4699  3746         1.25      
     sigma          5        5438  3746         1.45      
    
    
    [[2]]
    
    Quantile (q) = 0.025
    Accuracy (r) = +/- 0.005
    Probability (s) = 0.95 
                                                          
                    Burn-in  Total Lower bound  Dependence
                    (M)      (N)   (Nmin)       factor (I)
     Group.means[1] 4        4723  3746         1.260     
     Group.means[2] 2        3725  3746         0.994     
     alpha          4        4723  3746         1.260     
     beta[1]        <NA>     <NA>  3746            NA     
     beta[2]        3        4474  3746         1.190     
     sigma          5        5438  3746         1.450     
    
    
    [[3]]
    
    Quantile (q) = 0.025
    Accuracy (r) = +/- 0.005
    Probability (s) = 0.95 
                                                          
                    Burn-in  Total Lower bound  Dependence
                    (M)      (N)   (Nmin)       factor (I)
     Group.means[1] 4        4770  3746         1.27      
     Group.means[2] 2        3780  3746         1.01      
     alpha          4        4770  3746         1.27      
     beta[1]        <NA>     <NA>  3746           NA      
     beta[2]        4        4630  3746         1.24      
     sigma          4        5301  3746         1.42      
    
    The Raftery diagnostics for each chain estimate that we would require no more than 5000 samples to reach the specified level of confidence in convergence. As we have 16,667 samples, we can be confidence that convergence has occurred.
  • Autocorrelation diagnostic
    autocorr.diag(data.mcmc)
    
           Group.means[1] Group.means[2]         alpha beta[1]       beta[2]        sigma
    Lag 0    1.0000000000    1.000000000  1.0000000000     NaN  1.0000000000  1.000000000
    Lag 1    0.4040691225   -0.006258697  0.4040691225     NaN  0.3962572047  0.266923674
    Lag 5    0.0155316920    0.004603919  0.0155316920     NaN  0.0144527969  0.003480321
    Lag 10   0.0084921354   -0.001196139  0.0084921354     NaN  0.0017424689 -0.002382651
    Lag 50  -0.0002330403   -0.001298037 -0.0002330403     NaN -0.0008950162 -0.006333724
    
    A lag of 1 appears insufficient to avoid autocorrelation (poor mixing). The diagnostic suggests that a thinning factor of 10 would potentially correct this.
  • Re-run the chain with a thinning factor of 10 and confirm better mixing
    View code
    jagsModel <- jags.model("../downloads/BUGSscripts/ttestModel.txt", data = data.list,
        inits = inits, n.chains = nChains, n.adapt = adaptSteps)
    
    Compiling model graph
       Resolving undeclared variables
       Allocating nodes
    Deleting model
    
    Error in jags.model("../downloads/BUGSscripts/ttestModel.txt", data = data.list, : RUNTIME ERROR:
    Compilation error on line 6.
    Index out of range taking subset of  beta
    
    update(jagsModel, n.iter = burnInSteps)
    data.mcmc <- coda.samples(jagsModel, variable.names = params, n.iter = nIter,
        thin = 10)
    autocorr.diag(data.mcmc)
    
            Group.means[1] Group.means[2]        alpha beta[1]      beta[2]        sigma
    Lag 0      1.000000000    1.000000000  1.000000000     NaN  1.000000000  1.000000000
    Lag 10    -0.013364227   -0.010595139 -0.013364227     NaN -0.014814748 -0.009671218
    Lag 50     0.021366873   -0.003207512  0.021366873     NaN  0.025820823 -0.001696005
    Lag 100    0.001445556    0.008257461  0.001445556     NaN  0.001740606 -0.003615878
    Lag 500    0.004417772    0.024929961  0.004417772     NaN  0.022766553  0.012109929
    
    summary(data.mcmc)
    
    Iterations = 3010:19670
    Thinning interval = 10 
    Number of chains = 3 
    Sample size per chain = 1667 
    
    1. Empirical mean and standard deviation for each variable,
       plus standard error of the mean:
    
                      Mean     SD Naive SE Time-series SE
    Group.means[1] 105.320 0.3602 0.005093       0.005016
    Group.means[2]  77.826 0.4370 0.006179       0.006178
    alpha          105.320 0.3602 0.005093       0.005016
    beta[1]          0.000 0.0000 0.000000       0.000000
    beta[2]        -27.494 0.5603 0.007923       0.007923
    sigma            2.745 0.1990 0.002814       0.002838
    
    2. Quantiles for each variable:
    
                      2.5%     25%     50%     75%   97.5%
    Group.means[1] 104.612 105.077 105.315 105.563 106.020
    Group.means[2]  76.970  77.531  77.823  78.120  78.673
    alpha          104.612 105.077 105.315 105.563 106.020
    beta[1]          0.000   0.000   0.000   0.000   0.000
    beta[2]        -28.572 -27.878 -27.494 -27.128 -26.393
    sigma            2.385   2.606   2.731   2.873   3.163
    
    Thinning interval (lag) of 10, seems appropriate, although this has very little impact in this very simple case.

Again, prior to examining the summaries, we should have explored the convergence diagnostics. There are numerous ways of working with STAN model fits (for exploring diagnostics and summarization).

  • extract the mcmc samples and convert them into a mcmc.list to leverage the various coda routines
  • use the numerous routines that come with the rstan package
  • use the routines that come with the bayesplot package
  • explore the diagnostics interactively via shinystan

We will explore all of these:
  • via coda
    • Traceplots
    • library(coda)
      s = as.array(data.stan)
      mcmc <- do.call(mcmc.list, plyr:::alply(s[, , -(length(s[1, 1, ]))], 2, as.mcmc))
      plot(mcmc)
      
      plot of chunk STANcodaTraceplots
      plot of chunk STANcodaTraceplots
      Trace plots show no evidence that the chains have not reasonably traversed the entire multidimensional parameter space.
    • Autocorrelation
    • library(coda)
      s = as.array(data.stan)
      mcmc <- do.call(mcmc.list, plyr:::alply(s[, , -(length(s[1, 1, ]))], 2, as.mcmc))
      autocorr.diag(mcmc)
      
                   sigma      beta[1]       beta[2] Group_means[1] Group_means[2]     CohensD
      Lag 0  1.000000000  1.000000000  1.0000000000    1.000000000    1.000000000 1.000000000
      Lag 1  0.031786529  0.124408655  0.1192348060    0.124408655    0.039941783 0.054265261
      Lag 5  0.003533896  0.003471937  0.0109284026    0.003471937   -0.001485474 0.006464197
      Lag 10 0.022188178  0.036794169 -0.0004697056    0.036794169   -0.002005375 0.008347649
      Lag 50 0.010853153 -0.007082875  0.0016848268   -0.007082875   -0.003883857 0.017268624
      
      Trace plots show no evidence that the chains have not reasonably traversed the entire multidimensional parameter space.
  • via rstan
    • Traceplots
      stan_trace(data.stan)
      
      plot of chunk STANTrace
      Trace plots show no evidence that the chains have not reasonably traversed the entire multidimensional parameter space.
    • Raftery diagnostic
      raftery.diag(data.stan)
      
      Quantile (q) = 0.025
      Accuracy (r) = +/- 0.005
      Probability (s) = 0.95 
      
      You need a sample size of at least 3746 with these values of q, r and s
      
      The Raftery diagnostics for each chain estimate that we would require no more than 5000 samples to reach the specified level of confidence in convergence. As we have 16,667 samples, we can be confidence that convergence has occurred.
    • Autocorrelation diagnostic
      stan_ac(data.stan)
      
      plot of chunk STANAuto
      A lag of 2 appears broadly sufficient to avoid autocorrelation (poor mixing).
    • Rhat values. These values are a measure of sampling efficiency/effectiveness. Ideally, all values should be less than 1.05. If there are values of 1.05 or greater it suggests that the sampler was not very efficient or effective. Not only does this mean that the sampler was potentiall slower than it could have been, more importantly, it could indicate that the sampler spent time sampling in a region of the likelihood that is less informative. Such a situation can arise from either a misspecified model or overly vague priors that permit sampling in otherwise nonscence parameter space.
      stan_rhat(data.stan)
      
      plot of chunk STANRhat
      In this instance, all rhat values are well below 1.05 (a good thing).
    • Another measure of sampling efficiency is Effective Sample Size (ess). ess indicate the number samples (or proportion of samples that the sampling algorithm deamed effective. The sampler rejects samples on the basis of certain criterion and when it does so, the previous sample value is used. Hence while the MCMC sampling chain may contain 1000 samples, if there are only 10 effective samples (1%), the estimated properties are not likely to be reliable.
      stan_ess(data.stan)
      
      plot of chunk STANess
      In this instance, most of the parameters have reasonably high effective samples and thus there is likely to be a good range of values from which to estimate paramter properties.
  • via bayesplot
    • Trace plots and density plots
      library(bayesplot)
      mcmc_trace(as.array(data.stan), regex_pars = "beta|sigma")
      
      plot of chunk STANMCMCTrace
      library(bayesplot)
      mcmc_combo(as.array(data.stan))
      
      plot of chunk STANTrace1
      Trace plots show no evidence that the chains have not reasonably traversed the entire multidimensional parameter space.
    • Density plots
      library(bayesplot)
      mcmc_dens(as.array(data.stan))
      
      plot of chunk STANdens
      Density plots sugggest mean or median would be appropriate to describe the fixed posteriors and median is appropriate for the sigma posterior.
  • via shinystan
    	 library(shinystan)
     launch_shinystan(data.stan))      
    
  • It is worth exploring the influence of our priors.

Again, prior to examining the summaries, we should have explored the convergence diagnostics. There are numerous ways of working with STAN model fits (for exploring diagnostics and summarization).

  • extract the mcmc samples and convert them into a mcmc.list to leverage the various coda routines
  • use the numerous routines that come with the rstan package
  • use the routines that come with the bayesplot package
  • explore the diagnostics interactively via shinystan

We will explore all of these:
  • via coda
    • Traceplots
    • library(coda)
      s = as.array(data.rstanarm)
      mcmc <- do.call(mcmc.list, plyr:::alply(s[, , -(length(s[1, 1, ]))], 2, as.mcmc))
      plot(mcmc)
      
      plot of chunk RSTANARMcodaTraceplots
      Trace plots show no evidence that the chains have not reasonably traversed the entire multidimensional parameter space.
    • Autocorrelation
    • library(coda)
      s = as.array(data.rstanarm)
      mcmc <- do.call(mcmc.list, plyr:::alply(s[, , -(length(s[1, 1, ]))], 2, as.mcmc))
      autocorr.diag(mcmc)
      
              (Intercept)           xB
      Lag 0   1.000000000  1.000000000
      Lag 1   0.042874897 -0.042767296
      Lag 5  -0.028127765 -0.046081256
      Lag 10 -0.007295641  0.004702507
      Lag 50 -0.018160750  0.003582176
      
      Trace plots show no evidence that the chains have not reasonably traversed the entire multidimensional parameter space.
  • via rstan
    • Traceplots
      stan_trace(data.rstanarm)
      
      plot of chunk RSTANARMTrace
      Trace plots show no evidence that the chains have not reasonably traversed the entire multidimensional parameter space.
    • Raftery diagnostic
      raftery.diag(data.rstanarm)
      
      Quantile (q) = 0.025
      Accuracy (r) = +/- 0.005
      Probability (s) = 0.95 
      
      You need a sample size of at least 3746 with these values of q, r and s
      
      The Raftery diagnostics for each chain estimate that we would require no more than 5000 samples to reach the specified level of confidence in convergence. As we have 16,667 samples, we can be confidence that convergence has occurred.
    • Autocorrelation diagnostic
      stan_ac(data.rstanarm)
      
      plot of chunk RSTANARMAuto
      A lag of 2 appears broadly sufficient to avoid autocorrelation (poor mixing).
    • Rhat values. These values are a measure of sampling efficiency/effectiveness. Ideally, all values should be less than 1.05. If there are values of 1.05 or greater it suggests that the sampler was not very efficient or effective. Not only does this mean that the sampler was potentiall slower than it could have been, more importantly, it could indicate that the sampler spent time sampling in a region of the likelihood that is less informative. Such a situation can arise from either a misspecified model or overly vague priors that permit sampling in otherwise nonscence parameter space.
      stan_rhat(data.rstanarm)
      
      plot of chunk RSTANARMRhat
      In this instance, all rhat values are well below 1.05 (a good thing).
    • Another measure of sampling efficiency is Effective Sample Size (ess). ess indicate the number samples (or proportion of samples that the sampling algorithm deamed effective. The sampler rejects samples on the basis of certain criterion and when it does so, the previous sample value is used. Hence while the MCMC sampling chain may contain 1000 samples, if there are only 10 effective samples (1%), the estimated properties are not likely to be reliable.
      stan_ess(data.rstanarm)
      
      plot of chunk RSTANARMess
      In this instance, most of the parameters have reasonably high effective samples and thus there is likely to be a good range of values from which to estimate paramter properties.
  • via bayesplot
    • Trace plots and density plots
      library(bayesplot)
      mcmc_trace(as.array(data.rstanarm), regex_pars = "Intercept|xB|sigma")
      
      plot of chunk RSTANARMMCMCTrace
      library(bayesplot)
      mcmc_combo(as.array(data.rstanarm))
      
      plot of chunk RSTANARMTrace1
      Trace plots show no evidence that the chains have not reasonably traversed the entire multidimensional parameter space.
    • Density plots
      library(bayesplot)
      mcmc_dens(as.array(data.rstanarm))
      
      plot of chunk RSTANARMdens
      Density plots sugggest mean or median would be appropriate to describe the fixed posteriors and median is appropriate for the sigma posterior.
  • via rstanarm The rstanarm package provides additional posterior checks.
    • Posterior vs Prior - this compares the posterior estimate for each parameter against the associated prior. If the spread of the priors is small relative to the posterior, then it is likely that the priors are too influential. On the other hand, overly wide priors can lead to computational issues.
      library(rstanarm)
      posterior_vs_prior(data.rstanarm, color_by = "vs", group_by = TRUE,
          facet_args = list(scales = "free_y"))
      
       Elapsed Time: 6.40746 seconds (Warm-up)
                     0.087671 seconds (Sampling)
                     6.49513 seconds (Total)
      
      
       Elapsed Time: 6.42475 seconds (Warm-up)
                     0.094753 seconds (Sampling)
                     6.51951 seconds (Total)
      
      plot of chunk RSTANARMposterorvsprior
  • via shinystan
    	 library(shinystan)
     launch_shinystan(data.rstanarm))      
    
  • It is worth exploring the influence of our priors.

Again, prior to examining the summaries, we should have explored the convergence diagnostics. There are numerous ways of working with STAN model fits (for exploring diagnostics and summarization).

  • extract the mcmc samples and convert them into a mcmc.list to leverage the various coda routines
  • use the numerous routines that come with the rstan package
  • use the routines that come with the bayesplot package
  • explore the diagnostics interactively via shinystan

We will explore all of these:
  • via coda
    • Traceplots
    • library(coda)
      mcmc = as.mcmc(data.brms)
      plot(mcmc)
      
      plot of chunk BRMScodaTraceplots
      Trace plots show no evidence that the chains have not reasonably traversed the entire multidimensional parameter space.
    • Autocorrelation
    • library(coda)
      mcmc = as.mcmc(data.brms)
      autocorr.diag(mcmc)
      
      Error in ts(x, start = start(x), end = end(x), deltat = thin(x)): invalid time series parameters specified
      
      Trace plots show no evidence that the chains have not reasonably traversed the entire multidimensional parameter space.
  • via rstan
    • Traceplots
      stan_trace(data.brms$fit)
      
      plot of chunk BRMSTrace
      Trace plots show no evidence that the chains have not reasonably traversed the entire multidimensional parameter space.
    • Raftery diagnostic
      raftery.diag(data.brms)
      
      Quantile (q) = 0.025
      Accuracy (r) = +/- 0.005
      Probability (s) = 0.95 
      
      You need a sample size of at least 3746 with these values of q, r and s
      
      The Raftery diagnostics for each chain estimate that we would require no more than 5000 samples to reach the specified level of confidence in convergence. As we have 16,667 samples, we can be confidence that convergence has occurred.
    • Autocorrelation diagnostic
      stan_ac(data.brms$fit)
      
      plot of chunk BRMSAuto
      A lag of 2 appears broadly sufficient to avoid autocorrelation (poor mixing).
    • Rhat values. These values are a measure of sampling efficiency/effectiveness. Ideally, all values should be less than 1.05. If there are values of 1.05 or greater it suggests that the sampler was not very efficient or effective. Not only does this mean that the sampler was potentiall slower than it could have been, more importantly, it could indicate that the sampler spent time sampling in a region of the likelihood that is less informative. Such a situation can arise from either a misspecified model or overly vague priors that permit sampling in otherwise nonscence parameter space.
      stan_rhat(data.brms$fit)
      
      plot of chunk BRMSRhat
      In this instance, all rhat values are well below 1.05 (a good thing).
    • Another measure of sampling efficiency is Effective Sample Size (ess). ess indicate the number samples (or proportion of samples that the sampling algorithm deamed effective. The sampler rejects samples on the basis of certain criterion and when it does so, the previous sample value is used. Hence while the MCMC sampling chain may contain 1000 samples, if there are only 10 effective samples (1%), the estimated properties are not likely to be reliable.
      stan_ess(data.brms$fit)
      
      plot of chunk BRMSess
      In this instance, most of the parameters have reasonably high effective samples and thus there is likely to be a good range of values from which to estimate paramter properties.

Model validation

Model validation involves exploring the model diagnostics and fit to ensure that the model is broadly appropriate for the data. As such, exploration of the residuals should be routine.

Ideally, a good model should also be able to predict the data used to fit the model.

Residuals are not computed directly within MCMCpack. However, we can calculate them manually form the posteriors.

mcmc = as.data.frame(data.mcmcpack)
# generate a model matrix
newdata = data.frame(x = data$x)
Xmat = model.matrix(~x, newdata)
## get median parameter estimates
coefs = apply(mcmc[, 1:2], 2, median)
fit = as.vector(coefs %*% t(Xmat))
resid = data$y - fit
ggplot() + geom_point(data = NULL, aes(y = resid, x = fit))
plot of chunk MCMCpackresid

Residuals are not computed directly within R2jags. However, we can calculate them manually form the posteriors.

mcmc = data.r2jags$BUGSoutput$sims.matrix[, c("alpha", "beta[2]")]
# generate a model matrix
newdata = data.frame(x = data$x)
Xmat = model.matrix(~x, newdata)
## get median parameter estimates
coefs = apply(mcmc, 2, median)
fit = as.vector(coefs %*% t(Xmat))
resid = data$y - fit
ggplot() + geom_point(data = NULL, aes(y = resid, x = fit))
plot of chunk R2JAGSresid

Residuals are not computed directly within RSTAN. However, we can calculate them manually form the posteriors.

mcmc = as.matrix(data.stan)[, c("beta[1]", "beta[2]")]
# generate a model matrix
newdata = data.frame(x = data$x)
Xmat = model.matrix(~x, newdata)
## get median parameter estimates
coefs = apply(mcmc, 2, median)
fit = as.vector(coefs %*% t(Xmat))
resid = data$y - fit
ggplot() + geom_point(data = NULL, aes(y = resid, x = fit))
plot of chunk RSTANresid

Residuals can be computed directly within RSTANARM.

resid = resid(data.rstanarm)
fit = fitted(data.rstanarm)
ggplot() + geom_point(data = NULL, aes(y = resid, x = fit))
plot of chunk RSTANARMresid

Lets compare draws (predictions) from the posterior (in blue) with the distributions of observed data (red) via violin plots. Violin plots are similar to boxplots except that they display more visual information about the density distribution.

y_pred = posterior_predict(data.rstanarm)
newdata = data %>% cbind(t(y_pred)) %>% gather(key = "Rep", value = "Value", -y, -x, -xn)
ggplot(newdata, aes(Value, x = x)) + geom_violin(color = "blue", fill = "blue", alpha = 0.5) + geom_violin(data = data,
    aes(y = y, x = x), fill = "red", color = "red", alpha = 0.5)
plot of chunk RSTANARMposteriorpredict
Conclusions - the posterior predictions match the observed data very well.

Residuals can be computed directly within BRMS. By default, the residuals and fitted extractor functions in brms return summarized versions (means, SE and credibility intervals). We are only interested in the mean (Estimate) estimates.

resid = resid(data.brms)[, "Estimate"]
fit = fitted(data.brms)[, "Estimate"]
ggplot() + geom_point(data = NULL, aes(y = resid, x = fit))
plot of chunk BRMSresid

Lets compare draws (predictions) from the posterior (in blue) with the distributions of observed data (red) via violin plots. Violin plots are similar to boxplots except that they display more visual information about the density distribution.

y_pred = posterior_predict(data.brms)
newdata = data %>% cbind(t(y_pred)) %>% gather(key = "Rep", value = "Value", -y, -x, -xn)
ggplot(newdata, aes(Value, x = x)) + geom_violin(color = "blue", fill = "blue", alpha = 0.5) + geom_violin(data = data,
    aes(y = y, x = x), fill = "red", color = "red", alpha = 0.5)
plot of chunk BRMSposteriorpredict
Conclusions - the posterior predictions match the observed data very well.

Notwithstanding the slight issue of autocorrelation in the sigma2 samples, there is no evidence that the mcmc chain did not converge on a stable posterior distribution. We are now in a position to examine the summaries of the parameters.

Parameter estimates (posterior summaries)

Although all parameters in a Bayesian analysis are considered random and are considered a distribution, rarely would it be useful to present tables of all the samples from each distribution. On the other hand, plots of the posterior distributions are do have some use. Nevertheless, most workers prefer to present simple statistical summaries of the posteriors. Popular choices include the median (or mean) and 95% credibility intervals.

summary(data.mcmcpack)
Iterations = 1001:11000
Thinning interval = 1 
Number of chains = 1 
Sample size per chain = 10000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

               Mean     SD Naive SE Time-series SE
(Intercept) 105.325 0.3530 0.003530       0.003530
xB          -27.493 0.5634 0.005634       0.005634
sigma2        7.497 1.1016 0.011016       0.011328

2. Quantiles for each variable:

               2.5%     25%     50%     75%   97.5%
(Intercept) 104.637 105.090 105.329 105.560 106.014
xB          -28.603 -27.865 -27.494 -27.119 -26.374
sigma2        5.647   6.723   7.388   8.167   9.908
# OR
library(broom)
tidyMCMC(data.mcmcpack, conf.int = TRUE, conf.method = "HPDinterval")
         term   estimate std.error   conf.low  conf.high
1 (Intercept) 105.324812 0.3530422 104.631229 106.004451
2          xB -27.492624 0.5634498 -28.584014 -26.359402
3      sigma2   7.497343 1.1015580   5.525662   9.736187
Conclusions: the Group A is typically 27.5 units greater than Group B. The 95% confidence interval for the difference between Group A and B does not overlap with 0 implying a significant difference between the two groups.

We can also get the means and quantiles for each group. We will also add a p-value that corresponds to an empirical p-value for the hypothesis that the specified columns of samp have mean zero versus a general multivariate distribution with elliptical contours.

# get the means of each group
newdata = data.frame(x = levels(data$x))
Xmat = model.matrix(~x, newdata)
fit = data.mcmcpack[, 1:2] %*% t(Xmat)
broom:::tidyMCMC(fit, conf.int = TRUE, conf.method = "HPDinterval")
   estimate std.error  conf.low conf.high
1 105.32481 0.3530422 104.63123 106.00445
2  77.83219 0.4348535  76.98842  78.68521
# OR
tab = plyr:::adply(fit, 2, function(x) {
    data.frame(Mean = mean(x), t(quantile(x, c(0.025, 0.25, 0.5, 0.75,
        0.975))))
})
tab
  X1      Mean     X2.5.      X25.      X50.      X75.    X97.5.
1  1 105.32481 104.63721 105.08970 105.32850 105.55960 106.01431
2  2  77.83219  76.98345  77.53885  77.83865  78.12434  78.68148
# OR
tab <- rbind(Group1 = c(mean = mean(data.mcmcpack[, 1]), quantile(data.mcmcpack[,
    1], c(0.025, 0.25, 0.5, 0.75, 0.975))), Group2 = c(mean = mean(data.mcmcpack[,
    1] + data.mcmcpack[, 2]), quantile(data.mcmcpack[, 1] + data.mcmcpack[,
    2], c(0.025, 0.25, 0.5, 0.75, 0.975))))
tab
            mean      2.5%       25%       50%       75%     97.5%
Group1 105.32481 104.63721 105.08970 105.32850 105.55960 106.01431
Group2  77.83219  76.98345  77.53885  77.83865  78.12434  78.68148

Whilst workers attempt to become comfortable with a new statistical framework, it is only natural that they like to evaluate and comprehend new structures and output alongside more familiar concepts. One way to facilitate this is via Bayesian p-values that are somewhat analogous to the frequentist p-values for investigating the hypothesis that the two populations are identical (t=0).

library(coda)
mcmcpvalue <- function(samp) {
    ## elementary version that creates an empirical p-value for the
    ## hypothesis that the columns of samp have mean zero versus a general
    ## multivariate distribution with elliptical contours.

    ## differences from the mean standardized by the observed
    ## variance-covariance factor

    ## Note, I put in the bit for single terms
    if (length(dim(samp)) == 0) {
        std <- backsolve(chol(var(samp)), cbind(0, t(samp)) - mean(samp),
            transpose = TRUE)
        sqdist <- colSums(std * std)
        sum(sqdist[-1] > sqdist[1])/length(samp)
    } else {
        std <- backsolve(chol(var(samp)), cbind(0, t(samp)) - colMeans(samp),
            transpose = TRUE)
        sqdist <- colSums(std * std)
        sum(sqdist[-1] > sqdist[1])/nrow(samp)
    }

}
mcmcpvalue(data.mcmcpack[, 2])
[1] 0

With a p-value of 0, we would reject the frequentist null hypothesis of no difference between two population means.

print(data.r2jags)
Inference for Bugs model at "../downloads/BUGSscripts/ttestModelR2JAGS.txt", fit using jags,
 3 chains, each with 16667 iterations (first 2000 discarded), n.thin = 10
 n.sims = 4401 iterations saved
               mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
Group.means[1] 105.311   0.352 104.621 105.077 105.313 105.547 105.998 1.002  2100
Group.means[2]  77.825   0.439  76.933  77.531  77.832  78.123  78.652 1.001  4400
alpha          105.311   0.352 104.621 105.077 105.313 105.547 105.998 1.002  2100
beta[1]          0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1
beta[2]        -27.486   0.562 -28.599 -27.868 -27.478 -27.090 -26.422 1.001  4400
sigma            2.743   0.200   2.390   2.603   2.729   2.872   3.179 1.001  4400
deviance       484.117   2.504 481.259 482.278 483.502 485.266 490.547 1.001  2900

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.1 and DIC = 487.3
DIC is an estimate of expected predictive error (lower deviance is better).
# OR
library(broom)
tidyMCMC(as.mcmc(data.r2jags), conf.int = TRUE, conf.method = "HPDinterval")
            term   estimate std.error   conf.low conf.high
1          alpha 105.310681 0.3518027 104.581038 105.95220
2        beta[1]   0.000000 0.0000000   0.000000   0.00000
3        beta[2] -27.486090 0.5617990 -28.582930 -26.41388
4       deviance 484.117377 2.5043520 481.048414 488.92694
5 Group.means[1] 105.310681 0.3518027 104.581038 105.95220
6 Group.means[2]  77.824591 0.4393389  76.914285  78.62817
7          sigma   2.743263 0.2000124   2.382881   3.16508
Conclusions: the Group A is typically 27.5 units greater than Group B. The 95% confidence interval for the difference between Group A and B does not overlap with 0 implying a significant difference between the two groups.
summary(data.stan)
$summary
                      mean     se_mean        sd        2.5%         25%         50%         75%
sigma             2.748347 0.004643477 0.2058680    2.383985    2.609234    2.738462    2.880936
beta[1]         105.317175 0.008919223 0.3665387  104.610261  105.066507  105.314636  105.565938
beta[2]         -27.475913 0.013290581 0.5611134  -28.587997  -27.854918  -27.461887  -27.093707
Group_means[1]  105.317175 0.008919223 0.3665387  104.610261  105.066507  105.314636  105.565938
Group_means[2]   77.841262 0.009535972 0.4282323   77.023217   77.551173   77.852741   78.139736
CohensD         -16.607974 0.016649447 0.7007208  -17.924413  -17.082406  -16.601810  -16.154380
lp__           -149.248412 0.030348589 1.2739574 -152.689167 -149.830248 -148.897811 -148.325146
                     97.5%    n_eff      Rhat
sigma             3.193274 1965.582 0.9994900
beta[1]         106.071145 1688.828 1.0003437
beta[2]         -26.387636 1782.434 0.9998428
Group_means[1]  106.071145 1688.828 1.0003437
Group_means[2]   78.640961 2016.642 0.9993708
CohensD         -15.234004 1771.293 0.9997254
lp__           -147.786523 1762.109 1.0003114

$c_summary
, , chains = chain:1

                stats
parameter               mean        sd        2.5%         25%        50%         75%       97.5%
  sigma             2.756837 0.2062624    2.391308    2.610123    2.75993    2.901425    3.196558
  beta[1]         105.307325 0.3840697  104.544380  105.037111  105.31178  105.568104  106.021153
  beta[2]         -27.473363 0.5636199  -28.494840  -27.882508  -27.47164  -27.092728  -26.352808
  Group_means[1]  105.307325 0.3840697  104.544380  105.037111  105.31178  105.568104  106.021153
  Group_means[2]   77.833962 0.4258945   77.026439   77.532553   77.84071   78.114926   78.678875
  CohensD         -16.581819 0.7230484  -17.993491  -17.071949  -16.58011  -16.088151  -15.231593
  lp__           -149.308778 1.2434115 -152.596229 -149.846509 -149.04277 -148.421156 -147.816359

, , chains = chain:2

                stats
parameter               mean        sd        2.5%         25%         50%         75%       97.5%
  sigma             2.743866 0.2016057    2.380798    2.614627    2.729257    2.865171    3.177314
  beta[1]         105.340205 0.3711973  104.620377  105.101656  105.323958  105.588067  106.175873
  beta[2]         -27.498025 0.5453297  -28.586943  -27.842411  -27.486752  -27.121462  -26.476353
  Group_means[1]  105.340205 0.3711973  104.620377  105.101656  105.323958  105.588067  106.175873
  Group_means[2]   77.842180 0.4228195   77.040150   77.555705   77.852154   78.147909   78.630306
  CohensD         -16.633250 0.6786964  -17.876431  -17.092905  -16.615442  -16.209320  -15.255389
  lp__           -149.228310 1.3085387 -152.760994 -149.831285 -148.831683 -148.284840 -147.755450

, , chains = chain:3

                stats
parameter               mean        sd        2.5%         25%         50%         75%       97.5%
  sigma             2.744337 0.2096680    2.389003    2.597667    2.734977    2.870101    3.218381
  beta[1]         105.303997 0.3425076  104.667958  105.071350  105.302902  105.528407  105.994129
  beta[2]         -27.456352 0.5739812  -28.673590  -27.835452  -27.424885  -27.063945  -26.356115
  Group_means[1]  105.303997 0.3425076  104.667958  105.071350  105.302902  105.528407  105.994129
  Group_means[2]   77.847644 0.4363265   76.983453   77.562021   77.866951   78.153475   78.637085
  CohensD         -16.608852 0.6997046  -17.898063  -17.069318  -16.608418  -16.164300  -15.231721
  lp__           -149.208149 1.2685478 -152.553299 -149.818403 -148.854260 -148.304893 -147.791214
# OR
library(broom)
tidyMCMC(data.stan, conf.int = TRUE, conf.method = "HPDinterval", rhat = TRUE, ess = TRUE)
            term   estimate std.error   conf.low  conf.high      rhat  ess
1          sigma   2.748347 0.2058680   2.369463   3.168854 0.9994900 1966
2        beta[1] 105.317175 0.3665387 104.578122 106.014422 1.0003437 1689
3        beta[2] -27.475913 0.5611134 -28.575252 -26.374951 0.9998428 1782
4 Group_means[1] 105.317175 0.3665387 104.578122 106.014422 1.0003437 1689
5 Group_means[2]  77.841262 0.4282323  77.012057  78.631688 0.9993708 2017
6        CohensD -16.607974 0.7007208 -17.915138 -15.231072 0.9997254 1771
Conclusions: the Group A is typically 27.5 units greater than Group B. The 95% confidence interval for the difference between Group A and B does not overlap with 0 implying a significant difference between the two groups.
summary(data.rstanarm)
Model Info:

 function: 
 family:    gaussian [identity]
 formula:   y ~ x
 algorithm: sampling
 priors:    see help('prior_summary')
 sample:    2700 (posterior sample size)
 num obs:   100

Estimates:
                mean   sd     2.5%   25%    50%    75%    97.5%
(Intercept)    105.3    0.4  104.6  105.1  105.3  105.6  106.0 
xB             -27.5    0.6  -28.6  -27.9  -27.5  -27.1  -26.4 
sigma            2.8    0.2    2.4    2.6    2.7    2.9    3.2 
mean_PPD        94.3    0.4   93.6   94.1   94.3   94.6   95.1 
log-posterior -257.5    1.3 -260.7 -258.1 -257.2 -256.6 -256.1 

Diagnostics:
              mcse Rhat n_eff
(Intercept)   0.0  1.0  2458 
xB            0.0  1.0  2700 
sigma         0.0  1.0   690 
mean_PPD      0.0  1.0  2520 
log-posterior 0.0  1.0   984 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
# OR
library(broom)
tidyMCMC(data.rstanarm$stanfit, conf.int = TRUE, conf.method = "HPDinterval",
    rhat = TRUE, ess = TRUE)
           term    estimate std.error    conf.low   conf.high     rhat  ess
1   (Intercept)  105.326338 0.3555535  104.659397  106.022743 1.000113 2458
2            xB  -27.484064 0.5670145  -28.619208  -26.373617 1.001115 2700
3         sigma    2.751499 0.2029175    2.365682    3.141828 1.003115  690
4      mean_PPD   94.333290 0.3833119   93.635312   95.165967 1.000227 2520
5 log-posterior -257.508526 1.2547126 -260.061202 -255.970248 1.004789  984
Conclusions: the Group A is typically 27.5 units greater than Group B. The Bayesian uncertainty interval indicates we believe after seeing the data that there is a 0.95 probability that the effect of the treatment is between -28.62 and -26.37. Furthermore, we can say that the probability of an effect is essentially 1.

Alternatively we can generate posterior intervals for each parameter.

posterior_interval(data.rstanarm, prob = 0.95)
                  2.5%      97.5%
(Intercept) 104.625383 105.998683
xB          -28.618485 -26.373933
sigma         2.381654   3.172944
Conclusions: the 95% confidence interval for the difference between Group A and B does not overlap with 0 implying a significant difference between the two groups.

An alternative way of quantifying the impact of a predictor is to compare models with and without the predictor. In a Bayesian context, this can be achieved by comparing the leave-one-out cross-validation statistics. Leave-one-out (LOO) cross-validation explores how well a series of models can predict withheld values (Vehtari, Gelman, and Gabry, 2016b). The LOO Information Criterion (LOOIC) is analogous to the AIC except that the LOOIC takes priors into consideration, does not assume that the posterior distribution is drawn from a multivariate normal and integrates over parameter uncertainty so as to yield a distribution of looic rather than just a point estimate. The LOOIC does however assume that all observations are equally influential (it does not matter which observations are left out). This assumption can be examined via the Pareta k estimate (values greater than 0.5 or more conservatively 0.75 are considered overly influential).

(full = loo(data.rstanarm))
Computed from 2700 by 100 log-likelihood matrix

         Estimate   SE
elpd_loo   -243.6  7.1
p_loo         3.0  0.5
looic       487.3 14.3

All Pareto k estimates are good (k < 0.5)
See help('pareto-k-diagnostic') for details.
(reduced = loo(update(data.rstanarm, formula = . ~ 1)))
 Elapsed Time: 0.018229 seconds (Warm-up)
               0.073999 seconds (Sampling)
               0.092228 seconds (Total)


 Elapsed Time: 0.020266 seconds (Warm-up)
               0.064622 seconds (Sampling)
               0.084888 seconds (Total)


 Elapsed Time: 0.016048 seconds (Warm-up)
               0.064694 seconds (Sampling)
               0.080742 seconds (Total)
Computed from 2700 by 100 log-likelihood matrix

         Estimate  SE
elpd_loo   -405.5 2.8
p_loo         1.2 0.1
looic       811.0 5.5

All Pareto k estimates are good (k < 0.5)
See help('pareto-k-diagnostic') for details.
par(mfrow = 1:2, mar = c(5, 3.8, 1, 0) + 0.1, las = 3)
plot(full, label_points = TRUE)
plot(reduced, label_points = TRUE)
plot of chunk RSTANARMloo
compare_models(full, reduced)
elpd_diff        se 
   -161.8       7.6 
Conclusions: the difference in expected out-of-sample predictive accuracy is substantially lower than 0 suggesting that the full model (model containing the predictor) is better at predicting $y$ than the null model.
summary(data.brms)
 Family: gaussian (identity) 
Formula: y ~ x 
   Data: data (Number of observations: 100) 
Samples: 3 chains, each with iter = 2000; warmup = 200; thin = 2; 
         total post-warmup samples = 2700
   WAIC: Not computed
 
Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept   105.34      0.36   104.64   106.03       2444    1
xB          -27.50      0.57   -28.65   -26.40       2435    1

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma     2.75      0.19     2.39     3.18       2515    1

Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
is a crude measure of effective sample size, and Rhat is the potential 
scale reduction factor on split chains (at convergence, Rhat = 1).
# OR
library(broom)
tidyMCMC(data.brms$fit, conf.int = TRUE, conf.method = "HPDinterval", rhat = TRUE,
    ess = TRUE)
         term   estimate std.error   conf.low  conf.high      rhat  ess
1 b_Intercept 105.343811 0.3569517 104.629829 106.010549 0.9991800 2444
2        b_xB -27.503759 0.5672394 -28.687165 -26.453145 0.9992955 2435
3       sigma   2.748028 0.1948715   2.354592   3.111595 1.0009022 2515
Conclusions: the Group A is typically 27.5 units greater than Group B. The 95% confidence interval for the difference between Group A and B does not overlap with 0 implying a significant difference between the two groups.
(full = loo(data.brms))
  LOOIC    SE
 487.19 14.29
(reduced = loo(update(data.brms, formula = . ~ 1)))
SAMPLING FOR MODEL 'gaussian(identity) brms-model' NOW (CHAIN 1).

Chain 1, Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1, Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1, Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1, Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1, Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1, Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1, Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1, Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1, Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1, Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1, Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 0.020435 seconds (Warm-up)
               0.010122 seconds (Sampling)
               0.030557 seconds (Total)


SAMPLING FOR MODEL 'gaussian(identity) brms-model' NOW (CHAIN 2).

Chain 2, Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2, Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2, Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2, Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2, Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2, Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2, Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2, Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2, Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2, Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2, Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 0.016165 seconds (Warm-up)
               0.010127 seconds (Sampling)
               0.026292 seconds (Total)


SAMPLING FOR MODEL 'gaussian(identity) brms-model' NOW (CHAIN 3).

Chain 3, Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3, Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3, Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3, Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3, Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3, Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3, Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3, Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3, Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3, Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3, Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 0.016973 seconds (Warm-up)
               0.01009 seconds (Sampling)
               0.027063 seconds (Total)


SAMPLING FOR MODEL 'gaussian(identity) brms-model' NOW (CHAIN 1).

Chain 1, Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1, Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1, Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1, Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1, Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1, Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1, Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1, Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1, Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1, Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1, Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 0.022405 seconds (Warm-up)
               0.011189 seconds (Sampling)
               0.033594 seconds (Total)


SAMPLING FOR MODEL 'gaussian(identity) brms-model' NOW (CHAIN 2).

Chain 2, Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2, Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2, Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2, Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2, Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2, Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2, Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2, Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2, Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2, Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2, Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 0.015924 seconds (Warm-up)
               0.009567 seconds (Sampling)
               0.025491 seconds (Total)


SAMPLING FOR MODEL 'gaussian(identity) brms-model' NOW (CHAIN 3).

Chain 3, Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3, Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3, Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3, Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3, Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3, Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3, Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3, Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3, Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3, Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3, Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3, Iteration: 2000 / 2000 [100%]  (Sampling)
 Elapsed Time: 0.016643 seconds (Warm-up)
               0.009856 seconds (Sampling)
               0.026499 seconds (Total)
  LOOIC   SE
 810.88 5.52
par(mfrow = 1:2, mar = c(5, 3.8, 1, 0) + 0.1, las = 3)
plot(full, label_points = TRUE)
plot(reduced, label_points = TRUE)
plot of chunk BRMSloo

Graphical summaries

A nice graphic is often a great accompaniment to a statistical analysis. Although there are no fixed assumptions associated with graphing (in contrast to statistical analyses), we often want the graphical summaries to reflect the associated statistical analyses. After all, the sample is just one perspective on the population(s). What we are more interested in is being able to estimate and depict likely population parameters/trends.

Thus, whilst we could easily provide a plot displaying the raw data along with simple measures of location and spread, arguably, we should use estimates that reflect the fitted model. In this case, it would be appropriate to plot the credibility interval associated with each group.

mcmc = data.mcmcpack
## Calculate the fitted values
newdata = data.frame(x = levels(data$x))
Xmat = model.matrix(~x, newdata)
coefs = mcmc[, 1:2]
fit = coefs %*% t(Xmat)
newdata = newdata %>% cbind(tidyMCMC(fit, conf.int = TRUE, conf.method = "HPDinterval"))
newdata
  x  estimate std.error  conf.low conf.high
1 A 105.32481 0.3530422 104.63123 106.00445
2 B  77.83219 0.4348535  76.98842  78.68521
ggplot(newdata, aes(y = estimate, x = x)) + geom_pointrange(aes(ymin = conf.low,
    ymax = conf.high)) + scale_y_continuous("Y") + scale_x_discrete("X") +
    theme_classic()
plot of chunk MCMCpackGraphicalSummaries

If you wanted to represent sample data on the figure in such a simple example (single predictor) we could simply over- (or under-) lay the raw data.

ggplot(newdata, aes(y = estimate, x = x)) + geom_point(data = data, aes(y = y,
    x = x), color = "gray") + geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +
    scale_y_continuous("Y") + scale_x_discrete("X") + theme_classic()
plot of chunk MCMCpackGraphicalSummaries1

A more general solution would be to add the partial residuals to the figure. Partial residuals are the fitted values plus the residuals. In this simple case, that equates to exactly the same as the raw observations since $resid = obs - fitted $ and the fitted values depend only on the single predictor we are interested in.

## Calculate partial residuals fitted values
fdata = rdata = data
fMat = rMat = model.matrix(~x, fdata)
fit = as.vector(apply(coefs, 2, median) %*% t(fMat))
resid = as.vector(data$y - apply(coefs, 2, median) %*% t(rMat))
rdata = rdata %>% mutate(partial.resid = resid + fit)
ggplot(newdata, aes(y = estimate, x = x)) + geom_point(data = rdata, aes(y = partial.resid),
    color = "gray") + geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +
    scale_y_continuous("Y") + scale_x_discrete("X") + theme_classic()
plot of chunk MCMCpackGraphicalSummaries2
mcmc = data.r2jags$BUGSoutput$sims.matrix
## Calculate the fitted values
newdata = data.frame(x = levels(data$x))
Xmat = model.matrix(~x, newdata)
coefs = mcmc[, c("alpha", "beta[2]")]
fit = coefs %*% t(Xmat)
newdata = newdata %>% cbind(tidyMCMC(fit, conf.int = TRUE, conf.method = "HPDinterval"))
newdata
  x  estimate std.error  conf.low conf.high
1 A 105.31068 0.3518027 104.58104 105.95220
2 B  77.82459 0.4393389  76.91429  78.62817
ggplot(newdata, aes(y = estimate, x = x)) + geom_pointrange(aes(ymin = conf.low,
    ymax = conf.high)) + scale_y_continuous("Y") + scale_x_discrete("X") +
    theme_classic()
plot of chunk R2JAGSGraphicalSummaries

If you wanted to represent sample data on the figure in such a simple example (single predictor) we could simply over- (or under-) lay the raw data.

ggplot(newdata, aes(y = estimate, x = x)) + geom_point(data = data, aes(y = y,
    x = x), color = "gray") + geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +
    scale_y_continuous("Y") + scale_x_discrete("X") + theme_classic()
plot of chunk R2JAGSGraphicalSummaries1

A more general solution would be to add the partial residuals to the figure. Partial residuals are the fitted values plus the residuals. In this simple case, that equates to exactly the same as the raw observations since $resid = obs - fitted $ and the fitted values depend only on the single predictor we are interested in.

## Calculate partial residuals fitted values
fdata = rdata = data
fMat = rMat = model.matrix(~x, fdata)
fit = as.vector(apply(coefs, 2, median) %*% t(fMat))
resid = as.vector(data$y - apply(coefs, 2, median) %*% t(rMat))
rdata = rdata %>% mutate(partial.resid = resid + fit)
ggplot(newdata, aes(y = estimate, x = x)) + geom_point(data = rdata, aes(y = partial.resid),
    color = "gray") + geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +
    scale_y_continuous("Y") + scale_x_discrete("X") + theme_classic()
plot of chunk R2JAGSGraphicalSummaries2
mcmc = as.matrix(data.stan)
## Calculate the fitted values
newdata = data.frame(x = levels(data$x))
Xmat = model.matrix(~x, newdata)
coefs = mcmc[, c("beta[1]", "beta[2]")]
fit = coefs %*% t(Xmat)
newdata = newdata %>% cbind(tidyMCMC(fit, conf.int = TRUE, conf.method = "HPDinterval"))
newdata
  x  estimate std.error  conf.low conf.high
1 A 105.31718 0.3665387 104.57812 106.01442
2 B  77.84126 0.4282323  77.01206  78.63169
ggplot(newdata, aes(y = estimate, x = x)) + geom_pointrange(aes(ymin = conf.low,
    ymax = conf.high)) + scale_y_continuous("Y") + scale_x_discrete("X") +
    theme_classic()
plot of chunk RSTANGraphicalSummaries

If you wanted to represent sample data on the figure in such a simple example (single predictor) we could simply over- (or under-) lay the raw data.

ggplot(newdata, aes(y = estimate, x = x)) + geom_point(data = data, aes(y = y,
    x = x), color = "gray") + geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +
    scale_y_continuous("Y") + scale_x_discrete("X") + theme_classic()
plot of chunk RSTANGraphicalSummaries1

A more general solution would be to add the partial residuals to the figure. Partial residuals are the fitted values plus the residuals. In this simple case, that equates to exactly the same as the raw observations since $resid = obs - fitted $ and the fitted values depend only on the single predictor we are interested in.

## Calculate partial residuals fitted values
fdata = rdata = data
fMat = rMat = model.matrix(~x, fdata)
fit = as.vector(apply(coefs, 2, median) %*% t(fMat))
resid = as.vector(data$y - apply(coefs, 2, median) %*% t(rMat))
rdata = rdata %>% mutate(partial.resid = resid + fit)
ggplot(newdata, aes(y = estimate, x = x)) + geom_point(data = rdata, aes(y = partial.resid),
    color = "gray") + geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +
    scale_y_continuous("Y") + scale_x_discrete("X") + theme_classic()
plot of chunk RSTANGraphicalSummaries2

Although we could calculated the fitted values via matrix multiplication of the coefficients and the model matrix (as for MCMCpack, RJAGS and RSTAN), for more complex models, it is more convenient to use the posterior_predict function that comes with rstantools.

## Calculate the fitted values
newdata = data.frame(x = levels(data$x))
fit = posterior_predict(data.rstanarm, newdata = newdata)
newdata = newdata %>% cbind(tidyMCMC(as.mcmc(fit), conf.int = TRUE,
    conf.method = "HPDinterval"))
newdata
  x term  estimate std.error conf.low conf.high
1 A var1 105.31865  2.743785 99.93958 110.67908
2 B var2  77.86514  2.804853 72.85838  83.67873
ggplot(newdata, aes(y = estimate, x = x)) + geom_pointrange(aes(ymin = conf.low,
    ymax = conf.high)) + scale_y_continuous("Y") + scale_x_discrete("X") +
    theme_classic()
plot of chunk RSTANARMGraphicalSummaries

If you wanted to represent sample data on the figure in such a simple example (single predictor) we could simply over- (or under-) lay the raw data.

ggplot(newdata, aes(y = estimate, x = x)) + geom_point(data = data, aes(y = y,
    x = x), color = "gray") + geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +
    scale_y_continuous("Y") + scale_x_discrete("X") + theme_classic()
plot of chunk RSTANARMGraphicalSummaries1

A more general solution would be to add the partial residuals to the figure. Partial residuals are the fitted values plus the residuals. In this simple case, that equates to exactly the same as the raw observations since $resid = obs - fitted $ and the fitted values depend only on the single predictor we are interested in.

## Calculate partial residuals fitted values
fdata = rdata = data
pp = posterior_predict(data.rstanarm, newdata = fdata)
fit = as.vector(apply(pp, 2, median))
pp = posterior_predict(data.rstanarm, newdata = rdata)
resid = as.vector(data$y - apply(pp, 2, median))
rdata = rdata %>% mutate(partial.resid = resid + fit)
ggplot(newdata, aes(y = estimate, x = x)) + geom_point(data = rdata, aes(y = partial.resid),
    color = "gray") + geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +
    scale_y_continuous("Y") + scale_x_discrete("X") + theme_classic()
plot of chunk RSTANARMGraphicalSummaries2

Although we could calculated the fitted values via matrix multiplication of the coefficients and the model matrix (as for MCMCpack, RJAGS and RSTAN), for more complex models, it is more convenient to use the posterior_predict function that comes with rstantools.

## Calculate the fitted values
newdata = data.frame(x = levels(data$x))
fit = posterior_predict(data.brms, newdata = newdata)
newdata = newdata %>% cbind(tidyMCMC(as.mcmc(fit), conf.int = TRUE,
    conf.method = "HPDinterval"))
newdata
  x term  estimate std.error conf.low conf.high
1 A var1 105.33190  2.734539 99.87153 110.72338
2 B var2  77.86874  2.787099 72.41702  83.26321
ggplot(newdata, aes(y = estimate, x = x)) + geom_pointrange(aes(ymin = conf.low,
    ymax = conf.high)) + scale_y_continuous("Y") + scale_x_discrete("X") +
    theme_classic()
plot of chunk BRMSGraphicalSummaries

If you wanted to represent sample data on the figure in such a simple example (single predictor) we could simply over- (or under-) lay the raw data.

ggplot(newdata, aes(y = estimate, x = x)) + geom_point(data = data, aes(y = y,
    x = x), color = "gray") + geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +
    scale_y_continuous("Y") + scale_x_discrete("X") + theme_classic()
plot of chunk BRMSGraphicalSummaries1

A more general solution would be to add the partial residuals to the figure. Partial residuals are the fitted values plus the residuals. In this simple case, that equates to exactly the same as the raw observations since $resid = obs - fitted $ and the fitted values depend only on the single predictor we are interested in.

## Calculate partial residuals fitted values
fdata = rdata = data
pp = posterior_predict(data.brms, newdata = fdata)
fit = as.vector(apply(pp, 2, median))
pp = posterior_predict(data.brms, newdata = rdata)
resid = as.vector(data$y - apply(pp, 2, median))
rdata = rdata %>% mutate(partial.resid = resid + fit)
ggplot(newdata, aes(y = estimate, x = x)) + geom_point(data = rdata, aes(y = partial.resid),
    color = "gray") + geom_pointrange(aes(ymin = conf.low, ymax = conf.high)) +
    scale_y_continuous("Y") + scale_x_discrete("X") + theme_classic()
plot of chunk BRMSGraphicalSummaries2

Effect sizes

In addition to deriving the distribution means for the second group, we could make use of the Bayesian framework to derive the distribution of the effect size. There are multiple ways of calculating an effect size, but the two most common are:

Raw effect size
the difference between two groups (as already calculated)
Cohen's D
the effect size standardized by division with the pooled standard deviation $$ D = (\mu_A - \mu_B)/\sigma $$
Percent effect size
expressing the effect size as a percent of the reference group mean

Notes:

  • Calculating the percent effect size involves division by an estimate of α. The very first sample collected of each parameter (including α) is based on the initial values supplied. If inits=NULL the jags() function appears to generate initial values from the priors. Recall that in the previous model definition, α was deemed to be distributed as a normal distribution with a mean of 0. Hence, α would initially be assigned a value of 0. Division by zero is of course illegal and thus an error would be thrown. There are two ways to overcome this:
    1. modify the prior such that it has a mean of close to zero (and thus the first α sample is not zero), yet not actually zero (such as 0.0001). There is always a danger however that the next sample in the chain will be zero (although this is highly unlikely).
    2. define initial values that are based on the observed data (and not zero). This is the method used here - see the initial values used.

mcmc = data.mcmcpack
## Cohen's D
cohenD = mcmc[, 2]/sqrt(mcmc[, 3])
tidyMCMC(as.mcmc(cohenD), conf.int = TRUE, conf.method = "HPDinterval")
  term  estimate std.error  conf.low conf.high
1 var1 -10.11969 0.7535643 -11.60226 -8.661371
# Percentage change (relative to Group A)
ES = 100 * mcmc[, 2]/mcmc[, 1]
# Probability that the effect is greater than 10% (a decline of >10%)
sum(-1 * ES > 10)/length(ES)
[1] 1
	 modelString=" 
 model { 
 #Likelihood
 for (i in 1:n) {
 y[i]~dnorm(mu[i],tau)
 mu[i] <- alpha+beta[x[i]]
 }
 
 #Priors
 alpha ~ dnorm(0.001,1.0E-6)
 beta[1] <- 0
 beta[2] ~ dnorm(0,1.0E-6)
 
 tau <- 1 / (sigma * sigma)
 sigma~dunif(0,100)
 
 #Other Derived parameters
 Group.means <-alpha+beta  # Group means
 
 cohenD <- beta[2]/sigma
 ES <- 100*beta[2]/alpha
 p10 <- step((-1*ES) - 10)
 }
 "
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 100
   Unobserved stochastic nodes: 3
   Total graph size: 233

Initializing model
Inference for Bugs model at "../downloads/BUGSscripts/R2JAGSEffectSize.txt", fit using jags,
 3 chains, each with 16667 iterations (first 2000 discarded), n.thin = 10
 n.sims = 4401 iterations saved
               mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
ES             -26.097   0.478 -27.060 -26.415 -26.089 -25.782 -25.176 1.001  4400
Group.means[1] 105.318   0.346 104.642 105.086 105.312 105.550 106.009 1.001  4400
Group.means[2]  77.832   0.428  76.984  77.546  77.835  78.118  78.663 1.001  4400
alpha          105.318   0.346 104.642 105.086 105.312 105.550 106.009 1.001  4400
beta[1]          0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1
beta[2]        -27.486   0.556 -28.619 -27.854 -27.480 -27.116 -26.411 1.001  4400
cohenD         -10.069   0.748 -11.552 -10.581 -10.056  -9.555  -8.640 1.001  4000
p10              1.000   0.000   1.000   1.000   1.000   1.000   1.000 1.000     1
sigma            2.744   0.199   2.387   2.603   2.731   2.873   3.156 1.001  3100
deviance       484.017   2.442 481.244 482.228 483.345 485.139 490.431 1.001  4400

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.0 and DIC = 487.0
DIC is an estimate of expected predictive error (lower deviance is better).
View full code
         modelString=" 
 model { 
 #Likelihood
 for (i in 1:n) {
 y[i]~dnorm(mu[i],tau)
 mu[i] <- alpha+beta[x[i]]
 }
 
 #Priors
 alpha ~ dnorm(0.001,1.0E-6)
 beta[1] <- 0
 beta[2] ~ dnorm(0,1.0E-6)
 
 tau <- 1 / (sigma * sigma)
 sigma~dunif(0,100)
 
 #Other Derived parameters
 Group.means <-alpha+beta  # Group means
 
 cohenD <- beta[2]/sigma
 ES <- 100*beta[2]/alpha
 p10 <- step((-1*ES) - 10)
 }
 "
         writeLines(modelString,con="../downloads/BUGSscripts/R2JAGSEffectSize.txt")

         data.list <- with(data,
         list(y=y,
         x=xn,n=nrow(data))
         )
         inits <- list(alpha=mean(data$y),
         beta=c(NA,diff(tapply(data$y,data$x,mean))),
         sigma=sd(data$y/2))
         params <- c("alpha","beta","sigma","Group.means","cohenD","ES","p10")
         adaptSteps = 1000
         burnInSteps = 2000
         nChains = 3
         numSavedSteps = 50000
         thinSteps = 1
         nIter = ceiling((numSavedSteps * thinSteps)/nChains)

         data.r2jagsES <- jags(data=data.list,
         inits=list(inits,inits,inits), # since there are three chains
         parameters.to.save=params,
         model.file="../downloads/BUGSscripts/R2JAGSEffectSize.txt",
         n.chains=3,
         n.iter=nIter,
         n.burnin=burnInSteps,
         n.thin=10
         )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 100
   Unobserved stochastic nodes: 3
   Total graph size: 233

Initializing model
         print(data.r2jagsES)
Inference for Bugs model at "../downloads/BUGSscripts/R2JAGSEffectSize.txt", fit using jags,
 3 chains, each with 16667 iterations (first 2000 discarded), n.thin = 10
 n.sims = 4401 iterations saved
               mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
ES             -26.114   0.485 -27.033 -26.449 -26.115 -25.788 -25.181 1.001  4400
Group.means[1] 105.339   0.357 104.640 105.101 105.335 105.577 106.054 1.002  1500
Group.means[2]  77.830   0.438  76.989  77.525  77.838  78.130  78.655 1.001  4400
alpha          105.339   0.357 104.640 105.101 105.335 105.577 106.054 1.002  1500
beta[1]          0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1
beta[2]        -27.509   0.564 -28.584 -27.889 -27.510 -27.128 -26.425 1.001  3500
cohenD         -10.091   0.742 -11.560 -10.585 -10.074  -9.586  -8.619 1.001  3100
p10              1.000   0.000   1.000   1.000   1.000   1.000   1.000 1.000     1
sigma            2.740   0.196   2.394   2.604   2.728   2.861   3.172 1.002  1700
deviance       484.100   2.517 481.242 482.261 483.465 485.211 490.685 1.001  2600

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.2 and DIC = 487.3
DIC is an estimate of expected predictive error (lower deviance is better).

The Cohen's D value is -10.09. This value is far greater than the nominal 'large effect' guidelines outlined by Cohen and thus we might proclaim the treatment as having a large negative effect. The effect size expressed as a percentage of the Group A mean is -26.11. Hence the treatment was associated with a 26.1% reduction.

mcmc = as.matrix(data.stan)
## Cohen's D
cohenD = mcmc[, "beta[2]"]/mcmc[, "sigma"]
tidyMCMC(as.mcmc(cohenD), conf.int = TRUE, conf.method = "HPDinterval")
  term  estimate std.error  conf.low conf.high
1 var1 -10.05256 0.7697586 -11.60651 -8.615262
# Percentage change (relative to Group A)
ES = 100 * mcmc[, "beta[2]"]/mcmc[, "beta[1]"]
# Probability that the effect is greater than 10% (a decline of >10%)
sum(-1 * ES > 10)/length(ES)
[1] 1
mcmc = as.matrix(data.rstanarm)
## Cohen's D
cohenD = mcmc[, "xB"]/mcmc[, "sigma"]
tidyMCMC(as.mcmc(cohenD), conf.int = TRUE, conf.method = "HPDinterval")
  term  estimate std.error  conf.low conf.high
1 var1 -10.04273 0.7645315 -11.54822 -8.581521
# Percentage change (relative to Group A)
ES = 100 * mcmc[, "xB"]/mcmc[, "(Intercept)"]
# Probability that the effect is greater than 10% (a decline of >10%)
sum(-1 * ES > 10)/length(ES)
[1] 1
mcmc = as.matrix(data.brms)
## Cohen's D
cohenD = mcmc[, "b_xB"]/mcmc[, "sigma"]
tidyMCMC(as.mcmc(cohenD), conf.int = TRUE, conf.method = "HPDinterval")
  term  estimate std.error  conf.low conf.high
1 var1 -10.05772 0.7273054 -11.50552 -8.636652
# Percentage change (relative to Group A)
ES = 100 * mcmc[, "b_xB"]/mcmc[, "b_Intercept"]
# Probability that the effect is greater than 10% (a decline of >10%)
sum(-1 * ES > 10)/length(ES)
[1] 1

Probability Statements

Bayesian statistics provide a natural means to generate probability statements. For example, we could calculate the probability that there is an effect of the treatment. Moreover, we could calculate the probability that the treatment effect exceeds some threshold (which might be based on a measure of ecologically important difference or other compliance guidelines for example).

mcmc = data.mcmcpack
# Percentage change (relative to Group A)
ES = 100 * mcmc[, 2]/mcmc[, 1]
hist(ES)
plot of chunk MCMCpackProbabilityStatement
# Probability that the effect is greater than 10% (a decline of >10%)
sum(-1 * ES > 10)/length(ES)
[1] 1
# Probability that the effect is greater than 25% (a decline of >25%)
sum(-1 * ES > 25)/length(ES)
[1] 0.9866
mcmc = data.r2jags$BUGSoutput$sims.matrix
# Percentage change (relative to Group A)
ES = 100 * mcmc[, "beta[2]"]/mcmc[, "alpha"]
hist(ES)
plot of chunk R2JAGSProbabilityStatement
# Probability that the effect is greater than 10% (a decline of >10%)
sum(-1 * ES > 10)/length(ES)
[1] 1
# Probability that the effect is greater than 25% (a decline of >25%)
sum(-1 * ES > 25)/length(ES)
[1] 0.9895478
	 modelString=" 
 model { 
 #Likelihood
 for (i in 1:n) {
 y[i]~dnorm(mu[i],tau)
 mu[i] <- alpha+beta[x[i]]
 }
 
 #Priors
 alpha ~ dnorm(0.001,1.0E-6)
 beta[1] <- 0
 beta[2] ~ dnorm(0,1.0E-6)
 
 tau <- 1 / (sigma * sigma)
 sigma~dunif(0,100)
 
 #Other Derived parameters
 Group.means <-alpha+beta  # Group means
 
 cohenD <- beta[2]/sigma
 ES <- 100*beta[2]/alpha
 P0 <- 1-step(beta[2])
 P25 <- 1-step(ES+25)
 }
 "

Notes:

  • We have defined two additional probability derivatives, both of which utilize the step function (which generates a binary vector based on whether values evaluate less than zero or not):
    1. P0 - the probability (mean of 1-step()) that the raw effect is greater than zero.
    2. P25 - the probability (mean of 1-step()) that the percent effect size is greater than 25%.

Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 100
   Unobserved stochastic nodes: 3
   Total graph size: 236

Initializing model
Inference for Bugs model at "../downloads/BUGSscripts/R2JAGSProbability.txt", fit using jags,
 3 chains, each with 16667 iterations (first 2000 discarded), n.thin = 10
 n.sims = 4401 iterations saved
               mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
ES             -26.098   0.480 -27.070 -26.420 -26.101 -25.770 -25.134 1.001  2700
Group.means[1] 105.324   0.355 104.621 105.090 105.329 105.564 106.014 1.003  1000
Group.means[2]  77.836   0.440  76.969  77.547  77.839  78.128  78.696 1.001  4400
P0               1.000   0.000   1.000   1.000   1.000   1.000   1.000 1.000     1
P25              0.988   0.108   1.000   1.000   1.000   1.000   1.000 1.023  1800
alpha          105.324   0.355 104.621 105.090 105.329 105.564 106.014 1.003  1000
beta[1]          0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1
beta[2]        -27.488   0.557 -28.596 -27.858 -27.488 -27.113 -26.377 1.002  2000
cohenD         -10.080   0.754 -11.601 -10.583 -10.072  -9.558  -8.648 1.001  3100
sigma            2.741   0.199   2.377   2.603   2.729   2.870   3.153 1.001  4400
deviance       484.138   2.592 481.252 482.265 483.488 485.286 490.695 1.002  4200

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.4 and DIC = 487.5
DIC is an estimate of expected predictive error (lower deviance is better).
View full code
         modelString=" 
 model { 
 #Likelihood
 for (i in 1:n) {
 y[i]~dnorm(mu[i],tau)
 mu[i] <- alpha+beta[x[i]]
 }
 
 #Priors
 alpha ~ dnorm(0.001,1.0E-6)
 beta[1] <- 0
 beta[2] ~ dnorm(0,1.0E-6)
 
 tau <- 1 / (sigma * sigma)
 sigma~dunif(0,100)
 
 #Other Derived parameters
 Group.means <-alpha+beta  # Group means
 
 cohenD <- beta[2]/sigma
 ES <- 100*beta[2]/alpha
 P0 <- 1-step(beta[2])
 P25 <- 1-step(ES+25)
 }
 "
         writeLines(modelString,con="../downloads/BUGSscripts/R2JAGSProbability.txt")

         data.list <- with(data,
         list(y=y,
         x=xn,n=nrow(data))
         )
         inits <- list(alpha=mean(data$y),
         beta=c(NA,diff(tapply(data$y,data$x,mean))),
         sigma=sd(data$y/2))
         params <- c("alpha","beta","sigma","Group.means","cohenD","ES","P0","P25")
         adaptSteps = 1000
         burnInSteps = 2000
         nChains = 3
         numSavedSteps = 5000
         thinSteps = 10
         nIter = ceiling((numSavedSteps * thinSteps)/nChains)

         data.r2jagsProb <- jags(data=data.list,
         inits=list(inits,inits,inits), # since there are three chains
         parameters.to.save=params,
         model.file="../downloads/BUGSscripts/R2JAGSProbability.txt",
         n.chains=3,
         n.iter=nIter,
         n.burnin=burnInSteps,
         n.thin=10
         )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 100
   Unobserved stochastic nodes: 3
   Total graph size: 236

Initializing model
         print(data.r2jagsProb)
Inference for Bugs model at "../downloads/BUGSscripts/R2JAGSProbability.txt", fit using jags,
 3 chains, each with 16667 iterations (first 2000 discarded), n.thin = 10
 n.sims = 4401 iterations saved
               mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
ES             -26.100   0.479 -27.026 -26.422 -26.111 -25.769 -25.164 1.001  4400
Group.means[1] 105.323   0.357 104.629 105.074 105.325 105.565 106.015 1.001  4400
Group.means[2]  77.833   0.433  76.996  77.549  77.829  78.121  78.661 1.001  4400
P0               1.000   0.000   1.000   1.000   1.000   1.000   1.000 1.000     1
P25              0.989   0.103   1.000   1.000   1.000   1.000   1.000 1.017  2600
alpha          105.323   0.357 104.629 105.074 105.325 105.565 106.015 1.001  4400
beta[1]          0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1
beta[2]        -27.490   0.558 -28.583 -27.860 -27.495 -27.107 -26.399 1.001  4400
cohenD         -10.094   0.763 -11.601 -10.598 -10.100  -9.565  -8.557 1.001  4400
sigma            2.738   0.200   2.382   2.595   2.726   2.864   3.171 1.001  4400
deviance       484.116   2.565 481.237 482.300 483.442 485.203 490.783 1.004  1800

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.3 and DIC = 487.4
DIC is an estimate of expected predictive error (lower deviance is better).
mcmc = as.matrix(data.stan)
# Percentage change (relative to Group A)
ES = 100 * mcmc[, "beta[2]"]/mcmc[, "beta[1]"]
hist(ES)
plot of chunk RSTANProbabilityStatement
# Probability that the effect is greater than 10% (a decline of >10%)
sum(-1 * ES > 10)/length(ES)
[1] 1
# Probability that the effect is greater than 25% (a decline of >25%)
sum(-1 * ES > 25)/length(ES)
[1] 0.988
mcmc = as.matrix(data.rstanarm)
# Percentage change (relative to Group A)
ES = 100 * mcmc[, "xB"]/mcmc[, "(Intercept)"]
hist(ES)
plot of chunk RSTANARMProbabilityStatement
# Probability that the effect is greater than 10% (a decline of >10%)
sum(-1 * ES > 10)/length(ES)
[1] 1
# Probability that the effect is greater than 25% (a decline of >25%)
sum(-1 * ES > 25)/length(ES)
[1] 0.9848148
mcmc = as.matrix(data.brms)
# Percentage change (relative to Group A)
ES = 100 * mcmc[, "b_xB"]/mcmc[, "b_Intercept"]
hist(ES)
plot of chunk BRMSProbabilityStatement
# Probability that the effect is greater than 10% (a decline of >10%)
sum(-1 * ES > 10)/length(ES)
[1] 1
# Probability that the effect is greater than 25% (a decline of >25%)
sum(-1 * ES > 25)/length(ES)
[1] 0.9896296

Finite Population standard deviations

It is often useful to be able to estimate the relative amount of variability associated with each predictor (or term) in a model. This can provide a sort of relative importance measure for each predictor.

In frequentist statistics, such measures are only available for so called random factors (factors whose observational levels are randomly selected to represent all possible levels rather than to represent specific treatment levels). For such random factors, the collective variances (or standard deviation) of each factor are known as the variance components. Each component can also be expressed as a percentage of the total so as to provide a percentage breakdown of the relative contributions of each scale of sampling.

Frequentist approaches model random factors according to the variance they add to the model, whereas fixed factors are modelled according to their effects (deviations from reference means). The model does not seek to generalize beyond the observed levels of a given fixed factor (such as control vs treatment) and thus it apparently does not make sense to estimate the population variability between levels (which is what variance components estimate).

The notion of 'fixed' and 'random' factors is somewhat arbitrary and does not really have any meaning within a Bayesian context (as all parameters and thus factors are considered random). Instead, the spirit of what many consider is the difference between fixed and random factors can be captured by conceptualizing whether the levels of a factor are drawn from a finite-population (from which the observed factor levels are the only ones possible) or a superpopulation (from which the observed factor levels are just a random selection of the infinite possible levels possible).

Hence, variance components could be defining in terms of either finite-population or superpopulation standard deviations. Super-population standard deviations have traditionally been used to describe the relative scale of sampling variation (e.g. where is the greatest source of variability; Plots, subplots within plots, individual quadrats within subplots, .... or years, months within years, weeks within months, days within weeks, ...) and are most logically applicable to factors that have a relatively large number of levels (such as spatial or temporal sampling units). On the other hand, finite-population standard deviations can be used to explore the relative impact or effect of a set of (fixed) treatments.

  • Calculate the amount of unexplained (residual) variance absorbed by the factor. This is generated by fitting a model with (full model) and without (reduced model) the term and subtracting the standard deviations of the residuals one another.
    σA = σReducedResid - σFullResid
    This approach works fine for models that only include fixed factors (indeed it is somewhat analogous to the partitioning of variance employed by an ANOVA table), but cannot be used when the model includes random factors.
    View code if you are interested
    data.lmFull <- lm(y ~ x, data)
    data.lmRed <- lm(y ~ 1, data)
    sd.a <- sd(data.lmRed$resid) - sd(data.lmFull$resid)
    sd.resid <- sd(data.lmFull$resid)
    sds <- c(sd.a, sd.resid)
    100 * sds/sum(sds)
    
    [1] 80.47535 19.52465
    
However, options are somewhat limiting if we want to estimate the relative impacts of a mixture of 'fixed' and 'random' terms. For example, we may wish to explore the relative importance of a treatment compared to the spatial and/or temporal sampling heterogeneity.

The Bayesian framework provides a relatively simple way to generate both finite-population and superpopulation standard deviation estimates for all factors.

finite-population
the standard deviations of the Markov Chain Monte Carlo samples across each of the parameters associated with a factor (eg, β1 and β2 in the effects parameterization model) provide natural estimates of the variability between group levels (and thus the finite-population standard deviation).
superpopulation
the mechanism of defining priors also provides a mechanism for calculating infinite-population standard deviations. Recall that in the cell means model, the prior for α specifies that each of the α values are drawn from a normal distribution with a particular mean and a certain level of precision (reciprocal of variability). We could further parameterize this prior into an estimatable mean and precision via hyperpriors
αi ~ N(μ, τ)
μ ~ N(0, 1.0E-6)
τ ~ gamma(0.001, 0.001)
Since the normal distribution in line one above represents the distribution from which the (infinite) population means are drawn, τ (1/σ²) provides a direct measure of the variability of the population from which the means are drawn.

When the number of levels of a factor a large, the finite-population and superpopulation standard deviation point estimates will be very similar. However, the finite-population estimates will of course be considered more precise (and thus, less varied). At the other extreme, when the number of factor levels is small (such as two levels), the finite-population estimate will be very precise whereas the superpopulation standard deviation estimate will be very imprecise (highly varied). For this reason, if the purpose of estimating standard deviations is to compare relative contributions of various predictors some of which have small numbers of levels and others large), then it is best to use finite-population standard deviation estimates.

mcmc = data.mcmcpack
sd.x = apply(cbind(0, mcmc[, 2]), 1, sd)
# generate a model matrix
newdata = data.frame(x = data$x)
Xmat = model.matrix(~x, newdata)
## get median parameter estimates
coefs = mcmc[, 1:2]
fit = coefs %*% t(Xmat)
resid = sweep(fit, 2, data$y, "-")
sd.resid = apply(resid, 1, sd)
sd.all = cbind(sd.x, sd.resid)
tidyMCMC(sd.all, conf.int = TRUE, conf.method = "HPDinterval")
      term  estimate  std.error  conf.low conf.high
1     sd.x 19.440221 0.39841919 18.638912 20.211950
2 sd.resid  2.708759 0.02044918  2.694594  2.750023
# OR expressed as a percentage
tidyMCMC(100 * sd.all/rowSums(sd.all), conf.int = TRUE, conf.method = "HPDinterval")
      term estimate std.error conf.low conf.high
1     sd.x 87.76638 0.2373261 87.30265  88.03187
2 sd.resid 12.23362 0.2373261 11.96813  12.69735
	 modelString=" 
 model { 
 #Likelihood
 for (i in 1:n) {
 y[i]~dnorm(mu[i],tau)
 mu[i] <- alpha+beta[x[i]]
 resid[i] <- y[i]-mu[i]
 }
 
 #Priors
 alpha ~ dnorm(0,1.0E-6)
 beta[1] <- 0
 beta[2] ~ dnorm(0,tau.a)
 
 tau <- 1 / (sigma * sigma)
 sigma~dunif(0,100)
 
 tau.a <- 1 / (sigma.a * sigma.a)
 sigma.a~dunif(0,100)
 
 #Finite-population standard deviations
 sd.a <- sd(beta)
 sd.resid <- sd(resid)
 }
 "

Notes:

  • The xed precision constant of 1.0E-6 in the prior for β2 has been replaced by a parameter representing the modelled amount of precision (and thus standard deviation) of the normal distribution.
  • Finite-population standard deviation derivatives are included for the treatment (sd.a) and the residuals (sd.resid).

Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 100
   Unobserved stochastic nodes: 4
   Total graph size: 325

Initializing model
Inference for Bugs model at "../downloads/BUGSscripts/R2JAGSSD.txt", fit using jags,
 3 chains, each with 16667 iterations (first 2000 discarded), n.thin = 10
 n.sims = 4401 iterations saved
         mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
alpha    105.321   0.355 104.589 105.090 105.329 105.559 105.995 1.001  4400
beta[1]    0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1
beta[2]  -27.485   0.564 -28.604 -27.856 -27.494 -27.106 -26.370 1.001  4400
sd.a      19.435   0.399  18.647  19.167  19.441  19.697  20.226 1.001  4400
sd.resid   2.709   0.020   2.695   2.696   2.701   2.714   2.765 1.001  4400
sigma      2.737   0.199   2.385   2.597   2.726   2.863   3.159 1.001  4400
sigma.a   50.834  24.276  14.372  30.183  47.397  70.224  96.824 1.001  4400
deviance 484.139   2.560 481.250 482.296 483.467 485.242 490.693 1.001  4400

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.3 and DIC = 487.4
DIC is an estimate of expected predictive error (lower deviance is better).
View full code
         modelString=" 
 model { 
 #Likelihood
 for (i in 1:n) {
 y[i]~dnorm(mu[i],tau)
 mu[i] <- alpha+beta[x[i]]
 resid[i] <- y[i]-mu[i]
 }
 
 #Priors
 alpha ~ dnorm(0,1.0E-6)
 beta[1] <- 0
 beta[2] ~ dnorm(0,tau.a)
 
 tau <- 1 / (sigma * sigma)
 sigma~dunif(0,100)
 
 tau.a <- 1 / (sigma.a * sigma.a)
 sigma.a~dunif(0,100)
 
 #Finite-population standard deviations
 sd.a <- sd(beta)
 sd.resid <- sd(resid)
 }
 "
         writeLines(modelString,con="../downloads/BUGSscripts/R2JAGSSD.txt")

         data.list <- with(data,
         list(y=y,
         x=xn,n=nrow(data))
         )
         inits <- list(alpha=mean(data$y),
         beta=c(NA,diff(tapply(data$y,data$x,mean))),
         sigma=sd(data$y/2))
         params <- c("alpha","beta","sigma","sd.a","sd.resid","sigma.a")
         adaptSteps = 1000
         burnInSteps = 2000
         nChains = 3
         numSavedSteps = 50000
         thinSteps = 1
         nIter = ceiling((numSavedSteps * thinSteps)/nChains)

         data.r2jagsSD <- jags(data=data.list,
         inits=list(inits,inits,inits), # since there are three chains
         parameters.to.save=params,
         model.file="../downloads/BUGSscripts/R2JAGSSD.txt",
         n.chains=3,
         n.iter=nIter,
         n.burnin=burnInSteps,
         n.thin=10
         )
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 100
   Unobserved stochastic nodes: 4
   Total graph size: 325

Initializing model
         print(data.r2jagsSD)
Inference for Bugs model at "../downloads/BUGSscripts/R2JAGSSD.txt", fit using jags,
 3 chains, each with 16667 iterations (first 2000 discarded), n.thin = 10
 n.sims = 4401 iterations saved
         mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
alpha    105.323   0.359 104.605 105.089 105.329 105.563 106.010 1.001  4400
beta[1]    0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1
beta[2]  -27.487   0.552 -28.590 -27.856 -27.483 -27.117 -26.400 1.002  2100
sd.a      19.437   0.390  18.668  19.174  19.433  19.697  20.216 1.002  2100
sd.resid   2.708   0.020   2.695   2.696   2.701   2.712   2.764 1.001  4400
sigma      2.748   0.202   2.387   2.607   2.733   2.877   3.170 1.001  4400
sigma.a   50.644  23.963  14.600  30.646  47.595  68.988  96.614 1.001  4400
deviance 484.136   2.557 481.268 482.276 483.475 485.234 490.524 1.001  4400

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 3.3 and DIC = 487.4
DIC is an estimate of expected predictive error (lower deviance is better).

The between group (finite population) standard deviation is 19.44 whereas the within group standard deviation is 2.71. These equate to respectively.

Compared to he finite-population standard deviation, the superpopulation between group standard deviation estimate (σa) is both very large and highly variable (). This is to be expected, whilst the finite-population standard deviation represents he degree of variation between the observed levels, the superpopulation standard deviation seeks to estimate the variability of the population from which the group means of the observed levels AND all other possible levels are drawn. There are only two levels from which to estimate this standard deviation and therefore, its value and variability are going to be higher than those pertaining only to the scope of the current data.

Examination of the quantiles for σa suggest that its samples are not distributed normally. Consequently, the mean is not an appropriate measure of its location. We will instead characterize the superpopulation between group and within group standard deviations via their respective medians () and as percent medians ( ).

The contrast between finite-population and superpopulation standard deviations is also emphasized by the respective estimates for the residuals. The residuals are of course a 'random' factor with a large number of observed levels. It is therefore not surprising that the point estimates for the residuals variance components are very similar. However, also notice that the precision of the finite-population standard deviation estimate is substantially higher (lower standard deviation of the standard deviation estimate) than that of the superpopulation estimate.

mcmc = as.matrix(data.stan)
sd.x = apply(cbind(0, mcmc[, "beta[2]"]), 1, sd)
# generate a model matrix
newdata = data.frame(x = data$x)
Xmat = model.matrix(~x, newdata)
## get median parameter estimates
coefs = mcmc[, c("beta[1]", "beta[2]")]
fit = coefs %*% t(Xmat)
resid = sweep(fit, 2, data$y, "-")
sd.resid = apply(resid, 1, sd)
sd.all = cbind(sd.x, sd.resid)
tidyMCMC(sd.all, conf.int = TRUE, conf.method = "HPDinterval")
      term estimate  std.error  conf.low conf.high
1     sd.x 19.42840 0.39676711 18.649907 20.205754
2 sd.resid  2.70865 0.01995268  2.694594  2.748776
# OR expressed as a percentage
tidyMCMC(100 * sd.all/rowSums(sd.all), conf.int = TRUE, conf.method = "HPDinterval")
      term estimate std.error conf.low conf.high
1     sd.x 87.76031 0.2357667 87.29794  88.03187
2 sd.resid 12.23969 0.2357667 11.96813  12.70206
mcmc = as.matrix(data.rstanarm)
sd.x = apply(cbind(0, mcmc[, "xB"]), 1, sd)
# generate a model matrix
newdata = data.frame(x = data$x)
Xmat = model.matrix(~x, newdata)
## get median parameter estimates
coefs = mcmc[, c("(Intercept)", "xB")]
fit = coefs %*% t(Xmat)
resid = sweep(fit, 2, data$y, "-")
sd.resid = apply(resid, 1, sd)
sd.all = cbind(sd.x, sd.resid)
tidyMCMC(sd.all, conf.int = TRUE, conf.method = "HPDinterval")
      term  estimate  std.error  conf.low conf.high
1     sd.x 19.434168 0.40093981 18.648963 20.236836
2 sd.resid  2.708937 0.02048505  2.694594  2.751157
# OR expressed as a percentage
tidyMCMC(100 * sd.all/rowSums(sd.all), conf.int = TRUE, conf.method = "HPDinterval")
      term estimate std.error conf.low conf.high
1     sd.x 87.76221 0.2432714 87.28694  88.03187
2 sd.resid 12.23779 0.2432714 11.96813  12.71306
mcmc = as.matrix(data.brms)
sd.x = apply(cbind(0, mcmc[, "b_xB"]), 1, sd)
# generate a model matrix
newdata = data.frame(x = data$x)
Xmat = model.matrix(~x, newdata)
## get median parameter estimates
coefs = mcmc[, c("b_Intercept", "b_xB")]
fit = coefs %*% t(Xmat)
resid = sweep(fit, 2, data$y, "-")
sd.resid = apply(resid, 1, sd)
sd.all = cbind(sd.x, sd.resid)
tidyMCMC(sd.all, conf.int = TRUE, conf.method = "HPDinterval")
      term  estimate  std.error  conf.low conf.high
1     sd.x 19.448094 0.40109882 18.705198 20.284889
2 sd.resid  2.708953 0.02065129  2.694594  2.751716
# OR expressed as a percentage
tidyMCMC(100 * sd.all/rowSums(sd.all), conf.int = TRUE, conf.method = "HPDinterval")
      term estimate std.error conf.low conf.high
1     sd.x 87.76999 0.2340168 87.30861  88.03187
2 sd.resid 12.23001 0.2340168 11.96813  12.69139

Unequally varied populations

set.seed(1)
n1 <- 60  #sample size from population 1
n2 <- 40  #sample size from population 2
mu1 <- 105  #population mean of population 1
mu2 <- 77.5  #population mean of population 2
sigma1 <- 3  #standard deviation of population 1
sigma2 <- 2  #standard deviation of population 2
n <- n1 + n2  #total sample size
y1 <- rnorm(n1, mu1, sigma1)  #population 1 sample
y2 <- rnorm(n2, mu2, sigma2)  #population 2 sample
y <- c(y1, y2)
x <- factor(rep(c("A", "B"), c(n1, n2)))  #categorical listing of the populations
xn <- rep(c(0, 1), c(n1, n2))  #numerical version of the population category
data2 <- data.frame(y, x, xn)  # dataset
head(data2)  #print out the first six rows of the data set
         y x xn
1 103.1206 A  0
2 105.5509 A  0
3 102.4931 A  0
4 109.7858 A  0
5 105.9885 A  0
6 102.5386 A  0
  • Start by defining the model
    yi = α + β*xi + εi
    which in BUGS becomes;
    yi = α + β*xi + εi
    ε1 ∼ N(0,τ1) for x1=0 (females)
    ε2 ∼ N(0,τ2) for x2=1 (males)
             modelString="
     model {
     #Likelihood
     for (i in 1:n1) {
     y1[i]~dnorm(mu1,tau1)
     }
     for (i in 1:n2) {
     y2[i]~dnorm(mu2,tau2)
     }
     
     #Priors
     mu1 ~ dnorm (0,0.001)
     mu2 ~ dnorm(0,0.001)
     tau1 <- 1 / (sigma1 * sigma1)
     sigma1~dunif(0,100)
     tau2 <- 1 / (sigma2 * sigma2)
     sigma2~dunif(0,100)
     
     #Other Derived parameters 
     delta <- mu2 - mu1
     }
     "
             ## write the model to a text file (I suggest you alter the path to somewhere more relevant 
             ## to your system!)  
             writeLines(modelString,con="../downloads/BUGSscripts/ttestModel2.txt")
    
  • Arrange the data as a list (as required by BUGS). Note, all variables must be numeric, therefore we use the numeric version of x. Define the initial values. We are going to initialize three chains so the list must include three elements. We will also define the other properties.
             data2.list <- with(data2,
             list(y1=y[xn==0], y2=y[xn==1], n1=length(y[xn==0]), n2=length(y[xn==1]))
             )
             inits <- list(list(mu1=rnorm(1), mu2=rnorm(1), sigma1=rlnorm(1), sigma2=rlnorm(1)),
             list(mu1=rnorm(1), mu2=rnorm(1), sigma1=rlnorm(1), sigma2=rlnorm(1)),
             list(mu1=rnorm(1), mu2=rnorm(1), sigma1=rlnorm(1), sigma2=rlnorm(1)))
             params <- c("mu1","mu2","delta","sigma1","sigma2")
             adaptSteps = 1000
             burnInSteps = 2000
             nChains = 3
             numSavedSteps = 50000
             thinSteps = 1
             nIter = ceiling((numSavedSteps * thinSteps)/nChains)
    
  • Engage jags
             library(R2jags)
             data2.r2jags <- jags(data=data2.list,
             inits=NULL, #or inits=list(inits,inits,inits) # since there are three chains
             parameters.to.save=params,
             model.file="../downloads/BUGSscripts/ttestModel2.txt",
             n.chains=3,
             n.iter=nIter,
             n.burnin=burnInSteps,
             n.thin=10
             )
    
    Compiling model graph
       Resolving undeclared variables
       Allocating nodes
    Graph information:
       Observed stochastic nodes: 100
       Unobserved stochastic nodes: 4
       Total graph size: 121
    
    Initializing model
    
             print(data2.r2jags)
    
    Inference for Bugs model at "../downloads/BUGSscripts/ttestModel2.txt", fit using jags,
     3 chains, each with 16667 iterations (first 2000 discarded), n.thin = 10
     n.sims = 4401 iterations saved
             mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
    delta    -27.603   0.469 -28.527 -27.919 -27.609 -27.280 -26.685 1.001  4400
    mu1      105.319   0.344 104.632 105.096 105.315 105.546 105.995 1.001  3100
    mu2       77.716   0.315  77.099  77.501  77.717  77.922  78.346 1.003  1100
    sigma1     2.618   0.248   2.188   2.446   2.603   2.768   3.155 1.001  2700
    sigma2     2.009   0.240   1.602   1.839   1.989   2.155   2.536 1.001  3600
    deviance 452.089   3.023 448.393 449.873 451.383 453.449 459.839 1.001  4400
    
    For each parameter, n.eff is a crude measure of effective sample size,
    and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
    
    DIC info (using the rule, pD = var(deviance)/2)
    pD = 4.6 and DIC = 456.7
    DIC is an estimate of expected predictive error (lower deviance is better).
    

Alternatively

  • Start by defining the model
    yi = α + β*xi + εi
    which in BUGS becomes;
    yi = α + β*xi + εi
    ε1 ∼ N(0,τ1) for x1=0 (females)
    ε2 ∼ N(0,τ2) for x2=1 (males)
             modelString=" 
     model {
     #Likelihood
     for (i in 1:n) {
     y[i]~dnorm(mu[i],tau)
     mu[i] <- alpha+beta[x[i]]
     }
     
     #Priors
     alpha ~ dnorm(0,tau1)
     beta[1] <- 0
     beta[2] ~ dnorm(0,tau2)
     tau <- 1 / (sigma * sigma)
     sigma~dunif(0,100)
     tau1 <- 1 / (sigma1 * sigma1)
     sigma1~dunif(0,100)
     tau2 <- 1 / (sigma2 * sigma2)
     sigma2~dunif(0,100)
     }
     "
             ## write the model to a text file (I suggest you alter the path to somewhere more relevant 
             ## to your system!)  
             writeLines(modelString,con="../downloads/BUGSscripts/ttestModel3.txt")
    
  • Arrange the data as a list (as required by BUGS). Note, all variables must be numeric, therefore we use the numeric version of x. Define the initial values. We are going to initialize three chains so the list must include three elements. We will also define the other properties.
    data2.list <- with(data2, list(y = y, x = xn + 1, n = length(y)))
    data2.list
    
    $y
      [1] 103.12064 105.55093 102.49311 109.78584 105.98852 102.53859 106.46229 107.21497 106.72734
     [10] 104.08383 109.53534 106.16953 103.13628  98.35590 108.37479 104.86520 104.95143 107.83151
     [19] 107.46366 106.78170 107.75693 107.34641 105.22369  99.03194 106.85948 104.83161 104.53261
     [28] 100.58774 103.56555 106.25382 109.07604 104.69164 106.16301 104.83858 100.86882 103.75502
     [37] 103.81713 104.82206 108.30008 107.28953 104.50643 104.23991 107.09089 106.66999 102.93373
     [46] 102.87751 106.09375 107.30560 104.66296 107.64332 106.19432 103.16392 106.02336 101.61191
     [55] 109.29907 110.94120 103.89834 101.86760 106.70916 104.59484  82.30324  77.42152  78.87948
     [64]  77.55600  76.01345  77.87758  73.89008  80.43111  77.80651  81.84522  78.45102  76.08011
     [73]  78.72145  75.63180  74.99273  78.08289  76.61342  77.50221  77.64868  76.32096  76.36266
     [82]  77.22964  79.85617  74.45287  78.68789  78.16590  79.62620  76.89163  78.24004  78.03420
     [91]  76.41496  79.91574  79.82081  78.90043  80.67367  78.61697  74.94682  76.35347  75.05077
    [100]  76.55320
    
    $x
      [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
     [48] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
     [95] 2 2 2 2 2 2
    
    $n
    [1] 100
    
    params <- c("alpha", "beta", "sigma", "sigma1", "sigma2")
    adaptSteps = 1000
    burnInSteps = 2000
    nChains = 3
    numSavedSteps = 50000
    thinSteps = 1
    nIter = ceiling((numSavedSteps * thinSteps)/nChains)
    
  • Engage jags
             library(R2jags)
             data2.r2jags2 <- jags(data=data2.list,
             inits=NULL,
             parameters.to.save=params,
             model.file="../downloads/BUGSscripts/ttestModel3.txt",
             n.chains=3,
             n.iter=nIter,
             n.burnin=burnInSteps,
             n.thin=10
             )
    
    Compiling model graph
       Resolving undeclared variables
       Allocating nodes
    Graph information:
       Observed stochastic nodes: 100
       Unobserved stochastic nodes: 5
       Total graph size: 226
    
    Initializing model
    
             print(data2.r2jags2)
    
    Inference for Bugs model at "../downloads/BUGSscripts/ttestModel3.txt", fit using jags,
     3 chains, each with 16667 iterations (first 2000 discarded), n.thin = 10
     n.sims = 4401 iterations saved
             mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
    alpha    105.315   0.312 104.686 105.107 105.317 105.527 105.925 1.001  4400
    beta[1]    0.000   0.000   0.000   0.000   0.000   0.000   0.000 1.000     1
    beta[2]  -27.594   0.492 -28.571 -27.917 -27.591 -27.264 -26.644 1.001  2900
    sigma      2.366   0.172   2.060   2.246   2.357   2.478   2.731 1.001  2800
    sigma1    75.574  15.798  43.272  64.006  77.211  88.848  98.928 1.001  4400
    sigma2    50.494  23.836  14.902  30.379  47.110  69.033  96.020 1.001  4300
    deviance 454.676   2.555 451.774 452.845 454.026 455.828 461.328 1.001  3600
    
    For each parameter, n.eff is a crude measure of effective sample size,
    and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
    
    DIC info (using the rule, pD = var(deviance)/2)
    pD = 3.3 and DIC = 457.9
    DIC is an estimate of expected predictive error (lower deviance is better).
    

References

Gelman, A. (2006). “Prior distributions for variance parameters in hierarchical models”. In: Bayesian Analysis, pp. 515-533.

Vehtari, A, A. Gelman and J. Gabry (2016b). “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC”. In: Statistics and Computing.




Worked examples

Basic statistics references
  • Kery (2010) - Chpt 7
  • Kruschke (2010) - Chpt 18
  • McCarthy (2007) - Chpt 6

Modelling the two populations - assuming equal variances

Furness & Bryant (1996) studied the energy budgets of breeding northern fulmars (Fulmarus glacialis) in Shetland. As part of their study, they recorded the body mass and metabolic rate of eight male and six female fulmars.

Download Furness data set
Format of furness.csv data files
SEXMETRATEBODYMASS
MALE2950875
FEMALE1956765
MALE2308780
MALE2135790
MALE1945788

SEXSex of breeding northern fulmars (Fulmarus glacialis)
METRATEMetabolic rate (hJ/day)
BODYMASSBody mass (g)
Northern fulmars

Open the furness data file.

Show code
furness <- read.table("../downloads/data/furness.csv", header = T, sep = ",", strip.white = TRUE)
head(furness)
     SEX METRATE BODYMASS
1   Male  2950.0      875
2 Female  1956.1      635
3   Male  2308.7      765
4   Male  2135.6      780
5   Male  1945.6      790
6 Female  1490.5      635
  1. The researchers were interested in testing whether there is a difference in the metabolic rate of male and female breeding northern fulmars. In light of this, list the following:
    1. The biological hypotheses of interest
    2. The biological null hypotheses
    3. The statistical null hypotheses (H0)
  2. The appropriate statistical test for testing the null hypothesis that the means of two independent populations are equal is a t-test

    A t-est is just a very simple linear model: $$y\sim{}sex+\epsilon \hspace{1cm} \epsilon\sim{}N(0,\sigma^2)$$

    Recall that for linear models comprising a categorical variable, we can parameterize the model either as a means paramterization or as an effects parameterization.

    Effects parameterizationMeans parameterization
    $\mathsf{Metabolic~rate}_i=a+\beta_{j(i)}\times\mathsf{Sex}_i+\epsilon~\mathsf{where}~\epsilon\sim{}N(0,\sigma^2)$
    OR

    $\mathsf{Metabolic~rate}_i\sim{}N(a+\beta_{j(i)}\times\mathsf{Sex}_i,\sigma^2)$
    $\mathsf{Metabolic~rate}_i=\alpha_{j(i)}+\epsilon~\mathsf{where}~\epsilon\sim{}N(0,\sigma^2)$
    OR

    $\mathsf{Metabolic~rate}_i\sim{}N(\alpha_{j(i)},\sigma^2)$

    We need to define the priors for each of the parameters. As we have no prior expectations, we will define vague, non-informative flat priors for each of the parameters. Location parameters (means) will be drawn from very wide normal distributions with a mean of 0 (although it might be more sensible to use a mean equal to the overall data mean for $\alpha$). $\sigma$ (precision) will be modeled as a very flat Gamma distribution.

    Effects parameterizationMeans parameterization
    $\mathsf{Metabolic~rate}_i\sim{}N(\mu_i,\tau)$
    $\mu_i=a+\beta_{j(i)}\times\mathsf{Sex}_i$
    $\sigma = 1/\tau^2$

    Priors:

    $\alpha\sim{}N(0, 1.0E-6)$
    $\beta_1=0$
    $\beta_2\sim{}\sim{}N(0, 1.0E-6)$
    $\sigma\sim{}U(0, 100)$
    $\mathsf{Metabolic~rate}_i\sim{}N(\mu_i,\tau)$
    $\mu_i=\alpha_{j(i)}\times\mathsf{Sex}_i$
    $\sigma = 1/\tau^2$

    Priors:

    $\alpha_1\sim{}N(0, 1.0E-6)$
    $\alpha_2\sim{}N(0, 1.0E-6)$
    $\sigma\sim{}U(0, 100)$

    Since most hypothesis tests follow the same basic procedure, confirm that you understand the basic steps of hypothesis tests

  3. In the table below, list the assumptions of a t-test along with how violations of each assumption are diagnosed and/or the risks of violations are minimized.
    AssumptionDiagnostic/Risk Minimization
    I.
    II.
    III.

    he real data set (breeding northern fulmars), recall that--> So, we wish to investigate whether or not male and female fulmars have the same metabolic rates, and that we intend to use a t-test to test the null hypothesis that the population mean metabolic rate of males is equal to the population mean metabolic rate of females. Having identified the important assumptions of a t-test, use the samples to evaluate whether the assumptions are likely to be violated and thus whether a t-test is likely to be reliability.

  4. Is there any evidence that; HINT
    1. The assumption of normality has been violated?
    2. The assumption of homogeneity of variance has been violated?
    Show code
             boxplot(METRATE~SEX, data=furness)
    
    plot of chunk Q1-5
  5. Lets fit the model via both effects and means parameterizations.

  6. Define the JAGS model (Likelihood function and prior distributions). We also want to incorporate Cohen's D and percent effect size (therefore we need to ensure that the initial values of alpha are not equal to zero (divisible by zero issues).
    Effects parameterizationMeans parameterization
    Show code
    modelString = "  
     model {
     #Likelihood
     for (i in 1:n) {
     metrate[i]~dnorm(mu[i],tau)
     mu[i] <- alpha+beta[sex[i]]
     }
     
     #Priors
     alpha ~ dnorm (meanRate,1.0E-6)
     beta[1] <- 0
     beta[2] ~ dnorm(0,1.0E-6)
     tau <- 1 / (sigma * sigma)
     #sigma~dgamma(0.001,0.001)
     sigma~dunif(0,100)
     
     #Other Derived parameters 
     # Group means (note, beta is a vector)
     Group.means <-alpha+beta
     
     cohenD <- beta[2]/sigma
     ES <- 100*beta[2]/alpha  
     }
     "
    
    Show code
             modelString.means="  
     model {
     #Likelihood 
     for (i in 1:n) {
     metrate[i]~dnorm(mu[i],tau)
     mu[i] <- alpha[sex[i]]
     }
     
     #Priors
     for (j in min(sex):max(sex)) {
     alpha[j] ~ dnorm(0.01,1.0E-6)
     }
     
     tau <- 1 / (sigma * sigma)
     #sigma~dgamma(0.001,0.001)
     sigma~dunif(0,1000)
     
     #Other Derived parameters 
     effect <-alpha[2]-alpha[1]
     
     cohenD <- effect/sigma
     ES <- 100*effect/alpha[1] 
     }
     "
    
    Effects
    Show code
    ## write the model to a text file
    writeLines(modelString, con = "../downloads/BUGSscripts/ws6.2bttestModel.txt")
    
    Means
    Show code
    ## write the model to a text file
    writeLines(modelString.means, con = "../downloads/BUGSscripts/ws6.2bttestModelMeans.txt")
    
  7. Prior to fitting any Bayesian models (via JAGS), we need to make sure that the data are arranged as a list (as required by BUGS) and that all variables must be numeric, therefore we use the numeric version of SEX
    Furthermore, the first level of the categorical variable must be 1.
    Effects
    Show code
    furness$SEX <- factor(furness$SEX, levels = c("Female",
        "Male"))
    furness.list <- with(furness, list(metrate = METRATE,
        meanRate = mean(METRATE), sex = as.numeric(SEX),
        n = nrow(furness)))
    
    Means
    Show code
    furness$SEX <- factor(furness$SEX, levels = c("Female",
        "Male"))
    furness.means.list <- with(furness, list(metrate = METRATE,
        meanRate = mean(METRATE), sex = as.numeric(SEX),
        n = nrow(furness)))
    
  8. Define the initial values (α, β and σ2 for the chain. Reasonable starting points can be gleaned from the data themselves.
    Effects
    Show code
    inits <- list(alpha = mean(furness$METRATE), beta = c(0,
        diff(tapply(furness$METRATE, furness$SEX, mean))),
        sigma = sd(furness$METRATE/2))
    
    Means
    Show code
    inits.means <- list(alpha = tapply(furness$METRATE,
        furness$SEX, mean), sigma = sd(furness$METRATE/2))
    
  9. Define the nodes (parameters and derivatives) to monitor
    Effects
    Show code
    params <- c("alpha", "beta", "sigma", "Group.means", "ES", "cohenD")
    
    Means
    Show code
    params.means <- c("alpha", "effect", "sigma", "ES", "cohenD")
    
  10. Define the chain parameters
    Show code
    adaptSteps = 1000
    burnInSteps = 2000
    nChains = 3
    numSavedSteps = 50000
    thinSteps = 10
    nIter = ceiling((numSavedSteps * thinSteps)/nChains)
    
  11. We are now ready to fit the models in JAGS via R2jags.
    Show code
             library(R2jags)
             #we also need to get .Random.seed created.  Calling any function that generates a random number will do this.
             rnorm(1)
    
    [1] -0.2989186
    
    1. Effects model
      Show code
               furness.r2jags <- jags(data=furness.list,
               inits=NULL, #or inits=list(inits,inits,inits) # since there are three chains
               parameters.to.save=params,
               model.file="../downloads/BUGSscripts/ws6.2bttestModel.txt",
               n.chains=3,
               n.iter=nIter,
               n.burnin=burnInSteps,
               n.thin=thinSteps
               )
      
      Compiling model graph
         Resolving undeclared variables
         Allocating nodes
      Graph information:
         Observed stochastic nodes: 14
         Unobserved stochastic nodes: 3
         Total graph size: 51
      
      Initializing model
      
               print(furness.r2jags)
      
      Inference for Bugs model at "../downloads/BUGSscripts/ws6.2bttestModel.txt", fit using jags,
       3 chains, each with 166667 iterations (first 2000 discarded), n.thin = 10
       n.sims = 49401 iterations saved
                      mu.vect sd.vect     2.5%      25%      50%      75%    97.5%  Rhat n.eff
      ES               21.660   4.736   12.757   18.413   21.530   24.791   31.254 1.001 49000
      Group.means[1] 1286.511  40.717 1207.054 1259.223 1286.458 1314.134 1365.712 1.001 25000
      Group.means[2] 1563.597  35.223 1494.449 1539.965 1563.579 1587.175 1632.716 1.001 32000
      alpha          1286.511  40.717 1207.054 1259.223 1286.458 1314.134 1365.712 1.001 25000
      beta[1]           0.000   0.000    0.000    0.000    0.000    0.000    0.000 1.000     1
      beta[2]         277.086  53.807  171.913  240.786  276.946  313.377  382.239 1.001 49000
      cohenD            2.775   0.539    1.722    2.411    2.774    3.139    3.830 1.001 49000
      sigma            99.843   0.156   99.424   99.782   99.891   99.955   99.996 1.001 27000
      deviance        807.208   2.831  803.694  805.130  806.550  808.599  814.372 1.001 30000
      
      For each parameter, n.eff is a crude measure of effective sample size,
      and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
      
      DIC info (using the rule, pD = var(deviance)/2)
      pD = 4.0 and DIC = 811.2
      DIC is an estimate of expected predictive error (lower deviance is better).
      
    2. Means model
      Show code
               furness.means.r2jags <- jags(data=furness.means.list,
               inits=NULL, #or inits=list(inits,inits,inits) # since there are three chains
               parameters.to.save=params.means,
               model.file="../downloads/BUGSscripts/ws6.2bttestModelMeans.txt",
               n.chains=3,
               n.iter=nIter,
               n.burnin=burnInSteps,
               n.thin=thinSteps
               )
      
      Compiling model graph
         Resolving undeclared variables
         Allocating nodes
      Graph information:
         Observed stochastic nodes: 14
         Unobserved stochastic nodes: 3
         Total graph size: 48
      
      Initializing model
      
               print(furness.means.r2jags)
      
      Inference for Bugs model at "../downloads/BUGSscripts/ws6.2bttestModelMeans.txt", fit using jags,
       3 chains, each with 166667 iterations (first 2000 discarded), n.thin = 10
       n.sims = 49401 iterations saved
                mu.vect sd.vect     2.5%      25%      50%      75%    97.5%  Rhat n.eff
      ES         35.720 224.801  -32.404    1.792   24.219   54.266  168.055 1.073 22000
      alpha[1] 1170.526 301.514  562.231  974.244 1174.069 1372.845 1750.250 1.001 49000
      alpha[2] 1455.614 264.059  919.551 1285.024 1460.998 1630.656 1965.889 1.001 49000
      cohenD      0.383   0.522   -0.639    0.032    0.382    0.736    1.407 1.001 49000
      effect    285.088 398.697 -508.748   23.816  284.545  546.708 1069.671 1.001 49000
      sigma     762.970 120.525  539.011  672.752  760.227  856.166  979.664 1.001 33000
      deviance  225.417   2.315  222.625  223.737  224.901  226.473  231.397 1.001 49000
      
      For each parameter, n.eff is a crude measure of effective sample size,
      and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
      
      DIC info (using the rule, pD = var(deviance)/2)
      pD = 2.7 and DIC = 228.1
      DIC is an estimate of expected predictive error (lower deviance is better).
      
  12. Of course, prior to looking in depth at the summaries of the posteriors, we should explore the various chain mixing and autocorrelation diagnostics.
    1. Effects model
      Show code
               plot(as.mcmc(furness.r2jags), ask=FALSE)
      
      plot of chunk ws6.2bR2jagsPlots
      plot of chunk ws6.2bR2jagsPlots
      plot of chunk ws6.2bR2jagsPlots
      OR if you want to select specific parameters...
               plot(as.mcmc(furness.r2jags)[,c('alpha','beta[2]','sigma')])
      
      plot of chunk ws6.2bR2jagsPlots1
               autocorr.diag(as.mcmc(furness.r2jags))
      
                      alpha beta[1]       beta[2]        cohenD     deviance            ES Group.means[1]
      Lag 0    1.0000000000     NaN  1.0000000000  1.0000000000  1.000000000  1.0000000000   1.0000000000
      Lag 10  -0.0013698147     NaN  0.0016507538  0.0017345695  0.009979833  0.0006827348  -0.0013698147
      Lag 50   0.0002405321     NaN -0.0027941013 -0.0027822307 -0.001217127 -0.0024960973   0.0002405321
      Lag 100  0.0072343864     NaN -0.0004367811 -0.0004374597 -0.011009084  0.0004630122   0.0072343864
      Lag 500  0.0029755878     NaN  0.0046309719  0.0046543436 -0.005332836  0.0045717913   0.0029755878
              Group.means[2]        sigma
      Lag 0     1.0000000000  1.000000000
      Lag 10    0.0063195341  0.028161165
      Lag 50   -0.0025919739 -0.004029523
      Lag 100  -0.0006259034 -0.008691303
      Lag 500   0.0041661800 -0.001140906
      
               #autocorr.plot(as.mcmc(furness.r2jags))
      
    2. Means model
      Show code
               plot(as.mcmc(furness.means.r2jags), ask=FALSE)
      
      plot of chunk ws6.2bR2jagsPlotsMeans
      plot of chunk ws6.2bR2jagsPlotsMeans
      OR if you want to select specific parameters...
               plot(as.mcmc(furness.means.r2jags)[,c('alpha[1]','alpha[2]','sigma')])
      
      plot of chunk ws6.2bR2jagsPlotsMeans1
               autocorr.diag(as.mcmc(furness.means.r2jags))
      
                  alpha[1]     alpha[2]        cohenD     deviance        effect            ES
      Lag 0    1.000000000  1.000000000  1.0000000000  1.000000000  1.0000000000  1.000000e+00
      Lag 10  -0.001246130  0.002661141 -0.0014899232 -0.001312693 -0.0015190450  1.973619e-03
      Lag 50   0.002867322 -0.000510074 -0.0015067203  0.002405813  0.0007252822  4.386509e-05
      Lag 100 -0.002650157  0.006490269 -0.0004420897 -0.004323199  0.0029550855  1.289820e-03
      Lag 500  0.003693660  0.001730131  0.0051974203 -0.002832293  0.0043317249 -1.327230e-03
                      sigma
      Lag 0    1.0000000000
      Lag 10  -0.0003725095
      Lag 50   0.0058792745
      Lag 100  0.0026821734
      Lag 500  0.0040908885
      
               #autocorr.plot(as.mcmc(furness.means.r2jags)) 
      
  13. Now a summary plot would be nice!
    Show code
             library(ggplot2)
             library(plyr)
             coefs <- furness.r2jags$BUGSoutput$sims.matrix[,c('alpha','beta[2]')]
             Xmat <- model.matrix(~SEX,data.frame(SEX=levels(furness$SEX)))
             fit <- coefs %*% t(Xmat)
             fit <- adply(fit, 2, function(x) {
             data.frame(fit=median(x), HPDinterval(as.mcmc(x)))
             })
             fit <- cbind(SEX=levels(furness$SEX), fit)
             ggplot(fit, aes(y=fit, x=SEX)) +
             geom_point(data=furness, aes(y=METRATE, x=SEX), color='grey')+
             geom_pointrange(aes(ymin=lower, ymax=upper))+
             scale_y_continuous(expression(Metabolic~rate~(HJ/day)))+
             scale_x_discrete('')+
             theme_classic()+theme(axis.title.y=element_text(vjust=2, size=rel(1.25), face='bold'))
    
    plot of chunk ws6.2bR2jagsGgplot

Modelling the two populations - not assuming equal variances

In the previous question we fitted the model assuming that the two populations were equally varied. However, there is some evidence of unequal variance and we might want to allow for this in the modelling. One way to do this is to fit separate models for each population (one for the females and one for the males).

  1. Model the metabolic rate of males and females as separate models:
    Show code
             modelString="   
     model {
     #Likelihood
     for (i in 1:nF) {
     metRateF[i]~dnorm(female.mu,female.tau)
     }
     for (i in 1:nM) {
     metRateM[i]~dnorm(male.mu,male.tau)
     }
     
     #Priors
     female.mu ~ dnorm(meanRate,1.0E-6)
     male.mu ~ dnorm(meanRate,1.0E-6)
     female.tau <- 1 / (female.sigma * female.sigma)
     female.sigma~dunif(0,100)
     male.tau <- 1 / (male.sigma * male.sigma)
     male.sigma~dunif(0,100)
     
     #Other Derived parameters 
     delta <- male.mu - female.mu
     }
     "
    
             writeLines(modelString,con="../downloads/BUGSscripts/ws6.2bSeparateVar.txt")
    
             furness$SEXn <- as.numeric(factor(furness$SEX, levels=c("Female","Male")))
             furness.SV.list <- with(furness,
             list(metRateF=METRATE[SEXn==1], metRateM=METRATE[SEXn==2],
             nF=length(METRATE[SEXn==1]), nM=length(METRATE[SEXn==2]),
             meanRate=mean(METRATE))
             )
    
             params <- c("female.mu","male.mu","delta","female.sigma","male.sigma")
    
             adaptSteps = 1000
             burnInSteps = 2000
             nChains = 3
             numSavedSteps = 50000
             thinSteps = 10
             nIter = ceiling((numSavedSteps * thinSteps)/nChains)
    
             library(R2jags)
             furness.SV.r2jags <- jags(data=furness.SV.list,
             inits=NULL, #or inits=list(inits,inits,inits) # since there are three chains
             parameters.to.save=params,
             model.file="../downloads/BUGSscripts/ws6.2bSeparateVar.txt",
             n.chains=3,
             n.iter=nIter,
             n.burnin=burnInSteps,
             n.thin=thinSteps
             )
    
    Compiling model graph
       Resolving undeclared variables
       Allocating nodes
    Graph information:
       Observed stochastic nodes: 14
       Unobserved stochastic nodes: 4
       Total graph size: 34
    
    Initializing model
    
             print(furness.SV.r2jags)
    
    Inference for Bugs model at "../downloads/BUGSscripts/ws6.2bSeparateVar.txt", fit using jags,
     3 chains, each with 166667 iterations (first 2000 discarded), n.thin = 10
     n.sims = 49401 iterations saved
                  mu.vect sd.vect     2.5%      25%      50%      75%    97.5%  Rhat n.eff
    delta         277.789  53.364  173.782  241.453  277.850  314.119  382.733 1.001 26000
    female.mu    1285.831  40.277 1207.206 1258.673 1285.822 1313.039 1365.347 1.001 49000
    female.sigma   98.879   1.088   95.952   98.436   99.210   99.665   99.970 1.001 49000
    male.mu      1563.620  35.328 1494.021 1539.913 1563.518 1587.538 1632.869 1.001 26000
    male.sigma     99.819   0.181   99.329   99.750   99.874   99.949   99.995 1.001 49000
    deviance      809.135   3.435  804.435  806.604  808.487  810.951  817.579 1.001 49000
    
    For each parameter, n.eff is a crude measure of effective sample size,
    and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
    
    DIC info (using the rule, pD = var(deviance)/2)
    pD = 5.9 and DIC = 815.0
    DIC is an estimate of expected predictive error (lower deviance is better).
    

  Frequentist pooled variances t-test

         t.test(y~x, data, var.equal=TRUE)
Two Sample t-test

data:  y by x
 = 49.727, df = 98, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 26.39339 28.58754
sample estimates:
mean in group A mean in group B 
      105.32285        77.83238 
         #OR
         data.lm <- lm(y~x, data)
         summary(data.lm)
Call:
lm(formula = y ~ x, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9669 -1.8080  0.0141  1.7446  6.8725 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 105.3228     0.3496  301.23   <2e-16 ***
xB          -27.4905     0.5528  -49.73   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.708 on 98 degrees of freedom
Multiple R-squared:  0.9619,	Adjusted R-squared:  0.9615 
F-statistic:  2473 on 1 and 98 DF,  p-value: < 2.2e-16

End of instructions

  Frequentist pooled variances t-test

hello

End of instructions