Jump to main navigation


Tutorial 7.2a - Simple linear regression

10 Jun 2019

Many biologists and ecologists get a little twitchy and nervous around mathematical and statistical formulae and nomenclature. Whilst it is possible to perform basic statistics without too much regard for the actual equation (model) being employed, as the complexity of the analysis increases, the need to understand the underlying model becomes increasingly important. Consequently, I will always present the linear model formulae along with the analysis. If you start to feel some form of disorder starting to develop, you might like to run through the Tutorials and Workshops twice (the first time ignoring the formulae).


Preliminaries

The following packages will be used in this tutorial:

library(car)
library(effects)
library(emmeans)
library(ggfortify)
library(ggeffects)
library(tidyverse)

Overview

Correlation and regression are techniques used to examine associations and relationships between continuous variables collected on the same set of sampling or experimental units. Specifically, correlation is used to investigate the degree to which variables change or vary together (covary). In correlation, there is no distinction between dependent (response) and independent (predictor) variables and there is no attempt to prescribe or interpret the causality of the association.

For example, there may be an association between arm and leg length in humans, whereby individuals with longer arms generally have longer legs. Neither variable directly causes the change in the other. Rather, they are both influenced by other variables to which they both have similar responses. Hence correlations apply mainly to survey designs where each variable is measured rather than specifically set or manipulated by the investigator.

Regression is used to investigate the nature of a relationship between variables in which the magnitude and changes in one variable (known as the independent or predictor variable) are assumed to be directly responsible for the magnitude and changes in the other variable (dependent or response variable). Regression analyses apply to both survey and experimental designs. Whilst for experimental designs, the direction of causality is established and dictated by the experiment, for surveys the direction of causality is somewhat discretionary and based on prior knowledge.

For example, although it is possible that ambient temperature effects the growth rate of a species of plant, the reverse is not logical. As an example of regression, we could experimentally investigate the relationship between algal cover on rocks and molluscan grazer density by directly manipulating the density of snails in different specifically control plots and measuring the cover of algae therein. Any established relationship must be driven by snail density, as this was the controlled variable. Alternatively the relationship could be investigated via a field survey in which the density of snails and cover of algae could be measured from random locations across a rock platform. In this case, the direction of causality (or indeed the assumption of causality) may be more difficult to defend.

In addition to examining the strength and significance of a relationship (for which correlation and regression are equivalent), regression analysis also explores the functional nature of the relationship. In particular, it estimates the rate at which a change in an independent variable is reflected in a change in a dependent variable as well as the expected value of the dependent variable when the independent variable is equal to zero. These estimates can be used to construct a predictive model (equation) that relates the magnitude of a dependent variable to the magnitude of an independent variable, and thus permit new responses to be predicted from new values of the independent variable.

Simple linear regression is a linear modelling process that models a continuous response against a single continuous predictor. The linear model is expressed as: $$y_i = \beta_0+ \beta_1 x_i+\varepsilon_i \hspace{1cm}\varepsilon\sim{}\mathcal{N}(0,\sigma^2)$$ where:

  • $y_i$ is the response value for each of the $i$ observations
  • $\beta_0$ is the y-intercept (value of $y$ when $x=0$)
  • $\beta_1$ is the slope (rate of chance in $y$ per unit chance in $x$)
  • $x_i$ is the predictor value for each of the $i$ observations
  • $\varepsilon_i$ is the residual value of each of the $i$ observations. A residual is the difference between the observed value and the value expected by the model.
  • $\varepsilon\sim\mathcal{N}(0,\sigma^2)$ indicates that the residuals are normally distributed with a constant amount of variance
plot of chunk regression

The parameters of the trendline ($\beta_0, \beta_1$) are determined by Ordinary Least Squares in which the sum of the squared residuals is minimized. A non-zero population slope is indicative of a relationship.

Scenario and Data

Lets say we had set up an experiment in which we applied a continuous treatment ($x$) ranging in magnitude from 0 to 16 to a total of 16 sampling units ($n=16$) and then measured a response ($y$) from each unit. As this section is mainly about the generation of artificial data (and not specifically about what to do with the data), understanding the actual details are optional and can be safely skipped. Consequently, I have folded (toggled) this section away.

Random data incorporating the following trends (effect parameters)
  • the sample size = 16
  • the continuous $x$ variable ranging from 0 to 16
  • when the value of $x$ is 0, $y$ is expected to be 40 ($\beta_0=40$)
  • a 1 unit increase in $x$ is associated with a 1.5 unit decline in $y$ ($\beta_1=-1.5$)
  • the data are drawn from normal distributions with a mean of 0 and standard deviation of 5 ($\sigma^2=25$)
set.seed(1)
n <- 16
a <- 40  #intercept
b <- -1.5  #slope
sigma2 <- 25  #residual variance (sd=5)
x <- 1:n  #values of the year covariate
eps <- rnorm(n, mean = 0, sd = sqrt(sigma2))  #residuals
y <- a + b * x + eps  #response variable
#OR
y <- (model.matrix(~x) %*% c(a,b))+eps
data <- data.frame(y=round(y,1), x)  #dataset
head(data)  #print out the first six rows of the data set
     y x
1 35.4 1
2 37.9 2
3 31.3 3
4 42.0 4
5 34.1 5
6 26.9 6

With these sort of data, we are primarily interested in investigating whether there is a relationship between the continuous response variable and the linear predictor (single continuous predictor).

$$y_i = \beta_0+ \beta_1 x_i+\varepsilon_i \hspace{1cm}\varepsilon\sim{}\mathcal{N}(0,\sigma^2)$$ which can be re-written in vector (matrix) notation: $$\mathbf{Y} = \mathbf{X}\times\boldsymbol{\beta}+\boldsymbol{\varepsilon}, \quad\boldsymbol{\varepsilon}\sim{}\mathcal{N}(0,\mathbf{V})$$ $$ \underbrace{ \begin{pmatrix} 35.4\\37.9\\31.3\\42.0\\34.1\\26.9\\31.9\\31.7\\29.4\\23.5\\31.1\\23.9\\17.4\\7.9\\23.1\\15.8 \end{pmatrix}}_\text{Response vector}_\mathbf{Y} = \underbrace{\begin{pmatrix} 1&1\\1&2\\1&3\\1&4\\1&5\\1&6\\1&7\\1&8\\1&9\\1&10\\1&11\\1&12\\1&13\\1&14\\1&15\\1&16 \end{pmatrix}}_\text{Model matrix}_\textbf{X} \times \underbrace{\begin{pmatrix} \beta_1&\beta_2 \end{pmatrix}}_\text{Parameter vector}_\boldsymbol{\beta} + \underbrace{ \begin{pmatrix} \varepsilon_1\\\varepsilon_2\\\varepsilon_3\\\varepsilon_4\\\varepsilon_5\\\varepsilon_6\\\varepsilon_7\\\varepsilon_8\\\varepsilon_9\\\varepsilon_{10}\\ \varepsilon_{11}\\\varepsilon_{12}\\\varepsilon_{13}\\\varepsilon_{14}\\\varepsilon_{15}\\\varepsilon_{16} \end{pmatrix}}_\text{Residual vector}_\boldsymbol{\varepsilon} $$ and $$ \mathbf{V} = \underbrace{ \begin{pmatrix} \sigma^2&0&0&0&0&0&0&0&0&0&0&0&0&0&0\\ 0&\sigma^2&0&0&0&0&0&0&0&0&0&0&0&0&0\\ 0&0&\sigma^2&0&0&0&0&0&0&0&0&0&0&0&0\\ 0&0&0&\sigma^2&0&0&0&0&0&0&0&0&0&0&0\\ 0&0&0&0&\sigma^2&0&0&0&0&0&0&0&0&0&0\\ 0&0&0&0&0&\sigma^2&0&0&0&0&0&0&0&0&0\\ 0&0&0&0&0&0&\sigma^2&0&0&0&0&0&0&0&0\\ 0&0&0&0&0&0&0&\sigma^2&0&0&0&0&0&0&0\\ 0&0&0&0&0&0&0&0&\sigma^2&0&0&0&0&0&0\\ 0&0&0&0&0&0&0&0&0&\sigma^2&0&0&0&0&0\\ 0&0&0&0&0&0&0&0&0&0&\sigma^2&0&0&0&0\\ 0&0&0&0&0&0&0&0&0&0&0&\sigma^2&0&0&0\\ 0&0&0&0&0&0&0&0&0&0&0&0&\sigma^2&0&0\\ 0&0&0&0&0&0&0&0&0&0&0&0&0&\sigma^2&0\\ 0&0&0&0&0&0&0&0&0&0&0&0&0&0&\sigma^2\\ \end{pmatrix}}_\text{Variance matrix} $$

So, the $\boldsymbol{\varepsilon}\sim{}\mathcal{N}(0,\mathbf{V})$ states that the residuals being the difference between the observed values and those expected/predicted by the model) should be drawn from a normal distribution.
The $\mathbf{V}$ indicates that each of the residuals should be drawn from a distribution with the same variance (all diagonals have the same value of $\sigma^2$) and that each value should be independent (the off-diagonals are all zero).

Exploratory data analysis and initial assumption checking

Normality

Estimation and inference testing in linear regression assumes that the response is normally distributed in each of the populations. In this case, the populations are all possible measurements that could be collected at each level of $x$ - hence there are 16 populations. Typically however, we only collect a single observation from each population (as is also the case here). How then can be evaluate whether each of these populations are likely to have been normal?

plot of chunk normality
plot of chunk normality1

For a given response, the population distributions should follow much the same distribution shapes. Therefore provided the single samples from each population are unbiased representations of those populations, a boxplot of all observations should reflect the population distributions.

The two figures above show the relationships between the individual population distributions and the overall distribution. The left hand figure shows a distribution drawn from single representatives of each of the 16 populations. Since the 16 individual populations were normally distributed, the distribution of the 16 observations is also normal.

By contrast, the right hand figure shows 16 log-normally distributed populations and the resulting distribution of 16 single observations drawn from these populations. The overall boxplot mirrors each of the individual population distributions.

Homogeneity of variance

Simple linear regression also assumes that each of the populations are equally varied. Actually, it is prospect of a relationship between the mean and variance of y-values across x-values that is of the greatest concern. Strictly the assumption is that the distribution of y values at each x value are equally varied and that there is no relationship between mean and variance.

However, as we only have a single y-value for each x-value, it is difficult to directly determine whether the assumption of homogeneity of variance is likely to have been violated (mean of one value is meaningless and variability can't be assessed from a single value). The figure below depicts the ideal (and almost never realistic) situation in which (left hand figure) the populations are all equally varied. The middle figure simulates drawing a single observation from each of the populations. When the populations are equally varied, the spread of observed values around the trend line is fairly even - that is, there is no trend in the spread of values along the line.

If we then plot the residuals (difference between observed values and those predicted by the trendline) against the predict values, there is a definite lack of pattern. This lack of pattern is indicative of a lack of issues with homogeneity of variance.

plot of chunk homogeneity
plot of chunk homogeneity3
plot of chunk homogeneity1

If we now contrast the above to a situation where the population variance is related to the mean (unequal variance), we see that the observations drawn from these populations are not evenly distributed along the trendline (they get more spread out as the mean predicted value increase). This pattern is emphasized in the residual plot which displays a characteristic "wedge"-shape pattern.

plot of chunk homogeneity4
plot of chunk homogeneity5
plot of chunk homogeneity6

Hence looking at the spread of values around a trendline on a scatterplot of $y$ against $x$ is a useful way of identifying gross violations of homogeneity of variance. Residual plots provide an even better diagnostic. The presence of a wedge shape is indicative that the population mean and variance are related.

Linearity

Linear regression fits a straight (linear) line through the data. Therefore, prior to fitting such a model, it is necessary to establish whether this really is the most sensible way of describing the relationship. That is, does the relationship appear to be linearly related or could some other non-linear function describe the relationship better. Scatterplots and residual plots are useful diagnostics

Explore assumptions

The assumptions are:
  1. All of the observations are independent - this must be addressed at the design and collection stages
  2. The response variable (and thus the residuals) should be normally distributed
  3. The response variable should be equally varied (variance should not be related to mean as these are supposed to be estimated separately)
  4. the relationship between the linear predictor (right hand side of the regression formula) and the link function should be linear. A scatterplot with smoother can be useful for identifying possible non-linearity.

So lets explore normality, homogeneity of variances and linearity by constructing a scatterplot of the relationship between the response ($y$) and the predictor ($x$). We will also include a range of smoothers (linear and lowess) and marginal boxplots on the scatterplot to assist in exploring linearity and normality respectively.

#scatterplot
library(car)
scatterplot(y~x, data)
plot of chunk tut7.2aaS1.2
library(ggplot2)
ggplot(data, aes(y=y, x=x)) + geom_point() + geom_smooth()
plot of chunk tut7.2aaS1.2b
ggplot(data, aes(y=y)) + geom_boxplot(aes(x=1))
plot of chunk tut7.2aaS1.2b

Conclusions:

  • there is no evidence that the response variable is non-normal
  • the spread of values around the trendline seems fairly even (hence it there is no evidence of non-homogeneity
  • the data seems well represented by the linear trendline. Furthermore, the lowess smoother does not appear to have a consistent shift trajectory.
Obvious violations could be addressed either by:
  • consider a non-linear linear predictor (such as a polynomial, spline or other non-linear function)
  • transform the scale of the response variables (to address normality etc)
  • consider modelling against an alternative distribution (such as Poisson in the case of counts, Binomial in the case of presence/absence, etc)
  • if normality seems reasonable, yet there is evidence of unequal variance, consider applying a variance function as weights (see below).

Model fitting or statistical analysis

In R, simple linear models are fit using the lm() function. The most important (=commonly used) parameters/arguments for simple regression are:

  • formula: the linear model relating the response variable to the linear predictor
  • data: the data frame containing the data

data.lm <- lm(y~x, data)

Model evaluation

Prior to exploring the model parameters, it is prudent to confirm that the model did indeed fit the assumptions and was an appropriate fit to the data.

As part of linear model fitting, a suite of diagnostic measures can be calculated each of which provide an indication of the appropriateness of the model for the data and the indication of each points influence (and outlyingness) of each point on resulting the model.

Leverage

Leverage is a measure of how much of an outlier each point is in x-space (on x-axis) and thus only applies to the predictor variable. Values greater than $2\times p/n$ (where p=number of model parameters ($p = 2$ for simple linear regression), and $n$ is the number of observations) should be investigated as potential outliers.

Residuals

As the residuals are the differences between the observed and predicted values along a vertical plane, they provide a measure of how much of an outlier each point is in y-space (on y-axis). Outliers are identified by relatively large residual values. Residuals can also standardized and studentized. Standardized residuals are the residuals divided by the standard deviation (they are standardized by the variability of the data) and are useful for exploring the impacts of various variance functions. Studentized residuals can be compared across different models and follow a t distribution enabling the probability of obtaining a given residual can be determined. The patterns of residuals against predicted y values (residual plot) are also useful diagnostic tools for investigating linearity and homogeneity of variance assumptions.

Cook's D

Cook's D statistic is a measure of the influence of each point on the fitted model (estimated slope) and incorporates both leverage and residuals. Values > 1 (or even approaching 1) correspond to highly influential observations.

Extractor and moderator functions

There are a number of extractor functions (functions that extract or derive specific information from a model) available including:
ExtractorDescription
residuals()Extracts the residuals from the model
rstandard()Extracts the standardized residuals from the model
rstudent()Extracts the studentized residuals from the model
fitted()Extracts the predicted (expected) response values (on the link scale) at the observed levels of the linear predictor
predict()Extracts the predicted (expected) response values (on either the link, response or terms (linear predictor) scale) as well as standard errors or confidence intervals.
coef()Extracts the model coefficients
confint()Calculate confidence intervals for the model coefficients
summary()Summarizes the important output and characteristics of the model
anova()Computes an analysis of variance (variance partitioning) from the model
plot()Generates a series of diagnostic plots from the model
influence.measures()Calculates a range of regression diagnostics including leverage and Cook's D
effect()effects package - estimates the marginal (partial) effects of a factor (useful for plotting)
avPlot()car package - generates partial regression plots
AIC()Extracts the Akaike's Information Criterion

Lets explore the diagnostics - particularly the residuals

plot(data.lm)
plot of chunk tut7.2aS5.1
plot of chunk tut7.2aS5.1
plot of chunk tut7.2aS5.1
plot of chunk tut7.2aS5.1
library(ggfortify)
autoplot(data.lm, which=1:6)
plot of chunk tut7.2aS5.1a
Conclusions:
  • there is no obvious patterns in the residuals (first plot), or at least there are no obvious trends remaining that would be indicative of non-linearity or non-homogeneity of variance
  • The data do not appear to deviate substantially from normality according to the Q-Q normal plot
  • None of the observations are considered overly influential (Cooks' D values not approaching 1).

Exploring the model parameters, test hypotheses

If there was any evidence that the assumptions had been violated or the model was not an appropriate fit, then we would need to reconsider the model and start the process again. In this case, there is no evidence that the test will be unreliable so we can proceed to explore the test statistics. The main statistic of interest is the t-statistic ($t$) for the slope parameter.

summary(data.lm)
Call:
lm(formula = y ~ x, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.3694  -3.5053   0.6239   2.6522   7.3909 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  40.7450     2.6718  15.250 4.09e-10 ***
x            -1.5340     0.2763  -5.552 7.13e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.095 on 14 degrees of freedom
Multiple R-squared:  0.6876,	Adjusted R-squared:  0.6653 
F-statistic: 30.82 on 1 and 14 DF,  p-value: 7.134e-05

Conclusions:

  • We would reject the null hypothesis ($p<0.05$). An increase in x is associated with a significant linear decline (negative slope) in $y$. Every 1 unit increase in $x$ results in a -1.53 unit decrease in $y$.
  • The $R^2$ value is 0.69 suggesting that approximately 69% of the total variance in $y$ can be explained by its linear relationship with $x$. The $R^2$ value is thus a measure of the strength of the relationship.

Alternatively, we could calculate the 95% confidence intervals for the parameter estimates.

confint(data.lm)
                2.5 %     97.5 %
(Intercept) 35.014624 46.4753756
x           -2.126592 -0.9413493

Finally, arguably a more convenient summary is provided by the tidy() function within the broom package. This will yield a tibble in a standardized format that can easily be tabulated along with other results.

library(broom)
tidy(data.lm, conf.int=TRUE)
# A tibble: 2 x 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    40.7      2.67      15.3  4.09e-10    35.0     46.5  
2 x              -1.53     0.276     -5.55 7.13e- 5    -2.13    -0.941

Predictions

One of the strengths of a good model is that it can be used to generate predictions - expected responses to new levels of a predictor. As an example, I will predict new values of y corresponding to x values of 9.5, 15.1 and 1.135. Along with the predicted values, we will generate the upper and lower limits of the prediction interval. These prediction intervals represents the predicted mean $\pm$ an estimate of the model $\sigma^2$. In so doing, it assumes that estimates of all new predicted value will occur with similar variance as that present in the fitted model.

Prediction intervals represent the interval width in which we are (nominally) 95% confident that the true value of a single observation.

predict(data.lm, newdata=data.frame(x=c(9.5,15.1, 1.135)), interval="prediction")
       fit       lwr      upr
1 26.17228 14.892997 37.45156
2 17.58204  5.658568 29.50552
3 39.00394 26.924157 51.08373

Confidence intervals (rather than prediction intervals) for predicted values can alternatively be produced. These represent the interval in which we are (nominally) 95% confident that the true value of a mean of multiple observations will be contained. These are based on $\pm$ $Q% standard errors of the predicted value, where $Q$ is the upper quantile of a t-distribution for the appropriate degrees of freedom for the fitted model.

predict(data.lm, newdata=data.frame(x=c(9.5,15.1,1.135)), interval="confidence")
       fit      lwr      upr
1 26.17228 23.37689 28.96767
2 17.58204 12.81117 22.35292
3 39.00394 33.85484 44.15304
Note, when making predictions from models for which the response variable was transformed, it is necessary to generate the prediction intervals on the modeled scale prior to back-transforming.
predict(data.lm, newdata=data.frame(x=c(9.5,15.1,1.135)), se=TRUE, interval="confidence")
$fit
       fit      lwr      upr
1 26.17228 23.37689 28.96767
2 17.58204 12.81117 22.35292
3 39.00394 33.85484 44.15304

$se.fit
       1        2        3 
1.303341 2.224407 2.400751 

$df
[1] 14

$residual.scale
[1] 5.094864
predict(data.lm, newdata=data.frame(x=c(9.5,15.1,1.135)), se=TRUE, interval="prediction")
$fit
       fit       lwr      upr
1 26.17228 14.892997 37.45156
2 17.58204  5.658568 29.50552
3 39.00394 26.924157 51.08373

$se.fit
       1        2        3 
1.303341 2.224407 2.400751 

$df
[1] 14

$residual.scale
[1] 5.094864
predict(data.lm, newdata=data.frame(x=c(9.5,15.1,1.135)), se=TRUE, interval="prediction")$se.fit*2
       1        2        3 
2.606683 4.448813 4.801503 

Note, this provides as predicted values (fit) and standard errors (se.fit) as separate items of a list. Often it is more convenient to have the predicted values and confidence intervals presented as a data frame.

predict(data.lm, newdata=data.frame(x=c(9.5,15.1,1.135)), interval="confidence")
       fit      lwr      upr
1 26.17228 23.37689 28.96767
2 17.58204 12.81117 22.35292
3 39.00394 33.85484 44.15304
predict(data.lm, newdata=data.frame(x=c(9.5,15.1,1.135)), interval="prediction")
       fit       lwr      upr
1 26.17228 14.892997 37.45156
2 17.58204  5.658568 29.50552
3 39.00394 26.924157 51.08373

Confidence intervals (rather than prediction intervals) for predicted values can alternatively be produced. These represent the interval in which we are (nominally) 95% confident that the true value of a mean of multiple observations will be contained. These are based on $\pm$ $Q$ standard errors of the predicted value, where $Q$ is the upper quantile of a t-distribution for the appropriate degrees of freedom for the fitted model.

library(emmeans)
## establish the prediction (reference) grid
data.grid = ref_grid(data.lm, at=list(x=c(9.5,15.1,1.135)))
confint(data.grid)
     x prediction   SE df lower.CL upper.CL
  9.50       26.2 1.30 14     23.4     29.0
 15.10       17.6 2.22 14     12.8     22.4
  1.14       39.0 2.40 14     33.9     44.2

Confidence level used: 0.95 
## often it is better to pipe this through as.data.frame
confint(data.grid) %>% as.data.frame
       x prediction       SE df lower.CL upper.CL
1  9.500   26.17228 1.303341 14 23.37689 28.96767
2 15.100   17.58204 2.224407 14 12.81117 22.35292
3  1.135   39.00394 2.400751 14 33.85484 44.15304
Fitted/predicted values
Predicting new values of Y essentially becomes a reversal of matrix multiplications we saw earlier. We simply multiply the parameter vector (the modelled intercept and slope) by a new model matrix comprising a column of 1's and a column of new X values (in this case 6.5, 8.134, 1.1). $$ \underbrace{ \begin{pmatrix} \hat{y}_1\\\hat{y}_2\\\hat{y}_3 \end{pmatrix}}_\text{Predicted values}_\mathbf{\hat{Y}} = \underbrace{\begin{pmatrix} 1&9.5\\1&15.1\\1&1.1 \end{pmatrix}}_\text{Model matrix}_\textbf{X} \times \underbrace{\begin{pmatrix} \beta_1&\beta_2 \end{pmatrix}}_\text{Parameter vector}_\boldsymbol{\beta} $$ $$ \underbrace{ \begin{pmatrix} \hat{y}_1\\\hat{y}_2\\\hat{y}_3 \end{pmatrix}}_\text{Predicted values}_\mathbf{\hat{Y}} = \underbrace{\begin{pmatrix} 1&9.5\\1&15.1\\1&1.1 \end{pmatrix}}_\text{Model matrix}_\textbf{X} \times \underbrace{\begin{pmatrix} -1.534&40.745 \end{pmatrix}}_\text{Parameter vector}_\boldsymbol{\beta} $$
xmat <- model.matrix(~c(9.5,15.1, 1.1))
(pred<-xmat %*% coef(data.lm))
      [,1]
1 26.17228
2 17.58204
3 39.05763
Confidence intervals
Confidence intervals represent an interval in which we have a certain degree of confidence that the true value of a fitted value (a mean) falls. That is, it relates not to single newly predicted values, but rather the mean of a set of predicted values. As confidence intervals represent a distance either side of the fitted values, we must start by calculating the standard errors. The standard error calculation therefore incorporates the model's variance-covariance matrix which represents the degree of variation between and within the parameter estimates. In effect, this is a way of indicating how much slop there is likely to be in the parameter estimates that are used to solve for Y.
(se<-sqrt(diag(xmat %*% vcov(data.lm) %*% t(xmat))))
       1        2        3 
1.303341 2.224407 2.408954 
One standard error effectively represents a confidence interval of approximately 68.9%. Two standard errors would be approximately 95%. Multiplying the standard error by the critical t-value for the appropriate degrees of freedom is a popular way of estimating the 95% confidence intervals around the fitted values.
as.numeric(pred)+outer(se,qt(df=nrow(data)-2,c(.025,.975)))
      [,1]     [,2]
1 23.37689 28.96767
2 12.81117 22.35292
3 33.89094 44.22433
data.sum<-data.frame(pred, outer(se,qt(df=nrow(data)-2,c(.025,.975))))
colnames(data.sum) <- c("Fitted", "Lower", "Upper")
data.sum
    Fitted     Lower    Upper
1 26.17228 -2.795389 2.795389
2 17.58204 -4.770878 4.770878
3 39.05763 -5.166693 5.166693
Prediction intervals
In contrast to confidence intervals, prediction intervals pertain to confidence in single predicted values (rather than the mean of numerous values). So the standard errors should reflect the model variance (as measured by the variance in the fitted model) - this is approximated by adding this residual variance to the calculations.
(se<-sqrt(diag(xmat %*% vcov(data.lm) %*% t(xmat))+summary(data.lm)$sigma^2))
       1        2        3 
5.258930 5.559283 5.635664 
data.sum<-data.frame(pred, as.numeric(pred)+outer(se,qt(df=nrow(data)-2,c(.025,.975))))
colnames(data.sum) <- c("Fitted", "Lower", "Upper")
data.sum
    Fitted     Lower    Upper
1 26.17228 14.892997 37.45156
2 17.58204  5.658568 29.50552
3 39.05763 26.970336 51.14493

Graphical Summaries

Partial effects plots

Effects plots display the modeled effects. Whilst these graphs are not aesthetically pleasing, they allow us to quickly and effectively appreciate the modelled trends.

library(effects)
plot(effect("x",data.lm, partial.residuals=TRUE))
plot of chunk tut7.2aS7.1a
#OR a technically more robust approach..
plot(Effect("x",data.lm, partial.residuals=TRUE))
plot of chunk tut7.2aS7.1a
#OR for an older approach..
library(car)
avPlot(data.lm,"x")
plot of chunk tut7.2aS7.1a

Note, technically the points overlayed onto both the above figures are not the raw data (although again, in this simple case, they do equate to the raw data). Rather they are calculated by adding the residuals onto the fitted values. The rationale behind such a figure will be more obvious when exploring more complex models in later Topics. Furthermore, note that in the plot directly above, both the Y and X axes are centered around zero. Notice also that the axes titles are 'y|others' and 'x|others'. This is to explicitly remind users that this is a partial plot - although in this simple case there are no 'others'.

plot(ref_grid(data.lm, cov.reduce=FALSE), horizontal=FALSE)
plot of chunk tut7.2aS7.1b
#OR
plot(ref_grid(data.lm, cov.reduce=function(x) x), horizontal=FALSE)
plot of chunk tut7.2aS7.1b
This option uses emmeans in the backend.
ggemmeans(data.lm, ~x) %>% plot
plot of chunk tut7.2aS7.1c
This option used effects in the backend.
ggpredict(data.lm, ~x) %>% plot
plot of chunk tut7.2aS7.1d
ggpredict(data.lm, ~x) %>% plot(rawdata=TRUE)
plot of chunk tut7.2aS7.1d

Quality graphical summaries

If we wish to construct a more aesthetically pleasing summary graphic, then it is best to construct the plot manually. Predicted values (and associated confidence envelopes) become the basis of the data to plot.

## define the new prediction values (x)
data.list = with(data, list(x=seq(min(x), max(x), len=100)))
data.grid = ref_grid(data.lm, ~x, cov.reduce=FALSE, at=data.list)
newdata=confint(data.grid)

ggplot(newdata, aes(y=prediction, x=x))+
  geom_point(data=data, aes(y=y, x=x), color='grey')+
  geom_ribbon(aes(ymin=lower.CL, ymax=upper.CL), fill='blue',color=NA, alpha=0.2)+
  geom_line()+
  scale_x_continuous('X')+
  scale_y_continuous('Y')+
  theme_classic()+
  theme(axis.title.y=element_text(vjust=2, size=rel(1.5)),
        axis.title.x=element_text(vjust=-1, size=rel(1.5)))
plot of chunk tut7.2aS7.2a
#OR with partial residuals
partial.grid = ref_grid(data.lm, ~x, cov.reduce=FALSE)
partials=summary(partial.grid) %>% mutate(Partial = prediction + resid(data.lm))

ggplot(newdata, aes(y=prediction, x=x))+
  geom_point(data=partials, aes(y=Partial, x=x), color='grey')+
  geom_ribbon(aes(ymin=lower.CL, ymax=upper.CL), fill='blue',color=NA, alpha=0.2)+
  geom_line()+
  scale_x_continuous('X')+
  scale_y_continuous('Y')+
  theme_classic()+
  theme(axis.title.y=element_text(vjust=2, size=rel(1.5)),
        axis.title.x=element_text(vjust=-1, size=rel(1.5)))
plot of chunk tut7.2aS7.2a
ggemmeans(data.lm, ~x) %>% plot +
  scale_x_continuous('X')+
  scale_y_continuous('Y')+
  theme_classic()+
  theme(axis.title.y=element_text(vjust=2, size=rel(1.5)),
        axis.title.x=element_text(vjust=-1, size=rel(1.5)))
plot of chunk tut7.2aS7.2b
ggpredict(data.lm, ~x) %>%
  plot(rawdata=TRUE,show.title=FALSE, ci.style='dash') +
  scale_x_continuous('X')+
  scale_y_continuous('Y')+
  theme_classic()+
  theme(axis.title.y=element_text(vjust=2, size=rel(1.5)),
        axis.title.x=element_text(vjust=-1, size=rel(1.5)))
plot of chunk tut7.2aS7.2c
#1. extract the model coefficients
coefs <- coef(data.lm)
#2. generate a prediction dataframe
dat <- data.frame(x=seq(min(data$x,na.rm=TRUE), max(data$x,na.rm=TRUE), len=100))
#3. calculate predicted Y-values and confidence interval
fit <- cbind(x=dat$x,data.frame(predict(data.lm, newdata=dat, interval='confidence')))
#4. calculate partial values as fitted + residuals
partials <- data.frame(x=data$x, partial=fitted(data.lm)+resid(data.lm))

library(ggplot2)
ggplot(fit, aes(y=fit, x=x))+
  geom_point(data=partials, aes(y=partial, x=x), color='grey')+
  geom_ribbon(aes(ymin=lwr, ymax=upr), fill='blue',color=NA, alpha=0.2)+
  geom_line()+
  scale_x_continuous('X')+
  scale_y_continuous('Y')+
  theme_classic()+
  theme(axis.title.y=element_text(vjust=2, size=rel(1.5)),
        axis.title.x=element_text(vjust=-1, size=rel(1.5)))
plot of chunk tut7.2aS7.2
par(mar = c(4, 5, 0, 0))
plot(y ~ x, data = data, type = "n", ann = F, axes = F)
points(y ~ x, data = data, pch = 16)
xs <- seq(0, 17, l = 1000)
ys <- data.frame(predict(data.lm, newdata = data.frame(x = xs), interval = "confidence"))
points(y~x,data=data, pch=16)
points(ys$fit ~ xs, col = "black", type = "l")
lines(ys$lwr ~ xs, col = "black", type = "l", lty = 2)
lines(ys$upr ~ xs, col = "black", type = "l", lty = 2)
axis(1)
mtext("X", 1, cex = 1.5, line = 3)
axis(2, las = 2)
mtext("Y", 2, cex = 1.5, line = 3)
box(bty = "l")
plot of chunk tut7.2aS7.1

Extending the linear model

This tutorial has at times stressed the importance of exploratory data analysis and model validation diagnostics to ensure that the data confirm to the underlying linear modelling assumptions. In case these assumptions escaped your attention, they are: Normality, Homogeneity of Variance, Linearity and Independence. Collectively, these assumptions form a rather rigid and restrictive set of requirements that to many are somewhat unrealistic. Whilst parameter estimates and to a lesser degree, hypothesis tests, are reasonably robust to these assumptions, this is of course a sliding scale.

It turns out that there are modern options for dealing with violations of each of these assumptions. For example, for circumstances when the residuals are unlikely to follow a normal distribution (such as when the response is binary), the linear model can be 'generalized' to accommodate other residual distributions (Generalized Linear Models (GLM)). When the underlying cause of heteroscadacity pertains to the distribution of data, then scale transformations or Generalized Linear Models (GLM) can be useful in addressing specific types of unequal variance.

In yet other circumstances, certain types of variance structures can be incorporated directly into the model fitting process thereby potentially resulting in more useful and reflective models. Similarly, the residual variance structure can be constrained or modified so as to deal with specific forms of autocorrelation or lack of independence that arise from sampling designs that incorporate spatial and or temporal components. Approaches for dealing with heterogeneity and autocorrelation will be the focus of Topic 8 whereas other aspects of independence will be the focus of Topic 9.

Although the word 'linear' in linear models refers to the linear combination of the linear predictor (non of the predictors are a power, there are no terms raised to the power of a predictor term etc), the most linear models are also designed to explore a linear relationship. Topic 11 and Topic 12 will introduce methods for exploring non-linear relationships.




Worked examples

Basic statistics references
  • Logan (2010) - Chpt 8
  • Quinn & Keough (2002) - Chpt 5
Very basic overview of regression

Simple linear regression

Here is an example from Fowler, Cohen and Parvis (1998). An agriculturalist was interested in the effects of fertilizer load on the yield of grass. Grass seed was sown uniformly over an area and different quantities of commercial fertilizer were applied to each of ten 1 m2 randomly located plots. Two months later the grass from each plot was harvested, dried and weighed. The data are in the file fertilizer.csv.

Download Fertilizer data set
Format of fertilizer.csv data files
FERTILIZERYIELD
2584
5080
7590
100154
125148
......

FERTILIZERMass of fertilizer (g.m-2) - Predictor variable
YIELDYield of grass (g.m-2) - Response variable
turf

Open
Show code
fert <- read.table('../downloads/data/fertilizer.csv', header=T, sep=',', strip.white=T)
fert
   FERTILIZER YIELD
1          25    84
2          50    80
3          75    90
4         100   154
5         125   148
6         150   169
7         175   206
8         200   244
9         225   212
10        250   248

  1. List the following:
    1. The biological hypothesis of interest
    2. The biological null hypothesis of interest
    3. The statistical null hypothesis of interest
  2. Test the assumptions of simple linear regression using a scatterplot
    of YIELD against FERTILIZER. Add boxplots for each variable to the margins and fit a lowess smoother through the data. Is there any evidence of violations of the simple linear regression assumptions? (Y or N)
    Show base graphics code
    # get the car package
    library(car)
    scatterplot(YIELD~FERTILIZER, data=fert)
    
    plot of chunk Q1-2
    Show ggplot code
        library(tidyverse)
    ggplot(fert, aes(y=YIELD, x=FERTILIZER)) + geom_point() + geom_smooth()
    
    plot of chunk Q1-2.1
    ggplot(fert, aes(y=YIELD)) + geom_boxplot(aes(x=1))
    
    plot of chunk Q1-2.1
  3. If there is no evidence that the assumptions of simple linear regression have been violated, fit the linear model
    YIELD = intercept + (SLOPE * FERTILIZER). At this stage ignore any output.
    Show code
    fert.lm<-lm(YIELD~FERTILIZER, data=fert)
    
  4. Examine the regression diagnostics
    (particularly the residual plot
    ). Does the residual plot indicate any potential problems with the data? (Y or N)
    Show base graphics code
    # setup a 2x6 plotting device with small margins
    par(mfrow = c(2, 3), oma = c(0, 0, 2, 0))
    plot(fert.lm,ask=F,which=1:6)
    
    plot of chunk Q1-3
    influence.measures(fert.lm)
    
    Influence measures of
     lm(formula = YIELD ~ FERTILIZER, data = fert) :
    
          dfb.1_ dfb.FERT   dffit cov.r  cook.d   hat inf
    1   5.39e-01  -0.4561  0.5411 1.713 0.15503 0.345    
    2  -4.15e-01   0.3277 -0.4239 1.497 0.09527 0.248    
    3  -6.01e-01   0.4237 -0.6454 0.969 0.18608 0.176    
    4   3.80e-01  -0.2145  0.4634 1.022 0.10137 0.127    
    5  -5.77e-02   0.0163 -0.0949 1.424 0.00509 0.103    
    6  -2.50e-02  -0.0141 -0.0821 1.432 0.00382 0.103    
    7   2.09e-17   0.1159  0.2503 1.329 0.03374 0.127    
    8  -2.19e-01   0.6184  0.9419 0.623 0.31796 0.176    
    9   3.29e-01  -0.6486 -0.8390 1.022 0.30844 0.248    
    10  1.51e-01  -0.2559 -0.3035 1.900 0.05137 0.345   *
    
    library(effects)
    plot(Effect('FERTILIZER',fert.lm, partial.residuals=TRUE))
    
    plot of chunk Q1-3
    Show autoplot code
    library(ggfortify)
    autoplot(fert.lm, which=1:6)
    
    plot of chunk Q1-3.1
        influence.measures(fert.lm)
    
    Influence measures of
     lm(formula = YIELD ~ FERTILIZER, data = fert) :
    
          dfb.1_ dfb.FERT   dffit cov.r  cook.d   hat inf
    1   5.39e-01  -0.4561  0.5411 1.713 0.15503 0.345    
    2  -4.15e-01   0.3277 -0.4239 1.497 0.09527 0.248    
    3  -6.01e-01   0.4237 -0.6454 0.969 0.18608 0.176    
    4   3.80e-01  -0.2145  0.4634 1.022 0.10137 0.127    
    5  -5.77e-02   0.0163 -0.0949 1.424 0.00509 0.103    
    6  -2.50e-02  -0.0141 -0.0821 1.432 0.00382 0.103    
    7   2.09e-17   0.1159  0.2503 1.329 0.03374 0.127    
    8  -2.19e-01   0.6184  0.9419 0.623 0.31796 0.176    
    9   3.29e-01  -0.6486 -0.8390 1.022 0.30844 0.248    
    10  1.51e-01  -0.2559 -0.3035 1.900 0.05137 0.345   *
    
  5. Prior to exploring the model output, it is often a good idea to produce a quick partial plot. This helps appreciate the patterns contained within the fitted model as well as interpreting any estimated parameters/effects.
    Show effects code
    library(effects)
    plot(Effect('FERTILIZER',fert.lm, partial.residuals=TRUE))
    
    plot of chunk Q1-3a.1
    Show ggpredict code
    library(ggeffects)
    ggpredict(fert.lm, ~FERTILIZER) %>% plot(rawdata=TRUE)
    
    plot of chunk Q1-3a.2
    Show ggemmeans code
    library(ggeffects)
    ggemmeans(fert.lm, ~FERTILIZER) %>% plot
    
    plot of chunk Q1-3a.3
  6. If there is no evidence that any of the assumptions have been violated, examine the regression output
    . Identify and interpret the following;
    Show code
    summary(fert.lm)
    
    Call:
    lm(formula = YIELD ~ FERTILIZER, data = fert)
    
    Residuals:
       Min     1Q Median     3Q    Max 
    -22.79 -11.07  -5.00  12.00  29.79 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 51.93333   12.97904   4.001  0.00394 ** 
    FERTILIZER   0.81139    0.08367   9.697 1.07e-05 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 19 on 8 degrees of freedom
    Multiple R-squared:  0.9216,	Adjusted R-squared:  0.9118 
    F-statistic: 94.04 on 1 and 8 DF,  p-value: 1.067e-05
    
    Show confint code
    confint(fert.lm)
    
                     2.5 %    97.5 %
    (Intercept) 22.0036246 81.863042
    FERTILIZER   0.6184496  1.004338
    
    Show tidy code
    library(broom)
    tidy(fert.lm, conf.int=TRUE)
    
    # A tibble: 2 x 7
      term        estimate std.error statistic   p.value conf.low conf.high
      <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
    1 (Intercept)   51.9     13.0         4.00 0.00394     22.0       81.9 
    2 FERTILIZER     0.811    0.0837      9.70 0.0000107    0.618      1.00
    
    1. sample y-intercept
    2. sample slope
    3. t value for main H0
    4. R2 value
  7. What conclusions (statistical and biological) would you draw from the analysis?
  8. Significant simple linear regression outcomes are usually accompanied by a scatterpoint that summarizes the relationship between the two population. Construct a scatterplot
    without marginal boxplots.
    Show base graphics code
    #setup the graphics device with wider margins
    opar <- par(mar=c(5,5,0,0))
    #construct the base plot without axes or labels or even points
    plot(YIELD~FERTILIZER, data=fert, ann=F, axes=F, type="n")
    # put the points on as solid circles (point character 16)
    points(YIELD~FERTILIZER, data=fert, pch=16)
    # calculate predicted values
    xs <- seq(min(fert$FERTILIZER), max(fert$FERTILIZER), l = 1000)
    ys <- predict(fert.lm, newdata = data.frame(FERTILIZER = xs), se=T)
    lines(xs,ys$fit-ys$se.fit, lty=2)
    lines(xs,ys$fit+ys$se.fit, lty=2)
    lines(xs, ys$fit)
    # draw the x-axis
    axis(1)
    mtext("Fertilizer concentration",1,line=3)
    # draw the y-axis (labels rotated)
    axis(2,las=1)
    mtext("Grass yield",2,line=3)
    #put an 'L'-shaped box around the plot
    box(bty="l")
    
    plot of chunk Q1-6
    #reset the parameters
    par(opar)
    
    Show ggplot code
    library(ggplot2)
    ggplot(fert, aes(y=YIELD, x=FERTILIZER)) +
      geom_point()+
      geom_smooth(method='lm', fill='blue',alpha=0.2)+
      scale_y_continuous(expression(Grass~yield~(g.m^{-2})))+
      scale_x_continuous(expression(Fertilizer~concentration~(g.m^{-2})))+
      theme_classic()+
      theme(axis.title.y=element_text(vjust=2, size=rel(1.5)),
                    axis.title.x=element_text(vjust=-1, size=rel(1.5)))
    
    plot of chunk Q1-7
    Show manual code
    newdata = with(fert, data.frame(FERTILIZER=seq(min(FERTILIZER), max(FERTILIZER), len=100)))
    Xmat = model.matrix(~FERTILIZER, data=newdata)
    coefs=coef(fert.lm)
        fit = as.vector(coefs %*% t(Xmat))
    SE = sqrt(diag(Xmat %*% vcov(fert.lm) %*% t(Xmat)))
    Q = qt(0.975, df=df.residual(fert.lm))
    newdata = newdata %>% mutate(lower=fit-Q*SE, upper=fit+Q*SE, fit=fit)
    
    partial.Xmat = model.matrix(~FERTILIZER, data=fert)
    partial.resid = fert %>% mutate(Partial=as.vector(coefs %*% t(partial.Xmat)) + resid(fert.lm))
    library(ggplot2)
    ggplot(newdata, aes(y=fit, x=FERTILIZER)) +
          geom_ribbon(aes(ymin=lower, ymax=upper), fill='blue', color=NA, alpha=0.3) +
          geom_line() +
      geom_point(data=partial.resid, aes(y=Partial))+
      scale_y_continuous(expression(Grass~yield~(g.m^{-2})))+
      scale_x_continuous(expression(Fertilizer~concentration~(g.m^{-2})))+
      theme_classic()+
      theme(axis.title.y=element_text(vjust=2, size=rel(1.5)),
                    axis.title.x=element_text(vjust=-1, size=rel(1.5)))
    
    plot of chunk Q1-7.1
    Show emmeans code
    fert.grid = with(fert, list(FERTILIZER=seq(min(FERTILIZER), max(FERTILIZER), len=100)))
    newdata = ref_grid(fert.lm, ~FERTILIZER, at=fert.grid, cov.reduce=FALSE) %>% confint
    partial.resid = ref_grid(fert.lm, ~FERTILIZER, cov.reduce=FALSE) %>% summary %>%
      mutate(Partial=prediction+resid(fert.lm))
    library(ggplot2)
    ggplot(newdata, aes(y=prediction, x=FERTILIZER)) +
          geom_ribbon(aes(ymin=lower.CL, ymax=upper.CL), fill='blue', color=NA, alpha=0.3) +
          geom_line() +
      geom_point(data=partial.resid, aes(y=Partial))+
      scale_y_continuous(expression(Grass~yield~(g.m^{-2})))+
      scale_x_continuous(expression(Fertilizer~concentration~(g.m^{-2})))+
      theme_classic()+
      theme(axis.title.y=element_text(vjust=2, size=rel(1.5)),
                    axis.title.x=element_text(vjust=-1, size=rel(1.5)))
    
    plot of chunk Q1-7.2

Simple linear regression

Christensen et al. (1996) studied the relationships between coarse woody debris (CWD) and, shoreline vegetation and lake development in a sample of 16 lakes. They defined CWD as debris greater than 5cm in diameter and recorded, for a number of plots on each lake, the basal area (m2.km-1) of CWD in the nearshore water, and the density (no.km-1) of riparian trees along the shore. The data are in the file christ.csv and the relevant variables are the response variable, CWDBASAL (coarse woody debris basal area, m2.km-1), and the predictor variable, RIPDENS (riparian tree density, trees.km-1).

Download Christensen data set
Format of christ.csv data files
LAKERIPDENSCWDBASAL
Bay1270121
Bergner121041
Crampton1800183
Long1875130
Roach1300127
.........

LAKEName of the North American freshwater lake from which the observations were collected
RIPDENSDensity of riparian trees (trees.km-1) Predictor variable
CWDBASALCourse woody debris basal area (m2.km-1) Response variable
Lake

Open the christ data file.

Show code
christ <- read.table('../downloads/data/christ.csv', header=T, sep=',', strip.white=T)
head(christ)
        LAKE RIPDENS CWDBASAL
1        Bay    1270      121
2    Bergner    1210       41
3   Crampton    1800      183
4       Long    1875      130
5      Roach    1300      127
6 Tenderfoot    2150      134
  1. List the following:
    1. The biological hypothesis of interest
    2. The biological null hypothesis of interest
    3. The statistical null hypothesis of interest
  2. In the table below, list the assumptions of simple linear regression along with how violations of each assumption are diagnosed and/or the risks of violations are minimized.
    AssumptionDiagnostic/Risk Minimization
    I.
    II.
    III.
    IV.
  3. Draw a scatterplot
    of CWDBASAL against RIPDENS. This should include boxplots for each variable to the margins and a fitted lowess smoother through the data HINT.
    Show base graphics code
    library(car)
    scatterplot(CWDBASAL~RIPDENS, data=christ)
    
    plot of chunk Q2-3
    Show ggplot code
    library(tidyverse)
    ggplot(christ, aes(y=CWDBASAL, x=RIPDENS)) + geom_point() + geom_smooth()
    
    plot of chunk Q2-3.1
    ggplot(christ, aes(y=CWDBASAL)) + geom_boxplot(aes(x=1))
    
    plot of chunk Q2-3.1
    ggplot(christ, aes(y=RIPDENS)) + geom_boxplot(aes(x=1))
    
    plot of chunk Q2-3.1
    1. Is there any evidence of nonlinearity? (Y or N)
    2. Is there any evidence of nonnormality? (Y or N)
    3. Is there any evidence of unequal variance? (Y or N)
  4. The main intention of the researchers is to investigate whether there is a linear relationship between the density of riparian vegetation and the size of the logs. They have no of using the model equation for further predictions, not are they particularly interested in the magnitude of the relationship (slope). Is model I or II regression
    appropriate in these circumstances?. Explain?

    If there is no evidence that the assumptions of simple linear regression have been violated, fit the linear model

    CWDBASAL = (SLOPE * RIPDENS) + intercept HINT. At this stage ignore any output.

    Show code
    christ.lm<-lm(CWDBASAL~RIPDENS, data=christ)
    
  5. Examine the regression diagnostics
    (particularly the residual plot)
    HINT. Does the residual plot indicate any potential problems with the data? (Y or N)
    Show code
    # setup a 2x6 plotting device with small margins
    par(mfrow = c(2, 3), oma = c(0, 0, 2, 0))
    plot(christ.lm,ask=F,which=1:6)
    
    plot of chunk Q2-5
    influence.measures(christ.lm)
    
    Influence measures of
     lm(formula = CWDBASAL ~ RIPDENS, data = christ) :
    
         dfb.1_ dfb.RIPD  dffit cov.r  cook.d    hat inf
    1   0.09523   0.0230  0.396 0.889 0.07148 0.0627    
    2  -0.06051   0.0150 -0.156 1.172 0.01280 0.0631    
    3  -0.50292   0.6733  0.822 0.958 0.29787 0.1896    
    4   0.10209  -0.1323 -0.155 1.480 0.01293 0.2264   *
    5   0.06999   0.0568  0.423 0.857 0.08004 0.0637    
    6   0.85160  -1.0289 -1.120 1.482 0.59032 0.4016   *
    7  -0.00766  -0.0175 -0.084 1.222 0.00377 0.0653    
    8   0.13079  -0.0961  0.163 1.235 0.01401 0.0959    
    9  -0.16369   0.1207 -0.203 1.211 0.02156 0.0967    
    10  0.02294  -0.1138 -0.311 1.042 0.04742 0.0722    
    11 -0.02588  -0.0101 -0.120 1.198 0.00763 0.0629    
    12  0.49223  -0.3571  0.622 0.769 0.16169 0.0932    
    13 -0.12739   0.1065 -0.137 1.355 0.01007 0.1570    
    14 -0.15382   0.1248 -0.171 1.301 0.01550 0.1340    
    15 -0.21860   0.1721 -0.251 1.224 0.03278 0.1178    
    16 -0.22479   0.1666 -0.277 1.156 0.03922 0.0979    
    
    library(effects)
    plot(Effect('RIPDENS', christ.lm, partial.residuals=TRUE))
    
    plot of chunk Q2-5
    Show autoplot code
    library(ggfortify)
    autoplot(christ.lm, which=1:6)
    
    plot of chunk Q2-5.1
    influence.measures(christ.lm)
    
    Influence measures of
     lm(formula = CWDBASAL ~ RIPDENS, data = christ) :
    
         dfb.1_ dfb.RIPD  dffit cov.r  cook.d    hat inf
    1   0.09523   0.0230  0.396 0.889 0.07148 0.0627    
    2  -0.06051   0.0150 -0.156 1.172 0.01280 0.0631    
    3  -0.50292   0.6733  0.822 0.958 0.29787 0.1896    
    4   0.10209  -0.1323 -0.155 1.480 0.01293 0.2264   *
    5   0.06999   0.0568  0.423 0.857 0.08004 0.0637    
    6   0.85160  -1.0289 -1.120 1.482 0.59032 0.4016   *
    7  -0.00766  -0.0175 -0.084 1.222 0.00377 0.0653    
    8   0.13079  -0.0961  0.163 1.235 0.01401 0.0959    
    9  -0.16369   0.1207 -0.203 1.211 0.02156 0.0967    
    10  0.02294  -0.1138 -0.311 1.042 0.04742 0.0722    
    11 -0.02588  -0.0101 -0.120 1.198 0.00763 0.0629    
    12  0.49223  -0.3571  0.622 0.769 0.16169 0.0932    
    13 -0.12739   0.1065 -0.137 1.355 0.01007 0.1570    
    14 -0.15382   0.1248 -0.171 1.301 0.01550 0.1340    
    15 -0.21860   0.1721 -0.251 1.224 0.03278 0.1178    
    16 -0.22479   0.1666 -0.277 1.156 0.03922 0.0979    
    
  6. Prior to exploring the model output, it is often a good idea to produce a quick partial plot. This helps appreciate the patterns contained within the fitted model as well as interpreting any estimated parameters/effects.
    Show base graphics code
    library(effects)
    plot(Effect('RIPDENS',christ.lm, partial.residuals=TRUE))
    
    plot of chunk Q2-5a.1
    Show ggpredict code
    library(ggeffects)
    ggpredict(christ.lm, ~RIPDENS) %>% plot(rawdata=TRUE)
    
    plot of chunk Q2-5a.2
    Show ggemmeans code
    library(ggeffects)
    ggemmeans(christ.lm, ~RIPDENS) %>% plot
    
    plot of chunk Q2-5a.3
  7. At this point, we have no evidence to suggest that the hypothesis tests will not be reliable. Examine the regression output and identify and interpret the following:
    Show code
    summary(christ.lm)
    
    Call:
    lm(formula = CWDBASAL ~ RIPDENS, data = christ)
    
    Residuals:
       Min     1Q Median     3Q    Max 
    -38.62 -22.41 -13.33  26.16  61.35 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -77.09908   30.60801  -2.519 0.024552 *  
    RIPDENS       0.11552    0.02343   4.930 0.000222 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 36.32 on 14 degrees of freedom
    Multiple R-squared:  0.6345,	Adjusted R-squared:  0.6084 
    F-statistic:  24.3 on 1 and 14 DF,  p-value: 0.0002216
    
    Show confint code
    confint(christ.lm)
    
                        2.5 %      97.5 %
    (Intercept) -142.74672817 -11.4514274
    RIPDENS        0.06525871   0.1657734
    
    Show tidy code
    library(broom)
    tidy(christ.lm, conf.int=TRUE)
    
    # A tibble: 2 x 7
      term        estimate std.error statistic  p.value  conf.low conf.high
      <chr>          <dbl>     <dbl>     <dbl>    <dbl>     <dbl>     <dbl>
    1 (Intercept)  -77.1     30.6        -2.52 0.0246   -143.       -11.5  
    2 RIPDENS        0.116    0.0234      4.93 0.000222    0.0653     0.166
    
    1. sample y-intercept
    2. sample slope
    3. t value for main H0
    4. P-value for main H0
    5. R2 value
  8. What conclusions (statistical and biological) would you draw from the analysis?
  9. Finally, how about a nice summary figure.
    Show base graphics code
    #setup the graphics device with wider margins
    opar <- par(mar=c(5,5,0,0))
    #construct the base plot without axes or labels or even points
    plot(CWDBASAL~RIPDENS, data=christ, ann=F, axes=F, type="n")
    # put the points on as solid circles (point character 16)
    points(CWDBASAL~RIPDENS, data=christ, pch=16)
    # calculate predicted values
    xs <- seq(min(christ$RIPDENS), max(christ$RIPDENS), l = 1000)
    ys <- predict(christ.lm, newdata = data.frame(RIPDENS = xs), se=T)
    lines(xs,ys$fit-ys$se.fit, lty=2)
    lines(xs,ys$fit+ys$se.fit, lty=2)
    lines(xs, ys$fit)
    # draw the x-axis
    axis(1)
    mtext("Density of riparian trees",1,line=3)
    # draw the y-axis (labels rotated)
    axis(2,las=1)
    mtext("Course woody debris basal area",2,line=3)
    #put an 'L'-shaped box around the plot
    box(bty="l")
    
    plot of chunk Q2-8b
    #reset the parameters
    par(opar)
    
    Show ggplot code
    library(ggplot2)
    ggplot(christ, aes(y=CWDBASAL, x=RIPDENS)) +
      geom_point()+
      geom_smooth(method='lm', fill='blue',alpha=0.2)+
      scale_y_continuous(expression(Density~of~riparian~trees~(trees.km^{-1})))+
      scale_x_continuous(expression(Course~woody~debris~basal~area~(m^2~km^{-1})))+
      theme_classic()+
      theme(axis.title.y=element_text(vjust=2, size=rel(1.5)),
            axis.title.x=element_text(vjust=-1, size=rel(1.5)))
    
    plot of chunk Q2-8a
    Show manual code
    newdata = with(christ, data.frame(RIPDENS=seq(min(RIPDENS), max(RIPDENS), len=100)))
    Xmat = model.matrix(~RIPDENS, data=newdata)
    coefs=coef(christ.lm)
        fit = as.vector(coefs %*% t(Xmat))
    SE = sqrt(diag(Xmat %*% vcov(christ.lm) %*% t(Xmat)))
    Q = qt(0.975, df=df.residual(christ.lm))
    newdata = newdata %>% mutate(lower=fit-Q*SE, upper=fit+Q*SE, fit=fit)
    
    partial.Xmat = model.matrix(~RIPDENS, data=christ)
    partial.resid = christ %>% mutate(Partial=as.vector(coefs %*% t(partial.Xmat)) + resid(christ.lm))
    library(ggplot2)
    ggplot(newdata, aes(y=fit, x=RIPDENS)) +
          geom_ribbon(aes(ymin=lower, ymax=upper), fill='blue', color=NA, alpha=0.3) +
          geom_line() +
      geom_point(data=partial.resid, aes(y=Partial))+
      scale_y_continuous(expression(Density~of~riparian~trees~(trees.km^{-1})))+
      scale_x_continuous(expression(Course~woody~debris~basal~area~(m^2~km^{-1})))+
      theme_classic()+
      theme(axis.title.y=element_text(vjust=2, size=rel(1.5)),
                    axis.title.x=element_text(vjust=-1, size=rel(1.5)))
    
    plot of chunk Q2-8.1

Simple linear regression

Here is a modified example from Quinn and Keough (2002). Peake & Quinn (1993) investigated the relationship between the number of individuals of invertebrates living in amongst clumps of mussels on a rocky intertidal shore and the area of those mussel clumps.

Download PeakeQuinn data set
Format of peakquinn.csv data files
AREAINDIV
516.0018
469.0660
462.2557
938.60100
1357.1548
......

AREAArea of mussel clump mm2 - Predictor variable
INDIVNumber of individuals found within clump - Response variable
clump of mussels

Open the peakquinn data file.
Show code
peakquinn <- read.table('../downloads/data/peakquinn.csv', header=T, sep=',', strip.white=T)
peakquinn
       AREA INDIV
1    516.00    18
2    469.06    60
3    462.25    57
4    938.60   100
5   1357.15    48
6   1773.66   118
7   1686.01   148
8   1786.29   214
9   3090.07   225
10  3980.12   283
11  4424.84   380
12  4451.68   278
13  4982.89   338
14  4450.86   274
15  5490.74   569
16  7476.21   509
17  7138.82   682
18  9149.94   600
19 10133.07   978
20  9287.69   363
21 13729.13  1402
22 20300.77   675
23 24712.72  1159
24 27144.03  1062
25 26117.81   632

The relationship between two continuous variables can be analyzed by simple linear regression. As with question 2, note that the levels of the predictor variable were measured, not fixed, and thus parameter estimates should be based on model II RMA regression. Note however, that the hypothesis test for slope is uneffected by whether the predictor variable is fixed or measured.

Before performing the analysis we need to check the assumptions. To evaluate the assumptions of linearity, normality and homogeneity of variance, construct a scatterplot

of INDIV against AREA (INDIV on y-axis, AREA on x-axis) including a lowess smoother and boxplots on the axes.

Show base graphics code
library(car)
scatterplot(INDIV~AREA,data=peakquinn)
plot of chunk Q3-a
Show ggplot code
    library(tidyverse)
ggplot(peakquinn, aes(y=INDIV, x=AREA)) + geom_point() + geom_smooth()
plot of chunk Q3-1.1
ggplot(peakquinn, aes(y=INDIV)) + geom_boxplot(aes(x=1))
plot of chunk Q3-1.1
ggplot(peakquinn, aes(y=INDIV, x=AREA)) + geom_point() + geom_smooth() + scale_x_log10() + scale_y_log10()
plot of chunk Q3-1.1
ggplot(peakquinn, aes(y=INDIV)) + geom_boxplot(aes(x=1)) + scale_y_log10()
plot of chunk Q3-1.1
ggplot(peakquinn, aes(y=AREA)) + geom_boxplot(aes(x=1)) + scale_y_log10()
plot of chunk Q3-1.1
  1. Consider the assumptions and suitability of the data for simple linear regression:
    1. In this case, the researchers are interested in investigating whether there is a relationship between the number of invertebrate individuals and mussel clump area as well as generating a predictive model. However, they are not interested in the specific magnitude of the relationship (slope) and have no intension of comparing their slope to any other non-zero values. Is model I or II regression
      appropriate in these circumstances?. Explain?
    2. Is there any evidence that the other assumptions are likely to be violated?
  2. To get an appreciation of what a residual plot would look like when there is some evidence that the assumption of homogeneity of variance assumption has been violated, perform the simple linear regression (by fitting a linear model)
    Show code
    peakquinn.lm<-lm(INDIV~AREA, data=peakquinn)
    
  3. purely for the purpose of examining the regression diagnostics
    (particularly the residual plot)
    Show code
    # setup a 2x6 plotting device with small margins
    par(mfrow = c(2, 3), oma = c(0, 0, 2, 0))
    plot(peakquinn.lm,ask=F,which=1:6)
    
    plot of chunk Q3-1c
    influence.measures(peakquinn.lm)
    
    Influence measures of
     lm(formula = INDIV ~ AREA, data = peakquinn) :
    
        dfb.1_ dfb.AREA    dffit cov.r   cook.d    hat inf
    1  -0.1999  0.13380 -0.20006 1.125 2.04e-02 0.0724    
    2  -0.1472  0.09887 -0.14731 1.150 1.12e-02 0.0728    
    3  -0.1507  0.10121 -0.15073 1.148 1.17e-02 0.0728    
    4  -0.1153  0.07466 -0.11549 1.155 6.92e-03 0.0687    
    5  -0.1881  0.11765 -0.18896 1.117 1.82e-02 0.0653    
    6  -0.1214  0.07306 -0.12237 1.142 7.75e-03 0.0622    
    7  -0.0858  0.05206 -0.08639 1.155 3.88e-03 0.0628    
    8  -0.0176  0.01059 -0.01776 1.165 1.65e-04 0.0621    
    9  -0.0509  0.02633 -0.05236 1.150 1.43e-03 0.0535    
    10 -0.0237  0.01069 -0.02504 1.148 3.28e-04 0.0489    
    11  0.0475 -0.01961  0.05096 1.141 1.35e-03 0.0470    
    12 -0.0418  0.01717 -0.04492 1.142 1.05e-03 0.0468    
    13 -0.0061  0.00221 -0.00672 1.144 2.36e-05 0.0448    
    14 -0.0453  0.01859 -0.04862 1.142 1.23e-03 0.0468    
    15  0.1641 -0.05100  0.18584 1.067 1.74e-02 0.0433    
    16  0.0472 -0.00254  0.06317 1.129 2.08e-03 0.0401    
    17  0.1764 -0.01859  0.22781 1.021 2.57e-02 0.0403    
    18  0.0542  0.01493  0.09093 1.120 4.28e-03 0.0411    
    19  0.2173  0.12011  0.43430 0.808 8.29e-02 0.0433    
    20 -0.0701 -0.02168 -0.12016 1.106 7.43e-03 0.0413    
    21  0.1849  0.63635  1.07759 0.357 3.36e-01 0.0614   *
    22  0.0752 -0.33126 -0.39474 1.157 7.79e-02 0.1352    
    23 -0.0789  0.22611  0.25071 1.362 3.25e-02 0.2144   *
    24  0.0853 -0.21744 -0.23573 1.473 2.88e-02 0.2681   *
    25  0.4958 -1.32133 -1.44477 0.865 8.44e-01 0.2445   *
    
    library(effects)
    plot(Effect('AREA', peakquinn.lm, partial.residuals=TRUE))
    
    plot of chunk Q3-1c
    Show autoplot code
    library(ggfortify)
    autoplot(peakquinn.lm, which=1:6)
    
    plot of chunk Q3-1c.1
    influence.measures(peakquinn.lm)
    
    Influence measures of
     lm(formula = INDIV ~ AREA, data = peakquinn) :
    
        dfb.1_ dfb.AREA    dffit cov.r   cook.d    hat inf
    1  -0.1999  0.13380 -0.20006 1.125 2.04e-02 0.0724    
    2  -0.1472  0.09887 -0.14731 1.150 1.12e-02 0.0728    
    3  -0.1507  0.10121 -0.15073 1.148 1.17e-02 0.0728    
    4  -0.1153  0.07466 -0.11549 1.155 6.92e-03 0.0687    
    5  -0.1881  0.11765 -0.18896 1.117 1.82e-02 0.0653    
    6  -0.1214  0.07306 -0.12237 1.142 7.75e-03 0.0622    
    7  -0.0858  0.05206 -0.08639 1.155 3.88e-03 0.0628    
    8  -0.0176  0.01059 -0.01776 1.165 1.65e-04 0.0621    
    9  -0.0509  0.02633 -0.05236 1.150 1.43e-03 0.0535    
    10 -0.0237  0.01069 -0.02504 1.148 3.28e-04 0.0489    
    11  0.0475 -0.01961  0.05096 1.141 1.35e-03 0.0470    
    12 -0.0418  0.01717 -0.04492 1.142 1.05e-03 0.0468    
    13 -0.0061  0.00221 -0.00672 1.144 2.36e-05 0.0448    
    14 -0.0453  0.01859 -0.04862 1.142 1.23e-03 0.0468    
    15  0.1641 -0.05100  0.18584 1.067 1.74e-02 0.0433    
    16  0.0472 -0.00254  0.06317 1.129 2.08e-03 0.0401    
    17  0.1764 -0.01859  0.22781 1.021 2.57e-02 0.0403    
    18  0.0542  0.01493  0.09093 1.120 4.28e-03 0.0411    
    19  0.2173  0.12011  0.43430 0.808 8.29e-02 0.0433    
    20 -0.0701 -0.02168 -0.12016 1.106 7.43e-03 0.0413    
    21  0.1849  0.63635  1.07759 0.357 3.36e-01 0.0614   *
    22  0.0752 -0.33126 -0.39474 1.157 7.79e-02 0.1352    
    23 -0.0789  0.22611  0.25071 1.362 3.25e-02 0.2144   *
    24  0.0853 -0.21744 -0.23573 1.473 2.88e-02 0.2681   *
    25  0.4958 -1.32133 -1.44477 0.865 8.44e-01 0.2445   *
    
  4. How would you describe the residual plot?
  5. What could be done to the data to address the problems highlighted by the scatterplot, boxplots and residuals?
  6. Describe how the scatterplot, axial boxplots and residual plot might appear following successful data transformation.
  7. Transform
    both variables to logs (base 10), replot the scatterplot using the transformed data, refit the linear model (again using transformed data) and examine the residual plot.HINT
    Show code
    library(car)
    scatterplot(log10(INDIV)~log10(AREA), data=peakquinn)
    
    plot of chunk Q3-4a
    Show ggplot code
        library(tidyverse)
    ggplot(peakquinn, aes(y=INDIV, x=AREA)) + geom_point() + geom_smooth() + scale_x_log10() + scale_y_log10()
    
    plot of chunk Q3-4.1
    ggplot(peakquinn, aes(y=INDIV)) + geom_boxplot(aes(x=1)) + scale_y_log10()
    
    plot of chunk Q3-4.1
    ggplot(peakquinn, aes(y=AREA)) + geom_boxplot(aes(x=1)) + scale_y_log10()
    
    plot of chunk Q3-4.1
  8. Would you consider the transformation as successful? (Y or N)
  9. Note, rather than transform the response so that it conforms to the assumptions of normality, arguably, it would be more appropriate to model these data against a more suitable distribution. In this case, since the response represents count data, a Poisson distribution would be more appropriate than a Gaussian (normal) distribution. Indeed this is the approach that we will take in Tutorial 9.4a. However, for this tutorial, we will fit the normalized data against a normal distribution.

  10. If you are satisfied that the assumptions of the analysis are likely to have been met, perform the linear regression analysis (refit the linear model) HINT, examine the output.
    Show model fitting code
    peakquinn.lm<-lm(log(INDIV)~log(AREA), data=peakquinn)
    
    Show base graphics diagnostics code
    par(mfrow = c(2, 3), oma = c(0, 0, 2, 0))
    plot(christ.lm,ask=F,which=1:6)
    
    plot of chunk Q3-6.2
    Show autoplot code
    autoplot(peakquinn.lm, which=1:6)
    
    plot of chunk Q3-6.2a
    Show effects code
    library(effects)
    plot(Effect('AREA', peakquinn.lm, partial.residuals=TRUE))
    
    plot of chunk Q3-6.3
    plot(allEffects(peakquinn.lm), partial.residuals=TRUE,
     axes=list(x=list(AREA=list(transform=list(trans=log, inverse=exp),
                                ticks=list(at=c(500,1000,2000,5000,10000, 20000)))),
               y=list(transform=list(transform=exp))))
    
    Error in get(as.character(FUN), mode = "function", envir = envir): object 'inverse' of mode 'function' was not found
    
    Show ggpredict code
    library(ggeffects)
    ggpredict(peakquinn.lm, term='AREA [exp]') %>% plot() + scale_y_log10() + scale_x_log10()
    
    plot of chunk Q3-6.4
    ## The following should work, but there is a bug
    ggpredict(peakquinn.lm, term='AREA',full.data=TRUE) %>% plot(rawdata=TRUE)
    
    plot of chunk Q3-6.4
    Show ggemmeans code
    library(ggeffects)
    ggemmeans(peakquinn.lm, 'AREA [exp]') %>% plot()
    
    plot of chunk Q3-6.6
    ggemmeans(peakquinn.lm, 'AREA [exp]') %>% plot() + scale_x_log10() + scale_y_log10()
    
    plot of chunk Q3-6.6
    Show summary code
    summary(peakquinn.lm)
    
    Call:
    lm(formula = log(INDIV) ~ log(AREA), data = peakquinn)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -0.9983 -0.1488  0.0511  0.2574  0.6175 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -1.32632    0.59645  -2.224   0.0363 *  
    log(AREA)    0.83492    0.07066  11.816 3.01e-11 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.4274 on 23 degrees of freedom
    Multiple R-squared:  0.8586,	Adjusted R-squared:  0.8524 
    F-statistic: 139.6 on 1 and 23 DF,  p-value: 3.007e-11
    
    Show confint code
    confint(peakquinn.lm)
    
                     2.5 %      97.5 %
    (Intercept) -2.5601791 -0.09246238
    log(AREA)    0.6887433  0.98108798
    
    Show tidy code
    tidy(peakquinn.lm,confint=TRUE)
    
    # A tibble: 2 x 5
      term        estimate std.error statistic  p.value
      <chr>          <dbl>     <dbl>     <dbl>    <dbl>
    1 (Intercept)   -1.33     0.596      -2.22 3.63e- 2
    2 log(AREA)      0.835    0.0707     11.8  3.01e-11
    
    1. Test the null hypothesis that the population slope of the regression line between log number of individuals and log clump area is zero - use either the t-test or ANOVA F-test regression output. What are your conclusions (statistical and biological)? HINT
    2. If the relationship is significant construct the regression equation relating the number of individuals in the a clump to the clump size. Remember that parameter estimates should be based on RMA regression not OLS!
      DV=intercept+slopexIV
      LogeIndividuals LogeArea
  11. Write the results out as though you were writing a research paper/thesis. For example (select the phrase that applies and fill in gaps with your results): 

    A linear regression of log number of individuals against log clump area showed (choose correct option)
    (b = , t = , df = , P = )

  12. How much of the variation in log individual number is explained by the linear relationship with log clump area? That is , what is the R2 value?
  13. What number of individuals would you predict
    for a new clump with an area of 8000 mm2? HINT
    Show predict code
    #include prediction intervals
    exp(predict(peakquinn.lm, newdata=data.frame(AREA=8000), interval="predict"))
    
           fit      lwr      upr
    1 481.6561 194.5975 1192.167
    
    #include 95% confidence intervals
    exp(predict(peakquinn.lm, newdata=data.frame(AREA=8000), interval="confidence"))
    
           fit      lwr      upr
    1 481.6561 394.5162 588.0434
    
    Show emmeans code
    peakquinn.grid = ref_grid(peakquinn.lm, at=list(AREA=8000))
    confint(peakquinn.grid, type='response') %>% as.data.frame
    
      AREA response       SE df lower.CL upper.CL
    1 8000 481.6561 46.46698 23 394.5162 588.0434
    
  14. Given that in this case both response and predictor variables were measured (the levels of the predictor variable were not specifically set by the researchers), it might be worth presenting the less biased model parameters (y-intercept and slope) from RMA model II regression. Perform the RMA model II regression
    and examine the slope and intercept
    1. b (slope):
    2. c (y-intercept):
    Show code
    library(lmodel2)
    
    Error in library(lmodel2): there is no package called 'lmodel2'
    
    peakquinn.lm2 <- lmodel2(log10(INDIV)~log10(AREA), data=peakquinn)
    
    Error in lmodel2(log10(INDIV) ~ log10(AREA), data = peakquinn): could not find function "lmodel2"
    
    peakquinn.lm2
    
    Error in eval(expr, envir, enclos): object 'peakquinn.lm2' not found
    
  15. Significant simple linear regression outcomes are usually accompanied by a scatterplot that summarizes the relationship between the two population. Construct a scatterplot
    without a smoother or marginal boxplots. Consider whether or not transformed or untransformed data should be used in this graph.
    Show base graphics code
    #setup the graphics device with wider margins
    opar <- par(mar=c(6,6,0,0))
    #construct the base plot with log scale axes 
    plot(INDIV~AREA, data=peakquinn, log='xy', ann=F, axes=F, type="n")
    # put the points on as solid circles (point character 16)
    points(INDIV~AREA, data=peakquinn, pch=16)
    # draw the x-axis
    axis(1)
    mtext(expression(paste("Mussel clump area ",(mm^2))),1,line=4,cex=2)
    # draw the y-axis (labels rotated)
    axis(2,las=1)
    mtext("Number of individuals per clump",2,line=4,cex=2)
    #put in the fitted regression trendline
    xs <- seq(min(peakquinn$AREA), max(peakquinn$AREA), l = 1000)
    ys <- 10^predict(peakquinn.lm, newdata=data.frame(AREA=xs), interval='c')
    lines(ys[,'fit']~xs)
    matlines(xs,ys, lty=2,col=1)
    # include the equation
    text(max(peakquinn$AREA), 20, expression(paste(log[10], "INDIV = 0.835.",log[10], "AREA - 0.576")),
      pos = 2)
    text(max(peakquinn$AREA), 17, expression(paste(R^2 == 0.857)), pos = 2)
    #put an 'L'-shaped box around the plot
    box(bty="l")
    
    plot of chunk Q3-11
    #reset the parameters
    par(opar)
    
    Show ggplot code
    library(ggplot2)
    library(scales)
    ggplot(peakquinn, aes(y=INDIV, x=AREA)) +
      geom_point()+
      geom_smooth(method='lm', fill='blue',alpha=0.2)+
      scale_y_log10(expression(Number~of~individuals), breaks=as.vector(c(1,2,5,10) %o% 10^(-1:2)))+
      scale_x_log10(expression(Mussel~clump~area~(mm^2)), breaks=as.vector(c(1,2,5,10) %o% 10^(-1:5)))+
      theme_classic()+
      theme(axis.title.y=element_text(vjust=2, size=rel(1.5)),
            axis.title.x=element_text(vjust=-1, size=rel(1.5)))
    
    plot of chunk Q3-12
    Show manual code
    newdata = with(peakquinn, data.frame(AREA=seq(min(AREA), max(AREA), len=100)))
    Xmat = model.matrix(~log(AREA), data=newdata)
    coefs=coef(peakquinn.lm)
       fit = as.vector(coefs %*% t(Xmat))
    SE = sqrt(diag(Xmat %*% vcov(peakquinn.lm) %*% t(Xmat)))
    Q = qt(0.975, df=df.residual(peakquinn.lm))
    newdata = newdata %>% mutate(lower=exp(fit-Q*SE), upper=exp(fit+Q*SE), fit=exp(fit))
    
    partial.Xmat = model.matrix(~log(AREA), data=peakquinn)
    partial.resid = peakquinn %>% mutate(Partial=exp(as.vector(coefs %*% t(partial.Xmat)) + resid(peakquinn.lm)))
    library(ggplot2)
    ggplot(newdata, aes(y=fit, x=AREA)) +
      geom_ribbon(aes(ymin=lower, ymax=upper), fill='blue', color=NA, alpha=0.3) +
      geom_line() +
      geom_point(data=partial.resid, aes(y=Partial))+
      scale_y_log10(expression(Grass~yield~(g.m^{-2})))+
      scale_x_log10(expression(Fertilizer~concentration~(g.m^{-2})))+
      theme_classic()+
      theme(axis.title.y=element_text(vjust=2, size=rel(1.5)),
                    axis.title.x=element_text(vjust=-1, size=rel(1.5)))
    
    plot of chunk Q3-12.1
    Show emmeans code
    peakquinn.grid = with(peakquinn, list(AREA=seq(min(AREA), max(AREA), len=100)))
    newdata = ref_grid(peakquinn.lm, ~AREA, at=peakquinn.grid, cov.reduce=FALSE) %>% confint(type='response')
    partial.resid = ref_grid(peakquinn.lm, ~AREA, cov.reduce=FALSE) %>% summary %>%
      mutate(Partial=exp(prediction+resid(peakquinn.lm)))
    library(ggplot2)
    ggplot(newdata, aes(y=response, x=AREA)) +
      geom_ribbon(aes(ymin=lower.CL, ymax=upper.CL), fill='blue', color=NA, alpha=0.3) +
      geom_line() +
      geom_point(data=partial.resid, aes(y=Partial))+
      scale_y_log10(expression(Grass~yield~(g.m^{-2})))+
      scale_x_log10(expression(Fertilizer~concentration~(g.m^{-2})))+
      theme_classic()+
      theme(axis.title.y=element_text(vjust=2, size=rel(1.5)),
                    axis.title.x=element_text(vjust=-1, size=rel(1.5)))
    
    plot of chunk Q3-12.2

Model II RMA regression

Nagy, Girard & Brown (1999) investigated the allometric scaling relationships for mammals (79 species), reptiles (55 species) and birds (95 species). The observed relationships between body size and metabolic rates of organisms have attracted a great deal of discussion amongst scientists from a wide range of disciplines recently. Whilst some have sort to explore explanations for the apparently 'universal' patterns, Nagy et al. (1999) were interested in determining whether scaling relationships differed between taxonomic, dietary and habitat groupings.

Download Nagy data set
Format of nagy.csv data file
SpeciesCommonMassFMRTaxonHabitatDietClass
Pipistrellus pipistrellusPipistrelle7.329.3ChNDIMammal
Plecotus auritus Brown long-eared bat 8.5 27.6 Ch ND I Mammal
Myotis lucifugus Little brown bat 9.0 29.9 Ch ND I Mammal
Gerbillus henleyi Northern pygmy gerbil 9.3 26.5 Ro D G Mammal
Tarsipes rostratus Honey possum 9.9 34.4 Tr ND N Mammal
................

Open the nagy data file.
Show code
nagy <- read.table('../downloads/data/nagy.csv', header=T, sep=',', strip.white=T)
head(nagy)
                    Species                Common Mass  FMR Taxon Habitat Diet  Class   MASS
1 Pipistrellus pipistrellus           Pipistrelle  7.3 29.3    Ch      ND    I Mammal 0.0073
2          Plecotus auritus  Brown long-eared bat  8.5 27.6    Ch      ND    I Mammal 0.0085
3          Myotis lucifugus      Little brown bat  9.0 29.9    Ch      ND    I Mammal 0.0090
4         Gerbillus henleyi Northern pygmy gerbil  9.3 26.5    Ro       D    G Mammal 0.0093
5        Tarsipes rostratus          Honey possum  9.9 34.4    Tr      ND    N Mammal 0.0099
6           Anoura caudifer   Flower-visiting bat 11.5 51.9    Ch      ND    N Mammal 0.0115

For this example, we will explore the relationships between field metabolic rate (FMR) and body mass (Mass) in grams for the entire data set and then separately for each of the three classes (mammals, reptiles and aves).

Unlike the previous examples in which both predictor and response variables could be considered 'random' (measured not set), parameter estimates should be based on model II RMA regression. However, unlike previous examples, in this example, the primary focus is not hypothesis testing about whether there is a relationship or not. Nor is prediction a consideration. Instead, the researchers are interested in establishing (and comparing) the allometric scaling factors (slopes) of the metabolic rate - body mass relationships. Hence in this case, model II regression is indeed appropriate.

  1. Before performing the analysis we need to check the assumptions. To evaluate the assumptions of linearity, normality and homogeneity of variance, construct a scatterplot
    of FMR against Mass including a lowess smoother and boxplots on the axes.
    Show code
    library(car)
    scatterplot(FMR~Mass,data=nagy)
    
    plot of chunk Q4-1
    1. Is there any evidence of non-normality?
    2. Is there any evidence of non-linearity?
  2. Typically, allometric scaling data are treated by a log-log transformation. That is, both predictor and response variables are log10 transformed. This is achieved graphically on a scatterplot by plotting the data on log transformed axes. Produce such a scatterplot (HINT). Does this improve linearity?
    Show code
    library(car)
    scatterplot(FMR~Mass,data=nagy,log='xy')
    
    plot of chunk Q4-2
  3. Fit the linear models relating log-log transformed FMR and Mass using both the Ordinary Least Squares and Reduced Major Axis methods separately for each of the classes (mammals, reptiles and aves).
    Show code
    nagy.olsM <- lm(log10(FMR)~log10(Mass),data=subset(nagy,nagy$Class=="Mammal"))
    summary(nagy.olsM)
    
    Call:
    lm(formula = log10(FMR) ~ log10(Mass), data = subset(nagy, nagy$Class == 
        "Mammal"))
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.60271 -0.13172 -0.00811  0.14440  0.41794 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  0.68301    0.05336   12.80   <2e-16 ***
    log10(Mass)  0.73412    0.01924   38.15   <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.2119 on 77 degrees of freedom
    Multiple R-squared:  0.9498,	Adjusted R-squared:  0.9491 
    F-statistic:  1456 on 1 and 77 DF,  p-value: < 2.2e-16
    
    nagy.olsR <- lm(log10(FMR)~log10(Mass),data=subset(nagy,nagy$Class=="Reptiles"))
    summary(nagy.olsR)
    
    Call:
    lm(formula = log10(FMR) ~ log10(Mass), data = subset(nagy, nagy$Class == 
        "Reptiles"))
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.61438 -0.07813 -0.00782  0.17257  0.39794 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -0.70658    0.05941  -11.89   <2e-16 ***
    log10(Mass)  0.88789    0.02941   30.19   <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.2289 on 53 degrees of freedom
    Multiple R-squared:  0.945,	Adjusted R-squared:  0.944 
    F-statistic: 911.5 on 1 and 53 DF,  p-value: < 2.2e-16
    
    nagy.olsA <- lm(log10(FMR)~log10(Mass),data=subset(nagy,nagy$Class=="Aves"))
    summary(nagy.olsA)
    
    Call:
    lm(formula = log10(FMR) ~ log10(Mass), data = subset(nagy, nagy$Class == 
        "Aves"))
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.53281 -0.07884  0.00453  0.09973  0.41202 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  1.01945    0.03929   25.95   <2e-16 ***
    log10(Mass)  0.68079    0.01817   37.46   <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.1653 on 93 degrees of freedom
    Multiple R-squared:  0.9378,	Adjusted R-squared:  0.9372 
    F-statistic:  1403 on 1 and 93 DF,  p-value: < 2.2e-16
    
    library(lmodel2)
    
    Error in library(lmodel2): there is no package called 'lmodel2'
    
    nagy.rmaM <- lmodel2(log10(FMR)~log10(Mass),data=subset(nagy,nagy$Class=="Mammal"), nperm=99, range.y='interval', range.x='interval')
    
    Error in lmodel2(log10(FMR) ~ log10(Mass), data = subset(nagy, nagy$Class == : could not find function "lmodel2"
    
    nagy.rmaM
    
    Error in eval(expr, envir, enclos): object 'nagy.rmaM' not found
    
    nagy.rmaR <- lmodel2(log10(FMR)~log10(Mass),data=subset(nagy,nagy$Class=="Reptiles"), nperm=99, range.y='interval', range.x='interval')
    
    Error in lmodel2(log10(FMR) ~ log10(Mass), data = subset(nagy, nagy$Class == : could not find function "lmodel2"
    
    nagy.rmaR
    
    Error in eval(expr, envir, enclos): object 'nagy.rmaR' not found
    
    nagy.rmaA <- lmodel2(log10(FMR)~log10(Mass),data=subset(nagy,nagy$Class=="Aves"), nperm=99, range.y='interval', range.x='interval')
    
    Error in lmodel2(log10(FMR) ~ log10(Mass), data = subset(nagy, nagy$Class == : could not find function "lmodel2"
    
    nagy.rmaA
    
    Error in eval(expr, envir, enclos): object 'nagy.rmaA' not found
    

    Indicate the following;
     OLSRMA 
    ClassSlopeInterceptSlopeInterceptR2
    Mammals (HINT and HINT)
    Reptiles ( HINT and HINT)
    Aves (HINT and HINT)
  4. Produce a scatterplot that depicts the relationship between FMR and Mass just for the mammals (HINT)
    1. Fit the OLS regression line to this plot (HINT)
    2. Fit the RMA regression line (in red) to this plot (HINT)
    Show code
    #setup the graphics device with wider margins
    opar <- par(mar=c(6,6,0,0))
    #construct the base plot with log scale axes
    plot(FMR~Mass,data=subset(nagy,nagy$Class=="Mammal"), log='xy', ann=F, axes=F, type="n")
    # put the points on as solid circles (point character 16)
    points(FMR~Mass,data=subset(nagy,nagy$Class=="Mammal"), pch=16)
    # draw the x-axis
    axis(1)
    mtext(expression(paste("Mass ",(g))),1,line=4,cex=2)
    # draw the y-axis (labels rotated)
    axis(2,las=1)
    mtext("Field metabolic rate",2,line=4,cex=2)
    #put in the fitted regression trendline
    xs <- with(subset(nagy,nagy$Class=="Mammal"),seq(min(Mass), max(Mass), l = 1000))
    ys <- 10^predict(nagy.olsM, newdata=data.frame(Mass=xs), interval='c')
    lines(ys[,'fit']~xs)
    matlines(xs,ys, lty=2,col=1)
    #now include the RMA trendline
    centr.y <- mean(nagy.rmaM$y)
    
    Error in mean(nagy.rmaM$y): object 'nagy.rmaM' not found
    
    centr.x <- mean(nagy.rmaM$x)
    
    Error in mean(nagy.rmaM$x): object 'nagy.rmaM' not found
    
    A <- nagy.rmaM$regression.results[4,2]
    
    Error in eval(expr, envir, enclos): object 'nagy.rmaM' not found
    
    B <- nagy.rmaM$regression.results[4,3]
    
    Error in eval(expr, envir, enclos): object 'nagy.rmaM' not found
    
    B1 <- nagy.rmaM$confidence.intervals[4, 4]
    
    Error in eval(expr, envir, enclos): object 'nagy.rmaM' not found
    
    A1 <- centr.y - B1 * centr.x
    
    Error in eval(expr, envir, enclos): object 'centr.y' not found
    
    B2 <- nagy.rmaM$confidence.intervals[4, 5]
    
    Error in eval(expr, envir, enclos): object 'nagy.rmaM' not found
    
    A2 <- centr.y - B2 * centr.x
    
    Error in eval(expr, envir, enclos): object 'centr.y' not found
    
    lines(10^(A+(log10(xs)*B))~xs, col="red")
    
    Error in eval(predvars, data, env): object 'A' not found
    
    lines(10^(A1+(log10(xs)*B1))~xs, col="red",lty=2)
    
    Error in eval(predvars, data, env): object 'A1' not found
    
    lines(10^(A2+(log10(xs)*B2))~xs, col="red",lty=2)
    
    Error in eval(predvars, data, env): object 'A2' not found
    
    # include the equation
    #text(max(peakquinn$AREA), 20, expression(paste(log[10], "INDIV = 0.835.",log[10], "AREA - 0.576")), pos = 2)
    #text(max(peakquinn$AREA), 17, expression(paste(R^2 == 0.857)), pos = 2)
    ##put an 'L'-shaped box around the plot
    box(bty="l")
    
    plot of chunk Q4-4
    #reset the parameters
    par(opar)
    
  5. Compare and contrast the OLS and RMA parameter estimates. Explain which estimates are most appropriate and why the in this case the two methods produce very similar estimates.
  6. To see how the degree of correlation can impact on the difference between OLS and RMA estimates, fit the relationship between FMR and Mass for the entire data set.
    Show code
    library(lmodel2)
    
    Error in library(lmodel2): there is no package called 'lmodel2'
    
    nagy.rma <- lmodel2(log10(FMR)~log10(Mass),data=nagy, nperm=99, range.y='interval', range.x='interval')
    
    Error in lmodel2(log10(FMR) ~ log10(Mass), data = nagy, nperm = 99, range.y = "interval", : could not find function "lmodel2"
    
    nagy.rma
    
    Error in eval(expr, envir, enclos): object 'nagy.rma' not found
    
    1. Complete the following table
       OLSRMA 
      ClassSlopeInterceptSlopeInterceptR2
      Entire data set (HINT and HINT)
    2. Produce a scatterplot that depicts the relationship between FMR and Mass for the entire data set (HINT)
    3. Fit the OLS regression line to this plot (HINT)
    4. Fit the RMA regression line (in red) to this plot (HINT)
    5. Compare and contrast the OLS and RMA parameter estimates. Explain which estimates are most appropriate and why the in this case the two methods produce not so similar estimates.
    Show code
    #setup the graphics device with wider margins
    opar <- par(mar=c(6,6,0,0))
    #construct the base plot with log scale axes
    plot(FMR~Mass,data=nagy, log='xy', ann=F, axes=F, type="n")
    # put the points on as solid circles (point character 16)
    points(FMR~Mass,data=nagy, pch=16)
    # draw the x-axis
    axis(1)
    mtext(expression(paste("Mass ",(g))),1,line=4,cex=2)
    # draw the y-axis (labels rotated)
    axis(2,las=1)
    mtext("Field metabolic rate",2,line=4,cex=2)
    #put in the fitted regression trendline
    xs <- with(nagy,seq(min(Mass), max(Mass), l = 1000))
    nagy.lm<-lm(log10(FMR)~log10(Mass), data=nagy)
    ys <- 10^predict(nagy.lm, newdata=data.frame(Mass=xs), interval='c')
    lines(ys[,'fit']~xs)
    matlines(xs,ys, lty=2,col=1)
    #now include the RMA trendline
    centr.y <- mean(nagy.rma$y)
    
    Error in mean(nagy.rma$y): object 'nagy.rma' not found
    
    centr.x <- mean(nagy.rma$x)
    
    Error in mean(nagy.rma$x): object 'nagy.rma' not found
    
    A <- nagy.rma$regression.results[4,2]
    
    Error in eval(expr, envir, enclos): object 'nagy.rma' not found
    
    B <- nagy.rma$regression.results[4,3]
    
    Error in eval(expr, envir, enclos): object 'nagy.rma' not found
    
    B1 <- nagy.rma$confidence.intervals[4, 4]
    
    Error in eval(expr, envir, enclos): object 'nagy.rma' not found
    
    A1 <- centr.y - B1 * centr.x
    
    Error in eval(expr, envir, enclos): object 'centr.y' not found
    
    B2 <- nagy.rma$confidence.intervals[4, 5]
    
    Error in eval(expr, envir, enclos): object 'nagy.rma' not found
    
    A2 <- centr.y - B2 * centr.x
    
    Error in eval(expr, envir, enclos): object 'centr.y' not found
    
    lines(10^(A+(log10(xs)*B))~xs, col="red")
    
    Error in eval(predvars, data, env): object 'A' not found
    
    lines(10^(A1+(log10(xs)*B1))~xs, col="red",lty=2)
    
    Error in eval(predvars, data, env): object 'A1' not found
    
    lines(10^(A2+(log10(xs)*B2))~xs, col="red",lty=2)
    
    Error in eval(predvars, data, env): object 'A2' not found
    
    #put an 'L'-shaped box around the plot
    box(bty="l")
    
    plot of chunk Q4-6a
    #reset the parameters
    par(opar)
    
    Show code
    library(ggplot2)
    library(lmodel2)
    
    Error in library(lmodel2): there is no package called 'lmodel2'
    
    dat <- nagy
    dat$FMR <- log10(dat$FMR)
    dat$Mass <- log10(dat$MASS)
    mod.mam <- lmodel2(FMR ~ Mass, data=subset(dat, Class=="Mammal"),"interval", "interval", 99)
    
    Error in lmodel2(FMR ~ Mass, data = subset(dat, Class == "Mammal"), "interval", : could not find function "lmodel2"
    
    mod.rep <- lmodel2(FMR ~ Mass, data=subset(dat, Class=="Reptiles"),"interval", "interval", 99)
    
    Error in lmodel2(FMR ~ Mass, data = subset(dat, Class == "Reptiles"), : could not find function "lmodel2"
    
    mod.ave <- lmodel2(FMR ~ Mass, data=subset(dat, Class=="Aves"),"interval", "interval", 99)
    
    Error in lmodel2(FMR ~ Mass, data = subset(dat, Class == "Aves"), "interval", : could not find function "lmodel2"
    
    #Extract parameters and confidence bands
    lm2.params <- function(mod,method='RMA') {
       y <- mod$y
       x <- mod$x
       centr.y <- mean(y)
       centr.x <- mean(x)
       row <- which(mod$regression.results == method)
       a <- mod$regression.results[row, 2]
       b <- mod$regression.results[row, 3]
       b1 <- mod$confidence.intervals[row, 4]
       a1 <- centr.y - b1 * centr.x
       b2 <- mod$confidence.intervals[row, 5]
       a2 <- centr.y - b2 * centr.x
       data.frame(a,b,a1,b1,a2,b2)
    }
    
    reg <- data.frame(Taxa=c('Mammal','Reptiles','Aves'),
           rbind(lm2.params(mod.mam),
                 lm2.params(mod.rep),
                 lm2.params(mod.ave)))
    
    Error in lm2.params(mod.mam): object 'mod.mam' not found
    
    ggplot(nagy) +
      geom_point(aes(y=FMR, x=MASS, color=Class)) +
      geom_abline(data = reg, aes(intercept = a, slope = b, colour = Taxa))+
      geom_abline(data = reg, aes(intercept = a1, slope = b1, colour = Taxa), linetype='dashed')+
      geom_abline(data = reg, aes(intercept = a2, slope = b2, colour = Taxa), linetype='dashed')+
      scale_y_log10(expression(Field~Metabolic~Rate), labels=math_format(10^.x))+
      scale_x_log10(expression(Mass~(g)), labels=math_format(10^.x))+
      annotation_logticks()+
      theme_classic()+
      theme(legend.position=c(1,0), legend.justification=c(1,0))
    
    Error in fortify(data): object 'reg' not found
    

  Frequentist pooled variances t-test

t.test(y~x, data, var.equal=TRUE)
Error in t.test.formula(y ~ x, data, var.equal = TRUE): grouping factor must have exactly 2 levels
#OR
data.lm <- lm(y~x, data)
summary(data.lm)
Call:
lm(formula = y ~ x, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.3694  -3.5053   0.6239   2.6522   7.3909 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  40.7450     2.6718  15.250 4.09e-10 ***
x            -1.5340     0.2763  -5.552 7.13e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.095 on 14 degrees of freedom
Multiple R-squared:  0.6876,	Adjusted R-squared:  0.6653 
F-statistic: 30.82 on 1 and 14 DF,  p-value: 7.134e-05

End of instructions

  Frequentist pooled variances t-test

hello

End of instructions

  Simple linear regression

Linear regression analysis explores the linear relationship between a continuous response variable (DV) and a single continuous predictor variable (IV). A line of best fit (one that minimizes the sum of squares deviations between the observed y values and the predicted y values at each x) is fitted to the data to estimate the linear model.

As plot a) shows, a slope of zero, would indicate no relationship - a increase in IV is not associated with a consistent change in DV.
Plot b) shows a positive relationship (positive slope) - an increase in IV is associated with an increase in DV.
Plot c) shows a negative relationship.

Testing whether the population slope (as estimated from the sample slope) is one way of evaluating the relationship. Regression analysis also determines how much of the total variability in the response variable can be explained by its linear relationship with IV, and how much remains unexplained.

The line of best fit (relating DV to IV) takes on the form of
y=mx + c
Response variable = (slope*predictor   variable) + y-intercept
If all points fall exactly on the line, then this model (right-hand part of equation) explains all of the variability in the response variable. That is, all variation in DV can be explained by differences in IV. Usually, however, the model does not explain all of the variability in the response variable (due to natural variation), some is left unexplained (referred to as error).

Therefore the linear model takes the form of;
Response variable = model + error.
A high proportion of variability explained relative to unexplained (error) is suggestive of a relationship between response and predictor variables.

For example, a mammalogist was investigating the effects of tooth wear on the amount of time free ranging koalas spend feeding per 24 h. Therefore, the amount of time spent feeding per 24 h was measured for a number of koalas that varied in their degree of tooth wear (ranging from unworn through to worn). Time spent feeding was the response variable and degree of tooth wear was the predictor variable.

End of instructions

  Reading data into R

Ensure that the working directory is pointing to the path containing the file to be imported before proceeding.
To import data into R, we read the contents of a file into a data frame. The general format of the command for reading data into a data frame is

> name <- read.table('filename.csv', header=T, sep=',', row.names=column, strip.white=T)

where name is a name you wish the data frame to be referred to as, filename.csv is the name of the csv file that you created in excel and column is the number of the column that had row labels (if there were any). The argument header=T indicates that the variable (vector) names will be created from the names supplied in the first row of the file. The argument sep=',' indicates that entries in the file are separated by a comma (hence a comma delimited file). If the data file does not contain row labels, or you are not sure whether it does, it is best to omit the row.names=column argument. The strip.white=T arguement ensures that no leading or trailing spaces are left in character names (these can be particularly nasty in categorical variables!).

As an example
phasmid <- read.data('phasmid.csv', header=T, sep=',', row.names=1, strip.white=T)

End of instructions

  Scatterplots

# load the 'car' library
> library(car)
# generate a scatterplot
> scatterplot(dv~iv,data=data)

where dv is the dependent variable, iv is the independent variable and data is the data frame (data set).

End of instructions

  Linear model fitting

> name.lm <- lm(dv~iv, data=data)

where name.lm is a name you provide for the output, dv is the name of the dependent variable, iv is the name of the independent variable and data is the name of the data frame (data set).

End of instructions

  Linear regression diagnostics

> plot(name.lm)

where name.lm is the name of a fitted linear model. You will be prompted to hit return to cycle through a series of four (or six) diagnostic plots.
  1. The first (and most important of these) is the Residual Plot. The residual plot graphs the residual (differences between observed and expected values) against the fitted (expected) values. Residual plots assist in identifying issues with heterogeneity as well non-linearity
  2. The normal Q-Q plot compares the standardized residuals to normal quantiles. This plot helps identify issues with normality
  3. The scale-location (standardized/normalized residuals) plot, is similar to the residual plot described above except that it is on a standard scale that can permit automatic outlier detection
  4. The Residuals vs Leverage plot. Whilst the residuals are a measure of deviation in y-space, leverage is a measure of deviation in x-space (and thus each is a measure of outlierness). Combined together on this graph, along with Cook's D (measure of outlierness in x/y-shape) contours allows us to identify overly influential observations. Observations approaching the Cook's D contour of 1 are considered highly influential

> resid(name.lm)
> influence.measures(name.lm)

Collectively, these two functions provide information about the relative influence of each observation on the final model fit. Ideally, all observation should have an equal influence on the fit. Outliers are observations that have an over-representative impact on fitted models by pulling trends towards themselves to a greater degree than other observations (and thus biasing outcomes).

The first function lists the residuals corresponding to each of the observation sin the data set (from which the model was fit). A residual measures the difference between an observations observed value and the value that would be expected from the trendline (fitted model). Therefore, the residual value of any observation is a measure of how much of an outlier the observation is in the y-axis direction.

The second function produces a matrix whose rows correspond to each of the observations in the data set (from which the model was fit). The two important columns are the ones labelled 'hat' (the leverage values) and 'cook.d' (Cook's D values). Leverage measures how influential each observation was in the model fit in terms of its deviation in x-space (how unusual is the observation with respect to the other values of the predictor variable?). As with residual values, any leverage values that are much larger than the general set of leverage values should be further explored as they indicate potential outliers. Cook's D combines residual and leverage values into a single value that measures how influential each observation is in both x-space and y-space. As a rule of thumb, Cook's D values approaching 1 should be examined as they are having a larger influence on the model fitting.

End of instructions

  Residual plot

There is a residual for each point. The residuals are the differences between the predicted Y-values (point on the least squares regression line) for each X-value and the observed Y-value for each X-value. A residual plot, plots the residuals (Y-axis) against the predicted Y-values (X-axis) Plot a) shows a uniform spread of data around the least squares regression line and thus there is an even scatter of points in the residual plot (Plot b). In Plot c) the deviations of points from the predicted Y-values become greater with increasing X-values and thus there is a wedge-shape pattern in the residuals. Such a wedge suggests a violation of the homogeneity of variance assumption, and thus the regression may be unreliable.



End of instructions

  Regression output

#print a summary of the linear model coefficients
> summary(name.lm)
#print a summary of the ANOVA table
> anova(name.lm)

where name.lm is the name of a fitted linear model.

End of instructions

  Scatterplots

# construct the base plot without axes or labels or even points
> plot(dv~iv,data=data, ann=F, axes=F, type="n")
# put the points on as solid circles (point character 16)
> points(dv~iv,data=data, pch=16)
# draw the x-axis
> axis(1)
> mtext("X-axis label",1)
# draw the y-axis (labels rotated)
> axis(2,las=1)
> mtext("Y-axis label",2)
#put an 'L'-shaped box around the plot
> box(bty="l")

where dv is the dependent variable, iv is the independent variable and data is the data frame (data set).

End of instructions

  EDA - Linearity

The most common methods for analysing the relationship or association between variables.assume that the variables are linearly related (or more correctly, that they do not covary according to some function other than linear). For example, to examine for the presence and strength of the relationship between two variables (such as those depicted below), it is a requirement of the more basic statistical tests that the variables not be related by any function other than a linear (straight line) function if any function at all.

There is no evidence that the data represented in Figure (a) are related (covary) according to any function other than linear. Likewise, there is no evidence that the data represented in Figure (b) are related (covary) according to any function other than linear (if at all). However, there is strong evidence that the data represented in Figure (c) are not related linearly. Hence Figure (c) depicts evidence for a violation in the statistical necessity of linearity.


End of instructions

  EDA - Normal distribution

Parametric statistical hypothesis tests assume that the population measurements follow a specific distribution. Most commonly, statistical hypothesis tests assume that the population measurements are normally distributed (Exercise 4 highlights the specific reasoning for this).

While it is usually not possible to directly examine the distribution of measurements in the whole population to determine whether or not this requirement is satisfied or not (for the same reasons that it is not possible to directly measure population parameters such as population mean), if the sampling is adequate (unbiased and sufficiently replicated), the distribution of measurements in a sample should reflect the population distribution. That is a sample of measurements taken from a normally distributed population should be normally distributed.

Tests on data that do not meet the distributional requirements may not be reliable, and thus, it is essential that the distribution of data be examined routinely prior to any formal statistical analysis.


End of instructions

  EDA - Homogeneity of variances for regression

The assumption of homogeneity of variance is equally important for regression analysis and in particular, it is prospect of a relationship between the mean and variance of y-values across x-values that is of the greatest concern. Strictly the assumption is that the distribution of y values at each x value are equally varied and that there is no relationship between mean and variance. However, as we only have a single y-value for each x-value, it is difficult to determine whether the assumption of homogeneity of variance is likely to have been violated (mean of one value is meaningless and variability can't be assessed from a single value). The figure below depicts the ideal (and almost never realistic) situation in which there are multiple response variable (DV) observations for each of the levels of the predictor variable (IV). Boxplots are included that represent normality and variability.

Inferential statistics is based around repeatability and the likelihood of data re-occurrence. Consider the following scatterplot that has a regression line (linear smoother - line of best fit) fitted through the data (data points symbolized by letters).

Points that are close to the line (points A and B) of best fit (ie those points that are close to the predicted values) are likely to be highly repeatable (a repeat sampling effort is likely to yield a similar value). Conversely, points that are far from the line of best fit (such as points F and G) are likely to take on a larger variety of values.
The distance between a point (observed value) and the line of best fit (predicted value) is called a residual. A residual plot plots the each of the residuals against the expected y values. When there is a trend of increasing (or decreasing) spread of data along a regression line, the residuals will form a wedge shape pattern. Thus a wedge shape pattern in the residual plot suggests that the assumption of homogeneity of variance may have been violated.

Of course, it is also possible to assess the assumption of homogeneity of variance by simply viewing a scatterplot with a linear smoother (linear regression line) and determining whether there is a general increase or decrease in the distance that the values are from the line of best fit.


End of instructions

  Model I vs Model II Regression

Model I regression (fixed X regression)

The purpose of simple linear regression analysis is to estimate the slope and y-intercept of a straight line through a data set. The line represents the expected (and predicted) average values of X and Y in the population, based on the collected sample. The line (and thus the slope and intercept) is typically the line that minimizes the vertical spread of the observations from the line. This process of fitting the line of best fit is called Ordinary Least Squares (OLS), and is depicted in figure OLS Y vs X (top left) below. This fitting process assumes that there is no uncertainty in the predictor (X) variable and therefore that the observed values of X are exactly as they are in the population (there is no measurement error nor natural variation).

Model I regression is the classical regression analysis and the form of analysis most commonly used. However, it is only a valid method of estimating the model parameters (slope and intercept) when the levels of the predictor variable are not measured, but rather are set by the researcher. For example, if we were interested in the effects of temperature on the growth rate of plant seedlings, we could analyse the data using model I regression provided the different temperatures were specifically set by us (eg. controlling temperatures in green houses or growth cabinets). If on the other hand, the plants had been grown under a range of temperatures that were not specifically set by us and we had simply measured what the temperatures were, model I regression would not be appropriate.

Note that if instead of regressing Y against X we regressed X against Y, the slope of the line of best fit would be different since the line is now the one that minimizes the horizontal spread of the observations from the line (see figure OLS X vs Y - topright). The OLS regressions of Y on X and X on Y represent the two extremes of the population slope estimates as they depict the situations in which all of the variation is in the vertical and horizontal axes respectively. Typically, we might expect some uncertainty in both X and Y variables, and thus the true trend line should lie somewhere between these two extremes (see the bottom right figure below).


Model II regression (random X regression)

Model II regression estimates the line of best fit by incorporating uncertainty in both Y and X variables, and is therefore more suitable for estimating model parameters (slope and intercept) from data in which both variables have been measured. There are at least three forms of model II regression and these differ by what assumptions they place on the relative degree of uncertainty in Y and X variables

The bottom right figure depicts each of the regression lines through a data set. Note the following:
  1. All estimated lines pass through a central point. This point is the mean value of Y and the mean value of X. The different regression fitting procedures essentially just rotate the line of best fit (major axis) through the data set. They differ in the degree to which the major axis is rotated.
  2. That the OLS lines form the extreme estimated lines of best fit.
  3. The RMA line is exactly half way between the two OLS lines
  4. Since for this example, the Y and X variable were measured on substantially different scales (where the degree of uncertainty in X is far greater than the degree of uncertainty in Y), the MA regression fitting procedure is equivalent to OLS X vs Y (assumes all uncertainty is in X and non in Y)

Implications

Given that most instances of regression in biology are actually more suitable for model II regression, why are examples of model II in the literature so rare and why do most statistical software default to model I (OLS) regression procedures (most do not even offer model II procedures)? Is it that biologists are ignorant or slack? It turns out that as far as the main hypothesis of interest is concerned (that the population slope) equals zero, it does not matter - RMA and OLS procedures give the same p-values (p-values cant be easily computed for MA and Ranged MA procedures).

If the purpose of the regression is to generate a model that can be used for the purpose of prediction (predict new values of the dependent variable from new values of the independent values), then only OLS fitted models are valid. The reason for this, is that the new independent value represents a theoretical fixed value with no uncertainty. This new value must be used in a model that assumed no uncertainty in the independent (predictor) variable. The OLS procedure is the only one that minimizes the vertical spread of the observations from the line.

Therefore, if regression analysis is being performed for the purpose of investigating whether there is a relationship between two continuous variables and or to generate a predictive model, then model I (OLS) regression is perfectly valid. Note, however, that if slopes and intercepts are reported, it is probably worth presenting model I and model II slopes.

If on the other hand the nature of the relationship is important, and therefore it is important to get estimates of the slope and y-intercept that best represent the population relationship, then it is necessary to make the model II corrections. Good examples of where this is important are: As a final point, if the R2 value is very high, then estimated slopes will be very similar, irrespective of which line fitting procedure is used!

End of instructions

  R transformations

The following table illustrates the use of common data transmformation functions on a single numeric vector (old_var). In each case a new vector is created (new_var)

TransformationSyntax
loge

> new_var <- log(old_var)

log10

> new_var <- log10(old_var)

square root

> new_var <- sqrt(old_var)

arcsin

> new_var <- asin(sqrt(old_var))

scale (mean=0, unit variance)

> new_var <- scale(old_var)

where old_var is the name of the original variable that requires transformation and new_var is the name of the new variable (vector) that is to contain the transformed data. Note that the original vector (variable) is not altered.

End of instructions

  R2 value

The R2 value is the proportion of the total variability in the dependent variable that can be explained by its linear relationship with the independent variable, and thus a measure of the strength of the relationship. The R2 value ranges from 0 (very very low) to 1 (extremely tight relationship). The R2 value for the data represented in the following relationship would be 1 - that is all of the variation in DV can be explained by a change in IV.

The R2 value for the data represented in the next relationship however, would be substantially less than 1 - whilst some of the variation in DV can be explained by a change in IV, changes in IV alone does not account for all the changes in DV.


End of instructions

  R2 value

Essentially you need to reconstruct the linear equation (y=mx+c) that relates the dependent variable with the independent variable (including the estimated slope and y-intercept), substitute the new x value, and solve for y. Remember that if any transformations where used, then the transformation functions (eg log) need to be incorporated into the linear equation.

Along with calculating a predicted value, it is often useful to obtain 'prediction' or 'confidence' intervals around this prediction. Whilst the terms 'prediction' and 'confidence' are often thought of as synonymous, but there are important differences.
'Confidence' intervals (around a prediction) pertain to predicted future POPULATION means. That is, for a given X, if we were to collect a sample of observations, 95% of intervals of the stated size are likely to contain the mean of this sample.
'Prediction' intervals (around a prediction) pertain to predicted future single observations. For a given X, if we collect a single observation, 95% of intervals of the stated size are likely to contain this observation.

Prediction intervals are thus larger than confidence intervals. As we are typically interested in population trends, we are typically interested in 'confidence' intervals. It should be noted that technically, these confidence and prediction intervals assume that there is no uncertainty in the X (predictor) variable - an assumption that is rarely satisfied.

End of instructions

  Model II regression analysis

# make sure that the biology library is loaded
> library(biology)
# fit the model II regression
> summary(lm.II(log10(INDIV)~log10(AREA), data, type="RMA"))

where log10(INDIV) is the name of the dependent (response) variable, log10(AREA) is the name of the independent (predictor) variable, data is the name of the dataset and type="RMA" indicates the form of regression to fit. Other types possible are; "MA" and "OLS".
Or alternatively,

# make sure that the lmodel2 library is loaded
> library(lmodel2)
# fit the model II regression
> lmodel2(log10(INDIV)~log10(AREA), data)


End of instructions

  Raw or transmformed data?

Generally it is better to plot original (untransformed) data in any graphical summaries. Firstly, this people to examine the results without having to make any mental back-transformations. Secondly, transformations are usually only performed so as to satisfy the assumptions of a statistical test. A graph is not a statistical test. Rather it is a summary, and as such has no such assumptions.

A good compromise it to plot the raw data on transformed axes. This enables the magnitude of values to be examined, and yet also reflects the nature of the statistical model fitted.


End of instructions