Jump to main navigation


Workshop 11.4a - Logistic regression

23 April 2011

Basic χ2 references

  • Logan (2010) - Chpt 16-17
  • Quinn & Keough (2002) - Chpt 13-14

Logistic regression

Polis et al. (1998) were intested in modelling the presence/absence of lizards (Uta sp.) against the perimeter to area ratio of 19 islands in the Gulf of California.

Download Polis data set
Format of polis.csv data file
ISLANDRATIOPA
Bota15.411
Cabeza5.631
Cerraja25.921
Coronadito15.170
......

ISLANDCategorical listing of the name of the 19 islands used - variable not used in analysis.
RATIORatio of perimeter to area of the island.
PAPresence (1) or absence (0) of Uta lizards on island.

Uta lizard

Open the polis data file (HINT).
Show code
> polis <- read.table("../downloads/data/polis.csv", header = T, sep = ",", 
+     strip.white = T)
> head(polis)
      ISLAND RATIO PA
1       Bota 15.41  1
2     Cabeza  5.63  1
3    Cerraja 25.92  1
4 Coronadito 15.17  0
5     Flecha 13.04  1
6   Gemelose 18.85  0

Presence absence data are clearly binary as an observation can only be in 1 of two states (1 or 0, present or absent). Modelling the linear component (systematic) against a normal stochastic component (that is expecting the residuals to follow a normal distribution) is clearly inappropriate. Instead, we could use a binary distribution with either a log link or probit link. For this example, we will use the log link and thus model the residuals on the log odds scale. In a later example, we will use the probit link.

  1. What is the null hypothesis of interest?
  2. Prior to fitting the model, we need to also be sure of the structure of the systematic linear component. For example, is the relationship linear on the log odds scale?
    Show code
    > plot(PA ~ RATIO, data = polis)
    > with(polis, lines(lowess(PA ~ RATIO)))
    > 
    > #check by fitting cubic spline
    > library(splines)
    > polis.glmS <- glm(PA ~ ns(RATIO), data = polis, family = "binomial")
    > summary(polis.glmS)
    
    Call:
    glm(formula = PA ~ ns(RATIO), family = "binomial", data = polis)
    
    Deviance Residuals: 
       Min      1Q  Median      3Q     Max  
    -1.607  -0.638   0.237   0.433   2.099  
    
    Coefficients:
                Estimate Std. Error z value Pr(>|z|)  
    (Intercept)     3.56       1.68    2.12    0.034 *
    ns(RATIO)     -17.24       7.89   -2.18    0.029 *
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 26.287  on 18  degrees of freedom
    Residual deviance: 14.221  on 17  degrees of freedom
    AIC: 18.22
    
    Number of Fisher Scoring iterations: 6
    
    
    > xs <- with(polis, seq(min(RATIO), max(RATIO), l = 100))
    > ys <- predict(polis.glmS, newdata = data.frame(RATIO = xs), type = "response")
    > lines(xs, ys, col = "red")
    
    > 
    > # The trend is more or less linear. If anything it is slightly sigmoidal
    > # yet so long as it is linear within the bulk of the data, then linearity
    > #   is fine.
    
  3. Fitting the general linear model with a binomial error distribution (logit linkage)
    (HINT).
    Show code
    > polis.glm <- glm(PA ~ RATIO, family = binomial, data = polis)
    
    1. Check the model for lack of fit via:
      • Pearson Χ2
        Show code
        > polis.resid <- sum(resid(polis.glm, type = "pearson")^2)
        > 1 - pchisq(polis.resid, polis.glm$df.resid)
        
        [1] 0.5715
        
        > #No evidence of a lack of fit
        
      • Deviance (G2)
        Show code
        > 1 - pchisq(polis.glm$deviance, polis.glm$df.resid)
        
        [1] 0.6514
        
        > #No evidence of a lack of fit
        
    2. Estimate the dispersion parameter to evaluate over dispersion in the fitted model
      • Via Pearson residuals
        Show code
        > polis.resid/polis.glm$df.resid
        
        [1] 0.9019
        
        > #No evidence of over dispersion
        
      • Via deviance
        Show code
        > polis.glm$deviance/polis.glm$df.resid
        
        [1] 0.8365
        
        > #No evidence of over dispersion
        
    3. Compare the expected number of zeros to the actual number of zeros to determine whether the model might be zero inflated.
      • Calculate the proportion of zeros in the data
        Show code
        > polis.tab <- table(polis$PA == 0)
        > polis.tab/sum(polis.tab)
        
         FALSE   TRUE 
        0.5263 0.4737 
        
      • Assuming a probability of 0.5 for each trial, you would expect roughly 50:50 (50%) zeros and ones. So the data are not zero inflated.
  4. As the diagnostics do not indicate any issues with the data or the fitted model, identify and interpret the following (HINT);
    Show code
    > polis.glm <- glm(PA ~ RATIO, family = binomial, data = polis)
    > summary(polis.glm)
    
    Call:
    glm(formula = PA ~ RATIO, family = binomial, data = polis)
    
    Deviance Residuals: 
       Min      1Q  Median      3Q     Max  
    -1.607  -0.638   0.237   0.433   2.099  
    
    Coefficients:
                Estimate Std. Error z value Pr(>|z|)  
    (Intercept)    3.606      1.695    2.13    0.033 *
    RATIO         -0.220      0.101   -2.18    0.029 *
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 26.287  on 18  degrees of freedom
    Residual deviance: 14.221  on 17  degrees of freedom
    AIC: 18.22
    
    Number of Fisher Scoring iterations: 6
    
    
    1. sample constant (β0)
    2. sample slope (β1)
    3. Wald statistic (z value) for main H0
    4. P-value for main H0
    5. r2 value (HINT)
      Show code
      > 1 - (polis.glm$dev/polis.glm$null)
      
      [1] 0.459
      
  5. Another way ot test the fit of the model, and thus test the H0 that β1 = 0, is to compare the fit of the full model to the reduce model via ANOVA. Perform this ANOVA (HINT) and identify the following
    Show code
    > anova(polis.glm, test = "Chisq")
    
    Analysis of Deviance Table
    
    Model: binomial, link: logit
    
    Response: PA
    
    Terms added sequentially (first to last)
    
    
          Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
    NULL                     18       26.3             
    RATIO  1     12.1        17       14.2  0.00051 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    1. G2 statistic
    2. df
    3. P value
  6. Construct a scatterplot of the presence/absence of Uta lizards against perimeter to area ratio for the 19 islands (HINT). Add to this plot, the predicted probability of occurances from the logistic regression (as well as 66% confidence interval - standard error). (HINT)
    Show code
    > par(mar = c(4, 5, 0, 0))
    > plot(PA ~ RATIO, data = polis, type = "n", ann = F, axes = F)
    > points(PA ~ RATIO, data = polis, pch = 16)
    > xs <- seq(0, 70, l = 1000)
    > ys <- predict(polis.glm, newdata = data.frame(RATIO = xs), type = "response", 
    +     se = T)
    > points(ys$fit ~ xs, col = "black", type = "l")
    > lines(ys$fit - 1 * ys$se.fit ~ xs, col = "black", type = "l", lty = 2)
    > lines(ys$fit + 1 * ys$se.fit ~ xs, col = "black", type = "l", lty = 2)
    > axis(1)
    > mtext("Island perimeter to area ratio", 1, cex = 1.5, line = 3)
    > axis(2, las = 2)
    > mtext(expression(paste(italic(Uta), " lizard presence/absence")), 
    +     2, cex = 1.5, line = 3)
    > box(bty = "l")
    
  7. Calculate the LD50 (in this case, the perimeter to area ratio with a predicted probability of 0.5) from the fitted model (HINT).
    Show code
    > -polis.glm$coef[1]/polis.glm$coef[2]
    
    (Intercept) 
          16.42 
    
    Islands above this ratio are not predicted to contain lizards and islands above this ratio are expected to have lizards.
  8. Examine the odds ratio for the occurrence of Uta lizards.
    Show code
    > library(biology)
    > odds.ratio(polis.glm)
    
          Odds ratio Lower 95% CI Upper 95% CI
    RATIO     0.8029       0.6593       0.9777
    
    > #the chances of Uta lizards being present on an island declines by
    > #   approximately 20% (a change factor of 0.803) for every unit increase
    > #   in perimeter to area ratio.
    

Logistic regression - grouped binary data

Bliss (1935) examined the toxic effects of gaseous carbon disulphide on the survival of flour beetles. He presented a number of datasets and associated analyses as a means to demonstrate models for calculating dosage curves.

In the first dataset, rather than indicating whether any individual beetle was dead or alive, Bliss (1935) recorded the number of dead and alive beetles grouped by gaseous carbon disulphide concentration.

Download the Bliss1 data set
Format of bliss1.csv data file
DEADALIVECONC
2280
8221
15152
2373
2734

DEADThe number of dead flour beetles after 5 hours exposure to carbon disulphide gas.
ALIVEThe number of dead flour beetles after 5 hours exposure to carbon disulphide gas.
CONCThe concentration of gaseus carbon disulphide (in log2 units).

Flour beetle

Open the bliss1 data file (HINT).
Show code
> bliss <- read.table("../downloads/data/bliss1.csv", header = T, sep = ",", 
+     strip.white = T)
> head(bliss)
  dead alive conc
1    2    28    0
2    8    22    1
3   15    15    2
4   23     7    3
5   27     3    4
  1. This is another example of clearly binary data (albeit in a different form). There are three obvious link function worth trying
    1. log-link (the default)
      Show code
      > bliss.glm <- glm(cbind(dead, alive) ~ conc, data = bliss, family = "binomial")
      > summary(bliss.glm)
      
      Call:
      glm(formula = cbind(dead, alive) ~ conc, family = "binomial", 
          data = bliss)
      
      Deviance Residuals: 
            1        2        3        4        5  
      -0.4510   0.3597   0.0000   0.0643  -0.2045  
      
      Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
      (Intercept)   -2.324      0.418   -5.56  2.7e-08 ***
      conc           1.162      0.181    6.40  1.5e-10 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
      
      (Dispersion parameter for binomial family taken to be 1)
      
          Null deviance: 64.76327  on 4  degrees of freedom
      Residual deviance:  0.37875  on 3  degrees of freedom
      AIC: 20.85
      
      Number of Fisher Scoring iterations: 4
      
      
      > par(mfrow = c(2, 2))
      > plot(bliss.glm)
      
    2. probit
      Show code
      > bliss.glmP <- glm(cbind(dead, alive) ~ conc, data = bliss, family = binomial(link = "probit"))
      > summary(bliss.glmP)
      
      Call:
      glm(formula = cbind(dead, alive) ~ conc, family = binomial(link = "probit"), 
          data = bliss)
      
      Deviance Residuals: 
            1        2        3        4        5  
      -0.3586   0.2749   0.0189   0.1823  -0.2754  
      
      Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
      (Intercept)  -1.3771     0.2278   -6.05  1.5e-09 ***
      conc          0.6864     0.0968    7.09  1.3e-12 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
      
      (Dispersion parameter for binomial family taken to be 1)
      
          Null deviance: 64.76327  on 4  degrees of freedom
      Residual deviance:  0.31367  on 3  degrees of freedom
      AIC: 20.79
      
      Number of Fisher Scoring iterations: 4
      
      
      > par(mfrow = c(2, 2))
      > plot(bliss.glmP)
      
    3. complimentary log-log
      Show code
      > bliss.glmC <- glm(cbind(dead, alive) ~ conc, data = bliss, family = binomial(link = "cloglog"))
      > summary(bliss.glmC)
      
      Call:
      glm(formula = cbind(dead, alive) ~ conc, family = binomial(link = "cloglog"), 
          data = bliss)
      
      Deviance Residuals: 
           1       2       3       4       5  
      -1.083   0.213   0.498   0.559  -0.672  
      
      Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
      (Intercept)   -1.994      0.313   -6.38  1.8e-10 ***
      conc           0.747      0.109    6.82  8.9e-12 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
      
      (Dispersion parameter for binomial family taken to be 1)
      
          Null deviance: 64.7633  on 4  degrees of freedom
      Residual deviance:  2.2305  on 3  degrees of freedom
      AIC: 22.71
      
      Number of Fisher Scoring iterations: 5
      
      
      > par(mfrow = c(2, 2))
      > plot(bliss.glmC)
      
  2. On the basis of AIC as well as the residuals (which for the complimentary log-log model imply non-linearity), a log link would seem appropriate.

  3. To confirm linearity on the log odds scale we can plot the empirical logit against concentration
    Show code
    > #examine linearity on the log odds scale
    > # we add 1/2 (a small value) incase they are all dead or all alive.
    > #This is called an empirical logit
    > plot(bliss$conc, log((bliss$dead + 1/2)/(bliss$alive + 1/2)), xlab = "concentration", 
    +     ylab = "logit")
    > abline(coef(bliss.glm), col = 2)
    
    Alternatively, we could superimpose a logistic curve onto the scatterplot of proportion dead against concentration
    Show code
    > plot(bliss$conc, bliss$dead/(bliss$dead + bliss$alive), xlab = "Concentration", 
    +     ylab = "probability")
    > #create an inverse logit function
    > inv.logit <- function(x) exp(coef(bliss.glm)[1] + coef(bliss.glm)[2] * 
    +     x)/(1 + exp(coef(bliss.glm)[1] + coef(bliss.glm)[2] * x))
    > #plot this function as a curve
    > curve(inv.logit, add = T, col = 2)
    
    > #seems to fit well
    
  4. Given that we have elected to use the log-link model, calculate the odds-ratio
    Show code
    > exp(bliss.glm$coef[2])
    
     conc 
    3.196 
    
  5. Lets also include a summary figure
    Show code
    > plot(dead/(dead + alive) ~ conc, data = bliss, type = "n", ann = F, 
    +     axes = F)
    > points(dead/(dead + alive) ~ conc, data = bliss, pch = 16)
    > xs <- seq(0, 4, l = 1000)
    > ys <- predict(bliss.glm, newdata = data.frame(conc = xs), type = "response", 
    +     se = T)
    > points(ys$fit ~ xs, col = "black", type = "l")
    > lines(ys$fit - 1 * ys$se.fit ~ xs, col = "black", type = "l", lty = 2)
    > lines(ys$fit + 1 * ys$se.fit ~ xs, col = "black", type = "l", lty = 2)
    > axis(1)
    > mtext("Carbon disulphide concentration", 1, cex = 1.5, line = 3)
    > axis(2, las = 2)
    > mtext("Probability of death", 2, cex = 1.5, line = 3)
    > box(bty = "l")
    

Logistic (cloglog) regression - grouped binary data

As a further demonstration of dosage curves, Bliss (1935) presented yet another datset on the toxic effects of gaseus carbon sulphide on flour beetle mortality. This time, Bliss (1935) documented the number of individuals exposed and dead following five hours at one of 8 dosage concentrations.

Download the Bliss2 data set
Format of bliss2.csv data file
DoseExposedMortality
1.6907596
1.72426013
1.75526218
1.78425628
1.81136352
1.83695953
1.86106261
1.88396060

DoseThe concentration of gaseus carbon disulphide.
ExposedThe number of flour beetles exposure to carbon disulphide gas.
MortalityThe number of dead flour beetles after 5 hours exposure to gaseus carbon disulphide.

Flour beetle

Open the bliss1 data file (HINT).
Show code
> bliss2 <- read.table("../downloads/data/bliss2.csv", header = T, 
+     sep = ",", strip.white = T)
> head(bliss2)
   Dose Exposed Mortality
1 1.691      59         6
2 1.724      60        13
3 1.755      62        18
4 1.784      56        28
5 1.811      63        52
6 1.837      59        53
  1. Start by exploring a range of possible binary link functions
    1. logit - logistic (log odds-ratio) model
      \begin{align*} log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1Dose \end{align*}
      Show code
      > bliss2.glmL <- glm((Mortality/Exposed) ~ Dose, data = bliss2, family = "binomial", 
      +     weights = Exposed)
      
    2. probit - probability unit model - $\phi$ is the inverse cumulative distribution function (cdf) for the standard normal distribution
      \begin{align*} \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\alpha+\beta.X} exp\left(-\frac{1}{2}Z^2\right)dZ = \beta_0 + \beta_1Dose\\ \phi = \beta_0 + \beta_1Dose \end{align*}
      Show code
      > bliss2.glmP <- glm((Mortality/Exposed) ~ Dose, data = bliss2, family = binomial(link = "probit"), 
      +     weights = Exposed)
      
    3. cloglog - complimentary log-log model
      \begin{align*} log(-log(1-p))= \beta_0 + \beta_1Dose \end{align*}
      Show code
      > bliss2.glmC <- glm((Mortality/Exposed) ~ Dose, data = bliss2, family = binomial(link = "cloglog"), 
      +     weights = Exposed)
      
  2. Evaluate which of the above models fits the data best on the basis of
    1. Plotting the predicted trends
    2. Show code
      > #create the base scatterplot
      > plot((Mortality/Exposed) ~ Dose, data = bliss2)
      > #first plot a lowess smoother in black
      > with(bliss2, lines(lowess((Mortality/Exposed) ~ Dose)))
      > #create a sequence of new Dose values from which to predict mortality
      > xs <- with(bliss2, seq(min(Dose), max(Dose), l = 100))
      > 
      > # predict and plot mortality from the logit model
      > ys <- predict(bliss2.glmL, newdata = data.frame(Dose = xs), type = "response")
      > lines(xs, ys, col = "blue")
      > 
      > # predict and plot mortality from the probit model
      > ys <- predict(bliss2.glmP, newdata = data.frame(Dose = xs), type = "response")
      > lines(xs, ys, col = "red")
      > 
      > # predict and plot mortality from the complimentary log-log model
      > ys <- predict(bliss2.glmC, newdata = data.frame(Dose = xs), type = "response")
      > lines(xs, ys, col = "green")
      
      > 
      > # Whilst either the log or probit models are ok (and largely
      > #   indistinguishable with respect to fit), the cloglog is clearly a
      > #   better fit
      
    3. AIC
      Show code
      > #logit model
      > AIC(bliss2.glmL)
      
      [1] 15.23
      
      > library(AICcmodavg)
      > AICc(bliss2.glmL)
      
      [1] 43.83
      
      > #probit model
      > AIC(bliss2.glmP)
      
      [1] 14.12
      
      > AICc(bliss2.glmP)
      
      [1] 42.72
      
      > #complimentary log-log model
      > AIC(bliss2.glmC)
      
      [1] 7.446
      
      > AICc(bliss2.glmC)
      
      [1] 36.04
      
      > # Irrespective of whether we adopt AIC or AICc (corrected for small sample
      > #   sizes), the cloglog is considered a better fit.
      
    Which is the better link function on this occasion
  3. Explore the parameter estimates for the cloglog model
    Show code
    > summary(bliss2.glmC)
    
    Call:
    glm(formula = (Mortality/Exposed) ~ Dose, family = binomial(link = "cloglog"), 
        data = bliss2, weights = Exposed)
    
    Deviance Residuals: 
        Min       1Q   Median       3Q      Max  
    -0.8033  -0.5513   0.0309   0.3832   1.2888  
    
    Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
    (Intercept)   -39.57       3.24   -12.2   <2e-16 ***
    Dose           22.04       1.80    12.2   <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 284.2024  on 7  degrees of freedom
    Residual deviance:   3.4464  on 6  degrees of freedom
    AIC: 33.64
    
    Number of Fisher Scoring iterations: 4
    
    

Logistic regression - polynomial

As part of a Masters thesis on the distribution and habitat suitability of the endangered Corroboree frog, Hunter (2000) recorded the presence and absence of frogs in relation to a range of climatic factors (including mean minimum spring temperature).

Show code
> library(DAAG)
> head(frogs)
  pres.abs northing easting altitude distance NoOfPools NoOfSites avrain meanmin meanmax
2        1      115    1047     1500      500       232         3  155.0   3.567   14.00
3        1      110    1042     1520      250        66         5  157.7   3.467   13.80
4        1      112    1040     1540      250        32         5  159.7   3.400   13.60
5        1      109    1033     1590      250         9         5  165.0   3.200   13.17
6        1      109    1032     1590      250        67         5  165.0   3.200   13.17
7        1      106    1018     1600      500        12         4  167.3   3.133   13.07
Show code
> plot(pres.abs ~ meanmin, data = frogs, ylab = "Presence-absence", 
+     xlab = "Mean minimum Spring temperature")
> lines(lowess(frogs$pres.abs ~ frogs$meanmin), col = 2)
> #suggests possible quadratic - optimum temperature?
> 
> #we can also look at the structure of the model by plotting on a logit
> #   scale
> #after grouping the data first
> #so we need to create categories on the x-axis
> #Obviously, at the tails of a distribution, the counts diminish
> #   enormously.
> #it is recomended that categories at the tails are pooled
> meanmin.cuts <- cut(frogs$meanmin, quantile(frogs$meanmin, seq(0, 
+     1, 0.1)), include.lowest = TRUE)
> yi <- tapply(frogs$pres.abs, meanmin.cuts, sum)
> #calculate the empirical logit
> emp.logit <- log((yi + 1/2)/(table(meanmin.cuts) - yi + 1/2))
> #Get the interval midpoints
> #midpt of interval (a,b) is a+(b-a)/2
> midpt <- quantile(frogs$meanmin, seq(0, 1, 0.1))[1:10] + (quantile(frogs$meanmin, 
+     seq(0, 1, 0.1))[2:11] - quantile(frogs$meanmin, seq(0, 1, 0.1))[1:10])/2
> 
> plot(midpt, emp.logit, xlab = "Mean minimum Spring temperature", 
+     ylab = "empirical logit")
> lines(lowess(emp.logit ~ midpt), col = 2)
> rug(midpt)
> 
> #fit the polynomial model
> frog.glm <- glm(pres.abs ~ meanmin + I(meanmin^2), data = frogs, 
+     family = binomial)
> summary(frog.glm)
Call:
glm(formula = pres.abs ~ meanmin + I(meanmin^2), family = binomial, 
    data = frogs)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.366  -1.003  -0.463   1.058   2.341  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -21.175      5.281   -4.01  6.1e-05 ***
meanmin        11.665      3.190    3.66  0.00026 ***
I(meanmin^2)   -1.574      0.472   -3.34  0.00085 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 279.99  on 211  degrees of freedom
Residual deviance: 241.76  on 209  degrees of freedom
AIC: 247.8

Number of Fisher Scoring iterations: 5

> plot(frogs$meanmin, frogs$pres.abs, xlab = "Mean minimum Spring temperature", 
+     ylab = "presence/absence")
> #add linear model
> #phat<-function(x) exp(coef(out1)[1] + coef(out1)[2]*x)/
> #   (1+exp(coef(out1)[1] + coef(out1)[2]*x))
> #curve(phat, add=T, col=2)
> #add quadratic
> phat2 <- function(x) exp(coef(frog.glm)[1] + coef(frog.glm)[2] * 
+     x + coef(frog.glm)[3] * x^2)/(1 + exp(coef(frog.glm)[1] + coef(frog.glm)[2] * 
+     x + coef(frog.glm)[3] * x^2))
> curve(phat2, add = T, col = 4)
> #alternatively
> points(frogs$meanmin, predict(frog.glm, type = "response"), col = "red")
> #add lowess
> lines(lowess(frogs$pres.abs ~ frogs$meanmin), col = 1, lty = 2)
> legend(2.2, 0.9, c("linear", "quadratic", "lowess"), col = c(2, 4, 
+     1), lty = c(1, 1, 2), bty = "n", cex = 0.9)

Probit regression

Same data as above
Show code
> polis.glmP <- glm(PA ~ RATIO, data = polis, family = binomial(link = probit))
> summary(polis.glmP)
Call:
glm(formula = PA ~ RATIO, family = binomial(link = probit), data = polis)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.611  -0.677   0.188   0.415   2.055  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   2.1345     0.8913    2.39    0.017 *
RATIO        -0.1275     0.0522   -2.44    0.015 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26.287  on 18  degrees of freedom
Residual deviance: 14.164  on 17  degrees of freedom
AIC: 18.16

Number of Fisher Scoring iterations: 7

> 
> 
> polis.glm <- glm(PA ~ RATIO, family = binomial, data = polis)
> 
> 
> par(mar = c(4, 5, 0, 0))
> plot(PA ~ RATIO, data = polis, type = "n", ann = F, axes = F)
> points(PA ~ RATIO, data = polis, pch = 16)
> xs <- seq(0, 70, l = 1000)
> ys <- predict(polis.glmP, newdata = data.frame(RATIO = xs), type = "response", 
+     se = T)
> points(ys$fit ~ xs, col = "black", type = "l")
> lines(ys$fit - 1 * ys$se.fit ~ xs, col = "black", type = "l", lty = 2)
> lines(ys$fit + 1 * ys$se.fit ~ xs, col = "black", type = "l", lty = 2)
> 
> ys <- predict(polis.glm, newdata = data.frame(RATIO = xs), type = "response", 
+     se = T)
> points(ys$fit ~ xs, col = "red", type = "l")
> lines(ys$fit - 1 * ys$se.fit ~ xs, col = "red", type = "l", lty = 2)
> lines(ys$fit + 1 * ys$se.fit ~ xs, col = "red", type = "l", lty = 2)
> 
> axis(1)
> mtext("Island perimeter to area ratio", 1, cex = 1.5, line = 3)
> axis(2, las = 2)
> mtext(expression(paste(italic(Uta), " lizard presence/absence")), 
+     2, cex = 1.5, line = 3)
> box(bty = "l")
> 
> #the probability of Uta lizards being present declines by 0.128 (12.7
> #   units of percentage) for every 1 unit increase in perimeter to area
> #   ratio

  Reading data into R

Ensure that the working directory is pointing to the path containing the file to be imported before proceeding.
To import data into R, we read the contents of a file into a data frame. The general format of the command for reading data into a data frame is

> name <- read.table('filename.csv', header=T, sep=',', row.names=column, strip.white=T)

where name is a name you wish the data frame to be referred to as, filename.csv is the name of the csv file that you created in excel and column is the number of the column that had row labels (if there were any). The argument header=T indicates that the variable (vector) names will be created from the names supplied in the first row of the file. The argument sep=',' indicates that entries in the file are separated by a comma (hence a comma delimited file). If the data file does not contain row labels, or you are not sure whether it does, it is best to omit the row.names=column argument. The strip.white=T arguement ensures that no leading or trailing spaces are left in character names (these can be particularly nasty in categorical variables!).

As an example
> phasmid <- read.data("phasmid.csv", header = T, sep = ",", row.names = 1, 
+     strip.white = T)

End of instructions