Jump to main navigation


Workshop 11.3a - Continguency tables

23 April 2011

Basic χ2 references

  • Logan (2010) - Chpt 16-17
  • Quinn & Keough (2002) - Chpt 13-14

Continguency tables

Here is a modified example from Quinn and Keough (2002). Following fire, French and Westoby (1996) cross-classified plant species by two variables: whether they regenerated by seed only or vegetatively and whether they were dispersed by ant or vertebrate vector. The two variables could not be distinguished as response or predictor since regeneration mechanisms could just as conceivably affect dispersal mode as vice versa.

Download French data set
Format of french.csv data files
REGENDISPCOUNT
seedant25
seedvert6
vegant36
vegvert21

REGENCategorical listing of the plants regeneration mode.
DISPCategorical listing of the plants dispersal mode.
COUNTThe observed number of individuals in each category.
 DISPERSAL MODE
REGENERATION MODEANTVertebrate
SeedOnly256
Vegetative6121

Ant carrying a seed

Open the french data file. HINT.

Show code
> french <- read.table("../downloads/data/french.csv", header = T, 
+     sep = ",", strip.white = T)
> head(french)
  regen disp count
1  seed  ant    25
2  seed vert     6
3   veg  ant    36
4   veg vert    21
  1. What null hypothesis is being tested by this test?
  2. Generate a cross table
    out of the dataset in preparation for frequency analysis (HINT).
    Show code
    > french.tab <- xtabs(count ~ regen + disp, data = french)
    > french.tab
    
          disp
    regen  ant vert
      seed  25    6
      veg   36   21
    
  3. Fit
    a 2 x 2 (two way) contingency table
    (HINT), and explore the main assumption of the test by examining the expected frequencies (HINT).
    Show code
    > french.x2 <- chisq.test(french.tab, correct = F)
    > french.x2$exp
    
          disp
    regen    ant   vert
      seed 21.49  9.511
      veg  39.51 17.489
    
  4. If the assumption is OK, test this null hypothesis and identify the following.
    Show code
    > french.x2
    
    	Pearson's Chi-squared test
    
    data:  french.tab 
    X-squared = 2.887, df = 1, p-value = 0.08929
    
    
    1. X2 statistic
    2. df
    3. P value
  5. Calculate the odds ratio (odds of vegetative dispersal over seed dispersal for vertebrate dispersed vs ant dispersed)
    Show code
    > library(epitools)
    > oddsratio(french.tab)
    
    $data
           disp
    regen   ant vert Total
      seed   25    6    31
      veg    36   21    57
      Total  61   27    88
    
    $measure
          odds ratio with 95% C.I.
    regen  estimate  lower upper
      seed    1.000     NA    NA
      veg     2.375 0.8667 7.367
    
    $p.value
          two-sided
    regen  midp.exact fisher.exact chi.square
      seed         NA           NA         NA
      veg     0.09439      0.09832    0.08929
    
    $correction
    [1] FALSE
    
    attr(,"method")
    [1] "median-unbiased estimate & mid-p exact CI"
    
  6. What are your conclusions (statistical and biological)?

Contingency table

Arrington et al. (2002) examined the frequency with which African, Neotropical and North American fishes have empty stomachs and found that the mean percentage of empty stomachs was around 16.2%. As part of the investigation they were interested in whether the frequency of empty stomachs was related to dietary items. The data were separated into four major trophic classifications (detritivores, omnivores, invertivores, and piscivores) and whether the fish species had greater or less than 16.2% of individuals with empty stomachs. The number of fish species in each category combination was calculated and a subset of that (just the diurnal fish) is provided.

Download Arrington data set
Format of arrington.csv data file
STOMACHTROPHIC
< 16.2DET
....
< 16.2OMN
....
< 16.2PISC
....
< 16.2INV
....

STOMACHCategorical listing of the proportion of individuals in the species with empty stomachs (< 16.2% or > 16.2%).
TROPHICCategorical listing of the trophic classification (DET = detritovore, OMN = omnivore, INV = invertivore, PISC = piscivore).
 % Stomachs empty
Trophic classification< 16.2> 16.2
DET184
OMN458
INV5815
PISC1634

Fish

Open the arrington data file (HINT).
Show code
> arrington <- read.table("../downloads/data/arrington.csv", header = T, 
+     sep = ",", strip.white = T)
> head(arrington)
  STOMACH TROPHIC
1   <16.2     DET
2   <16.2     DET
3   <16.2     DET
4   <16.2     DET
5   <16.2     DET
6   <16.2     DET

Note the format of the data file. Rather than including a compilation of the observed counts, this data file lists the categories for each individual. This example will demonstrate how to analyse two-way contingency tables from such data files. Each row of the data set represents a separate species of fish that is then cross categorised according to whether the proportion of individuals of that species with empty stomachs was higher or lower than the overall average (16.2%) and to what trophic group they belonged.

  1. Generate a cross table
    out of the raw data file in preparation for the contingency table (HINT).
    Show code
    > arrington.tab <- table(arrington)
    > arrington.tab
    
           TROPHIC
    STOMACH DET INV OMN PISC
      <16.2  18  58  45   16
      >16.2   4  15   8   34
    
  2. Fit the model (HINT), test the assumptions (HINT) and, using a two-way contingency table,
    test the null hypothesis that the percentage of empty stomachs was independent of trophic classification (HINT). What would you conclude form the analysis?
    Show code
    > arrington.x2 <- chisq.test(arrington.tab)
    > arrington.x2$exp
    
           TROPHIC
    STOMACH    DET   INV   OMN PISC
      <16.2 15.222 50.51 36.67 34.6
      >16.2  6.778 22.49 16.33 15.4
    
    > arrington.x2
    
    	Pearson's Chi-squared test
    
    data:  arrington.tab 
    X-squared = 43.83, df = 3, p-value = 1.636e-09
    
    
  3. Write the results out as though you were writing a research paper/thesis. For example (select the phrase that applies and fill in gaps with your results): 
    The percentage of empty stomachs was (choose the correct option)
    trophic classification. (X2 = , df = , P = ).
  4. Generate the residuals (HINT) associated with the above contingency test and complete the following table of standardized residuals.
    Show code
    > arrington.x2$res
    
           TROPHIC
    STOMACH    DET    INV    OMN   PISC
      <16.2  0.712  1.054  1.375 -3.162
      >16.2 -1.067 -1.579 -2.061  4.738
    
     < 16.2%> 16.2%
    DET
    OMN
    INV
    PISC
  5. Calculate the odds ratios for the different trophic levels
    Show code
    > library(biology)
    > or <- NULL
    > nms <- colnames(arrington.tab)
    > for (i in 1:ncol(arrington.tab)) {
    +     for (j in 1:ncol(arrington.tab)) {
    +         if (i == j) 
    +             next
    +         or <- rbind(or, cbind(Comp1 = nms[i], Comp2 = nms[j], oddsratios(arrington.tab[, 
    +             c(i, j)])))
    +     }
    + }
    > or$Comp2s <- as.numeric(factor(as.character(or$Comp2)))
    > opar <- par(mar = c(5, 6, 1, 1))
    > plot(estimate ~ Comp2s, data = or, axes = F, ann = F, type = "n", 
    +     log = "y", ylim = c(min(lower), max(upper)), xlim = c(0, 5))
    > abline(h = 1, lty = 2)
    > with(subset(or, Comp1 == "DET"), arrows(Comp2s - 0.1, lower, Comp2s - 
    +     0.1, upper, code = 3, length = 0.1, ang = 90))
    > points(estimate ~ I(Comp2s - 0.1), data = subset(or, Comp1 == "DET"), 
    +     type = "p", pch = 22, bg = "white")
    > with(subset(or, Comp1 == "INV"), arrows(Comp2s - 0.05, lower, Comp2s - 
    +     0.05, upper, code = 3, length = 0.1, ang = 90))
    > points(estimate ~ I(Comp2s - 0.05), data = subset(or, Comp1 == "INV"), 
    +     type = "p", pch = 21, bg = "grey90")
    > with(subset(or, Comp1 == "OMN"), arrows(Comp2s + 0.05, lower, Comp2s + 
    +     0.05, upper, code = 3, length = 0.1, ang = 90))
    > points(estimate ~ I(Comp2s + 0.05), data = subset(or, Comp1 == "OMN"), 
    +     type = "p", pch = 21, bg = "grey50")
    > with(subset(or, Comp1 == "PISC"), arrows(Comp2s + 0.1, lower, Comp2s + 
    +     0.1, upper, code = 3, length = 0.1, ang = 90))
    > points(estimate ~ I(Comp2s + 0.1), data = subset(or, Comp1 == "PISC"), 
    +     type = "p", pch = 21, bg = "black")
    > axis(1, at = 1:4, labels = nms)
    > axis(2, las = 1)
    > mtext("Trophic level", 1, line = 3, cex = 1.5)
    > mtext("Odds ratio of empty stomachs by trophic level", 2, line = 3.5, 
    +     cex = 1.5)
    > legend("topleft", legend = nms, pch = c(22, 21, 21, 21), pt.bg = c("white", 
    +     "grey90", "grey50", "black"), bty = "n")
    > box(bty = "l")
    
    > par(opar)
    
  6. What further conclusions would you draw from the standardized residuals?

Contingency tables

Here is an example (13.5) from Fowler, Cohen and Parvis (1998). A field biologist collected leaf litter from a 1 m2 quadrats randomly located on the ground at night in two locations - one was on clay soil the other on chalk soil. The number of woodlice of two different species (Oniscus and Armadilidium) were collected and it is assumed that all woodlice undertake their nocturnal activities independently. The number of woodlice are in the following contingency table.

Download Woodlice data set
Format of Woodlice data set
 WOODLICE SPECIES
SOIL TYPEOniscusArmadilidium
Clay146
Chalk2246

Woodlice

Open the woodlice data file. HINT.
Show code
> woodlice <- read.table("../downloads/data/woodlice.csv", header = T, 
+     sep = ",", strip.white = T)
> head(woodlice)
   SOIL      SPECIES COUNTS
1  Clay      oniscus     14
2  Clay armadilidium      6
3 Chalk      oniscus     22
4 Chalk armadilidium     46
  1. What null hypothesis is being tested by this test?
  2. Generate a cross table
    out of the dataset in preparation for frequency analysis (HINT).
    Show code
    > woodlice.tab <- xtabs(COUNTS ~ SOIL + SPECIES, data = woodlice)
    > woodlice.tab
    
           SPECIES
    SOIL    armadilidium oniscus
      Chalk           46      22
      Clay             6      14
    
  3. Fit
    a 2 x 2 (two way) contingency table
    (HINT), and explore the main assumption of the test by examining the expected frequencies (HINT).
    Show code
    > woodlice.x2 <- chisq.test(woodlice.tab, correct = F)
    > woodlice.x2$exp
    
           SPECIES
    SOIL    armadilidium oniscus
      Chalk        40.18  27.818
      Clay         11.82   8.182
    
  4. If the assumption is OK, test this null hypothesis (HINT) and identify the following.
    Show code
    > woodlice.x2
    
    	Pearson's Chi-squared test
    
    data:  woodlice.tab 
    X-squared = 9.061, df = 1, p-value = 0.002611
    
    
    1. X2 statistic
    2. df
    3. P value
  5. Generate the residuals (HINT) associated with the above contingency test and complete the following table of standardized residuals.
    Show code
    > woodlice.x2$res
    
           SPECIES
    SOIL    armadilidium oniscus
      Chalk       0.9179 -1.1031
      Clay       -1.6924  2.0341
    
     oniscusarmadilidium
    CLAY
    CHALK
  6. Calculate the odds ratio (of species presence) of clay vs chalk
    Show code
    > oddsratio(woodlice.tab)
    
    $data
           SPECIES
    SOIL    armadilidium oniscus Total
      Chalk           46      22    68
      Clay             6      14    20
      Total           52      36    88
    
    $measure
           odds ratio with 95% C.I.
    SOIL    estimate lower upper
      Chalk    1.000    NA    NA
      Clay     4.725 1.642 15.22
    
    $p.value
           two-sided
    SOIL    midp.exact fisher.exact chi.square
      Chalk         NA           NA         NA
      Clay    0.003597     0.004024   0.002611
    
    $correction
    [1] FALSE
    
    attr(,"method")
    [1] "median-unbiased estimate & mid-p exact CI"
    
    > #oniscus are 4 times more likely to have a preference for clay over chalk
    > #   than armadilidium
    
  7. What are your conclusions (statistical and biological)?

  Reading data into R

Ensure that the working directory is pointing to the path containing the file to be imported before proceeding.
To import data into R, we read the contents of a file into a data frame. The general format of the command for reading data into a data frame is

> name <- read.table('filename.csv', header=T, sep=',', row.names=column, strip.white=T)

where name is a name you wish the data frame to be referred to as, filename.csv is the name of the csv file that you created in excel and column is the number of the column that had row labels (if there were any). The argument header=T indicates that the variable (vector) names will be created from the names supplied in the first row of the file. The argument sep=',' indicates that entries in the file are separated by a comma (hence a comma delimited file). If the data file does not contain row labels, or you are not sure whether it does, it is best to omit the row.names=column argument. The strip.white=T arguement ensures that no leading or trailing spaces are left in character names (these can be particularly nasty in categorical variables!).

As an example
> phasmid <- read.data("phasmid.csv", header = T, sep = ",", row.names = 1, 
+     strip.white = T)

End of instructions

  Analysing frequencies

Analysis of frequencies is similar to Analysis of Variance (ANOVA) in some ways. Variables contain two or more classes that are defined from either natural categories or from a set of arbitrary class limits in a continuous variable. For example, the classes could be sexes (male and female) or color classes derived by splitting the light scale into a set of wavelength bands. Unlike ANOVA, in which an attribute (e.g. length) is measured for a set number of replicates and the means of different classes (categories) are compared, when analyzing frequencies, the number of replicates (observed) that fall into each of the defined classes are counted and these frequencies are compared to predicted (expected) frequencies.

Analysis of frequencies tests whether a sample of observations came from a population where the observed frequencies match some expected or theoretical frequencies. Analysis of frequencies is based on the chi-squared (X2) statistic, which follows a chi-square distribution (squared values from a standard normal distribution thus long right tail).

When there is only one categorical variable, expected frequencies are calculated from theoretical ratios. When there are more than one categorical variables, the data are arranged in a contingency table that reflects the cross-classification of sampling or experimental units into the classes of the two or more variables. The most common form of contingency table analysis (model I), tests a null hypothesis of independence between the categorical variables and is analogous to the test of an interaction in multifactorial ANOVA. Hence, frequency analysis provides hypothesis tests for solely categorical data. Although, analysis of frequencies provides a way to analyses data in which both the predictor and response variable are both categorical, since variables are not distinguished as either predictor or response in the analysis, establishment of causality is only of importance for interpretation.


End of instructions

  Goodness of fit test

The goodness-of-fit test compares observed frequencies of each class within a single categorical variable to the frequencies expected of each of the classes on a theoretical basis. It tests the null hypothesis that the sample came from a population in which the observed frequencies match the expected frequencies.

For example, an ecologist investigating factors that might lead to deviations from a 1:1 offspring sex ratio, hypothesized that when the investment in one sex is considerably greater than in the other, the offspring sex ratio should be biased towards the less costly sex. He studied two species of wasps, one of which had males that were considerably larger (and thus more costly to produce) than females. For each species, he compared the offspring sex ratio to a 1:1 ratio using a goodness-of-fit test.

End of instructions

  R Goodness of fit test

> chisq.test(c(counts))
#OR
> chisq.test(c(counts),p=c(.5,.5))

where counts is a comma separated list of observed counts or frequencies and .5,.5 is a comma separated list of expected frequencies. For example

> chisq.test(c(40,50))
#OR
> chisq.test(c(40,50),p=c(.5,.5))

End of instructions

  R Cross tables

> name.tab <- xtabs(counts ~ cat1 + cat2, data=data)

where name.tab is a name for the resulting cross table, counts are the observed counts/frequencies, cat1 and cat2 are the categorical variables and data is the name of the data frame (dataset)

End of instructions

  Contingency tables

Contingency tables are the cross-classification of two (or more) categorical variables. A 2 x 2 (two way) table takes on the following form:


Where f12 etc are the frequencies in each cell (Variable 1 x Variable 2 combination).

Contingency tables test the null hypothesis that the data came from a population in which variable 1 is independent of variable 2 and vice-versa. That is, it is a test of independence.

For example a plant ecologist examined the effect of heat on seed germination. Contingency test was used to determine whether germination (2 categories - yes or no) was independent of the heat treatment (2 categories heated or not heated).

End of instructions

  R Contingency tables

> name.x2 <- chisq.test(name.tab, correct=F)

where name.x2 is a name to provide for the fitted model and name.tab is a name of a cross table.

End of instructions

  Contingency tables from raw data sets

> name.tab <- table(data)

where name.tab is a name to give the resulting cross table and data is the name of the data set that contains the raw data.

End of instructions