Workshop 14.3 - Correspondence Analysis (CA)

14 Jan 2013

Basic statistics references

Legendre and Legendre
Quinn & Keough (2002) - Chpt 17

Correspondence Analysis (CA)

In Workshop 14.2 we introduced a dataset of Gittens(1985) in which the abundances of 8 species of plants were measured from 45 sites within 3 habitat types. Essentially, the plant ecologist wanted to be able to compare the sites according to their plant communities. In Workshop 14.2 we performed PCA on these data.

In the current workshop, we will instead start by assuming that the sampling spans multiple communities (the species of which are likely to display unimodal abundance distributions) and there are strong environmental gradients operating across the landscape that are likely to drive strong associations between species abundances and sites.

This approach will thus quantify the contributions of the relative frequencies to a $\chi^2$ statistic.

Download veg data set

Format of veg.csv data file

SITE	HABITAT	SP1	...	SP8
1	A	4	..	68
2	B	92	..	4
3	A	9	..	68
4	A	52	..	24
5	C	99	..	0
6	A	12	..	68
7	C	72	..	8
8	C	80	..	8
9	B	80	..	0
10	C	92	..	0

SITE	A number or name given to each quadrat (site)
HABITAT	A letter or name given to each habitat type
SP1, SP2, .., SP8	Number of individuals of each plant species found in each quadrat

Open the veg data set.

Show code

> veg <- read.csv("../downloads/data/veg.csv")
> veg

   SITE HABITAT SP1 SP2 SP3 SP4 SP5 SP6 SP7 SP8
1     1       A   4   0   0  36  28  24  99  68
2     2       B  92  84   0   8   0   0  84   4
3     3       A   9   0   0  52   4  40  96  68
4     4       A  52   0   0  52  12  28  96  24
5     5       C  99   0  36  88  52   8  72   0
6     6       A  12   0   0  20  40  40  88  68
7     7       C  72   0  20  72  24  20  72   8
8     8       C  80   0   0  48  16  28  92   8
9     9       B  80  76   4   8  12   0  84   0
10   10       C  92   0  40  72  36  12  84   0
11   11       A  28   4   0  16  56  28  96  56
12   12       A   8   0   0  36  68   8  99  28
13   13       C  99  12   4  84  36  12  88   8
14   14       A  40   0   0  68  12   8  88  24
15   15       A  28   0   0  36  64  28  99  56
16   16       A  28   0   0  28  44  20  88  32
17   17       C  80   0   0  52  20  32  96  20
18   18       C  84   0   0  76  44  16  96   0
19   19       B  88  40  12   8  24   8  92   0
20   20       C  99   0  60  88  28   0  80   0
21   21       A  12   0   0  36  16  12  88  76
22   22       A   0   0   0  20   8   0  99  60
23   23       C  88   0  12  72  32  16  88   0
24   24       C  56   0   4  32  56   4  96  16
25   25       C  99   0  40  60  20   4  56   4
26   26       A  12   0   0  28   4   4  99  72
27   27       A  28   0   0  48  64   4  99  28
28   28       B  92  52   0  40  64   8  96   4
29   29       C  80   0   0  68  40  12  80   8
30   30       A  32   0   0  56  28  36  84  24
31   31       A  40   0   0  60   8  36  96  56
32   32       A  44   0   0  44   8  20  96  24
33   33       A  48   0   0  72  20  12  99  32
34   34       A  48   0   0   8  44   8  92  56
35   35       B  99  36  20  56   8   4  24   0
36   36       A  15   0   4  36   4  28  99  44
37   37       A   8   0   0  20  16  12  99  56
38   38       A  28   0   0  24  16  12  99  36
39   39       A  52   0   0  48  12  28  99  32
40   40       C  92   0   4  56  12  16  70   4
41   41       C  92   0   8  52  56   8  99   4
42   42       A   4   0   0  44  24   4  99  60
43   43       A  16   0   0  36   0  24  99  76
44   44       A  76   0   0  48  12  36  96  32
45   45       B  96  36   4  28  28   8  88   4

Use correspondence analysis (CA) explore the trends in plant communities amongst sites (and habitats)

Show code

> library(vegan)
> veg.ca <- cca(veg[, c(-1, -2)])
> summary(veg.ca, display = NULL)

Call:
cca(X = veg[, c(-1, -2)]) 

Partitioning of mean squared contingency coefficient:
              Inertia Proportion
Total            0.55          1
Unconstrained    0.55          1

Eigenvalues, and their contribution to the mean squared contingency coefficient 

Importance of components:
                        CA1   CA2    CA3    CA4    CA5    CA6     CA7
Eigenvalue            0.260 0.156 0.0532 0.0456 0.0179 0.0109 0.00652
Proportion Explained  0.473 0.283 0.0968 0.0829 0.0326 0.0198 0.01185
Cumulative Proportion 0.473 0.756 0.8528 0.9358 0.9684 0.9881 1.00000

Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions

Examine the eigenvalues for each new component (group). They represent the contribution of each new component to the overall $\chi^2$. The sum of these values should add up to the $\chi^2$ value (also known as inertia). If there were absolutely no associations between the species and sites, then you would expect each new component to have a eigenvalue of $innertia/n$. What do the eigenvalues indicate in this case?

Show code
> veg.ca$CA$eig
CA1 CA2 CA3 CA4 CA5 CA6 CA7 0.260083 0.155846 0.053240 0.045632 0.017940 0.010865 0.006518
> sum(veg.ca$CA$eig)
[1] 0.5501
Calculate the percentage of total $\chi^2$ explained by each of the new principal components. How much of the total original variation is explained by principal component 1 (as a percentage)?
Show code
> 100 * veg.ca$CA$eig/sum(veg.ca$CA$eig)
CA1 CA2 CA3 CA4 CA5 CA6 CA7 47.277 28.329 9.678 8.295 3.261 1.975 1.185
Calculate the cumulative sum of these percentages. How much of the total $\chi^2$ is explained by the first three principal components (as a percentage)?
Show code
> cumsum(100 * veg.ca$CA$eig/sum(veg.ca$CA$eig))
CA1 CA2 CA3 CA4 CA5 CA6 CA7 47.28 75.61 85.28 93.58 96.84 98.82 100.00
Using the eigenvalues and a screeplot, determine how many principal components are necessary to represent the original variables (species) . How many principal components are necessary?

Show code
> screeplot(veg.ca) > int <- veg.ca$tot.chi/length(veg.ca$CA$eig) > abline(a = int, b = 0)

Generate a a quick biplot ordination (scatterplot of correspondence components) with correspondence component 1 on the x-axis and correspondence component 2 on the y-axis. Are the patterns of sites associated with any particular species?
Show code
> plot(veg.ca, scaling = 2)
Whilst the above biplot illustrates some of the patterns, it does not allow us to directly see whether the communities change in the different habitats. So lets instead construct the plot at a lower level.
1. Create the base ordination plot and add the sites (colored according to habitat). Since we are more interested in the habitats than the actual sites, we can just label the points according to their habitat rather than their site names.
  Show code
  > veg.ord <- ordiplot(veg.ca, type = "n") > text(veg.ord, "sites", lab = veg$HABITAT, col = as.numeric(veg$HABITAT))
2. Lets now add the species correlation vectors (component loadings). This will yield a biplot similar to the previous question.
  Show code
  > veg.ord <- ordiplot(veg.ca, type = "n") > text(veg.ord, "sites", lab = veg$HABITAT, col = as.numeric(veg$HABITAT)) > data.envfit <- envfit(veg.ca, veg[, 3:8]) > plot(data.envfit, col = "grey")
3. Now lets fit the habitat vectors onto this ordination. Before environmental variables can he added to an ordination plot, they must first be numeric representations. If we wish to display the orientation of each habitat on the ordination plot, then we need to convert the habitat variable into dummy variables.
  Show code
  > veg.ord <- ordiplot(veg.ca, type = "n") > text(veg.ord, "sites", lab = veg$HABITAT, col = as.numeric(veg$HABITAT)) > data.envfit <- envfit(veg.ca, veg[, 3:8]) > plot(data.envfit, col = "grey") > # dummy code the habitat factor > habitat <- model.matrix(~-1 + HABITAT, veg) > data.envfit <- envfit(veg.ca, env = habitat) > data.envfit
  ***VECTORS CA1 CA2 r2 Pr(>r) HABITATA -0.904 0.427 0.75 0.001 *** HABITATB 0.841 0.541 0.84 0.001 *** HABITATC 0.334 -0.942 0.63 0.001 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 P values based on 999 permutations.
  > plot(data.envfit, col = "blue")
To ensure you appreciate the patterns displayed in this ordination plot, answer the following questions.
1. Species 1 in primarily associated with principal component (axis)?
2. Species 2 in primarily associated with principal component (axis)?
3. Species 5 in primarily associated with principal component (axis)?
4. Habitat A aligns primarily with the
5. Habitat C strongly reflects the abundances of It is also interesting to note that the sites predominantly line up along very narrow trajectories.
The environmental fit procedure above included a permutation test that explored the relationship between each of the habitat types and the reduced ordination space communities (as defined by CA1 and CA2). What conclusions would you draw from this analysis?
Show code
> data.envfit <- envfit(veg.ca, env = habitat) > data.envfit
***VECTORS CA1 CA2 r2 Pr(>r) HABITATA -0.904 0.427 0.75 0.001 *** HABITATB 0.841 0.541 0.84 0.001 *** HABITATC 0.334 -0.942 0.63 0.001 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 P values based on 999 permutations.
1. Habitat A is?
2. Habitat B is?
3. Habitat C is?

Correspondence Analysis (CA)

We will also return to the data of Peet & Loucks (1977) that examined the abundances of 8 species of trees (Bur oak, Black oak, White oak, Red oak, American elm, Basswood, Ironwood, Sugar maple) at 10 forest sites in southern Wisconsin, USA. The data (given below) are the mean measurements of canopy cover for eight species of north American trees in 10 samples (quadrats).

Download wisc data set

Format of wisc.csv data file

QUAD.	BUROAK	BLACKOAK	WHITEOAK	REDOAK	ELM	BASSWOOD	IRONWOOD	MAPLE
1	9	8	5	3	2	0	0	0
2	8	9	4	4	2	0	0	0
3	3	8	9	0	4	0	0	0
4	5	7	9	6	5	0	0	0
5	6	0	7	9	6	2	0	0
6	0	0	7	8	5	7	6	5
7	5	0	4	7	5	6	7	4
8	0	0	6	6	0	6	4	8
9	0	0	0	4	2	7	6	8
10	0	0	2	3	5	6	5	9

QUADRAT	A number or name given to each quadrat
BUROAK, BLACKOAK,....	Number of individuals of each tree species found in each quadrat

Open the wisc data set.

Show code

> wisc <- read.csv("../downloads/data/wisc.csv")
> wisc

   QUADRAT BUROAK BLACKOAK WHITEOAK REDOAK ELM BASSWOOD IRONWOOD MAPLE
1        1      9        8        5      3   2        0        0     0
2        2      8        9        4      4   2        0        0     0
3        3      3        8        9      0   4        0        0     0
4        4      5        7        9      6   5        0        0     0
5        5      6        0        7      9   6        2        0     0
6        6      0        0        7      8   5        7        6     5
7        7      5        0        4      7   5        6        7     4
8        8      0        0        6      6   0        6        4     8
9        9      0        0        0      4   2        7        6     8
10      10      0        0        2      3   5        6        5     9

Use correspondence analysis (CA), to generate new groups (components) and explore the trends in tree communities amongst quadrats.

Show code

> library(vegan)
> wisc.ca <- cca(wisc[, -1])
> summary(wisc.ca, display = NULL)

Call:
cca(X = wisc[, -1]) 

Partitioning of mean squared contingency coefficient:
              Inertia Proportion
Total           0.716          1
Unconstrained   0.716          1

Eigenvalues, and their contribution to the mean squared contingency coefficient 

Importance of components:
                        CA1    CA2    CA3    CA4    CA5     CA6      CA7
Eigenvalue            0.532 0.0858 0.0553 0.0237 0.0125 0.00519 0.000869
Proportion Explained  0.744 0.1199 0.0773 0.0332 0.0174 0.00725 0.001210
Cumulative Proportion 0.744 0.8637 0.9409 0.9741 0.9915 0.99879 1.000000

Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions

Examine the eigenvalues for each new component (group). What do the eigenvalues indicate in this case?

Show code

> wisc.ca$CA$eig

      CA1       CA2       CA3       CA4       CA5       CA6       CA7 
0.5323741 0.0858155 0.0553136 0.0237462 0.0124801 0.0051924 0.0008694

> sum(wisc.ca$CA$eig)

[1] 0.7158

Calculate the percentage of total $\chi^2$ explained by each of the new principal components. How much of the total original $\chi^2$ is explained by correspondence component 1 (as a percentage)?
Show code
> 100 * wisc.ca$CA$eig/sum(wisc.ca$CA$eig)
CA1 CA2 CA3 CA4 CA5 CA6 CA7 74.3756 11.9889 7.7276 3.3175 1.7435 0.7254 0.1215
Calculate the cumulative sum of these percentages. How much of the total $\chi^2$ is explained by the first three correspondence components (as a percentage)?
Show code
> cumsum(100 * wisc.ca$CA$eig/sum(wisc.ca$CA$eig))
CA1 CA2 CA3 CA4 CA5 CA6 CA7 74.38 86.36 94.09 97.41 99.15 99.88 100.00
Using the eigenvalues and a screeplot, determine how many correspondence components are necessary to represent the original variables (species) . How many correspondence components are necessary?

Show code
> screeplot(wisc.ca) > int <- veg.ca$tot.chi/length(veg.ca$CA$eig) > abline(a = int, b = 0)

Generate a a quick biplot ordination (scatterplot of correspondence components) with correspondence component 1 on the x-axis and correspondence component 2 on the y-axis. Are the patterns of quadrats associated with any particular tree species?
Show code
> plot(wisc.ca, scaling = 2)

Transformation	Syntax
log_e	> new_var <- log(old_var)
log₁₀	> new_var <- log10(old_var)
square root	> new_var <- sqrt(old_var)
arcsin	> new_var <- asin(sqrt(old_var))
scale (mean=0, unit variance)	> new_var <- scale(old_var)

Transformation

Syntax

log_e

> new_var <- log(old_var)

log₁₀

> new_var <- log10(old_var)

square root

> new_var <- sqrt(old_var)

arcsin

> new_var <- asin(sqrt(old_var))

scale (mean=0, unit variance)

> new_var <- scale(old_var)

Sample number	Sample mean
1	12.1
2	12.7
3	12.5
Mean of sample means	12.433
> SD of sample means	0.306