Jump to main navigation


Tutorial 10.2a - Goodness of fit tests

22 Jul 2018

Scenario and Data

Goodness of fit tests are concerned with comparing the observed frequencies with those expected on the basis of a specific null hypothesis. So lets now fabricate a motivating scenario and some data.

We will create a scenario that involves items classified into one of three groups (A, B and C). The number of items in each classification group are then tallied up. Out of a total of 47 items, 15 where of type A, 9 where of type B and 23 where of type C. We could evaluate a parity (a 1:1:1 ratio from these data. In a frequentist context, this might involve testing a null hypothesis that the observed data could have come from a population with a 1:1 item ratio. In this case the probability would be the probability of obtaining the observed ratio of frequencies when the null hypothesis is true.

To extend the example, lets also explore a 1:1:2 ratio.

We start by generating the observed data:

# the observed frequences of A and B
obs <- c(15, 9, 23)
obs
[1] 15  9 23

An appropriate test statistic for comparing an observed ($o$) frequency ratio to an expected ($e$) frequency ratio is the chi-square $\chi^2$ statistic. $$\chi^2=\sum\frac{(o-e)^2}{e}$$

When the null hypothesis is true, and specific assumptions hold, the $\chi^2$ statistic should follow a $\chi^2$ probability distribution with degrees of freedom equal to the number of categories minus 1 ($df=p-1$).

Exploratory data analysis and initial assumption checking

The assumptions are:
  1. All of the observations are classified independently - this must be addressed at the design and collection stages
  2. No more than 20% of the expected values should be less than 5. Since the location and spread of the expected value of a $\chi^2$ distribution are the same parameter ($\lambda$), and that the $\chi^2$ distribution is bounded by a lower limit of zero, distributions with expected values less than 5 have an asymmetrical shape and are thus unreliable (for calculating probabilities).

So lets calculate the expected frequencies as a means to evaluate this assumption. The expected values are calculated as: $$e = total~counts \times expected~fraction$$

 1:1:1 ratio1:1:2 ratio
Expected fractionsA=1/3=0.33, B=1/3=0.33, C=1/3=0.33A=1/3=0.25, B=1/4=0.25, C=2/4=0.5
Expected frequencies$e_A=(15+9+23)\times 1/3=15.67$
$e_B=(15+9+23)\times 1/3=15.67$
$e_C=(15+9+23)\times 1/3=15.67$
$e_A=(15+9+23)\times 1/4=11.75$
$e_B=(15+9+23)\times 1/4=11.75$
$e_C=(15+9+23)\times 2/4=23.5$

It is clear that in neither case are any of the expected frequencies less than 5. Therefore, we would conclude that probabilities derived from the $\chi^2$ distribution are likely to be reliable.

Model fitting or statistical analysis

We perform the Goodness of fit $\chi^2$ test with the chisq.test() function. There are only two relevant parameters for a Goodness of fit test:

  • x: the set (vector or matrix) of observed frequencies
  • p: the set (vector or matrix) of expected probabilities (default to probabilities equivalent to 1/(length of x)

1:1:1 ratio1:1:2 ratio
data.chisq <- chisq.test(obs)
data.chisq1 <- chisq.test(obs, p = c(1/4, 1/4, 2/4))

Model evaluation

Prior to exploring the model parameters, it is prudent to confirm that the model did indeed fit the assumptions and was an appropriate fit to the data. For the $chi^2$ test, this just means confirming the expected values.

1:1:1 ratio1:1:2 ratio
data.chisq$exp
[1] 15.66667 15.66667 15.66667
data.chisq1$exp
[1] 11.75 11.75 23.50

Exploring the model parameters, test hypotheses

If there was any evidence that the assumptions had been violated or the model was not an appropriate fit, then we would need to reconsider the model and start the process again. In this case, there is no evidence that the test will be unreliable so we can proceed to explore the test statistics.

1:1:1 ratio1:1:2 ratio
data.chisq
	Chi-squared test for given probabilities

data:  obs
X-squared = 6.2979, df = 2, p-value = 0.0429
data.chisq1
	Chi-squared test for given probabilities

data:  obs
X-squared = 1.5532, df = 2, p-value = 0.46

Conclusions:

  • There is inferential evidence to reject the null hypothesis that the observed item frequencies could have come from a population with a 1:1:1 ratio.
  • There is insufficient inferential evidence to reject the null hypothesis that the observed item frequencies could have come from a population with a 1:1:2 ratio.

Further explorations of the trends

When significant overal deviations have been identified (and there are more than 2 groups), it is often useful to explore the patterns of residuals. Since the residuals are a (standardized) measure of the differences between observed and expected for each classification group (A, B or C in this case), they provide an indication of which group(s) deviate most from the expected and therefore are the main driver(s) of the "effect".

In interpreting the residuals, we are looking for large (substantially larger in magnitude than 1)positive and negative values, which represent higher and lower observed frequencies than would have been expected under the null hypothesis respectively.

1:1:1 ratio1:1:2 ratio
data.chisq$res
[1] -0.1684304 -1.6843038  1.8527342
data.chisq1$res
[1]  0.9481224 -0.8022575 -0.1031421

Conclusions:

  • There were fewer observed B's and more observed C's than would have been expected from a 1:1:1 population ratio.
  • As the observed data were not found to differ significantly from a 1:1:2 ratio, the residuals are not going to offer much additional insights and probably would not have been generated..




Worked Examples

Basic χ2 references
  • Logan (2010) - Chpt 16-17
  • Quinn & Keough (2002) - Chpt 13-14

Goodness of fit test

A fictitious plant ecologist sampled 90 shrubs of a dioecious plant in a forest, and each plant was classified as being either male or female. The ecologist was interested in the sex ratio and whether it differed from 50:50. The observed counts and the predicted (expected) counts based on a theoretical 50:50 sex ratio follow.

Format of fictitious plant sex ratios - note, not a file
Expected and Observed data (50:50 sex ratio).
 FemaleMaleTotal
Observed405090
Expected454590

Tasmannia bush

Note, it is not necessary to open or create a data file for this question.

  1. First, what is the appropriate test
    to examine the sex ratio of these plants?
  2. What null hypothesis is being tested by this test?
  3. What are the degrees of freedom are associated with this data for this test?
  4. Perform a Goodness-of-fit test
    to test the null hypothesis that these data came from a population with a 50:50 sex ratio (hint). Identify the following:
    Show code
    chisq.test(c(40, 50))
    
    	Chi-squared test for given probabilities
    
    data:  c(40, 50)
    X-squared = 1.1111, df = 1, p-value = 0.2918
    
    1. X2 statistic
    2. df
    3. P value
  5. What are your conclusions (statistical and biological)?
  6. Lets now extend this fictitious endeavor. Recent studies on a related species of shrub have suggested a 30:70 female:male sex ratio. Knowing that our plant ecologist had similar research interests, the authors contacted her to inquire whether her data contradicted their findings.

  7. Using the same observed data, test the null hypothesis
    that these data came from a population with a 30:70 sex ratio (hint). From a 30:70 female:male sex ratio, what are the expected frequency counts of females and males from 90 individuals and what is the X2 statistic?.
    Show code
    chisq.test(c(40, 50), p = c(0.3, 0.7))
    
    	Chi-squared test for given probabilities
    
    data:  c(40, 50)
    X-squared = 8.9418, df = 1, p-value = 0.002787
    
    chisq.test(c(40, 50), p = c(0.3, 0.7))$exp
    
    [1] 27 63
    
    1. Expected number of females
    2. Expected number of males
    3. X2 statistic
  8. Do the plant ecologist's data dispute the findings of the other studies? (y or n)

Exponential family of distributions

The exponential distributions are a class of continuous distribution which can be characterized by two parameters. One of these parameters (the location parameter) is a function of the mean and the other (the dispersion parameter) is a function of the variance of the distribution. Note that recent developments have further extended generalized linear models to accommodate other non-exponential residual distributions.

End of instructions

  Reading data into R

Ensure that the working directory is pointing to the path containing the file to be imported before proceeding.
To import data into R, we read the contents of a file into a data frame. The general format of the command for reading data into a data frame is

> name <- read.table('filename.csv', header=T, sep=',', row.names=column, strip.white=T)

where name is a name you wish the data frame to be referred to as, filename.csv is the name of the csv file that you created in excel and column is the number of the column that had row labels (if there were any). The argument header=T indicates that the variable (vector) names will be created from the names supplied in the first row of the file. The argument sep=',' indicates that entries in the file are separated by a comma (hence a comma delimited file). If the data file does not contain row labels, or you are not sure whether it does, it is best to omit the row.names=column argument. The strip.white=T arguement ensures that no leading or trailing spaces are left in character names (these can be particularly nasty in categorical variables!).

As an example
        phasmid <- read.data('phasmid.csv', header=T, sep=',', row.names=1, strip.white=T)

End of instructions

  Analysing frequencies

Analysis of frequencies is similar to Analysis of Variance (ANOVA) in some ways. Variables contain two or more classes that are defined from either natural categories or from a set of arbitrary class limits in a continuous variable. For example, the classes could be sexes (male and female) or color classes derived by splitting the light scale into a set of wavelength bands. Unlike ANOVA, in which an attribute (e.g. length) is measured for a set number of replicates and the means of different classes (categories) are compared, when analyzing frequencies, the number of replicates (observed) that fall into each of the defined classes are counted and these frequencies are compared to predicted (expected) frequencies.

Analysis of frequencies tests whether a sample of observations came from a population where the observed frequencies match some expected or theoretical frequencies. Analysis of frequencies is based on the chi-squared (X2) statistic, which follows a chi-square distribution (squared values from a standard normal distribution thus long right tail).

When there is only one categorical variable, expected frequencies are calculated from theoretical ratios. When there are more than one categorical variables, the data are arranged in a contingency table that reflects the cross-classification of sampling or experimental units into the classes of the two or more variables. The most common form of contingency table analysis (model I), tests a null hypothesis of independence between the categorical variables and is analogous to the test of an interaction in multifactorial ANOVA. Hence, frequency analysis provides hypothesis tests for solely categorical data. Although, analysis of frequencies provides a way to analyses data in which both the predictor and response variable are both categorical, since variables are not distinguished as either predictor or response in the analysis, establishment of causality is only of importance for interpretation.


End of instructions

  Goodness of fit test

The goodness-of-fit test compares observed frequencies of each class within a single categorical variable to the frequencies expected of each of the classes on a theoretical basis. It tests the null hypothesis that the sample came from a population in which the observed frequencies match the expected frequencies.

For example, an ecologist investigating factors that might lead to deviations from a 1:1 offspring sex ratio, hypothesized that when the investment in one sex is considerably greater than in the other, the offspring sex ratio should be biased towards the less costly sex. He studied two species of wasps, one of which had males that were considerably larger (and thus more costly to produce) than females. For each species, he compared the offspring sex ratio to a 1:1 ratio using a goodness-of-fit test.

End of instructions

  R Goodness of fit test

> chisq.test(c(counts))
#OR
> chisq.test(c(counts),p=c(.5,.5))

where counts is a comma separated list of observed counts or frequencies and .5,.5 is a comma separated list of expected frequencies. For example

> chisq.test(c(40,50))
#OR
> chisq.test(c(40,50),p=c(.5,.5))

End of instructions

  R Cross tables

> name.tab <- xtabs(counts ~ cat1 + cat2, data=data)

where name.tab is a name for the resulting cross table, counts are the observed counts/frequencies, cat1 and cat2 are the categorical variables and data is the name of the data frame (dataset)

End of instructions

  Contingency tables

Contingency tables are the cross-classification of two (or more) categorical variables. A 2 x 2 (two way) table takes on the following form:


Where f12 etc are the frequencies in each cell (Variable 1 x Variable 2 combination).

Contingency tables test the null hypothesis that the data came from a population in which variable 1 is independent of variable 2 and vice-versa. That is, it is a test of independence.

For example a plant ecologist examined the effect of heat on seed germination. Contingency test was used to determine whether germination (2 categories - yes or no) was independent of the heat treatment (2 categories heated or not heated).

End of instructions

  R Contingency tables

> name.x2 <- chisq.test(name.tab, correct=F)

where name.x2 is a name to provide for the fitted model and name.tab is a name of a cross table.

End of instructions

  Contingency tables from raw data sets

> name.tab <- table(data)

where name.tab is a name to give the resulting cross table and data is the name of the data set that contains the raw data.

End of instructions

  Logistic Regression

> name.glm <- glm(dv~iv, family=binomial, data=data)

where name.glm is a name you provide for the fitted model, dv is the name of the dependent variable, iv is the name of the independent variable and data is the name of the data frame (data set).

End of instructions

  Further exploration of three way log-linear interactions

Start with the female data. The female data can be accessed through the object sinclair.split[["FEMALE"]]. This is the "FEMALE" data of the sinclair.split object.

> sinclair.glmR <- glm(COUNT~DEATH+MARROW, family=poisson, data=sinclair.split[["FEMALE"]])
> sinclair.glmF <- glm(COUNT~DEATH*MARROW, family=poisson, data=sinclair.split[["FEMALE"]])
> anova(sinclair.glmR, sinclair.glmF, test="Chisq")

The first line fits the reduced log-linear model. This is the model that does not contain the two way interaction term. The second line fits the full log-linear model - the model that does contain the two way interaction. The third line generates the analysis of deviance table. This is essentially presenting the results of the hypothesis test that tests whether there is a two way interaction or not.

This same proceedure can then be repeated for the "MALE" data.

To estimate the relationship between power and sample size given degree of correlation for a range of power.

End of instructions

  Odds ratios for specific two-way interactions

Start with the female data.

> library(epitools)
> female.tab <- xtabs(COUNT ~ DEATH + MARROW, data=sinclair.split[["FEMALE"]])
> oddsratio.wald(t(female.tab))$measure
> oddsratio.wald(t(female.tab),rev="b")$measure

  1. The first line loads a package called 'epitools'. This package contains the function (oddsratiowald()) that is necessary to calculate the odds ratios.
  2. The second line converts the female data set into a cross table.
  3. The third line calculates the odds ratios and 95% confidence interval (after transposing the table). Transposing is done with the 't()' function and is necessary because oddsratio.wald is expected to only encounter a rx2 table (that is a table that only has two columns) and the odds ratios are calculated for the rows. Odds ratios are calculated by contrasting one of the levels of the row variable against the other levels. As a result, the odds ratios etc are not calculated for the first level.
  4. The final line performs the odds ratio calculations again accept this time a different level is chosen as the reference level so that all pairwise comparisons have been covered.


This same proceedure can then be repeated for the "MALE" data.

To estimate the relationship between power and sample size given degree of correlation for a range of power.

End of instructions