Tutorial 4.1 - Basic statistical principles

27 Mar 2017

Statistics

Statistics is a branch of mathematical sciences that relates to the collection, analysis, presentation and interpretation of data and is therefore central to most scientific pursuits. Fundamental to statistics is the concept that samples are collected and statistics are calculated to estimate populations and their parameters.

Statistical populations can represent natural biological populations (such as the Victorian koala population), although more typically they reflect somewhat artificial constructs (e.g. Victorian male koalas). A statistical population strictly refers to all the possible observations from which a sample (a subset) can be drawn and is the entity about which you wish to make conclusions.

The population parameters are the characteristics (such as population mean, variability etc) of the population that we are interested in drawing conclusions about. Since it is usually not possible to observe an entire population, the population parameters must be estimated from corresponding statistics calculated from a subset of the population known as a sample (e.g sample mean, variability etc). Provided the sample adequately represents the population (is sufficiently large and unbiased), the sample statistics should be reliable estimates of the population parameters of interest.

It is primarily for this reason that most statistical procedures impose certain sampling and distributional assumptions on the collected data. For example, most statistical tests assume that the observations have been drawn randomly from populations (to maximize the likelihood that the sample will truly represent the population). Additional terminology fundamental to the study of biometry are listed in the following table (in which the examples pertain to a hypothetical research investigation into estimating the protein content of koala milk).

Term	Definition	Example
Measurement	A single piece of recorded information reflecting a characteristic of interest (e.g. length of a leaf, pH of a water aliquot mass of an individual, number of individuals per quadrat etc)	Protein content of the milk of a single female koala
Observation	A single measured sampling or experimental unit (such as an individual, a quadrat, a site etc)	A small quantity of milk from a single koala
Population	All the possible observations that could be measured and the unit of which wish to draw conclusions about (note a statistical population need not be a viable biological population)	The milk of all female koalas
Sample	The (representative) subset of the population that are observed	A small quantity of milk collected from 15 captive female koalas.Note that such a sample may not actually reflect the defined population. Rather, it could be argued that such a sample reflects captive populations. Nevertheless, such extrapolations are common when field samples are difficult to obtain.
Variable	A set of measurements of the same type that comprise the sample. The characteristic that differs (varies) from observation to observation	The protein content of koala milk.

In addition to estimating population parameters, various statistical functions (or statistics) are often calculated to express the relative magnitude of trends within and between populations. For example, the degree of difference between two populations is usually described by a t-statistic (see introductory hypothesis testing tutuorial).

Another important concept in statistics is the idea of probability. The frequentist view of the probability of an event or outcome is the proportion of times that the event or outcome is expected to occur in the long-run (after a large number of repeated procedures). For many statistical analyses, probabilities of occurrence are used as the basis for conclusions, inferences and predictions.

Consider the vague research question "How much do Victorian male koalas weigh?". This could be interpreted as:

How much do each of the Victorian male koalas weigh individually?
What is the total mass of all Victorian male koalas added together?
What is the mass of the typical Victorian male koala?

Arguably, it is the last of these questions that is of most interest. We might also be interested in the degree to which these weights differ from individual to individual and the frequency of individuals in different weight classes.

Probability theory

Probability (the chance of a particular outcome per event) can be considered from two different perspectives;

as an objective representation of the relative frequency of times that the outcome occurs from a long series (infinite) of events. Hence, it can be calculated by counting the number of times that the outcome occurs (the frequency) divided (normalized) by the total number of events (the sample space) in which it could have occurred. In order to relate this back to a hypothesis, we typically estimate the expected frequencies of outcomes when the null hypothesis is true. We will return to why there is a focus on a null hypothesis rather than a hypothesis a little later.
as a somewhat subjective representation of the uncertainty of an outcome. That is, how reasonable is an outcome given our previous understandings and the newly observed data.

Whist these two approaches differ substantially in their interpretation of probability (long-run chances of outcomes under certain conditions vs degree of belief). This can be represented diagrammatically. In simple probability, the probability of an outcome (e.g. $P(A)$) is expressed relative to a broad sample space.

The Probability of outcome A is the frequency of times outcome A occurs divided by the total number of times the outcome could occur (the sample space). The open symbols represent alternative outcomes.	\begin{align} P(A) &= \frac{freq(A)}{freq(Total)}\\ P(A) &= \frac{5}{22}\\ &= 0.227 \end{align}
The Probability of outcome B is the frequency of times outcome B occurs divided by the total number of times the outcome could occur (the sample space).	\begin{align} P(B) &= \frac{freq(B)}{freq(Total)}\\ P(B) &= \frac{7}{22}\\ &= 0.318 \end{align}
The Probability of both outcome A AND outcome B is the frequency of times outcome A AND outcome B both occur together divided by the total number of times the outcome could occur (the sample space).	\begin{align} P(AB) &= \frac{freq(A\&B)}{freq(Total)}\\ P(AB) &= \frac{2}{22}\\ &= 0.091 \end{align}

Conditional probability on the other hand, establishes the probability of a particular event conditional to (given the occurrence of) another event and therefore alters the divisor sample space. The sample space is restricted to the occurrence of the unconditional outcome.

The probability of outcome A occurring given that outcome B also occurs (or has occurred) is the frequency of times that outcome A AND outcome B both occur divided by the frequency of times that outcome B occurs. The frequency of outcome B occurrences becomes the divisor.

\begin{align*} P(A|B) &= \frac{freq(A\&B)}{freq(B)}\\ P(A|B) &= \frac{2}{7}\\ &= 0.286 \end{align*}

The above representation of conditional probability can be expressed completely in terms of probability \begin{align*} P(A|B) &= \frac{freq(A\&B)}{freq(B)}\Leftrightarrow \frac{P(AB)\times freq(Total)}{P(B)\times freq(Total)}\\ &= \frac{P(AB)}{P(B)} \end{align*}

Most probability statements take place in the context of a hypothesis (nor hypothesis). For example, frequentist probability is the probability of the data given the null hypothesis. Hence, most inferential statistics involve conditional probability.

Distributions

The set of observations in a sample can be represented by a sampling or frequency distribution. A frequency distribution (or just distribution) represents how often observations in certain ranges occur. For example, how many male koalas in the sample weigh between 10 and 11kg, or how many weigh more than 12kg. Such a sampling distribution can also be expressed in terms of the probability (long-run likelihood or chance) of encountering observations within certain ranges.

Probability distributions are also know as density distributions and their mathematical representations are known as density functions. For discrete outcomes (integers, such as the number of eggs laid by female silver gulls [range from 0-8]), the density represents the frequency of a certain outcome (clutch size) divided by the total number of observations (examined clutches). The following figures represent the frequency (left) and density (right) of clutch sizes from 100 nests.

Frequency distribution	Density distribution

For continuous outcomes, chances are, all the sample values are unique (at least in theory) and therefore all outcomes have a frequency of exactly 1. As the sample size approaches infinity, the probability of any single point value therefore approaches zero.

So we instead break the continuum into small equal-sized chunks and calculate the frequency of values within each chunk (akin to a histogram). To normalize these data such that the histogram represents an area of exactly one, we divide by the chunk width.

Clearly the accuracy of the density (probability) will depend on the size of the chunk selected. The smaller the chunk, the greater the accuracy. Alternatively, integrating the density function produces an exact solution. Probability from continuous distributions is thence based on areas under the density function and is undefined for a single point along the curve.

For example, the probability of encountering a male koala weighing more than 12kg is equal to the proportion of male koalas in the sample that weighed greater than 12kg. It is then referred to as a probability distribution.

When a frequency distribution can be described by a mathematical function, the probability distribution is a curve. The total area under this curve is defined as 1 and thus, the area under sections of the curve represent the probability of values falling in the associated interval. Note, it is not possible to determine the probability of discrete events (such as the probability of encountering a koala weighing 12.183kg) only ranges of values.

Continuous distributions

The normal (Gaussian) distribution

It has been a long observed mathematical phenomenon that the accumulation of a very large set of independent random influences tend to converge upon a central value (central limit theorem) and that the distribution of such accumulated values follow a specific `bell shaped' curve called a normal or Gaussian distribution. The normal distribution is a symmetrical distribution in which values close to the center of the distribution are more likely and that progressively larger and smaller values are less commonly encountered.

$$f(x;\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\left(\frac{x-\mu}{2\sigma}\right)^2}$$ At first, this might appear to be a very daunting formula. It essentially defines the density (frequency) of any value of $x$. The exact shape of the distribution is determined by just two parameters:

$\mu$ - the mean. This defines the center of the distribution, the location of the peak.
$\sigma^2$ - the variance (or $\sigma$, the standard deviation) which defines the variability or spread of values around the mean.

Important properties of the Gaussian distribution:

There is no relationship between the distributions mean (location) and variance - they are independent of one another.
It is symmetric and unbounded and thus defined for all real numbers in the range of ($-\infty$,$\infty$).
Governed by central limits theorem
- averages tend to converge to a central limit

As many biological measurements (such as weights, lengths etc) are likewise influenced by an almost infinite number of factors (many of which can be considered independent and random) and thus many biological variables also follow a normal distribution. The Gaussian distribution is particularly well suited for representing the distribution variables whose values are either

considerably larger (or smaller) than zero (e.g. koalas mass) or
have no theoretical limits (e.g. difference in masses between sibling fledglings)

Even discrete responses (such as counts that can only logically be positive integers) can occasionally be approximately described by a Gaussian distribution, particularly if either the samples are very large and the values free from boundary conditions (such as being close to a lower limit of 0), or else we are dealing with average counts.

Since many scientific variables behave according to the central limit theorem, many of the common statistical procedures have been specifically derived for (and thus assume) that the underlying distribution from which the data are drawn is normal. Specifically, inference and hypothesis tests from simple parametric tests (regression, ANOVA etc) assume that the residuals (stochastic, unexplained components of data) are normally distributed around a mean of zero. The reliability of such tests is dependent on the degree of conformity to this assumption of normality. Likewise, many other statistical elements rely on normal distributions, and thus the normal distribution (or variants thereof) is one of the most important mathematical distributions.

Log-normal distribution

Many biological variables have a lower limit of zero (at least in theory). For example, a koala cannot weigh less than 0kg or there cannot be less than 0mm of rain in a month. Such circumstances can result in asymmetrical distributions that are highly truncated towards the left with a long right tail.

In such cases, the mean and median present different values (the latter arguably more reflective of the 'typical' value). These distributions can often be described by a log-normal distribution. Furthermore, some variables do not naturally vary on a linear scale. For example, growth rates or chemical concentrations might naturally operate on logarithmic or exponential scales. Consequently, when such data are collected on a linear scale, they might be expected to follow a non-normal (perhaps log-normal) distribution.

$$f(x;\mu,\sigma) = \frac{1}{x\sigma\sqrt{2\pi}} e^{-\left(\frac{ln x-\mu}{2\sigma^2}\right)^2}$$ As with the Gaussian distribution, the exact shape of the log-normal distribution is determined by just two parameters:

$\mu$ - the mean. This defines the center of the distribution, the location of the peak.
$\sigma^2$ - the variance (or $\sigma$, the standard deviation) which defines the variability or spread of values around the mean.

However, $\mu$ and $\sigma^2$ are the mean and variance of $ln(x)$ rather than $x$.

Important properties of the log-normal distribution:

The variance is related (proportional) to the mean ($\sigma^2 \sim \mu^2$)
The log-normal distribution is skewed to the right as a result of being bounded at 0, yet unbounded to the right ($0$, $\infty$)
Also governed by central limits theorem except that it describes the distribution of values that are the product (rather than sum) of a large number of independent random factors.

Gamma distribution

The Gamma distribution describes the distribution of waiting times until a specific number of independent events (typically deaths) have occurred. For example, if the average mortality rate is one individual per five days (rate=1/5 or scale=5), then a Gamma distribution could be used to describe the distribution of expected waiting time before 10 individuals were dead.

There are two parameterizations of the Gamma distribution

in terms of shape ($k$) and scale ($\theta$) $$f(x;k,\theta) = \frac{1}{\theta^k}\frac{1}{\gamma(k)}x^{k-1}e^{-\frac{x}{\theta}}\\ \text{for}~x\gt 0~\text{and}~k,\theta\gt 0 $$
in terms of shape ($\alpha$) and rate ($\beta$) $$f(x;\alpha,\beta) = \beta^\alpha\frac{1}{\gamma(\alpha)}x^{\alpha-1}e^{-\beta x}\\ \text{for}~x\gt 0~\text{and}~\alpha,\beta\gt 0$$

In addition to being used to describe the distribution of waiting times, the gamma distribution can also be used as an alternative to the normal distribution when data (residuals) are skewed with a long right tail, such as when there is a relationship between mean and variance. When such data are modeled with a normal distribution, illogical negative predicted values can occur. Such values are not possible from a Gamma distribution.

The Gamma distribution is also an important conjugate prior for the precision (variance) of a normal distribution in Bayesian modeling.

Important properties of the Gamma distribution:

The shape parameter defines the number of events (for example, 10 deaths) and can technically be any positive number.
- shape values less than 1, the gamma distribution has a mode of 0
- shape values equal to 1, the gamma distribution is equivalent to the exponential distribution
- shape values greater than 1, the distribution becomes increasingly more symmetrical and approaches a normal distribution when the shape parameter is large.
The scale or rate (rate=1/scale) parameter defines how often (scale) or the rate at which events are expected to occur
The variance is proportional to the mean ($variance=\frac{scale}{mean}$, $variance=\frac{mean^2}{shape}$)

Uniform distribution

The uniform distribution describes a square distribution within a specific range.

$$f(x;a,b) = \begin{cases} \frac{1}{b-a} & \text{for } a \leq x \geq b,\\[1em] 0 & \text{for } x \lt a \text{ or } x \gt b \end{cases}$$

Important properties of the uniform distribution:

Has a constant probability density within the range $a\le x\ge b$ of \frac{1}{b-a}$ and zero outside of this range
Whilst this distribution is rarely employed in frequentist statistics, it is an important vague prior distribution for precision (variance) in Bayesian modeling.

Exponential distribution

The exponential distribution describes the distribution of waiting times for the occurrence a single discrete event (such as an individual death) given a constant rate (probability of occurrence per unit of time) - for example, describing longevity or the time elapsed between events (such as whale sightings). It is also useful for describing the distribution of measurements that naturally attenuate (exponentially) such as light levels penetrating to increasing water depths.

$$f(x;\lambda) = \lambda e^{-\lambda x}$$ The uniform distribution is defined by a single parameter:

$\lambda$ - the rate. The rate at which the event is expected to occur. The larger the rate, the steeper the curve.

Important properties of the uniform distribution:

It is bounded by 0 on the left and limitless on the right ($0$, $\infty$).
The mean and variance are both related to the rate ($variance=\frac{1}{\lambda^2}$, $mean=\frac{1}{\lambda}$

Beta distribution

The beta distribution describes the probability of success in a binomial trial is the only continuous distribution defined within the range that is bound at both ends ($0-1$). As it operates in the range of $0-1$, it is ideal for modeling proportions and percentages. However, it is also useful for modeling other continuous quantities on a finite scale. The values are transformed (see Transformations) from the arbitrary finite scale to the $0-1$ scale, modeling with a beta distribution and finally the parameters are back-transformed into the original scale.

$$f(x;a,b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}x^{a-1}(1-x)^{b-1}$$ The beta distribution is defined by two shape parameters:

$a$ - shape parameter 1. Number of successes in binomial trial ($a-1$)
$b$ - shape parameter 1. Number of successes in binomial trial ($b-1$)

The beta distribution is also a conjugate prior for binomial, Bernoulli and geometric distributions.

Important properties of the beta distribution:

It is bounded by 0 on the left and limitless on the right ($0$, $\infty$).
When $a=b$, the distribution is symmetric about $x=0.5$
When $a=b=1$, the distribution is a uniform distribution with $a=0$ and $b=1$.
The location of the peak shifts towards 0 as $ab$.
The variance of the distribution is inversely proportional to the total of $a+b$ (the number of trials).

Discrete distributions

Binomial distribution

The binomial distribution describes the number of 'successes' out of a total of $n$ independent trials each with a set probability. On any given trial, only two possible outcomes (binary) are possible (0 and 1) - that is it is a Bernoulli trial. Importantly, the binomial distribution is bounded at both ends - zero to the left and the trial size on the right. Typical binomial include:

the number of surviving individuals from a pool of individuals
the number of infected individuals from a pool of individuals
the number of items of a particular class (e.g. males) from a pool of items

$$f(x;n,p) = \left(\begin{array}{c} n\\x \end{array}\right)p^{x}(1-p)^{n-x}$$ The binomial distribution is defined by two shape parameters:

$n$ - the total number of trials
$p$ - the probability of success on any given trial. Defined as any real number between 0 and 1.

The $\left(\begin{array}{c} n\\x \end{array}\right)$ component is a normalizing constant that defines the number of ways of drawing $x$ items out of $n$ trials and also ensures that all probabilities add up to 1.

Important properties of the binomial distribution:

It is bounded by 0 on the left and by $n$ (the number of trials/individuals/quadrats etc) on the right ($0$, $n$).
Variance is proportional to $n$ and related to the mean in that the larger the sample size, the larger the variance.
Variance is greatest when $p=0.5$ and decreases as $p$ approaches 0 or 1.
When $n$ is large and $p$ is away from 0 or 1, the binomial distribution approaches a normal distribution
When $n$ is large and $p$ is small, the binomial distribution approaches a Poisson distribution/li>

Poisson distribution

The poisson distribution describes the number (counts) of independent discrete items or events (individuals, times, deaths) recorded for a given effort. The poisson distribution is defined by a single parameter ($\lambda$) that describes the expected count (mean) as well as the variance in count. The poisson distribution is bounded at the lower end by zero, yet theoretically unbounded at the upper end ($0$,$\infty$).

The poisson distribution is particularly appropriate for modeling count data as they are always truncated at zero, have no upper limit and tend to get more variable with increasing mean.

$$f(x;\lambda) = \frac{e^{-\lambda}\lambda^x}{x!}$$ The poisson distribution is defined by a single parameter:

$\lambda$ - the expected value

Important properties of the binomial distribution:

It is bounded by 0 on the left and unbounded on the right ($0$, $\infty$).
Mean and variance are both equal to $\lambda$.
When $\lambda$ is large, the binomial distribution approaches a normal distribution/li>

Negative binomial distribution

The negative binomial distribution number of failures out of a sequence of $n$ independent trials before a success is obtained each with a set probability (typically 0.5). The negative binomial is a useful alternative to the poisson distribution for modeling count data for which the variance is greater than the mean (particularly when caused by a heterogeneous/patchy/clumped response). The negative binomial distribution is bounded at the lower end by zero, yet theoretically unbounded at the upper end ($0$,$\infty$).

There are two parameterizations of the Gamma distribution

in terms of the size ($n$) and probability ($p$) $$f(x;n,p) = \frac{(n+x-1)!}{(n-1!)x!}p^{n}(1-p)^x$$
- $n$ - the number of successes to occur before stopping the count of failures. $n$ acts as a stopping point in that the number of failures are counted until $n$ successes are encountered.
- $p$ - the probability of success of any single trial.
in terms of mean $\mu=n(1-p)/p$) and overdispersion parameter or scaling factor ($\omega$). This parameterization is more meaningful in ecology. $$f(x;\mu,\omega) = \frac{\Gamma(\omega+x)}{\Gamma(\omega)x!}\frac{(\mu^x\omega^\omega}{(\mu+\omega)^{\mu+\omega}}$$
- $\mu$ - the mean (expected number of failures).
- $\omega$ - the dispersion or scaling factor.

plot of chunk negativebinomialDistribution

Important properties of the negative binomial distribution:

It is bounded by 0 on the left and unbounded on the right ($0$, $\infty$).
The variance is related to the mean ($\sigma^2=\mu+\mu^2/\omega$) - variance increases with increasing mean.

Scale transformations

The above section on distributions illustrate the main distributions that are useful in ecology. Provided data have been collected in an unbiased manner and from well defined populations, data usually follow one of the above distributions. When data do not comply well to one of the above distributions, it is often possible to transform the scale of those data so that they may be better approximated by one of these distributions. For example, data measured on a percentage scale of 0 to 100 could be easily transformed into a scale of 0-1 (for a beta distribution), by dividing the observations by 100.

Essentially, data transformation is the process of converting the scale in which the observations were measured into another scale. I will demonstrate the principles of data transformation with two simple examples. Firstly, to illustrate the legitimacy and commonness of data transformations, imagine you had measured water temperature in a large number of streams. Let's assume that you measured the temperature in $\,^{\circ}\mathrm{C}$. Supposing later you required the temperatures be in $\,^{\circ}\mathrm{F}$. You would not need to re-measure the stream temperatures. Rather, each of the temperatures could be converted from one scale ($\,^{\circ}\mathrm{C}$) to the other ($\,^{\circ}\mathrm{F}$). Such transformations are very common.

Imagine now that a botanist wanted to examine the leaf size of a particular species. The botanist decides to measure the length of a random selection of leaves using a standard linear, metric ruler and the distribution of sample observations are illustrated in the upper left hand figure of the following.

The growth rate of leaves might be expected to be greatest in small leaves and decelerate with increasing leaf size. That is, the growth rate of leaves might be expected to be logarithmic rather than linear. As a result, the distribution of leaf sizes using a linear scale might also be expected to be non-normal (log-normal). If, instead of using a linear scale, the botanist had used a logarithmic ruler, the distribution of leaf sizes may have been more like that depicted in the figure in the upper right corner.

If the distribution of observations is determined by the scale used to measure of the observations, and the choice of scale (in this case the ruler) is somewhat arbitrary (a linear scale is commonly used because we find it easier to understand), then it is justifiable to convert the data from one scale to another after the data has been collected and explored. It is not necessary to re-measure the data in a different scale. Therefore, to normalize the data, the botanist can simply convert the data to logarithms.

The important points in the process of transformations are;

The order of the data has not been altered (a large leaf measured on a linear scale is still a large leaf on a logarithmic scale), only the spacing of the data has changed
Since the spacing of the data is purely dependent on the scale of the measuring device, there is no reason why one scale is more correct than any other scale
For the purpose of normalization, data can be converted from one scale to another

The purpose of scale transformation is purely to normalize the data so as to satisfy the underlying assumptions of a statistical analysis. As such, it is possible to apply any function to the data. Nevertheless, certain data types respond more favourably to certain transformations due to characteristics of those data types. Common transformations into an approximate normal distribution as well as the R syntax are provided in the following table.

Nature of the data	Transformation	R syntax
Measurements (lengths, weights, etc)	$log_e$ (natural log)	`log(x)`
	$log_{10}$ (log base 10)	`log(x,10)`
		`log10(x)`
	$log x+1$	`log(x+1)`
Counts (number of individuals etc)	$\sqrt{~}$	`sqrt(x)`
Percentages (must be proportions)	$arcsin$	`asin(sqrt(x))*180/pi`

Estimates

Measures of location

Measures of location describe the center of a distribution and thus characterize the typical value of a population. There are many different measures of location (see Table below), all of which yield identical values (in the center of the distribution) when the population (and sample) follows an exactly symmetrical distribution. Whilst the mean is highly influenced by unusually large or small values (outliers) and skewed distributions, the median is more robust. The greater the degree of asymmetry and outliers, the more disparate the different measures of location.

Parameter	Description	R syntax
Estimates of location
Arithmetic mean ($\mu$)	the sum of the values divided by the number of values ($n$)	`mean(x)`
Trimmed mean	the arithmetic mean calculated after a fraction (typically 0.05 or $5\%$) of the lower and upper values have been discarded	`mean(x, trim=0.05)`
Winsorized mean	the arithmetic mean is calculated after the trimmed values are replaced by the upper and lower trimmed quantiles	`library(psych) winsor(x, trim=0.05)`
Median	the middle value	`median(x)`
Minimum, maximum	the smallest and largest values	`min(x), max(x)`
Estimates of spread
Variance ($\sigma^2$)	the average deviation (difference) of observations from the mean	`var(x)`
Standard deviation ($\sigma$)	square-root of the variance	`sd(x)`
Median average deviation	the median difference of observations from the median value	`mad(x)`
Inter-quartile range	the difference between the 75% and 25% ranked observations	`IQR(x)`
Precision and confidence
Standard error $\bar{y}(s_\bar{y})$	the precision of an estimate $\bar{y}$	`sd(x)/sqrt(length(x))`
95% confidence interval of $\mu$	the interval with a 95\% probability of containing the true mean	`library(gmodels) ci(x)`

Measures of dispersion and variability

In addition to having an estimate of the typical value (center of a distribution), it is often desirable to have an estimate of the spread of the values in the population. That is, do all Victorian male koalas weigh the same or do the weights differ substantially?

In its simplest form, the variability, or spread, of a population can be characterized by its range (difference between maximum and minimum values). However, as ranges can only increase with increasing sample size, sample ranges are likely to be a poor estimate of population spread.

Variance ($s^2$) describes the typical deviation of values from the typical (mean) value: $$s^2=\sum{\frac{(y_i-\bar{y})^2}{n-1}}$$ Note that by definition, the mean value must be in the center of all the values, and thus the sum of the positive and negative deviations will always be zero. Consequently, the deviances are squared prior to summing. Unfortunately, this results in the units of the spread estimates being different to the units of location. Standard deviation (the square-root of the variance) rectifies this issue.

Note also, that population variance (and standard deviation) estimates are calculated with a denominator of $n-1$ rather than $n$. The reason for this is that since the sample values are likely to be more similar to the sample mean (which is of course derived from these values) than to the fixed, yet unknown population mean, the sample variance will always underestimate the population variance. That is, the sample variance and standard deviations are biased estimates of the population parameters. Ideally, the mean and variance should be estimated from two different independent samples. However, this is not practical in most situations. Division by n-1 rather than n is an attempt to partly offset these biases.

There are more robust (less sensitive to outliers) measures of spread including the inter-quartile range (difference between 75% and 25% ranked observations) and the median absolute deviation (MAD: the median difference of observations from the median value).

Measures of the precision of estimates - standard errors and confidence intervals

Since sample statistics are used to estimate population parameters, it is also desirable to have a measure of how good the estimates are likely to be. For example, how well the sample mean is likely to represent the true population mean. The proximity of an estimated value to the true population value is its accuracy.

Clearly, as the true value of the population parameter is never known (hence the need for statistics), it is not possible to determine the accuracy of an estimate. Instead, we measure the precision (repeatability, consistency) of the estimate. Provided an estimate is repeatable (likely to be obtained from repeated samples) and that the sample is a good, unbiased representative of the population, a precise estimate should also be accurate.

Strictly, precision is measured as the degree of spread (standard deviation) in a set of sample statistics (e.g. means) calculated from multiple samples and is called the standard error. The standard error can be estimated from a single sample by dividing the sample standard deviation by the square-root of the sample size ($\frac{\sigma}{\sqrt{n}}$). The smaller the standard error of an estimate, the more precise the estimate is and thus the closer it is likely to approximate the true population parameter.

The central limit theorem (which predicates that any set of averaged values drawn from an identical population will always converge towards being normally distributed) suggests that the distribution of repeated sample means should follow a normal distribution and thus can be described by its overall mean and standard deviation (=standard error). In fact, since the standard error of the mean is estimated from the same single sample as the mean, its distribution follows a special type of normal distribution called a t-distribution.

In accordance to the properties of a normal distribution (and thus a t-distribution with infinite degrees of freedom), 68.27% of the repeated means fall between the true mean and $\pm$ one sample standard error (see Figure bellow). Put differently, we are 68.27% percent confident that the interval bound by the sample mean plus and minus one standard error will contain the true population mean. Of course, the smaller the sample size (lower the degrees of freedom), the flatter the t-distribution and thus the smaller the level of confidence for a given span of values (interval).

This concept can be easily extended to produce intervals associated with other degrees of confidence (such as 95%) by determining the percentiles (and thus number of standard errors away from the mean) between which the nominated percentage (e.g. 95\%) of the values lie. The 95% confidence interval is thus defined as: $$P\{\bar{y}-t_{0.05(n-1)}s_{\bar{y}}\le\mu\le\bar{y}+t_{0.05(n-1)}s_{\bar{y}}\}$$ where $\bar{y}$ is the sample mean, $s_{\bar{y}}$ is the standard error, $t_{0.05(n-1)}$ is the value of the 95\% percentile of a \textit{t} distribution with $n-1$ degrees of freedom, and $\mu$ is the unknown population mean.

For a 95% confidence interval, there is a 95% probability that the interval will contain the true mean. Note, this interpretation is about the interval, not the true population value, which remains fixed (albeit unknown). The smaller the interval, the more confidence is placed in inferences about the estimated parameter.

The left hand figure above illustrates a Normal distribution displaying percentage quantiles (grey) and probabilities (areas under the curve) associated with a range of standard deviations beyond the mean. The right hand figure displays 20 possible 95% confidence intervals from 20 samples ($n=30$) drawn from the one population. Bold intervals are those that do not include the true population mean. In the long run, 5% of such intervals will not include the population mean ($\mu$).

Degrees of freedom

The concept of degrees of freedom is sufficiently abstract and foreign to those new to statistical principles that it warrants special attention. The degrees of freedom refers to how many observations in a sample are `free to vary' (theoretically take on any value) when calculating independent estimates of population parameters (such as population variance and standard deviation).

In order for any inferences about a population to be reliable, each population parameter estimate (such as the mean and the variance) must be independent of one another. Yet they are usually all obtained from a single sample and to estimate variance, a prior estimate of the mean is required. Consequently, mean and variance estimated from the same sample cannot strictly be independent of one another.

When estimating the population variance (and thus standard deviation) from sample observations, not all of the observations can be considered independent of the estimate of population mean. The value of at least one of the observations in the sample is constrained (not free to vary).

If, for example, there were four observations in a sample with a mean of 5, then the first three of these can theoretically take on any value, yet the forth value must be such that the sum of the values is still 20.

The degrees of freedom therefore indicates how many \textbf{independent} observations are involved in the estimation of a population parameter. A `cost' of a single degree of freedom is incurred for each prior estimate required in the calculation of a population parameter.

The shape of the probability distributions of coefficients (such as those in linear models etc) and statistics depend on the number of degrees of freedom associated with the estimates. The greater the degrees of freedom, the narrower the probability distribution and thus the greater the statistical power. Power is the probability of detecting an effect if an effect genuinely occurs.

Degrees of freedom (and thus power) are positively related to sample size (the greater the number of replicates, the greater the degrees of freedom and power) and negatively related to the number of variables and prior required parameters (the greater the number of parameters and variables, the lower the degrees of freedom and power).