Workshop 2.2 - Data importation and exploratory data analysis

09 Dec 2014

Working with data sets
- Exercise 2
- Exercise 3
Exploratory data analysis
Basic hypothesis testing
Basic power analysis
- Exercise 11

Basic statistics references

Logan (2010) - Chpt 1, 2 & 6
Quinn & Keough (2002) - Chpt 1, 2, 3 & 4

Data sets - Data frames(R)

Rarely is only a single biological variable collected. Data are usually collected in sets of variables reflecting tests of relationships, differences between groups, multiple characterizations etc. Consequently, data sets are best organized into collections of variables (vectors). Such collections are called data frames in R.

Data frames are generated by combining multiple vectors together whereby each vector becomes a separate column in the data frame. In for a data frame to represent the data properly, the sequence in which observations appear in the vectors (variables) must be the same for each vector and each vector should have the same number of observations. For example, the first observations from each of the vectors to be included in the data frame must represent observations collected from the same sampling unit.

To demonstrate the use of dataframes in R, we will use fictitious data representing the areas of leaves of two species of Japanese Boxwood

Format of the fictitious data set

PLANT	SPECIES	AREA
P1	B.semp	25
P2	B.semp	22
P3	B.semp	29
P4	B.micro	15
P5	B.micro	17
P6	B.micro	20

PLANT	An identifier for each individual plant that was measured (a single leaf was measured from each individual plant)
SPECIES	Categorical listing of whether the individual plant was Buxus sempervirens (B.semp) or Buxus microphyllum (B.micro)
AREA	The surface area (mm²) of the leaf measured - Response variable

Q1-1. Lets create the data set in a series of steps. Use the textbox provided in part g below to record the R syntax used in each step

First create the categorical (factor) variable containing the listing of B.semp three times and B.micro three times
Now create the dependent variable (numeric vector) containing the leaf areas
Combine the two variables (vectors) into a single data set (data frame) called LEAVES
Print (to the screen) the contents of this new data set called LEAVES
You will have noticed that the names of the rows are listed as 1 to 6 (this is the default). In the table above, we can see that there is a variable called PLANT that listed unique plant identification labels. These labels are of no use for any statistics, however, they are useful for identifying particular observations. Consequently it would be good to incorporate these labels as row names in the data set. Create a variable called PLANT that contains a listing of the plant identifications
Use this plant identification label variable to define the row names in the data frame called LEAVES

In the textbox provided below, list each of the lines of R syntax required to generate the data set

The above syntax forms a list of instructions that R can perform. Such lists are called scripts. Scripts offer the following;

Enable a sequence of tasks such as data entry, analysis and graphical preparation to be repeated quickly and precisely
Ensure that the sequence of tasks used to complete an analysis are permanently documented
Simplify performing many similar analyses
Simplify sharing of data, analyses and techniques

Q1-2.To see how to use a script,

close down R
restart R
Change the working directory (path)
to the location where you saved the script file in Q1-2 above
Source the script file

Q1-3.There are now at least four objects in the R workspace.

These should be LEAVES (the data frame - data set), PLANTS (the list of plant ID's), SPECIES (the character vector of plant species) and AREA (the numeric vector of leaf areas).

Print (list on screen) the contents of the AREA vector. Note, that this is listing the contents of the AREA vector, this is not the same as asking it to list the contents of the AREA vector within the LEAVES data frame. For example, multiply all of the numbers in the AREA vector by 2. Now print the contents of the AREA vector then the LEAVES data frame. Notice that only the values in the AREA vector have changed - the values within the AREA vector of the LEAVES data frame were not effected.
To avoid confusion and clutter, it is therefore always best to remove single vectors
once a data frame has been created. Remove the PLANTS, SPECIES and AREA vectors.
Notice what happens when you now try to access the AREA vector.
To access a variable from within a data frame, we use the $ sign. Print the contents of the LEAVES AREA vector

Q1-4.Since data are stored in vectors, it is possible to access single entries or specific groups of entries. A specific entry is accessed via its index.

To investigate the range of options, complete the following table.

Access	Syntax
print the LEAVES data set	hint
print first leaf area in the LEAVES data set	hint
print the first 3 leaf areas in the LEAVES data set	hint
print a list of leaf areas that are greater than 20	hint
print a list of leaf areas for the B.microphylum species	hint
print the section of the data set that contains the B.microphylum species	hint
alter the second leaf area from 22 to 23	hint

Q1-5.Although it is possible to some data editing this way, for more major editing procedures it is better to either return to Excel or use the 'fix()' function.

Use the 'fix()' function to make a number of changes to the data frame (data set) including adding another column of data (that might represent another variable).

Q1-6.Sometimes it is necessary to transform

a variable from one scale to another. While it is possible to modify an existing variable (vector), it is safer to create a new variable that contains the altered values. Examine the use of R for common transformations.

Transform the leaf areas to log (base 10).

Importing data and data files

Although it is possible to generate a data set from scratch using the procedures demonstrated in the above demonstration module, often data sets are better managed with spreadsheet software. R is not designed to be a spreadsheet, and thus, it is necessary to import data into R. We will use the following small data set (in which the feeding metabolic rate of stick insects fed two different diets was recorded)to demonstrate how a data set is imported into R.

Format of the fictitious data set

PHASMID	DIET	MET.RATE
P1	tough	1.25
P2	tough	1.22
P3	tough	1.29
P4	soft	1.51
P5	soft	1.55
P6	soft	1.48

PHASMID	An identifier for each individual stick insect (Phasmid) that was measured
DIET	Categorical listing of whether the food consumed was considered to be tough or soft
MET.RATE	The feeding metabolic rate (mg 0₂/min/g) of phasmids - Response variable

Q2-1.Importing data into R from Excel is a multistage stage process.

Enter the above data set into Excel and save the sheet as a comma delimited text file (CSV)
. Ensure that e column titles (variable names) are in the first row and that you take note where the file is saved. To see the format of this file, open it in Notepad (the windows accessory program). Notice that it is just a straight text file, there is no encryption or encoding.
Ensure that the current working directory is set to the location of this file
Read (import) the data set into a data table
. Since data exploration and analysis cannot begin until the data is imported into R, the syntax of this step would usually be on the first line in a new script file that is stored with the comma delimited text file version of the data set.
To ensure that the data have been successfully imported, print the data frame

Q2-2.As well as importing files, it is often necessary to save a data set (data frame) - particularly if it has been modified and you wish to retain the changes. To demonstrate how to export a data set, we need a data frame (data set) to export. If the LEAVES data frame (from Excersize 2 above) is no longer present, regenerate the LEAVES data set from above using the script file

that was generated in Q1-2. To export an R data frame to a text file, you need to write the data frame to a file

Examine the contents of this comma delimited text file using Notepad

Q2-3.Alternatively, it is also possible to copy and paste data from Excel into R (via the clipboard). Although this method is quicker, there is no record in a R script file as to which Excel file the data originally came from. Furthermore, changes to the Excel data sheet will not be accounted for. Read the data in from the clipboard.

Q3-4.Since there is no link between the data and the script when data are imported via the clipboard, it is recommended that the data be stored as a structure within your R script above any commands that use these data. Place a copy of the data within the R script file that you generated earlier..

Transformation	Syntax
log_e	> new_var <- log(old_var)
log₁₀	> new_var <- log10(old_var)
square root	> new_var <- sqrt(old_var)
arcsin	> new_var <- asin(sqrt(old_var))
scale (mean=0, unit variance)	> new_var <- scale(old_var)

Transformation

Syntax

log_e

> new_var <- log(old_var)

log₁₀

> new_var <- log10(old_var)

square root

> new_var <- sqrt(old_var)

arcsin

> new_var <- asin(sqrt(old_var))

scale (mean=0, unit variance)

> new_var <- scale(old_var)

Sample number	Sample mean
1	12.1
2	12.7
3	12.5
Mean of sample means	12.433
> SD of sample means	0.306