Jump to main navigation


Tutorial 2.1 - Constructing data frames

07 Mar 2017

data.frame

Data frames are generated by amalgamating vectors of the same length together. To illustrate the translation of a data set (collection of variables) into an R data frame (collection of vectors), a portion of a real data set by Mac Nally (1996) in which the bird communities were investigated from 37 sites across five habitats in southeastern Australia will be used. Although the original data set includes the measured maximum density of 102 bird species from the 37 sites, for simplicity's sake only two bird species (GST: gray shrike thrush, EYR: eastern yellow robin) and the first eight of the sites will be included. The truncated data set, comprises a single factorial (or categorical) variable, two continuous variables, and a set of site (row) names, and is as follows:

SiteHABITATGSTEYR
Reedy LakeMixed3.40.0
PearcedaleGipps.Manna3.49.2
WarneetGipps.Manna8.43.8
CranbourneGipps.Manna3.05.0
LysterfieldMixed5.65.6
Red HillMixed8.14.1
DevilbendMixed8.37.1
OlindaMixed4.65.3

Firstly, we will generate the three variables (excluding the site labels as they are not variables) separately:

HABITAT <- factor(c('Mixed','Gipps.Manna','Gipps.Manna','Gipps.Manna','Mixed',
'Mixed','Mixed','Mixed'))
GST <- c(3.4, 3.4, 8.4, 3.0, 5.6, 8.1, 8.3, 4.6)
EYR <- c(0.0, 9.2, 3.8, 5.0, 5.6, 4.1, 7.1, 5.3)

Next, use the list the names of the vectors as arguments in the data.frame() function to amalgamate the three separate variables into a single data frame (data set) which we will call MACNALLY (after the author).

MACNALLY <- data.frame(HABITAT, GST, EYR)
MACNALLY
      HABITAT GST EYR
1       Mixed 3.4 0.0
2 Gipps.Manna 3.4 9.2
3 Gipps.Manna 8.4 3.8
4 Gipps.Manna 3.0 5.0
5       Mixed 5.6 5.6
6       Mixed 8.1 4.1
7       Mixed 8.3 7.1
8       Mixed 4.6 5.3

Notice that each vector (variable) becomes a column in the data frame and that each row represents a single sampling unit (in this case, each row represents a different site). By default, the rows are named using numbers corresponding to the number of rows in the data frame. However, these can be altered to reflect the names of the sampling units by assigning a list of alternative names to the row.names() (data frame row names) property of the data frame.

row.names(MACNALLY) <- c('Reedy Lake', 'Pearcedale', 'Warneet', 'Cranbourne',
'Lysterfield', 'Red Hill', 'Devilbend', 'Olinda')
MACNALLY
                HABITAT GST EYR
Reedy Lake        Mixed 3.4 0.0
Pearcedale  Gipps.Manna 3.4 9.2
Warneet     Gipps.Manna 8.4 3.8
Cranbourne  Gipps.Manna 3.0 5.0
Lysterfield       Mixed 5.6 5.6
Red Hill          Mixed 8.1 4.1
Devilbend         Mixed 8.3 7.1
Olinda            Mixed 4.6 5.3

expand.grid

When the data set contains multiple fully crossed categorical variables (factors), the expand.grid() function provides a convenient way to create the factor vectors.

expand.grid(rep=1:4,B=paste("b",1:2,sep=""),A=paste("a",1:3,sep=""))
   rep  B  A
1    1 b1 a1
2    2 b1 a1
3    3 b1 a1
4    4 b1 a1
5    1 b2 a1
6    2 b2 a1
7    3 b2 a1
8    4 b2 a1
9    1 b1 a2
10   2 b1 a2
11   3 b1 a2
12   4 b1 a2
13   1 b2 a2
14   2 b2 a2
15   3 b2 a2
16   4 b2 a2
17   1 b1 a3
18   2 b1 a3
19   3 b1 a3
20   4 b1 a3
21   1 b2 a3
22   2 b2 a3
23   3 b2 a3
24   4 b2 a3

Summarizing data frames

For very small and simple data.frame's like the MACNALLY example above, the whole data data.frame can be comfortably displayed in the console. However for much larger data.frame's, displaying all the data can be overwhelming and not very useful. There are a number of convenient functions that provide overviews of data. To appreciate the particulars of each routine as well as the differences between the different routines, we will add some other data types to our MACNALLY data.

MACNALLY$Bool <- rep(c(TRUE,FALSE),4)
MACNALLY$Char <- rep(c('Large','Small'),4)
MACNALLY$Date <- seq(as.Date('2000-02-29'),as.Date('2000-05-12'), len=8)

summary()

The summary() function is an overloaded function whose behaviour depends on the object passed to the function. When summary() is called with a data.frame, a summary is provided in which:

  • numeric vectors (variables) are summarized by the standard 5 number statistics and if there are any missing values, the number of missing values is also provided
  • categorical (factors) vectors are tallied up - that is, the number of instances of each level are counted.
  • boolean states are also tallied
  • character vectors are only described by their length
  • date (and POSIX) vectors are summarized by 5 number summaries

summary(MACNALLY)
        HABITAT       GST            EYR           Bool             Char          
 Gipps.Manna:3   Min.   :3.00   Min.   :0.000   Mode :logical   Length:8          
 Mixed      :5   1st Qu.:3.40   1st Qu.:4.025   FALSE:4         Class :character  
                 Median :5.10   Median :5.150   TRUE :4         Mode  :character  
                 Mean   :5.60   Mean   :5.013   NA's :0                           
                 3rd Qu.:8.15   3rd Qu.:5.975                                     
                 Max.   :8.40   Max.   :9.200                                     
      Date           
 Min.   :2000-02-29  
 1st Qu.:2000-03-18  
 Median :2000-04-05  
 Mean   :2000-04-05  
 3rd Qu.:2000-04-23  
 Max.   :2000-05-12  

str()

Similar to summary(), the str() function is an overloaded. The str() function generally produces a compact view of the structure of an object. When str() is called with a data.frame, this compact view comprises a nested list of abbreviated structures.

str(MACNALLY)
'data.frame':	8 obs. of  6 variables:
 $ HABITAT: Factor w/ 2 levels "Gipps.Manna",..: 2 1 1 1 2 2 2 2
 $ GST    : num  3.4 3.4 8.4 3 5.6 8.1 8.3 4.6
 $ EYR    : num  0 9.2 3.8 5 5.6 4.1 7.1 5.3
 $ Bool   : logi  TRUE FALSE TRUE FALSE TRUE FALSE ...
 $ Char   : chr  "Large" "Small" "Large" "Small" ...
 $ Date   : Date, format: "2000-02-29" "2000-03-10" ...

glimpse()

The glimpse() function in the tibble package is similar to str() except that it attempts to maximize the amount of data displayed according to the dimensions of the output.

library(tibble)
glimpse(MACNALLY)
Observations: 8
Variables: 6
$ HABITAT <fctr> Mixed, Gipps.Manna, Gipps.Manna, Gipps.Manna, Mixed, Mixed, Mixed, M...
$ GST     <dbl> 3.4, 3.4, 8.4, 3.0, 5.6, 8.1, 8.3, 4.6
$ EYR     <dbl> 0.0, 9.2, 3.8, 5.0, 5.6, 4.1, 7.1, 5.3
$ Bool    <lgl> TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE
$ Char    <chr> "Large", "Small", "Large", "Small", "Large", "Small", "Large", "Small"
$ Date    <date> 2000-02-29, 2000-03-10, 2000-03-20, 2000-03-31, 2000-04-10, 2000-04-...

Others

There are also numerous graphical methods including view() and fix(), however, I have focused on the script friendly routines. As the graphical routines require user input, they are inappropriate to include in scripts.

Within Rstudio, a data frame can be viewed like a spreadsheet. Furthermore, when in R Notebook mode, a simple functioning spreadsheet will be embedded within the notebook.


Exponential family of distributions

The exponential distributions are a class of continuous distribution which can be characterized by two parameters. One of these parameters (the location parameter) is a function of the mean and the other (the dispersion parameter) is a function of the variance of the distribution. Note that recent developments have further extended generalized linear models to accommodate other non-exponential residual distributions.

End of instructions