Jump to main navigation


Tutorial 2.3 - Data frame vectors

07 Mar 2017

Please note that the following tutorial describes data.frame manipulation from then classical R perspective. Urguably, more inititive and satifying outcomes can be achieved with the dplyr() and tidyr() packages. These will be explored in the next tutorial.

In generating a data frame from individual vectors (such as above), copies of the original vectors, rather than the actual original vectors themselves are amalgamated. Consequently, while the vectors contained in the data frame contain the same information (entries) as the original vectors, they are completely distinct from the original vectors.

By way of a motivating example for this tutorial, we will again create an extract of some bird abundances data of Mac Nally (1996).

HABITAT <- factor(c('Mixed','Gipps.Manna','Gipps.Manna','Gipps.Manna','Mixed',
'Mixed','Mixed','Mixed'))
GST <- c(3.4, 3.4, 8.4, 3.0, 5.6, 8.1, 8.3, 4.6)
EYR <- c(0.0, 9.2, 3.8, 5.0, 5.6, 4.1, 7.1, 5.3)
MACNALLY <- data.frame(HABITAT,GST,EYR)

The current R workspace will contain the vectors HABITAT, GST and EYR as well as HABITAT, GST and EYR within the MACNALLY data frame. Note that the separate vectors HABITAT, GST and EYR are entirely different objects from those within the data frame (occupy an entirely different memory slot).

ls()
[1] "EYR"      "GST"      "HABITAT"  "MACNALLY" "my_png"  

To refer to a vector within a data frame, the name of the vector is proceeded by the name of the data frame and the two names are separated by a $ character. For example, to refer to the GST vector of the MACNALLY data frame:

MACNALLY$GST
[1] 3.4 3.4 8.4 3.0 5.6 8.1 8.3 4.6
MACNALLY$HABITAT
[1] Mixed       Gipps.Manna Gipps.Manna Gipps.Manna Mixed       Mixed       Mixed      
[8] Mixed      
Levels: Gipps.Manna Mixed

Any modifications made to the original vectors will not affect the vectors within a data frame. Therefore, it is important to remember to use the data frame prefix. To avoid confusion, it is generally recommended that following the successful generation of the data frame from individual vectors, the original vectors should be deleted.

rm(HABITAT,GST,EYR)

Thereafter, any inadvertent reference to the original vector (GST) rather than vector within the data frame (MACNALLY$GST) will result in a error informing that the object does not exist.

GST
Error in eval(expr, envir, enclos): object 'GST' not found

Factor levels

When factors are generated directly using the factor() function to convert character vectors into factors, factor levels are automatically arranged alphabetically. For example, examine the levels of the MACNALLY$HABITAT factor:

levels(MACNALLY$HABITAT)
[1] "Gipps.Manna" "Mixed"      

Although the order of factor levels has no bearing on most statistical procedures and for many applications, alphabetical ordering is as valid as any other arrangement, for some analyses (particularly those involving contrasts) it is necessary to know the arrangement of factor levels. Furthermore, for graphical summaries of some data, alphabetical factor levels might not represent the natural trends among groups.

Consider a data set that includes a factorial variable with the levels 'high', 'medium' and 'low'. Presented alphabetically, the levels of the factor would be 'high' 'low' 'medium'. Those data would probably be more effectively presented in the more natural order of 'high' 'medium' 'low' or 'low' 'medium' 'high'.

When creating a factor, the order of factor levels can be specified as a list of labels. For example, consider a factor with the levels 'low','medium' and 'high':

FACTOR <- gl(3,2,6,labels=c('low','medium','high'))
FACTOR
[1] low    low    medium medium high   high  
Levels: low medium high

The order of existing factor levels can also be altered by redefining a factor:

# examine the default order of levels
levels(MACNALLY$HABITAT)
[1] "Gipps.Manna" "Mixed"      
# redefine the order of levels
MACNALLY$HABITAT<-factor(MACNALLY$HABITAT, levels=c(
'Montane Forest', 'Foothills Woodland','Mixed', 'Gipps.Manna',
'Box-Ironbark','River Red Gum'))
# examine the new order of levels
levels(MACNALLY$HABITAT)
[1] "Montane Forest"     "Foothills Woodland" "Mixed"              "Gipps.Manna"       
[5] "Box-Ironbark"       "River Red Gum"     

Notice that in the above code snippet, not only did I alter the order of the factor levels, I also introduced additional factor levels. This is not generally advisable (as it can result in unexpected behaviors of some summary functions), however, it does illustrate how to reorder factor levels.

Furthermore, it also helps to reinforce the notion that the levels property of a factor are like a key. Internally, categorical vectors (factors) are stored as integer values (1,2,3...). The levels property indicates a name for each of these factor levels.

as.numeric(MACNALLY$HABITAT)
[1] 3 4 4 4 3 3 3 3
levels(MACNALLY$HABITAT)
[1] "Montane Forest"     "Foothills Woodland" "Mixed"              "Gipps.Manna"       
[5] "Box-Ironbark"       "River Red Gum"     
So in the above example, the levels property (key) indicates that any entry with a value of 3 should be called Mixed. Altering the factor levels mealy alters this key.

In addition, some analyses perform different operations on factors that are defined as 'ordered' compared to 'unordered' factors. Regardless of whether you have altered the ordering of factor levels or not, by default all factors are implicitly considered `unordered' until otherwise defined using the ordered() function. Alternatively, the argument ordered=TRUE can be supplied to the factor function when defining a vector as a factor.

# define the factor as ordered
FACTOR <- ordered(FACTOR)
FACTOR
[1] low    low    medium medium high   high  
Levels: low < medium < high

Subsets of data frames - data frame indexing

Indexing of data frames follows the format of dataframe[rows,columns], see the following table.

ActionExample index syntax
Indexing by rows
(sampling units)
Select the first 5 rows of each of the vectors in the data frame
MACNALLY[1:5,]
      HABITAT GST EYR
1       Mixed 3.4 0.0
2 Gipps.Manna 3.4 9.2
3 Gipps.Manna 8.4 3.8
4 Gipps.Manna 3.0 5.0
5       Mixed 5.6 5.6
Select each of the vectors for the row called 'Pearcedale' from the data frame. Note for this to work row names need to be defined.
MACNALLY['Pearcedale',]
Indexing by columns
(variables)
Select all rows but just the first and third vector of the data frame
MACNALLY[,c(1,3)]
      HABITAT EYR
1       Mixed 0.0
2 Gipps.Manna 9.2
3 Gipps.Manna 3.8
4 Gipps.Manna 5.0
5       Mixed 5.6
6       Mixed 4.1
7       Mixed 7.1
8       Mixed 5.3
Select the GST and EYR vectors for all sites from the dataframe
MACNALLY[,c('GST','EYR')]
  GST EYR
1 3.4 0.0
2 3.4 9.2
3 8.4 3.8
4 3.0 5.0
5 5.6 5.6
6 8.1 4.1
7 8.3 7.1
8 4.6 5.3
Indexing by conditions Select the data for sites that have GST values greater than 3
MACNALLY[MACNALLY$GST>3,]
      HABITAT GST EYR
1       Mixed 3.4 0.0
2 Gipps.Manna 3.4 9.2
3 Gipps.Manna 8.4 3.8
5       Mixed 5.6 5.6
6       Mixed 8.1 4.1
7       Mixed 8.3 7.1
8       Mixed 4.6 5.3
Select data for 'Mixed' habitat sites that have GST values greater than 3
MACNALLY[MACNALLY$GST>3 & MACNALLY$HABITAT=='Mixed',]
  HABITAT GST EYR
1   Mixed 3.4 0.0
5   Mixed 5.6 5.6
6   Mixed 8.1 4.1
7   Mixed 8.3 7.1
8   Mixed 4.6 5.3

The subset() function

As an alternative to data frame indexing, the subset() function can be used:

function (x, subset, select, drop = FALSE, ...)  
  • x - is the data frame to be subset
  • subset - is a vector of logical values (TRUE and FALSE) resulting from a conditional statement that defines which rows to include
  • select - is an expression involving either column indexes or column names (that are converted to column indexes) indicating columns to include

Here are a few more examples:

Example subset syntax
Select all rows but just the first and third vector of the data frame
subset(MACNALLY, select=c(1,3))
      HABITAT EYR
1       Mixed 0.0
2 Gipps.Manna 9.2
3 Gipps.Manna 3.8
4 Gipps.Manna 5.0
5       Mixed 5.6
6       Mixed 4.1
7       Mixed 7.1
8       Mixed 5.3
Select the GST and EYR vectors for all sites from the dataframe
subset(MACNALLY, select=c(GST,EYR))
  GST EYR
1 3.4 0.0
2 3.4 9.2
3 8.4 3.8
4 3.0 5.0
5 5.6 5.6
6 8.1 4.1
7 8.3 7.1
8 4.6 5.3
Select the data for sites that have GST values greater than 3
subset(MACNALLY, GST>3)
      HABITAT GST EYR
1       Mixed 3.4 0.0
2 Gipps.Manna 3.4 9.2
3 Gipps.Manna 8.4 3.8
5       Mixed 5.6 5.6
6       Mixed 8.1 4.1
7       Mixed 8.3 7.1
8       Mixed 4.6 5.3
Select data for 'Mixed' habitat sites that have GST values greater than 3
subset(MACNALLY, GST>3 & HABITAT=='Mixed')
  HABITAT GST EYR
1   Mixed 3.4 0.0
5   Mixed 5.6 5.6
6   Mixed 8.1 4.1
7   Mixed 8.3 7.1
8   Mixed 4.6 5.3
Select the 'HABITAT' and 'EYR' columns of the MACNALLY data for 'Mixed' habitat sites that have GST values greater than 3
subset(MACNALLY, GST>3 & HABITAT=='Mixed', select=c(HABITAT,EYR))
  HABITAT EYR
1   Mixed 0.0
5   Mixed 5.6
6   Mixed 4.1
7   Mixed 7.1
8   Mixed 5.3

The subset() function can be used within many other analysis functions and therefore provides a convenient way of performing data analysis on subsets of larger data sets. Moreover, the subset() function should be used in preference to the above conditional indexing techniques when there are missing values or more defined factor levels than actual levels in the data.

The %in% matching operator

It is often desirable to subset according to multiple alternative conditions. The \%in\% operator searches through all of the entries in the object on the lefthand side for matches with any of the entries within the vector on the righthand side.

We now want to use more of the Mac Nally (1996) data set. The following function will import the dataset and will assume that the file is in a directory relative to the current working directory.
MACNALLY1 <- read.table('../downloads/data/macnally.csv', header=TRUE, sep=",")
#subset the MACNALLY dataset according to those rows that correspond to
#HABITAT 'Montane Forest' or 'Foothills Woodland'
MACNALLY1[MACNALLY1$HABITAT %in% c("Montane Forest","Foothills Woodland"),]
                         HABITAT  GST EYR
Fern Tree Gum     Montane Forest  3.2 5.2
Sherwin       Foothills Woodland  4.6 1.2
Heathcote Ju      Montane Forest  3.7 2.5
Warburton         Montane Forest  3.8 6.5
Panton Gap        Montane Forest  3.8 3.8
St Andrews    Foothills Woodland  4.7 3.6
Nepean        Foothills Woodland 14.0 5.6
Tallarook     Foothills Woodland  4.3 2.9

Conveniently, the %in% operator can also be used in the subset function.

Pivot tables and aggregating datasets

Sometimes it is necessary to calculate summary statistics of a vector separately for different levels of a factor. One way to achieved this is by specifying the numeric vector, the factor (or list of factors) and the summary statistic function (such as mean) as arguments in the tapply() function.

#calculate the mean GST densities per HABITAT
tapply(MACNALLY1$GST, MACNALLY1$HABITAT, mean)
      Box-Ironbark Foothills Woodland        Gipps.Manna              Mixed 
          4.575000           6.900000           5.325000           5.035294 
    Montane Forest      River Red Gum 
          3.625000           3.300000 
#OR
with(MACNALLY1, tapply(GST,HABITAT,mean))
      Box-Ironbark Foothills Woodland        Gipps.Manna              Mixed 
          4.575000           6.900000           5.325000           5.035294 
    Montane Forest      River Red Gum 
          3.625000           3.300000 

When it is necessary to calculate the summary statistic for multiple variables at a time, or to retain the dataset (data.frame) format to facilitate subsequent analyses or graphical summaries, a range of other functions are available. Indeed, there is an entire package devoted to pivot table like functionality and data set aggregations (called plyr), this will be explored in Tutorial 2.4).

Nevertheless, it is appropriate at this point to showcase a small selection of aggregating functions.

  • the ddply function within the plyr package. This function performs a split (generates subsets of the data), apply (applies a function or set of functions on the subsets) and combine (bring the chunks of aggregated data back together as a data frame).
    #calculate the mean GST and EYR densities per habitat
    library(plyr)
    ddply(MACNALLY1, ~HABITAT, function(df) {
      data.frame(GST=mean(df$GST, na.rm=T), EYR=mean(df$EYR, na.rm=T))
    })
    
                 HABITAT      GST      EYR
    1       Box-Ironbark 4.575000 1.450000
    2 Foothills Woodland 6.900000 3.325000
    3        Gipps.Manna 5.325000 6.925000
    4              Mixed 5.035294 4.264706
    5     Montane Forest 3.625000 4.500000
    6      River Red Gum 3.300000 0.000000
    
    #OR if the function you want to apply is the same for each column
    ddply(MACNALLY1, ~HABITAT, colwise(mean))
    
                 HABITAT      GST      EYR
    1       Box-Ironbark 4.575000 1.450000
    2 Foothills Woodland 6.900000 3.325000
    3        Gipps.Manna 5.325000 6.925000
    4              Mixed 5.035294 4.264706
    5     Montane Forest 3.625000 4.500000
    6      River Red Gum 3.300000 0.000000
    
  • the aggregate() function
    #calculate the mean GST and EYR densities per habitat
    aggregate(MACNALLY1[c('GST','EYR')], list(Habitat=MACNALLY1$HABITAT),
    mean)
    
                 Habitat      GST      EYR
    1       Box-Ironbark 4.575000 1.450000
    2 Foothills Woodland 6.900000 3.325000
    3        Gipps.Manna 5.325000 6.925000
    4              Mixed 5.035294 4.264706
    5     Montane Forest 3.625000 4.500000
    6      River Red Gum 3.300000 0.000000
    
  • alternatively, the gsummary() function within the nlme and lme4 packages performs similarly. The gsummary() function performs more conveniently than aggregate() on grouped data (data containing hierarchical blocking or nesting). Note that due to competing namespaces as well as other technical issues, when using the gsummary function, it is nearly always necessary to explicitly include the namespace (scope) for the summary function. For example:
    library(nlme)
    gsummary(MACNALLY1[c('GST','EYR')],groups=MACNALLY1$HABITAT, FUN=base:::mean)
    
                            GST      EYR
    Box-Ironbark       4.575000 1.450000
    Foothills Woodland 6.900000 3.325000
    Gipps.Manna        5.325000 6.925000
    Mixed              5.035294 4.264706
    Montane Forest     3.625000 4.500000
    River Red Gum      3.300000 0.000000
    

Sorting datasets

Often it is necessary to rearrange or sort datasets according to one or more variables. This is done by using the order() function to generate the row indices. By default, data are sorted in increasing order, however this can be reversed by supplying the decreasing=T argument to the order() function.

It is possible to sort according to multiple variables simply by specifying a comma separated list of the vector names (see example below), whereby the data are sorted first by the first supplied vector, then the next and so on. Note however, when multiple vectors are supplied, all are sorted in the same direction.

MACNALLY1[order(MACNALLY1$HABITAT,MACNALLY1$GST),]
                         HABITAT  GST EYR
Rushworth           Box-Ironbark  2.1 1.1
Sayers              Box-Ironbark  2.6 0.0
Bailieston          Box-Ironbark  6.5 2.5
Costerfield         Box-Ironbark  7.1 2.2
Tallarook     Foothills Woodland  4.3 2.9
Sherwin       Foothills Woodland  4.6 1.2
St Andrews    Foothills Woodland  4.7 3.6
Nepean        Foothills Woodland 14.0 5.6
Cranbourne           Gipps.Manna  3.0 5.0
Pearcedale           Gipps.Manna  3.4 9.2
Bittern              Gipps.Manna  6.5 9.7
Warneet              Gipps.Manna  8.4 3.8
Donna Buang                Mixed  1.5 0.0
Hawke                      Mixed  1.7 2.6
Waranga                    Mixed  3.0 1.6
Ben Cairn                  Mixed  3.1 9.3
Reedy Lake                 Mixed  3.4 0.0
Ghin Ghin                  Mixed  3.4 2.7
Balnarring                 Mixed  4.1 4.9
Olinda                     Mixed  4.6 5.3
Upper Yarra                Mixed  4.7 3.1
Millgrove                  Mixed  5.4 6.5
Lysterfield                Mixed  5.6 5.6
Minto                      Mixed  5.6 3.3
Cape Schanck               Mixed  6.0 4.9
Gembrook                   Mixed  7.5 7.5
Red Hill                   Mixed  8.1 4.1
Devilbend                  Mixed  8.3 7.1
OShannassy                 Mixed  9.6 4.0
Fern Tree Gum     Montane Forest  3.2 5.2
Heathcote Ju      Montane Forest  3.7 2.5
Warburton         Montane Forest  3.8 6.5
Panton Gap        Montane Forest  3.8 3.8
Undera             River Red Gum  2.7 0.0
Toolamba           River Red Gum  3.0 0.0
Arcadia            River Red Gum  3.1 0.0
Coomboona          River Red Gum  4.4 0.0

To appreciate how this is working, examine just the order component

order(MACNALLY1$HABITAT,MACNALLY1$GST)
 [1] 33 34 25 36 37 10 20 21  4  2 24  3 26 19 35 14  1 17 23  8 27 13  5 18 22 28  6  7
[29] 16  9 11 12 15 30 32 29 31

Hence when this sequence is applied as row indices to MACNALLY, it would be interpreted as 'display row 33, then row 34, 25 etc'.

Accessing and evaluating expressions within the context of a dataframe

For times when you find it necessary to repeatedly include the name of the dataframe within functions and expressions, the with() function is very convenient. This function evaluates an expression (which can include functions) within the context of the dataframe. Hence, the above order() illustration could also be performed as:

with(MACNALLY1, order(HABITAT, GST))
 [1] 33 34 25 36 37 10 20 21  4  2 24  3 26 19 35 14  1 17 23  8 27 13  5 18 22 28  6  7
[29] 16  9 11 12 15 30 32 29 31

Similarly, the within function can be used to create new variables within the context of a dataset. This is particularly useful for scale transformations. The within function returns a new instance of the data frame, it does not effect the original data frame.

MACNALLY2 <- within(MACNALLY1, logGST <- log(GST))
MACNALLY2
                         HABITAT  GST EYR    logGST
Reedy Lake                 Mixed  3.4 0.0 1.2237754
Pearcedale           Gipps.Manna  3.4 9.2 1.2237754
Warneet              Gipps.Manna  8.4 3.8 2.1282317
Cranbourne           Gipps.Manna  3.0 5.0 1.0986123
Lysterfield                Mixed  5.6 5.6 1.7227666
Red Hill                   Mixed  8.1 4.1 2.0918641
Devilbend                  Mixed  8.3 7.1 2.1162555
Olinda                     Mixed  4.6 5.3 1.5260563
Fern Tree Gum     Montane Forest  3.2 5.2 1.1631508
Sherwin       Foothills Woodland  4.6 1.2 1.5260563
Heathcote Ju      Montane Forest  3.7 2.5 1.3083328
Warburton         Montane Forest  3.8 6.5 1.3350011
Millgrove                  Mixed  5.4 6.5 1.6863990
Ben Cairn                  Mixed  3.1 9.3 1.1314021
Panton Gap        Montane Forest  3.8 3.8 1.3350011
OShannassy                 Mixed  9.6 4.0 2.2617631
Ghin Ghin                  Mixed  3.4 2.7 1.2237754
Minto                      Mixed  5.6 3.3 1.7227666
Hawke                      Mixed  1.7 2.6 0.5306283
St Andrews    Foothills Woodland  4.7 3.6 1.5475625
Nepean        Foothills Woodland 14.0 5.6 2.6390573
Cape Schanck               Mixed  6.0 4.9 1.7917595
Balnarring                 Mixed  4.1 4.9 1.4109870
Bittern              Gipps.Manna  6.5 9.7 1.8718022
Bailieston          Box-Ironbark  6.5 2.5 1.8718022
Donna Buang                Mixed  1.5 0.0 0.4054651
Upper Yarra                Mixed  4.7 3.1 1.5475625
Gembrook                   Mixed  7.5 7.5 2.0149030
Arcadia            River Red Gum  3.1 0.0 1.1314021
Undera             River Red Gum  2.7 0.0 0.9932518
Coomboona          River Red Gum  4.4 0.0 1.4816045
Toolamba           River Red Gum  3.0 0.0 1.0986123
Rushworth           Box-Ironbark  2.1 1.1 0.7419373
Sayers              Box-Ironbark  2.6 0.0 0.9555114
Waranga                    Mixed  3.0 1.6 1.0986123
Costerfield         Box-Ironbark  7.1 2.2 1.9600948
Tallarook     Foothills Woodland  4.3 2.9 1.4586150
MACNALLY1
                         HABITAT  GST EYR
Reedy Lake                 Mixed  3.4 0.0
Pearcedale           Gipps.Manna  3.4 9.2
Warneet              Gipps.Manna  8.4 3.8
Cranbourne           Gipps.Manna  3.0 5.0
Lysterfield                Mixed  5.6 5.6
Red Hill                   Mixed  8.1 4.1
Devilbend                  Mixed  8.3 7.1
Olinda                     Mixed  4.6 5.3
Fern Tree Gum     Montane Forest  3.2 5.2
Sherwin       Foothills Woodland  4.6 1.2
Heathcote Ju      Montane Forest  3.7 2.5
Warburton         Montane Forest  3.8 6.5
Millgrove                  Mixed  5.4 6.5
Ben Cairn                  Mixed  3.1 9.3
Panton Gap        Montane Forest  3.8 3.8
OShannassy                 Mixed  9.6 4.0
Ghin Ghin                  Mixed  3.4 2.7
Minto                      Mixed  5.6 3.3
Hawke                      Mixed  1.7 2.6
St Andrews    Foothills Woodland  4.7 3.6
Nepean        Foothills Woodland 14.0 5.6
Cape Schanck               Mixed  6.0 4.9
Balnarring                 Mixed  4.1 4.9
Bittern              Gipps.Manna  6.5 9.7
Bailieston          Box-Ironbark  6.5 2.5
Donna Buang                Mixed  1.5 0.0
Upper Yarra                Mixed  4.7 3.1
Gembrook                   Mixed  7.5 7.5
Arcadia            River Red Gum  3.1 0.0
Undera             River Red Gum  2.7 0.0
Coomboona          River Red Gum  4.4 0.0
Toolamba           River Red Gum  3.0 0.0
Rushworth           Box-Ironbark  2.1 1.1
Sayers              Box-Ironbark  2.6 0.0
Waranga                    Mixed  3.0 1.6
Costerfield         Box-Ironbark  7.1 2.2
Tallarook     Foothills Woodland  4.3 2.9


Exponential family of distributions

The exponential distributions are a class of continuous distribution which can be characterized by two parameters. One of these parameters (the location parameter) is a function of the mean and the other (the dispersion parameter) is a function of the variance of the distribution. Note that recent developments have further extended generalized linear models to accommodate other non-exponential residual distributions.

End of instructions