Skip to main content

Lab session 4

Exploratory Analysis of Multivariate Data

Matt Moores

13 September 2017


You will need to have some additional packages installed for this laboratory exercise:

library(factoextra) library(MASS) attach(Cars93)

Overview of the Dataset

dplyr provides a function glimpse to view the columns in a data frame:


It is very important to check that the data type of each column matches what you expect. Problems can arise if, for example, you have accidentaly used stringsAsFactors=TRUE when calling read.table or read.csv.

We can visualise this information using the R package visdat


Note that there are several NA values in the columns and

## [1] 11    
## [1] 2    

In this case, rear seat room is missing for sports cars, which do not have a back seat:

seat_miss <- which(
## [1] "Chevrolet Corvette" "Mazda RX-7"    
## [1] Sporty
## Levels: Compact Large Midsize Small Sporty Van    
## [1] 2    

There is an important distinction between data that is missing at random and data that is unavailable due to some underlying reason. The column is truly not applicable for these makes of car, so these rows can be safely excluded from any statistical model.

Exercise: can you figure out if there is a reason why is missing, or whether it is completely at random?

Pairwise Correlation

We can use a pairs plot¹ to explore relationships between pairs of columns in our data frame. For example:

ggpairs(Cars93, columns = c(3,5,7,11:13),
        lower=list(combo=wrap("facethist", binwidth=0.8)))    
ggpairs(Cars93, columns = c(17,18,20:22,25),
        lower=list(combo=wrap("facethist", binwidth=0.8)))    

A pairs plot is an example of small multiples²: we look at selected subgroups of columns, rather than plotting all 351 possible combinations at once. Otherwise, it is too difficult to glean any useful information from this style of visualisation.

1. Emerson, et al. (2012). The Generalized Pairs Plot. J. Comput. Graph. Stat. 22(1): 79–91.

2. Tufte, Edward (1983, 2nd ed. 2001). The Visual Display of Quantitative Information.

Principal Components Analysis

Principal components analysis (PCA) is a method for dimension reduction. We can use it to explore covariance relationships between all of the columns simultaneously. PCA can be computed using the functions stats::prcomp or stats::princomp, but instead we will be using the R package FactoMineR³. This is mainly because it provides plotting functions using ggplot instead of base graphics, via the R package factoextra.

For now, we will exclude any rows with missing variables. We will also exclude any columns containing categorical data (factors). It is possible to handle these types of data using generalised PCA, but that is beyond the scope of this tutorial.

cars_cat <- c(1:3,9:11,16,26,27) # categorical (factor) columns
cars_pca <- PCA(na.omit(Cars93), graph=FALSE, quali.sup=cars_cat)
kable(cars_pca$eig, digits=2, col.names = c("eigen","% var","cum %"))    

There are 18 principal components, since we have 18 continuous variables in our dataset. A scree plot shows how much of the variance in the data is explained by each component:

fviz_screeplot(cars_pca, ncp=18)    

We can see that 63.7% of the variance is explained by the first principal component, which an additional 13.1% explained by the second component. We can plot all of our observations according to their 2D coordinates.

fviz_pca_ind(cars_pca, axes = c(1, 2), habillage=3)    

Instead of plotting the rows (observations) according to their principal components, we can also plot the columns (variables).


3: Lê, Josse & Husson (2008). FactoMineR: An R Package for Multivariate Analysis. J. Stat. Soft. 25(1): 1-18.

Extra time (optional)

Try these steps for exploratory analysis of another dataset. You could choose one of your own, or download cereal.dat from the link below.