Exploratory Analysis of Multivariate Data
13 September 2017
You will need to have some additional packages installed for this laboratory exercise:
library(visdat) library(GGally) library(FactoMineR)
library(factoextra) library(MASS) attach(Cars93)
Overview of the Dataset
dplyr provides a function
glimpse to view the columns in a data frame:
It is very important to check that the data type of each column matches what you expect. Problems can arise if, for example, you have accidentaly used
stringsAsFactors=TRUE when calling
We can visualise this information using the R package
Note that there are several
NA values in the columns
##  11
##  2
In this case, rear seat room is missing for sports cars, which do not have a back seat:
seat_miss <- which(is.na(Rear.seat.room)) as.character(Make[seat_miss])
##  "Chevrolet Corvette" "Mazda RX-7"
##  Sporty ## Levels: Compact Large Midsize Small Sporty Van
##  2
There is an important distinction between data that is missing at random and data that is unavailable due to some underlying reason. The column
Rear.set.room is truly not applicable for these makes of car, so these rows can be safely excluded from any statistical model.
Exercise: can you figure out if there is a reason why
Luggage.room is missing, or whether it is completely at random?
We can use a pairs plot¹ to explore relationships between pairs of columns in our data frame. For example:
ggpairs(Cars93, columns = c(3,5,7,11:13), lower=list(combo=wrap("facethist", binwidth=0.8)))
ggpairs(Cars93, columns = c(17,18,20:22,25), lower=list(combo=wrap("facethist", binwidth=0.8)))
A pairs plot is an example of small multiples²: we look at selected subgroups of columns, rather than plotting all 351 possible combinations at once. Otherwise, it is too difficult to glean any useful information from this style of visualisation.
1. Emerson, et al. (2012). The Generalized Pairs Plot. J. Comput. Graph. Stat. 22(1): 79–91.
2. Tufte, Edward (1983, 2nd ed. 2001). The Visual Display of Quantitative Information.
Principal Components Analysis
Principal components analysis (PCA) is a method for dimension reduction. We can use it to explore covariance relationships between all of the columns simultaneously. PCA can be computed using the functions
stats::princomp, but instead we will be using the R package FactoMineR³. This is mainly because it provides plotting functions using
ggplot instead of base graphics, via the R package
For now, we will exclude any rows with missing variables. We will also exclude any columns containing categorical data (factors). It is possible to handle these types of data using generalised PCA, but that is beyond the scope of this tutorial.
cars_cat <- c(1:3,9:11,16,26,27) # categorical (factor) columns cars_pca <- PCA(na.omit(Cars93), graph=FALSE, quali.sup=cars_cat) kable(cars_pca$eig, digits=2, col.names = c("eigen","% var","cum %"))
There are 18 principal components, since we have 18 continuous variables in our dataset. A scree plot shows how much of the variance in the data is explained by each component:
We can see that 63.7% of the variance is explained by the first principal component, which an additional 13.1% explained by the second component. We can plot all of our observations according to their 2D coordinates.
fviz_pca_ind(cars_pca, axes = c(1, 2), habillage=3)
Instead of plotting the rows (observations) according to their principal components, we can also plot the columns (variables).
3: Lê, Josse & Husson (2008). FactoMineR: An R Package for Multivariate Analysis. J. Stat. Soft. 25(1): 1-18.
Extra time (optional)
Try these steps for exploratory analysis of another dataset. You could choose one of your own, or download
cereal.dat from the link below.