Lecture 4

Linear model as a basic exemplar

A template for lots of models for dependence of a variable (the response variable) upon other variables (the explanatory or predictor variables): multiple linear regression.

mymodel <- lm(my.response ~ my.predictor1 + my.predictor2, data = mydata)

The ingredients here are:

mymodel: the name I am giving to the resulting model object that will be stored in my current R workspace
mydata: the data frame where lm() should look first for the variables used in my model
my.response: the ‘response’ or ‘outcome’ variable. (What SPSS would often call the dependent variable.)
my.predictor1 etc.: the explanatory or predictor variables for the model
~: means roughly is modelled as a function of
+: takes the effects of my.predictor1 and my.predictor2 to be additive
lm(): calls the R function needed to specify and fit a linear model by least squares.

In mathematical notation, this model looks like \[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + e_i \qquad(i=1,...n) \] where \(e_i\) is the ‘error’. (By default, the errors are assumed to have constant variance.)

Linear model: Other commonly used arguments to `lm()`

The above template can be modified by further arguments to lm(). The most commonly used of these are:

weights: specify a vector of case-weights. Useful for situations where error variances are known to be unequal but their ratios are known.
- e.g., measurements come from 2 different instruments, one of which is known to have 10% more variance than the other: weights = 1/(1 + (0.1 * (instrument == "less_accurate")). The weights should be proportional to 1/variance.
subset: specify a subset of the data to be used (the remainder being ignored).
- For example, subset = (gender == "female")

A model formula can include all sorts of predictor!

Examples are:

factors: categorical predictors such as gender, race, …
interactions: if the effect of p1 might be different at different levels of p2, consider the interaction p1 : p2
polynomials: for example, to fit a quadratic \(y = x + x^2 + e\), use as model formula y ~ x + I(x^2). The wrapper I() is needed here because otherwise R would interpret x^2 to mean the same thing as x:x (which by convention is the same as just x).
spline functions: various forms of spline function are implemented. Among the simplest to use are B-splines and natural cubic splines. With library(splines) loaded, such spline functions can be used in a model formula via terms such as bs(x) and ns(x).

Standard operations on a model object

What we should want to know first, after fitting a model to data, is:

does the model fit the data?
can the model be improved?

The key here is to look (in whatever ways we can think of) at the model’s residuals. The plot() function applied to a model object is a good place to start with this. If the model is a good model and cannot be improved, then it should not be possible to find any systematic patterns in the residuals.

Once a model is found that fits the data well, interest then shifts to:

What does the model say about my research question(s)? This typically is answered through:
- parameter estimates, and confidence intervals for the parameter(s) of interest
- hypothesis tests

Part of the art of modelling is to try to encompass your research question in a model parameter, or at most a small number of model parameters.

Beyond linear models

Linear models will not be appropriate for some commonly encountered types of response variable, including:

categorical response variables (use logistic regression, probit regression, etc.; and multi-category versions of these)
counts (typically use log-linear models based on the Poisson or negative binomial distributions)
failure-time or ‘survival’ data (use proportional hazards models or accelerated failure-time models).

And sometimes linearity just won’t do the job, even on another scale such as logit or probit, because the effects being studied are inherently non-linear.

R provides a huge array of alternative modelling possibilities.

These include:

glm() for all generalized linear models (includes logistic regression, log-linear models, etc.)
gam() (from the recommended mgcv package) which fits generalized additive models, very flexible models based on spline functions
the survival package (for time-to-event data, typically with censoring)
the nls() function for least-squares fitting of nonlinear regression models, and the gnm package for generalized nonlinear models

— and many, many more!

A small (artificial) example

Just to demonstrate that even seemingly simple linear models need care, in formulation and interpretation.

Here are some data:

with(mydata, plot(x, y))

We are interested in how \(y\) depends on \(x\).

Let’s fit a linear model:

model1 <- lm(y ~ x, data = mydata)
with(mydata, plot(x, y))
abline(model1)

Let’s look next at the residuals:

plot(model1)

That doesn’t look too bad. There is some indication that a straight-line fit isn’t quite right; but it’s not very strong, so let’s continue with this model for now.

Let’s look at the parameter estimates:

summary(model1)

## 
## Call:
## lm(formula = y ~ x, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4254 -1.5659 -0.1224  1.5091  5.2079 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.3897     0.3183  -1.224    0.223
## x            -0.5291     0.1079  -4.904 2.44e-06
## 
## Residual standard error: 2.344 on 148 degrees of freedom
## Multiple R-squared:  0.1398, Adjusted R-squared:  0.134 
## F-statistic: 24.05 on 1 and 148 DF,  p-value: 2.439e-06

That looks like pretty strong evidence of a negative relationship between \(y\) and \(x\).

But wait. Let’s not conclude just yet.

Can the model be improved?

This dataset actually has more data than just \(x\) and \(y\):

head(mydata)

##        x     y group
## 96  1.42 -2.02     B
## 4   1.30  0.19     A
## 95  1.64 -3.38     B
## 7   0.24 -1.10     A
## 105 4.33 -5.90     C
## 89  1.73 -3.09     B

The data are actually in 3 groups.

So let’s re-plot \(y\) against \(x\), identifying the groups:

colours <- c("red", "green", "blue")
##  Incidental note that the "RColorBrewer" package provides better colours 
##  than this for use in graphs -- but we won't use it here.
with(mydata, plot(x, y, col = colours[group]))
abline(model1)

There’s now a very clearly discernible structure in the residuals from model1!

This is an example of the so-called ecological fallacy: We see quite different regression relationships between \(y\) and \(x\) when we do and do not control for one or more additional variable(s) — in this instance the group variable.

Let’s re-formulate our model to include group:

model2 <- lm(y ~ group + group:x, data = mydata)
summary(model2)

## 
## Call:
## lm(formula = y ~ group + group:x, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0748 -1.0288  0.0162  0.8897  3.8793 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.1504     0.2619   0.574    0.567
## groupB       -4.0897     0.5087  -8.039 3.01e-13
## groupC       -7.2413     0.8080  -8.961 1.52e-15
## groupA:x      0.8308     0.2033   4.086 7.24e-05
## groupB:x      1.0114     0.2020   5.007 1.59e-06
## groupC:x      0.7715     0.1718   4.490 1.45e-05
## 
## Residual standard error: 1.394 on 144 degrees of freedom
## Multiple R-squared:  0.704,  Adjusted R-squared:  0.6937 
## F-statistic: 68.48 on 5 and 144 DF,  p-value: < 2.2e-16

Now we see that (as we could see plainly in the graph) the relationship between \(y\) and \(x\) is actually a positive one. In every one of the three groups, the values of \(y\) tend to increase as \(x\) increases.

The two regression models model1 and model2 here have quite different meanings.

with(mydata, plot(x, y, col = colours[group]))         ## same as before
abline(model1)
print(coefs2 <- coef(model2))

## (Intercept)      groupB      groupC    groupA:x    groupB:x    groupC:x 
##   0.1504497  -4.0896702  -7.2412522   0.8308377   1.0114357   0.7714611

abline(coefs2[1], coefs2[4], col = "red")
abline(coefs2[1] + coefs2[2], coefs2[4], col = "green")
abline(coefs2[1] + coefs2[3], coefs2[6], col = "blue")
legend("topright", legend = c("A", "B", "C"), fill = c("red", "green", "blue"))
title(main = "black is model1; rgb lines are model2")

The improvement given by model2 is clear.

If it was less clear, and we wanted a formal test of the hypothesis that a single regression line is the truth, then since model1 and model2 are nested models, we could use the anova function to perform a test.

anova(model1, model2)

## Analysis of Variance Table
## 
## Model 1: y ~ x
## Model 2: y ~ group + group:x
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)
## 1    148 813.38                              
## 2    144 279.93  4    533.45 68.603 < 2.2e-16

The fit of model2 is much better than the fit of model1 — the improvement in fit is much better than can be attributed to random variation — so the computed \(F\) statistic is very large indeed. We have clear evidence here against the single-line hypothesis.

Warwick Data Science Institute

Lecture 4

Statistical models in R

David Firth

September 2016

Linear model as a basic exemplar

Linear model: Other commonly used arguments to `lm()`

A model formula can include all sorts of predictor!

Standard operations on a model object

Beyond linear models

A small (artificial) example

Statistical models in R

David Firth

September 2016

Linear model as a basic exemplar

Linear model: Other commonly used arguments to lm()

A model formula can include all sorts of predictor!

Standard operations on a model object

Beyond linear models

A small (artificial) example

Linear model: Other commonly used arguments to `lm()`