Lecture 5
Statistical models in R
David Firth
13 September 2017
Linear model as a basic exemplar
A template for lots of models for dependence of a variable (the response variable) upon other variables (the explanatory or predictor variables): multiple linear regression.
mymodel < lm(my.response ~ my.predictor1 + my.predictor2, data = mydata)
The ingredients here are:
mymodel
: the name I am giving to the resulting model object that will be stored in my current R workspacemydata
: the data frame wherelm()
should look first for the variables used in my modelmy.response
: the ‘response’ or ‘outcome’ variable. (What SPSS would often call the dependent variable.)my.predictor1
etc.: the explanatory or predictor variables for the model ~: means roughly is modelled as a function of
+
: takes the effects ofmy.predictor1
andmy.predictor2
to be additive
lm()
: calls the R function needed to specify and fit a linear model by least squares.
In mathematical notation, this model looks like \[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + e_i \qquad(i=1,...n) \] where \(e_i\) is the ‘error’. (By default, the errors are assumed to have constant variance.)
Linear model: Other commonly used arguments to lm()
The above template can be modified by further arguments to lm()
. The most commonly used of these are:
weights
: specify a vector of caseweights. Useful for situations where error variances are known to be unequal but their ratios are known. e.g., measurements come from 2 different instruments, one of which is known to have 10% more variance than the other:
weights = 1/(1 + (0.1 * (instrument == "less_accurate"))
. The weights should be proportional to 1/variance.
 e.g., measurements come from 2 different instruments, one of which is known to have 10% more variance than the other:
subset
: specify a subset of the data to be used (the remainder being ignored). For example,
subset = (gender == "female")
 For example,
A model formula can include all sorts of predictor!
Examples are:

factors: categorical predictors such as gender, race, …

interactions: if the effect of
p1
might be different at different levels ofp2
, consider the interactionp1 : p2

polynomials: for example, to fit a quadratic \(y = x + x^2 + e\), use as model formula
y ~ x + I(x^2)
. The wrapperI()
is needed here because otherwise R would interpretx^2
to mean the same thing asx:x
(which by convention is the same as justx
). 
spline functions: various forms of spline function are implemented. Among the simplest to use are Bsplines and natural cubic splines. With
library(splines)
loaded, such spline functions can be used in a model formula via terms such asbs(x)
andns(x)
.
Standard operations on a model object
What we should want to know first, after fitting a model to data, is:
 does the model fit the data?
 can the model be improved?
The key here is to look (in whatever ways we can think of) at the model’s residuals. The plot()
function applied to a model object is a good place to start with this. If the model is a good model and cannot be improved, then it should not be possible to find any systematic patterns in the residuals.
Once a model is found that fits the data well, interest then shifts to:
 What does the model say about my research question(s)? This typically is answered through:
 parameter estimates, and confidence intervals for the parameter(s) of interest
 hypothesis tests
Part of the art of modelling is to try to encompass your research question in a model parameter, or at most a small number of model parameters.
Beyond linear models
Linear models will not be appropriate for some commonly encountered types of response variable, including:
 categorical response variables (use logistic regression, probit regression, etc.; and multicategory versions of these)
 counts (typically use loglinear models based on the Poisson or negative binomial distributions)
 failuretime or ‘survival’ data (use proportional hazards models or accelerated failuretime models).
And sometimes linearity just won’t do the job, even on another scale such as logit or probit, because the effects being studied are inherently nonlinear.
R provides a huge array of alternative modelling possibilities.
These include:
glm()
for all generalized linear models (includes logistic regression, loglinear models, etc.)gam()
(from the recommendedmgcv
package) which fits generalized additive models, very flexible models based on spline functions the
survival
package (for timetoevent data, typically with censoring)  the
nls()
function for leastsquares fitting of nonlinear regression models, and thegnm
package for generalized nonlinear models
— and many, many more!
A small (artificial) example
Just to demonstrate that even seemingly simple linear models need care, in formulation and interpretation.
Here are some data:
with(mydata, plot(x, y))
We are interested in how \(y\) depends on \(x\).
Let’s fit a linear model:
model1 < lm(y ~ x, data = mydata)
with(mydata, plot(x, y))
abline(model1)
Let’s look next at the residuals:
plot(model1)