Lab session 5

Logistic regression and an example of ‘Simpson’s paradox’

Simpson’s paradox (the first exposition of which was actually due to G. U. Yule) is not really a paradox at all, just a manifestation of the fact that the effect of a variable depends crucially on what other variables are controlled for.

The name ‘Simpson’s paradox’ usually relates to discrete variables, and hence data in contingency tables (tables of counts). The ‘ecological fallacy’ phenomenon that we saw in the lecture is the same thing as Simpson’s paradox.

The file berkeley.txt contains data on admissions to UC Berkeley, for 6 academic departments in Fall 1973. [A famous, real dataset – often used for teaching purposes!]

Let’s read the data into R:

berkeley <- read.table("berkeley.txt", header = TRUE)
attach(berkeley)      
##  that's so that R will look for variables here automatically without berkeley$...

Let’s first just take a look at the data:

berkeley

Now let’s fit a binomial logistic regression in which the only predictor is Sex:

admissions <- cbind(Admitted, NotAdmitted)      ## the binomial response counts
model0 <- glm(admissions ~ 1, family = binomial)
model1 <- update(model0, . ~ . + Sex)
anova(model0, model1)
summary(model1)

The update() device is a useful shortcut when we want to tweak some aspect of a model we have already fitted. The . ~ . specification means that the starting point for the model formula is to use the same respone variable and the same linear predictor as the model that’s being updated..

The family = binomial specification results in a logistic regression, by default; alternatives include probit, and others.

You should find that the apparent effect is positive for males, and clearly significant. On aggregate, 30% of females and 45% of males were admitted to Berkeley: e.g.,

prop.table(colSums(admissions[Sex == "Male", ]))

But notice the widely differing relative application rates of males and females to the six different departments. Could it be that females tend to apply to ‘difficult’ departments? Answer this by estimating the sex effect controlling for (i.e., within) department:

model2 <- update(model1, . ~ . + Department)
summary(model2)

This model still shows some lack of fit (note the fairly high deviance, relative to the residual degrees of freedom), but it now shows a slight negative effect of being male (though not significantly different from zero: see the standard errors by using summary(model2)).

Lack of fit is evident only in Department I, where females in fact have a rather high rate of admission (look at residuals(model2)). With Department I omitted, we achieve a simple model that fits the data well, and in which there is no significant sex effect:

model3 <- update(model2, subset = (Department != "I")) 
summary(model3)

A simple ‘graphical’ representation of this model is Sex — Department — Admission

i.e., admission is independent of sex, within any given department.

What was the ‘paradox’ here?

When we looked at the data aggregated over departments (ie, we ignored the Department variable), we found (via model1) a seemingly clear, and positive, association between admission to Berkeley and being male.

On the other hand, the department-by-department models (model2 and model3) show that in most departments there is no apparent association; while in Department I it’s the female applicants who have the higher admission rate.

So, is there a relationship between admission rate and M/F? The answer is: it depends!

(That is, the answer depends on the precise question being asked. In the Berkeley admissions context, the department-by-department analysis is probably more relevant to admissions policy discussions, perhaps?)

Warwick Data Science Institute

Lab session 5

David Firth

13 September 2017

Logistic regression and an example of ‘Simpson’s paradox’

What was the ‘paradox’ here?