# Lab session 5

*David Firth*

*13 September 2017*

## Logistic regression and an example of ‘Simpson’s paradox’

Simpson’s paradox (the first exposition of which was actually due to G. U. Yule) is not really a paradox at all, just a manifestation of the fact that the effect of a variable depends crucially on what other variables are controlled for.

The name ‘Simpson’s paradox’ usually relates to discrete variables, and hence data in contingency tables (tables of counts). The ‘ecological fallacy’ phenomenon that we saw in the lecture is the same thing as Simpson’s paradox.

The file `berkeley.txt`

contains data on admissions to UC Berkeley, for 6 academic departments in Fall 1973. [A famous, real dataset – often used for teaching purposes!]

Let’s read the data into R:

```
berkeley <- read.table("berkeley.txt", header = TRUE)
attach(berkeley)
## that's so that R will look for variables here automatically without berkeley$...
```

Let’s first just take a look at the data:

`berkeley`

Now let’s fit a binomial logistic regression in which the only predictor is `Sex`

:

```
admissions <- cbind(Admitted, NotAdmitted) ## the binomial response counts
model0 <- glm(admissions ~ 1, family = binomial)
model1 <- update(model0, . ~ . + Sex)
anova(model0, model1)
summary(model1)
```

The `update()`

device is a useful shortcut when we want to tweak some aspect of a model we have already fitted. The `. ~ .`

specification means that the starting point for the model formula is to use the *same* respone variable and the *same* linear predictor as the model that’s being updated..

The `family = binomial`

specification results in a logistic regression, by default; alternatives include probit, and others.

You should find that the apparent effect is positive for males, and clearly significant. On aggregate, 30% of females and 45% of males were admitted to Berkeley: e.g.,

`prop.table(colSums(admissions[Sex == "Male", ]))`

But notice the widely differing relative application rates of males and females to the six different departments. Could it be that females tend to apply to ‘difficult’ departments? Answer this by estimating the sex effect *controlling for* (i.e., within) *department*:

```
model2 <- update(model1, . ~ . + Department)
summary(model2)
```

This model still shows some lack of fit (note the fairly high deviance, relative to the residual degrees of freedom), but it now shows a slight *negative* effect of being male (though not significantly different from zero: see the standard errors by using `summary(model2)`

).

Lack of fit is evident only in Department I, where females in fact have a rather *high* rate of admission (look at `residuals(model2)`

). With Department I omitted, we achieve a simple model that fits the data well, and in which there is no significant sex effect:

```
model3 <- update(model2, subset = (Department != "I"))
summary(model3)
```

A simple ‘graphical’ representation of this model is i.e., admission is independent of sex, within any given department.

## What was the ‘paradox’ here?

When we looked at the data aggregated over departments (ie, we ignored the `Department`

variable), we found (via `model1`

) a seemingly clear, and positive, association between *admission to Berkeley* and *being male*.

On the other hand, the department-by-department models (`model2`

and `model3`

) show that in most departments there is *no* apparent association; while in Department I it’s the *female* applicants who have the higher admission rate.

So, is there a relationship between admission rate and M/F? The answer is: **it depends!**

(That is, the answer depends on the precise question being asked. In the Berkeley admissions context, the department-by-department analysis is probably more relevant to admissions policy discussions, perhaps?)