Linear Regressions and Linear Models using the Iris Data

Have a look at this page where I introduce and plot the Iris data before diving into this topic. To summarise, the data set consists of four measurements (length and width of the petals and sepals) of one hundred and fifty Iris flowers from three species:

Linear Regressions

You will have noticed on the previous page (or the plot above), that petal length and petal width are highly correlated over all species. How about running a linear regression? First of all, using the "least squares fit" function lsfit gives this:

> lsfit(iris$Petal.Length, iris$Petal.Width)$coefficients Intercept X -0.3630755 0.4157554 > plot(iris$Petal.Length, iris$Petal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Edgar Anderson's Iris Data", xlab="Petal length", ylab="Petal width") > abline(lsfit(iris$Petal.Length, iris$Petal.Width)$coefficients, col="black")

The function lsfit is a bit of a "one trick pony" and its a lot more flexible to use a linear model instead (function lm). For this example you get exactly the same thing when we model petal width depending on petal length (written as Petal.Width ~ Petal.Length in R's model syntax):

> lm(Petal.Width ~ Petal.Length, data=iris)$coefficients (Intercept) Petal.Length -0.3630755 0.4157554 > plot(iris$Petal.Length, iris$Petal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Edgar Anderson's Iris Data", xlab="Petal length", ylab="Petal width") > abline(lm(Petal.Width ~ Petal.Length, data=iris)$coefficients, col="black")
(same graph again)

You get more than just that with a linear model:

> summary(lm(Petal.Width ~ Petal.Length, data=iris))

Call:
lm(formula = Petal.Width ~ Petal.Length, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.56515 -0.12358 -0.01898  0.13288  0.64272 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.363076   0.039762  -9.131  4.7e-16 ***
Petal.Length  0.415755   0.009582  43.387  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.2065 on 148 degrees of freedom
Multiple R-Squared: 0.9271,     Adjusted R-squared: 0.9266 
F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

Looking at those p-values, this simple model seems to fit the data very well.

Linear Models

The main point about using a linear model is we can consider more complicated examples. What about the sepal length as a function of the sepal width?

> plot(iris$Sepal.Width, iris$Sepal.Length, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Edgar Anderson's Iris Data", xlab="Sepal Width", ylab="Sepal Length") > abline(lm(Sepal.Length ~ Sepal.Width, data=iris)$coefficients, col="black")

It very clear that the linear model Sepal.Length ~ Sepal.Width (black line) is not doing a very good job, even without looking at the statistics:

> summary(lm(Sepal.Length ~ Sepal.Width, data=iris))

Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = iris)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5561 -0.6333 -0.1120  0.5579  2.2225 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.5262     0.4789   13.63   <2e-16 ***
Sepal.Width  -0.2234     0.1551   -1.44    0.152    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.8251 on 148 degrees of freedom
Multiple R-Squared: 0.01382,    Adjusted R-squared: 0.007159 
F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519

Look at those lousy p-values.

Linear Models - Adding terms

What happens if we divide the data up by species, and run three separate linear regressions?

> plot(iris$Sepal.Width, iris$Sepal.Length, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Edgar Anderson's Iris Data", xlab="Petal length", ylab="Sepal length") > abline(lm(Sepal.Length ~ Sepal.Width, data=iris)$coefficients, col="black") > abline(lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="setosa"),])$coefficients, col="red") > abline(lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="versicolor"),])$coefficients, col="green3") > abline(lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="virginica"),])$coefficients, col="blue")

The coefficients doing separate per species regressions of Sepal.Length ~ Sepal.Width are:

> lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="setosa"),])$coefficients
(Intercept) Sepal.Width 
  2.6390012   0.6904897 
> lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="versicolor"),])$coefficients
(Intercept) Sepal.Width 
  3.5397347   0.8650777
> lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="virginica"),])$coefficients
(Intercept) Sepal.Width 
  3.9068365   0.9015345

The equivalent linear model would be something like Sepal.Length ~ Petal.Length:Species + Species - 1, which gives identical coefficients (see later for why I did this):

> lm(Sepal.Length ~ Sepal.Width:Species + Species - 1, data=iris)$coefficients
                Speciessetosa             Speciesversicolor              Speciesvirginica 
                    2.6390012                     3.5397347                     3.9068365 
    Sepal.Width:Speciessetosa Sepal.Width:Speciesversicolor  Sepal.Width:Speciesvirginica 
                    0.6904897                     0.8650777                     0.9015345

What are these new terms? Because Species is a categorical input variable (a factor in R's terminology) it can't be used directly in a linear model as they need actual numbers (a linear model is basically a matrix equation). So, the following "dummy variables" have been invented for each data point (which are just numbers)

Speciessetosa = 1 if Species is "setosa", 0 otherwise
Speciesversicolor = 1 if Species is "versicolor", 0 otherwise
Speciesvirginica = 1 if Species is "virginica", 0 otherwise
Sepal.Width:Speciessetosa = Sepal.Width if Species is "setosa", 0 otherwise
Sepal.Width:Speciesversicolor = Sepal.Width if Species is "versicolor", 0 otherwise
Sepal.Width:Speciesvirginica = Sepal.Width if Species is "virginica", 0 otherwise

Using the summary command on the linear model object gives:

> summary(lm(Sepal.Length ~ Sepal.Width:Species + Species - 1, data=iris))

Call:
lm(formula = Sepal.Length ~ Sepal.Width:Species + Species - 1, 
    data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.26067 -0.25861 -0.03305  0.18929  1.44917 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
Speciessetosa                   2.6390     0.5715   4.618 8.53e-06 ***
Speciesversicolor               3.5397     0.5580   6.343 2.74e-09 ***
Speciesvirginica                3.9068     0.5827   6.705 4.25e-10 ***
Sepal.Width:Speciessetosa       0.6905     0.1657   4.166 5.31e-05 ***
Sepal.Width:Speciesversicolor   0.8651     0.2002   4.321 2.88e-05 ***
Sepal.Width:Speciesvirginica    0.9015     0.1948   4.628 8.16e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.4397 on 144 degrees of freedom
Multiple R-Squared: 0.9947,     Adjusted R-squared: 0.9944 
F-statistic:  4478 on 6 and 144 DF,  p-value: < 2.2e-16

Just look at those p-values! Every single term has an excellent p-value, as does the model as a whole. And the residual standard error has also been halved.

In this case, the Sepal.Length ~ Sepal.Width:Species + Species - 1 model is clearly much better than just Sepal.Length ~ Sepal.Width.

Linear Models - Simplify with AIC

On the other hand, what about this choice instead: Sepal.Length ~ Sepal.Width + Species. In fact, this is what the AIC (Akaike Information Criterion) step function gives you if you start with all possible interactions between sepal width and species, which is written Sepal.Length ~ Sepal.Width * Species (using a asterix instead of a plus or colon) in R:

> summary(step(lm(Sepal.Length ~ Sepal.Width * Species, data=iris)))
...
lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.30711 -0.25713 -0.05325  0.19542  1.41253 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.2514     0.3698   6.089 9.57e-09 ***
Sepal.Width         0.8036     0.1063   7.557 4.19e-12 ***
Speciesversicolor   1.4587     0.1121  13.012  < 2e-16 ***
Speciesvirginica    1.9468     0.1000  19.465  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.438 on 146 degrees of freedom
Multiple R-Squared: 0.7259,     Adjusted R-squared: 0.7203 
F-statistic: 128.9 on 3 and 146 DF,  p-value: < 2.2e-16

How can we interpret this? Recall we did three species specific regressions? If you imagine applying a shift of 1.4587 to the Iris versicolor (green) points, and a shift of 1.9468 to the Iris virginica (blue) points then their individual best fit lines would more of less agree (with a gradient of say 0.8036):
[Scatter Plot with several Linear Regressions]

Or, to put that another way, this model predicts that Sepal.Length = (0.8036 * Sepal.Width) + 2.2514 plus an additional 1.4587 if the species is Iris versicolor or 1.9468 if the species is Iris virginica.

Note that using summary(step(lm(Sepal.Length ~ Sepal.Width * Species - 1, data=iris))) gives an equivalent model, but like the case discussed below would use a dummy variable for each of the three species, rather than an intercept term and two dummy variables.

See help(lm), help(step) for more information, and perhaps also help(glm) too.

Note on Intercept Coefficients

Recall that I just introduced a model of the form Sepal.Length ~ Sepal.Width:Species + Species - 1, which gave identical coefficients to those found doing species specific regressions:

> lm(Sepal.Length ~ Sepal.Width:Species + Species - 1, data=iris)$coefficients
                Speciessetosa             Speciesversicolor              Speciesvirginica 
                    2.6390012                     3.5397347                     3.9068365 
    Sepal.Width:Speciessetosa Sepal.Width:Speciesversicolor  Sepal.Width:Speciesvirginica 
                    0.6904897                     0.8650777                     0.9015345

The use of the "- 1" in the model above told R not to automatically include a default intercept term. The alternative is the following:

> lm(Sepal.Length ~ Sepal.Width:Species + Species, data=iris)$coefficients
                  (Intercept)             Speciesversicolor              Speciesvirginica 
                    2.6390012                     0.9007335                     1.2678352 
    Sepal.Width:Speciessetosa Sepal.Width:Speciesversicolor  Sepal.Width:Speciesvirginica 
                    0.6904897                     0.8650777                     0.9015345

This time the dummy variables are:

(Intercept) = 1, regardless of Species
Speciesversicolor = 1 if Species is "versicolor", 0 otherwise
Speciesvirginica = 1 if Species is "virginica", 0 otherwise
Sepal.Width:Speciessetosa as before
Sepal.Width:Speciesversicolor as before
Sepal.Width:Speciesvirginica as before

Notice that the new (Intercept) term and the old Speciessetosa have the same coefficient, 2.6390012. Similarly note that for Speciesversicolor the old value of 3.5397347 equals 2.6390012 + 0.9007335 (the intercept plus the new coefficient), and likewise for Speciesvirginica, the old value of 3.9068365 equals 2.6390012 + 1.2678352. This all makes sense - these are basically equivalent models but with different dummy variables. Anyway, I found the first case slightly easier to explain.

The R Project