Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About
  • Text only
  • |
  • Sign in
  • Search MOAC
  • Search University of Warwick
  • Search for people at Warwick
  • Search Warwick Blogs
  • Search past exam papers
  • Search video
  • More…

    Molecular Organisation and Assembly in Cells

    • About the DTC
    • Research
    • People
    • Degrees
    • Study at MOAC
    • News & Events
    • MOAC Students »
    • Peter Cock »
    • R Programming »
    • Linear Models
    University of Warwick

    Linear Regressions and Linear Models using the Iris Data

    Have a look at this page where I introduce and plot the Iris data before diving into this topic. To summarise, the data set consists of four measurements (length and width of the petals and sepals) of one hundred and fifty Iris flowers from three species:
    [Pairs Scatter Plot, or Draftsman's display using the Iris data in R]

    Linear Regressions

    You will have noticed on the previous page (or the plot above), that petal length and petal width are highly correlated over all species. How about running a linear regression? First of all, using the "least squares fit" function lsfit gives this:

    > lsfit(iris$Petal.Length, iris$Petal.Width)$coefficients
    Intercept X
    -0.3630755 0.4157554
    > plot(iris$Petal.Length, iris$Petal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Edgar Anderson's Iris Data", xlab="Petal length", ylab="Petal width")
    > abline(lsfit(iris$Petal.Length, iris$Petal.Width)$coefficients, col="black")
    [Scatter Plot with Linear Regression best fit shown]

    The function lsfit is a bit of a "one trick pony" and its a lot more flexible to use a linear model instead (function lm). For this example you get exactly the same thing when we model petal width depending on petal length (written as Petal.Width ~ Petal.Length in R's model syntax):

    > lm(Petal.Width ~ Petal.Length, data=iris)$coefficients
    (Intercept) Petal.Length
    -0.3630755 0.4157554
    > plot(iris$Petal.Length, iris$Petal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Edgar Anderson's Iris Data", xlab="Petal length", ylab="Petal width")
    > abline(lm(Petal.Width ~ Petal.Length, data=iris)$coefficients, col="black")

    (same graph again)

    You get more than just that with a linear model:

    > summary(lm(Petal.Width ~ Petal.Length, data=iris))
    
    Call:
    lm(formula = Petal.Width ~ Petal.Length, data = iris)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.56515 -0.12358 -0.01898  0.13288  0.64272 
    
    Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  -0.363076   0.039762  -9.131  4.7e-16 ***
    Petal.Length  0.415755   0.009582  43.387  < 2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Residual standard error: 0.2065 on 148 degrees of freedom
    Multiple R-Squared: 0.9271,     Adjusted R-squared: 0.9266 
    F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16 

    Looking at those p-values, this simple model seems to fit the data very well.

    Linear Models

    The main point about using a linear model is we can consider more complicated examples. What about the sepal length as a function of the sepal width?

    > plot(iris$Sepal.Width, iris$Sepal.Length, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Edgar Anderson's Iris Data", xlab="Sepal Width", ylab="Sepal Length")
    > abline(lm(Sepal.Length ~ Sepal.Width, data=iris)$coefficients, col="black")
    [Scatter Plot with Linear Regression best fit shown]

    It very clear that the linear model Sepal.Length ~ Sepal.Width (black line) is not doing a very good job, even without looking at the statistics:

    > summary(lm(Sepal.Length ~ Sepal.Width, data=iris))
    
    Call:
    lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -1.5561 -0.6333 -0.1120  0.5579  2.2225 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)   6.5262     0.4789   13.63   <2e-16 ***
    Sepal.Width  -0.2234     0.1551   -1.44    0.152    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Residual standard error: 0.8251 on 148 degrees of freedom
    Multiple R-Squared: 0.01382,    Adjusted R-squared: 0.007159 
    F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519 

    Look at those lousy p-values.

    Linear Models - Adding terms

    What happens if we divide the data up by species, and run three separate linear regressions?

    > plot(iris$Sepal.Width, iris$Sepal.Length, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Edgar Anderson's Iris Data", xlab="Petal length", ylab="Sepal length")
    > abline(lm(Sepal.Length ~ Sepal.Width, data=iris)$coefficients, col="black")
    > abline(lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="setosa"),])$coefficients, col="red")
    > abline(lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="versicolor"),])$coefficients, col="green3")
    > abline(lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="virginica"),])$coefficients, col="blue")
    [Scatter Plot with several Linear Regressions]

    The coefficients doing separate per species regressions of Sepal.Length ~ Sepal.Width are:

    > lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="setosa"),])$coefficients
    (Intercept) Sepal.Width 
      2.6390012   0.6904897 
    > lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="versicolor"),])$coefficients
    (Intercept) Sepal.Width 
      3.5397347   0.8650777
    > lm(Sepal.Length ~ Sepal.Width, data=iris[which(iris$Species=="virginica"),])$coefficients
    (Intercept) Sepal.Width 
      3.9068365   0.9015345

    The equivalent linear model would be something like Sepal.Length ~ Petal.Length:Species + Species - 1, which gives identical coefficients (see later for why I did this):

    > lm(Sepal.Length ~ Sepal.Width:Species + Species - 1, data=iris)$coefficients
                    Speciessetosa             Speciesversicolor              Speciesvirginica 
                        2.6390012                     3.5397347                     3.9068365 
        Sepal.Width:Speciessetosa Sepal.Width:Speciesversicolor  Sepal.Width:Speciesvirginica 
                        0.6904897                     0.8650777                     0.9015345

    What are these new terms? Because Species is a categorical input variable (a factor in R's terminology) it can't be used directly in a linear model as they need actual numbers (a linear model is basically a matrix equation). So, the following "dummy variables" have been invented for each data point (which are just numbers)

    • Speciessetosa = 1 if Species is "setosa", 0 otherwise
    • Speciesversicolor = 1 if Species is "versicolor", 0 otherwise
    • Speciesvirginica = 1 if Species is "virginica", 0 otherwise
    • Sepal.Width:Speciessetosa = Sepal.Width if Species is "setosa", 0 otherwise
    • Sepal.Width:Speciesversicolor = Sepal.Width if Species is "versicolor", 0 otherwise
    • Sepal.Width:Speciesvirginica = Sepal.Width if Species is "virginica", 0 otherwise

    Using the summary command on the linear model object gives:

    > summary(lm(Sepal.Length ~ Sepal.Width:Species + Species - 1, data=iris))
    
    Call:
    lm(formula = Sepal.Length ~ Sepal.Width:Species + Species - 1, 
        data = iris)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -1.26067 -0.25861 -0.03305  0.18929  1.44917 
    
    Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
    Speciessetosa                   2.6390     0.5715   4.618 8.53e-06 ***
    Speciesversicolor               3.5397     0.5580   6.343 2.74e-09 ***
    Speciesvirginica                3.9068     0.5827   6.705 4.25e-10 ***
    Sepal.Width:Speciessetosa       0.6905     0.1657   4.166 5.31e-05 ***
    Sepal.Width:Speciesversicolor   0.8651     0.2002   4.321 2.88e-05 ***
    Sepal.Width:Speciesvirginica    0.9015     0.1948   4.628 8.16e-06 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Residual standard error: 0.4397 on 144 degrees of freedom
    Multiple R-Squared: 0.9947,     Adjusted R-squared: 0.9944 
    F-statistic:  4478 on 6 and 144 DF,  p-value: < 2.2e-16 

    Just look at those p-values! Every single term has an excellent p-value, as does the model as a whole. And the residual standard error has also been halved.

    In this case, the Sepal.Length ~ Sepal.Width:Species + Species - 1 model is clearly much better than just Sepal.Length ~ Sepal.Width.

    Linear Models - Simplify with AIC

    On the other hand, what about this choice instead: Sepal.Length ~ Sepal.Width + Species. In fact, this is what the AIC (Akaike Information Criterion) step function gives you if you start with all possible interactions between sepal width and species, which is written Sepal.Length ~ Sepal.Width * Species (using a asterix instead of a plus or colon) in R:

    > summary(step(lm(Sepal.Length ~ Sepal.Width * Species, data=iris)))
    ...
    lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -1.30711 -0.25713 -0.05325  0.19542  1.41253 
    
    Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
    (Intercept)         2.2514     0.3698   6.089 9.57e-09 ***
    Sepal.Width         0.8036     0.1063   7.557 4.19e-12 ***
    Speciesversicolor   1.4587     0.1121  13.012  < 2e-16 ***
    Speciesvirginica    1.9468     0.1000  19.465  < 2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Residual standard error: 0.438 on 146 degrees of freedom
    Multiple R-Squared: 0.7259,     Adjusted R-squared: 0.7203 
    F-statistic: 128.9 on 3 and 146 DF,  p-value: < 2.2e-16 

    How can we interpret this? Recall we did three species specific regressions? If you imagine applying a shift of 1.4587 to the Iris versicolor (green) points, and a shift of 1.9468 to the Iris virginica (blue) points then their individual best fit lines would more of less agree (with a gradient of say 0.8036):
    [Scatter Plot with several Linear Regressions]

    Or, to put that another way, this model predicts that Sepal.Length = (0.8036 * Sepal.Width) + 2.2514 plus an additional 1.4587 if the species is Iris versicolor or 1.9468 if the species is Iris virginica.

    Note that using summary(step(lm(Sepal.Length ~ Sepal.Width * Species - 1, data=iris))) gives an equivalent model, but like the case discussed below would use a dummy variable for each of the three species, rather than an intercept term and two dummy variables.

    See help(lm), help(step) for more information, and perhaps also help(glm) too.

    Note on Intercept Coefficients

    Recall that I just introduced a model of the form Sepal.Length ~ Sepal.Width:Species + Species - 1, which gave identical coefficients to those found doing species specific regressions:

    > lm(Sepal.Length ~ Sepal.Width:Species + Species - 1, data=iris)$coefficients
                    Speciessetosa             Speciesversicolor              Speciesvirginica 
                        2.6390012                     3.5397347                     3.9068365 
        Sepal.Width:Speciessetosa Sepal.Width:Speciesversicolor  Sepal.Width:Speciesvirginica 
                        0.6904897                     0.8650777                     0.9015345
    

    The use of the "- 1" in the model above told R not to automatically include a default intercept term. The alternative is the following:

    > lm(Sepal.Length ~ Sepal.Width:Species + Species, data=iris)$coefficients
                      (Intercept)             Speciesversicolor              Speciesvirginica 
                        2.6390012                     0.9007335                     1.2678352 
        Sepal.Width:Speciessetosa Sepal.Width:Speciesversicolor  Sepal.Width:Speciesvirginica 
                        0.6904897                     0.8650777                     0.9015345

    This time the dummy variables are:

    • (Intercept) = 1, regardless of Species
    • Speciesversicolor = 1 if Species is "versicolor", 0 otherwise
    • Speciesvirginica = 1 if Species is "virginica", 0 otherwise
    • Sepal.Width:Speciessetosa as before
    • Sepal.Width:Speciesversicolor as before
    • Sepal.Width:Speciesvirginica as before

    Notice that the new (Intercept) term and the old Speciessetosa have the same coefficient, 2.6390012. Similarly note that for Speciesversicolor the old value of 3.5397347 equals 2.6390012 + 0.9007335 (the intercept plus the new coefficient), and likewise for Speciesvirginica, the old value of 3.9068365 equals 2.6390012 + 1.2678352. This all makes sense - these are basically equivalent models but with different dummy variables. Anyway, I found the first case slightly easier to explain.

    [R logo]
    The R Project

    MOAC DTC, Coventry House, University of Warwick, Gibbet Hill Road, Coventry, CV4 7AL
    Tel. 024 765 75808 moac2 at warwick dot ac dot uk

    How to find us

    MOAC Intranet

    EPSRC logo

    Close this email form
    Page contact: Peter Cock Last revised: Tue 16 Jan 2007
    • Sign in
    • |
    • Powered by Sitebuilder
    • |
    • © MMXII
    • |
    • Privacy
    • |
    • Accessibility