Using Python (and R) to calculate Linear Regressions

You might also be interested in my page on doing Rank Correlations with Python and/or R.

This page demonstrates three different ways to calculate a linear regression from python:

Pure Python - Gary Strangman's linregress function
R from Python - R's lsfit function (Least Squares Fit)
R from Python - R's lm function (Linear Model)

Pure Python - Gary Strangman's linregress function

In Python, Gary Strangman's library (available in the SciPy library) can be used to do a simple linear regression as follows:-

>>> from scipy import stats >>> x = [5.05, 6.75, 3.21, 2.66] >>> y = [1.65, 26.5, -5.93, 7.96] >>> gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y) >>> print "Gradient and intercept", gradient, intercept Gradient and intercept (5.3935773611970186, -16.281127993087829) >>> print "R-squared", r_value**2 R-squared 0.524806275136 >>> print "p-value", p_value p-value 0.27556485788150242

Typing help(stats.linregress) will tell you about the return values (gradient, y-axis intercept, r, two-tailed probability, and the standard error of the estimate).

R from Python - R's lsfit function (Least Squares Fit)

A simple way to do this in the R language is to use the lsfit function (Least Squares Fit):

> x <- c(5.05, 6.75, 3.21, 2.66)
> y <- c(1.65, 26.5, -5.93, 7.96)
> lsfit(x, y)$coefficients
 Intercept          X 
-16.281128   5.393577

From Python using RPy (R from Python), this is just:-

>>> from rpy import r >>> x = [5.05, 6.75, 3.21, 2.66] >>> y = [1.65, 26.5, -5.93, 7.96] >>> print r.lsfit(x,y)['coefficients'] {'X': 5.3935773611970212, 'Intercept': -16.281127993087839}

Plotting the Regression line from R's lsfit function (Least Squares Fit)

If you are using R, its very easy to do an x-y scatter plot with the linear model regression line:

x <- c(5.05, 6.75, 3.21, 2.66)
y <- c(1.65, 26.5, -5.93, 7.96)
plot(x, y, main="Example Scatter with regression")
abline(lsfit(x, y)$coefficients, col="red")

Again, we can do this with Python - to try and reduce the confusion between all the different meanings of x and y, I am calling the python list variables my_x and my_y here:

from rpy import r
my_x = [5.05, 6.75, 3.21, 2.66]
my_y = [1.65, 26.5, -5.93, 7.96]
ls_fit = r.lsfit(my_x,my_y)
gradient = ls_fit['coefficients']['X']
yintercept= ls_fit['coefficients']['Intercept']

r.png("scatter_regression.png", width=400, height=350)
r.plot(x=my_x, y=my_y, xlab="x", ylab="y", xlim=(0,7), ylim=(-16,27),
       main="Example Scatter with regression")
r.abline(a=yintercept, b=gradient, col="red")
r.dev_off()

I have explicitly chosen the axes limits to show the y-axis intersect at -16.28 as previously determined:
[Figure]

Note that we have explicitly pulled out the gradient and y-intercept from the coefficients (which is a python dictionary) and given them to the abline function as named arguments. If we simply pass the coefficients dictionary back to the abline function we have no guarantee that its elements will be in the "right" order.

Alternatively, we could turn off rpy's conversion and keep everything as R objects.

Also notice we have told R to write the graphics to a file, instead of showing them on screen. This was firstly to make getting an image for this webpage easier, and secondly this approach is usually more reliable on Windows.

R from Python - R's lm function (Linear Model)

This third method is much more complicated (especially from python) but offers more information than just the linear regression coefficient: R's linear model fitting:

> x <- c(5.05, 6.75, 3.21, 2.66)
> y <- c(1.65, 26.5, -5.93, 7.96)
> lm(y ~ x)$coefficients
(Intercept)           x 
 -16.281128    5.393577

The syntax y ~ x tells the lm function to use y depends on x as its model.

Its hard work, but we can still do this from within Python using RPy. Part of the problem is python doesn't have a built in equivalent to the ~ operator, so we have to use rpy's "evaluate a string" functionality.

>>> from rpy import r >>> my_x = [5.05, 6.75, 3.21, 2.66] >>> my_y = [1.65, 26.5, -5.93, 7.96] >>> print r.lm(r("y ~ x"), data = r.data_frame(x=my_x, y=my_y))['coefficients'] {'x': 5.3935773611970212, '(Intercept)': -16.281127993087839}

Plotting the Regression line from R's Linear Model

If you are using R, its very easy to do an x-y scatter plot with the linear model regression line:

> x <- c(5.05, 6.75, 3.21, 2.66)
> y <- c(1.65, 26.5, -5.93, 7.96)
> plot(x, y, main="Example Scatter with regression")
> abline(coef(lm(y ~ x)), col="red")

Again, we can do this with Python:

from rpy import r
my_x = [5.05, 6.75, 3.21, 2.66]
my_y = [1.65, 26.5, -5.93, 7.96]
linear_model = r.lm(r("y ~ x"), data = r.data_frame(x=my_x, y=my_y))
gradient = linear_model['coefficients']['x']
yintercept= linear_model['coefficients']['(Intercept)']

r.png("scatter_regression.png", width=400, height=350)
r.plot(x=my_x, y=my_y, xlab="x", ylab="y", xlim=(0,7), ylim=(-16,27),
           main="Example Scatter with regression")
r.abline(a=yintercept, b=gradient, col="red")
r.dev_off()

As above, this is the exciting result - I have explicitly chosen the axes limits to show the y-axis intersect at -16.28 as previously determined:
[Figure]

Standard errors etc from R's Linear Model

Finally, as a slight aside following a question from a Derek Bandler, here is a handy bit of R code to get the standard errors, p-values etc from a regression model using the R summary command:

> x <- c(5.05, 6.75, 3.21, 2.66)
> y <- c(1.65, 26.5, -5.93, 7.96)
> lm(y ~ x)$coefficients
(Intercept)           x 
 -16.281128    5.393577
> summary(lm(y~x))

Call:
lm(formula = y ~ x)

Residuals:
     1      2      3      4 
-9.306  6.374 -6.962  9.894 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -16.281     17.065  -0.954    0.441
x              5.394      3.629   1.486    0.276

Residual standard error: 11.7 on 2 degrees of freedom
Multiple R-Squared: 0.5248,     Adjusted R-squared: 0.2872 
F-statistic: 2.209 on 1 and 2 DF,  p-value: 0.2756

The next question is how to do this in Python with rpy. Well, using rpy's default object conversion mode would mean the lm results would be converted from an R dataframe into a Python dictionary. The python dictionary isn't quite good enough to hold all the information R stores in a dataframe, so if rpy tries to convert it back again, the R summary command can't understand it

One solution is to keep the linear model as an R object (by turning off rpy's conversion temporarily). We can then pass this to back the R summary command without problem, and we can still treat the linear model as a Python dictionary using the as_py method - to access its coefficients for example:

>>> import rpy >>> x = [5.05, 6.75, 3.21, 2.66] >>> y = [1.65, 26.5, -5.93, 7.96] >>> rpy.set_default_mode(rpy.NO_CONVERSION) >>> linear_model = rpy.r.lm(rpy.r("y ~ x"), data = rpy.r.data_frame(x=x, y=y)) >>> rpy.set_default_mode(rpy.BASIC_CONVERSION) >>> print linear_model.as_py()['coefficients'] {'x': 5.3935773611970212, '(Intercept)': -16.281127993087839} >>> print rpy.r.summary(linear_model) {'terms': <Robj object at 0x0089E240>, 'fstatistic': {'dendf': 2.0, 'value': 2.2088097871524752, 'numdf': 1.0}, 'aliased': {'x': False, '(Intercept)': False}, 'df': [2, 2, 2], 'call': <Robj object at 0x0089E340>, 'residuals': {'1': -9.3064376809571137, '3': -6.9622553363545983, '2': 6.3744808050079511, '4': 9.8942122123037599}, 'adj.r.squared': 0.28720941270437206, 'cov.unscaled': array([[ 2.1286381 , -0.42527178], [-0.42527178, 0.09626979]]), 'r.squared': 0.524806275136248, 'sigma': 11.696414461570097, 'coefficients': array([[-16.28112799, 17.06489672, -0.95407129, 0.44073772], [ 5.39357736, 3.62909012, 1.48620651, 0.27556486]])}

See also Linear Models using the Iris data in R, which takes this a little further.

Python

The R Project

SciPy
For Gary Strangman's stats.py

RPy (R from Python)