Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About
  • Text only
  • |
  • Sign in
  • Search MOAC
  • Search University of Warwick
  • Search for people at Warwick
  • Search Warwick Blogs
  • Search past exam papers
  • Search video
  • More…

    Molecular Organisation and Assembly in Cells

    • About the DTC
    • Research
    • People
    • Degrees
    • Study at MOAC
    • News & Events
    • MOAC Students »
    • Peter Cock »
    • Python Programming »
    • Linear Regressions
    University of Warwick

    Using Python (and R) to calculate Linear Regressions

    You might also be interested in my page on doing Rank Correlations with Python and/or R.

    This page demonstrates three different ways to calculate a linear regression from python:

    • Pure Python - Gary Strangman's linregress function
    • R from Python - R's lsfit function (Least Squares Fit)
    • R from Python - R's lm function (Linear Model)

    Pure Python - Gary Strangman's linregress function

    In Python, Gary Strangman's library (available in the SciPy library) can be used to do a simple linear regression as follows:-

    >>> from scipy import stats
    >>> x = [5.05, 6.75, 3.21, 2.66]
    >>> y = [1.65, 26.5, -5.93, 7.96]
    >>> gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y)
    >>> print "Gradient and intercept", gradient, intercept
    Gradient and intercept (5.3935773611970186, -16.281127993087829)
    >>> print "R-squared", r_value**2
    R-squared 0.524806275136
    >>> print "p-value", p_value
    p-value 0.27556485788150242

    Typing help(stats.linregress) will tell you about the return values (gradient, y-axis intercept, r, two-tailed probability, and the standard error of the estimate).

    R from Python - R's lsfit function (Least Squares Fit)

    A simple way to do this in the R language is to use the lsfit function (Least Squares Fit):

    > x <- c(5.05, 6.75, 3.21, 2.66)
    > y <- c(1.65, 26.5, -5.93, 7.96)
    > lsfit(x, y)$coefficients
     Intercept          X 
    -16.281128   5.393577

    From Python using RPy (R from Python), this is just:-

    >>> from rpy import r
    >>> x = [5.05, 6.75, 3.21, 2.66]
    >>> y = [1.65, 26.5, -5.93, 7.96]
    >>> print r.lsfit(x,y)['coefficients']
    {'X': 5.3935773611970212, 'Intercept': -16.281127993087839}

    Plotting the Regression line from R's lsfit function (Least Squares Fit)

    If you are using R, its very easy to do an x-y scatter plot with the linear model regression line:

    x <- c(5.05, 6.75, 3.21, 2.66)
    y <- c(1.65, 26.5, -5.93, 7.96)
    plot(x, y, main="Example Scatter with regression")
    abline(lsfit(x, y)$coefficients, col="red")

    Again, we can do this with Python - to try and reduce the confusion between all the different meanings of x and y, I am calling the python list variables my_x and my_y here:

    from rpy import r
    my_x = [5.05, 6.75, 3.21, 2.66]
    my_y = [1.65, 26.5, -5.93, 7.96]
    ls_fit = r.lsfit(my_x,my_y)
    gradient = ls_fit['coefficients']['X']
    yintercept= ls_fit['coefficients']['Intercept']
    
    r.png("scatter_regression.png", width=400, height=350)
    r.plot(x=my_x, y=my_y, xlab="x", ylab="y", xlim=(0,7), ylim=(-16,27),
           main="Example Scatter with regression")
    r.abline(a=yintercept, b=gradient, col="red")
    r.dev_off()

    I have explicitly chosen the axes limits to show the y-axis intersect at -16.28 as previously determined:
    [Figure]

    Note that we have explicitly pulled out the gradient and y-intercept from the coefficients (which is a python dictionary) and given them to the abline function as named arguments. If we simply pass the coefficients dictionary back to the abline function we have no guarantee that its elements will be in the "right" order.

    Alternatively, we could turn off rpy's conversion and keep everything as R objects.

    Also notice we have told R to write the graphics to a file, instead of showing them on screen. This was firstly to make getting an image for this webpage easier, and secondly this approach is usually more reliable on Windows.

    R from Python - R's lm function (Linear Model)

    This third method is much more complicated (especially from python) but offers more information than just the linear regression coefficient: R's linear model fitting:

    > x <- c(5.05, 6.75, 3.21, 2.66)
    > y <- c(1.65, 26.5, -5.93, 7.96)
    > lm(y ~ x)$coefficients
    (Intercept)           x 
     -16.281128    5.393577

    The syntax y ~ x tells the lm function to use y depends on x as its model.

    Its hard work, but we can still do this from within Python using RPy. Part of the problem is python doesn't have a built in equivalent to the ~ operator, so we have to use rpy's "evaluate a string" functionality.

    >>> from rpy import r
    >>> my_x = [5.05, 6.75, 3.21, 2.66]
    >>> my_y = [1.65, 26.5, -5.93, 7.96]
    >>> print r.lm(r("y ~ x"), data = r.data_frame(x=my_x, y=my_y))['coefficients']
    {'x': 5.3935773611970212, '(Intercept)': -16.281127993087839}

    Plotting the Regression line from R's Linear Model

    If you are using R, its very easy to do an x-y scatter plot with the linear model regression line:

    > x <- c(5.05, 6.75, 3.21, 2.66)
    > y <- c(1.65, 26.5, -5.93, 7.96)
    > plot(x, y, main="Example Scatter with regression")
    > abline(coef(lm(y ~ x)), col="red")

    Again, we can do this with Python:

    from rpy import r
    my_x = [5.05, 6.75, 3.21, 2.66]
    my_y = [1.65, 26.5, -5.93, 7.96]
    linear_model = r.lm(r("y ~ x"), data = r.data_frame(x=my_x, y=my_y))
    gradient = linear_model['coefficients']['x']
    yintercept= linear_model['coefficients']['(Intercept)']
    
    r.png("scatter_regression.png", width=400, height=350)
    r.plot(x=my_x, y=my_y, xlab="x", ylab="y", xlim=(0,7), ylim=(-16,27),
               main="Example Scatter with regression")
    r.abline(a=yintercept, b=gradient, col="red")
    r.dev_off()

    As above, this is the exciting result - I have explicitly chosen the axes limits to show the y-axis intersect at -16.28 as previously determined:
    [Figure]

    Standard errors etc from R's Linear Model

    Finally, as a slight aside following a question from a Derek Bandler, here is a handy bit of R code to get the standard errors, p-values etc from a regression model using the R summary command:

    > x <- c(5.05, 6.75, 3.21, 2.66)
    > y <- c(1.65, 26.5, -5.93, 7.96)
    > lm(y ~ x)$coefficients
    (Intercept)           x 
     -16.281128    5.393577
    > summary(lm(y~x))
    
    Call:
    lm(formula = y ~ x)
    
    Residuals:
         1      2      3      4 
    -9.306  6.374 -6.962  9.894 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)
    (Intercept)  -16.281     17.065  -0.954    0.441
    x              5.394      3.629   1.486    0.276
    
    Residual standard error: 11.7 on 2 degrees of freedom
    Multiple R-Squared: 0.5248,     Adjusted R-squared: 0.2872 
    F-statistic: 2.209 on 1 and 2 DF,  p-value: 0.2756

    The next question is how to do this in Python with rpy. Well, using rpy's default object conversion mode would mean the lm results would be converted from an R dataframe into a Python dictionary. The python dictionary isn't quite good enough to hold all the information R stores in a dataframe, so if rpy tries to convert it back again, the R summary command can't understand it

    One solution is to keep the linear model as an R object (by turning off rpy's conversion temporarily). We can then pass this to back the R summary command without problem, and we can still treat the linear model as a Python dictionary using the as_py method - to access its coefficients for example:

    >>> import rpy
    >>> x = [5.05, 6.75, 3.21, 2.66]
    >>> y = [1.65, 26.5, -5.93, 7.96]
    >>> rpy.set_default_mode(rpy.NO_CONVERSION)
    >>> linear_model = rpy.r.lm(rpy.r("y ~ x"), data = rpy.r.data_frame(x=x, y=y))
    >>> rpy.set_default_mode(rpy.BASIC_CONVERSION)
    >>> print linear_model.as_py()['coefficients']
    {'x': 5.3935773611970212, '(Intercept)': -16.281127993087839}
    >>> print rpy.r.summary(linear_model)
    {'terms': <Robj object at 0x0089E240>, 'fstatistic': {'dendf': 2.0, 'value': 2.2088097871524752, 'numdf': 1.0}, 'aliased': {'x': False, '(Intercept)': False}, 'df': [2, 2, 2], 'call': <Robj object at 0x0089E340>, 'residuals': {'1': -9.3064376809571137, '3': -6.9622553363545983, '2': 6.3744808050079511, '4': 9.8942122123037599}, 'adj.r.squared': 0.28720941270437206, 'cov.unscaled': array([[ 2.1286381 , -0.42527178], [-0.42527178, 0.09626979]]), 'r.squared': 0.524806275136248, 'sigma': 11.696414461570097, 'coefficients': array([[-16.28112799, 17.06489672, -0.95407129, 0.44073772], [ 5.39357736, 3.62909012, 1.48620651, 0.27556486]])}

    See also Linear Models using the Iris data in R, which takes this a little further.

    [Python logo]
    Python

    [R logo]
    The R Project

    [SciPy logo]
    SciPy
    For Gary Strangman's stats.py

    [RPy logo]
    RPy (R from Python)

    MOAC DTC, Coventry House, University of Warwick, Gibbet Hill Road, Coventry, CV4 7AL
    Tel. 024 765 75808 moac2 at warwick dot ac dot uk

    How to find us

    MOAC Intranet

    EPSRC logo

    Close this email form
    Page contact: Peter Cock Last revised: Thu 4 Jan 2007
    • Sign in
    • |
    • Powered by Sitebuilder
    • |
    • © MMXII
    • |
    • Privacy
    • |
    • Accessibility