Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About
  • Text only
  • |
  • Sign in
  • Search MOAC
  • Search University of Warwick
  • Search for people at Warwick
  • Search Warwick Blogs
  • Search past exam papers
  • Search video
  • More…

    Molecular Organisation and Assembly in Cells

    • Why MOAC?
    • MOAC Research
    • MOAC Degrees
    • MOAC News
    • MOAC People
    • MOAC Students »
    • Peter Cock »
    • Python Programming »
    • Rank Correlations
    University of Warwick

    Using Python (and R) to calculate Rank Correlations

    You might also be interested in my pages on doing Linear Regressions with Python and/or R.

    This page covers:

    • Ranking data
    • Rank based Correlations
    • Spearman's Rho (ρ)
    • Kendall's Tau (τ)

    Ranking data

    Rank Correlations are performed on ranks instead of the raw data itself. This can be very advantageous when dealing with data with outliers.

    For example, given two sets of data, say x = [5.05, 6.75, 3.21, 2.66] and y = [1.65, 26.5, -5.93, 7.96], with some ordering (here numerical) we can give them the ranks [3, 4, 2, 1] and [2, 4, 1, 3] respectively.

    Tied ranks are usually assigned using the midrank method, whereby those entries receive the mean of the ranks they would have received had they not been tied. Thus z = [1.65, 2.64, 2.64, 6.95] would yield ranks [1, 2.5, 2.5, 4] using the midrank method.

    We can do this in Python using Gary Strangman's rankdata function from the stats library in SciPy:-

    >>> import scipy
    >>> x = [5.05, 6.75, 3.21, 2.66]
    >>> print scipy.stats.stats.rankdata(x)
    [ 3. 4. 2. 1.]

    >>> y = [1.65, 26.5, -5.93, 7.96]
    >>> print scipy.stats.stats.rankdata(y)
    [ 2. 4. 1. 3.]

    >>> z = [1.65, 2.64, 2.64, 6.95]
    >>> print scipy.stats.stats.rankdata(z)
    [ 1. 2.5 2.5 4.]

    This functionality is built into the R language:-

    > x <- c(5.05, 6.75, 3.21, 2.66)
    > rank(x)
    [1] 3 4 2 1

    > y <- c(1.65, 26.5, -5.93, 7.96)
    > rank(y)
    [1] 2 4 1 3

    > z <- c(1.65, 2.64, 2.64, 6.95)
    > rank(z)
    [1] 1.0 2.5 2.5 4.0

    Rank based Correlations

    The two main correlations used for comparing such ranked data are known as the Spearman Rank Correlation (Spearman's ρ or Spearman's Rho) and Kendall's Tau (τ).

    Both have several variants (e.g. rs, rsa and rsb for Spearman's ρ) which deal with the situation of tied data in different ways.

    Using Spearman's ρ as an example, there are no ties in x and y, thus rs(x,y) and rsb(x,y) are both 0.40 (2dp). However, z does have ties so rs(x,z) = -0.55 (2dp) (no tie correction) and does not equal rsb(x,z) = -0.63 (2dp) (with a tie correction).

    The notation I am using is from the 5th edition (published 1990) of "Rank Correlation Methods", by Maurice Kendall and Jean Dickinson Gibbons (ISBN 0-85264-305-5, first published in 1948).

    To date, I have found two existing Python libraries with support for these correlations (Spearman and Kendall):

    • Gary Strangman's stats.py (last updated in 2003, includes Linear Regression). Travis Oliphant incorporated an earlier version of this into SciPy - Scientific tools for Python in 2002.
    • Michiel de Hoon's PyCluster module (which is also included as Bio.Cluster in BioPython).

    I have also used the R language (for statistical computing and graphics) from within Python using the package RPy (R from Python) to calculate these rank correlations.

    Spearman's Rho

    r_s=1-\frac{6\small\sum{d^2}}{n^3-n}

    [Insert formula for rs, rsa and rsb here]

    Gary Strangman's library in SciPy gives rs which has NO TIE CORRECTION included (plus it also calculates the two-tailed p-value):-

    >>> import scipy
    >>> x = [5.05, 6.75, 3.21, 2.66]
    >>> y = [1.65, 26.5, -5.93, 7.96]
    >>> z = [1.65, 2.64, 2.64, 6.95]
    >>> print scipy.stats.stats.spearmanr(x, y)[0]
    0.4

    >>> print scipy.stats.stats.spearmanr(x, z)[0]
    -0.55

    On the other hand, Michiel de Hoon's library (available in BioPython or standalone as PyCluster) returns Spearman rsb which does include a tie correction:-

    >>> import Bio.Cluster
    >>> x = [5.05, 6.75, 3.21, 2.66]
    >>> y = [1.65, 26.5, -5.93, 7.96]
    >>> z = [1.65, 2.64, 2.64, 6.95]
    >>> print 1 - Bio.Cluster.distancematrix((x,y), dist="s")[1][0]
    0.4

    >>> print 1 - Bio.Cluster.distancematrix((x,z), dist="s")[1][0]
    -0.632455532034

    The distancematrix function takes a "matrix" and returns the distances between each row (in this case, x and y). This information could be stored as a symmetric matrix (with zeroes on the diagonal), but for efficiency it isn't - see help(Bio.Cluster.distancematrix) for more information.

    We can also access R's Spearman correlation from within Python, again this uses the Spearman rsb which does include a tie correction:-

    >>> import rpy
    >>> x = [5.05, 6.75, 3.21, 2.66]
    >>> y = [1.65, 26.5, -5.93, 7.96]
    >>> z = [1.65, 2.64, 2.64, 6.95]
    >>> print rpy.r.cor(x, y, method="spearman")
    0.4

    >>> print rpy.r.cor(x, z, method="spearman")
    -0.632455532034

    This could be done in R as follows:

    > x <- c(5.05, 6.75, 3.21, 2.66)
    > y <- c(1.65, 26.5, -5.93, 7.96)
    > z <- c(1.65, 2.64, 2.64, 6.95)
    > cor(x, y, method="spearman")
    [1] 0.4

    > cor(x, z, method="spearman")
    [1] -0.6324555

    Kendall's Tau

    [Insert formula for ta and tb here]

    Gary Strangman's library in SciPy gives Kendall's tb which has the standard tie correction included (and it calculates the two-tailed p-value):-

    >>> import scipy
    >>> x = [5.05, 6.75, 3.21, 2.66]
    >>> y = [1.65, 26.5, -5.93, 7.96]
    >>> z = [1.65, 2.64, 2.64, 6.95]
    >>> print scipy.stats.stats.kendalltau(x, y)[0]
    0.333333333333

    >>> print scipy.stats.stats.kendalltau(x, z)[0]
    -0.547722557505

    Michiel de Hoon's library in BioPython is faster according to my tests (using large lists with multiple ties), and also gives Kendall's tb (standard tie correction included):-

    >>> import Bio.Cluster
    >>> x = [5.05, 6.75, 3.21, 2.66]
    >>> y = [1.65, 26.5, -5.93, 7.96]
    >>> z = [1.65, 2.64, 2.64, 6.95]
    >>> print 1 - Bio.Cluster.distancematrix((x,y), dist="k")[1][0]
    0.333333333333

    >>> print 1 - Bio.Cluster.distancematrix((x,z), dist="k")[1][0]
    -0.547722557505

    We can also access R's Kendall correlation from within Python, again this returns Kendall's tb (standard tie correction included):-

    >>> import rpy
    >>> x = [5.05, 6.75, 3.21, 2.66]
    >>> y = [1.65, 26.5, -5.93, 7.96]
    >>> z = [1.65, 2.64, 2.64, 6.95]
    >>> print rpy.r.cor(x, y, method="kendall")
    0.333333333333

    >>> print rpy.r.cor(x, z, method="kendall")
    -0.547722557505

    The version in R would be simply:

    > x <- c(5.05, 6.75, 3.21, 2.66)
    > y <- c(1.65, 26.5, -5.93, 7.96)
    > z <- c(1.65, 2.64, 2.64, 6.95)
    > cor(x, y, method="kendall")
    [1] 0.3333333

    > cor(x, z, method="kendall")
    [1] -0.5477226

    [Python logo]
    Python

    [R logo]
    The R Project

    [SciPy logo]
    SciPy
    For Gary Strangman's stats.py

    [RPy logo]
    RPy (R from Python)

    [Biopython logo]
    Biopython
    For Michiel de Hoon's PyCluster

    MOAC DTC, Coventry House, University of Warwick, Gibbet Hill Road, Coventry, CV4 7AL
    Tel. 024 765 75808 moac2 at warwick dot ac dot uk

    How to find us

    MOAC Intranet

    EPSRC logo

    Close this email form
    Page contact: Peter Cock Last revised: Tue 6 Jul 2010
    • Sign in
    • |
    • Powered by Sitebuilder
    • |
    • © MMXIII
    • |
    • Terms
    • |
    • Privacy
    • |
    • Cookies
    • |
    • Accessibility