# Using Python (and R) to calculate Rank Correlations

You might also be interested in my pages on doing Linear Regressions with Python and/or R.

## Ranking data

Rank Correlations are performed on ranks instead of the raw data itself. This can be very advantageous when dealing with data with outliers.

For example, given two sets of data, say x = [5.05, 6.75, 3.21, 2.66] and y = [1.65, 26.5, -5.93, 7.96], with some ordering (here numerical) we can give them the ranks [3, 4, 2, 1] and [2, 4, 1, 3] respectively.

Tied ranks are usually assigned using the midrank method, whereby those entries receive the mean of the ranks they would have received had they not been tied. Thus z = [1.65, 2.64, 2.64, 6.95] would yield ranks [1, 2.5, 2.5, 4] using the midrank method.

We can do this in Python using Gary Strangman's rankdata function from the stats library in SciPy:-

>>> import scipy
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> print scipy.stats.stats.rankdata(x)
[ 3. 4. 2. 1.]

>>> y = [1.65, 26.5, -5.93, 7.96]
>>> print scipy.stats.stats.rankdata(y)
[ 2. 4. 1. 3.]

>>> z = [1.65, 2.64, 2.64, 6.95]
>>> print scipy.stats.stats.rankdata(z)
[ 1. 2.5 2.5 4.]

This functionality is built into the R language:-

> x <- c(5.05, 6.75, 3.21, 2.66)
> rank(x)
[1] 3 4 2 1

> y <- c(1.65, 26.5, -5.93, 7.96)
> rank(y)
[1] 2 4 1 3

> z <- c(1.65, 2.64, 2.64, 6.95)
> rank(z)
[1] 1.0 2.5 2.5 4.0

## Rank based Correlations

The two main correlations used for comparing such ranked data are known as the Spearman Rank Correlation (Spearman's ρ or Spearman's Rho) and Kendall's Tau (τ).

Both have several variants (e.g. rs, rsa and rsb for Spearman's ρ) which deal with the situation of tied data in different ways.

Using Spearman's ρ as an example, there are no ties in x and y, thus rs(x,y) and rsb(x,y) are both 0.40 (2dp). However, z does have ties so rs(x,z) = -0.55 (2dp) (no tie correction) and does not equal rsb(x,z) = -0.63 (2dp) (with a tie correction).

The notation I am using is from the 5th edition (published 1990) of "Rank Correlation Methods", by Maurice Kendall and Jean Dickinson Gibbons (ISBN 0-85264-305-5, first published in 1948).

To date, I have found two existing Python libraries with support for these correlations (Spearman and Kendall):

I have also used the R language (for statistical computing and graphics) from within Python using the package RPy (R from Python) to calculate these rank correlations.

## Spearman's Rho

$r_s=1-\frac{6\small\sum{d^2}}{n^3-n}$

[Insert formula for rs, rsa and rsb here]

Gary Strangman's library in SciPy gives rs which has NO TIE CORRECTION included (plus it also calculates the two-tailed p-value):-

>>> import scipy
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> z = [1.65, 2.64, 2.64, 6.95]
>>> print scipy.stats.stats.spearmanr(x, y)[0]
0.4

>>> print scipy.stats.stats.spearmanr(x, z)[0]
-0.55

On the other hand, Michiel de Hoon's library (available in BioPython or standalone as PyCluster) returns Spearman rsb which does include a tie correction:-

>>> import Bio.Cluster
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> z = [1.65, 2.64, 2.64, 6.95]
>>> print 1 - Bio.Cluster.distancematrix((x,y), dist="s")[1][0]
0.4

>>> print 1 - Bio.Cluster.distancematrix((x,z), dist="s")[1][0]
-0.632455532034

The distancematrix function takes a "matrix" and returns the distances between each row (in this case, x and y). This information could be stored as a symmetric matrix (with zeroes on the diagonal), but for efficiency it isn't - see help(Bio.Cluster.distancematrix) for more information.

We can also access R's Spearman correlation from within Python, again this uses the Spearman rsb which does include a tie correction:-

>>> import rpy
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> z = [1.65, 2.64, 2.64, 6.95]
>>> print rpy.r.cor(x, y, method="spearman")
0.4

>>> print rpy.r.cor(x, z, method="spearman")
-0.632455532034

This could be done in R as follows:

> x <- c(5.05, 6.75, 3.21, 2.66)
> y <- c(1.65, 26.5, -5.93, 7.96)
> z <- c(1.65, 2.64, 2.64, 6.95)
> cor(x, y, method="spearman")
[1] 0.4

> cor(x, z, method="spearman")
[1] -0.6324555

## Kendall's Tau

[Insert formula for ta and tb here]

Gary Strangman's library in SciPy gives Kendall's tb which has the standard tie correction included (and it calculates the two-tailed p-value):-

>>> import scipy
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> z = [1.65, 2.64, 2.64, 6.95]
>>> print scipy.stats.stats.kendalltau(x, y)[0]
0.333333333333

>>> print scipy.stats.stats.kendalltau(x, z)[0]
-0.547722557505

Michiel de Hoon's library in BioPython is faster according to my tests (using large lists with multiple ties), and also gives Kendall's tb (standard tie correction included):-

>>> import Bio.Cluster
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> z = [1.65, 2.64, 2.64, 6.95]
>>> print 1 - Bio.Cluster.distancematrix((x,y), dist="k")[1][0]
0.333333333333

>>> print 1 - Bio.Cluster.distancematrix((x,z), dist="k")[1][0]
-0.547722557505

We can also access R's Kendall correlation from within Python, again this returns Kendall's tb (standard tie correction included):-

>>> import rpy
>>> x = [5.05, 6.75, 3.21, 2.66]
>>> y = [1.65, 26.5, -5.93, 7.96]
>>> z = [1.65, 2.64, 2.64, 6.95]
>>> print rpy.r.cor(x, y, method="kendall")
0.333333333333

>>> print rpy.r.cor(x, z, method="kendall")
-0.547722557505

The version in R would be simply:

> x <- c(5.05, 6.75, 3.21, 2.66)
> y <- c(1.65, 26.5, -5.93, 7.96)
> z <- c(1.65, 2.64, 2.64, 6.95)
> cor(x, y, method="kendall")
[1] 0.3333333

> cor(x, z, method="kendall")
[1] -0.5477226

SciPy
For Gary Strangman's stats.py

Biopython
For Michiel de Hoon's PyCluster