Comparison of Perl and C++ efficiency

Summary

I have done five speed comparisons of equivalent Perl and C++ code for randomising strings. The speed improvements shown by using C++ varied between a factor of 1.3 and 150, depending on the test code and the length of the string.

Mind you, if you want your Perl to be really slow, use MooseX and signature checking, when you make it 5 times slower again:
http://blogs.perl.org/users/aevar_arnfjor_bjarmason/2010/02/moosexmethodsignatures-is-really-slow.html

In some ways it is suprising that it is as little as this, as stepping through the following method call

my $ortholog_map = $ortholog_mapping_maker->create_ortholog_mappings($genome_a, $genome_b)

takes you through over 200 lines of MooseX code that validates and verifies the parameters, and this is typical of what happens on every single method call. This is probably not a problem if you have lots of processing power to throw at the problem or are doing anything that involves lots of data, such as processing DNA sequence data.

Introduction

I have been concerned about potential speed penalties of writing processor intensive code in Perl, so I have compared equivalent implementations of the same algorithm in Perl and C++. The equivalence of the code can be seen in part in that the C++ code reproduces the bug in the original Perl. I have used std::vectors and std::maps to be as equivalent as possible. In many cases a more efficient implementation could have been achieved with ‘char *’ and ‘int *’ vectors.

Method

The original Perl code was is a scrapbook example from the BioPerl website, which randomises a string.

http://www.bioperl.org/wiki/A_quick_string_randomizer

While not the most efficient way of randomising a string, it appears to be a good example of fairly typical Perl code, and representative of the sorting copying and moving of data associated with many gene sequence related operations.

I then modified the Perl code twice to improve the efficiency of the algorithm, and modified the C++ in each case so that it implemented the equivalent algorithm.

In each test the routine was used to sort a string of either 1000 or 500000 nucleotides, repeated sufficiently often for timings to be made using the second hand of a watch. Both tests were performed on my Dell laptop. The Perl was activePerl 5.10 and was executed from the command line. The C++ was compiled using the Visual C++ 2005 compiler in Release mode.

Results

	Sequence length	Perl	C++	Improvement
Original Algorithm	1000	26s	170ms	150
Improved algorithm	1000	4ms	0.4ms	10
	500000	70s	55s	1.3
Further improved algorithm	1000	3.2ms	0.2ms	16
	500000	3.8s	0.1s	38

Appendix: Source code

Original Perl code

sub perm {
my @a = split('',shift);
@i = (0..$#a);
my (@r, %i);
@i{@i} = @i;
map {
push @r, delete $i{ int($_*rand())};
@i{(0..scalar(values %i)-1)} = sort {$a<=>$b} values %i;
} reverse @i;
return join('',@a[@r]);
}

'IMPROVED' PERL CODE

sub perm2 {
my @a = split('',shift);
my @i = (0..$#a);
my @j = @i;
my @r;
map {
push @r, splice(@i, rand($_+1),1);
} reverse @j;
return join('',@a[@r]);
}

'FURTHER IMPROVED' PERL CODE

sub perm3 {
my @a = split('',shift);
my @i = (0..$#a);
my @r;
for ($j = $#a; $j >= 0; $j--){
my $entry = rand($j+1);
push @r, $i[$entry];
$i[$entry]=$i[$j];
};
return join('',@a[@r]);
}

C++ equivalent of original PERL

I have rewritten this in C++, keeping as close as possible to the original algorithm in order that this is a fair comparison. I have not created methods or subroutines to match many of the bundled functions provided by C++, so there is quite a lot of longwinded code to reproduce some short sections of Perl.

std::string perm(std::string seq)
{
nbsp; int size = seq.size();
// my @i = (0..$#a);
intvec i(size);
for (int j = 0;j < size;j++)
i[j] = j;
// my (@r, %i);
// imap used instead of %i as C++ does not allow two
// identically named variables.
intvec r;
intmap imap;
// @i {@i} = @i;
for (int j = 0;j < size;j++)
imap[j] = j;
// map {} reverse @i;
// map simply iterates through a set of values
for (int j = size-1;j >=0; j--)
{
// int($_*rand())
int entry = (j * rand())/RAND_MAX;
int v = imap[entry];
//delete $i {...}
imap.erase(entry);
// push @r ...
r.push_back(v);
// values %i
// This returns a vector of values
// which we do explicitly in C++
intvec y(0);
for (intmap::const_iterator im = imap.begin(); im != imap.end();im++)
y.push_back((*im).second);
// sort {$a<=>$b}
sort(y.begin(),y.end());
//@i {(0..scalar(values %i)-1)}
//Place the sorted values into the original array
for (int j = 0;j < imap.size()-1; j++)
imap[j] = y[j];
}

//Use the randomised vector to reorder the string
std::string res;
for (int j = 0;j < r.size();j++)
res.push_back(seq[r[j]]);
return res;
}

The equivalent C++ for the improved algorithm is very similar, and uses the 'erase' method to remove entries in the vector.