Calling Perl from C++

Perl provides a powerful way of processing and accessing Bioinformatics data, but there are times when the inefficiency of Perl code means that data processing needs to be written in C or C++ if the results are to be available within the time-frame of a PhD.

Perl provides an API that allows perl methods and functions to be called from within a C++ program, which is documented at:

http://perldoc.perl.org/perlembed.html

I have written a short program which uses these principles to access data from the ensemble database using its Perl API so that it can be used within a C++ program.

Source Visual C++ 2005 project file and Jamfile for Boost.bjam build

The headers

Perl provides header files for the functions required to use Perl. When building, the path containing these files should be included in the directories searched for include files (e.g. System/Library/Perl/5.8.6/darwin-thread-multi-2level/CORE on XServe)

#include <EXTERN.h> /* from the Perl distribution */
#include <perl.h> /* from the Perl distribution */

When linking, the associated library file (System/Library/Perl/5.8.6/darwin-thread-multi-2level/CORE/libperl.dylib) will need to be linked in.

Initialising Linked modules

When running Perl normally, the Perl environment will initialise modules as required. This does not happen automatically when Perl is embedded, so has to be done by the calling program. If this is not done then an error message such as:

D:\Dev\Test\PerlTest\Debug>EmbeddedPerlTest.exe
Can't load module DBI, dynamic loading not available in this perl.
(You may need to build a new perl executable which either supports
dynamic loading or has the DBI module statically linked into it.)

will appear. The following code provides the initialisation required. Initialising DynaLoader should ensure that all required modules are initialised. The commented out boot_DBI code is an example of how specific modules can be initialised.

EXTERN_C void boot_DynaLoader (pTHX_ CV* cv);
//EXTERN_C void boot_DBI (pTHX_ CV* cv);
EXTERN_C void xs_init(pTHX)
{
char *file = __FILE__;
/* DynaLoader is a special case */
newXS("DynaLoader::boot_DynaLoader", boot_DynaLoader, file);
//newXS("DBI::bootstrap", boot_DBI, file);
}

When linking, the library containing the initialising code (/System/Library/Perl/5.8.6/darwin-thread-multi-2level//auto/DynaLoader/DynaLoader.a on XServe) will need to be linked in

Initialising the Embedded Perl

The following code creates, initialises and runs a single instance of the Perl interpreter

static PerlInterpreter *my_perl;

int _tmain(int argc, char ** argv,char ** env)
{
STRLEN str_length;
char *embedding[] = { "", "-e", "0" };
PERL_SYS_INIT3(&argc,&argv,&env);
my_perl = perl_alloc();
perl_construct( my_perl );
perl_parse(my_perl, xs_init, 3, embedding, NULL);
PL_exit_flags |= PERL_EXIT_DESTRUCT_END;
perl_run(my_perl);

Ensuring internal data is freed

The Perl documentation describes in detail what needs to be done to ensure memory and local variables allocated within Perl are deleted. This would need to be studied in detail if these techniques were to be used extensively, e.g. in a server application. The simplest technique is to wrap the code in "ENTER;SAVETMPS; " and "FREETMPS;LEAVE;"

ENTER;
SAVETMPS;

Performing a one-off command

This is done by creating a string containing the command. The following connects to EnsEMBL and retrieves the DNA sequence from human chromosome 20.

const char * init =
"use Bio::EnsEMBL::Registry;"
"$registry = 'Bio::EnsEMBL::Registry';"

"$registry->load_registry_from_db("
"-host => 'ensembldb.ensembl.org',"
"-user => 'anonymous');"

"$slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );"
"$slice = $slice_adaptor->fetch_by_region( 'chromosome', '20', 1e6, 1e6 + 1000 );"
"$a = $slice -> seq();";

After creating the string it is 'eval'uated, and the values of any variables retrieved using the SvPV(get_sv()) macro

eval_pv(init, TRUE);
const char * seq = SvPV(get_sv("a", FALSE), str_length);
printf("a = %s\n", seq);

Performing subsequent commands

Variables previously created (eg $slice_adaptor) can then be reused in subsequent Perl commands, in this case to get a different region of the same chromosome. Note that in order to be persistent, the original variable must be declared without using 'my'. If 'my' is used then it is local to the 'eval' command and is not available in subsequent 'eval's.

const char * command = "$b = $slice_adaptor->fetch_by_region( 'chromosome', '20', 2e6, 2e6 + 10 ) -> seq();";
eval_pv(command, TRUE);
printf("sequence b = %s\n", SvPV(get_sv("b", FALSE), str_length));

The following command gets a DNA slice:

command = "$slice = $slice_adaptor->fetch_by_region( 'chromosome', '20', 3e6, 3e6 + 10 );";
eval_pv(command, TRUE);

So that we can output its sequence (note the use of the str_length parameter to access the length of the returned string):

eval_pv("$a = $slice -> seq();", TRUE);
printf("sequence = %s, length = %i\n", SvPV(get_sv("a", FALSE), str_length),str_length);

And the start and finish co-ordinate. Note the use of SvIV for obtaining integer values, and the way in which multiple values can be returned.

eval_pv("$a = $slice -> start();$b = $slice -> end();", TRUE);
printf("start c = %d, finish = %d\n", SvIV(get_sv("a", FALSE)),SvIV(get_sv("b", FALSE)));

Finishing off

And finally the code for clearing up the Perl interpreter.

FREETMPS;
LEAVE;
perl_destruct(my_perl);
perl_free(my_perl);
PERL_SYS_TERM();