Using RPS-BLAST with Biopython

This page is a work in progress!

This page introduces BLAST and RPS-BLAST then how to:

Build a small RPS-BLAST database
Run RPS-BLAST at the command line
Parse RPS-BLAST's XML output with Biopython 1.43 or later
Call RPS-BLAST and analyze the output from within Biopython

This should all work on Windows, Linux and Mac OS X, although you may need to adjust path or file names accordingly.

What is BLAST?

The NCBI's Basic Local Alignment Search Tool (BLAST) is a suite of programs used to find regions of local similarity between biological sequences. The program usually compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.

One of the tool's many variants is RPS-BLAST (reversed position specific blast). In this case you submit one or more query protein sequences for comparison against a database of Hidden Markov Models (HMM) representing many different protein domains or families. The NCBI have a web version, but you can also install the BLAST programs on your own computer.

In this article I'll go quickly go through installing standalone BLAST, building an RPS-BLAST database, and then using the rpsblast program with Biopython to search bacterial genomes for particular protein motifs.

Installing Standalone Blast

On their downloads page you can get standalone BLAST for all the major operating systems including Windows and Linux.

Once you have the program installed, you also need a database or two. For "normal" BLAST you can download blast sequence databases or make your own using the supplied formatdb program.

The NCBI also make available ready made RPS-BLAST databases for PFAM, SMART, COG, KOG and their own meta-domain database, CDD. You can download these rpsblast databases ready to use on your own machine.

Preparing a reduced RPS-BLAST database

One nice trick if you are running RPS-BLAST on your own machine is you can build a sub-database containing only selected protein domains. This has two major advantages: you need less RAM, and it will run much faster - which means even my old laptop computer can manage!

To make your own sub-database, you'll first need to download the all the raw HMM models as position specific scoring matrix (PSSM) text files in this archive cdd.tar.gz - while the archive is only 250MB, be warned that it unzips to about 2GB! If you are using Windows, then WinZip will cope with this file (eventually).

The idea is to select particular domains of interest, get their *.smp files from the archive, and then use the formatrpsdb program to build a little database the rpsblast program can use.

For example, suppose you are interested in searching for sigma factors, and had decided that you only want to look for the following PFAM domains, typically found in Sigma 70 proteins:

You would start by making a simple text file called Sigma.pn containing the names of the associated PSSM files (the order doesn't matter):

pfam00140.smp pfam03979.smp pfam04539.smp pfam04542.smp pfam04545.smp pfam04546.smp

Next, extract just these *.smp files from the large archive (cdd.tar.gz).

Then run the formatrpsdb tool to build a database:

formatrpsdb -t Sigma.v001 -i Sigma.pn -o T -f 9.82 -n Sigma -S 100.0

The above assumed you are in the same directory as the Sigma.pn file and all the *.smp files. You may need to provide an explicit path for the formatrpsdb program, e.g. C:\Blast\bin\formatrpsdb.exe

These options are based on the examples in the README file. What this does is creates the eight files: Sigma.aux, Sigma.loo, Sigma.phr, Sigma.pin, Sigma.psd, Sigma.psi, Sigma.psq and Sigma.rps which together make up the database.

It will also create (or add to) the file formatrpsdb.log, why not open this and check what it says. For example confirm it wrote the expected number of models in the database.

Running RPS-BLAST at the command line

We're going to use standalone BLAST, but before that for an idea of what the results should look like, goto the online RPS-BLAST and enter the search term "39546362" (the GI number for rpoD, a sigma factor in Salmonella typhimurium) and pick the PFAM database. You should get strong hits to the six PFAM domains listed above.

Now we'll run the same search with standalone blast. You'll need to prepare an input file in FASTA format - prepare a text file called rpoD.faa and paste in this:

>gi|39546362|ref|NP_462126.2| rpoD from Salmonella typhimurium MEQNPQSQLKLLVTRGKEQGYLTYAEVNDHLPEDIVDSDQIEDIIQMINDMGIQVMEEAPDADDLLLAENTTST DEDAEEAAAQVLSSVESEIGRTTDPVRMYMREMGTVELLTREGEIDIAKRIEDGINQVQCSVAEYPEAITYLLE QYDRVEAEEARLSDLITGFVDPNAEEEMAPTATHVGSELSQEDLDDDEDEDEEDGDDDAADDDNSIDPELAREK FAELRAQYVVTRDTIKAKGRSHAAAQEEILKLSEVFKQFRLVPKQFDYLVNSMRVMMDRVRTQERLIMKLCVEQ CKMPKKNFITLFTGNETSETWFNAAIAMNKPWSEKLHDVAEEVQRCLQKLRQIEEETGLTIEQVKDINRRMSIG EAKARRAKKEMVEANLRLVISIAKKYTNRGLQFLDLIQEGNIGLMKAVDKFEYRRGYKFSTYATWWIRQAITRS IADQARTIRIPVHMIETINKLNRISRQMLQEMGREPTPEELAERMLMPEDKIRKVLKIAKEPISMETPIGDDED SHLGDFIEDTTLELPLDSATTESLRAATHDVLAGLTAREAKVLRMRFGIDMNTDHTLEEVGKQFDVTRERIRQI EAKALRKLRHPSRSEVLRSFLDD

You can now run rpsblast with this input file (-i rpoD.faa) using the Sigma factor database we prepared above (-d Sigma) and a harsh expectation cut-off (-e 0.00001) with the command:

rpsblast -i rpoD.faa -d Sigma -e 0.00001

This should print lots of text on screen which should agree pretty well with what the web version told you earlier. I have assumed in this example that all the files are in the current directory - you may need to provide explicit paths depending on where you have put them, e.g.

C:\Blast\bin\rpsblast.exe -i rpoD.faa -d c:\Blast\cdd\Sigma -e 0.00001

You might want to save the BLAST output as a plain text file, for example rpoD.txt, and to do that we can use the -o option to specify the output file:

rpsblast -i rpoD.faa -d Sigma -e 0.00001 -o rpoD.txt

And let's also save the output as an XML file called rpoD.xml (using output format 7), ready for the next step where we'll read this data in with Biopython:

rpsblast -i rpoD.faa -d Sigma -e 0.00001 -m 7 -o rpoD.xml

If instead of using the Sigma mini-database, you wanted to use the full PFAM database (see download links above - you'll need quite a lot of RAM), simply change the database argument:

rpsblast -i rpoD.faa -d Pfam -e 0.00001

For this particular gene, the results should be the same (baring slight variations in the estimated expectation p-values). However, it will take a lot longer because this time rpsblast is searching thousands of domain motifs instead of just the handful in our little database.

Next let's download an entire genomes worth of protein sequences, Salmonella typhimurium - FASTA file NC_003197.faa (1MB), and search all of them for Sigma 70 domains:

rpsblast -i NC_003197.faa -d Sigma -e 0.00001 -o NC_003197.txt

or:

rpsblast -i NC_003197.faa -d Sigma -e 0.00001 -m 7 -o NC_003197.xml

This will of course take longer than just searching one gene, but as the Sigma database is very small it still takes less than a minute on my laptop.

Analyzing RPS-BLAST output with Biopython

The BLAST tools' default output is a simple plain text format designed for people to read (see the files rpoD.txt and NC_003197.txt created above as an example). While you can write a computer program to read this in, the NCBI have a habit of tweaking the file format fairly often and 'breaking' things - in fact Biopython probably won't like these files.

Instead, we asked BLAST to produce XML output (see files rpoD.xml and NC_003197.xml created above) which is intended to be read by another program (such as Biopython). With a bit of practice you can actually read XML files by hand, but its not pleasant.

Lets start with the small example rpoD.xml - you can read in this file, and print out some of its information in Biopython like this:

from Bio.Blast import NCBIXML for record in NCBIXML.parse(open("rpoD.xml")) : print "QUERY: %s" % record.query for align in record.alignments : print " MATCH: %s..." % align.title[:60] for hsp in align.hsps : print " HSP, e=%f, from position %i to %i" \ % (hsp.expect, hsp.query_start, hsp.query_end) if hsp.align_length < 60 : print " Query: %s" % hsp.query print " Match: %s" % hsp.match print " Sbjct: %s" % hsp.sbjct else : print " Query: %s..." % hsp.query[:57] print " Match: %s..." % hsp.match[:57] print " Sbjct: %s..." % hsp.sbjct[:57] print "Done"

That should give you the following output - note there is only one result object, because the input file contained only one FASTA sequence:

QUERY: gi|39546362|ref|NP_462126.2| rpoD from Salmonella typhimurium
 MATCH: gnl|CDD|68130 pfam04546, Sigma70_ner, Sigma-70, non-essentia...
 HSP, e=0.000000, from position 139 to 350
  Query: YPEAITYLLEQYDRVEAEEARLSDLITGFVDPNAEE--------EMAPTATHVGSEL...
  Match: +P  + Y+LE+YDRVE EE RLSD+I+GF+DPNAEE          A   +    + ...
  Sbjct: FPGTVDYILEEYDRVETEEGRLSDIISGFIDPNAEEAEATHIGSAAAEEPSSEKDDA...
 MATCH: gnl|CDD|68124 pfam04539, Sigma70_r3, Sigma-70 region 3. Regi...
 HSP, e=0.000000, from position 455 to 537
  Query: PVHMIETINKLNRISRQMLQEMGREPTPEELAERMLMPEDKIRKVLKIAKEPISMET...
  Match: P H++E +NK+ R  R++ QE+GREPTPEE+AE + + E+K+R+VL+ A+EP+S++ ...
  Sbjct: PRHVVELLNKIKRAQRELEQELGREPTPEEIAEELGISEEKVREVLEAAREPVSLDL...
 MATCH: gnl|CDD|67588 pfam03979, Sigma70_r1_1, Sigma-70 factor, regi...
 HSP, e=0.000000, from position 1 to 82
  Query: MEQNPQSQLKLLVTRGKEQGYLTYAEVNDHLPEDIVDSDQIEDIIQMINDMGIQVME...
  Match: +E   Q+Q+K L+ +GKE+GYLTY E+ND LP D VDS+QI+DII M+ DMGI V+E...
  Sbjct: LEDLSQAQVKKLIEQGKERGYLTYDEINDALPPDDVDSEQIDDIISMLEDMGINVVE...
 MATCH: gnl|CDD|68127 pfam04542, Sigma70_r2, Sigma-70 region 2. Regi...
 HSP, e=0.000000, from position 381 to 451
  Query: MVEANLRLVISIAKKYTNRGLQFLDLIQEGNIGLMKAVDKFEYRRGYKFSTYATWWI...
  Match: +VE  L LV S+A++Y   G    DL+QEG +GL++AV++F+  RG  FST+    I...
  Sbjct: LVERYLPLVYSLARRYLGSGADAEDLVQEGFLGLLRAVERFDPERGVSFSTWLYTII...
 MATCH: gnl|CDD|68129 pfam04545, Sigma70_r4, Sigma-70, region 4. Reg...
 HSP, e=0.000000, from position 549 to 602
  Query: VLAGLTAREAKVLRMRFGIDMNTDHTLEEVGKQFDVTRERIRQIEAKALRKLRH
  Match: L  L  RE +VL +R+G       TLEE+G++  ++RER+RQIE +ALRKLR
  Sbjct: ALDSLPEREREVLVLRYG----EGLTLEEIGERLGISRERVRQIEKRALRKLRK
 MATCH: gnl|CDD|64025 pfam00140, Sigma70_r1_2, Sigma-70 factor, regi...
 HSP, e=0.000000, from position 97 to 128
  Query: TDPVRMYMREMGTVELLTREGEIDIAKRIEDG
  Match: DPVRMY+RE+G V LLT E E+++A+RIE G
  Sbjct: DDPVRMYLREIGRVPLLTAEEEVELARRIEKG
Done

Great. Now lets do the larger example, NC_003197.xml, but this time we'll print out less information on screen:

from Bio.Blast import NCBIXML
for record in NCBIXML.parse(open("NC_003197.xml")) :
    #We want to ignore any queries with no search results:
    if record.alignments :
        print "QUERY: %s..." % record.query[:60]
        for align in record.alignments :
            for hsp in align.hsps :
                print " %s HSP, e=%f, from position %i to %i" \
                % (align.hit_id, hsp.expect, hsp.query_start, hsp.query_end)
print "Done"

You should then get the following output, showing only eight Sigma70 related genes were found:

QUERY: gi|16765108|ref|NP_460723.1| response regulator [Salmonella ...
 gnl|CDD|68129 HSP, e=0.000001, from position 155 to 199
QUERY: gi|16765294|ref|NP_460909.1| flagellar biosynthesis sigma fa...
 gnl|CDD|68124 HSP, e=0.000000, from position 90 to 165
 gnl|CDD|68127 HSP, e=0.000000, from position 16 to 89
 gnl|CDD|68129 HSP, e=0.000000, from position 185 to 233
QUERY: gi|16765597|ref|NP_461212.1| response regulator [Salmonella ...
 gnl|CDD|68129 HSP, e=0.000000, from position 149 to 193
QUERY: gi|16765960|ref|NP_461575.1| RNA polymerase sigma-70 factor ...
 gnl|CDD|68127 HSP, e=0.000000, from position 25 to 92
 gnl|CDD|68129 HSP, e=0.000000, from position 135 to 181
QUERY: gi|16766230|ref|NP_461845.1| RNA polymerase sigma factor [Sa...
 gnl|CDD|68127 HSP, e=0.000000, from position 94 to 164
 gnl|CDD|68124 HSP, e=0.000000, from position 168 to 250
 gnl|CDD|68129 HSP, e=0.000000, from position 262 to 315
 gnl|CDD|64025 HSP, e=0.000000, from position 56 to 91
QUERY: gi|39546362|ref|NP_462126.2| RNA polymerase sigma factor [Sa...
 gnl|CDD|68130 HSP, e=0.000000, from position 139 to 350
 gnl|CDD|68124 HSP, e=0.000000, from position 455 to 537
 gnl|CDD|67588 HSP, e=0.000000, from position 1 to 82
 gnl|CDD|68127 HSP, e=0.000000, from position 381 to 451
 gnl|CDD|68129 HSP, e=0.000000, from position 549 to 602
 gnl|CDD|64025 HSP, e=0.000000, from position 97 to 128
QUERY: gi|16766854|ref|NP_462469.1| RNA polymerase sigma factor [Sa...
 gnl|CDD|68127 HSP, e=0.000000, from position 53 to 123
 gnl|CDD|68129 HSP, e=0.000000, from position 228 to 280
QUERY: gi|16766892|ref|NP_462507.1| putative transcriptional regula...
 gnl|CDD|68129 HSP, e=0.000007, from position 139 to 184
Done

Running RPS-BLAST from Biopython

Its great that we can read the BLAST XML output from within Biopython, but having to type cryptic command lines to run the BLAST program isn't very nice. Instead, we can get Biopython to do this for us too!

#Adjust the file locations to match your own: rpsblast_db = "C:\\Blast\\cdd\\Sigma" rpsblast_exe = "C:\\Blast\\bin\\rpsblast.exe" query_filename = "rpoD.faa" #query_filename = "NC_003197.faa" E_VALUE_THRESH = 0.00001 #Adjust the expectation cut-off here from Bio.Blast import NCBIStandalone output_handle, error_handle = NCBIStandalone.rpsblast(rpsblast_exe, \ rpsblast_db, query_filename, expectation=E_VALUE_THRESH)

The NCBIStandalone.rpsblast() function runs RPS-BLAST for us (using XML output by default), and captures its standard output (and any error output) as two handle objects. We can treat these like handles to files opened for reading, so we can re-use the code from before:

from Bio.Blast import NCBIXML
for record in NCBIXML.parse(output_handle) :
    #We want to ignore any queries with no search results:
    if record.alignments :
        print "QUERY: %s..." % record.query[:60]
        for align in record.alignments :
            for hsp in align.hsps :
                print " %s HSP, e=%f, from position %i to %i" \
                % (align.hit_id, hsp.expect, hsp.query_start, hsp.query_end)
                assert hsp.expect <= E_VALUE_THRESH
print "Done"

That should give you this:

QUERY: gi|39546362|ref|NP_462126.2| rpoD from Salmonella typhimurium
 gnl|CDD|68130 HSP, e=0.000000, from position 139 to 350
 gnl|CDD|68124 HSP, e=0.000000, from position 455 to 537
 gnl|CDD|68127 HSP, e=0.000000, from position 381 to 451
 gnl|CDD|68129 HSP, e=0.000000, from position 549 to 602
 gnl|CDD|64025 HSP, e=0.000000, from position 97 to 128
Done

You'll notice that in this example we don't ever save the BLAST's XML output as a file on disc. You might want to do this while debugging your code (e.g. use blast_out.read() to get all the data as a string, write this to a file, and then open the file). If anything fails, its often worth checking the for any error mesages printed by rpsblast using:

print error_handle.read()

So, searching all the genes in a Bactera for Sigma70 PFAM domains was quick and easy. You can of course go one step further, and download FASTA files for all the sequenced Bacteria, all.faa.tar.gz (nearly 400MB compressed), and then search them all!

Notes on versions

You will need BioPython 1.43 or later, but any recent version of python and the NCBI's standalone BLAST tools should work too.

I have tested this using:

NCBI Standalone BLAST 2.2.15, Biopython 1.43 and Python 2.3.5 on Windows XP
NCBI Standalone BLAST 2.2.14, Biopython 1.43 and Python 2.3.5 on Windows XP
NCBI Standalone BLAST 2.2.10, Biopython 1.43 and Python 2.4.3 on Ubuntu "Dapper Drake" Linux
(Note that this version of rpsblast has some 'issues' with multi-query XMl output)

Python

Biopython

NCBI BLAST