Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • Text only
  • |
  • Sign in
  • Search MOAC
  • Search University of Warwick
  • Search for people at Warwick
  • Search Warwick Blogs
  • Search past exam papers
  • Search video
  • More…

    Molecular Organisation and Assembly in Cells

    • About the DTC
    • Research
    • People
    • Degrees
    • Study at MOAC
    • News & Events
    • MOAC Students »
    • Peter Cock »
    • Python Programming »
    • RPS-BLAST
    University of Warwick

    Using RPS-BLAST with Biopython

    This page is a work in progress!

    This page introduces BLAST and RPS-BLAST then how to:

    • Build a small RPS-BLAST database
    • Run RPS-BLAST at the command line
    • Parse RPS-BLAST's XML output with Biopython 1.43 or later
    • Call RPS-BLAST and analyze the output from within Biopython

    This should all work on Windows, Linux and Mac OS X, although you may need to adjust path or file names accordingly.

    What is BLAST?

    The NCBI's Basic Local Alignment Search Tool (BLAST) is a suite of programs used to find regions of local similarity between biological sequences. The program usually compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.

    One of the tool's many variants is RPS-BLAST (reversed position specific blast). In this case you submit one or more query protein sequences for comparison against a database of Hidden Markov Models (HMM) representing many different protein domains or families. The NCBI have a web version, but you can also install the BLAST programs on your own computer.

    In this article I'll go quickly go through installing standalone BLAST, building an RPS-BLAST database, and then using the rpsblast program with Biopython to search bacterial genomes for particular protein motifs.

    Installing Standalone Blast

    On their downloads page you can get standalone BLAST for all the major operating systems including Windows and Linux.

    Once you have the program installed, you also need a database or two. For "normal" BLAST you can download blast sequence databases or make your own using the supplied formatdb program.

    The NCBI also make available ready made RPS-BLAST databases for PFAM, SMART, COG, KOG and their own meta-domain database, CDD. You can download these rpsblast databases ready to use on your own machine.

    Preparing a reduced RPS-BLAST database

    One nice trick if you are running RPS-BLAST on your own machine is you can build a sub-database containing only selected protein domains. This has two major advantages: you need less RAM, and it will run much faster - which means even my old laptop computer can manage!

    To make your own sub-database, you'll first need to download the all the raw HMM models as position specific scoring matrix (PSSM) text files in this archive cdd.tar.gz - while the archive is only 250MB, be warned that it unzips to about 2GB! If you are using Windows, then WinZip will cope with this file (eventually).

    The idea is to select particular domains of interest, get their *.smp files from the archive, and then use the formatrpsdb program to build a little database the rpsblast program can use.

    For example, suppose you are interested in searching for sigma factors, and had decided that you only want to look for the following PFAM domains, typically found in Sigma 70 proteins:

    • pfam03979, Sigma70_r1_1, Sigma-70 region 1.1
    • pfam00140, Sigma70_r1_2, Sigma-70 region 1.2
    • pfam04546, Sigma70_ner, Sigma-70 non-essential region
    • pfam04542, Sigma70_r2, Sigma-70 region 2
    • pfam04539, Sigma70_r3, Sigma-70 region 3
    • pfam04545, Sigma70_r4, Sigma-70 region 4

    You would start by making a simple text file called Sigma.pn containing the names of the associated PSSM files (the order doesn't matter):

    pfam00140.smp
    pfam03979.smp
    pfam04539.smp
    pfam04542.smp
    pfam04545.smp
    pfam04546.smp

    Next, extract just these *.smp files from the large archive (cdd.tar.gz).

    Then run the formatrpsdb tool to build a database:

    formatrpsdb -t Sigma.v001 -i Sigma.pn -o T -f 9.82 -n Sigma -S 100.0

    The above assumed you are in the same directory as the Sigma.pn file and all the *.smp files. You may need to provide an explicit path for the formatrpsdb program, e.g. C:\Blast\bin\formatrpsdb.exe

    These options are based on the examples in the README file. What this does is creates the eight files: Sigma.aux, Sigma.loo, Sigma.phr, Sigma.pin, Sigma.psd, Sigma.psi, Sigma.psq and Sigma.rps which together make up the database.

    It will also create (or add to) the file formatrpsdb.log, why not open this and check what it says. For example confirm it wrote the expected number of models in the database.

    Running RPS-BLAST at the command line

    We're going to use standalone BLAST, but before that for an idea of what the results should look like, goto the online RPS-BLAST and enter the search term "39546362" (the GI number for rpoD, a sigma factor in Salmonella typhimurium) and pick the PFAM database. You should get strong hits to the six PFAM domains listed above.

    Now we'll run the same search with standalone blast. You'll need to prepare an input file in FASTA format - prepare a text file called rpoD.faa and paste in this:

    >gi|39546362|ref|NP_462126.2| rpoD from Salmonella typhimurium
    MEQNPQSQLKLLVTRGKEQGYLTYAEVNDHLPEDIVDSDQIEDIIQMINDMGIQVMEEAPDADDLLLAENTTST
    DEDAEEAAAQVLSSVESEIGRTTDPVRMYMREMGTVELLTREGEIDIAKRIEDGINQVQCSVAEYPEAITYLLE
    QYDRVEAEEARLSDLITGFVDPNAEEEMAPTATHVGSELSQEDLDDDEDEDEEDGDDDAADDDNSIDPELAREK
    FAELRAQYVVTRDTIKAKGRSHAAAQEEILKLSEVFKQFRLVPKQFDYLVNSMRVMMDRVRTQERLIMKLCVEQ
    CKMPKKNFITLFTGNETSETWFNAAIAMNKPWSEKLHDVAEEVQRCLQKLRQIEEETGLTIEQVKDINRRMSIG
    EAKARRAKKEMVEANLRLVISIAKKYTNRGLQFLDLIQEGNIGLMKAVDKFEYRRGYKFSTYATWWIRQAITRS
    IADQARTIRIPVHMIETINKLNRISRQMLQEMGREPTPEELAERMLMPEDKIRKVLKIAKEPISMETPIGDDED
    SHLGDFIEDTTLELPLDSATTESLRAATHDVLAGLTAREAKVLRMRFGIDMNTDHTLEEVGKQFDVTRERIRQI
    EAKALRKLRHPSRSEVLRSFLDD

    You can now run rpsblast with this input file (-i rpoD.faa) using the Sigma factor database we prepared above (-d Sigma) and a harsh expectation cut-off (-e 0.00001) with the command:

    rpsblast -i rpoD.faa -d Sigma -e 0.00001

    This should print lots of text on screen which should agree pretty well with what the web version told you earlier. I have assumed in this example that all the files are in the current directory - you may need to provide explicit paths depending on where you have put them, e.g.

    C:\Blast\bin\rpsblast.exe -i rpoD.faa -d c:\Blast\cdd\Sigma -e 0.00001

    You might want to save the BLAST output as a plain text file, for example rpoD.txt, and to do that we can use the -o option to specify the output file:

    rpsblast -i rpoD.faa -d Sigma -e 0.00001 -o rpoD.txt

    And let's also save the output as an XML file called rpoD.xml (using output format 7), ready for the next step where we'll read this data in with Biopython:

    rpsblast -i rpoD.faa -d Sigma -e 0.00001 -m 7 -o rpoD.xml

    If instead of using the Sigma mini-database, you wanted to use the full PFAM database (see download links above - you'll need quite a lot of RAM), simply change the database argument:

    rpsblast -i rpoD.faa -d Pfam -e 0.00001

    For this particular gene, the results should be the same (baring slight variations in the estimated expectation p-values). However, it will take a lot longer because this time rpsblast is searching thousands of domain motifs instead of just the handful in our little database.

    Next let's download an entire genomes worth of protein sequences, Salmonella typhimurium - FASTA file NC_003197.faa (1MB), and search all of them for Sigma 70 domains:

    rpsblast -i NC_003197.faa -d Sigma -e 0.00001 -o NC_003197.txt

    or:

    rpsblast -i NC_003197.faa -d Sigma -e 0.00001 -m 7 -o NC_003197.xml

    This will of course take longer than just searching one gene, but as the Sigma database is very small it still takes less than a minute on my laptop.

    Analyzing RPS-BLAST output with Biopython

    The BLAST tools' default output is a simple plain text format designed for people to read (see the files rpoD.txt and NC_003197.txt created above as an example). While you can write a computer program to read this in, the NCBI have a habit of tweaking the file format fairly often and 'breaking' things - in fact Biopython probably won't like these files.

    Instead, we asked BLAST to produce XML output (see files rpoD.xml and NC_003197.xml created above) which is intended to be read by another program (such as Biopython). With a bit of practice you can actually read XML files by hand, but its not pleasant.

    Lets start with the small example rpoD.xml - you can read in this file, and print out some of its information in Biopython like this:

    from Bio.Blast import NCBIXML
    for record in NCBIXML.parse(open("rpoD.xml")) :
    print "QUERY: %s" % record.query
    for align in record.alignments :
    print " MATCH: %s..." % align.title[:60]
    for hsp in align.hsps :
    print " HSP, e=%f, from position %i to %i" \
    % (hsp.expect, hsp.query_start, hsp.query_end)
    if hsp.align_length < 60 :
    print " Query: %s" % hsp.query
    print " Match: %s" % hsp.match
    print " Sbjct: %s" % hsp.sbjct
    else :
    print " Query: %s..." % hsp.query[:57]
    print " Match: %s..." % hsp.match[:57]
    print " Sbjct: %s..." % hsp.sbjct[:57]
    print "Done"

    That should give you the following output - note there is only one result object, because the input file contained only one FASTA sequence:

    QUERY: gi|39546362|ref|NP_462126.2| rpoD from Salmonella typhimurium
     MATCH: gnl|CDD|68130 pfam04546, Sigma70_ner, Sigma-70, non-essentia...
     HSP, e=0.000000, from position 139 to 350
      Query: YPEAITYLLEQYDRVEAEEARLSDLITGFVDPNAEE--------EMAPTATHVGSEL...
      Match: +P  + Y+LE+YDRVE EE RLSD+I+GF+DPNAEE          A   +    + ...
      Sbjct: FPGTVDYILEEYDRVETEEGRLSDIISGFIDPNAEEAEATHIGSAAAEEPSSEKDDA...
     MATCH: gnl|CDD|68124 pfam04539, Sigma70_r3, Sigma-70 region 3. Regi...
     HSP, e=0.000000, from position 455 to 537
      Query: PVHMIETINKLNRISRQMLQEMGREPTPEELAERMLMPEDKIRKVLKIAKEPISMET...
      Match: P H++E +NK+ R  R++ QE+GREPTPEE+AE + + E+K+R+VL+ A+EP+S++ ...
      Sbjct: PRHVVELLNKIKRAQRELEQELGREPTPEEIAEELGISEEKVREVLEAAREPVSLDL...
     MATCH: gnl|CDD|67588 pfam03979, Sigma70_r1_1, Sigma-70 factor, regi...
     HSP, e=0.000000, from position 1 to 82
      Query: MEQNPQSQLKLLVTRGKEQGYLTYAEVNDHLPEDIVDSDQIEDIIQMINDMGIQVME...
      Match: +E   Q+Q+K L+ +GKE+GYLTY E+ND LP D VDS+QI+DII M+ DMGI V+E...
      Sbjct: LEDLSQAQVKKLIEQGKERGYLTYDEINDALPPDDVDSEQIDDIISMLEDMGINVVE...
     MATCH: gnl|CDD|68127 pfam04542, Sigma70_r2, Sigma-70 region 2. Regi...
     HSP, e=0.000000, from position 381 to 451
      Query: MVEANLRLVISIAKKYTNRGLQFLDLIQEGNIGLMKAVDKFEYRRGYKFSTYATWWI...
      Match: +VE  L LV S+A++Y   G    DL+QEG +GL++AV++F+  RG  FST+    I...
      Sbjct: LVERYLPLVYSLARRYLGSGADAEDLVQEGFLGLLRAVERFDPERGVSFSTWLYTII...
     MATCH: gnl|CDD|68129 pfam04545, Sigma70_r4, Sigma-70, region 4. Reg...
     HSP, e=0.000000, from position 549 to 602
      Query: VLAGLTAREAKVLRMRFGIDMNTDHTLEEVGKQFDVTRERIRQIEAKALRKLRH
      Match: L  L  RE +VL +R+G       TLEE+G++  ++RER+RQIE +ALRKLR
      Sbjct: ALDSLPEREREVLVLRYG----EGLTLEEIGERLGISRERVRQIEKRALRKLRK
     MATCH: gnl|CDD|64025 pfam00140, Sigma70_r1_2, Sigma-70 factor, regi...
     HSP, e=0.000000, from position 97 to 128
      Query: TDPVRMYMREMGTVELLTREGEIDIAKRIEDG
      Match: DPVRMY+RE+G V LLT E E+++A+RIE G
      Sbjct: DDPVRMYLREIGRVPLLTAEEEVELARRIEKG
    Done

    Great. Now lets do the larger example, NC_003197.xml, but this time we'll print out less information on screen:

    from Bio.Blast import NCBIXML
    for record in NCBIXML.parse(open("NC_003197.xml")) :
        #We want to ignore any queries with no search results:
        if record.alignments :
            print "QUERY: %s..." % record.query[:60]
            for align in record.alignments :
                for hsp in align.hsps :
                    print " %s HSP, e=%f, from position %i to %i" \
                    % (align.hit_id, hsp.expect, hsp.query_start, hsp.query_end)
    print "Done"

    You should then get the following output, showing only eight Sigma70 related genes were found:

    QUERY: gi|16765108|ref|NP_460723.1| response regulator [Salmonella ...
     gnl|CDD|68129 HSP, e=0.000001, from position 155 to 199
    QUERY: gi|16765294|ref|NP_460909.1| flagellar biosynthesis sigma fa...
     gnl|CDD|68124 HSP, e=0.000000, from position 90 to 165
     gnl|CDD|68127 HSP, e=0.000000, from position 16 to 89
     gnl|CDD|68129 HSP, e=0.000000, from position 185 to 233
    QUERY: gi|16765597|ref|NP_461212.1| response regulator [Salmonella ...
     gnl|CDD|68129 HSP, e=0.000000, from position 149 to 193
    QUERY: gi|16765960|ref|NP_461575.1| RNA polymerase sigma-70 factor ...
     gnl|CDD|68127 HSP, e=0.000000, from position 25 to 92
     gnl|CDD|68129 HSP, e=0.000000, from position 135 to 181
    QUERY: gi|16766230|ref|NP_461845.1| RNA polymerase sigma factor [Sa...
     gnl|CDD|68127 HSP, e=0.000000, from position 94 to 164
     gnl|CDD|68124 HSP, e=0.000000, from position 168 to 250
     gnl|CDD|68129 HSP, e=0.000000, from position 262 to 315
     gnl|CDD|64025 HSP, e=0.000000, from position 56 to 91
    QUERY: gi|39546362|ref|NP_462126.2| RNA polymerase sigma factor [Sa...
     gnl|CDD|68130 HSP, e=0.000000, from position 139 to 350
     gnl|CDD|68124 HSP, e=0.000000, from position 455 to 537
     gnl|CDD|67588 HSP, e=0.000000, from position 1 to 82
     gnl|CDD|68127 HSP, e=0.000000, from position 381 to 451
     gnl|CDD|68129 HSP, e=0.000000, from position 549 to 602
     gnl|CDD|64025 HSP, e=0.000000, from position 97 to 128
    QUERY: gi|16766854|ref|NP_462469.1| RNA polymerase sigma factor [Sa...
     gnl|CDD|68127 HSP, e=0.000000, from position 53 to 123
     gnl|CDD|68129 HSP, e=0.000000, from position 228 to 280
    QUERY: gi|16766892|ref|NP_462507.1| putative transcriptional regula...
     gnl|CDD|68129 HSP, e=0.000007, from position 139 to 184
    Done

    Running RPS-BLAST from Biopython

    Its great that we can read the BLAST XML output from within Biopython, but having to type cryptic command lines to run the BLAST program isn't very nice. Instead, we can get Biopython to do this for us too!

    #Adjust the file locations to match your own:
    rpsblast_db = "C:\\Blast\\cdd\\Sigma"
    rpsblast_exe = "C:\\Blast\\bin\\rpsblast.exe"

    query_filename = "rpoD.faa"
    #query_filename = "NC_003197.faa"

    E_VALUE_THRESH = 0.00001 #Adjust the expectation cut-off here

    from Bio.Blast import NCBIStandalone
    output_handle, error_handle = NCBIStandalone.rpsblast(rpsblast_exe, \
    rpsblast_db, query_filename, expectation=E_VALUE_THRESH)

    The NCBIStandalone.rpsblast() function runs RPS-BLAST for us (using XML output by default), and captures its standard output (and any error output) as two handle objects. We can treat these like handles to files opened for reading, so we can re-use the code from before:

    from Bio.Blast import NCBIXML
    for record in NCBIXML.parse(output_handle) :
        #We want to ignore any queries with no search results:
        if record.alignments :
            print "QUERY: %s..." % record.query[:60]
            for align in record.alignments :
                for hsp in align.hsps :
                    print " %s HSP, e=%f, from position %i to %i" \
                    % (align.hit_id, hsp.expect, hsp.query_start, hsp.query_end)
                    assert hsp.expect <= E_VALUE_THRESH
    print "Done"

     

    That should give you this:

    QUERY: gi|39546362|ref|NP_462126.2| rpoD from Salmonella typhimurium
     gnl|CDD|68130 HSP, e=0.000000, from position 139 to 350
     gnl|CDD|68124 HSP, e=0.000000, from position 455 to 537
     gnl|CDD|68127 HSP, e=0.000000, from position 381 to 451
     gnl|CDD|68129 HSP, e=0.000000, from position 549 to 602
     gnl|CDD|64025 HSP, e=0.000000, from position 97 to 128
    Done

     

    You'll notice that in this example we don't ever save the BLAST's XML output as a file on disc. You might want to do this while debugging your code (e.g. use blast_out.read() to get all the data as a string, write this to a file, and then open the file). If anything fails, its often worth checking the for any error mesages printed by rpsblast using:

    print error_handle.read()

    So, searching all the genes in a Bactera for Sigma70 PFAM domains was quick and easy. You can of course go one step further, and download FASTA files for all the sequenced Bacteria, all.faa.tar.gz (nearly 400MB compressed), and then search them all!

    Notes on versions

    You will need BioPython 1.43 or later, but any recent version of python and the NCBI's standalone BLAST tools should work too.

    I have tested this using:

    • NCBI Standalone BLAST 2.2.15, Biopython 1.43 and Python 2.3.5 on Windows XP
    • NCBI Standalone BLAST 2.2.14, Biopython 1.43 and Python 2.3.5 on Windows XP
    • NCBI Standalone BLAST 2.2.10, Biopython 1.43 and Python 2.4.3 on Ubuntu "Dapper Drake" Linux
      (Note that this version of rpsblast has some 'issues' with multi-query XMl output)

    [Python logo]
    Python

     

    [Biopython logo]
    Biopython

     

    [NCBI logo]
    NCBI BLAST

    MOAC DTC, Coventry House, University of Warwick, Gibbet Hill Road, Coventry, CV4 7AL
    Tel. 024 765 75808 moac2 at warwick dot ac dot uk

    How to find us

    MOAC Intranet

    EPSRC logo

    Close this email form
    Page contact: Peter Cock Last revised: Thu 2 Aug 2007
    • Sign in
    • |
    • Powered by Sitebuilder
    • |
    • © MMXII
    • |
    • Privacy
    • |
    • Accessibility