Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • Text only
  • |
  • Sign in
  • Search MOAC
  • Search University of Warwick
  • Search for people at Warwick
  • Search Warwick Blogs
  • Search past exam papers
  • Search video
  • More…

    Molecular Organisation and Assembly in Cells

    • About the DTC
    • Research
    • People
    • Degrees
    • Study at MOAC
    • News & Events
    • MOAC Students »
    • Peter Cock »
    • Python Programming »
    • GenBank to FASTA
    University of Warwick

    Converting GenBank files to FASTA format with Biopython

    This page follows on from dealing with GenBank files in BioPython and shows how to use the GenBank parser to convert a GenBank file into a FASTA format file. See also this example of dealing with Fasta Nucelotide files.

    As before, I'm going to use a small bacterial genome, Nanoarchaeum equitans Kin4-M (RefSeq NC_005213, GI:38349555, GenBank AE017199) which can be downloaded from the NCBI here:

    • NC_005213.gbk (1190 KB) - GenBank file
    • NC_005213.fna (487 KB) - FASTA Nucleic Acids - entire DNA nucleotide sequence as one record, see gbk -> fna
    • NC_005213.faa (198 KB) - FASTA Amino Acids - amino acid sequences for each gene, see gbk -> faa
    • NC_005213.ffn (487 KB) - FASTA Feature Nucleotides - nucleotide sequences for each gene, see gbk -> ffn

    As you can see, the NCBI provide this file in GenBank format, and preconverted into assorted FASTA formats. This document shows how to do this for yourself, which has the advantage that you can choose what to put into the FASTA title lines.

    Note that the NCBI's genomic GenBank files (like the one above) only contain a single large record. However, for maximum flexibility, the example code should cope with multi-record GenBank files.

    Record by Record : GenBank to FASTA Nucleotides (*.gbk to *.fna)

    Simple sequence file format between supported file formats is very easy using Bio.SeqIO - assuming you are happy with its default choices! This bit of code will record the full DNA nucleotide sequence for each record in the GenBank file as a fasta record:

    from Bio import SeqIO
    SeqIO.convert("NC_005213.gbk", "genbank", "NC_005213_converted.fna", "fasta")

    For comparison, in this next version (gbk_to_fna.py [Python Code]) we construct the FASTA file "by hand" giving full control:

    from Bio import SeqIO
    gbk_filename = "NC_005213.gbk"
    faa_filename = "NC_005213_converted.fna"
    input_handle  = open(gbk_filename, "r")
    output_handle = open(faa_filename, "w")
    
    for seq_record in SeqIO.parse(input_handle, "genbank") :
        print "Dealing with GenBank record %s" % seq_record.id
        output_handle.write(">%s %s\n%s\n" % (
               seq_record.id,
               seq_record.description,
               seq_record.seq))
    
    output_handle.close()
    input_handle.close()

    These examples should all give basically the same output, with a single FASTA record for each GenBank record. In this case NC_005213.gbk only has one very large record, so you only get one FASTA record. Note that I have added the line breaks shown here:

    >NC_005213 Nanoarchaeum equitans Kin4-M, complete genome.
    TCTCGCAGAGTTCTTTTTTGTATTAACAAACCCAAAACCCATAGAATTTAATGAACCCAAACCGCAATCGTACAAAAATT
    TGTAAAATTCTCTTTCTTCTTTGTCTAATTTTCTATAAACATTTAACTCTTTCCATAATGTGCCTATATATACTGCTTCC
    CCTCTGTTAATTCTTATTCTTATTGATACTGTTTTATAGAAAGTAAAACCTTCGAATATTGCTTCTTCAAAATAAAAGTT
    ...

    Gene by Gene : GenBank to FASTA Amino Acids (*.gbk to *.faa)

    As before, the suggested input file only has a single GenBank record (as a SeqRecord object), but the code below will cope with multiple GenBank records - each with multiple features (as SeqFeature objects). You can download this example too: gbk_to_faa.py [Python Code].

    from Bio import SeqIO
    gbk_filename = "NC_005213.gbk"
    faa_filename = "NC_005213_converted.faa"
    input_handle  = open(gbk_filename, "r")
    output_handle = open(faa_filename, "w")
    
    for seq_record in SeqIO.parse(input_handle, "genbank") :
        print "Dealing with GenBank record %s" % seq_record.id
        for seq_feature in seq_record.features :
            if seq_feature.type=="CDS" :
                assert len(seq_feature.qualifiers['translation'])==1
                output_handle.write(">%s from %s\n%s\n" % (
                       seq_feature.qualifiers['locus_tag'][0],
                       seq_record.name,
                       seq_feature.qualifiers['translation'][0]))
    
    output_handle.close()
    input_handle.close()

    This results in one FASTA record for each CDS feature in each GenBank record. I have added line breaks to the sequence below to display this better:

    >NEQ001 from NC_005213
    MRLLLELKALNSIDKKQLSNYLIQGFIYNILKNTEYSWLHNWKKEKYFNFTLIPKKDIIENKRYYLIISSPDKRFIEVLH
    NKIKDLDIITIGLAQFQLRKTKKFDPKLRFPWVTITPIVLREGKIVILKGDKYYKVFVKRLEELKKYNLIKKKEPILEEP
    IEISLNQIKDGWKIIDVKDRYYDFRNKSFSAFSNWLRDLKEQSLRKYNNFCGKNFYFEEAIFEGFTFYKTVSIRIRINRG
    EAVYIGTLWKELNVYRKLDKEEREFYKFLYDCGLGSLNSMGFGFVNTKKNSAR
    >NEQ003 from NC_005213
    MKKPQPYKDEEIYSILEEPVKQWFKEKYKTFTPPQRYAIMEIHKRNNVLISSPTGSGKTLAAFLAIINELIKLSHKGKLE
    NRVYAIYVSPLRSLNNDVKKNLETPLKEIKEKAKELNYYIGDIRIAVRTSDTKESEKAKMLKQPPHILITTPESLAIILS
    ...
    >NEQ550 from NC_005213
    MLELLAGFKQSILYVLAQFKKPEYATSYTIKLVNPFYYISDSLNVITSTKEDKVNYKVSLSDIAFDFPFKFPIVAIVEGK
    ANREFTFIIDRQNKKLSYDLKKGIIYIQDATIIPNGIKITVNGLAELKNIKINPNDPSITVQKVVGEQNTYIIKTSKDSV
    KITISADFVVKAEKWLFIQ

    Its easy enough to adapt the code if you wanted to output the protein names for example, or GI numbers.

    Note that not all GenBank CDS features include the translation. In a related example, for viruses, the mat_peptide (mature peptide) features don't include the protein sequence either. In such cases, you can first extract the nucleotide sequence (see below) and then translate it to get the amino acids.

    Gene by Gene : GenBank to FASTA Nucleotides (*.gbk to *.ffn)

    I've saved this one till last, because it was the hardest. However, as described in the preceding document, Biopython 1.53 adds a new extract method to the SeqFeature object. This means you don't have to deal with anything complicated like "joins" or "fuzzy" positions explicitly.

    I still need to find the time to sit down and expand this example to use the new method...

    [Python logo]
    Python

    [Biopython logo]
    Biopython

    [NCBI logo]
    NCBI

    MOAC DTC, Coventry House, University of Warwick, Gibbet Hill Road, Coventry, CV4 7AL
    Tel. 024 765 75808 moac2 at warwick dot ac dot uk

    How to find us

    MOAC Intranet

    EPSRC logo

    Close this email form
    Page contact: Peter Cock Last revised: Thu 6 Jan 2011
    • Sign in
    • |
    • Powered by Sitebuilder
    • |
    • © MMXII
    • |
    • Privacy
    • |
    • Accessibility