Skip to main content

Using FASTA nucleotide files in BioPython

I've added a simplified version of this example to the "Biopython Tutorial and Cookbook" which will be included as of Biopython 1.49 onwards (HTML, PDF).
This new version uses the Bio.SeqUtils.GC() function for the GC percentage calculation, and a python library called matplotlib (pylab) for plotting the graph.

This is an example python program to calculate GC percentages for each gene in an nucleotide FASTA file - using Bio.SeqIO, the new Biopython sequence input/output module I've been working on which is available in Biopython 1.43 onwards.

Why is the CG percentage interesting? In double stranded DNA, paired G and C nucleotides are joined by three hydrogen bonds, while the A and T have only two. This means that CG pairs are "stronger" than AT pairs, and thus the ratio of GC to AT for a DNA sequence has a marked effect on its physical properties (such as its "melting point"). Biologists will often use the terms "GC rich" or "AT rich" in this context.

Its well documented that different organisms have different GC percentages for their genomes, which are in some way tuned to the environment they live in. In bacteria, when a new gene is aquired by horizontal gene transfer, its GC% will reflect that of the donor bacteria - and thus may be very different to that of the recipient. By gradual mutation (assuming the gene is preserved) the GC% of such new genes will tend to approach that of the recipient bacteria.

What this means is that for any bacteria, genes with unusually high or low GC% may have been recently aquired by horizontal gene transfer.

You can download the script here, fasta_n3.py [Python Code] (older versions still available, see historical note).

It calculates GC percentages for each gene in a FASTA nucleotide file, writing the output to a tab separated file for use in a spreadsheet.

It has been tested with BioPython 1.43 and Python 2.3, and is suitable for Windows, Linux etc.

An example FASTA file

The suggested input file 'NC_005213.ffn' is available from the NCBI from here:

ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans/

It was chosen because it is a small genome, so the example is very quick to run. This is how this file starts out, showing the first gene and some of the second entry:

>ref|NC_005213.1|:c879-490883
ATGCGATTGCTATTAGAACTTAAAGCCCTAAATAGCATAGATAAAAAACAATTATCTAACTATCTAATAC
AAGGTTTCATTTATAATATATTAAAAAATACCGAATACTCTTGGTTACATAATTGGAAGAAAGAGAAATA
TTTTAATTTTACCTTAATCCCAAAAAAAGACATAATAGAGAATAAGAGGTATTATTTAATCATATCTTCG
CCCGATAAAAGGTTTATAGAGGTTTTGCATAATAAAATAAAAGATTTAGATATAATAACTATTGGTTTGG
CTCAATTCCAATTAAGGAAAACAAAAAAATTCGATCCAAAATTAAGATTTCCTTGGGTTACTATAACTCC
TATAGTATTAAGGGAAGGCAAAATAGTAATATTAAAAGGGGACAAATACTATAAGGTATTTGTTAAGCGA
TTGGAAGAATTGAAAAAGTATAATCTAATAAAAAAGAAAGAGCCCATTTTAGAAGAACCCATAGAGATTA
GTTTAAACCAAATCAAAGATGGATGGAAAATTATAGATGTAAAAGATAGGTATTACGATTTTAGAAATAA
GAGTTTTAGCGCTTTTTCTAATTGGTTGCGAGACCTAAAAGAGCAAAGCTTAAGAAAATATAATAATTTC
TGTGGGAAGAACTTTTATTTTGAAGAAGCAATATTCGAAGGTTTTACTTTCTATAAAACAGTATCAATAA
GAATAAGAATTAACAGAGGGGAAGCAGTATATATAGGCACATTATGGAAAGAGTTAAATGTTTATAGAAA
ATTAGACAAAGAAGAAAGAGAATTTTACAAATTTTTGTACGATTGCGGTTTGGGTTCATTAAATTCTATG
GGTTTTGGGTTTGTTAATACAAAAAAGAACTCTGCGAGATAA
>ref|NC_005213.1|:883-2691
ATGAAAAAGCCCCAACCCTATAAAGATGAAGAGATATATTCTATTTTAGAAGAGCCCGTAAAACAATGGT
TTAAAGAGAAATACAAAACATTCACTCCCCCACAAAGGTATGCAATAATGGAAATACATAAAAGGAACAA
TGTTTTAATTTCTTCCCCCACAGGTTCGGGAAAAACGTTAGCAGCGTTTTTAGCTATAATAAATGAATTA
ATAAAGTTATCTCATAAAGGAAAATTAGAAAATAGAGTTTATGCCATTTATGTTTCTCCATTAAGAAGTT
TAAATAACGATGTAAAGAAAAACTTAGAAACTCCATTAAAAGAAATAAAAGAAAAAGCGAAAGAGCTTAA
...

Most FASTA files will have additional (optional) information about each gene as part of the title line (the single lines starting > before the sequence).

The corresponding output file is here, nucleotide_counts.tsv, part of which is displayed later in this document.

The Program

In total (ignoring the comments) my program [Python Code] has under 20 lines (which I'll explain below):

input_file = open('NC_005213.ffn', 'r')
output_file = open('nucleotide_counts.tsv','w')
output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n')
from Bio import SeqIO
for cur_record in SeqIO.parse(input_file, "fasta") :
#count nucleotides in this record...
gene_name = cur_record.name
A_count = cur_record.seq.count('A')
C_count = cur_record.seq.count('C')
G_count = cur_record.seq.count('G')
T_count = cur_record.seq.count('T')
length = len(cur_record.seq)
cg_percentage = float(C_count + G_count) / length
output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \
(gene_name, A_count, C_count, G_count, T_count, length, cg_percentage)
output_file.write(output_line)
output_file.close()
input_file.close()

So, going through this bit by bit...

We open a FASTA input file of gene nucleotides sequences ('r' = read):

input_file = open('NC_005213.ffn', 'r')

We next open an output file to record the counts in ('w' = write):

output_file = open('nucleotide_counts.tsv','w')

The file extension "tsv" is short for "Tab Separated Variables", also known as "Tab Delimited Format". This is a universal format, you can read it with any text editor or spreadsheet - Microsoft Excel is also a good choice.

We will now write a header line to our output file, so that each column has a caption in the spreadsheet:

output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n')

In Python (and also the C and C++ programming languages) we must write \t to mean a tab, and \n to mean an end of line (new line) character. i.e. that means:

Gene (tab) A (tab) C (tab) G (tab) T (tab) Length (tab) CG%

In a moment we are going to need BioPython's sequence input/output library, Bio.SeqIO, so we must tell Python to load this ready for us:

from Bio import SeqIO

Now we create a sequence record iterator, using the Bio.SeqIO parse function. Biopython uses the term iterator to mean an lump of code we can use to step through the records in a file (i.e. iterate over the records).

This next bit is a big loop... remember that the indentation means something in python:

for cur_record in SeqIO.parse(input_file, "fasta") :
#count nucleotides in this record...

The records created by Bio.SeqIO are SeqRecord objects, which in general will have an id, name, descrption and sequence. For some file formats additional annotation of sub-features may also be present.

In the Bio.SeqIO parser, the first word of each FASTA record is used as the record's id and name.

gene_name = cur_record.name

Just like a normal string in python, sequence objects also have a 'count' method which we can use to find the number of times nucleotide is present:

A_count = cur_record.seq.count('A')
C_count = cur_record.seq.count('C')
G_count = cur_record.seq.count('G')
T_count = cur_record.seq.count('T')

We would also like to know the number of nucleotides in this gene (which should add up to the four base counts, if there are no unknown bases, N):

length = len(cur_record.sequence)

Now we work out the CG percentage for this gene. We must switch from integers into floating point (non-integer) because integer division will just give 0 or 1 as the answer:

cg_percentage = float(C_count + G_count) / length

Finally, we are going to save this information as a single tab separated line in our output file.

As before (when we wrote the header line), we must write \t to mean a tab, and \n to mean an end of line (new line) character.

We are going to use the string formatting (or interpolation) operator % (which is like the printf command in C) where %s means insert a string, while %i means insert an integer, and %f a floating point (non-integer) number:

output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \
(gene_name, A_count, C_count, G_count, T_count, length, cg_percentage)

The \ character at the end of that line above means this command continues onto next line - it makes things slightly easier to read.

Then we save this line of text to the output file:

output_file.write(output_line)

Now we have finished all the genes, we can close the output file:

output_file.close()

and close the input file:

input_file.close()

The Results

If you run this program and then open the output file 'nucleotide_counts.tsv' in your spreadsheet (e.g. Microsoft Excel) you should get something like this:

Gene A C G T Length CG%
ref|NC_005213.1|:c879-490883 371 91 151 269 882 0.274376
ref|NC_005213.1|:883-2691 742 234 335 498 1809 0.314538
ref|NC_005213.1|:c3189-2668 224 45 107 146 522 0.291188
ref|NC_005213.1|:c3748-3290 177 42 100 140 459 0.309368
... ... ... ... ... ... ...
ref|NC_005213.1|:c486962-486423 217 70 80 173 540 0.277778

Using the spreadsheet tools you can do things like sort the genes by their GC percentage:

Gene A C G T Length CG%
ref|NC_005213.1|:c879-490883 371 91 151 269 882 0.274376
... ... ... ... ... ... ...
ref|NC_005213.1|:c486962-486423 217 70 80 173 540 0.433333

As you can see above, the GC% ranges from 22.8% up to 43.3% (1 d.p.) for 'NC_005213.ffn'

By summing up the C and G columns and dividing by the sum of the gene lengths, the overall coding sequence GC percentage comes out at 31.2%. You should also be able to do fun things like graph the sorted GC percentages:

[Graph]

You can see there are a few genes with abnormally high or low GC percentages - perhaps these are the result of recent horizontal gene transfer from another bacteria?

You can also download the complete genome's nucleotide sequence, file 'NC_005213.fna'.

If you run this program on that instead, you will get a single line output containing the data for the whole genome (genes and all the DNA inbetween). The GC% for this is 31.6% (1dp) which is almost identical to the figure of 31.2% obtained for all the genes.

Historical Note

The first versions of this example used the Bio.Fasta library which predates the more recent and more general Bio.SeqIO module.

My original example [Python Code] used a while loop with cur_record = iterator.next() as was suggested in the Biopython documentation at the time:

iterator = Fasta.Iterator(input_file, Fasta.RecordParser())
while True :
cur_record = iterator.next()
if cur_record is None: break
#count nucleotides in this record...

For people familiar with database programming, the above code is very much like dealing with a forward only recordset, advancing until the end is reached.

The revised code [Python Code] shows a much nicer way, using the for ... in ... syntax:

for cur_record in Fasta.Iterator(input_file, Fasta.RecordParser()) :
#count nucleotides in this record...

The third version [Python Code] also uses the for ... in ... syntax, but with Bio.SeqIO instead of Bio.Fasta