Skip to main content

Downloading Genomes with an FTP script

I am looking at Two Component Systems (TCS) in bacteria as part of my PhD research, and one of the first tasks was to find these genes in bacterial genomes using BLAST.

While BLAST can be used over the internet (even automatically in a Python script), this is comparatively slow - especially for the number of searches I have been doing. Instead, you can run the BLAST program locally, using a local database (described here).

Of course, you first will need to download the genome(s) you are interested in. I wanted to search ALL the currently sequenced bacteria, and so came up with the following script to download all the bacteria genomes from the NCBI's FTP site: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/

Update: The NCBI now provide a one gigabyte file, all.gbk.tar.gz and downloading this would be much more efficient for a one-off operation.

When I wrote this script, the NCBI had just over 200 bacterial genomes (many for different strains of a given bacteria), and storing just the GenBank files (*.gbk) took almost 1.5GB of disk space.

To handle the actual FTP access, I used Stefan Schwarzer's Python module ftputil, which he describes as a high-level interface to the ftplib module. Importantly for me, this allows many operations similar to those of python's os and os.path modules, which I already knew how to use for manipulating local files.

The script (which requires you first install the ftputil module):

import ftputil
import string
import os
import sys

#Where to put the genomes (script will create sub directories
#using the same names as the NCBI use).  This directory must
#exist already:
base_path="C:\\genomes\\Bacteria\\"

#People on linux try something like:
#base_path="~/genomes/Bacteria/"

host = ftputil.FTPHost('ftp.ncbi.nlm.nih.gov', 'anonymous', 'password')
host.chdir('/genomes/Bacteria/')

dir_list = host.listdir(host.curdir)
for dir_name in dir_list :
    host.chdir('/genomes/Bacteria/')
    if host.path.isdir(dir_name):
        print dir_name
        host.chdir('/genomes/Bacteria/' + dir_name + '/')
        file_list = host.listdir(host.curdir)
        for file_name in file_list :
            if file_name[-4:]==".gbk" :
                print "File " + file_name
                if not os.path.isdir(os.path.join(base_path,dir_name)) :
                    print "Making directory " + os.path.join(base_path,dir_name)
                    os.chdir(base_path)
                    os.mkdir(os.path.join(base_path,dir_name))
                if os.path.isfile(os.path.join(base_path,dir_name,file_name)) :
                    print "Skiping file " \
                          + os.path.join(base_path,dir_name,file_name)
                elif host.path.isfile(file_name) :
                    print "Downloading file " \
                          + os.path.join(base_path,dir_name,file_name)
                    host.download(file_name, \
                          os.path.join(base_path,dir_name,file_name), 't')
                    #Download arguments: remote filename, local filename, mode
                else :
                    print "ERROR - Not a file " + dir_name + "/" + file_name

What this does is log into the NCBI's FTP site, and ask for a list of the files/directories. For each directory, it then asks for a list of files/directories and looks for any GenBank files (*.gbk), which are then downloaded.

GenBank files are great because they contain more information than the simple FASTA files, which the NCBI also provide. You could easily change the script to download the FASTA amino acid sequences (*.faa) or FASTA nucleotide sequences (*.fna) which may be handy...

Notice that the directory listings from the listdir function are cached (as dir_list and file_list) because otherwise the listing seems to be re-fetched on each loop iteration, and this does take a long time.

The script can probably still be improved, for example the isdir() operations take a surprisingly long time - maybe it would be better to try and change directory and catch an exception error. The isfile() operation is also slow, and it would be fairly safe to assume that anything called *.gbk was a file. However, this speed impact is minor and to me didn't matter, as once I had the files the only reason to use the script again would be to look for new genomes automatically.