Calculating phi/psi angles for the "Top 500" PDB files

Alternatively, things that go wrong with PDB files.

Note - You should read calculating (ϕ,ψ) angles first. Also, for more discussion of PDB oddites, see Roger Sayle's PDB: Cruft to Content.

When I wrote this page, I was trying to calculate all the (ϕ,ψ) angles from the "Top 500 PDB files" used by Lovell et al. 2003 (see references). I managed to produce a lot of angle pairs, but it didn't tally up that well with their results. I got in touch with the group and they sent me a ZIP file which includes:

their lists of angles (see top500-angles/srcdata/rama500-*.tab)
their smoothed distributions (see top500-angles/pct/rama/rama500-*.data)
and associated occupancies (see top500-angles/pct/rama/rama500-*.data)

It was their lists of angles that I was initially trying to recreate. I also had a go at the next step, generating smoothed density distributions (using R) but with a much simpler technique than that in Lovell et al. 2003.

The "Top 500" PDB files

In Lovell et al. 2003 (see references), a list of 500 high resolution PDB files was compiled, and used to calculate a list of about 100,000 known (ψ,ϕ) angles. These angles were then used to define Preferred, Allowed but Disfavoured, and Forbidden regions - a refinement of the earlier core/non core regions of Kleywegt and Jones (1996).

On their website you can download these 500 PDB files (top500H.tgz, 41 MB, unix format), however they have been pre-processed by a tool called Reduce to generate the hydrogen atoms. MMTK had "issues" with some of their files, so in general I think it is better work with the original files direct from the Protein Data Bank:

top500_list.tsv - List of the "Top 500" PDB files, second column is which chains are used (blank for all).
make_list.py - Python script to create the list (using top500H.tgz files as input).
download_list.py - Python script to download PDB files on the list. Uses the urllib.urlretrieve function.

As a point of interest, while their file does contain 500 PDB files, it only covers 494 PDB identifiers. Why? Because they have split six PDB files into two:

1FLT has separate files for chains V and Y
1H2R has separate files for chains L and S
1MRO has separate files for chains B and C
1QH8 has separate files for chains A and B
2AHJ has separate files for chains C and D
2BBK has separate files for chains H and L

BioPython and the "Top 500" PDB files

We can use the BioPython PDB parser to check the validity of these "Top 500" PDB files. I found it will load all of them in permissive mode (while issuing copious warnings about disordered atoms found with blank altloc entries, discontinuous chains, and redefined residues).

However, some of these PDB files are so bad they require permissive mode (and will not load in strict mode), for example 1BTY:

pdb = "1BTY"
filename = "top500/"+pdb+".pdb"

from Bio.PDB import PDBParser
from Bio.PDB.PDBParser import PDBConstructionException
strict_parser = PDBParser(PERMISSIVE=False)
permissive_parser = PDBParser(PERMISSIVE=True)

try :
    configuration = strict_parser.get_structure(pdb, filename)
except PDBConstructionException :
    print "WARNING - Errors in PDB file: "+pdb
    print "==================================="
    #This will print warnings to screen:
    configuration = permissive_parser.get_structure(pdb, filename)

If you run that code, you'll see that BioPython thinks there are erroneous multiply defined atoms in residues Ser61, Ser127, Ser166, Ser170, Ser236 and Asn245. Looking at the raw PDB file, the altloc field seems to have been misused.

Going through all the "Top 500" files in this way shows sixteen files with oddities: 1A73, 1AAC, 1BTY, 1EDG, 1EJG, 1FVK, 1JHG, 1LUC, 1MCT, 1MOF, 1OAA, 1PDO, 1YVE, 2NAC, 3SDH and 5NUL:

check_list_bio.py - Python script to check the PDB entries on the list using BioPython.

This script also searches each chain for residues which share the same backbone coordinates, which I discuss next.

Fractional Occupancy

Using check_list_bio.py , I discovered two files (1DIN and 2HMZ) which appear to load fine but contain a much more insidious problem: Two different residues listed with exactly the same backbone coordinates (but fractional occupancy).

In the first case (1DIN) residues Csd123 (60% occupancy) and Cys123 (40%) coincide:

There is microheterogeneity at residue 123. One of the residues is cysteine and the other is oxidized cysteine which is presented as residue Csd. Only Csd is listed on the seqres records below; both Cys and Csd are listed on seqadv records. Coordinates are presented for cys and csd.

The second case (2HMZ) is similar, residues Val21 and Ile21A coincide with 50% occupancy each:

Residue 21 has been modeled both as Val and as Ile. It is presented in the atom records below as Val 21 followed by Ile 21A. All atoms of this residue have been assigned occupancies of 0.5.

Using BioPython's PDB parser, both residues are present in the structures (with appropriate occupancies). Using MMTK however, we only see the Val21/Ile21A problem in 2HMZ. The reason 1DIN appears fine is that MMTK ignores modified amino acid residues encoded as HETATM records, so it never sees residue Csd123 - only Cys123 is present.

So, fractional occupancy does make life interesting... and editing the PDB files is a quick and dirty solution:

1DIN_edited.pdb - With Csd123 removed
2HMZ_edited.pdb - With Ile21A removed

These both pass my little test with flying colours. So far, I have only resorted to editing two out the 494 PDB files for use with BioPython... but with MMTK its a different story.

MMTK and the "Top 500" PDB files

I found that even with a few tricks, MMTK still won't load all of these PDB files correctly. The problem is that MMTK is designed for molecular dynamic simulations, and therefore has to fully understand the structure and bonds of every peptide - and this makes it fussy.

The following script attempts to load each of the "Top 500" PDB files using MMTK, and if successful, searches each chain for residues with exactly overlapping backbone atoms (reporting 2HMZ, but as discussed above, not 1DIN):

check_list_mmtk.py - Python script to check the PDB entries on the list using MMTK.

Three PDB files fail due to ambiguous side chain atoms ("AE1", "AE2" atoms on Gly residues in 1ECO and 5CYT; "AD1", "AD2" atoms on Asn9 in 451C). These can be fixed by editing the PDB files to delete the lines, or doing this in code as a special case.

1VNS fails as MMTK is not expecting the atom "CB" in residue Gly126.

1BTY (and others?) fail at the deleteHydrogens() call, due to a problem with multiply defined atoms. I have suggested a fix to MMTK for this.

1AAC loads, but silently ignores Ser9 - unless MMTK is told to use altloc field "1" rather than its default of "A". There may be other "failures" of this type.

Quite a lot of these PDB files include modified amino acid residues, which are held in PDB files using HETATM records. For example: selenomethionine (MSE) in 1PYM, or an assortment including methylated histidines, arginines and glutamines in 1MRO.

MMTK does not support these and will break the peptide chains at each non-standard residue. This means MMTK will not include them in the phi/psi angles. The suggested way to cope with these is to create definitions for the residues in the MMTK database, then edit the configuration in code (or the PDB file by hand) to change the HETATM entries into ATOM entries. Messy.

Finally, older versions of MMTK did not support what it considered to be length one peptide chains (e.g. chain C in 1ATL), but in MMTK 2.5.13 at least these are defined in its database. Such short chains don't have ϕ-ψ angles anyway, so this can be avoided by only building MMTK peptide chains for the longer chains (rather than for the whole PDB file).

Python

The Molecular Modelling Toolkit (MMTK)

Biopython

Protein Data Bank (PDB)