Skip to main content

Nucleotide and Amino Acid Sequences in BioPython

In day to day coding, like many BioPython users, I often used to use just python strings - rather than the Biopython Seq objects which are strings with an associated alphabet. I think that recent releases of Biopython have made the Seq object much more useful, especially in combination with the SeqIO system.

In an effort to get to grips with BioPython's current "alphabet" system, several years ago I started this page. So far it just summarises the IUPAC/IUBMB standards for nucleotide and amino acid "letter" names.

Still under construction...

Nucleotide Alphabet

The nucleotides making up RNA or DNA sequences, taken from IUBMB - Nucleotides:

Symbol Meaning
G Guanine
A Adenine
T Thymine (in DNA)
C Cytosine
U Uracil (in RNA)

Then there are the ambigous nucleotide letters:

Symbol Meaning Origin of designation
R G or A puRine
Y T or C pYrimidine
M A or C aMino
K G or T Keto
S G or C Strong interaction (3 H bonds)
W A or T Weak interaction (2 H bonds)
H A, C or T not-G, H follows G in the alphabet
B G, T or C not-A, B follows A in the alphabet
V G, C or A not-T (not-U), V follows U in the alphabet
D G, A or T not-C, D follows C
N G, A, T or C aNy

Nucleotide Sequences in BioPython

Right then... DNA and RNA... unambiguous and ambiguous...

Amino Acid Alphabet

The standard twenty amino acids have one-letter and three-letter codes as follows, taken from the IUPAC/IMBMB - Amino Acids:

One Three Meaning One Three Meaning
A Ala Alanine M Met Methionine
C Cys Cysteine N Asn Asparagine
D Asp Aspartic acid P Pro Proline
E Glu Glutamic acid Q Gln Glutamine
F Phe Phenylalanine R Arg Arginine
G Gly Glycine S Ser Serine
H His Histidine T Thr Threonine
I Ile Isoleucine V Val Valine
K Lys Lysine W Trp Tryptophan
L Leu Leucine Y Tyr Tyrosine

There are of course, some special cases

One Three Meaning
X Xaa Unknown or 'other' amino acid
U Sec Selenocysteine (see IUBMB recommentations)
O Pyl Pyrrolysine
B Asx Aspartic acid (R) or Asparagine (N)
Z Glx Glutamic acid (E) or Glutamine (Q), or substances such as 4-carboxyglutamic acid and 5-oxoproline that yield glutamic acid on acid hydrolysis of peptides
J   Sometimes used in NMR work as designation for signals assigned either to leucine (L) or to isoleucine (I) which cannot be distinguished from each other

Amino Acid Sequences in BioPython

The first point is that BioPython uses the one-letter codes almost exclusively - they are simply much more convenient for manipulating on the computer than the three-letter codes.