Nucleotide and Amino Acid Sequences in BioPython

In day to day coding, like many BioPython users, I often used to use just python strings - rather than the Biopython Seq objects which are strings with an associated alphabet. I think that recent releases of Biopython have made the Seq object much more useful, especially in combination with the SeqIO system.

In an effort to get to grips with BioPython's current "alphabet" system, several years ago I started this page. So far it just summarises the IUPAC/IUBMB standards for nucleotide and amino acid "letter" names.

Still under construction...

Nucleotide Alphabet

The nucleotides making up RNA or DNA sequences, taken from IUBMB - Nucleotides:

Symbol	Meaning
G	Guanine
A	Adenine
T	Thymine (in DNA)
C	Cytosine
U	Uracil (in RNA)

Then there are the ambigous nucleotide letters:

Symbol	Meaning	Origin of designation
R	G or A	puRine
Y	T or C	pYrimidine
M	A or C	aMino
K	G or T	Keto
S	G or C	Strong interaction (3 H bonds)
W	A or T	Weak interaction (2 H bonds)
H	A, C or T	not-G, H follows G in the alphabet
B	G, T or C	not-A, B follows A in the alphabet
V	G, C or A	not-T (not-U), V follows U in the alphabet
D	G, A or T	not-C, D follows C
N	G, A, T or C	aNy

Nucleotide Sequences in BioPython

Right then... DNA and RNA... unambiguous and ambiguous...

Amino Acid Alphabet

The standard twenty amino acids have one-letter and three-letter codes as follows, taken from the IUPAC/IMBMB - Amino Acids:

One	Three	Meaning	One	Three	Meaning
A	Ala	Alanine	M	Met	Methionine
C	Cys	Cysteine	N	Asn	Asparagine
D	Asp	Aspartic acid	P	Pro	Proline
E	Glu	Glutamic acid	Q	Gln	Glutamine
F	Phe	Phenylalanine	R	Arg	Arginine
G	Gly	Glycine	S	Ser	Serine
H	His	Histidine	T	Thr	Threonine
I	Ile	Isoleucine	V	Val	Valine
K	Lys	Lysine	W	Trp	Tryptophan
L	Leu	Leucine	Y	Tyr	Tyrosine

There are of course, some special cases

One	Three	Meaning
X	Xaa	Unknown or 'other' amino acid
U	Sec	Selenocysteine (see IUBMB recommentations)
O	Pyl	Pyrrolysine
B	Asx	Aspartic acid (R) or Asparagine (N)
Z	Glx	Glutamic acid (E) or Glutamine (Q), or substances such as 4-carboxyglutamic acid and 5-oxoproline that yield glutamic acid on acid hydrolysis of peptides
J		Sometimes used in NMR work as designation for signals assigned either to leucine (L) or to isoleucine (I) which cannot be distinguished from each other

Amino Acid Sequences in BioPython

The first point is that BioPython uses the one-letter codes almost exclusively - they are simply much more convenient for manipulating on the computer than the three-letter codes.

Python

Biopython

International Union of Pure & Applied Chemistry

International Union of Biochemistry & Molecular Biology