Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • |
  • About
  • Text only
  • |
  • Sign in
  • Search MOAC
  • Search University of Warwick
  • Search for people at Warwick
  • Search Warwick Blogs
  • Search past exam papers
  • Search video
  • More…

    Molecular Organisation and Assembly in Cells

    • About the DTC
    • Research
    • People
    • Degrees
    • Study at MOAC
    • News & Events
    • MOAC Students »
    • Peter Cock »
    • Python Programming »
    • Nuc/AA Sequences
    University of Warwick

    Nucleotide and Amino Acid Sequences in BioPython

    In day to day coding, like many BioPython users, I often used to use just python strings - rather than the Biopython Seq objects which are strings with an associated alphabet. I think that recent releases of Biopython have made the Seq object much more useful, especially in combination with the SeqIO system.

    In an effort to get to grips with BioPython's current "alphabet" system, several years ago I started this page. So far it just summarises the IUPAC/IUBMB standards for nucleotide and amino acid "letter" names.

    Still under construction...

    Nucleotide Alphabet

    The nucleotides making up RNA or DNA sequences, taken from IUBMB - Nucleotides:

    Symbol Meaning
    G Guanine
    A Adenine
    T Thymine (in DNA)
    C Cytosine
    U Uracil (in RNA)

    Then there are the ambigous nucleotide letters:

    Symbol Meaning Origin of designation
    R G or A puRine
    Y T or C pYrimidine
    M A or C aMino
    K G or T Keto
    S G or C Strong interaction (3 H bonds)
    W A or T Weak interaction (2 H bonds)
    H A, C or T not-G, H follows G in the alphabet
    B G, T or C not-A, B follows A in the alphabet
    V G, C or A not-T (not-U), V follows U in the alphabet
    D G, A or T not-C, D follows C
    N G, A, T or C aNy

    Nucleotide Sequences in BioPython

    Right then... DNA and RNA... unambiguous and ambiguous...

    Amino Acid Alphabet

    The standard twenty amino acids have one-letter and three-letter codes as follows, taken from the IUPAC/IMBMB - Amino Acids:

    One Three Meaning One Three Meaning
    A Ala Alanine M Met Methionine
    C Cys Cysteine N Asn Asparagine
    D Asp Aspartic acid P Pro Proline
    E Glu Glutamic acid Q Gln Glutamine
    F Phe Phenylalanine R Arg Arginine
    G Gly Glycine S Ser Serine
    H His Histidine T Thr Threonine
    I Ile Isoleucine V Val Valine
    K Lys Lysine W Trp Tryptophan
    L Leu Leucine Y Tyr Tyrosine

    There are of course, some special cases

    One Three Meaning
    X Xaa Unknown or 'other' amino acid
    U Sec Selenocysteine (see IUBMB recommentations)
    O Pyl Pyrrolysine
    B Asx Aspartic acid (R) or Asparagine (N)
    Z Glx Glutamic acid (E) or Glutamine (Q), or substances such as 4-carboxyglutamic acid and 5-oxoproline that yield glutamic acid on acid hydrolysis of peptides
    J   Sometimes used in NMR work as designation for signals assigned either to leucine (L) or to isoleucine (I) which cannot be distinguished from each other

    Amino Acid Sequences in BioPython

    The first point is that BioPython uses the one-letter codes almost exclusively - they are simply much more convenient for manipulating on the computer than the three-letter codes.

    [Python logo]
    Python

    [Biopython logo]
    Biopython


    International Union of Pure & Applied Chemistry


    International Union of Biochemistry & Molecular Biology

    MOAC DTC, Coventry House, University of Warwick, Gibbet Hill Road, Coventry, CV4 7AL
    Tel. 024 765 75808 moac2 at warwick dot ac dot uk

    How to find us

    MOAC Intranet

    EPSRC logo

    Close this email form
    Page contact: Peter Cock Last revised: Sun 20 Jul 2008
    • Sign in
    • |
    • Powered by Sitebuilder
    • |
    • © MMXII
    • |
    • Privacy
    • |
    • Accessibility