Skip to main content Skip to navigation

Poster Abstracts

Vladislav Vyshemirsky and Mark Girolami, "Approximate Inference on Stochastic Process Algebra models"
An application of the Approximate Bayesian Computation methods to models defined using Process Algebras will be demonstrated. This work enables learning quantitative model parameters from experimental data in a consistent probabilistic setting. Approximate Markov Chain Monte Carlo algorithms are adapted to work with biochemical models defined in the model language of the PRISM model checker. Model calibration of stochastic biochemical models is now possible, and therefore structural models defined using Process Algebras can now be put into realistic quantitative context to produce relevant and calibrated predictions. The issues of selecting suitable descriptive statistics, interpretation of the model mismatch error terms, and defining a distance measure on model predictions will be discussed in detail. A case study will be demonstrated, in which all of the parameters of a stochastic models are inferred from a limited data set. These learnt model parameters are then confirmed to reproduce known parameter values found in the literature. Stochastic model predictions can then be drawn to demonstrate how stability of the model depends on the certainty of the parameter choice.

Ari Loytynoja,"Progressive alignment of sequence graphs with insertion-deletion history"
Sequence alignment is an inference of evolutionary homology among the characters in different sequences. The process of alignment comprises the reconstruction of ancestral sequences with their character states and presence/absence of sites. We showed that the algorithm that distinguishes insertions from deletions makes more realistic inferences of sites truly present in ancestral sequences and has superior accuracy in reconstructing the correct site homology. However, the method is greedy and sensitive to errors in the sequence phylogeny. I describe a novel approach based on a graph representation of sequences that allows a more flexible modelling of insertion-deletion history and is less dependent on the input phylogeny. The framework can be extended to capture the uncertainty of alignment solutions, avoid greedy errors and provide an efficient approach for sampling suboptimal solutions. I present an application of the graph algorithm for the alignment of short-read sequence data, with uncertainty of the true sites content, to a reference alignment of related sequences.

Charlotte Soneson, Henrik Lilljebjörn, Thoas Fioretos and Magnus Fontes,"Integrative analysis of gene expression and copy number alterations using canonical correlation analysis"
With the rapid development of new genetic measurement methods, several types of genetic alterations can be quantified in a high-throughput manner. While the initial focus has been on investigating each data set separately, there is an increasing interest in studying the correlation structure between two or more data sets. Recently, several multivariate methods have been proposed for this purpose. The high dimensionality of microarray data generally imposes computational difficulties, which have been addressed e.g. by instead studying the covariance structure or by studying small data sets. In this work, we implement a new method for integrating large genetic data sets, by translating a regularized canonical correlation analysis (CCA) to its dual form. This implies that we are able to emphasize the correlation structure, while the computational complexity depends mainly on the number of samples instead of the number of variables.

We apply the regularized dual CCA to a large paired data set of gene expression changes, measured by Affymetrix HG-U133A arrays, and copy number alterations, measured by Affymetrix Human Mapping 250K Sty SNP-arrays, in 173 leukemia patients. Using this method, we are able to extract two well-known leukemia subtypes with strong copy number and gene expression profiles. Furthermore, the regularized dual CCA is shown to yield results which are more biologically interpretable than those resulting from a covariance-maximizing method, and highlight different features than those found when each variable set is studied separately with principal components analysis (PCA).

Kenneth J Evans,"Most transcription factor binding sites are in a few mosaic classes of the human genome"
Many algorithms for finding transcription factor binding sites have concentrated on the characterisation of the binding site itself: and these algorithms lead to a large number of false positive sites. The DNA sequence which does not bind has been modeled only to the extent necessary to complement this formulation.
Results: We find that the human genome may be described by 19 pairs of mosaic classes, each defined by its base frequencies, (or more precisely by the frequencies of doublets), so that typically a run of 10 to 100 bases belongs to the same class. Most experimentally verified binding sites are in the same four pairs of classes. In our sample of fourteen transcription factors---taken from different families of transcription factors---the average proportion of sites in this subset of classes was 71%, with values for individual factors ranging from 46% to 98%. By contrast these same classes contain only 26% of the bases of the genome, and only 31% of occurrences of the motifs of these factors---that is places where one might expect the factors to bind. These results are not a consequence of the class composition in promoter regions.
Conclusions: This method of analysis will help to find transcription factor binding sites and assist with the problem of false positives. These results also imply a profound difference between the mosaic classes.


Iirna Abnizova, Tom Skelly, Steven Leonard, Nava Whiteford, Clive Brown and Tony Cox,"Statistical comparison of methods to estimate error probability in short read Illumina sequencing"
As was the case in the beginning of the sequencing era, the new generation of short read sequencing technologies still requires both accuracy of data processing methods and reliable measures of that accuracy.
Inspired by classic of the genre, the Phred method, we generalized those findings in the area of base quality value calibration. We introduce a simple, straight-forward statistically established way to measure the performance of a calibrator, and to find an optimal way to assess its reliability. We illustrate the method by assessing the performance of several calibrators/predictors for Illumina, Genome Analyser 2 (GA2) data. The choice of the best predictor is based on optimization of validity, discriminative ability and discrimination power for several candidate predictors. We applied the method on data from one experimental run for genome of the phage phiX, and found the best predictor out of ten candidates to be ‘Purity’, a statistics derived from corrected cluster intensities.

William Kelly,"Error and Interactome Complexity: Capture-Recapture and Protein Interaction Networks"
Quantifying the false positive rate within reported protein interactions is of crucial importance when estimating the size and organization of interactomes. Existing methods have relied upon the use of a gold standard set of true interactions. The construction of these sets is problematic, and prone to experimenters’ biases. Results relying on such data may inadvertently increase the estimated error rate for certain experimental techniques.

Capture-recapture methods are commonly used in ecology to estimate population sizes. We develop a statistical process for the collection of protein interaction data and from the outset consider that observations fall into two distinct populations: false positives and true positives. Assuming uniform sampling across the available proteome, we model the interactome size and error rate given the experimental data as completed for multiple capture-recapture studies.

Our results give an overview of how complete the current data for PIN are in a variety of species, and also enable a novel means of comparing error rates and biases between experiment types without the need for a gold standard set of interactions.

Susann Stjernqvist, Tobias Rydén and Chris D. Greenman,"Modelling of allelic copy number variations in cancer cells using hidden Markov models"
Cancer tumour cells can contain abnormalities in the form of copy number alterations i.e. segments with losses or gains of one or several copies of DNA. The positions and forms of these alterations are essential for both detecting and improving knowledge of various sorts of cancer, and thereby methods that localize them are of great importance. Both changes of the total copy number as well as the number of copies of each of the two different alleles are interesting. SNP array data consists of measured allelic intensities at hundreds of thousands positions all over the genome. We present a method to estimate the number of copies of each of the alleles using Hidden Markov models. The method is especially suited to model cancer data, since two well-known features of cancer data are included in the model. One of the features is aneuploidy, i.e. the mean number of DNA copies of the genome diverges from two. The other characteristic is contamination with normal DNA due to a mixture of cancer tissue and surrounding normal tissue in the sample.

Since the SNPs are unevenly spread over the genome we assume a continuous index Markov chain, where each state corresponds to one class of genotypes for example {AA, AB, BB} or {AAA, AAB, ABB, BBB}. The conditional density for the two allelic intensities, given the state of the Markov chain, is assumed to follow a mixture of bivariate Normal distributions. The parameters of the Normal densities are estimated from normal samples in Greenman et al. The EM algorithm is used to estimate the ploidy, the transition rates and the fraction of normal contamination, and the Markov chain is reconstructed using the Viterbi algorithm. We evaluate the performance of our method by applying it to both simulated and clinical data. For the simulated data we can compute the amount of SNPs estimated to the correct state. There is above 99% correct estimates when the fraction of normal tissue is 30% or 50%. For 70% normal contamination 97% of the SNPs are correctly estimated, which is higher than for other methods in this field.

Taoyang Wu,"Species, Clusters and the ‘Tree of Life’"
A hierarchical structure describing the inter-relationships of species has long been a fundamental concept in systematic biology, from Linnean classification through to the more recent quest for a ‘Tree of Life.’ In this talk we introduce an approach based on discrete mathematics to address a basic question: Could one delineate this hierarchical structure in nature purely by reference to the ‘genealogy’ of present-day individuals, which describes how they are related with one another by ancestry through a continuous line of descent? We describe several mathematically precise ways by which one can naturally define collections of subsets of present day individuals so that these subsets are nested (and so form a tree) based purely on the directed graph that describes the ancestry of these individuals. We also explore the relationship between these and related clustering constructions.

Lisa Hopcroft, Keith Harris, Martin McBride and Mark Girolami,"Predictive response-relevant clustering provides insights into disease processes"
To overcome the problem of high dimensionality, whilst defining potentially biologically relevant structure in the data, microarray probes are commonly clustered according to gene expression. It is then possible to perform subsequent prediction using some 'average' of these 'co-expression' clusters. Here, we present a single step method, where the clustering and prediction step are combined to generate 'meta-covariate' representations of co-expressed, response-relevant gene clusters. We first illustrate the method by analysing a well-known leukaemia dataset, before focusing closely on the analysis of a renal gene expression dataset in a rat model of salt-sensitive hypertension. We explore the biological insights provided by our analysis of these data. In particular, we identify a highly influential cluster of thirteen genes---including three transcription factors (Arntl, Bhlhe41 and Npas2)---that is implicated as being protective against hypertension in response to increased dietary sodium. Furthermore, functional and canonical pathway analysis of this cluster using Ingenuity Pathway Analysis implicated transcriptional activation and circadian rhythm signaling, respectively.

Rene te Boekhorst, Irina Abnizova, Imrana Sabir, Sandeep Brar and Sylvia Beka,"Identification of Sources of Error Affecting Base Calling in Next Generation Illumina/Solexa Sequencing"
The Genome Analyzer (Illumina/Solexa) is a pioneering high throughput sequencing platform that is able to produce millions of short (up to about 100 bases) “reads” of sequenced DNA fragments. In Illumina/Solexa sequencing single stranded DNA fragments are attached in “lanes” (that are subdivided in "tiles") on glass plates termed flow cells. The fragments are amplified to clusters containing about 1000 clones. The clusters are “sequenced-by-synthesis” by attaching fluorescently labelled nucleotides position by position to the complementary base in the template DNA strands in a series of chemistry steps or cycles (each cycle corresponds to a position in the read DNA fragment). Following laser excitation, the fluorescence of the clusters is captured in approximately 100 images (“tiles”) per lane at each cycle. This is done four times, using different wavelengths for each of the four nucleotides. Ideally, at each cycle the clusters display a single fluorescence signal of maximal intensity, thus leading to the unambiguous identification of a nucleotide (a “base call”) at the corresponding position. However, this is not the case and the accuracy of base-calling is distorted by ‘signal noise’ of the images, artefacts in the chemistry and the spatial locations of the clusters on the flow cell.

One source of error in the chemistry leading to misreading is phasing. This occurs when chemicals from a previous cycle are insufficiently blocked, therefore inhibiting the incorporation of a further base. Because of this, all bases of subsequent cycles are shifted by one location and the resulting low fluorescent intensities of the blocking base manifest themselves as apparent “leaks” in the signals. Incomplete removal of the fluorescents also interferes with signal interpretation as it may lead to more than one signal in the subsequent cycle. Consequently, mostly more than one signal, and sometimes they are of almost equal intensity, is emitted during a cycle.

To assess the dominance of the strongest signal, the four intensities Ii (i = 1 … 4) are combined in the metric “Purity”, the highest of the four intensities divided by their sum. The values of this index range between 0.25 (all signals of equal intensity, no identification possible) and 1.00 (only one of the four signals is emitted, unambiguous base-calling). Note that “purity” is identical to the Berger-Parker dominance index, an often used measure of diversity in ecological studies. A second source of misreading is the decay in fluorescent signal intensity with cycle number, probably caused by loss of reactants during sequencing. Also the way fluorescence is measured may influence the results. In Illumina’s Genome Analyzer, two laser channels (green for G and T, red for A and C) are employed to distinguish the signals. Overlap in the dye emission frequencies causes A and C not be separated well and the same is true for G and T, despite implemented methods to correct for this “cross-talk”.

All the noise factors mentioned above may affect signal intensities and hence the unambiguousness of base-calling as measured by purity. To assess the importance of these sources of error we carried out three quality confirmation investigations (data were from the genome of the phage FX174, obtained at the Wellcome Trust Sanger Institute and sequenced by Illumina’s Genome Analyzer GA2, release 1.4, run 3259). For the first study (I. Sabir) a program was written to read Illumina data files, preprocess the data and compute a four way ANOVA with replication to test for differences in purity between tiles within lanes and between lanes while accounting also for the type of nucleotide and cycle number. A second ANOVA (S. Brar) was applied to unravel the effects of the neighbouring bases on the value of purity of the middle nucleotide of a trimer, given it’s identity and cycle number. The analysis was performed on triads of three subsequent base-calls, randomly sampled from a large number of reads.

A check on the validity of the method is to verify whether or not a called base is found back after aligning the sequenced fragment to a reference genome. The method is reliable if purity “predicts” the proportion of correctly back-aligned nucleotides well. In that case the procedure of back-alignment can be side-stepped, as a table can be constructed in which error-rates can be looked-up from purity values alone. However, there are other well-defined diversity measures beside purity that may give a finer distinction between signals and (therefore) correlate better with the proportion of true base calls. The topic of the third investigation (S. Beka) was therefore to find out which of a series of indices (Purity, chastity, Shannon entropy, and indices by Simpson, Hill and Margalef) showed the strongest logistic regression with the probability of correct base calls.

Results of the ANOVA showed an effect of lanes, cycles, and bases on purity values. These factors explained between 5% and 9% of the total variance of purity. However, the most significant effect was the interaction between all the four main factors contributing 11% to the overall variance. This poses a problem, as it demonstrates that all these factors in combination significantly affect the accuracy with which DNA is sequenced. A plot of mean purity vs. cycle number demonstrated a decreasing trend and a significant drop in purity for later cycles. Bases A and T have higher mean purity values compared to G and C, thus indicating an effect due to the identity of the base called. It was also found that when T is the middle base of a trimer, preceded by G, average purity is lower than for all other combinations. Finally, Purity and Margalef’s index correlated best with the proportion of back-aligned base pairs and therefore appear to be signal diversity measures best suited for calibration purposes.