Skip to content Skip to navigation
University of Warwick
  • Study
  • |
  • Research
  • |
  • Business
  • |
  • Alumni
  • |
  • News
  • Text only
  • |
  • Sign in
  • Search Mathematics
  • Search University of Warwick
  • Search for people at Warwick
  • Search Warwick Blogs
  • Search past exam papers
  • Search video
  • More…

    Mathematics Institute

    facebook twitter
    • General
    • Admissions
    • Undergraduate
    • Postgraduate
    • Research
    • People
    • Jobs
    • Events »
    • 2009 – 2010 »
    • Symposium »
    • INF Workshop »
    • INF Abstracts
    University of Warwick

    EPSRC Symposium Workshop on Information extraction from complex data sets (INF)

    14-17 September 2009,

    Organisers: D Wild, S Mukherjee, Z Ghahramani (Cambridge)

    ABSTRACTS


    Ziv Bar-Joseph

    Cross species analysis of functional genomics data

    Recent advances in genomics are enabling researchers to accumulate large
    datasets in multiple species. These include sequence data as well as
    functional information such as the level of gene expression and various
    types of interactions. However, while the sequence and function of genes
    are highly conserved between close species, expression and interaction
    data appears to be much less conserved. For example, even though there
    is 90% sequence similarity between human and mice, both their expression
    and interaction similarities are is lower than 20%.

    In this talk I will present methods that utilize graphical models and
    constrained clustering for integrating sequence and functional data from
    multiple species. We used these methods to study two biological systems:
    cell cycle and immune response. As we show, using these methods we can
    improve on the sets of genes recovered for each species independently.
    More importantly, these methods allow us to recover the core set of
    genes for specific biological systems indicating that data integration
    across species can overcome problems associated with the analysis of
    genomics data.

    Mark Girolami

    Riemann Manifold MCMC for very high dimensional data

    Information extraction from complex data sets such as those
    produced from functional genomic and proteomic technologies is typically
    model-based. Statistical models of high-dimensional observations or
    multiple-sources themselves have complex and oftentimes high-dimensional
    parameterisations bringing with them challenges in performing inference.
    In such a setting performing Markov Chain Monte Carlo based inference
    efficiently is an ongoing theme of methodological research. This talk
    presents a Riemannian Manifold Hamiltonian Monte Carlo sampler to
    resolve the shortcomings of existing Monte Carlo algorithms when
    sampling from target densities that may be high dimensional and exhibit
    strong correlations. The method provides a fully automated adaptation
    mechanism that circumvents the costly pilot runs required to tune
    proposal densities for Metropolis-Hastings or indeed Hybrid Monte Carlo
    and Metropolis Adjusted Langevin Algorithms. This allows for highly
    efficient sampling even in very high dimensions where different scalings
    may be required for the transient and stationary phases of the Markov
    chain. The proposed method exploits the Riemannian structure of the
    parameter space of statistical models and thus automatically adapts to
    the local manifold structure at each step based on the metric tensor. A
    semi-explicit second order symplectic integrator for non-separable
    Hamiltonians is derived for simulating paths across this manifold which
    provides highly efficient convergence and exploration of the target
    density. The performance of the Riemannian Manifold Hamiltonian Monte
    Carlo method is assessed by performing posterior inference on logistic
    regression models, log-Gaussian Cox point processes, stochastic
    volatility models, and Bayesian estimation of parameter posteriors of
    dynamical systems described by nonlinear differential equations.
    Substantial improvements in the time normalised Effective Sample Size
    are reported when compared to alternative sampling approaches.


    Christopher Holmes

    Some issues in robust Bayesian inference for functional genomics

    Experiments in functional genomics typically produce highly structured
    data sets, with thousands of measurements on tens to hundreds of
    individuals. The nature of the assays and the sheer number of
    measurements taken makes analysis of such data prone to influence by
    outliers that arise from bad samples or bad measurements. This
    influence is especially problematic within discovery driven studies
    which often apply simple statistical models to multiple subsets of the
    data with the resulting findings ranked in some fashion, such as when
    using microarrays to test for differential gene expression under two
    treatments. In these scenarios, semi-automated robust Bayesian
    inference provides an attractive inferential framework. We will
    discuss our experience in the analysis of complex genomic data sets
    using robust Bayesian methods via both parametric, robust Bayesian
    ANOVA, and non-parametric, Bayesian Hidden Markov Models with mixture
    of Dirichlet Process state sampling distributions (likelihoods), and
    show these lead to substantial gains in inference and resulting
    findings.

    Dirk Husmeier

    Joint work with Marco Grzegorczyk

    Learning gene regulatory networks from gene expression time series with
    non-linear/non-stationary dynamic Bayesian networks

    Feedback loops and recurrent structures are essential to the
    regulation and stable control of complex biological systems. The
    application of dynamic as opposed to static Bayesian networks is
    promising in that, in principle, these feedback loops can be learned
    from gene expression time series. However, we will show that the
    widely applied BGe model is susceptible to learning spurious feedback
    loops, which are a consequence of non-linear regulation and
    autocorrelation in the data. We propose a non-linear/non-stationary
    generalisation of the BGe model, based on a mixture model and
    change-point process. We demonstrate that this approach has the
    potential to successfully avoid spurious feedback loops that BGe is
    susceptible to, which leads to a more accurate network reconstruction.


    Neil Lawrence

    Efficient Multiple Output Convolution Processes for Multiple Task Learning

    Learning multiple correlated outputs with a Gaussian process
    presents problems both in specifying the covariance (kernel) function
    and efficiently inverting it. We consider the convolution process route
    to generating covariance functions over structured outputs. We will show
    how sparse approximations based on conditional independence assumptions
    and variational methods can be used to make inference and learning
    efficient. Given time we will give examples from multi-task learning,
    computational biology, financial time series and human motion modeling.

    Jure Leskovec

    Meme-tracking and the Dynamics of the News Cycle

    Tracking new topics, ideas, and "memes'' across the Web has been an issue
    of considerable interest. Recent work has developed methods for tracking
    topic shifts over long time scales, as well as abrupt spikes in the
    appearance of particular named entities. However, these approaches are
    less well suited to the identification of content that spreads widely and
    then fades over time scales on the order of days - the time scale at
    which we perceive news and events.

    We develop a framework for tracking short, distinctive phrases that travel
    relatively intact through on-line text; developing scalable algorithms for
    clustering textual variants of such phrases, we identify a broad class of
    memes that exhibit wide spread and rich variation on a daily basis. As our
    principal domain of study, we show how such a meme-tracking approach can
    provide a coherent representation of the news cycle --- the daily
    rhythms in the news media that have long been the subject of qualitative
    interpretation but have never been captured accurately enough to permit
    actual quantitative analysis. We tracked 1.6 million mainstream media
    sites and blogs over a period of three months with the total of 90 million
    articles and we find a set of novel and persistent temporal patterns in
    the news cycle. In particular, we observe a typical lag of 2.5 hours
    between the peaks of attention to a phrase in the news media and in blogs
    respectively, with divergent behavior around the overall peak and a
    "heartbeat''-like pattern in the handoff between news and blogs. We also
    develop and analyze a mathematical model for the kinds of temporal
    variation that the system exhibits.


    Guido Sanguinetti

    Approximate inference for Markov Jump Processes with applications
    in systems and developmental biology


    Markov Jump Processes represent a convenient mathematical
    model of many chemical reactions involving low numbers of molecular
    species. Inference in these models is hampered by the necessity to solve
    very large systems of ODEs giving the forward backward relations. In
    this talk, I will present some work (in collaboration with Manfred
    Opper) on using a variational mean field approach to reduce the
    inference problem to a more tractable size. I will give some examples of
    applications, including an application to reaction-diffusion systems in
    morphogenesis of Drosophila embryos (joint work with M. Dewar, M.
    Opper, V. Kadirkamanathan).

    Eric Schadt

    Networks as the Sensors and Drivers of Disease

    Common human diseases and drug response are complex traits
    that involve entire networks of changes at the molecular level driven
    by genetic and environmental perturbations. Efforts to elucidate
    disease and drug response traits have focused on single dimensions of
    the system. Studies focused on identifying changes in DNA that
    correlate with changes in disease or drug response traits, changes in
    gene expression that correlate with disease or drug response traits,
    or changes in other molecular traits (e.g., metabolite, methylation
    status, protein phosphorylation status, and so on) that correlate with
    disease or drug response are fairly routine and have met with great
    success in many cases. However, to further our understanding of the
    complex network of molecular and cellular changes that impact disease
    risk, disease progression, severity, and drug response, these multiple
    dimensions must be considered together. Here I present an approach
    for integrating a diversity of molecular and clinical trait data to
    uncover models that predict complex system behavior. By integrating
    diverse types of data on a large scale I demonstrate that some forms
    of common human diseases are most likely the result of perturbations
    to specific gene networks that in turn causes changes in the states of
    other gene networks both within and between tissues that drive
    biological processes associated with disease. These models elucidate
    not only primary drivers of disease and drug response, but they
    provide a context within which to interpret biological function,
    beyond what could be achieved by looking at one dimension alone. That
    some forms of common human diseases are the result of complex
    interactions among networks has significant implications for drug
    discovery: designing drugs or drug combinations to impact entire
    network states rather than designing drugs that target specific
    disease associated genes.


    Ricardo Silva

    Joint work with Katherine Heller, Zoubin Ghahramani and Edoardo Airoldi

    Ranking Relations Using Analogies

    We develop an approach to relational learning which, given a
    set of pairs of objects S = {A1:B1,A2:B2, . . . ,AN:BN}, measures
    how well other pairs A:B fit in with the set S. Our work addresses
    the question: is the relation between objects A and B analogous to
    those relations found in S? Such questions are particularly relevant
    in information retrieval, where an investigator might want to search
    for analogous pairs of objects that match the query set of interest.
    Analogical reasoning depends fundamentally on the ability to learn
    and generalize about relations between objects. There are many ways
    in which objects can be related, making the task very challenging.
    We recast this classical problem as a problem of Bayesian analysis
    of relational data and function spaces, and illustrate its potential in
    the domain of text analysis. A detailed application on searching for
    protein-protein interactions is discussed.


    John Skilling

    The Nested Sampling Algorithm


    The "Nested Sampling" algorithm is designed for probabilistic inference,
    where a function of arbitrary complexity is to be both integrated (for
    model selection) and sampled (for inferred parameters). It is an
    iterative Monte Carlo scheme based on sampling within a progressive
    constraint on function value. This constraint compresses the remaining
    available volume smoothly and systematically, so that exploration is (at
    least in principle) independent of quirks of function behaviour. This
    property is particularly valuable in multi-modal problems, where peaks
    of different heights and volumes need to be correctly balanced.

    Early applications are in cosmological model selection and the modelling
    of nano-materials.

    Michael Stumpf

    Model selection from single cell data

    For the vast majority of biological systems we lack reliable models let
    alone model parameters. Using well defined simulation models and real
    biological data collected for a range of biological signalling systems,
    we explore how much can be learned about biological systems from
    temporally resolved transcriptomic or proteomic data. We pay particular
    attention to qualitative properties of the underlying dynamical system
    and their impact on our ability to infer the system's dynamics. We then
    illustrate how approximate Bayesian computation approaches can be
    employed to gain insights into the inferability of model parameters, and
    for model selection in the context of dynamical systems of signalling
    networks in systems biology. We will pay particular attention to the
    analysis of single-cell data and discuss the relative advantages of
    different experimental setups to study cellular variability.

    Simon Tavaré

    Joint work with Christiana Spyrou, Rory Stark, Andy Lynch

    Some statistical issues in the analysis of Illumina sequencing experiments

    High-throughput sequencing technologies have become popular for the
    study of genome organization, gene expression, methylation and
    protein-DNA interactions. For example, chromatin immunoprecipitation
    followed by sequencing of the resulting samples produces large amounts
    of data that can be used to map transcription factor binding sites,
    histone modifications and origins of replication.

    In this talk I will discuss some of the statistical issues from such
    data, focussing primarily on ChIP-seq experiments. I will describe
    some research from the CRI in which ChIP-seq has proved invaluable,
    and illustrate a statistical method for calling enriched
    regions. BayesPeak uses a fully Bayesian hidden Markov model to detect
    enriched locations in the genome. The structure accommodates the
    natural features of Illumina sequencing data and allows for
    overdispersion in the abundance of reads in different
    regions. Moreover, a control sample can be incorporated in the
    analysis to account for experimental and sequence biases. Markov chain
    Monte Carlo algorithms are applied to estimate the posterior
    distributions of the model parameters, and posterior probabilities are
    used to identify the sites of interest. I will give some comparisons
    with existing approaches, and describe related applications such as
    mapping origins of replication using BrdU-IP-seq and for which novel
    statistical problems arise.

    John Winn

    Modelling complex disease phenotype data with Infer.NET

    When trying to understand the genetic basis of disease, a common
    approach is to treat presence or absence of a disease as a binary
    target. Because many diseases involve multiple, complex systems,
    disease symptoms may be due to a failure in a subset of a large
    number of relevant cellular mechanisms across multiple systems. For
    example, asthma symptoms may arise from problems with the immune
    system, bronchial hyper-sensitivity or difficulties during lung
    development - or some combination of these with varying severity.
    Hence, before we can understand the genetic basis of a disease, it is
    important to identify and decompose the system-level basis of the
    disease. Genetic associations to these underlying system-level
    factors can then be found, instead of to the disease label, making it
    possible to detect associations that were previously lost.

    Our approach to understanding the system-level basis of disease is to
    construct a graphical model of rich disease phenotype data. This
    approach allows us to combine physiological, clinical, environmental
    and sociological variables relevant to the disease whilst also taking
    into account expert clinical knowledge. To construct these rich
    models, we use the Infer.NET graphical modelling and inference tool
    developed at Microsoft Research Cambridge. Infer.NET allows very
    rapid development, testing and refinement of the model, whilst also
    scaling to very large datasets. I illustrate the talk with a
    detailed example of how Infer.NET was used to model asthma phenotype
    data as part of a project undertaken with the University of
    Manchester.

    Eric Xing

    Time (and Space)-Varying Networks: Reverse engineering rewiring genetic interactions

    A plausible representation of the relational information among entities
    in dynamic systems such as a living cell is a stochastic network which
    is topologically rewiring and semantically evolving over time (or
    space). While there is a rich literature in modeling static or
    temporally invariant networks, until recently, little has been done
    toward modeling the dynamic processes underlying rewiring networks, and
    on recovering such networks when they are not observable. In this talk,
    I will present a new formalism for modeling network evolution over time
    based on temporal exponential random graphs, and several new algorithms
    based on temporal extensions of the sparse graphical logistic
    regression, for reverse-engineering the latent time/space varying
    networks. These algorithms can be cast as standard convex-optimization
    problems and solved efficiently using generic solvers scalable to large
    networks. I will show some promising results on recovering the latent
    sequence of evolving gene networks over more than 4000 genes during the
    life cycle of Drosophila melanogaster from microarray time course, at a
    time resolution only limited by sample frequency (i.e., works even when
    a single snapshot of node-values from each time-specific network is
    available.) I will also sketch some theoretical results on asymptotic
    sparsistency of the proposed methods, which differ significantly from
    traditional sparsistency analysis of static structure estimation based
    on iid samples because of the temporal relatedness of samples.

    Aerial photograph of Maths Houses

    • How to get here

    See also:

    Mathematics Research Centre

    Mathematical Interdisciplinary Research at Warwick (MIR@W)

    Past Events 

    Past Symposia 

    Registration:
    You can register for any of the symposia or workshops online. To see which registrations are currently open and to submit a registration, please click here.
    Contact:
    Mathematics Research Centre
    Zeeman Building
    University of Warwick
    Coventry CV4 7AL - UK
    E-mail:
    mrc@maths.warwick.ac.uk

    Mathematics Institute
    Zeeman Building
    University of Warwick
    Coventry CV4 7AL

    Staff Intranet
    Alumni website

    Close this email form
    Page contact: David Wild Last revised: Mon 7 Sep 2009
    • Sign in
    • |
    • Powered by Sitebuilder
    • |
    • © MMXII
    • |
    • Privacy
    • |
    • Accessibility