Prof. David L. Wild
Department of Statistics and Systems Biology Centre
University of Warwick
Office Hours (Term 1)
My research interests are in the field of statistical bioinformatics; in particular in the application of Bayesian statistical machine learning techniques to problems in systems biology, functional genomics and proteomics.
In the forthcoming era of genomic medicine and agriculture, genome sequences and other forms of high-throughput data such as gene expression, alternative splicing, DNA methylation, histone acetylation, and protein abundances will be routinely measured for large numbers of people and crop species. New approaches to synthetic protein design offer the long-term potential for novel biotechnological and therapeutic applications. Statistical and computational methodology will be essential to realizing the promise of these technological developments.
My research is funded by the U.K. Biotechnology and Biological Sciences Research Council (SABR and SYSMO initiatives), the U.K. Engineering and Physical Sciences Research Council, the European Union, the Leverhulme Trust, and the U.S. National Science Foundation and National Institutes of Health.
Modelling gene regulatory networks
Over 50 years ago the developmental biologist C.H. Waddington laid the conceptual foundations of modern systems biology in his book "The Strategy of the Genes", in which he envisaged an epigenetic landscape as the potential surface of a multidimensional state-space of cellular metabolism, underpinned by a network of interacting genes. With the advent of high-throughput post-genomic technologies, we are now beginning to investigate the topology of these networks. Increasingly, studies are being carried out that measure the expression profiles of large sets of genes, proteins or metabolites over a time course rather than a single static snapshot. This research aims to combine functional genomics and computation into a novel integrative systems approach, based on probabilistic modeling techniques, to identify key components of the regulatory networks involved in cell physiology. We aim to learn networks integrating transcriptional data with the production of proteins and metabolites with well-defined biological activity.
My group was the first to propose the use of state-space models (a simple class of probabilistic graphical model used for time series analysis) to the problem of inferring gene regulatory networks from high-throughput time series, such as microarray, proteomics and metabolomics data (Rangel et al., 2001). In the context of genetic regulatory networks, the hidden states of a state-space model can represent unmeasured factors, such as genes that have not been included in the microarray, levels of regulatory proteins, and the effects of mRNA and protein degradation. In collaboration with Lorantis Ltd., we have used state space models to reverse engineer transcriptional networks from highly replicated gene expression profiling time series data obtained from a well-established biological model of T cell activation. The resulting networks reflect many of the dynamics of an activated T cell and provide a methodology for the development of rational and experimentally testable hypotheses. In particular, they reveal the integrated activation of cytokines, proliferation, and adhesion following activation and place JunB and JunD at the center of the mechanisms that control apoptosis and proliferation (Rangel et al., 2004; Beal et al., 2005). More recently, in collaboration with scientists at Warwick School of Life Sciences, we are applying these methods to elucidate key transcriptional networks and underlying regulatory mechanisms controlling plant responses to pathogens, high light and drought. With the University of Birmingham, we also aim to develop a computational framework to reconstruct transcriptional and metabolic networks representative of the response of E. coli to acid stress.
To date most “top-down” approaches to inferring dynamic networks from high-throughput time series data, including our own previous work as described above, have been based on the assumption that the topology of the network is homogeneous across time. However, most biological time series aim to capture information about processes which vary over time, and temporal changes in the transcription program are often apparent. There is little published work on inferring the structure of evolving biological networks with time-varying topologies. My current and future work focuses on developing flexible nonparametric Bayesian models, which will allow the discovery of the variables such as gene expression levels, protein concentrations, and experimental perturbations that cause structural changes in the network topology over time (Penfold et al., 2012).
Fast Bayesian computational methods for post-genomic data analysis
A key feature that distinguishes the modern approach to Systems Biology is the aim of linking modeling of the interactions of system components with the huge volume and diversity of contemporary cellular and molecular data, such as that coming from high-throughput, genome-wide and imaging technologies. One of the most important challenges facing modern biology, medicine and agriculture is to understand how the genetic variation between individuals (the genotype) translates into the type of variation we can see or measure (the phenotype), and how environment influences this relationship. Although considerable progress has been made in recent years in identifying regulatory genes and modules in various organisms, there is still limited knowledge about downstream gene regulatory networks, and about how variation in these networks results in phenotypic differences, and is, in turn, affected by the environment.
We are currently developing new approaches to understanding the general principles that underlie the genotype-phenotype link through the integrated statistical modeling of post-genomic data from very disparate sources - for example, data from different experimental platforms, such as gene expression data with genomic or clinical indicators, integrated with proteomic or metabolomic measurements. Additional challenges posed by this problem include different sources of measurement noise, uncertain correspondence between measurements, varying patterns of missing data, and vastly different numbers of measurements for different experimental platforms. Our current work focuses on the development of nonparametric Bayesian models for the fusion of such heterogeneous data sources, based on inference in hierarchical graphical models. Although Bayesian methods provide a powerful framework for modeling, representing uncertainty, and combining prior knowledge with data, simple parametric models (for example, linear models) are often inadequate for modeling the complexity of real world biological processes. In the quest for creating more flexible modeling tools with a view to maintaining predictive reliability, recent research has turned to the limit of Bayesian models with infinitely many latent variables and parameters - these are also known as nonparametric Bayesian models. The term nonparametric is used because the model cannot be represented in terms of its parameters, since there would be an infinite number of them to learn and store, but instead is represented in terms of other quantities derived from the data that encapsulate the effect of these infinite number of parameters.
This type of modeling has already proven its utility in discovering transcriptional modules, identifying key protein complexes whose genes are co-regulated during the cell cycle, and revealing prognostic cancer subtypes, through the integration of gene expression data with transcription factor binding (ChIP-chip), protein-protein interaction and copy number variation data, respectively (Savage et al., 2010; Kirk et al, 2012; Yuan et al., 2011). In future work I intend to apply these approaches to the scientific goal of understanding the role that molecular phenotypes (such as gene expression levels or chromatin state) play in the overall genotype-phenotype map, and how environment influences this map.
We are also investigating methods for exploiting parallelism and scaling computation to very large data sets, particularly through general purpose Graphical Processing Unit (GPU) computation, an exciting modern direction in the high-performance computing community. Our initial results indicate that this approach holds the promise of reducing computational time by several orders of magnitude.
Analysing Protein Energetics with Stochastic Computation
Also over 50 years ago, the Nobel laureate Christian Anfinsen and colleagues demonstrated that protein molecules can fold into their three-dimensional ‘native state’ reversibly, leading to the view that these structures represented the global minimum of a rugged funnel-like ‘energy landscape’. Since this seminal work, computer simulations have continued to shed light on the phenomena of protein folding and function. However, protein modeling and structure prediction face two major challenges if progress is to be made in the development of more precise models, which quantitatively describe experimental observations. The first is the difficulty of efficient sampling in the enormous conformational space accessible to protein molecules, whilst the second is the development of the energy function describing molecular interactions for the problem at hand. The microscopic size of protein molecules makes it impossible to measure these interactions directly, and so known protein structures themselves have become the best available experimental evidence. Traditionally, empirical so-called ‘knowledge-based statistical potentials’ have been used to describe such interactions from analysis of a collection of known protein structures. This project aims to address both of these challenges. The overall goal of this research is to advance knowledge of protein energetics and improve on established modeling techniques that utilize these empirical knowledge-based potentials. We are using an approach, based on a novel statistical machine learning methodology known as ‘Contrastive Divergence’, to infer interaction potentials from a subset of known protein structures. We also utilize a novel Bayesian computational method for sampling the conformational space of molecular systems, known as ‘Nested Sampling’, which allows us to directly investigate the macroscopic states of the protein folding pathway and evaluate the associated absolute free energies without the need for thermodynamic integration (Burkoff et al., 2012).
Whilst systems biology attempts to understand the design principles underpinning biological processes, synthetic biology attempts to apply this understanding to the design and construction of novel biological functions and systems not found in nature. One facet of synthetic biology is protein design, in which our increasing understanding of the principles underlying protein structure and function is being applied in the redesign of existing proteins, or the design of novel proteins. In future work I intend to explore the application of this toolkit of techniques to the ‘inverse’ problem of protein design.
1. Beal, M.J., Falciani, F., Ghahramani, Z., Rangel C. and Wild, D.L. A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21: 349-356, (2005).
2. Burkoff, N.S. Várnai, C., Wells, S.A. and Wild, D.L. Exploring the Energy Landscapes of Protein Folding Simulations with Bayesian Computation. Biophysical Journal, 102, 878-886 (2012).
3. Várnai C., Burkoff N.S., and Wild D.L. Efficient Parameter Estimation of Generalizable Coarse-Grained Protein Force Fields Using Contrastive Divergence: A Maximum Likelihood Approach. J. Chem. Theory Comput. 9(12):5718-5733, (2013).
4. Kirk, P., Griffin, J.E., Savage, R.S., Ghahramani, Z. and Wild, D.L. Bayesian correlated clustering to integrate multiple datasets. Bioinformatics (2012) (in press). doi: 10.1093/bioinformatics/bts595
5. Penfold, C.A., Buchanan-Wollaston, V., Denby, K.J. and Wild, D.L. Nonparametric Bayesian Inference for Perturbed and Orthologous Gene Regulatory Networks. Bioinformatics, 28, i233-i241 (2012).
6. Rangel, C., Wild, D.L., Falciani, F. and Ghahramani, Z. (2001) Modelling biological responses using gene expression profiling and linear dynamical systems. In: The 2nd International Conference on Systems Biology: The Future of Biology in the 21st Century (ICSB), 4-11-2001 to 7-11-2001, Omnipress, California, US pp. 248-256.
7. Rangel, C., Angus, J., Ghahramani, Z., Lioumi, M., Sotheran, E., A., Gaiba, A., Wild, D.L. and Falciani, F. Modeling T-cell activation using gene expression profiling and state space models. Bioinformatics, 20(9):1361-1372, (2004).
8. Savage, R.S., Ghahramani, Z., Griffin, J.E., de la Cruz, B.J. and Wild, D.L. Discovering Transcriptional Modules by Bayesian Data Integration, Bioinformatics, 26, i158-i167, (2010).
9. Yuan, Y., Savage, R. S., & Markowetz, F. (2011). Patient-specific data fusion defines prognostic cancer subtypes. PLoS computational biology, 7(10), e1002227.
Várnai C., Burkoff N.S., and Wild D.L. Efficient Parameter Estimation of Generalizable Coarse-Grained Protein Force Fields Using Contrastive Divergence: A Maximum Likelihood Approach. J. Chem. Theory Comput. 9(12):5718-5733, (2013).
Burkoff, N.S., Várnai, C., and Wild, D.L. Predicting protein β-sheet contacts using a maximum entropy-based correlated mutation measure. Bioinformatics 29(5): 580-587, (2013).
Hickman R., Penfold C.A., Breeze E., Bowden, L., Moore J.D., Zhang P., Jackson A., Cooke E., Bewicke-Copley F., Mead A., Beynon J., Wild D.L., Denby K.J., Ott S.,and Buchanan-Wollaston V. A local regulatory network around three NAC transcription factors in stress responses and senescence in Arabidopsis leaves The Plant Journal, 75 (1), 26 – 39, (2013).
Darkins, R., Cooke, E.J., Ghahramani, Z., Kirk, P.D.W., Wild, D.L., Savage, R.S. Accelerating Bayesian hierarchical clustering of time series data with a randomised algorithm, PLoS One, 8(4): e59795, (2013).
Penfold, C.A., Buchanan-Wollaston, V., Denby, K.J. and Wild, D.L. Nonparametric Bayesian Inference for Perturbed and Orthologous Gene Regulatory Networks. Bioinformatics, 28, i233-i241 (2012).
Kirk, P., Griffin, J.E., Savage, R.S., Ghahramani, Z. and Wild, D.L. Bayesian correlated clustering to integrate multiple datasets. Bioinformatics 28 (24): 3290-3297, (2012).
Windram, O., Madhou, P., McHattie, S., Hill, C., Hickman, R., Cooke, E., Jenkins, D.J., Penfold, C.A., Baxter, L., Breeze, E., Kiddle, S.J., Rhodes, J., Atwell, S., Kliebenstein, D.J., Kim, Y-S., Stegle, O., Borgwardt, K., Zhang, C., Tabrett, A., Legaie, R., Moore, J., Finkenstadt, B., Wild, D.L., Mead, A., Rand, D., Beynon, J., Ott, S., Buchanan-Wollaston, V., Denby, K.J. Arabidopsis Defense against Botrytis cinerea: Chronology and Regulation Deciphered by High-Resolution Temporal Transcriptomic Analysis. Plant Cell, 24, 3530-3557, (2012).
- Burkoff, N.S., Várnai, C., Wells, S.A. and Wild, D.L. Exploring the Energy Landscapes of Protein Folding Simulations with Bayesian Computation. Biophysical Journal (2012), 102, 878-886
- Podtelezhnikov, A.A. and Wild, D.L. Inferring knowledge based potentials using contrastive divergence in Hamelryck T., Mardia K.V. and Ferkinghoff-Borg J. (Eds.), Bayesian Methods in Structural Bioinformatics, pp 135-155. Springer (2012).
- Penfold C.A. and Wild D.L. How to infer gene networks from expression profiles, revisited. Journal of the Royal Society Interface Focus (2011), 1, 857-870 (Invited Review).
- Cooke E.J., Savage. R.S , Kirk P.D.W., Darkins R., Wild, D.L. Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements. BMC Bioinformatics (2011), 12:399 (Highly Accessed Paper).
- Breeze, E., Harrison, E., McHattie, S., Hughes, L., Hickman, R., Hill, C., Kiddle, S., Kim, Y-S., Penfold, C., Jenkins, D., Zhang, C., Morris, K., Jenner, C., Jackson, S., Thomas, B., Tabrett, A., Legaie, R., Moore, J.D., Wild, D.L., Ott, S., Rand, D., Beynon, J., Denby, K., Mead, A., Buchanan-Wollaston, V. High resolution temporal profiling of transcripts during Arabidopsis leaf senescence reveals a distinct chronology of processes and regulation. Plant Cell (2011), 23, 873–894.