Principal Supervisor: Dr Daniel Hebenstreit - School of Lofe Sciences
PhD project title: Improving single-cell RNA-sequencing
University of Registration: Warwick
Background It is the major goal of RNA-seq to accurately quantify expression levels of mRNAs expressed in cell. Several steps in the RNA-seq library preparation procedures lead to over- and/or under-representation of sequences with regards to the starting material, across different transcripts and within transcripts. These biases have been recognized as a problem as they affect quantification accuracy . This issue is particularly relevant for single-cell RNA-seq (scRNA-seq)[2, 3], where a decrease in precision automatically affects sensitivity. While losses at each step of a standard RNA-seq protocol are uncritical due to a sufficient supply of starting material, they limit detection chances in scRNA-seq. Ideally, the mass of every single original mRNA should be harnessed as completely as possible for the next generation sequencing step at the end of an scRNA-seq protocol.
Comparatively little attention has been paid to biases linked to cDNA conversion. Reverse transcription of mRNA into cDNA is required for virtually all RNA-seq protocols currently in use, as it is not possible to (PCR-) amplify RNA itself. cDNA production for RNA-seq in general is normally initiated from primers that are designed to either bind random positions along the mRNA or that target the 3’ poly-A tail. Depending on the protocol, synthesis of the second, reverse-complementary cDNA strand may once again start from a random position or start from the end of the first-strand. These positional dependencies cause further non-trivial biases in the representation by cDNA of the original mRNA.
Several online databases contain vast numbers of RNA-seq datasets that can be mined and analysed with regards to general bias trends. Combining this with advanced statistical bias models will allow to better understand mechanistic causes of the various types of biases and distinguish biological from experimental ones.
Outline of project
The project has a strong computational component, with a focus on programming and exploratory data analysis and to a smaller degree statistics. An experimental verification part will follow at the end of the project. The project will be organized into three main stages:
Stage 1. The initial project phase will encompass a review of existing literature on RNA-seq and a compilation of the used protocols and their characteristics. We will then programme a script that automatically generates a census of RNA-seq datasets in online resources and classifies these dependent on protocols, cell types, treatments, and similar parameters. Using an existing probabilistic framework, we will study the biases that are expected from the various protocols and decide, depending on the available data, which datasets to use for further analysis
Stage 2. In the second stage, we will programme scripts to automatically download the chosen datasets and prepare these for further analysis (e.g. reformatting the data and mapping sequencing reads to a reference genome). The goal will be to harness a vast resource, ideally consisting of more than hundred datasets, and automate as many steps as possible. We will then use a Bayesian framework to link the data with the theoretical models obtained in Stage 1. Markov Chain Monte Carlo algorithms will be employed to explore the validity of models and their parameter space.
Stage 3. In the third project stage, we will analyse the results obtained in the previous stage and will investigate further the most interesting aspects. We will potentially conduct a series of RNA-seq-based experiments to verify some of the findings. Finally, we will start writing an article that describes our discoveries.
- Zheng, W., L.M. Chung, and H. Zhao, Bias detection and correction in RNA-Sequencing data. BMC Bioinformatics, 2011. 12: p. 290.
- Tang, F., K. Lao, and M.A. Surani, Development and applications of single-cell transcriptome analysis. Nat Methods, 2011. 8(4 Suppl): p. S6-11.
- Shapiro, E., T. Biezuner, and S. Linnarsson, Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat Rev Genet, 2013. 14(9): p. 618-30.
BBSRC Strategic Research Priority: Molecule, cells and systems
Techniques that will be undertaken during the project:
Programming, next-generation-sequencing, PCR, standard molecular biology
Contact: Dr Daniel Hebenstreit, University of Warwick