Skip to main content

Analysis and Evaluation of Classification Models for Disease Detection Using Human Gut Metagenomic Data

The gut microbiota is an important contributor to human health and disease. We harbor a large number of microorganisms in our gastrointestinal tracts, 10 times more cells than our own human cells. The gut microbiota is an important modulator of the immune system and an important metabolic organ. In several diseases, the taxonomic and functional composition of the microbiota is altered compared to a normal healthy microbiota. Metagenomics, sequencing of a pool of DNA extracted from microbiota samples, is an excellent tool to study the functional and taxonomic composition of the microbiota. The aim in this project is to develop and evaluate a statistical classifying models using taxonomic and functional data derived from metagenomic sequences. Such a model could for example predict the risk of colon cancer based on data from the gut metagenome. We obtained a data set of human gut metagenomics data available online . This data is characterized by high number of attributes which prevails on number of instances. Firstly, we followed the pipeline of the original paper and reproduced its results using lasso. Next, we are exploring the ways to improve classification performance and evaluate variable importance to determine which bacteria can be CRC markers. Using filter feature selection with SVMs we raised AUC from 0.85 (lasso) to 0.89 on the test set. Random Forests classifier gives AUC of 0.87 and verifies the importance of several bacteria species, namely Fusobacterium nucleatum and Peptostreptococcus stomatis, which were highlighted in several studies and by correlation and entropy based feature selection methods. We also check for the absence of the influence of confounders such as age, gender and body mass index and conclude that there is no significant effect of those factors in this study.

Report: (PDF Document)

Presentation: (PDF Document)