Bayesian methods in bacterial population genomics

Show full item record

Title: Bayesian methods in bacterial population genomics
Author: Cheng, Lu
Contributor: University of Helsinki, Faculty of Science, Department of Mathematics and Statistics
Publisher: Helsingin yliopisto
Date: 2013-10-04
Thesis level: Doctoral dissertation (article-based)
Abstract: Vast amounts of molecular data are being generated every day. However, how to properly harness the data remains often a challenge for many biologists. Firstly, due to the typical large dimension of the molecular data, analyses can either require exhaustive amounts of computer memory or be very time-consuming, or both. Secondly, biological problems often have their own special features, which put demand on specially designed software to obtain meaningful results from statistical analyses without imposing too much requirements on the available computing resources. Finally, the general complexity of many biological research questions necessitates joint use of many different methods, which requires a considerable expertise in properly understanding the possibilities and limitations of the analysis tools. In the first part of this thesis, we discuss three general Bayesian classification/clustering frameworks, which in the considered applications are targeted towards clustering of DNA sequence data, in particular in the context of bacterial population genomics and evolutionary epidemiology. Based on more generic Bayesian concepts, we have developed several statistical tools for analyzing DNA sequence data in bacterial metagenomics and population genomics. In the second part of this thesis, we focus on discussing how to reconstruct bacterial evolutionary history from a combination of whole genome sequences and a number of core genes for which a large set of samples are available. A major problem is that for many bacterial species horizontal gene transfer of DNA, which is often termed as recombination, is relatively frequent and the recombined fragments within genome sequences have a tendency to severely distort the phylogenetic inferences. To obtain computationally viable solutions in practice for a majority of currently emerging genome data sets, it is necessary to divide the problem into parts and use different approaches in combination to perform the whole analysis. We demonstrate this strategy by application to two challenging data sets in the context of evolutionary epidemiology and show that biologically significant conclusions can be drawn by shedding light into the complex patterns of relatedness among strains of bacteria. Both studied organisms (\textit{Escherichia coli} and \textit{Campylobacter jejuni}) are major pathogens of humans and understanding the mechanisms behind the evolution of their populations is of vital importance for human health.Although bacteria are everywhere in the earth, we still do not understand the small creatures very well. Due to the development of current sequencing technologies, we are able to use a DNA sequence to represent a bacteria. With lots of sequences collected, we can understand the relationships between different bacteria. The first part of the thesis discusses three general classification frameworks to classify DNA sequences in different scenarios. For example, we have DNA sequences collected from bacteria in the environment. We want to know what are the collected bacteria and whether there are novel bacteria. This part provides a solution for this problem. In the second part of the thesis, we try to find out how related bacteria evolve from their sequences. For example, a pathogen gains some DNA fragment which enables it to cause a serious disease but not so infectious. Then this pathogen interacts with other bacteria over a period. After that a strong pathogen appears, which is very infectious and causes a serious disease. It spreads all over the world very quickly. Now we get many DNA sequences of this pathogen from different countries. We want to find out where the pathogen originates from these DNA sequences. It is a difficult task due to the existence of recombination between the pathogens.
Subject: bayesian statistics
Rights: This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.

Files in this item

Total number of downloads: Loading...

Files Size Format View
cheng_dissertation.pdf 961.1Kb PDF View/Open

This item appears in the following Collection(s)

Show full item record