Title: | Statistical data analysis for biomarker discovery and Type 1 Diabetes prediction |
Author: | Danmei, Huang |
Other contributor: |
Helsingin yliopisto, Matemaattis-luonnontieteellinen tiedekunta
University of Helsinki, Faculty of Science Helsingfors universitet, Matematisk-naturvetenskapliga fakulteten |
Publisher: | Helsingin yliopisto |
Date: | 2018 |
Language: | eng |
URI: |
http://urn.fi/URN:NBN:fi:hulib-202002251399
http://hdl.handle.net/10138/312286 |
Thesis level: | master's thesis |
Discipline: | Tilastotiede |
Abstract: | Type 1 diabetes is a genetically related disease. The immune system attacks the pancreas so that no insulin can be secreted to regulate the blood glucose level. The cause of the disease is still unknown. To study Type 1 diabetes, researchers have collected time series microarray data for thousands of genes from individuals divided into case and control groups. We aim to detect genes that show significant differences between cases and controls by analyzing the data. These genes may be used as biomarkers for Type 1 diabetes prediction in the future. We present 4 statistical methods for analyzing this Type 1 diabetes gene expression data, based on different considerations. We provide detailed introductions to the methods that are used in the analysis of the thesis. In particular, we show that Gaussian process regression is actually an extension of linear regression. The first method, standard linear regression, assumes both cases and controls follow the same linear model, except that the cases exhibit large variation at some time point. Those time points with large variation are also known as outliers. We can estimate their predictive distribution and calculate their p-values to check the significance. The second method, Bayesian linear regression, considers the variation of the point estimates (maximum likelihood) in the standard linear regression. We place priors on the parameters such that the uncertainty of the parameters can be integrated out. The estimates are generally more robust than the standard linear regression. The third method, Gaussian process regression, assumes both cases and controls follow the same non-linear model. This is in contrast to the linear model in the previous two methods. Gaussian process is a non-parametric model that is very flexible. The squared exponential kernel used in this thesis is able to model almost all smooth functions. After the fitting of the data, we can calculate the predictive distribution of data points of the cases. Then we can detect the outliers by checking their p-values. The fourth method, Gaussian process model comparison, models the difference between cases and controls as a whole. Cases may be systematically different to controls, or not. We use a shared model to model them jointly and an independent model to model them separately. After that we calculate the Bayes factor between the two models. If cases and controls are very similar, they will follow the shared model with a higher marginal likelihood. If they differ a lot, the independent model is preferred. We apply the above four methods to the microarray data, which contains 49386 genes for 6 case-control pairs. We find 4956, 661 and 2797 significant genes using the first three methods with Bonferroni corrections to the p-values. The numbers are 43276, 3584 and 25149 if we use Benjamini-Hochberg correction. The fourth method suggests 722 significant genes with the log Bayesian factor less than -5. We presents some example significant genes that show difference between cases and controls. They clearly show the expected difference between cases and controls. The example results suggest in general Gaussian process models fit the data better than linear regression models. The top hits (genes) provided by the methods remain to be validated by more biological experiments. |
Total number of downloads: Loading...
Files | Size | Format | View |
---|---|---|---|
danmei_thesis.pdf | 2.673Mb |
View/ |