Browsing by Subject "Algorithmic Bioinformatics"

Sort by: Order: Results:

Now showing items 1-3 of 3
  • Hämäläinen, Kreetta (Helsingin yliopisto, 2021)
    Personalized medicine tailors therapies for the patient based on predicted risk factors. Some tools used for making predictions on the safety and efficacy of drugs are genetics and metabolomics. This thesis focuses on identifying biomarkers for the activity level of the drug transporter organic anion transporting polypep-tide 1B1 (OATP1B1) from data acquired from untargeted metabolite profiling. OATP1B1 transports various drugs, such as statins, from portal blood into the hepatocytes. OATP1B1 is a genetically polymorphic influx transporter, which is expressed in human hepatocytes. Statins are low-density lipoprotein cholesterol-lowering drugs, and decreased or poor OATP1B1 function has been shown to be associated with statin-induced myopathy. Based on genetic variability, individuals can be classified to those with normal, decreased or poor OATP1B1 function. These activity classes were employed to identify metabolomic biomarkers for OATP1B1. To find the most efficient way to predict the activity level and find the biomarkers that associate with the activity level, 5 different machine learning models were tested with a dataset that consisted of 356 fasting blood samples with 9152 metabolite features. The models included both a Random Forest regressor and a classifier, Gradient Boosted Decision Tree regressor and classifier, and a Deep Neural Network regressor. Hindrances specific for this type of data was the collinearity between the features and the large amount of features compared to the number of samples, which lead to issues in determining the important features of the neural network model. To adjust to this, the data was clustered according to their Spearman’s rank-order correlation ranks. Feature importances were calculated using two methods. In the case of neural network, the feature importances were calculated with permutation feature importance using mean squared error, and random forest and gradient boosted decision trees used gini impurity. The performance of each model was measured, and all classifiers had a poor ability to predict decreasead and poor function classes. All regressors performed very similarly to each other. Gradient boosted decision tree regressor performed the best by a slight margin, but random forest regressor and neural network regressor performed nearly as well. The best features from all three models were cross-referenced with the features found from y-aware PCA analysis. The y-aware PCA analysis indicated that 14 best features cover 95% of the explained variance, so 14 features were picked from each model and cross-referenced with each other. Cross-referencing highest scoring features reported by the best models found multiple features that showed up as important in many models.Taken together, machine learning methods provide powerful tools to identify potential biomarkers from untargeted metabolomics data.
  • Maljanen, Katri (Helsingin yliopisto, 2021)
    Cancer is a leading cause of death worldwide. Unlike its name would suggest, cancer is not a single disease. It is a group of diseases that arises from the expansion of a somatic cell clone. This expansion is thought to be a result of mutations that confer a selective advantage to the cell clone. These mutations that are advantageous to cells that result in their proliferation and escape of normal cell constraints are called driver mutations. The genes that contain driver mutations are known as driver genes. Studying these mutations and genes is important for understanding how cancer forms and evolves. Various methods have been developed that can discover these mutations and genes. This thesis focuses on a method called Deep Mutation Modelling, a deep learning based approach to predicting the probability of mutations. Deep Mutation Modelling’s output probabilities offer the possibility of creating sample and cancer type specific probability scores for mutations that reflect the pathogenicity of the mutations. Most methods in the past have made scores that are the same for all cancer types. Deep Mutation Modelling offers the opportunity to make a more personalised score. The main objectives of this thesis were to examine the Deep Mutation Modelling output as it was unknown what kind of features it has, see how the output compares against other scoring methods and how the probabilities work in mutation hotspots. Lastly, could the probabilities be used in a common driver gene discovery method. Overall, the goal was to see if Deep Mutation Modelling works and if it is competitive with other known methods. The findings indicate that Deep Mutation Modelling works in predicting driver mutations, but that it does not have sufficient power to do this reliably and requires further improvements.
  • Leinonen, Miika (Helsingin yliopisto, 2019)
    With the introduction of DNA sequencing over 40 years ago, we have been able to take a peek at our genetic material. Even though we have had a long time to develop sequencing strategies further, we are still unable to read the whole genome in one go. Instead, we are able to gather smaller pieces of the genetic material, which we can then use to reconstruct the original genome with a process called genome assembly. As a result of the genome assembly we often obtain multiple long sequences representing different regions of the genome, which are called contigs. Even though a genome often consists of a few separate DNA molecules (chromosomes), the number of obtained contigs outnumbers them substantially, meaning our reconstruction of the genome is not perfect. The resulting contigs can afterwards be refined by ordering, orienting and scaffolding them using additional information about the genome, which is often done manually by hand. The assembly process can also be guided automatically with the additional information, and in this thesis we are introducing a method that utilizes optical maps to aid us assemble the genome more accurately. A noticeable improvement of this method is the unification of the contigs, i.e. we are left with fewer but longer contigs. We are using an existing genome assembler called Kermit, which is designed to accept genetic maps as auxiliary long range information. Our contribution is the development of an assembly pipeline that provides Kermit with similar kind of information via optical maps. The initial results of our experiments show that the proposed genome assembly scheme can take advantage of optical maps effectively already during the assembly process to guide the reconstruction of a genome.