Browsing by Subject "Life Science Informatics -maisteriohjelma"

Sort by: Order: Results:

Now showing items 1-14 of 14
  • Pohjonen, Joona (Helsingin yliopisto, 2020)
    Prediction of the pathological T-stage (pT) in men undergoing radical prostatectomy (RP) is crucial for disease management as curative treatment is most likely when prostate cancer (PCa) is organ-confined (OC). Although multiparametric magnetic resonance imaging (MRI) has been shown to predict pT findings and the risk of biochemical recurrence (BCR), none of the currently used nomograms allow the inclusion of MRI variables. This study aims to assess the possible added benefit of MRI when compared to the Memorial Sloan Kettering, Partin table and CAPRA nomograms and a model built from available preoperative clinical variables. Logistic regression is used to assess the added benefit of MRI in the prediction of non-OC disease and Kaplan-Meier survival curves and Cox proportional hazards in the prediction of BCR. For the prediction of non-OC disease, all models with the MRI variables had significantly higher discrimination and net benefit than the models without the MRI variables. For the prediction of BCR, MRI prediction of non-OC disease separated the high-risk group of all nomograms into two groups with significantly different survival curves but in the Cox proportional hazards models the variable was not significantly associated with BCR. Based on the results, it can be concluded that MRI does offer added value to predicting non-OC disease and BCR, although the results for BCR are not as clear as for non-OC disease.
  • Hämäläinen, Kreetta (Helsingin yliopisto, 2021)
    Personalized medicine tailors therapies for the patient based on predicted risk factors. Some tools used for making predictions on the safety and efficacy of drugs are genetics and metabolomics. This thesis focuses on identifying biomarkers for the activity level of the drug transporter organic anion transporting polypep-tide 1B1 (OATP1B1) from data acquired from untargeted metabolite profiling. OATP1B1 transports various drugs, such as statins, from portal blood into the hepatocytes. OATP1B1 is a genetically polymorphic influx transporter, which is expressed in human hepatocytes. Statins are low-density lipoprotein cholesterol-lowering drugs, and decreased or poor OATP1B1 function has been shown to be associated with statin-induced myopathy. Based on genetic variability, individuals can be classified to those with normal, decreased or poor OATP1B1 function. These activity classes were employed to identify metabolomic biomarkers for OATP1B1. To find the most efficient way to predict the activity level and find the biomarkers that associate with the activity level, 5 different machine learning models were tested with a dataset that consisted of 356 fasting blood samples with 9152 metabolite features. The models included both a Random Forest regressor and a classifier, Gradient Boosted Decision Tree regressor and classifier, and a Deep Neural Network regressor. Hindrances specific for this type of data was the collinearity between the features and the large amount of features compared to the number of samples, which lead to issues in determining the important features of the neural network model. To adjust to this, the data was clustered according to their Spearman’s rank-order correlation ranks. Feature importances were calculated using two methods. In the case of neural network, the feature importances were calculated with permutation feature importance using mean squared error, and random forest and gradient boosted decision trees used gini impurity. The performance of each model was measured, and all classifiers had a poor ability to predict decreasead and poor function classes. All regressors performed very similarly to each other. Gradient boosted decision tree regressor performed the best by a slight margin, but random forest regressor and neural network regressor performed nearly as well. The best features from all three models were cross-referenced with the features found from y-aware PCA analysis. The y-aware PCA analysis indicated that 14 best features cover 95% of the explained variance, so 14 features were picked from each model and cross-referenced with each other. Cross-referencing highest scoring features reported by the best models found multiple features that showed up as important in many models.Taken together, machine learning methods provide powerful tools to identify potential biomarkers from untargeted metabolomics data.
  • Kinnula, Ville (Helsingin yliopisto, 2021)
    In inductive inference phenomena from the past are modeled in order to make predictions of the future. The mathematical concept of exchangeability for random sequences provides a mathematical justification for the assumption that observations are independently and identically distributed given some underlying parameters estimable from the empirical distribution of the observations. The theory of exchangeability contains basic elements for inductive inference, such as the de Finetti representation theorem for the probability of a general exchangeable sequence, prior probability distributions for the parameters in the representation theorem, as well as the predictive probabilities, or rule of succession, for new observations from the random sequence under consideration. However, entirely unanticipated observations pose a problem for inductive inference. How can one assign a probability for an event that has never been seen before? This is called the sampling of species problem. Under exchangeability, the number of possible different events t has to be known before-hand to be able to assign an equal prior probability 1/t for each event. In the sampling of species problem an assumption of infinite possible events has to be made, leading to the prior probability 1/∞ for each event, which is impossible. Exchangeability is thus inadequate to handle probability distributions for infinite possible events. It turns out that a solution to the sampling of species problem arises from partition exchangeability. Exchangeable random sequences have the same probability of occurring, if the observations in the sequence have identical frequencies. Under partition exchangeability, the sequences have the same probability of occurring when they share identical frequencies of frequencies. In this thesis, partition exchangeability is introduced as a framework of inductive inference by juxtaposing it with the more familiar type of exchangeability for random sequences. Partition exchangeability has parallel elements to exchangeability, in the Kingman representation theorem, the Poisson-Dirichlet distribution for the prior probability distribution, and a corresponding rule of succession. The rules of succession are required in the problem of supervised classification to provide product predictive probabilities to be maximized by assigning the test data into pre-defined classes based on training data. A Bayesian construction of supervised classification is discussed in this thesis. In theory, the best classification performance is gained when assigning the class labels to the test data simultaneously, but because of computational complexity, an assumption is often made where the test data points are i.i.d. with regards to each other. In the case of a known set of possible events these simultaneous and marginal classifiers converge in their test data predictive probabilities as the amount of training data tends to infinity, justifying the use of the simpler marginal classifier with enough training data. These two classifiers are implemented in this thesis under partition exchangeability, and it is shown in theory and in practice with a simulation study that the same asymptotic convergence between the simultaneous and marginal classifiers applies with partition exchangeable data as well. Finally, a small application in single cell RNA expression is explored.
  • Koski, Jessica (Helsingin yliopisto, 2021)
    Acute lymphoblastic leukemia (ALL) is a hematological malignancy that is characterized by uncontrolled proliferation and blocked maturation of lymphoid progenitor cells. It is divided into B- and T-cell types both of which have multiple subtypes defined by different somatic genetic changes. Also, germline predisposition has been found to play an important role in multiple hematological malignancies and several germline variants that contribute to the ALL risk have already been identified in pediatric and familial settings. There are only few studies including adult ALL patients but thanks to the findings in acute myeloid leukemia, where they found the germline predisposition to consider also adult patients, there is now more interest in studying adult patients. The prognosis of adult ALL patients is much worse compared to pediatric patients and many are still lacking clear genetic markers for diagnosis. Thus, identifying genetic lesions affecting ALL development is important in order to improve treatments and prognosis. Germline studies can provide additional insight on the predisposition and development of ALL when there are no clear somatic biomarkers. Single nucleotide variants are usually of interest when identifying biomarkers from the genome, but also structural variants can be studied. Their coverage on the genome is higher than that of single nucleotide variants which makes them suitable candidates to explore association with prognosis. Copy number changes can be detected from next generation sequencing data although the detection specificity and sensitivity vary a lot between different software. Current approach is to identify the most likely regions with copy number change by using multiple tools and to later validate the findings experimentally. In this thesis the copy number changes in germline samples of 41 adult ALL patients were analyzed using ExomeDepth, CODEX2 and CNVkit.
  • Malmsten, Kim (Helsingin yliopisto, 2021)
    Genomic structural variants are large events that change the structure of the genome. These can cause changes in the functions of cells by breaking genes and genomic regulatory regions. Multiple factors are known to affect the formation of structural variants and previous studies have shown that often the sequence content in a genomic region plays a role in their formation. This study aims to characterize the sequence content around structural variant breakpoints from structural variants which have been detected from human tissue samples which have been whole genome sequenced with nanopore sequencing. The characterization was done by looking at the genomic repetitive elements found around the breakpoints, by analyzing the GC-content around the breakpoints, and by studying what kind of enriched DNA motifs were found in the sequences around the breakpoints and how these were located in these sequences. Multiple different repetitive elements were seen to occur near the breakpoint regions, and it was also observed that there were differences in what kind of repetitive elements were seen around different types of structural variants. Around the sequences of different kinds of structural variants there was also distinct differences in what kind of GC-content profiles the sequences had. In addition, various different enriched motifs were also found from the sequences and many of these showed distinct variation on how they were located around the breakpoints. These results support the previous findings showing that also here the sequence content does play a role in the formation of structural variants, but still all of the results here could not be directly explained by previous studies. In these results, it was seen that the GC-content was higher in sequences that have been affected by an event that causes structural variant formation. Also, many of the found DNA motifs were distinctly skewed around the breakpoint sequences, possibly hinting that the sequences containing these motifs would be prone to the formation of structural variants.
  • Ottensmann, Linda (Helsingin yliopisto, 2020)
    It is challenging to identify causal genes and pathways explaining the associations with diseases and traits found by genome-wide association studies (GWASs). To solve this problem, a variety of methods that prioritize genes based on the variants identified by GWASs have been developed. In this thesis, the methods Data-driven Expression Prioritized Integration for Complex Traits (DEPICT) and Multi-marker Analysis of GenoMic Annotation (MAGMA) are used to prioritize causal genes based on the most recently published publicly available schizophrenia GWAS summary statistics. The two methods are compared using the Benchmarker framework, which allows an unbiased comparison of gene prioritization methods. The study has four aims. Firstly, to explain what are the differences between the gene prioritization methods DEPICT and MAGMA and how the two methods work. Secondly, to explain how the Benchmarker framework can be used to compare gene prioritization methods in an unbiased way. Thirdly, to compare the performance of DEPICT and MAGMA in prioritizing genes based on the latest schizophrenia summary statistics from 2018 using the Benchmarker framework. Lastly, to compare the performance of DEPICT and MAGMA on a schizophrenia GWAS with a smaller sample size by using Benchmarker. Firstly, the published results of the Benchmarker analyses using schizophrenia GWAS from 2014 were replicated to make sure that the framework is run correctly. The results were very similar and both the original and the replicated results show that DEPICT and MAGMA do not perform significantly differently. Furthermore, they show that the intersection of genes prioritized by DEPICT and MAGMA outperforms the outersection, which is defined as genes prioritized by only one of these methods. Secondly, Benchmarker was used to compare the performance of DEPICT and MAGMA on prioritizing genes using the schizophrenia GWAS from 2018. The results of the Benchmarker analyses suggest that DEPICT and MAGMA perform similarly with the GWAS from 2018 compared to the GWAS from 2014. Furthermore, an earlier schizophrenia GWAS from 2011 was used to check if the performance of DEPICT and MAGMA differs when a GWAS with lower statistical power is used. The results of the Benchmarker analyses make clear that MAGMA performs better than DEPICT in prioritizing genes using this smaller data set. Furthermore, for the schizophrenia GWAS from 2011 the outersection of genes prioritized by DEPICT and MAGMA outperforms the intersection. To conclude, the Benchmarker framework is a useful tool for comparing gene prioritization methods in an unbiased way. For the most recently published schizophrenia GWAS from 2018 there is no significant difference between the performance of DEPICT and MAGMA in prioritizing genes according to Benchmarker. For the smaller schizophrenia GWAS from 2011, however, MAGMA outperformed DEPICT.
  • Maljanen, Katri (Helsingin yliopisto, 2021)
    Cancer is a leading cause of death worldwide. Unlike its name would suggest, cancer is not a single disease. It is a group of diseases that arises from the expansion of a somatic cell clone. This expansion is thought to be a result of mutations that confer a selective advantage to the cell clone. These mutations that are advantageous to cells that result in their proliferation and escape of normal cell constraints are called driver mutations. The genes that contain driver mutations are known as driver genes. Studying these mutations and genes is important for understanding how cancer forms and evolves. Various methods have been developed that can discover these mutations and genes. This thesis focuses on a method called Deep Mutation Modelling, a deep learning based approach to predicting the probability of mutations. Deep Mutation Modelling’s output probabilities offer the possibility of creating sample and cancer type specific probability scores for mutations that reflect the pathogenicity of the mutations. Most methods in the past have made scores that are the same for all cancer types. Deep Mutation Modelling offers the opportunity to make a more personalised score. The main objectives of this thesis were to examine the Deep Mutation Modelling output as it was unknown what kind of features it has, see how the output compares against other scoring methods and how the probabilities work in mutation hotspots. Lastly, could the probabilities be used in a common driver gene discovery method. Overall, the goal was to see if Deep Mutation Modelling works and if it is competitive with other known methods. The findings indicate that Deep Mutation Modelling works in predicting driver mutations, but that it does not have sufficient power to do this reliably and requires further improvements.
  • Kuosmanen, Teemu (Helsingin yliopisto, 2020)
    Cancer is a dynamic and complex microevolutionary process. All attempts of curing cancer thus rely on successfully controlling also the evolving future cancer cell population. Since the emergence of drug resistance severely limits the success of many anti-cancer therapies, especially in the case of the promising targeted therapies, we need urgently better ways of controlling cancer evolution with our treatments to avoid resistance. This thesis characterizes acquired drug resistance as an evolutionary rescue and uses optimal control theory to critically investigate the rationale of aggressive maximum tolerated dose (MTD) therapies that represent the standard of care for first line treatment. Unlike the previous models of drug resistance, which mainly concentrate on minimizing the tumor volume, herein the optimal control problem is reformulated to explicitly minimize the probability of evolutionary rescue, or equivalently, maximizing the extinction probability of the cancer cells. Furthermore, I investigate the effects of drug-induced resistance, where the rate of gaining new resistant cells increases with the dose due to increased genome-wide mutation rate and non-genetic adaptations (such as epigenetic regulation and phenotypic plasticity). This approach not only reflects the biological realism, but also allows to model the cost of control in a quantifiable manner instead of using some ambiguous and incomparable penalty parameter for the cost of treatment. The major finding presented in this thesis is that MTD-style therapies may actually increase the likelihood of an evolutionary rescue even when only modest drug-induced effects are present. This suggests that significant improvements to treatment outcomes may be accomplished at least in some cases by treatment optimization. The resistance promoting properties of different anti-cancer therapies should therefore be properly investigated in experimental and clinical settings.
  • Holopainen, Ida (Helsingin yliopisto, 2021)
    Traditional parametric statistical inference methods, such as maximum likelihood and Bayesian inference, cannot be used to learn parameter estimates if the likelihood is intractable, for example due to the complexity of the studied phenomenon. This can be overcome by using likelihood-free inference that is used with simulator-based models to learn parameter estimates. Also, traditional methods used in the estimation of uncertainties related to the parameter estimates typically require a likelihood function, and that is why these methods cannot be applied in likelihood-free inference. In this thesis, we present a novel way to compute confidence sets for parameter estimates obtained from likelihood-free inference using Jensen—Shannon divergence. We consider two test statistics that are based on mean Jensen—Shannon divergence and propose hypothesised asymptotic distributions for them. We test whether these hypothesised distributions can be used in the computation of confidence sets for parameter estimates obtained from likelihood-free inference, and we evaluate the produced confidence sets by studying their frequentist behaviour that is summarised with coverage probabilities. We compare this frequentist behaviour between Jensen —Shannon divergence estimates and confidence sets obtained from grid evaluation of Monte Carlo estimates and from Bayesian optimisation for likelihood-free inference (BOLFI) to the ones obtained from maximum likelihood inference with Wald’s and log likelihood-ratio confidence sets using three different models. We also use a simulator- based model with intractable likelihood to study the proposed confidence sets with BOLFI. In order to study the influence of observations on the parameter estimates and their confidence sets, we conducted these experiments with varying the number of observations. We show that Jensen—Shannon divergence based confidence sets meet the expected frequentist behaviour.
  • Dovydas, Kičiatovas (Helsingin yliopisto, 2021)
    Cancer cells accumulate somatic mutations in their DNA throughout their lifetime. The advances in cancer prevention and treatment methods call for a deeper understanding of carcinogenesis on the genetic sequence level. Mutational signatures present a novel and promising way to capture somatic mutation patterns and define their causes, allowing to summarize the mutational landscape of cancer as a combination of distinct mutagenic processes acting with different levels of strength. While the majority of previous studies assume an additive relationship between the mutational processes, this Master’s thesis provides tentative evidence that contemporary methods with additivity constraints, e.g. non-negative matrix factorization (NMF), are not sufficient to comprehensively explain the observed mutations in cancer genomes and the observed deviations are not random. To quantify these residues, two metrics are defined – additive and multiplicative residues – and hierarchical clustering algorithms are used to identify cancer subsets with similar residual profiles. It is shown that in certain cancer sample subsets there is a systematic mutational burden overestimation that can only be solved by a multiplicatively acting process, as well as non-random underestimation, requiring additional mutational signatures. Here an extension to the additive mutational signature model is proposed – a probabilistic model that incorporates a selectively active modulatory mutational process that is able to act in a multiplicative manner together with the known mutational signatures, reducing systematic variability.
  • Gu, Chunhao (Helsingin yliopisto, 2021)
    Along with the rapid scale-up of biological knowledge bases, mechanistic models, especially metabolic network models, are becoming more accurate. On the other hand, machine learning has been widely applied in biomedical researches as a large amount of omics data becomes available in recent years. Thus, it is worth to conduct a study on integration of metabolic network models and machine learning, and the method may result in some biological discoveries. In 2019, MIT researchers proposed an approach called 'White-Box Machine Learning' when they used fluxomics data derived from in silico simulation of a genome-scale metabolic (GEM) model and experimental antibiotic lethality measurements (IC50 values) of E. coli under hundreds of screening conditions to train a linear regression-based machine learning model, and they extracted coefficients of the model to discover some metabolic mechanism involving in antibiotic lethality. In this thesis, we propose a new approach based on the framework of the 'White-Box Machine Learning'. We replace the GEM model with another state-of-the-art metabolic network model -- the expression and thermodynamics flux (ETFL) formulation. We also replace the linear regression-based machine learning model with a novel nonlinear regression model – multi-task elastic net multilayer perceptron (MTENMLP). We apply the approach on the same experimental antibiotic lethality measurements (IC50 values) of E. coli from the 'White-Box Machine Learning' study. Finally, we validate their conclusions and make some new discoveries. Specially, our results show the ppGpp metabolism is active under antibiotic stress, which is supported by some literature. This implies that our approach has potential to make a biological discovery even if we don't know a possible conclusion.
  • Lehtonen, Leevi (Helsingin yliopisto, 2021)
    Sex differences can be found in most human phenotypes, and they play an important role in human health and disease. Females and males have different sex chromosomes, which are known to cause sex differences, as are differences in the concentration of sex hormones such as testosterone, estradiol and progesterone. However, the role of the autosomes has remained more debated. The primary aim of this thesis is to assess the magnitude and relevance of human sex-specific genetic architecture in the autosomes. This is done by calculating sex-specific heritability estimates and genetic correlation estimates between females and males, as well as comparing these to sex differences on the phenotype level. Additionally, the heritability and genetic correlation estimates are compared between two populations, in order to assess the magnitude of sex differences compared to differences between populations. The analyses in this thesis are based on sex-stratified genome-wide association study (GWAS) data from 48 phenotypes in the UK Biobank (UKB), which contains genotype data from approximately 500 000 individuals as well as thousands of phenotype measurements. A replication of the analyses using three phenotypes was also made on data from the FinnGen project, with a dataset from approximately 175 000 individuals. The 48 phenotypes used in this study range from biomarkers such as serum testosterone and albumin levels to general traits such as height and blood pressure. The heritability and genetic correlation estimates were calculated using linkage disequilibrium score regression (LDSC). LDSC fits a linear regression model between test statistic values of GWAS variants and linkage disequilibrium (LD) scores calculated from a reference population. For most phenotypes, the heritability and genetic correlation results show little evidence of sex differences. Serum testosterone level and waist-to-hip ratio are exceptions to this, showing strong evidence of sex differences both on the genetic and the phenotype level. However, the overall correlation between phenotype level sex differences and sex differences in heritability or genetic correlation estimates is low. The replication in the FinnGen dataset for height, weight and body mass index (BMI), showed that for these traits the differences in heritability estimates and genetic correlations between the Finnish and UK populations are comparable or larger than the differences found between males and females.
  • Ba, Yue (Helsingin yliopisto, 2021)
    Ringed seals (Pusa hispida) and grey seals (Halichoerus grypus) are known to have hybridized in captivity despite belonging to different taxonomic genera. Earlier genetic analyses have indicated hybridization in the wild and the resulting introgression of genetic material cross species boundaries could potentially explain the intermediate phenotypes observed e.g. in their dentition. Introgression can be detected using genome data, but existing inference methods typically require phased genotype data or cannot separate heterozygous and homozygous introgression tracts. In my thesis, I will present a method based on Hidden Markov Models (HMM) to identify genomic regions with a high density of single nucleotide variants (SNVs) of foreign ancestry. Unlike other methods, my method can use unphased genotype data and can separate heterozygous and homozygous introgression tracts. I will apply this method to study introgression in Baltic ringed seals and grey seals. I will compare our method to an alternative method and assess our method with simulated data in terms of precision and recall. Then, I will apply it to seal data to search for introgression. Finally, I will discuss what future directions to improve our method.
  • Niinikoski, Eerik (Helsingin yliopisto, 2020)
    The aim of this thesis is to predict total career racing performance of Finnish trotter horses by using trotters early career racing performance and other early career variables. This thesis presents a brief introductory of harness racing and horses used in Finnish trotting sport. The data is presented and modified for predictions, with descriptive statistics of tables and visuals. The machine learning method of Random forests for regression is introduced and used in the predictions. After training the model, this thesis presents the prediction accuracy and variables of importance of the predictions of total career racing performance for both Finnhorse trotters and Finnish Standardbred trotter population. Finally, the writer discusses on the shortages and possible improvements for future research. The data for this thesis was provided by The Finnish trotting and breeding association (Suomen Hippos ry), which included all information of harness races from 1984 to the end of 2019, raced in Finland. From almost three million rows, the data was summarised to a data table of 46704 rows of trotters, that have started their career at earliest allowed three age groups. A total of 37 independent variables were used to predict three outcomes of total career earnings, total number of career starts and total number of career first placings, as separate models. The predictors are derived from other studies that estimate the environmental and genetic factors of racing performance of a trotter. The three models performed poor to moderate, with total earnings having the highest prediction accuracy. The model predicted quite well larger amounts of earnings, but was avid to predict some earnings when there in fact were none. Prediction accuracy of total number of starts was poor, especially when the true amount of starts was low. Model that predicted total number of career first placings performed the worst. This can partially be explained by the fact that winning is a rare event for a trotter in general. The models fit better for Finnish Standardbred trotters than for Finnhorse trotters. This thesis works as a good basis for future similar research, where massive amounts of data and machine learning is used to predict trotter’s career, racing performance or other factors. The results show that predicting total career racing performance as a classification problem could be a better fit than regression. These adequate classes, as well as possible better predictors and suitable imputes for missing values, should be consulted with an audience of superior knowledge in harness racing.