Browsing by Subject "Datatieteen maisteriohjelma"

Sort by: Order: Results:

Now showing items 1-20 of 43
  • Rämö, Miia (Helsingin yliopisto, 2020)
    In news agencies, there is a growing interest towards automated journalism. Majority of the systems applied are template- or rule-based, as they are expected to produce accurate and fluent output transparently. However, this approach often leads to output that lacks variety. To overcome this issue, I propose two approaches. In the lexicalization approach new words are included in the sentences, and in relexicalization approach some existing words are replaced with synonyms. Both of the approaches utilize contextual word embeddings for finding suitable words. Furthermore, the above approaches require linguistic resources, which are only available for high- resource languages. Thus, I present variants of the (re)lexicalization approaches that allow their utilization for low-resource languages. These variants utilize cross-lingual word embeddings to access linguistic resources of a high-resource language. The high-resource variants achieved promising results. However, the sampling of words should be further enhanced to improve reliability. The low-resource variants did show some promising results, but the quality suffered from complex morphology of the example language. This is a clear next issue to address and resolving it is expected to significantly improve the results.
  • Tiittanen, Henri (Helsingin yliopisto, 2019)
    Estimating the error level of models is an important task in machine learning. If the data used is independent and identically distributed, as is usually assumed, there exist standard methods to estimate the error level. However, if the data distribution changes, i.e., a phenomenon known as concept drift occurs, those methods may not work properly anymore. Most existing methods for detecting concept drift focus on the case in which the ground truth values are immediately known. In practice, that is often not the case. Even when the ground truth is unknown, a certain type of concept drift called virtual concept drift can be detected. In this thesis we present a method called drifter for estimating the error level of arbitrary regres- sion functions when the ground truth is not known. Concept drift detection is a straightforward application of error level estimation. Error level based concept drift detection can be more useful than traditional approaches based on direct distribution comparison, since only changes that affect the error level are detected. In this work we describe the drifter algorithm in detail, including its theoretical basis, and present an experimental evaluation of its performance in virtual concept drift detection on multiple datasets consisting of both synthetic and real-world datasets and multiple regression functions. Our experi- ments show that the drifter algorithm can be used to detect virtual concept drift with a reasonable accuracy.
  • Huertas, Andres (Helsingin yliopisto, 2020)
    Investment funds are continuously looking for new technologies and ideas to enhance their results. Lately, with the success observed in other fields, wealth managers are taking a closes look at machine learning methods. Even if the use of ML is not entirely new in finance, leveraging new techniques has proved to be challenging and few funds succeed in doing so. The present work explores de usage of reinforcement learning algorithms for portfolio management for the stock market. It is well known the stochastic nature of stock and aiming to predict the market is unrealistic; nevertheless, the question of how to use machine learning to find useful patterns in the data that enable small market edges, remains open. Based on the ideas of reinforcement learning, a portfolio optimization approach is proposed. RL agents are trained to trade in a stock exchange, using portfolio returns as rewards for their RL optimization problem, thus seeking optimal resource allocation. For this purpose, a set of 68 stock tickers in the Frankfurt exchange market was selected, and two RL methods applied, namely Advantage Actor-Critic(A2C) and Proximal Policy Optimization (PPO). Their performance was compared against three commonly traded ETFs (exchange-traded funds) to asses the algorithm's ability to generate returns compared to real-life investments. Both algorithms were able to achieve positive returns in a year of testing( 5.4\% and 9.3\% for A2C and PPO respectively, a European ETF (VGK, Vanguard FTSE Europe Index Fund) for the same period, reported 9.0\% returns) as well as healthy risk-to-returns ratios. The results do not aim to be financial advice or trading strategies, but rather explore the potential of RL for studying small to medium size stock portfolios.
  • Comănescu, Andrei-Daniel (Helsingin yliopisto, 2020)
    Social networks represent a public forum of discussion for various topics, some of them controversial. Twitter is such a social network; it acts as a public space where discourse occurs. In recent years the role of social networks in information spreading has increased. As have the fears regarding the increasingly polarised discourse on social networks, caused by the tendency of users to avoid exposure to opposing opinions, while increasingly interacting with only like-minded individuals. This work looks at controversial topics on Twitter, over a long period of time, through the prism of political polarisation. We use the daily interactions, and the underlying structure of the whole conversation, to create daily graphs that are then used to obtain daily graph embeddings. We estimate the political ideologies of the users that are represented in the graph embeddings. By using the political ideologies of users and the daily graph embeddings, we offer a series of methods that allow us to detect and analyse changes in the political polarisation of the conversation. This enables us to conclude that, during our analysed time period, the overall polarisation levels for our examined controversial topics have stagnated. We also explore the effects of topic-related controversial events on the conversation, thus revealing their short-term effect on the conversation as a whole. Additionally, the linkage between increased interest in a topic and the increase of political polarisation is explored. Our findings reveal that as the interest in the controversial topic increases, so does the political polarisation.
  • Kotola, Mikko Markus (Helsingin yliopisto, 2021)
    Image captioning is the task of generating a natural language description of an image. The task requires techniques from two research areas, computer vision and natural language generation. This thesis investigates the architectures of leading image captioning systems. The research question is: What components and architectures are used in state-of-the-art image captioning systems and how could image captioning systems be further improved by utilizing improved components and architectures? Five openly reported leading image captioning systems are investigated in detail: Attention on Attention, the Meshed-Memory Transformer, the X-Linear Attention Network, the Show, Edit and Tell method, and Prophet Attention. The investigated leading image captioners all rely on the same object detector, the Faster R-CNN based Bottom-Up object detection network. Four out of five also rely on the same backbone convolutional neural network, ResNet-101. Both the backbone and the object detector could be improved by using newer approaches. Best choice in CNN-based object detectors is the EfficientDet with an EfficientNet backbone. A completely transformer-based approach with a Vision Transformer backbone and a Detection Transformer object detector is a fast-developing alternative. The main area of variation between the leading image captioners is in the types of attention blocks used in the high-level image encoder, the type of natural language decoder and the connections between these components. The best architectures and attention approaches to implement these components are currently the Meshed-Memory Transformer and the bilinear pooling approach of the X-Linear Attention Network. Implementing the Prophet Attention approach of using the future words available in the supervised training phase to guide the decoder attention further improves performance. Pretraining the backbone using large image datasets is essential to reach semantically correct object detections and object features. The feature richness and dense annotation of data is equally important in training the object detector.
  • Ilse, Tse (Helsingin yliopisto, 2019)
    Background: Electroencephalography (EEG) depicts electrical activity in the brain, and can be used in clinical practice to monitor brain function. In neonatal care, physicians can use continuous bedside EEG monitoring to determine the cerebral recovery of newborns who have suffered birth asphyxia, which creates a need for frequent, accurate interpretation of the signals over a period of monitoring. An automated grading system can aid physicians in the Neonatal Intensive Care Unit by automatically distinguishing between different grades of abnormality in the neonatal EEG background activity patterns. Methods: This thesis describes using support vector machine as a base classifier to classify seven grades of EEG background pattern abnormality in data provided by the BAby Brain Activity (BABA) Center in Helsinki. We are particularly interested in reconciling the manual grading of EEG signals by independent graders, and we analyze the inter-rater variability of EEG graders by building the classifier using selected epochs graded in consensus compared to a classifier using full-duration recordings. Results: The inter-rater agreement score between the two graders was κ=0.45, which indicated moderate agreement between the EEG grades. The most common grade of EEG abnormality was grade 0 (continuous), which made up 63% of the epochs graded in consensus. We first trained two baseline reference models using the full-duration recording and labels of the two graders, which achieved 71% and 57% accuracy. We achieved 82% overall accuracy in classifying selected patterns graded in consensus into seven grades using a multi-class classifier, though this model did not outperform the two baseline models when evaluated with the respective graders’ labels. In addition, we achieved 67% accuracy in classifying all patterns from the full-duration recording using a multilabel classifier.
  • Kovanen, Veikko (Helsingin yliopisto, 2020)
    Real estate appraisal, or property valuation, requires strong expertise in order to be performed successfully, thus being a costly process to produce. However, with structured data on historical transactions, the use of machine learning (ML) enables automated, data-driven valuation which is instant, virtually costless and potentially more objective compared to traditional methods. Yet, fully ML-based appraisal is not widely used in real business applications, as the existing solutions are not sufficiently accurate and reliable. In this study, we introduce an interpretable ML model for real estate appraisal using hierarchical linear modelling (HLM). The model is learned and tested with an empirical dataset of apartment transactions in the Helsinki area, collected during the past decade. As a result, we introduce a model which has competitive predictive performance, while being simultaneously explainable and reliable. The main outcome of this study is the observation that hierarchical linear modelling is a very potential approach for automated real estate appraisal. The key advantage of HLM over alternative learning algorithms is its balance of performance and simplicity: this algorithm is complex enough to avoid underfitting but simple enough to be interpretable and easy to productize. Particularly, the ability of these models to output complete probability distributions quantifying the uncertainty of the estimates make them suitable for actual business use cases where high reliability is required.
  • Anttila, Jesse (Helsingin yliopisto, 2020)
    Visual simultaneous localization and mapping (visual SLAM) is a method for consistent self-contained localization using visual observations. Visual SLAM can produce very precise pose estimates without any specialized hardware, enabling applications such as AR navigation. The use of visual SLAM in very large areas and over long distances is not presently possible due to a number of significant scalability issues. In this thesis, these issues are discussed and solutions for them explored, culminating in a concept for a real-time city-scale visual SLAM system. A number of avenues for future work towards a practical implementation are also described.
  • Laaksonen, Jenniina (Helsingin yliopisto, 2021)
    Understanding customer behavior is one of the key elements in any thriving business. Dividing customers into different groups based on their distinct characteristics can help significantly when designing the service. Understanding the unique needs of customer groups is also the basis for modern marketing. The aim of this study is to explore what types of customer groups exist in an entertainment service business. In this study, customer segmentation is conducted with k-prototypes, a variation of k-means clustering. K-prototypes is a machine learning approach partitioning a group of observations into subgroups. These subgroups have little variation within the group and clear differences when compared to other subgroups. The advantage of k-prototypes is that it can process both categorical and numeric data efficiently. The results show that there are significant and meaningful differences between customer groups emerging from k-prototypes clustering. These customer groups can be targeted based on their unique characteristics and their reactions to different types of marketing actions vary. The unique characteristics of the customer groups can be utilized to target marketing actions better. Other possibilities to benefit from customer segmentation include such as personalized views, recommendations and helping strategy level decision making when designing the service. Many of these require further technical development or deeper understanding of the segments. Data selection as well as the quality of the data has an impact on the results and those should be considered carefully when deciding future actions on customer segmentation.
  • Koivisto, Teemu (Helsingin yliopisto, 2021)
    Programming courses often receive large quantities of program code submissions to exercises which, due to their large number, are graded and students provided feedback automatically. Teachers might never review these submissions therefore losing a valuable source of insight into student programming patterns. This thesis researches how these submissions could be reviewed efficiently using a software system, and a prototype, CodeClusters, was developed as an additional contribution of this thesis. CodeClusters' design goals are to allow the exploration of the submissions and specifically finding higher-level patterns that could be used to provide feedback to students. Its main features are full-text search and n-grams similarity detection model that can be used to cluster the submissions. Design science research is applied to evaluate CodeClusters' design and to guide the next iteration of the artifact and qualitative analysis, namely thematic synthesis, to evaluate the problem context as well as the ideas of using software for reviewing and providing clustered feedback. The used study method was interviews conducted with teachers who had experience teaching programming courses. Teachers were intrigued by the ability to review submitted student code and to provide more tailored feedback to students. The system, while still a prototype, is considered worthwhile to experiment on programming courses. A tool for analyzing and exploring submissions seems important to enable teachers to better understand how students have solved the exercises. Providing additional feedback can be beneficial to students, yet the feedback should be valuable and the students incentivized to read it.
  • Nissilä, Viivi (Helsingin yliopisto, 2020)
    Origin-Destination (OD) data is a crucial part of price estimation in the aviation industry, and an OD flight is any number of flights a passenger takes in a single journey. OD data is a complex set of data that is both flow and multidimensional type of data. In this work, the focus is to design interactive visualization techniques to support user exploration of OD data. The thesis work aims to find which of the two menu designs suit better for OD data visualization: breadth-first or depth-first menu design. The two menus follow Schneiderman’s Task by Data Taxonomy, a broader version of the Information Seeking Mantra. The first menu design is a parallel, breadth-first menu layout. The layout shows the variables in an open layout and is closer to the original data matrix. The second menu design is a hierarchical, depth-first layout. This layout is derived from the semantics of the data and is more compact in terms of screen space. The two menu designs are compared in an online survey study conducted with the potential end users. The results of the online survey study are inconclusive, and therefore are complemented with an expert review. Both the survey study and expert review show that the Sankey graph is a good visualization type for this work, but the interaction of the two menu designs requires further improvements. Both of the menu designs received positive and negative feedback in the expert review. For future work, a solution that combines the positives of the two designs could be considered. ACM Computing Classification System (CCS): Human-Centered Computing → Visualization → Empirical Studies in Visualization Human-centered computing → Interaction design → Interaction design process and methods → Interface design prototyping
  • Mäkinen, Sasu (Helsingin yliopisto, 2021)
    Deploying machine learning models is found to be a massive issue in the field. DevOps and Continuous Integration and Continuous Delivery (CI/CD) has proven to streamline and accelerate deployments in the field of software development. Creating CI/CD pipelines in software that includes elements of Machine Learning (MLOps) has unique problems, and trail-blazers in the field solve them with the use of proprietary tooling, often offered by cloud providers. In this thesis, we describe the elements of MLOps. We study what the requirements to automate the CI/CD of Machine Learning systems in the MLOps methodology. We study if it is feasible to create a state-of-the-art MLOps pipeline with existing open-source and cloud-native tooling in a cloud provider agnostic way. We designed an extendable and cloud-native pipeline covering most of the CI/CD needs of Machine Learning system. We motivated why Machine Learning systems should be included in the DevOps methodology. We studied what unique challenges machine learning brings to CI/CD pipelines, production environments and monitoring. We analyzed the pipeline’s design, architecture, and implementation details and its applicability and value to Machine Learning projects. We evaluate our solution as a promising MLOps pipeline, that manages to solve many issues of automating a reproducible Machine Learning project and its delivery to production. We designed it as a fully open-source solution that is relatively cloud provider agnostic. Configuring the pipeline to fit the client needs uses easy-to-use declarative configuration languages (YAML, JSON) that require minimal learning overhead.
  • Rannisto, Meeri (Helsingin yliopisto, 2020)
    Bat monitoring is commonly based on audio analysis. By collecting audio recordings from large areas and analysing their content, it is possible estimate distributions of bat species and changes in them. It is easy to collect a large amount of audio recordings by leaving automatic recording units in nature and collecting them later. However, it takes a lot of time and effort to analyse these recordings. Because of that, there is a great need for automatic tools. We developed a program for detecting bat calls automatically from audio recordings. The program is designed for recordings that are collected from Finland with the AudioMoth recording device. Our method is based on a median clipping method that has previously shown promising results in the field of bird song detection. We add several modifications to the basic method in order to make it work well for our purpose. We use real-world field recordings that we have annotated to evaluate the performance of the detector and compare it to two other freely available programs (Kaleidoscope and Bat Detective). Our method showed good results and got the best F2-score in the comparison.
  • Räisä, Ossi (Helsingin yliopisto, 2021)
    Differential privacy has over the past decade become a widely used framework for privacy-preserving machine learning. At the same time, Markov chain Monte Carlo (MCMC) algorithms, particularly Metropolis-Hastings (MH) algorithms, have become an increasingly popular method of performing Bayesian inference. Surprisingly, their combination has not received much attention in the litera- ture. This thesis introduces the existing research on differentially private MH algorithms, proves tighter privacy bounds for them using recent developments in differential privacy, and develops two new differentially private MH algorithms: an algorithm using subsampling to lower privacy costs, and a differentially private variant of the Hamiltonian Monte Carlo algorithm. The privacy bounds of both new algorithms are proved, and convergence to the exact posterior is proven for the latter. The performance of both the old and the new algorithms is compared on several Bayesian inference problems, revealing that none of the algorithms is clearly better than the others, but subsampling is likely only useful to lower computational costs.
  • Joosten, Rick (Helsingin yliopisto, 2020)
    In the past two decades, an increasing amount of discussions are held via online platforms such as Facebook or Reddit. The most common form of disruption of these discussions are trolls. Traditional trolls try to digress the discussion into a nonconstructive argument. One strategy to achieve this is to give asymmetric responses, responses that don’t follow the conventional patterns. In this thesis we propose a modern machine learning NLP method called ULMFiT to automatically detect the discourse acts of online forum posts in order to detect these conversational patterns. ULMFiT finetunes the language model before training its classifier in order to create a more accurate language representation of the domain language. This task of discourse act recognition is unique since it attempts to classify the pragmatic role of each post within a conversation compared to the functional role which is related to tasks such as question-answer retrieval, sentiment analysis, or sarcasm detection. Furthermore, most discourse act recognition research has been focused on synchronous conversations where all parties can directly interact with each other while this thesis looks at asynchronous online conversations. Trained on a dataset of Reddit discussions, the proposed model achieves a matthew’s correlation coefficient of 0.605 and an F1-score of 0.69 to predict the discourse acts. Other experiments also show that this model is effective at question-answer classification as well as showing that language model fine-tuning has a positive effect on both classification performance along with the required size of the training data. These results could be beneficial for current trolling detection systems.
  • Lange, Moritz Johannes (Helsingin yliopisto, 2020)
    In the context of data science and machine learning, feature selection is a widely used technique that focuses on reducing the dimensionality of a dataset. It is commonly used to improve model accuracy by preventing data redundancy and over-fitting, but can also be beneficial in applications such as data compression. The majority of feature selection techniques rely on labelled data. In many real-world scenarios, however, data is only partially labelled and thus requires so-called semi-supervised techniques, which can utilise both labelled and unlabelled data. While unlabelled data is often obtainable in abundance, labelled datasets are smaller and potentially biased. This thesis presents a method called distribution matching, which offers a way to do feature selection in a semi-supervised setup. Distribution matching is a wrapper method, which trains models to select features that best affect model accuracy. It addresses the problem of biased labelled data directly by incorporating unlabelled data into a cost function which approximates expected loss on unseen data. In experiments, the method is shown to successfully minimise the expected loss transparently on a synthetic dataset. Additionally, a comparison with related methods is performed on a more complex EMNIST dataset.
  • Säkkinen, Niko (Helsingin yliopisto, 2020)
    Predicting patient deterioration in an Intensive Care Unit (ICU) effectively is a critical health care task serving patient health and resource allocation. At times, the task may be highly complex for a physician, yet high-stakes and time-critical decisions need to be made based on it. In this work, we investigate the ability of a set of machine learning models to algorithimically predict future occurrence of in hospital death based on Electronic Health Record (EHR) data of ICU-patients. For one, we will assess the generalizability of the models. We do this by evaluating the models on hospitals the data of which has not been considered when training the models. For another, we consider the case in which we have access to some EHR data for the patients treated at a hospital of interest. In this setting, we assess how EHR data from other hospitals can be used in the optimal way to improve the prediction accuracy. This study is important for the deployment and integration of such predictive models in practice, e.g., for real-time algorithmic deterioration prediction for clinical decision support. In order to address these questions, we use the eICU collaborative research database, which is a database containing EHRs of patients treated at a heterogeneous collection of hospitals in the United States. In this work, we use the patient demographics, vital signs and Glasgow coma score as the predictors. We devise and describe three computational experiments to test the generalization in different ways. The used models are the random forest, gradient boosted trees and long short-term memory network. In our first experiment concerning the generalization, we show that, with the chosen limited set of predictors, the models generalize reasonably across hospitals but that only a small data mismatch is observed. Moreover, with this setting, our second experiment shows that the model performance does not significantly improve when increasing the heterogeneity of the training set. Given these observations, our third experiment shows that
  • Lavikka, Kari (Helsingin yliopisto, 2020)
    Visualization is an indispensable method in the exploration of genomic data. However, the current state of the art in genome browsers – a class of interactive visualization tools – limit the exploration by coupling the visual representations with specific file formats. Because the tools do not support the exploration of the visualization design space, they are difficult to adapt to atypical data. Moreover, although the tools provide interactivity, the implementations are often rudimentary, encumbering the exploration of the data. This thesis introduces GenomeSpy, an interactive genome visualization tool that improves upon the current state of the art by providing better support for exploration. The tool uses a visualization grammar that allows for implementing novel visualization designs, which can display the underlying data more effectively. Moreover, the tool implements GPU-accelerated interactions that better support navigation in the genomic space. For instance, smoothly animated transitions between loci or sample sets improve the perception of causality and help the users stay in the flow of exploration. The expressivity of the visualization grammar and the benefit of fluid interactions are validated with two case studies. The case studies demonstrate visualization of high-grade serous ovarian cancer data at different analysis phases. First, GenomeSpy is being used to create a tool for scrutinizing raw copy-number variation data along with segmentation results. Second, the segmentations along with point mutations are used in a GenomeSpy-based multi-sample visualization that allows for exploring and comparing both multiple data dimensions and samples at the same time. Although the focus has been on cancer research, the tool could be applied to other domains as well.
  • Lauha, Patrik (Helsingin yliopisto, 2021)
    Automatic bird sound recognition has been studied by computer scientists since late 1990s. Various techniques have been exploited, but no general method, that could even nearly match the performance of a human expert, has been developed yet. In this thesis, the subject is approached by reviewing alternative methods for cross-correlation as a similarity measure between two signals in template-based bird sound recognition models. Template-specific binary classification models are fit with different methods and their performance is compared. The contemplated methods are template averaging and procession before applying cross-correlation, use of texture features as additional predictors, and feature extraction through transfer learning with convolutional neural networks. It is shown that the classification performance of template-specific models can be improved by template refinement and utilizing neural networks’ ability to automatically extract relevant features from bird sound spectrograms.
  • Barin Pacela, Vitória (Helsingin yliopisto, 2021)
    Independent Component Analysis (ICA) aims to separate the observed signals into their underlying independent components responsible for generating the observations. Most research in ICA has focused on continuous signals, while the methodology for binary and discrete signals is less developed. Yet, binary observations are equally present in various fields and applications, such as causal discovery, signal processing, and bioinformatics. In the last decade, Boolean OR and XOR mixtures have been shown to be identifiable by ICA, but such models suffer from limited expressivity, calling for new methods to solve the problem. In this thesis, "Independent Component Analysis for Binary Data", we estimate the mixing matrix of ICA from binary observations and an additionally observed auxiliary variable by employing a linear model inspired by the Identifiable Variational Autoencoder (iVAE), which exploits the non-stationarity of the data. The model is optimized with a gradient-based algorithm that uses second-order optimization with limited memory, resulting in a training time in the order of seconds for the particular study cases. We investigate which conditions can lead to the reconstruction of the mixing matrix, concluding that the method is able to identify the mixing matrix when the number of observed variables is greater than the number of sources. In such cases, the linear binary iVAE can reconstruct the mixing matrix up to order and scale indeterminacies, which are considered in the evaluation with the Mean Cosine Similarity Score. Furthermore, the model can reconstruct the mixing matrix even under a limited sample size. Therefore, this work demonstrates the potential for applications in real-world data and also offers a possibility to study and formalize identifiability in future work. In summary, the most important contributions of this thesis are the empirical study of the conditions that enable the mixing matrix reconstruction using the binary iVAE, and the empirical results on the performance and efficiency of the model. The latter was achieved through a new combination of existing methods, including modifications and simplifications of a linear binary iVAE model and the optimization of such a model under limited computational resources.