Browsing by Subject "Algoritmer"

Sort by: Order: Results:

Now showing items 1-20 of 24
  • Avikainen, Jari (Helsingin yliopisto, 2019)
    This thesis presents a wavelet-based method for detecting moments of fast change in the textual contents of historical newspapers. The method works by generating time series of the relative frequencies of different words in the newspaper contents over time, and calculating their wavelet transforms. Wavelet transform is essentially a group of transformations describing the changes happening in the original time series at different time scales, and can therefore be used to pinpoint moments of fast change in the data. The produced wavelet transforms are then used to detect fast changes in word frequencies by examining products of multiple scales of the transform. The properties of the wavelet transform and the related multi-scale product are evaluated in relation to detecting various kinds of steps and spikes in different noise environments. The suitability of the method for analysing historical newspaper archives is examined using an example corpus consisting of 487 issues of Uusi Suometar from 1869–1918 and 250 issues of Wiipuri from 1893–1918. Two problematic features in the newspaper data, noise caused by OCR (optical character recognition) errors and uneven temporal distribution of the data, are identified and their effects on the results of the presented method are evaluated using synthetic data. Finally, the method is tested using the example corpus, and the results are examined briefly. The method is found to be adversely affected especially by the uneven temporal distribution of the newspaper data. Without additional processing, or improving the quality of the examined data, a significant amount of the detected steps are due to the noise in the data. Various ways of alleviating the effect are proposed, among other suggested improvements on the system.
  • Wargelin, Matias (Helsingin yliopisto, 2021)
    Musical pattern discovery refers to the automated discovery of important repeated patterns, such as melodies and themes, from music data. Several algorithms have been developed to solve this problem, but evaluating the algorithms has been difficult without proper visualisations of the output of the algorithms. To address this issue a web application named Mupadie was built. Mupadie accepts MIDI music files as input and visualises the outputs of musical pattern discovery algorithms, with implementations of SIATEC and TTWIA built in the application. Other algorithms can be visualised if the algorithm output is uploaded to Mupadie as a JSON file that follows a specified data structure. Using Mupadie, an evaluation of SIATEC and TTWIA was conducted. Mupadie was found to be a useful tool in the qualitative evaluation of these musical pattern discovery algorithms; it helped reveal systematically recurring issues with the discovered patterns, some previously known and some previously undocumented. The findings were then used to suggest improvements to the algorithms.
  • Lehtonen, Leevi (Helsingin yliopisto, 2021)
    Quantum computing has an enormous potential in machine learning, where problems can quickly scale to be intractable for classical computation. A Boltzmann machine is a well-known energy-based graphical model suitable for various machine learning tasks. Plenty of work has already been conducted for realizing Boltzmann machines in quantum computing, all of which have somewhat different characteristics. In this thesis, we conduct a survey of the state-of-the-art in quantum Boltzmann machines and their training approaches. Primarily, we examine variational quantum Boltzmann machine, a specific variant of quantum Boltzmann machine suitable for the near-term quantum hardware. Moreover, as variational quantum Boltzmann machine heavily relies on variational quantum imaginary time evolution, we effectively analyze variational quantum imaginary time evolution to a great extent. Compared to the previous work, we evaluate the execution of variational quantum imaginary time evolution with a more comprehensive collection of hyperparameters. Furthermore, we train variational quantum Boltzmann machines using a toy problem of bars and stripes, representing more multimodal probability distribution than the Bell states and the Greenberger-Horne-Zeilinger states considered in the earlier studies.
  • Harviainen, Juha (Helsingin yliopisto, 2021)
    Computing the permanent of a matrix is a famous #P-hard problem with a wide range of applications. The fastest known exact algorithms for the problem require an exponential number of operations, and all known fully polynomial randomized approximation schemes are rather complicated to implement and have impractical time complexities. The most promising recent advancements on approximating the permanent are based on rejection sampling and upper bounds for the permanent. In this thesis, we improve the current state of the art by developing the deep rejection sampling method, which combines an exact algorithm with the rejection sampling method. The algorithm precomputes a dynamic programming table that tightens the initial upper bound used by the rejection sampling method. In a sense, the table is used to jump-start the sampling process. We give a high probability upper bound for the time complexity of the deep rejection sampling method for random (0, 1)-matrices in which each entry is 1 with probability p. For matrices with p < 1/5, our high probability bound is stronger than in previous work. In addition to that, we empirically observe that our algorithm outperforms earlier rejection sampling methods by testing it with different parameters against other algorithms on multiple classes of matrices. The improvements in sampling times are especially notable in cases in which the ratios of the permanental upper bounds and the exact value of the permanent are huge.
  • Liu, Yang Jr (Helsingin yliopisto, 2020)
    Automatic readability assessment is considered as a challenging task in NLP due to its high degree of subjectivity. The majority prior work in assessing readability has focused on identifying the level of education necessary for comprehension without the consideration of text quality, i.e., how naturally the text flows from the perspective of a native speaker. Therefore, in this thesis, we aim to use language models, trained on well-written prose, to measure not only text readability in terms of comprehension but text quality. In this thesis, we developed two word-level metrics based on the concordance of article text with predictions made using language models to assess text readability and quality. We evaluate both metrics on a set of corpora used for readability assessment or automated essay scoring (AES) by measuring the correlation between scores assigned by our metrics and human raters. According to the experimental results, our metrics are strongly correlated with text quality, which achieve 0.4-0.6 correlations on 7 out of 9 datasets. We demonstrate that GPT-2 surpasses other language models, including the bigram model, LSTM, and bidirectional LSTM, on the task of estimating text quality in a zero-shot setting, and GPT-2 perplexity-based measure is a reasonable indicator for text quality evaluation.
  • Liu, Yang (Helsingin yliopisto, 2020)
    Automatic readability assessment is considered as a challenging task in NLP due to its high degree of subjectivity. The majority prior work in assessing readability has focused on identifying the level of education necessary for comprehension without the consideration of text quality, i.e., how naturally the text flows from the perspective of a native speaker. Therefore, in this thesis, we aim to use language models, trained on well-written prose, to measure not only text readability in terms of comprehension but text quality. In this thesis, we developed two word-level metrics based on the concordance of article text with predictions made using language models to assess text readability and quality. We evaluate both metrics on a set of corpora used for readability assessment or automated essay scoring (AES) by measuring the correlation between scores assigned by our metrics and human raters. According to the experimental results, our metrics are strongly correlated with text quality, which achieve 0.4-0.6 correlations on 7 out of 9 datasets. We demonstrate that GPT-2 surpasses other language models, including the bigram model, LSTM, and bidirectional LSTM, on the task of estimating text quality in a zero-shot setting, and GPT-2 perplexity-based measure is a reasonable indicator for text quality evaluation.
  • Thapa Magar, Purushottam (Helsingin yliopisto, 2021)
    Rapid growth and advancement of next generation sequencing (NGS) technologies have changed the landscape of genomic medicine. Today, clinical laboratories perform DNA sequencing on a regular basis, which is an error prone process. Erroneous data affects downstream analysis and produces fallacious result. Therefore, external quality assessment (EQA) of laboratories working with NGS data is crucial. Validation of variations such as single nucleotide polymor- phism (SNP) and InDels (<50 bp) is fairly accurate these days. However, detection and quality assessment of large changes such as the copy number variation (CNV) continues to be a concern. In this work, we aimed to study the feasibility of an automated CNV concordance analysis for the laboratory EQA services. We benchmarked variants reported by 25 laboratories against the highly curated gold standard for the son (HG002/NA24385) of the askenazim trio from the Personal Genome Project published by the Genome in a Bottle Consortium (GIAB). We employed two methods to conduct concordance of CNVs, the sequence based comparison with Truvari and the in-house exome-based comparison. For deletion calls of two whole genome sequencing (WGS) submissions, Truvari gained a value greater than 88% and 68% for precision and recall respectively. Conversely, the in-house method’s precision and recall score peaked at 39% and 7.9% respectively for one WGS submission for both deletion and duplication calls. The results indicate that automated CNV concordance analysis of the deletion calls for the WGS-based callset might be feasible with Truvari. On the other hand, results for panel-based targeted sequencing for the deletion calls showed precision and recall rates ranging from 0-80% and 0-5.6% respectively with Truvari. The result suggests that automated concordance analysis of CNVs for targeted sequencing remains a challenge. In conclusion, CNV concordance analysis depends on how the sequence data is generated.
  • Porttinen, Peter (Helsingin yliopisto, 2020)
    Computing an edit distance between strings is one of the central problems in both string processing and bioinformatics. Optimal solutions to edit distance are quadratic to the lengths of the input strings. The goal of this thesis is to study a new approach to approximate edit distance. We use a chaining algorithm presented by Mäkinen and Sahlin in "Chaining with overlaps revisited" CPM 2020 implemented verbatim. Building on the chaining algorithm, our focus is on efficiently finding a good set of anchors for the chaining algorithm. We present three approaches to computing the anchors as maximal exact matches: Bi-Directional Burrows-Wheeler Transform, Minimizers, and lastly, a hybrid implementation of the two. Using the maximal exact matches as anchors, we can efficiently compute an optimal chaining alignment for the strings. The chaining alignment further allows us to determine all such intervals where mismatches occur by looking at which sequences are not in the chain. Using these smaller intervals lets us approximate edit distance with a high degree of accuracy and a significant speed improvement. The methods described present a way to approximate edit distance in time complexity bounded by the number of maximal exact matches.
  • Ma, Jun (Helsingin yliopisto, 2021)
    Sequence alignment by exact or approximate string matching is one of the fundamental problems in bioinformatics. As the volume of sequenced genomes grows rapidly, pairwise sequence alignment becomes inefficient for pan-genomic analyses involving multiple sequences. The graph representation of multiple genomes has been an increasingly useful tool in pan-genomics research. Therefore, sequence-to-graph alignment becomes an important and challenging problem. For pairwise approximate sequence alignment under Levenshtein (edit) distance, subquadratic algorithms for finding an optimal solution are unknown. As a result, aligning sequences of millions of characters optimally is too challenging and impractical. Thus, many heuristics and techniques are developed for possibly suboptimal alignments. Among them, co-linear chaining (CLC) is a powerful and popular technique that approximates the alignment by finding a chain of short aligned fragments that may come from exact matching. The optimal solution to CLC on sequences can be found efficiently in subquadratic time. For sequence-to-graph alignment, the CLC problem has been solved theoretically on a special class of graphs that are narrow and have no cycles, i.e. directed acyclic graphs (DAGs) with small width, by Mäkinen et al. (ACM Transactions on Algorithms, 2019). Pan-genome graphs such as variation graphs satisfy these restrictions but allowing cycles may enable more general applications of the algorithm. In this thesis, we introduce an efficient algorithm to solve the CLC problem on general graphs with small width that may have cycles, by reducing it to a slightly modified CLC problem on DAGs. We implemented an initial version of the new algorithm on DAGs as a sequence-to-graph aligner GraphChainer. The aligner is evaluated and compared to an existing state-of-the-art aligner GraphAligner (Genome Biology, 2020) in experiments using both simulated and real genome assembly data on variation graphs. Our method improves the quality of alignments significantly in the task of aligning real human PacBio data. GraphChainer is freely available as an open source tool at
  • Länsman, Olá-Mihkku (Helsingin yliopisto, 2020)
    Demand forecasts are required for optimizing multiple challenges in the retail industry, and they can be used to reduce spoilage and excess inventory sizes. The classical forecasting methods provide point forecasts and do not quantify the uncertainty of the process. We evaluate multiple predictive posterior approximation methods with a Bayesian generalized linear model that captures weekly and yearly seasonality, changing trends and promotional effects. The model uses negative binomial as the sampling distribution because of the ability to scale the variance as a quadratic function of the mean. The forecasting methods provide highest posterior density intervals in different credible levels ranging from 50% to 95%. They are evaluated with proper scoring function and calculation of hit rates. We also measure the duration of the calculations as an important result due to the scalability requirements of the retail industry. The forecasting methods are Laplace approximation, Monte Carlo Markov Chain method, Automatic Differentiation Variational Inference, and maximum a posteriori inference. Our results show that the Markov Chain Monte Carlo method is too slow for practical use, while the rest of the approximation methods can be considered for practical use. We found out that Laplace approximation and Automatic Differentiation Variational Inference have results closer to the method with best analytical quarantees, the Markov Chain Monte Carlo method, suggesting that they were better approximations of the model. The model faced difficulties with highly promotional, slow selling, and intermittent data. Best fit was provided with high selling SKUs, for which the model provided intervals with hit rates that matched the levels of the credible intervals.
  • Laitala, Julius (Helsingin yliopisto, 2021)
    Arranging products in stores according to planograms, optimized product arrangement maps, is important for keeping up with the highly competitive modern retail market. The planograms are realized into product arrangements by humans, a process which is prone to mistakes. Therefore, for optimal merchandising performance, the planogram compliance of the arrangements needs to be evaluated from time to time. We investigate utilizing a computer vision problem setting – retail product detection – to automate planogram compliance evaluation. We introduce the relevant problems, the state-of- the-art approaches for solving them and background information necessary for understanding them. We then propose a computer vision based planogram compliance evaluation pipeline based on the current state of the art. We build our proposed models and algorithms using PyTorch, and run tests against public datasets and an internal dataset collected from a large Nordic retailer. We find that while the retail product detection performance of our proposed approach is quite good, the planogram compliance evaluation performance of our whole pipeline leaves a lot of room for improvement. Still, our approach seems promising, and we propose multiple ways for improving the performance enough to enable possible real world utility. The code used for our experiments and the weights for our models are available at
  • Hertweck, Corinna (Helsingin yliopisto, 2020)
    In this work, we seek robust methods for designing affirmative action policies for university admissions. Specifically, we study university admissions under a real centralized system that uses grades and standardized test scores to match applicants to university programs. For the purposes of affirmative action, we consider policies that assign bonus points to applicants from underrepresented groups with the goal of preventing large gaps in admission rates across groups, while ensuring that the admitted students are for the most part those with the highest scores. Since such policies have to be announced before the start of the application period, there is uncertainty about which students will apply to which programs. This poses a difficult challenge for policy-makers. Hence, we introduce a strategy to design policies for the upcoming round of applications that can either address a single or multiple demographic groups. Our strategy is based on application data from previous years and a predictive model trained on this data. By comparing this predictive strategy to simpler strategies based only on application data from, e.g., the previous year, we show that the predictive strategy is generally more conservative in its policy suggestions. As a result, policies suggested by the predictive strategy lead to more robust effects and fewer cases where the gap in admission rates is inadvertently increased through the suggested policy intervention. Our findings imply that universities can employ predictive methods to increase the reliability of the effects expected from the implementation of an affirmative action policy.
  • Ikkala, Tapio (Helsingin yliopisto, 2020)
    This thesis presents a scalable method for identifying anomalous periods of non-activity in short periodic event sequences. The method is tested with real world point-of-sale (POS) data from grocery retail setting. However, the method can be applied also to other problem domains which produce similar sequential data. The proposed method models the underlying event sequence as a non-homogeneous Poisson process with a piecewise constant rate function. The rate function for the piecewise homogeneous Poisson process can be estimated with a change point detection algorithm that minimises a cost function consisting of the negative Poisson log-likelihood and a penalty term that is linear to the number of change points. The resulting model can be queried for anomalously long periods of time with no events, i.e., waiting times, by defining a threshold below which the waiting time observations are deemed anomalies. The first experimental part of the thesis focuses on model selection, i.e., in finding a penalty value that results in the change point detection algorithm detecting the true changes in the intensity of the arrivals of the events while not reacting to random fluctuations in the data. In the second experimental part the performance of the anomaly detection methodology is measured against stock-out data, which gives an approximate ground truth for the termination of a POS event sequence. The performance of the anomaly detector is found to be subpar in terms of precision and recall, i.e., the true positive rate and the positive predictive value. The number of false positives remains high even with small threshold values. This needs to be taken into account when considering applying the anomaly detection procedure in practice. Nevertheless, the methodology may have practical value in the retail setting, e.g., in guiding the store personnel where to focus their resources in ensuring the availability of the products.
  • Duong, Quoc Quan (Helsingin yliopisto, 2021)
    Discourse dynamics is one of the important fields in digital humanities research. Over time, the perspectives and concerns of society on particular topics or events might change. Based on the changing in popularity of a certain theme different patterns are formed, increasing or decreasing the prominence of the theme in news. Tracking these changes is a challenging task. In a large text collection discourse themes are intertwined and uncategorized, which makes it hard to analyse them manually. The thesis tackles a novel task of automatic extraction of discourse trends from large text corpora. The main motivation for this work lies in the need in digital humanities to track discourse dynamics in diachronic corpora. Machine learning is a potential method to automate this task by learning patterns from the data. However, in many real use-cases ground truth is not available and annotating discourses on a corpus-level is incredibly difficult and time-consuming. This study proposes a novel procedure to generate synthetic datasets for this task, a quantitative evaluation method and a set of benchmarking models. Large-scale experiments are run using these synthetic datasets. The thesis demonstrates that a neural network model trained on such datasets can obtain meaningful results when applied to a real dataset, without any adjustments of the model.
  • Aula, Kasimir (Helsingin yliopisto, 2019)
    Air pollution is considered to be one of the biggest environmental risks to health, causing symptoms from headache to lung diseases, cardiovascular diseases and cancer. To improve awareness of pollutants, air quality needs to be measured more densely. Low-cost air quality sensors offer one solution to increase the number of air quality monitors. However, they suffer from low accuracy of measurements compared to professional-grade monitoring stations. This thesis applies machine learning techniques to calibrate the values of a low-cost air quality sensor against a reference monitoring station. The calibrated values are then compared to a reference station’s values to compute error after calibration. In the past, the evaluation phase has been carried out very lightly. A novel method of selecting data is presented in this thesis to ensure diverse conditions in training and evaluation data, that would yield a more realistic impression about the capabilities of a calibration model. To better understand the level of performance, selected calibration models were trained with data corresponding to different levels of air pollution and meteorological conditions. Regarding pollution level, using homogeneous training and evaluation data, the error of a calibration model was found to be even 85% lower than when using diverse training and evaluation pollution environment. Also, using diverse meteorological training data instead of more homogeneous data was shown to reduce the size of the error and provide stability on the behavior of calibration models.
  • Jylhä-Ollila, Pekka (Helsingin yliopisto, 2020)
    K-mer counting is the process of building a histogram of all substrings of length k for an input string S. The problem itself is quite simple, but counting k-mers efficiently for a very large input string is a difficult task that has been researched extensively. In recent years the performance of k-mer counting algorithms have improved significantly, and there have been efforts to use graphics processing units (GPUs) in k-mer counting. The goal for this thesis was to design, implement and benchmark a GPU accelerated k-mer counting algorithm SNCGPU. The results showed that SNCGPU compares reasonably well to the Gerbil k-mer counting algorithm on a mid-range desktop computer, but does not utilize the resources of a high-end computing platform as efficiently. The implementation of SNCGPU is available as open-source software.
  • Kemppainen, Esa (Helsingin yliopisto, 2020)
    NP-hard optimization problems can be found in various real-world settings such as scheduling, planning and data analysis. Coming up with algorithms that can efficiently solve these problems can save various rescources. Instead of developing problem domain specific algorithms we can encode a problem instance as an instance of maximum satisfiability (MaxSAT), which is an optimization extension of Boolean satisfiability (SAT). We can then solve instances resulting from this encoding using MaxSAT specific algorithms. This way we can solve instances in various different problem domains by focusing on developing algorithms to solve MaxSAT instances. Computing an optimal solution and proving optimality of the found solution can be time-consuming in real-world settings. Finding an optimal solution for problems in these settings is often not feasible. Instead we are only interested in finding a good quality solution fast. Incomplete solvers trade guaranteed optimality for better scalability. In this thesis, we study an incomplete solution approach for solving MaxSAT based on linear programming relaxation and rounding. Linear programming (LP) relaxation and rounding has been used for obtaining approximation algorithms on various NP-hard optimization problems. As such we are interested in investigating the effectiveness of this approach on MaxSAT. We describe multiple rounding heuristics that are empirically evaluated on random, crafted and industrial MaxSAT instances from yearly MaxSAT Evaluations. We compare rounding approaches against each other and to state-of-the-art incomplete solvers SATLike and Loandra. The LP relaxation based rounding approaches are not competitive in general against either SATLike or Loandra However, for some problem domains our approach manages to be competitive against SATLike and Loandra.
  • Vu, Anh-Duc (Helsingin yliopisto, 2020)
    Artificial intelligence (AI) is being increasingly applied in the field of intelligent tutoring systems (ITS). Knowledge space theory (KST) aims to model the main features of the process of learning new skills. Two basic components of ITS are the domain model and the student model. The student model provides an estimation of the state of the student’s knowledge or proficiency, based on the student’s performance on exercises. The domain model provides a model of relations between the concepts/skills in the domain. To learn the student model from data, some ITSs use the Bayesian Knowledge Tracing (BKT) algorithm, which is based on hidden Markov models (HMM). This thesis investigates the applicability of KST to constructing these models. The contribution of the thesis is twofold. Firstly, we learn the student model by a modified BKT algorithm, which models forgetting of skills (which the standard BKT model does not do). We build one BKT model for each concept. However, rather than treating a single question as a step in the HMM, we treat an entire practice session as one step, on which the student receives a score between 0 and 1, which we assume to be normally distributed. Secondly, we propose algorithms to learn the “surmise” graph—the prerequisite relation between concepts—from “mastery data,” estimated by the student model. The mastery data tells us the knowledge state of a student on a given concept. The learned graph is a representation of the knowledge domain. We use the student model to track the advancement of students, and use the domain model to propose the optimal study plan for students based on their current proficiency and targets of study.
  • Lång, John (Helsingin yliopisto, 2021)
    This thesis discusses the formal modelling and verification of certain non-real-time aspects of correctness of a mission-critical distributed software system known as the ALICE Data Point Service (ADAPOS). The domain of this distributed system is data acquisition from a particle detector control system in experimental high energy particle physics research. ADAPOS is part of the upgrade effort of A Large Ion Collider Experiment (ALICE) at the European Organisation for Nuclear Research (CERN), near Geneva in France/Switzerland, for the third run of the Large Hadron Collider (LHC). ADAPOS is based on the publicly available ALICE Data Point Processing (ADAPRO) C++14 framework and works within the free and open source GNU/Linux ecosystem. The model checker Spin was chosen for modelling and verifying ADAPOS. The model focuses on the general specification of ADAPOS. It includes ADAPOS processes, a load generator process, and rudimentary interpretations for the network protocols used between the processes. For experimenting with different interpretations of the underlying network protocols and also for coping with the state space explosion problem, eight variants of the model were developed and studied. Nine Linear Temporal Logic (LTL) properties were defined for all those variants. Large numbers of states were covered during model checking even though the model turned out to have a reachable state space too large to fully exhaust. No counter-examples were found to safety properties. A significant amount of evidence hinting that ADAPOS seems to be safe, was obtained. Liveness properties and implementation-level verification among other possible research directions remain open.
  • Leinonen, Matti (Helsingin yliopisto, 2021)
    3D Object detection and tracking are computer vision methods used in many applications. It is necessary for autonomous vehicles and robots to be able to reliably extract 3D localization information about objects in their environment to operate safely. Currently most 3D object detection and tracking algorithms use high quality LiDAR-sensors which are very expensive. This is why research into methods that use cheap monocular camera images as inputs is an active field in computer vision research. Most current research into monocular 3D object detection and tracking is focused in autonomous driving. This thesis investigates how well current monocular methods are suited for use in industrial settings where the environment and especially the camera perspective can be very different compared to what it is in an automobile. This thesis introduces some of the most used 3D object detection and tracking methods and techniques and tests one detection method on a dataset where the environment and the point of view differs from what it would be in autonomous driving. This thesis also analyzes the technical requirements for a detection and tracking system that could be be used for autonomous robots in an industrial setting and what future research would be necessary to develop such a system.