Browsing by Subject "Magisterprogrammet i datavetenskap"

Sort by: Order: Results:

Now showing items 1-20 of 79
  • Heikkinen, Juuso (Helsingin yliopisto, 2021)
    Telecommunication companies are moving towards even more digitalized and agile ways of working. They are expanding their business in other fields, such as television, thus moving further away from the traditional telecommunications model. Recently, Telia has become the largest television company in the Nordics. One of the their main products in the field of television is channel packages, which allow customers to access specific television content. In this study, a benefit analysis for Telia Finland Oyj was conducted to inspect the benefits that test automation brings for the channel package testing process. 8 interviews in total were conducted with Telia employees with knowledge on channel packages. To receive both a business and a technical perspective, the interviewees were divided into two groups fitting their expertise. In general, test automation was seen as a useful tool. The main business related benefits of test automation mentioned were a faster and cheaper testing process, and a faster time-to-market. It was also seen that test automation could help achieve a more efficient testing process, and increase confidence in test automation. Based on the interview results, an epic was defined and analyzed according to the principles of Scaled Agile Framework (SAFe). This included describing the solution in detail and defining a Minimum Viable Product (MVP). By using example variables and generalized values, several calculations were made to present a framework on the costs of implementing the MVP and the estimated reduction of channel package testing costs. By utilizing the MVP as a part of the channel package testing process, the return on investment (ROI) was not as desirable as expected. With more automated tests compared to the number of test cases, combined with regular use of test automation, the investment would pay itself back and start generating additional savings faster. Based on the epic analysis, a Lean Business Case was defined.
  • Avikainen, Jari (Helsingin yliopisto, 2019)
    This thesis presents a wavelet-based method for detecting moments of fast change in the textual contents of historical newspapers. The method works by generating time series of the relative frequencies of different words in the newspaper contents over time, and calculating their wavelet transforms. Wavelet transform is essentially a group of transformations describing the changes happening in the original time series at different time scales, and can therefore be used to pinpoint moments of fast change in the data. The produced wavelet transforms are then used to detect fast changes in word frequencies by examining products of multiple scales of the transform. The properties of the wavelet transform and the related multi-scale product are evaluated in relation to detecting various kinds of steps and spikes in different noise environments. The suitability of the method for analysing historical newspaper archives is examined using an example corpus consisting of 487 issues of Uusi Suometar from 1869–1918 and 250 issues of Wiipuri from 1893–1918. Two problematic features in the newspaper data, noise caused by OCR (optical character recognition) errors and uneven temporal distribution of the data, are identified and their effects on the results of the presented method are evaluated using synthetic data. Finally, the method is tested using the example corpus, and the results are examined briefly. The method is found to be adversely affected especially by the uneven temporal distribution of the newspaper data. Without additional processing, or improving the quality of the examined data, a significant amount of the detected steps are due to the noise in the data. Various ways of alleviating the effect are proposed, among other suggested improvements on the system.
  • Ollila, Risto (Helsingin yliopisto, 2021)
    The Web has become the world's most important application distribution platform, with web pages increasingly containing not static documents, but dynamic, script-driven content. Script-based rendering relies on imperative browser APIs which become unwieldy to use as an application's complexity grows. An increasingly common solution is to use libraries and frameworks which provide an abstraction over rendering and enable a less error-prone declarative programming model. The details of how web frontend frameworks implement rendering vary widely and can potentially have significant consequences for application performance. Frameworks' rendering strategies are typically invisible to the application developer, and may consequently be poorly understood despite their potential impact. In this thesis, we review rendering strategies used in a number of influential and popular web frontend frameworks. By studying their implementation details, we discover ways to categorize and estimate rendering strategies' performance based on input sizes in update loops. To verify and measure the effects of these differences, we implement a number of benchmarks that measure different aspects of rendering. In our benchmarks, we discover significant performance differences ranging up to an order of magnitude under some conditions. Additionally, we confirm that categorizing rendering strategies based on input sizes of update loops is an effective way to estimate their relative performance. The best performing rendering strategies are found to be ones which minimize input sizes in update loops using techniques such as compile-time optimization and reactive programming models.
  • Xue, Jiayue (Helsingin yliopisto, 2021)
    The semantic shifts in natural language is a well established phenomenon and have been studied for many years. Similarly, the meanings of scientific publications may also change as time goes by. In other words, the same publication may be cited in distinct contexts. To investigate whether the meanings of citations have changed in different scenarios, which is also called in the semantic shifts in citations, we followed the same ideas of how researchers studied semantic shifts in language. To be more specific, we combined the temporal referencing model and the Word2Vec model to explore the semantic shifts of scientific citations in two aspects: their usages over time and their usages across different domains. By observing how citations themselves changed over time and comparing the closest neighbors of citations, we concluded that the semantics of scientific publications did shift in terms of cosine distances.
  • Wargelin, Matias (Helsingin yliopisto, 2021)
    Musical pattern discovery refers to the automated discovery of important repeated patterns, such as melodies and themes, from music data. Several algorithms have been developed to solve this problem, but evaluating the algorithms has been difficult without proper visualisations of the output of the algorithms. To address this issue a web application named Mupadie was built. Mupadie accepts MIDI music files as input and visualises the outputs of musical pattern discovery algorithms, with implementations of SIATEC and TTWIA built in the application. Other algorithms can be visualised if the algorithm output is uploaded to Mupadie as a JSON file that follows a specified data structure. Using Mupadie, an evaluation of SIATEC and TTWIA was conducted. Mupadie was found to be a useful tool in the qualitative evaluation of these musical pattern discovery algorithms; it helped reveal systematically recurring issues with the discovered patterns, some previously known and some previously undocumented. The findings were then used to suggest improvements to the algorithms.
  • Lehtonen, Leevi (Helsingin yliopisto, 2021)
    Quantum computing has an enormous potential in machine learning, where problems can quickly scale to be intractable for classical computation. A Boltzmann machine is a well-known energy-based graphical model suitable for various machine learning tasks. Plenty of work has already been conducted for realizing Boltzmann machines in quantum computing, all of which have somewhat different characteristics. In this thesis, we conduct a survey of the state-of-the-art in quantum Boltzmann machines and their training approaches. Primarily, we examine variational quantum Boltzmann machine, a specific variant of quantum Boltzmann machine suitable for the near-term quantum hardware. Moreover, as variational quantum Boltzmann machine heavily relies on variational quantum imaginary time evolution, we effectively analyze variational quantum imaginary time evolution to a great extent. Compared to the previous work, we evaluate the execution of variational quantum imaginary time evolution with a more comprehensive collection of hyperparameters. Furthermore, we train variational quantum Boltzmann machines using a toy problem of bars and stripes, representing more multimodal probability distribution than the Bell states and the Greenberger-Horne-Zeilinger states considered in the earlier studies.
  • Meriläinen, Roosa (Helsingin yliopisto, 2020)
    In the world of constantly growing data masses the efficient extraction, saving and accessing that data for business intelligence and analytics has become increasingly important to businesses. Analytics and business intelligence software is offered by many providers in the market for all sizes of organizations and there are multiple ways to build an analytics system, or pipeline from scratch or integrated with tools available on the market. In this case study we explore and re-design the analytics pipeline solution of a medium sized software product company by utilizing the design science research methodology. We discuss the current technologies and tools on the market for business intelligence and analytics and consider how they fit into our case study context. As design science suggests, we design, implement and evaluate two prototypes of an analyt- ics pipeline with an Extract, Transform and Load (ETL) solution and data warehouse. The prototypes represent two different approaches to building an analytics pipeline - an in-house approach, and a partially outsourced approach. Our study brings out typical challenges similar businesses may face when designing and building their own business intelligence and analytics software. In our case we lean towards an analytics pipeline with an outsourced ETL process to be able to pass various different types of event data with a consistent data schema into our data warehouse with minimal maintenance work. However, we also show the value of near real time analytics with an in-house solution, and offer some ideas on how such a pipeline may be built.
  • Törnroos, Topi (Helsingin yliopisto, 2021)
    Application Performance Management (APM) is a growing field, and APM tools on the market tend to be complex enterprise solutions with features ranging from traffic analysis and error reporting to real- user monitoring and business transaction management. This thesis is a study done on behalf of Veikkaus Oy, a Finnish government-owned game company and betting agency. It serves as a look into the current state-of-the-art field of leading APM tools as well as a requirements analysis done from the perspective of the company’s IT personnel. A list of requirements was gathered and scored based on perceived importance, and four APM tools on the market—Datadog APM, Dynatrace, New Relic and AppDynamics—were each compared to each other and scored based on the gathered requirements. In addition, open-source alternatives were considered and investigated. Our results suggest that the leading APM vendors have products very similar to each other with marginal differences between them, feature-wise. In general, APMs were deemed useful and valuable to the company, able to assist in the work of a wide variety of IT personnel, as well as able to replace many tools currently in use by Veikkaus Oy and simplify their application ecosystem.
  • Harviainen, Juha (Helsingin yliopisto, 2021)
    Computing the permanent of a matrix is a famous #P-hard problem with a wide range of applications. The fastest known exact algorithms for the problem require an exponential number of operations, and all known fully polynomial randomized approximation schemes are rather complicated to implement and have impractical time complexities. The most promising recent advancements on approximating the permanent are based on rejection sampling and upper bounds for the permanent. In this thesis, we improve the current state of the art by developing the deep rejection sampling method, which combines an exact algorithm with the rejection sampling method. The algorithm precomputes a dynamic programming table that tightens the initial upper bound used by the rejection sampling method. In a sense, the table is used to jump-start the sampling process. We give a high probability upper bound for the time complexity of the deep rejection sampling method for random (0, 1)-matrices in which each entry is 1 with probability p. For matrices with p < 1/5, our high probability bound is stronger than in previous work. In addition to that, we empirically observe that our algorithm outperforms earlier rejection sampling methods by testing it with different parameters against other algorithms on multiple classes of matrices. The improvements in sampling times are especially notable in cases in which the ratios of the permanental upper bounds and the exact value of the permanent are huge.
  • Liu, Yang Jr (Helsingin yliopisto, 2020)
    Automatic readability assessment is considered as a challenging task in NLP due to its high degree of subjectivity. The majority prior work in assessing readability has focused on identifying the level of education necessary for comprehension without the consideration of text quality, i.e., how naturally the text flows from the perspective of a native speaker. Therefore, in this thesis, we aim to use language models, trained on well-written prose, to measure not only text readability in terms of comprehension but text quality. In this thesis, we developed two word-level metrics based on the concordance of article text with predictions made using language models to assess text readability and quality. We evaluate both metrics on a set of corpora used for readability assessment or automated essay scoring (AES) by measuring the correlation between scores assigned by our metrics and human raters. According to the experimental results, our metrics are strongly correlated with text quality, which achieve 0.4-0.6 correlations on 7 out of 9 datasets. We demonstrate that GPT-2 surpasses other language models, including the bigram model, LSTM, and bidirectional LSTM, on the task of estimating text quality in a zero-shot setting, and GPT-2 perplexity-based measure is a reasonable indicator for text quality evaluation.
  • Liu, Yang (Helsingin yliopisto, 2020)
    Automatic readability assessment is considered as a challenging task in NLP due to its high degree of subjectivity. The majority prior work in assessing readability has focused on identifying the level of education necessary for comprehension without the consideration of text quality, i.e., how naturally the text flows from the perspective of a native speaker. Therefore, in this thesis, we aim to use language models, trained on well-written prose, to measure not only text readability in terms of comprehension but text quality. In this thesis, we developed two word-level metrics based on the concordance of article text with predictions made using language models to assess text readability and quality. We evaluate both metrics on a set of corpora used for readability assessment or automated essay scoring (AES) by measuring the correlation between scores assigned by our metrics and human raters. According to the experimental results, our metrics are strongly correlated with text quality, which achieve 0.4-0.6 correlations on 7 out of 9 datasets. We demonstrate that GPT-2 surpasses other language models, including the bigram model, LSTM, and bidirectional LSTM, on the task of estimating text quality in a zero-shot setting, and GPT-2 perplexity-based measure is a reasonable indicator for text quality evaluation.
  • Thapa Magar, Purushottam (Helsingin yliopisto, 2021)
    Rapid growth and advancement of next generation sequencing (NGS) technologies have changed the landscape of genomic medicine. Today, clinical laboratories perform DNA sequencing on a regular basis, which is an error prone process. Erroneous data affects downstream analysis and produces fallacious result. Therefore, external quality assessment (EQA) of laboratories working with NGS data is crucial. Validation of variations such as single nucleotide polymor- phism (SNP) and InDels (<50 bp) is fairly accurate these days. However, detection and quality assessment of large changes such as the copy number variation (CNV) continues to be a concern. In this work, we aimed to study the feasibility of an automated CNV concordance analysis for the laboratory EQA services. We benchmarked variants reported by 25 laboratories against the highly curated gold standard for the son (HG002/NA24385) of the askenazim trio from the Personal Genome Project published by the Genome in a Bottle Consortium (GIAB). We employed two methods to conduct concordance of CNVs, the sequence based comparison with Truvari and the in-house exome-based comparison. For deletion calls of two whole genome sequencing (WGS) submissions, Truvari gained a value greater than 88% and 68% for precision and recall respectively. Conversely, the in-house method’s precision and recall score peaked at 39% and 7.9% respectively for one WGS submission for both deletion and duplication calls. The results indicate that automated CNV concordance analysis of the deletion calls for the WGS-based callset might be feasible with Truvari. On the other hand, results for panel-based targeted sequencing for the deletion calls showed precision and recall rates ranging from 0-80% and 0-5.6% respectively with Truvari. The result suggests that automated concordance analysis of CNVs for targeted sequencing remains a challenge. In conclusion, CNV concordance analysis depends on how the sequence data is generated.
  • Huotala, Aleksi (Helsingin yliopisto, 2021)
    Isomorphic web applications combine the best parts of static Hypertext Markup Language (HTML) pages and single-page applications. An isomorphic web application shares code between the server and the client. However, there is not much existing research on isomorphic web applications. Improving the performance, user experience and development experience of web applications are popular research topics in computer science. This thesis studies the benefits and challenges of isomorphism in single-page applications. To study the benefits and challenges of isomorphism in single-page applications, a gray literature review and a case study were conducted. The articles used in the gray literature review were searched from four different websites. To make sure the gray literature could be used in this study, a quality assessment process was conducted. The case study was conducted as a developer survey, where developers familiar with isomorphic web applications were interviewed. The results of both studies are then compared and the key findings are compared together. The results of this study show that isomorphism in single-page applications brings benefits to both the developers and the end-users. Isomorphism in single-page applications is challenging to implement and has some downsides, but they mostly affect developers. The performance and search engine optimization of the application are improved. Implementing isomorphism makes it possible to share code between the server and the client, but it increases the complexity of the application. Framework and library compatibility are issues that must be addressed by the developers. The findings of this thesis give motivation for developers to implement isomorphism when starting a new project or transforming existing single-page applications to use isomorphism.
  • Porttinen, Peter (Helsingin yliopisto, 2020)
    Computing an edit distance between strings is one of the central problems in both string processing and bioinformatics. Optimal solutions to edit distance are quadratic to the lengths of the input strings. The goal of this thesis is to study a new approach to approximate edit distance. We use a chaining algorithm presented by Mäkinen and Sahlin in "Chaining with overlaps revisited" CPM 2020 implemented verbatim. Building on the chaining algorithm, our focus is on efficiently finding a good set of anchors for the chaining algorithm. We present three approaches to computing the anchors as maximal exact matches: Bi-Directional Burrows-Wheeler Transform, Minimizers, and lastly, a hybrid implementation of the two. Using the maximal exact matches as anchors, we can efficiently compute an optimal chaining alignment for the strings. The chaining alignment further allows us to determine all such intervals where mismatches occur by looking at which sequences are not in the chain. Using these smaller intervals lets us approximate edit distance with a high degree of accuracy and a significant speed improvement. The methods described present a way to approximate edit distance in time complexity bounded by the number of maximal exact matches.
  • Ritala, Susanna (Helsingin yliopisto, 2021)
    Chatbotteja on kehitetty jo vuosikymmenten ajan, mutta nykyinen kiinnostus on kasvanut niihin teknologian kehityksen myötä. Chatbotit palvelevat ihmisiä eri tarkoituksissa ja niiden toiminta perustuu keskusteluun ihmisen kanssa. Chatbotit tarjoavat henkilökohtaista palvelua vuorokauden jokaisena hetkenä, jonka vuoksi niiden tarve on lisääntynyt monilla aloilla, kuten verkkomyynnissä ja terveydenhuollossa. Chatbottien kehityksessä on tärkeää pohtia niiden toteutusta. Monet käyttäjät suosivat edelleen muita informaationlähteitä heidän ongelmiensa ratkaisuun. Yksi tapa mitata chatbot-järjestelmien laatua on tutkia niiden käyttäjäkokemusta. Tässä tutkielmassa tarkastellaan empiirisesti chatbot-sovellusten käyttäjäkokemusta. Empiirisen osion muodostaa laadullinen tutkimus, jonka avulla pyritään vastaamaan seuraavaan tutkimuskysymykseen: Kuinka chatbottien käyttäjäkokemusta voitaisiin parantaa? Tutkimus järjestettiin Osaamisbotti-palvelun kanssa, joka tarjosi testiympäristön tutkimuksen suorittamiselle. Tutkimukseen osallistui kahdeksan henkilöä, jotka suorittivat heille annetun tehtävän keskustelemalla chatbotin kanssa. Tutkimuksen aineisto on saatu protokolla-analyysin ja sen jälkeisen haastattelun keinoin. Tulokset esittävät, että ihmismäiset keskustelukyvyt, pidemmät vastaukset sekä tehokas keskustelun kulku parantavat chatbottien käyttäjäkokemusta. Lisäksi riittävällä informoinnilla ohjataan keskustelua sekä vältetään virhetilanteita. Chatbottien hyvällä saatavuudella sekä helppokäyttöisyydellä kasvatetaan niiden hyväksyntää ja käyttöönottoa. Tutkielman tuloksia voidaan hyödyntää tulevissa tutkimuksissa ja chatbottien kehitystyössä.
  • Ma, Jun (Helsingin yliopisto, 2021)
    Sequence alignment by exact or approximate string matching is one of the fundamental problems in bioinformatics. As the volume of sequenced genomes grows rapidly, pairwise sequence alignment becomes inefficient for pan-genomic analyses involving multiple sequences. The graph representation of multiple genomes has been an increasingly useful tool in pan-genomics research. Therefore, sequence-to-graph alignment becomes an important and challenging problem. For pairwise approximate sequence alignment under Levenshtein (edit) distance, subquadratic algorithms for finding an optimal solution are unknown. As a result, aligning sequences of millions of characters optimally is too challenging and impractical. Thus, many heuristics and techniques are developed for possibly suboptimal alignments. Among them, co-linear chaining (CLC) is a powerful and popular technique that approximates the alignment by finding a chain of short aligned fragments that may come from exact matching. The optimal solution to CLC on sequences can be found efficiently in subquadratic time. For sequence-to-graph alignment, the CLC problem has been solved theoretically on a special class of graphs that are narrow and have no cycles, i.e. directed acyclic graphs (DAGs) with small width, by Mäkinen et al. (ACM Transactions on Algorithms, 2019). Pan-genome graphs such as variation graphs satisfy these restrictions but allowing cycles may enable more general applications of the algorithm. In this thesis, we introduce an efficient algorithm to solve the CLC problem on general graphs with small width that may have cycles, by reducing it to a slightly modified CLC problem on DAGs. We implemented an initial version of the new algorithm on DAGs as a sequence-to-graph aligner GraphChainer. The aligner is evaluated and compared to an existing state-of-the-art aligner GraphAligner (Genome Biology, 2020) in experiments using both simulated and real genome assembly data on variation graphs. Our method improves the quality of alignments significantly in the task of aligning real human PacBio data. GraphChainer is freely available as an open source tool at
  • Martesuo, Kim (Helsingin yliopisto, 2019)
    Creating a user interface (UI) is often a part of software development. In the software industry designated UI designers work side by side with the developers in agile software development teams. While agile software processes have been researched, yet there is no general consensus on how UI designers should be integrated with the developing team. The existing research points towards the industry favoring tight collaboration between developers and UI designers by having them work together in the same team. The subject is gathering interest and different ways of integration is happening in the industry. In this thesis we researched the collaboration between developers and UI designers in agile software development. The goal was to understand the teamwork between the UI designers and developers working in the same agile software teams. The research was conducted by doing semi-structured theme interviews with UI designers and devel- opers individually. The interviewees were from consulting firms located in the Helsinki metropolitan are in Finland. The subjects reported about a recent project where they worked in an agile software team consisting of UI designers and developers. The data from the interviews was compared to the literature. Results of the interviews were similar to the findings from the literature for the most part. Finding a suitable process for the teamwork, co-location, good social relations and a an atmosphere of trust were factors present in the literature and the interviews. The importance of good software tools for communicating designs, and developers taking part in the UI designing process stood out from the interviews.
  • Länsman, Olá-Mihkku (Helsingin yliopisto, 2020)
    Demand forecasts are required for optimizing multiple challenges in the retail industry, and they can be used to reduce spoilage and excess inventory sizes. The classical forecasting methods provide point forecasts and do not quantify the uncertainty of the process. We evaluate multiple predictive posterior approximation methods with a Bayesian generalized linear model that captures weekly and yearly seasonality, changing trends and promotional effects. The model uses negative binomial as the sampling distribution because of the ability to scale the variance as a quadratic function of the mean. The forecasting methods provide highest posterior density intervals in different credible levels ranging from 50% to 95%. They are evaluated with proper scoring function and calculation of hit rates. We also measure the duration of the calculations as an important result due to the scalability requirements of the retail industry. The forecasting methods are Laplace approximation, Monte Carlo Markov Chain method, Automatic Differentiation Variational Inference, and maximum a posteriori inference. Our results show that the Markov Chain Monte Carlo method is too slow for practical use, while the rest of the approximation methods can be considered for practical use. We found out that Laplace approximation and Automatic Differentiation Variational Inference have results closer to the method with best analytical quarantees, the Markov Chain Monte Carlo method, suggesting that they were better approximations of the model. The model faced difficulties with highly promotional, slow selling, and intermittent data. Best fit was provided with high selling SKUs, for which the model provided intervals with hit rates that matched the levels of the credible intervals.
  • Laitala, Julius (Helsingin yliopisto, 2021)
    Arranging products in stores according to planograms, optimized product arrangement maps, is important for keeping up with the highly competitive modern retail market. The planograms are realized into product arrangements by humans, a process which is prone to mistakes. Therefore, for optimal merchandising performance, the planogram compliance of the arrangements needs to be evaluated from time to time. We investigate utilizing a computer vision problem setting – retail product detection – to automate planogram compliance evaluation. We introduce the relevant problems, the state-of- the-art approaches for solving them and background information necessary for understanding them. We then propose a computer vision based planogram compliance evaluation pipeline based on the current state of the art. We build our proposed models and algorithms using PyTorch, and run tests against public datasets and an internal dataset collected from a large Nordic retailer. We find that while the retail product detection performance of our proposed approach is quite good, the planogram compliance evaluation performance of our whole pipeline leaves a lot of room for improvement. Still, our approach seems promising, and we propose multiple ways for improving the performance enough to enable possible real world utility. The code used for our experiments and the weights for our models are available at
  • Ahonen, Heikki (Helsingin yliopisto, 2020)
    The research group dLearn.Helsinki has created a software for defining the work life competence skills of a person, working as a part of a group. The software is a research tool for developing the mentioned skills of users, and users can be of any age, from school children to employees in a company. As the users can be of different age groups, the data privacy of different groups has to be taken into consideration from different aspects. Children are more vulnerable than adults, and may not understand all the risks imposed to-wards them. Thus in the European Union the General Data Protection Regulation (GDPR)determines the privacy and data of children are more protected, and this has to be taken into account when designing software which uses said data. For dLearn.Helsinki this caused changes not only in the data handling of children, but also other users. To tackle this problem, existing and future use cases needed to be planned and possibly implemented. Another solution was to implement different versions of the software, where the organizations would be separate. One option would be determining organizational differences in the existing SaaS solution. The other option would be creating on-premise versions, where organizations would be locked in accordance to the customer type. This thesis introduces said use cases, as well as installation options for both SaaS and on-premise. With these, broader views of data privacy and the different approaches are investigated, and it can be concluded that no matter the approach, the data privacy of children will always prove a challenge.