# Co-linear Chaining on Graphs With Cycles

 dc.contributor Helsingin yliopisto, Matemaattis-luonnontieteellinen tiedekunta fi dc.contributor University of Helsinki, Faculty of Science en dc.contributor Helsingfors universitet, Matematisk-naturvetenskapliga fakulteten sv dc.contributor.author Ma, Jun dc.date.issued 2021 dc.identifier.uri URN:NBN:fi:hulib-202106092584 dc.identifier.uri http://hdl.handle.net/10138/330781 dc.description.abstract Sequence alignment by exact or approximate string matching is one of the fundamental problems in bioinformatics. As the volume of sequenced genomes grows rapidly, pairwise sequence alignment becomes inefficient for pan-genomic analyses involving multiple sequences. The graph representation of multiple genomes has been an increasingly useful tool in pan-genomics research. Therefore, sequence-to-graph alignment becomes an important and challenging problem. en For pairwise approximate sequence alignment under Levenshtein (edit) distance, subquadratic algorithms for finding an optimal solution are unknown. As a result, aligning sequences of millions of characters optimally is too challenging and impractical. Thus, many heuristics and techniques are developed for possibly suboptimal alignments. Among them, co-linear chaining (CLC) is a powerful and popular technique that approximates the alignment by finding a chain of short aligned fragments that may come from exact matching. The optimal solution to CLC on sequences can be found efficiently in subquadratic time. For sequence-to-graph alignment, the CLC problem has been solved theoretically on a special class of graphs that are narrow and have no cycles, i.e. directed acyclic graphs (DAGs) with small width, by Mäkinen et al. (ACM Transactions on Algorithms, 2019). Pan-genome graphs such as variation graphs satisfy these restrictions but allowing cycles may enable more general applications of the algorithm. In this thesis, we introduce an efficient algorithm to solve the CLC problem on general graphs with small width that may have cycles, by reducing it to a slightly modified CLC problem on DAGs. We implemented an initial version of the new algorithm on DAGs as a sequence-to-graph aligner GraphChainer. The aligner is evaluated and compared to an existing state-of-the-art aligner GraphAligner (Genome Biology, 2020) in experiments using both simulated and real genome assembly data on variation graphs. Our method improves the quality of alignments significantly in the task of aligning real human PacBio data. GraphChainer is freely available as an open source tool at https://github.com/algbio/GraphChainer. dc.language.iso eng dc.publisher Helsingin yliopisto fi dc.publisher University of Helsinki en dc.publisher Helsingfors universitet sv dc.title Co-linear Chaining on Graphs With Cycles en dc.type.ontasot pro gradu -tutkielmat fi dc.type.ontasot master's thesis en dc.type.ontasot pro gradu-avhandlingar sv dct.identifier.urn URN:NBN:fi:hulib-202106092584 dc.subject.specialization Algoritmit fi dc.subject.specialization Algorithms en dc.subject.specialization Algoritmer sv dc.subject.degreeprogram Tietojenkäsittelytieteen maisteriohjelma fi dc.subject.degreeprogram Master's Programme in Computer Science en dc.subject.degreeprogram Magisterprogrammet i datavetenskap sv