Co-linear Chaining on Graphs With Cycles

Show simple item record

dc.contributor Helsingin yliopisto, Matemaattis-luonnontieteellinen tiedekunta fi
dc.contributor University of Helsinki, Faculty of Science en
dc.contributor Helsingfors universitet, Matematisk-naturvetenskapliga fakulteten sv
dc.contributor.author Ma, Jun
dc.date.issued 2021
dc.identifier.uri URN:NBN:fi:hulib-202106092584
dc.identifier.uri http://hdl.handle.net/10138/330781
dc.description.abstract Sequence alignment by exact or approximate string matching is one of the fundamental problems in bioinformatics. As the volume of sequenced genomes grows rapidly, pairwise sequence alignment becomes inefficient for pan-genomic analyses involving multiple sequences. The graph representation of multiple genomes has been an increasingly useful tool in pan-genomics research. Therefore, sequence-to-graph alignment becomes an important and challenging problem. For pairwise approximate sequence alignment under Levenshtein (edit) distance, subquadratic algorithms for finding an optimal solution are unknown. As a result, aligning sequences of millions of characters optimally is too challenging and impractical. Thus, many heuristics and techniques are developed for possibly suboptimal alignments. Among them, co-linear chaining (CLC) is a powerful and popular technique that approximates the alignment by finding a chain of short aligned fragments that may come from exact matching. The optimal solution to CLC on sequences can be found efficiently in subquadratic time. For sequence-to-graph alignment, the CLC problem has been solved theoretically on a special class of graphs that are narrow and have no cycles, i.e. directed acyclic graphs (DAGs) with small width, by Mäkinen et al. (ACM Transactions on Algorithms, 2019). Pan-genome graphs such as variation graphs satisfy these restrictions but allowing cycles may enable more general applications of the algorithm. In this thesis, we introduce an efficient algorithm to solve the CLC problem on general graphs with small width that may have cycles, by reducing it to a slightly modified CLC problem on DAGs. We implemented an initial version of the new algorithm on DAGs as a sequence-to-graph aligner GraphChainer. The aligner is evaluated and compared to an existing state-of-the-art aligner GraphAligner (Genome Biology, 2020) in experiments using both simulated and real genome assembly data on variation graphs. Our method improves the quality of alignments significantly in the task of aligning real human PacBio data. GraphChainer is freely available as an open source tool at https://github.com/algbio/GraphChainer. en
dc.language.iso eng
dc.publisher Helsingin yliopisto fi
dc.publisher University of Helsinki en
dc.publisher Helsingfors universitet sv
dc.title Co-linear Chaining on Graphs With Cycles en
dc.type.ontasot pro gradu -tutkielmat fi
dc.type.ontasot master's thesis en
dc.type.ontasot pro gradu-avhandlingar sv
dct.identifier.urn URN:NBN:fi:hulib-202106092584
dc.subject.specialization Algoritmit fi
dc.subject.specialization Algorithms en
dc.subject.specialization Algoritmer sv
dc.subject.degreeprogram Tietojenkäsittelytieteen maisteriohjelma fi
dc.subject.degreeprogram Master's Programme in Computer Science en
dc.subject.degreeprogram Magisterprogrammet i datavetenskap sv

Files in this item

Total number of downloads: Loading...

Files Size Format View
Ma_Jun_thesis_2021.pdf 946.7Kb PDF View/Open

This item appears in the following Collection(s)

Show simple item record