Co-linear Chaining on Graphs With Cycles

Show full item record



Permalink

http://urn.fi/URN:NBN:fi:hulib-202106092584
Title: Co-linear Chaining on Graphs With Cycles
Author: Ma, Jun
Contributor: University of Helsinki, Faculty of Science
Publisher: Helsingin yliopisto
Date: 2021
Language: eng
URI: http://urn.fi/URN:NBN:fi:hulib-202106092584
http://hdl.handle.net/10138/330781
Thesis level: master's thesis
Degree program: Tietojenkäsittelytieteen maisteriohjelma
Master's Programme in Computer Science
Magisterprogrammet i datavetenskap
Specialisation: Algoritmit
Algorithms
Algoritmer
Abstract: Sequence alignment by exact or approximate string matching is one of the fundamental problems in bioinformatics. As the volume of sequenced genomes grows rapidly, pairwise sequence alignment becomes inefficient for pan-genomic analyses involving multiple sequences. The graph representation of multiple genomes has been an increasingly useful tool in pan-genomics research. Therefore, sequence-to-graph alignment becomes an important and challenging problem. For pairwise approximate sequence alignment under Levenshtein (edit) distance, subquadratic algorithms for finding an optimal solution are unknown. As a result, aligning sequences of millions of characters optimally is too challenging and impractical. Thus, many heuristics and techniques are developed for possibly suboptimal alignments. Among them, co-linear chaining (CLC) is a powerful and popular technique that approximates the alignment by finding a chain of short aligned fragments that may come from exact matching. The optimal solution to CLC on sequences can be found efficiently in subquadratic time. For sequence-to-graph alignment, the CLC problem has been solved theoretically on a special class of graphs that are narrow and have no cycles, i.e. directed acyclic graphs (DAGs) with small width, by Mäkinen et al. (ACM Transactions on Algorithms, 2019). Pan-genome graphs such as variation graphs satisfy these restrictions but allowing cycles may enable more general applications of the algorithm. In this thesis, we introduce an efficient algorithm to solve the CLC problem on general graphs with small width that may have cycles, by reducing it to a slightly modified CLC problem on DAGs. We implemented an initial version of the new algorithm on DAGs as a sequence-to-graph aligner GraphChainer. The aligner is evaluated and compared to an existing state-of-the-art aligner GraphAligner (Genome Biology, 2020) in experiments using both simulated and real genome assembly data on variation graphs. Our method improves the quality of alignments significantly in the task of aligning real human PacBio data. GraphChainer is freely available as an open source tool at https://github.com/algbio/GraphChainer.


Files in this item

Total number of downloads: Loading...

Files Size Format View
Ma_Jun_thesis_2021.pdf 946.7Kb PDF View/Open

This item appears in the following Collection(s)

Show full item record