Linear time minimum segmentation enables scalable founder reconstruction

Show full item record

Permalink

http://hdl.handle.net/10138/301860

Citation

Algorithms for Molecular Biology. 2019 May 17;14(1):12

Title: Linear time minimum segmentation enables scalable founder reconstruction
Author: Norri, Tuukka; Cazaux, Bastien; Kosolobov, Dmitry; Mäkinen, Veli
Publisher: BioMed Central
Date: 2019-05-17
URI: http://hdl.handle.net/10138/301860
Abstract: Abstract Background  We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set $${\mathcal {R}} = \{R_1, \ldots , R_m\}$$ R = { R 1 , … , R m } of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment $$[a,b] \in P$$ [ a , b ] ∈ P has length at least L and the number $$d(a,b)=|\{R_i[a,b] :1\le i \le m\}|$$ d ( a , b ) = | { R i [ a , b ] : 1 ≤ i ≤ m } | of distinct substrings at segment [a, b] is minimized over $$[a,b] \in P$$ [ a , b ] ∈ P . The distinct substrings in the segments represent founder blocks that can be concatenated to form $$\max \{ d(a,b) :[a,b] \in P \}$$ max { d ( a , b ) : [ a , b ] ∈ P } founder sequences representing the original $${\mathcal {R}}$$ R such that crossovers happen only at segment boundaries. Results  We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier $$O(mn^2)$$ O ( m n 2 ) . Conclusions  Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences .
Subject: Pan-genome indexing
Founder reconstruction
Dynamic programming
Positional Burrows–Wheeler transform
Range minimum query
Rights: The Author(s)


Files in this item

Total number of downloads: Loading...

Files Size Format View
13015_2019_Article_147.pdf 2.875Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record