Mäkinen, Veli; Navarro, Gonzalo; Sirén, Jouni; Välimäki, Niko
(2008)
A repetitive sequence collection is one where portions of a base sequence of length n are repeated
many times with small variations, forming a collection of total length N. Examples
of such collections are version control data and genome sequences of individuals, where
the differences can be expressed by lists of basic edit operations. Flexible and efficient
data analysis on a such typically huge collection is plausible using suffix trees. However,
suffix tree occupies O(N logN) bits, which very soon inhibits in-memory analyses. Recent
advances in full-text indexing reduce the space of suffix tree to NHk + o(N log σ) bits at
the cost of running times of its operations increasing by polylog(N) factor. Here Hk is the
k-th order entropy of the collection and σ is the alphabet size. Notice that for r identical
copies of an incompressible base sequence, the bound simplifies to N log σ(1 + o(1)) bits.
We develop new static/dynamic full-text self-indexes based on the run-length encoding
whose space-requirements are much less dependent on N. For example, we obtain an index
occupying Rlog σ(1+o(1))+R log N
R (1+o(1))+r log n+O((s+r) log(s+r)) bits, where
s is the total number of basic edit operations to convert the r repeats into substrings of the
base sequence, and R ≤ min(n, nHk)+O((s+r) logσ N), where the O() term holds in the
expected case. The new indexes can be plugged into a recent dynamic fully-compressed
suffix tree using an additional O((N/δ) log N) bits of space for any δ = polylog(N), and
retaining the polylog(N) time slowdown on operations.