Fischer, Johannes; Mäkinen, Veli; Välimäki, Niko
(IEEE, 2008)
Let D1 and D2 be two databases (i.e. multisets) of d
strings, over an alphabet S, with overall length n. We study
the problem of mining discriminative patterns between D1
and D2, e.g., patterns that are frequent in one database
but not in the other, emerging patterns, or patterns satisfying
other frequency-related constraints. Using the algorithmic
framework by Hui (CPM 1992), one can solve
several variants of this problem in the optimal linear time
with the aid of suffix trees or suffix arrays. This stands
in high contrast to other pattern domains such as itemsets
or subgraphs, where super-linear lower bounds are
known. However, the space requirement of existing solutions
is O(n log n) bits, which is not optimal for |S| << n
(in particular for constant |S|), as the databases themselves
occupy only n log |S| bits.
Because in many real-life applications space is a more
critical resource than time, the aim of this article is to reduce
the space, at the cost of an increased running time. In
particular, we give a solution for the above problems that
uses O(n log n+d log n) bits, while the time requirement
is increased from the optimal linear time to O(n log n). Our
new method is tested extensively on a biologically relevant
datasets and shown to be usable even on a genome-scale
data.