Browsing by Organization "Department of Computer Science"

Sort by: Order: Results:

Now showing items 1-10 of 10
  • Fischer, Johannes; Mäkinen, Veli; Navarro, Gonzalo (Springer-Verlag, 2008)
    Suffix trees are one of the most important data structures in stringology, with myriads of applications in fluorishing areas like bioinformatics. As their main problem is space usage, recent efforts have focused on compressed suffix tree representations, which obtain large space reductions in exchange for moderate slowdowns. Such a smaller suffix tree could fit in a faster memory, outweighting by far the theoretical slowdown. We present a novel compressed suffix tree. Compared to the current compressed suffix trees, it is the first achieving at the same time sublogarithmic complexity for the operations, and space usage which goes to zero as the entropy of the text does. Our development contains several novel ideas, such as compressing the longest common prefix information, and totally getting rid of the suffix tree topology, expressing all the suffix tree operations using range minimum queries and a new primitive called next/previous smaller value in a sequence.
  • Polishchuk, Valentin; Suomela, Jukka (Elsevier, 2009)
    We present a local algorithm (constant-time distributed algorithm) for finding a 3-approximate vertex cover in bounded-degree graphs. The algorithm is deterministic, and no auxiliary information besides port numbering is required.
  • Böcker, Sebastian; Mäkinen, Veli (IEEE/ACM, 2008)
    Mass spectrometry has become one of the most popular analysis techniques in Proteomics and Systems Biology. With the creation of larger datasets, the automated recalibration of mass spectra becomes important to ensure that very peak in the sample spectrum is correctly assigned to some peptide and protein. Algorithms for recalibrating mass spectra have to be robust with respect to wrongly assigned peaks, as well as efficient due to the amount of mass spectrometry data. The recalibration of mass spectra leads us to the problem of finding an optimal matching between mass spectra under measurement errors. We have developed two deterministic methods that allow robust computation of such a matching: The first approach uses a computational geometry interpretation of the problem, and tries to find two parallel lines with constant distance that stab a maximal number of points in the plane. The second approach is based on finding a maximal common approximate subsequence, and improves existing algorithms by one order of magnitude exploiting the sequential nature of the matching problem. We compare our results to a computational geometry algorithm using a topological line-sweep.
  • Sirén, Jouni (2009)
    We present a fast space-efficient algorithm for constructing compressed suffix arrays (CSA). The algorithm requires O(n log n) time in the worst case, and only O(n) bits of extra space in addition to the CSA. As the basic step, we describe an algorithm for merging two CSAs. We show that the construction algorithm can be parallelized in a symmetric multiprocessor system, and discuss the possibility of a distributed implementation. We also describe a parallel implementation of the algorithm, capable of indexing several gigabytes per hour.
  • Kaski, Petteri; Penttinen, Aleksi; Suomela, Jukka (Old City Publishing, 2008)
    We study the algorithmic problem of coordinating transmissions in a wireless network where radio interference constrains concurrent transmissions on wireless links. We focus on pairwise conflicts between the links; these can be described as a conflict graph. Associated with the conflict graph are two fundamental network coordination tasks: (a) finding a nonconflicting set of links with the maximum total weight, and (b) finding a link schedule with the minimum total length. Our work shows that two assumptions on the geometric structure of conflict graphs suffice to achieve polynomial-time constant-factor approximations: (i) bounded density of devices, and (ii) bounded range of interference. We also show that these assumptions are not sufficient to obtain a polynomial-time approximation scheme (PTAS) for either coordination task. There exists a PTAS if we make an additional assumption: (iii) bounded range of radio transmissions.
  • Lemström, Kjell; Mikkilä, Niko; Mäkinen, Veli (2008)
    We consider two content-based music retrieval problems where the music is modeled as sets of points in the Euclidean plane, formed by the (on-set time, pitch) pairs. We introduce fast filtering methods based on indexing the underlying database. The filters run in a sublinear time in the length of the database, and they are lossless if a quadratic space may be used. By taking into account the application, the search space can be narrowed down, obtaining practically lossless filters using linear size index structures. For the checking phase, which dominates the overall running time, we exploit previously designed algorithms suitable for local checking. In our experiments on a music database, our best filter-based methods performed several orders of a magnitude faster than previous solutions.
  • Mäkinen, Veli; Navarro, Gonzalo (IEEE Computer Society, 2008)
    Recent advances in compressed data structures have led to the new concept of self-indexing; it is possible to represent a sequence of symbols compressed in a form that enables fast queries on the content of the sequence. This paper studies different analogies of self-indexing on images. First, we show that a key ingredient of many self-indexes for sequences, namely the wavelet tree, can be used to obtain both lossless and lossy compression with random access to pixel values. Second, we show how to use self-indexes for sequences as a black-box to provide self-indexes for images with filtering-type query capabilities. Third, we develop a tailor-made self-index for images by showing how to compress two-dimensional suffix arrays. Experimental results are provided to compare the compressibility to standard compression methods.
  • Siren, Jouni; Välimäki, Niko; Mäkinen, Veli; Navarro, Gonzalo (Springer-Verlag, 2008)
    A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N. Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. This paper is devoted to studying ways to store massive sets of highly repetitive sequence collections in space-efficient manner so that retrieval of the content as well as queries on the content of the sequences can be provided time-efficiently. We show that the state-of-the-art entropy-bound full-text self-indexes do not yet provide satisfactory space bounds for this specific task. We engineer some new structures that use run-length encoding and give empirical evidence that these structures are superior to the current structures.
  • Fischer, Johannes; Mäkinen, Veli; Välimäki, Niko (IEEE, 2008)
    Let D1 and D2 be two databases (i.e. multisets) of d strings, over an alphabet S, with overall length n. We study the problem of mining discriminative patterns between D1 and D2, e.g., patterns that are frequent in one database but not in the other, emerging patterns, or patterns satisfying other frequency-related constraints. Using the algorithmic framework by Hui (CPM 1992), one can solve several variants of this problem in the optimal linear time with the aid of suffix trees or suffix arrays. This stands in high contrast to other pattern domains such as itemsets or subgraphs, where super-linear lower bounds are known. However, the space requirement of existing solutions is O(n log n) bits, which is not optimal for |S| << n (in particular for constant |S|), as the databases themselves occupy only n log |S| bits. Because in many real-life applications space is a more critical resource than time, the aim of this article is to reduce the space, at the cost of an increased running time. In particular, we give a solution for the above problems that uses O(n log n+d log n) bits, while the time requirement is increased from the optimal linear time to O(n log n). Our new method is tested extensively on a biologically relevant datasets and shown to be usable even on a genome-scale data.
  • Kivinen, Jyrki; Warmuth, Manfred K.; Hassibi, Babak (IEEE, 2006)
    Recently much work has been done analyzing online machine learning algorithms in a worst case setting, where no probabilistic assumptions are made about the data. This is analogous to the H-infinity setting used in adaptive linear filtering. Bregman divergences have become a standard tool for analyzing online machine learning algorithms. Using these divergences, we motivate a generalization of the the Least Mean Squared (LMS) algorithm. The loss bounds for these so-called p-norm algorithms involve other norms than the standard 2-norm. The bounds can be significantly better if a large proportion of the input variables are irrelevant, i.e., if the weight vector we are trying to learn is sparse. We also prove results for nonstationary targets. We only know how to apply kernel methods to the standard LMS algorithm (i.e., p=2). However even in the general p-norm case we can handle generalized linear models where the output of the system is a linear function combined with a nonlinear transfer function (e.g., the logistic sigmoid).