emMAW : computing minimal absent words in external memory

Show simple item record

dc.contributor University of Helsinki, Department of Computer Science en
dc.contributor.author Héliou, Alice
dc.contributor.author Pissis, Solon P.
dc.contributor.author Puglisi, Simon J.
dc.date.accessioned 2019-12-16T11:32:01Z
dc.date.available 2021-01-23T03:47:18Z
dc.date.issued 2017-09-01
dc.identifier.citation Héliou , A , Pissis , S P & Puglisi , S J 2017 , ' emMAW : computing minimal absent words in external memory ' , Bioinformatics , vol. 33 , no. 17 , pp. 2746-2749 . https://doi.org/10.1093/bioinformatics/btx209 en
dc.identifier.issn 1367-4803
dc.identifier.other PURE: 89829473
dc.identifier.other PURE UUID: 69484372-2c58-4dc1-9451-a0ae1f7dd9c5
dc.identifier.other WOS: 000408772700020
dc.identifier.other Scopus: 85029405893
dc.identifier.other ORCID: /0000-0001-7668-7636/work/38834904
dc.identifier.uri http://hdl.handle.net/10138/308355
dc.description.abstract Motivation: The biological significance of minimal absent words has been investigated in genomes of organisms from all domains of life. For instance, three minimal absent words of the human genome were found in Ebola virus genomes. There exists an O(n)-time and O(n)-space algorithm for computing all minimal absent words of a sequence of length n on a fixed-sized alphabet based on suffix arrays. A standard implementation of this algorithm, when applied to a large sequence of length n, requires more than 20n bytes of RAM. Such memory requirements are a significant hurdle to the computation of minimal absent words in large datasets. Results: We present emMAW, the first external-memory algorithm for computing minimal absent words. A free open-source implementation of our algorithm is made available. This allows for computation of minimal absent words on far bigger data sets than was previously possible. Our implementation requires less than 3 h on a standard workstation to process the full human genome when as little as 1GB of RAM is made available. We stress that our implementation, despite making use of external memory, is fast; indeed, even on relatively smaller datasets when enough RAM is available to hold all necessary data structures, it is less than two times slower than state-of-theart internal-memory implementations. en
dc.format.extent 4
dc.language.iso eng
dc.relation.ispartof Bioinformatics
dc.rights en
dc.subject FORBIDDEN WORDS en
dc.subject 1182 Biochemistry, cell and molecular biology en
dc.subject 113 Computer and information sciences en
dc.subject 111 Mathematics en
dc.title emMAW : computing minimal absent words in external memory en
dc.type Article
dc.description.version Peer reviewed
dc.identifier.doi https://doi.org/10.1093/bioinformatics/btx209
dc.type.uri info:eu-repo/semantics/other
dc.type.uri info:eu-repo/semantics/acceptedVersion

Files in this item

Total number of downloads: Loading...

Files Size Format View
main_appnote.pdf 228.4Kb PDF View/Open

This item appears in the following Collection(s)

Show simple item record