AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models

Show full item record



Permalink

http://hdl.handle.net/10138/335771

Citation

Silva , M , Pratas , D & Pinho , AJ 2021 , ' AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models ' , Entropy , vol. 23 , no. 5 . https://doi.org/10.3390/e23050530

Title: AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models
Author: Silva, M; Pratas, D; Pinho, AJ
Other contributor: University of Helsinki, Department of Virology

Date: 2021-05
Language: eng
Belongs to series: Entropy
ISSN: 1099-4300
DOI: https://doi.org/10.3390/e23050530
URI: http://hdl.handle.net/10138/335771
Abstract: Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2-9% and 6-7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences' input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.
Subject: lossless data compression
protein sequence compression
context mixing
neural networks
mixture of experts
coronavirus
RESPIRATORY SYNDROME CORONAVIRUS
SARS-COV-2 VARIANTS
COMPLEXITY
OUTBREAK
3111 Biomedicine
Rights:


Files in this item

Total number of downloads: Loading...

Files Size Format View
AC2_An_Efficien ... _and_Cache_Hash_Models.pdf 1.462Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record