E2FM: an encrypted and compressed full-text index for collections of genomic sequences
Next Generation Sequencing (NGS) platforms and, more generally, high-throughput technologies are giving rise to an exponential growth in the size of nucleotide sequence databases. Moreover, many emerging applications of nucleotide datasets—as those related to personalized medicine—require the compliance with regulations about the storage and processing of sensitive data.Results:
We have designed and carefully engineered E2FM-index, a new full-text index in minute space which was optimized for compressing and encrypting nucleotide sequence collections in FASTA format and for performing fast pattern-search queries. E2FM-index allows to build self-indexes which occupy till to 1/20 of the storage required by the input FASTA file, thus permitting to save about 95% of storage when indexing collections of highly similar sequences; moreover, it can exactly search the built indexes for patterns in times ranging from few milliseconds to a few hundreds milliseconds, depending on pattern length.Availability and implementation:
Source code is available at https://github.com/montecuollo/E2FM.Contact:
Supplementary data are available at Bioinformatics online.