Constructing Antidictionaries in Output-Sensitive Space

Ayad, Lorraine A. K.; Badkobeh, Golnaz; Fici, Gabriele; Heliou, Alice and Pissis, Solon P.. 2019. Constructing Antidictionaries in Output-Sensitive Space. pp. 538-547. ISSN 2375-0359 [Article]

[img]
Preview
Text
DCC_Computing_Minimal_Absent_Words_in_Output_Sensitive_Space (1).pdf - Accepted Version
Available under License Creative Commons Attribution Non-commercial.

Download (365kB) | Preview

Abstract or Description

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y_1,y_2,...,y_k over an alphabet Σ, we are asked to compute the set M^ℓ_y_1#...#y_k of minimal absent words of length at most ℓ of word y=y_1#y_2#...#y_k, #∉Σ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M^ℓ_y_1#...#y_N||=o(n), for all N∈[1,k]. For instance, in the human genome, n ≈ 3× 10^9 but ||M^12_y_1#...#y_k|| ≈ 10^6. We consider a constant-sized alphabet for stating our results. We show that all M^ℓ_y_1,...,M^ℓ_y_1#...#y_k can be computed in O(kn+∑^k_N=1||M^ℓ_y_1#...#y_N||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in {y_1,...,y_k} and MaxOut={||M^ℓ_y_1#...#y_N||:N∈[1,k]}. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

Item Type:

Article

Identification Number (DOI):

https://doi.org/10.1109/DCC.2019.00062

Keywords:

Bioinformatics, Gold, Data structures, Data compression, Genomics, Biological cells, Indexes; absent words, antidictionaries, string algorithms, output sensitive algorithms, data compression.

Related URLs:

Departments, Centres and Research Units:

Computing

Dates:

DateEvent
27 December 2018Accepted
29 March 2019Completed
13 May 2019Published

Event Location:

Snowbird, Utah, United States

Date range:

26-29 March 2019

Item ID:

25916

Date Deposited:

28 Feb 2019 14:57

Last Modified:

12 Jun 2021 15:47

URI:

https://research.gold.ac.uk/id/eprint/25916

View statistics for this item...

Edit Record Edit Record (login required)