String Indexing for Patterns with Wildcards |
| Abstract | We consider the problem of indexing a string t of length n to report the occurrences of a query pattern p containing m characters and j wildcards. Let occ be the number of occurrences of p in t, and a the size of the alphabet. We obtain the following results.
- A linear space index with query time O(m + aj log log n + occ). This significantly improves the previously best known linear space index described by Lam et al. [ISAAC 2007], which requires query time (-)(jn) in the worst case.
- An index with optimal query time O(m+j +occ) using space O(ak2n logk log n), where k is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time.
- A time-space trade-off for the problem which generalizes the index described by Cole et al. [STOC 2004].
The Longest Common Prefix (LCP) data structure introduced by Cole et al. is a key component in our results. We give a detailed explanation and show several new properties of the LCP data structure. Most importantly, we show that not only suffixes, but arbitrary sets of substrings of t, can be queried, and that unrooted LCP queries can be performed in linear space. Our results are obtained by combining the new properties of the LCP data structure with well-known and new techniques. In particular, we introduce a generalization of the heavy-path decomposition, which could be of independent interest. Finally, we extend our results to allow variable length gaps or optional wildcards in the query pattern, improving upon the only previously known index for this problem by Lam et al. [ISAAC 2007]. | Type | Master's thesis [Academic thesis] | Year | 2011 | Publisher | Technical University of Denmark, DTU Informatics, E-mail: reception@imm.dtu.dk | Address | Asmussens Alle, Building 305, DK-2800 Kgs. Lyngby, Denmark | Series | IMM-M.Sc.-2011-74 | Note | | Electronic version(s) | [pdf] | Publication link | http://www.imm.dtu.dk/English.aspx | BibTeX data | [bibtex] | IMM Group(s) | Computer Science & Engineering |
|