@MASTERSTHESIS\{IMM2011-06138, author = "H. W. Vildh{\o}j and S. Vind", title = "String Indexing for Patterns with Wildcards", year = "2011", school = "Technical University of Denmark, {DTU} Informatics, {E-}mail: reception@imm.dtu.dk", address = "Asmussens Alle, Building 305, {DK-}2800 Kgs. Lyngby, Denmark", type = "", note = "Supervised by Associate Professor Philip Bille, phbi@imm.dtu.dk, and Associate Professor Inge Li G{\o}rtz, ilg@imm.dtu.dk, {DTU} Informatics", url = "http://www.imm.dtu.dk/English.aspx", abstract = "We consider the problem of indexing a string t of length n to report the occurrences of a query pattern p containing m characters and j wildcards. Let occ be the number of occurrences of p in t, and a the size of the alphabet. We obtain the following results. - A linear space index with query time O(m + aj log log n + occ). This significantly improves the previously best known linear space index described by Lam et al. [{ISAAC} 2007], which requires query time (-)(jn) in the worst case. - An index with optimal query time O(m+j +occ) using space O(ak2n logk log n), where k is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time. - A time-space trade-off for the problem which generalizes the index described by Cole et al. [{STOC} 2004]. The Longest Common Prefix (LCP) data structure introduced by Cole et al. is a key component in our results. We give a detailed explanation and show several new properties of the {LCP} data structure. Most importantly, we show that not only suffixes, but arbitrary sets of substrings of t, can be queried, and that unrooted {LCP} queries can be performed in linear space. Our results are obtained by combining the new properties of the {LCP} data structure with well-known and new techniques. In particular, we introduce a generalization of the heavy-path decomposition, which could be of independent interest. Finally, we extend our results to allow variable length gaps or optional wildcards in the query pattern, improving upon the only previously known index for this problem by Lam et al. [{ISAAC} 2007]." }