Algorithms for Web Scraping

Patrick Hagge Cording

AbstractWeb scraping is the process of extracting and creating a structured representation of data from a web site. HTML, the markup language used to structure data on webpages, is subject to change when for instance the look-and-feel is updated. Since current techniques for web scraping are based on the markup, a change may lead to the extraction of incorrect data. In this thesis we investigate the potential of using approximate tree pattern matching based on the tree edit distance and constrained derivatives for web scraping. We argue that algorithms for constrained tree edit distances are not suited for web scraping. To address the high time complexity of optimal tree edit distance algorithms, we present the lower bound pruning algorithm which, based on the data tree T{...} and the pattern tree T{...} , will attempt to remove branches of T{,,,} that are not part of an optimal mapping. Its running time is {.....}, where {.....} is the running time of the lower bound method used. Although it asymptotically is close to the approximate tree pattern matching algorithms, we show that in practice the total execution time is reduced in some cases. We develop several methods for determining a lower bound on the tree edit distance used for approximate tree pattern matching, and we see that our generalization of the q-gram distance from strings is the most effective with our algorithm. We also present a similar algorithm that use the HTML grammar to prune T{D}, and some heuristics to guide the approximate tree pattern matching algorithm.
TypeMaster's thesis [Academic thesis]
PublisherTechnical University of Denmark, DTU Informatics, E-mail:
AddressAsmussens Alle, Building 305, DK-2800 Kgs. Lyngby, Denmark
Electronic version(s)[pdf]
Publication link
BibTeX data [bibtex]
IMM Group(s)Computer Science & Engineering