Statistical Text Analysis Using English Dependency Structure

Jacobo Rouces

AbstractIn this thesis we design, implement and test a system for statistical text analysis that, unlike the traditional bag-of-words model, takes into account natural language syntax. Our working hypothesis is that there is information held by the syntactic structure of a text, that is permanently lost when using the bag-of-words model, and which may be relevant when judging semantic similarity between different texts.
Therefore, we attempt to extract and use the actual syntactic structure of text. Namely, we use the English dependency structure of a large corpus, obtained with an unsupervised parser and a coreference resolution system. We translate the labelled dependency relations and the coreference information to a more semantically oriented language, which we call Entity-Property Language (EPL), and then from this we build both a term space and a document space over which respective metrics are defined. Different versions of the inter-document metric, some making use of the inter-term metric, are used in an information retrieval task, and its performance compared to the bag-of-words model. For testing we use a corpus developed from the English and Simple English Wikipedias, and a topic-based relevance measure.
We obtain a slight but statistically consistent improvement over the bag-of-words model, specially for long queries, and we suggest lines of research that may lead to further improvements, both in the inter-document metric for information retrieval and the inter-term metric for automatic thesaurus construction.
TypeMaster's thesis [Academic thesis]
Year2012
PublisherTechnical University of Denmark, DTU Informatics, E-mail: reception@imm.dtu.dk
AddressAsmussens Alle, Building 305, DK-2800 Kgs. Lyngby, Denmark
SeriesIMM-M.Sc.-2012-56
NoteSupervised by Professor Lars Kai Hansen, lkh@imm.dtu.dk, DTU Informatics
Electronic version(s)[pdf]
Publication linkhttp://www.imm.dtu.dk/English.aspx
BibTeX data [bibtex]
IMM Group(s)Intelligent Signal Processing