Statistical Text Analysis Using English Dependency Structure
|Abstract||In this thesis we design, implement and test a system for statistical text analysis that, unlike the traditional bag-of-words model, takes into account natural language syntax. Our working hypothesis is that there is information held by the syntactic structure of a text, that is permanently lost when using the bag-of-words model, and which may be relevant when judging semantic similarity between different texts.|
Therefore, we attempt to extract and use the actual syntactic structure of text. Namely, we use the English dependency structure of a large corpus, obtained with an unsupervised parser and a coreference resolution system. We translate the labelled dependency relations and the coreference information to a more semantically oriented language, which we call Entity-Property Language (EPL), and then from this we build both a term space and a document space over which respective metrics are defined. Different versions of the inter-document metric, some making use of the inter-term metric, are used in an information retrieval task, and its performance compared to the bag-of-words model. For testing we use a corpus developed from the English and Simple English Wikipedias, and a topic-based relevance measure.
We obtain a slight but statistically consistent improvement over the bag-of-words model, specially for long queries, and we suggest lines of research that may lead to further improvements, both in the inter-document metric for information retrieval and the inter-term metric for automatic thesaurus construction.
|Type||Master's thesis [Academic thesis]|
|Publisher||Technical University of Denmark, DTU Informatics, E-mail: firstname.lastname@example.org|
|Address||Asmussens Alle, Building 305, DK-2800 Kgs. Lyngby, Denmark|
|Note||Supervised by Professor Lars Kai Hansen, email@example.com, DTU Informatics|
|BibTeX data|| [bibtex]|
|IMM Group(s)||Intelligent Signal Processing|
Back :: IMM Publications