Information retrieval in document spaces using clustering

Kenneth Lolk Vester, Moses Claus Martiny

AbstractToday, information retrieval plays a large part of our everyday lives - especially with the advent of the World Wide Web. During the last 10 years, the amount of information available in electronic form on the Web has grown exponentially. However, this development has introduced problems of its own; finding useful information is increasingly becoming a hit-or-miss experience that often ends in information overload. In this thesis, we propose document clustering as a possible solution for improving information retrieval on the Web.

The primary objective of this project was to assist the software company Mondosoft in evaluating the feasibility of using document clustering to improve their information retrieval products. To achieve this end, we have designed and implemented a clustering toolkit that allows experiments with various clustering algorithms in connection with real websites.

The construction of the toolkit was based on a comprehensive analysis of current research within the area. The toolkit encompasses the entire clustering process, including data extraction, various preprocessing steps, the actual clustering and postprocessing. The aim of the document clustering is finding similar pages and, to a lesser degree, search result clustering of webpages. The toolkit is fully integrated with Mondosoft's search engine and utilises a two-stage approach to document clustering, where keywords are first extracted and then clustering is performed using these keywords.

The toolkit includes prototype implementations of several promising algorithms, including several novel ideas/approaches of our own. The toolkit implements the following 5 clustering algorithms: K-Means, CURE, PDDP, GALOIS and a novel extended version of Apriori. In addition to this, we introduce two novel approaches for extracting n-grams and a novel keyword extraction scheme based on Latent Semantic Analysis.

To test the capabilities of the implemented algorithms, we have subjected them to extensive performance tests, both in terms of memory and computational requirements. Our tests clearly show that CURE and GALOIS become infeasible in connection with larger websites (10,000+ pages). To evaluate the quality of the remaining three algorithms and the toolkit in general, we have also performed a user test based on the similar pages found by the algorithms.

The user test shows with statistic significance that the quality of the algorithms for this task can be ranked in the following order: Apriori, K-Means and PDDP. Furthermore, the test provided evidence that both K-Means and Apriori were as good as or better than a search-based approach for finding similar pages. Finally, we have found strong evidence that the LSA-based method for keyword extraction is better for subsequent clustering, than pure truncation of terms based on their local weight.
TypeMaster's thesis [Industrial collaboration]
PublisherInformatics and Mathematical Modelling, Technical University of Denmark, DTU
AddressRichard Petersens Plads, Building 321, DK-2800 Kgs. Lyngby
NoteSupervised by Assoc. Prof. Jan Larsen
Electronic version(s)[pdf]
BibTeX data [bibtex]
IMM Group(s)Intelligent Signal Processing