Textmining and Organization in Large Corpus

Wei Ning

AbstractNowadays a common size of document corpus might have more than 5000 documents. It is almost impossible for a reader to read thought all documents within the corpus and find out relative information in a couple of minutes. In this master thesis project we propose text clustering as a potential solution to organizing large document corpus.

As a sub-field of data mining, text mining is to discover useful information from written resources. Text clustering is one of topics in text mining, which is to find out the groups information from the text documents and cluster these documents into the most relevant groups automatically. Representing document corpus as a term-document matrix is the prevalent preprocessing in text clustering. If each unique term is taken as a dimension, a common size of corpus may contain more than ten-thousands of unique term, which results in extremely high dimensionality. Finding good dimensionality deduction algorithms and suitable clustering methods are the main concerns of this thesis project.

We mainly compare two dimensionality deduction methods: Singular Vector Decomposition (SVD) and Random Projection (RP), and three selected clustering algorithms: K-means, Non-negative Matrix Factorization (NMF) and Frequent Itemset. These selected methods and algorithms are compared based on their performance and time consumption.

This thesis project shows K-means and Frequent Itemset can be applied in large corpus. NMF might need more research on speeding up its convergence speed.
TypeMaster's thesis [Academic thesis]
Year2006
PublisherInformatics and Mathematical Modelling, Technical University of Denmark, DTU
AddressRichard Petersens Plads, Building 321, DK-2800 Kgs. Lyngby
SeriesIMM-Thesis-2006-04
NoteSupervised by Jan Larsen, IMM
Electronic version(s)[pdf]
BibTeX data [bibtex]
IMM Group(s)Intelligent Signal Processing