Enhanced Context Recognition by Sensitivity Pruned Vocabularies |
Rasmus Elsborg Madsen, Sigurdur Sigurdsson, Lars Kai Hansen
|
Abstract | Language independent `bag-of-words' representations are
surprisingly effective for text classification. The generic BOW
approach is based on a high-dimensional vocabulary which may
reduce the generalization performance of subsequent classifiers,
e.g., based on ill-posed principal component transformations. In
this communication our aim is to study the effect of sensitivity
based pruning of the bag-of-words representation. We consider
neural network based sensitivity maps for determination of term
relevancy, when pruning the vocabularies. With reduced
vocabularies documents are classified using a latent semantic
indexing representation and a probabilistic neural network
classifier. Pruning the vocabularies to approximately 20% of the
original size, we find consistent context recognition enhancement
for two mid size data-sets for a range of training set sizes. We
also study the applicability of the sensitivity measure for
automated keyword generation. |
Keywords | sensitivity, neural networks, text, classification, dimensionality |
Type | Conference paper [With referee] |
Conference | Proceedings of 17th International Conference on Pattern Recognition (ICPR 2004) |
Year | 2004 Month August Vol. 2 pp. 483-486 |
Address | Cambridge UK |
Electronic version(s) | [pdf] |
BibTeX data | [bibtex] |
IMM Group(s) | Intelligent Signal Processing |