Modeling Word Burstiness Using the Dirichlet Distribution |
Rasmus Elsborg Madsen, David Kauchak, Charles Elkan
|
Abstract | Multinomial distributions are often used to
model text documents. However, they do
not capture well the phenomenon that words
in a document tend to appear in bursts: if
a word appears once, it is more likely to
appear again. In this paper, we propose
the Dirichlet compound multinomial model
(DCM) as an alternative to the multinomial.
The DCM model has one additional degree
of freedom, which allows it to capture burstiness.
We show experimentally that the DCM
is substantially better than the multinomial
at modeling text data, measured by perplexity.
We also show using three standard document
collections that the DCM leads to better
classification than the multinomial model.
DCM performance is comparable to that obtained
with multiple heuristic changes to the
multinomial model. |
Keywords | Text mining, DCM, Polya, Multinomial, Categorization, Supervised |
Type | Conference paper [With referee] |
Conference | International Conference on Machine Learning |
Year | 2005 Month June pp. 489--498 |
Electronic version(s) | [pdf] |
BibTeX data | [bibtex] |
IMM Group(s) | Intelligent Signal Processing |