Modeling Word Burstiness Using the Dirichlet Distribution

Rasmus Elsborg Madsen, David Kauchak, Charles Elkan

AbstractMultinomial distributions are often used to
model text documents. However, they do
not capture well the phenomenon that words
in a document tend to appear in bursts: if
a word appears once, it is more likely to
appear again. In this paper, we propose
the Dirichlet compound multinomial model
(DCM) as an alternative to the multinomial.
The DCM model has one additional degree
of freedom, which allows it to capture burstiness.
We show experimentally that the DCM
is substantially better than the multinomial
at modeling text data, measured by perplexity.
We also show using three standard document
collections that the DCM leads to better
classification than the multinomial model.
DCM performance is comparable to that obtained
with multiple heuristic changes to the
multinomial model.
KeywordsText mining, DCM, Polya, Multinomial, Categorization, Supervised
TypeConference paper [With referee]
ConferenceInternational Conference on Machine Learning
Year2005    Month June    pp. 489--498
Electronic version(s)[pdf]
BibTeX data [bibtex]
IMM Group(s)Intelligent Signal Processing