Combining Semantic and Acoustic Features for Valence and Arousal Recognition in Speech



AbstractThe recognition of affect in speech has attracted
a lot of interest recently; especially in the area of cognitive
and computer sciences. Most of the previous studies focused on
the recognition of basic emotions (such as happiness, sadness
and anger) using categorical approach. Recently, the focus has
been shifting towards dimensional affect recognition based on
the idea that emotional states are not independent from one
another but related in a systematic manner. In this paper, we
design a continuous dimensional speech affect recognition model
that combines acoustic and semantic features. We design our
own corpus that consists of 59 short movie clips with audio and
text in subtitle format, rated by human subjects in arousal and
valence (A-V) dimensions. For the acoustic part, we combine
many features and use correlation based feature selection and
apply support vector regression. For the semantic part, we use the
affective norms for English words (ANEW), that are rated also in
A-V dimensions, as keywords and apply latent semantics analysis
(LSA) on those words and words in the clips to estimate A-V
values in the clips. Finally, the results of acoustic and semantic
parts are combined. We show that combining semantic and
acoustic information for dimensional speech recognition improves
the results. Moreover, we show that valence is better estimated
using semantic features while arousal is better estimated using
acoustic features.
Keywordsemotion recogntion, valence, arousal, speech
TypeConference paper [With referee]
ConferenceCogniive Information Processing CIP2012
Year2012    Month May
PublisherIEEE Press
NoteAssociated presentation http://www.imm.dtu.dk/pubdb/p.php?6326
Electronic version(s)[pdf]
Publication linkhttp://cip2012.tsc.uc3m.es/
BibTeX data [bibtex]
IMM Group(s)Intelligent Signal Processing