Learning Combinations of Multiple Feature Representations for Music Emotion Prediction

Learning Combinations of Multiple Feature Representations for Music Emotion Prediction

Abstract	Music consists of several structures and patterns evolving through time which greatly influences the human decoding of higher-level cognitive aspects of music like the emotions expressed in music. For tasks, such as genre, tag and emotion recognition, these structures have often been identified and used as individual and non-temporal features and representations. % not including for multiple --- both temporal and non-temporal --- structures. In this work, we address the hypothesis whether using multiple temporal and non-temporal representations of different features is beneficial for modeling music structure with the aim to predict the emotions expressed in music. We test this hypothesis by representing temporal and non-temporal structures using generative models of multiple audio features. The representations are used in a discriminative setting via the Product Probability Kernel and the Gaussian Process model enabling Multiple Kernel Learning, finding optimized combinations of both features and temporal/ non-temporal representations. We show the increased predictive performance using the combination of different features and representations along with the great interpretive prospects of this approach.
Keywords	Music emotion prediction; expressed emotions; pairwise comparisons; multiple kernel learning; Gaussian process
Type	Conference paper [With referee]
Conference	Affect and Sentiment in Multimedia (ASM) - an ACM MM'15 workshop
Year	2015 Month October
Electronic version(s)	[pdf]
BibTeX data	[bibtex]
IMM Group(s)	Intelligent Signal Processing