Are deep neural networks really learning relevant features?

Corey Mose Kereliuk, Bob Sturm, Jan Larsen

AbstractIn recent years deep neural networks (DNNs) have become a popular choice for audio content analysis. This may be attributed to various factors including advancements in training algorithms, computational power, and the potential for DNNs to implicitly learn a set of feature detectors. We have recently re-examined two works that consider DNNs for the task of music genre recognition (MGR). These papers conclude that frame-level features learned by DNNs offer an improvement over traditional, hand-crafted features such as Mel-frequency cepstrum coefficients (MFCCs). However, these conclusions were drawn based on training/testing using the GTZAN dataset, which is now known to contain several flaws including replicated observations and artists. We illustrate how considering these flaws dramatically changes the results, which leads one to question the degree to which the learned frame-level features are actually useful for MGR. We make available a reproducible software package allowing other researchers to completely duplicate our figures and results.
KeywordsDeep neural networks, audio, feature learning, music information retrieval, genre recognition
TypeConference paper [Without referee]
ConferenceDMRN+9: Digital Music Research Network One-day Workshop 2014
Year2014    Month December
NoteQueen Mary University of London, Tuesday 16th December 2014
Electronic version(s)[pdf]
Publication link
BibTeX data [bibtex]
IMM Group(s)Scientific Computing, Intelligent Signal Processing