Tools for Automatic Audio Indexing

Tools for Automatic Audio Indexing

Abstract	Current web search engines generally do not enable searches into audio files. Informative metadata would allow searches into audio files, but producing such metadata is a tedious manual task. Tools for automatic production of metadata are therefore needed. This project investigates methods for audio segmentation and speech recognition, which can be used for this metadata extraction. Classification models for classifying speech and music are investigated. A feature set consisting of zero-crossing rate, short time energy, spectrum flux, and mel frequency cepstral coefficients is integrated over a 1 second window to yield a 60-dimensional feature vector. A number of classifiers are compared including artificial neural networks and a linear discriminant. The results obtained using the linear discriminant are comparable with the performance of more complex classifiers. The dimensionality of the feature vectors is decreased from 60 to 14 features using a selection scheme based on the linear discriminant. The resulting model using 14 features with the linear discriminant yields a test misclassification of 2.2%. A speaker change detection algorithm based on a vector quantization distortion (VQD) measure is proposed. The algorithm works in two steps. The first step finds potential change-points and the second step compensates for the false alarms produced by the first step. The VQD metric is compared with two other frequently used metrics: Kullback Leibler divergence (KL2) and Divergence Shape Distance (DSD) and found to yield better results. An overall F-measure of 85.4% is found. The false alarm compensation shows a relative improvement in precision of 59.7% with a relative loss of 7.2% in recall in the found change-points. The choice of parameters based on one data set generalize well to other independent data sets. The open source speech recognition system SPHINX-4 is used to produce transcripts of the speech segments. The system shows an overall word accuracy of ~ 75%.
Type	Master's thesis [Academic thesis]
Year	2006
Publisher	Informatics and Mathematical Modelling, Technical University of Denmark, DTU
Address	Richard Petersens Plads, Building 321, DK-2800 Kgs. Lyngby
Series	IMM-Thesis-2006-31
Note	Supervised by Prof. Lars Kai Hansen, IMM.
Electronic version(s)	[pdf]
BibTeX data	[bibtex]
IMM Group(s)	Intelligent Signal Processing