@MASTERSTHESIS\{IMM2006-04472,
    author       = "K. W. J{\o}rgensen and L. L. M{\o}lgaard",
    title        = "Tools for Automatic Audio Indexing",
    year         = "2006",
    school       = "Informatics and Mathematical Modelling, Technical University of Denmark, {DTU}",
    address      = "Richard Petersens Plads, Building 321, {DK-}2800 Kgs. Lyngby",
    type         = "",
    note         = "Supervised by Prof. Lars Kai Hansen, {IMM}.",
    url          = "http://www2.compute.dtu.dk/pubdb/pubs/4472-full.html",
    abstract     = "Current web search engines generally do not enable searches into audio files. Informative metadata would allow searches into audio files, but producing such metadata is a tedious manual task. Tools for automatic production of metadata are therefore needed. This project investigates methods for audio segmentation and speech recognition, which can be used for this metadata extraction.

Classification models for classifying speech and music are investigated. A feature set consisting of zero-crossing rate, short time energy, spectrum flux, and mel frequency cepstral coefficients is integrated over a 1 second window to yield a 60-dimensional feature vector. A number of classifiers are compared including artificial neural networks and a linear discriminant. The results obtained using the linear discriminant are comparable with the performance of more complex classifiers. The dimensionality of the feature vectors is decreased from 60 to 14 features using a selection scheme based on the linear discriminant. The resulting model using 14 features with the linear discriminant yields a test misclassification of 2.2\%.

A speaker change detection algorithm based on a vector quantization distortion (VQD) measure is proposed. The algorithm works in two steps. The first step finds potential change-points and the second step compensates for the false alarms produced by the first step. The {VQD} metric is compared with two other frequently used metrics: Kullback Leibler divergence (KL2) and Divergence Shape Distance (DSD) and found to yield better results. An overall {F-}measure of 85.4\% is found. The false alarm compensation shows a relative improvement in precision of 59.7\% with a relative loss of 7.2\% in recall in the found change-points. The choice of parameters based on one data set generalize well to other independent data sets.

The open source speech recognition system {SPHINX-}4 is used to produce transcripts of the speech segments. The system shows an overall word accuracy of \verb+~+ 75\%."
}