Efficient Recursive Speaker Segmentation for Unsupervised Audio Editing

Thor Bundgaard Nielsen

AbstractToday nearly everyone carries a microphone every waking moment. The world and particularly the internet is awash with digital audio. This excess generates demand for tools, using machine learning algorithms, capable of organisation and interpretation. Thereby enriching audio and creating actionable Information.
This thesis tackles the problem of speaker diarisation, answering the question of "Who spoke when?", without the need for human intervention. This is achieved through the design of a custom algorithm that when given data, automatically designs an algorithm capable of solving this problem optimally.
Initially this thesis scans the field of change-detection in general. A diverse variety of methods are studied, compared, contrasted, combined and improved. A subgroup of these methods are selected and optimised further through a recursive design. Beyond this, the raw audio is processed using a model of the speech production system to generate a sequence of highly descriptive features. This process deconvolves an auditory fingerprint from the literal information carried by speech.
Given data from normal conversation, between an arbitrary number of people, the generated algorithm is capable of identifying almost 19 out of 20 speaker changes with very few false alarms. The algorithm operates 5 times faster than real-time on a contemporary PC and subsequently answers the "who" by comparing the speaker turns and assigning labels.
The work carried out in this thesis is of particular practical use in the field of audio editing.
TypeMaster's thesis [Academic thesis]
Year2013
PublisherTechnical University of Denmark, Department of Applied Mathematics and Computer Science / DTU Co
AddressMatematiktorvet, Building 303B, DK-2800 Kgs. Lyngby, Denmark, compute@compute.dtu.dk
SeriesM.Sc.-2013-62
Note
Electronic version(s)[pdf]
Publication linkhttp://www.compute.dtu.dk/English.aspx
BibTeX data [bibtex]
IMM Group(s)Intelligent Signal Processing