Speaker Identification for Hearing Instruments

Maia E.M. Weddin

AbstractThis thesis proposes a speaker identification system that can differentiate between members of a small set of speakers as well as being able to detect an impostor sound and classify it accordingly. The identi cation system is text-independent, so no specific words or sounds have to be uttered for the identification to work. In cooperation with GN ReSound, the ultimate implementation of this system would be in hearing aids, more specifically, those designed for children, as they have more difficulty adjusting a hearing instrument when such an adjustment becomes necessary. A variety of speech feature sets are extracted, including fundamental frequency estimates, LPCC, warped LPCC, PLPCC, MFCC and the LPC residual. Three classifiers are used to establish which combination of feature set and classifier is optimal. These classification methods are the Mixture of Gaussians models, k-Nearest Neighbour and the nonlinear Neural Network. The classification results are obtained for each frame of a test sentence and the performance of each system setup is measured both in identi cation rate of the small set of speakers, that is calculated using consensus over the individually classi ed frames for each sentence, and in the percentage of correctly classi ed frames. The Neural Network classifier proves to be more robust than the Mixture of Gaussians classifier and already results in a 100% correct identification rate for the 8MFCC feature set.

As the ultimate aim of this research is the implementation of a speaker identification system in a hearing instrument, a method for detecting impostors is implemented. This is done by using density modelling with the Mixture of Gaussians classifier and a rate of 90% impostor detection is obtained for the 12¢MFCC feature set.

Finally, the small set of speakers is divided into a group of female speakers and a group of male speakers based on fundamental frequency estimates. A division of feature sets is implemented so that subsets based on whether a frame is voiced, unvoiced, voiced preceded by a voiced frame, or unvoiced preceded by a voiced frame, are formed. For the 12¢MFCC feature set used with the Neural Network classifier, the correct identification of all speakers using a limited amount of data is only obtained when using the voiced preceded by unvoiced and the unvoiced preceded by voiced features subsets, and the correct frame rate using these subsets combined with gender separation is increased by up to 23%.
KeywordsFundamental frequency estimation, MFCC, LPCC, PLPCC, Mixture of Gaussians, impostor detection, nonlinear neural network, voiced/unvoiced speech
TypeMaster's thesis [Industrial collaboration]
PublisherInformatics and Mathematical Modelling, Technical University of Denmark, DTU
AddressRichard Petersens Plads, Building 321, DK-2800 Kgs. Lyngby
NoteSupervised by Assoc. prof. Ole Winther, IMM, and Brian Pedersen, GN ReSound
Electronic version(s)[pdf]
BibTeX data [bibtex]
IMM Group(s)Intelligent Signal Processing

Back  ::  IMM Publications