Temporal visual cues aid speech recognition

Xiang Zhou, Lars Ross, Tue Lehn-Schiĝler, Lucas Parra

AbstractBACKGROUND: It is well known that under noisy conditions, viewing a
speaker's articulatory movement aids the recognition of spoken words.
Conventionally it is thought that the visual input disambiguates
otherwise confusing auditory input. HYPOTHESIS: In contrast we
hypothesize that it is the temporal synchronicity of the visual input
that aids parsing of the auditory stream. More specifically, we expected
that purely temporal information, which does not convey information such
as place of articulation may facility word recognition. METHODS: To test
this prediction we used temporal features of audio to generate an
artificial talking-face video and measured word recognition performance
on simple monosyllabic words. RESULTS: When presenting words together
with the artificial video we find that word recognition is improved over
purely auditory presentation. The effect is significant (p<0.01) for SNR
at or above -12dB noise. For lower SNR the visual temporal information
does not improve recognition confirming that our visual input does not
contain useful lip-reading information in itself. CONCLUSION: Thus, we
argue that temporal information is used in addition to articulatory
features. This finding supports the notion that synchronous visual input
aids auditory processing at an early parsing stage.
TypeConference paper [Abstract]
Conference7th Annual Meeting of the International Multisensory Research Forum
BibTeX data [bibtex]
IMM Group(s)Intelligent Signal Processing

