A Framework for Evaluating Multimodal Processing and A Role for Embodied Conversational Agents
One of the implicit assumptions of multi-modal interfaces is that human-computer interaction is significantly facilitated by providing multiple input and output modalities. Surprisingly, however, there is very little theoretical and empirical research testing this assumption in terms of the presentation of multimodal displays to the user. The goal of this talk is provide both a theoretical and empirical framework for addressing this important issue. Two contrasting models of human information processing are formulated and contrasted in experimental tests. According to integration models, multiple sensory influences are continuously combined during categorization, leading to perceptual experience and action. The Fuzzy Logical Model of Perception (FLMP) assumes that processing occurs in three successive but overlapping stages: evaluation, integration, and decision (Massaro, 1998). According to nonintegration models, any perceptual experience and action results from only a single sensory influence. These models are tested in expanded factorial designs in which two input modalities are varied independently of one another in a factorial design and each modality is also presented alone. Results from a variety of experiments on speech, emotion, and gesture support the predictions of the FLMP. Baldi, an embodied conversational agent, is described and implications for applications of multimodal interfaces are discussed.
This talk presents a novel representation for auditory environments that can be used for classifying events of interest, such as speech, cars, etc., and potentially used to classify the environments themselves. We propose a novel discriminative framework that is based on the audio epitome, an audio extension of the image representation developed by Jojic et al. We also develop an informative patch sampling procedure to train the epitomes. This procedure reduces the computational complexity and increases the quality of the epitome. For classification, the training data is used to learn distributions over the epitomes to model the different classes; the distributions for new inputs are then compared to these models. On a task of distinguishing between 4 auditory classes in the context of environmental sounds (car, speech, birds, utensils), our method outperforms the conventional approaches of nearest neighbor and mixture of Gaussians on three out of the four classes.
The use of multiple sensors such as ubiquitous and wearable sensors not only opens new ways of interaction but may lead to a paradigm-shift in human-computer interaction. For example, the use of multiple sensors and elaborate fusion techniques may play an important role in order to derive interesting and high-level context. Such a scenario is also quite possible, since it is widely believed that many sensors are becoming so cheap that they can be added easily to computing and wearable devices distributed in the environment or worn by a human.
In this talk I will present several wearable multi-sensor platforms with their respective applications for human-computer interaction. I will also discuss resulting research challenges not only in the area of perception but also in the area of human-computer interaction.
I will describe some of our work in capturing human behavior in natural settings and learning to imitate it as a discriminative prediction problem. Training data of real human behavior is obtained by watching humans interacting with each other using multiple perceptual modalities. These modalities include head and hand tracking, face modeling, generic visual scene descriptors as well as auditory representations. These can also be placed in wearable configurations to permit longer-term data collection efforts. The real-time modalities permit us to treat behavior learning as a time series prediction problem where hours of data of interaction are used to learn predictive distributions using mixtures and input-output hidden Markov models. The models mimic and react to simple human activities after monitoring real interactions in an unsupervised setting.
T. Jebara and A. Pentland. "Statistical Imitative Learning from Perceptual Data" . 2nd International Conference on Development and Learning, ICDL'02, June
T. Jebara and A. Pentland. "Action Reaction Learning: Automatic Visual Analysis and Synthesis of Interactive Behaviour" . In International Conference on Vision Systems, ICVS, January, 1999.
To map from speech to video sequences of a face, features are extracted from both sound and images. For the speech a representation is used that can capture non-phonetic sounds like yawning as well as phonetic sounds. By using a generative model of the face it is possible to do both feature extraction and synthesize photorealistic images of facial expressions. Adjusting a small set of parameters it is possible to create all emotions and lip movements seen in the training data. The mapping from speech to facial movements is performed by adopting a state space approach. The system provides smooth facial movements matching the speech and makes it possible to create a face that mimics the sound.
Tue Lehn-Schiĝler. "Multimedia Mapping using Continuous State Space Models" In proceedings of the Multi Media Signal Processing workshop, Siena 2004
Tue Lehn-Schiĝler, Lars Kai Hansen, Jan Larsen. "Mapping from Speech to Images Using Continuous State Space Models". In proceedings of the joint AMI/PASCAL/IM2/M4 Workshop on Multimodal Interaction and Related Machine Learning Algorithms, Martigny 2004
Sensors are not only shrinking to a size where they can be attached to anything, new methods allow sensors to be so small that they can be used to saturate environments. The vision of intelligent fabrics for instance, embeds sensors and actuators in the very fibres of the textile, enabling unobtrusive and heavily distributed sensing.
While engineers are currently working on issues such as how all of this information could be communicated to a processing unit, this might be a good time to contemplate what such systems might produce in terms of data and how it could affect sensor fusion algorithms. So far, few have regarded a high input dimensionality of learning algorithms as a good thing, with the Curse of Dimensionality usually leading to awfull accuracies and execution times.
This talk offers an alternative vision and scenarios where having a massive amount of sensors could be a good thing, exploiting distribution while not expecting a high accuracy per sensor.
Fusion of image and sound streams for automatic audio-visual speech recognition offers challenges to multi-stream integration models. It is well known that visemes and phonemes can evolve asynchronously in time, especially in continuous, co-articulated speech. Recently, some multi-stream models allow the viseme and phoneme streams to evolve independently in order to account for the observed asynchrony, but these approaches fail to capture the fact that certain components of the underlying production process are coupled. In this talk, we explore an alternative A/V fusion model inspired by the human speech production system. Rather than modeling two underlying processes, one of which produces visemes and the other phonemes, we model the underlying physical articulatory processes directly in a multi-stream Dynamic Bayesian Network. We use N state streams, each of which corresponds to a particular "articulatory feature" (AF), such as the opening of the lips, or the constriction between the tongue and teeth. While some articulatory features are recognized using both audio and visual observations, others, such as the vibration of the vocal cords, are recognized using only the auditory modality, as they produce no visual effects. The modalities are fused at the articulatory feature level, and the A/V asynchrony is handled by allowing the AF streams to de-synchronize from each other. We will present initial results of evaluating the model on a large audio-visual speech corpus consisting of TIMIT sentences.
In recent years people have started to explore sensor modalities different from audio and video for recognizing and modeling human behavior and activity. Often the choice of sensors and the features derived from them are driven by human intuition and what is easily available, rather than performance and/or practicality. Selecting the most useful features and computing them on an appropriate time-scale (relative to the duration of the activity) is crucial for recognition. We are working towards developing a framework that allows us to systematically identify modalities and features that are well suited for machine recognition and discrimination of natural human activities. We will present our work on multi-modal activity recognition in a real world setting. We have built a wearable sensing platform that was used to collect thirty-five hours of annotated natural activity data from eight different sensor modalities - accelerometer, audio, IR/visible light, high-frequency light, barometric pressure, humidity, temperature, and compass. Activities were collected from two subjects and included walking/jogging, driving a car, working, riding an elevator, riding a bus, biking, etc. We calculate about 400 hundred different features from these eight modalities. To obtain a subset of reliable features and identify the most discriminatory modalities for classification, we apply confidence rated boosting, using simple regression-stumps as our weak static classifiers. We are currently extending the feature selection method to include time-series classifiers. We will present results on the stability of the features chosen and how they effect the recognition accuracy of our system.
Localization and tracking of speakers in multi-party conversations are necessary tasks for automatic meeting analysis. In this talk, we will describe a probabilistic approach to track multiple speakers in meetings, which integrates audio-visual data generated by multiple cameras and microphones via particle filters. Our approach relies on a number of principles. First, a joint state-space formulation, where the configurations for each person are modeled concurrently, constitutes a formal choice that allows to represent interactions. We build on recent work that combines Markov Chain Monte Carlo and particle filtering to handle the dimensionality of the joint state-space and perform inference in the model. In the second place, observation models are derived from the audio and visual modalities, and readily integrated in the particle filter. Audio observations are derived from an speaker localization algorithm. Visual observations are based on models of shape and spatial structure of human heads. A procedure to relate measurements from each modality is also proposed. Using data from a room set up to record small-group discussions, we show that our algorithm is capable of tracking participants and their speaking turns with good accuracy. Some results are available here
We address the problem of modeling multimodal meeting actions using a multi-layer HMM framework. Meetings are structured as sequences of meeting actions, which are inherently group-based (defined by the individual actions of meeting participants, and their interactions), and multimodal (as captured by cameras and microphones). In the proposed multi-layer HMM framework, the first layer models typical actions of individuals in meetings using supervised HMM learning and low-level audio-visual features. The second layer models group actions using both supervised and unsupervised HMM learning. The two layers are linked by a set of probability-based features. We illustrate the layered HMM framework with a set of eight group actions using a public five-hour meeting corpus. Experiments show that the use of multiple modalities and the layered framework are advantageous, compared with a single-layer HMM based system.