We present a bimodal information analysis system for
automatic emotion recognition. Our approach is based on
the analysis of video sequences which combines facial
expressions observed visually with acoustic features to
automatically recognize five universal emotion classes:
Anger, Disgust, Happiness, Sadness and Surprise. We
address the challenges posed during the temporal analysis
of the bimodal data and introduce a novel technique for
combining the best features of instantaneous and temporal
based visual recognition systems. We obtain robust
appearance-based visual features which we classify
instantaneously and aggregate it temporally to improve
the recognition rates when compared to single-frame
based instantaneous classification. The performance of the
system is further boosted by using the complementary
audio information for the bimodal emotion recognition.
We combine the two modalities at both feature and score
level to compare the respective joint emotion recognition
rates. The emotions are instantaneously classified using a
Support Vector Machine and sequentially aggregated
based on their classification probabilities. This approach
is validated on a posed audio-visual database and a
natural interactive database. The experiments performed
on these databases provide encouraging results with the
best combined recognition rate being 82%.