We present a bimodal information analysis system for automatic emotion recognition. Our approach is based on the analysis of video sequences which combines facial expressions observed visually with acoustic features to automatically recognize five universal emotion classes: Anger, Disgust, Happiness, Sadness and Surprise. We address the challenges posed during the temporal analysis of the bimodal data and introduce a novel technique for combining the best features of instantaneous and temporal based visual recognition systems. We obtain robust appearance-based visual features which we classify instantaneously and aggregate it temporally to improve the recognition rates when compared to single-frame based instantaneous classification. The performance of the system is further boosted by using the complementary audio information for the bimodal emotion recognition. We combine the two modalities at both feature and score level to compare the respective joint emotion recognition rates. The emotions are instantaneously classified using a Support Vector Machine and sequentially aggregated based on their classification probabilities. This approach is validated on a posed audio-visual database and a natural interactive database. The experiments performed on these databases provide encouraging results with the best combined recognition rate being 82%.