Abstract: Emotion recognition from multimodal data presents challenges such as variations in speech tone, facial expressions, and real-world noise. Most recent systems rely on transformer ...