See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
Paper
•
2512.02231
•
Published
•
7
None defined yet.
Contamination Detection for VLMs using Multi-Modal Semantic Perturbation
UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios