| datasets: | |
| - SurplusDeficit/MultiHop-EgoQA | |
| # Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos | |
| ## GeLM Model | |
| We propose a novel architecture, termed as <b><u>GeLM</u></b> for *MH-VidQA*, to leverage the world knowledge reasoning capabilities of multi-modal large language models (LLMs), while incorporating a grounding module to retrieve temporal evidence in the video with flexible grounding tokens. | |
| <div align="center"> | |
| <img src="./assets/architecture_v3.jpeg" style="width: 80%;"> | |
| </div> |