CodeClash: Benchmarking Goal-Oriented Software Engineering Paper β’ 2511.00839 β’ Published Nov 2 β’ 9
VideoGameBench: Can Vision-Language Models complete popular video games? Paper β’ 2505.18134 β’ Published May 23 β’ 6 β’ 3
VideoGameBench: Can Vision-Language Models complete popular video games? Paper β’ 2505.18134 β’ Published May 23 β’ 6
SciCode: A Research Coding Benchmark Curated by Scientists Paper β’ 2407.13168 β’ Published Jul 18, 2024 β’ 16