Abstract
Agentic Rubrics enable efficient and scalable verification for software engineering agents by creating context-aware checklists that outperform traditional methods while maintaining interpretability.
Verification is critical for improving agents: it provides the reward signal for Reinforcement Learning and enables inference-time gains through Test-Time Scaling (TTS). Despite its importance, verification in software engineering (SWE) agent settings often relies on code execution, which can be difficult to scale due to environment setup overhead. Scalable alternatives such as patch classifiers and heuristic methods exist, but they are less grounded in codebase context and harder to interpret. To this end, we explore Agentic Rubrics: an expert agent interacts with the repository to create a context-grounded rubric checklist, and candidate patches are then scored against it without requiring test execution. On SWE-Bench Verified under parallel TTS evaluation, Agentic Rubrics achieve a score of 54.2% on Qwen3-Coder-30B-A3B and 40.6% on Qwen3-32B, with at least a +3.5 percentage-point gain over the strongest baseline in our comparison set. We further analyze rubric behavior, showing that rubric scores are consistent with ground-truth tests while also flagging issues that tests do not capture. Our ablations show that agentic context gathering is essential for producing codebase-specific, unambiguous criteria. Together, these results suggest that Agentic Rubrics provide an efficient, scalable, and granular verification signal for SWE agents.
Community
Agentic Rubrics for verifying SWE agent patches WITHOUT running tests!
An agent explores the codebase to generate context-grounded checklists, then scores patches execution-free.
Rubrics provide dense, interpretable reward signals that could scale RL training for coding agents
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SWE-RM: Execution-free Feedback For Software Engineering Agents (2025)
- SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving (2026)
- SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios (2025)
- AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following (2025)
- ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment (2025)
- AInsteinBench: Benchmarking Coding Agents on Scientific Repositories (2025)
- Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper