CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies
Paper • 2606.16613 • Published • 9
None defined yet.
Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests
How Can I Publish My LLM Benchmark Without Giving the True Answers Away?