ArxivRollBench: Is Your Model Cheating on the Test?

1. Background: The Trust Crisis of Benchmarks

There are more and more LLM benchmarks, and scores keep going up. But here is the fundamental problem: how much of that score reflects real capability, and how much is because the model "saw" the test questions during training?

This is benchmark contamination. Existing benchmarks are static—the questions are fixed. Given enough time, models will memorize them.

2. Our Solution: ArxivRollBench

Drawing inspiration from the One-Time-Pad in cryptography, we designed a dynamic benchmark.

ArxivRollBench automatically extracts content from newly published arXiv papers every day and generates fresh test cases. Since the papers were published that day, the model cannot possibly have seen them during training—so test results are contamination-free.

We also proposed a quantitative "cheating ratio" evaluation framework—you can precisely measure the performance gap between "seen" and "unseen" test conditions for any model.

3. Leaderboard

The leaderboard is at: https://arxivroll.moreoverai.com. I update it every six months. Already we see some interesting patterns—certain models that score highly on static benchmarks drop significantly on ArxivRollBench.

4. Paper Info

screenshot_20250927_202632.png


Author: Zi Liang (liangzi20163933@qq.com) Create Date: 2026-05-27 Last modified: 2026-05-27 Wed 21:41 Creator: Emacs 30.2 (Org mode 9.7.11)