Eval-Arena: noise and errors by leaderboard showdowns

Doc/Code

We ran pairwise match-ups on thousands of model pairs on various LLM benchmarks. Here is a summary of the main results with links to details aggregating by models or by examples.

  • size: number of examples in the benchmark
  • p5_min: the minimum difference with p-value < 0.05
  • p5_max: the maximum difference with p-value > 0.05
  • no_solve: examples not solved by any models
  • tau-: examples negatively correlated with the overall model quality as measured by Kendall's tau.
  • sig_noise: median of signal-to-noise ratios for doubling the model size.
  • details: link to details for each benchmark, aggregating by the models or by examples in each benchmark, and a plot of all models and examples sorted by difficulty.

benchmark_id size p5_min p5_max no_solve tau- sig_noise link to details
CRUXEval-input 800 3.0% 3.9% 2.1% 10.2% 1.93 by models | examples | data
CRUXEval-output 800 2.6% 3.1% 3.0% 4.4% 3.21 by models | examples | data
DS1000 1000 1.5% 3.4% 11.0% 1.2% 3.00 by models | examples | data
agi_english 2546 2.0% 2.3% 1.0% 20.5% 5.17 by models | examples | data
arc_challenge 1165 2.3% 2.6% 14.2% 11.2% 2.77 by models | examples | data
gsm8k 1319 1.5% 2.6% 2.1% 0.5% 7.09 by models | examples | data
hellaswag 10042 0.4% 0.5% 6.1% 1.6% 6.41 by models | examples | data
humaneval 164 4.9% 9.8% 3.7% 1.2% 1.12 by models | examples | data
humaneval+ 164 6.7% 9.8% 4.3% 1.8% 0.50 by models | examples | data
lcb_codegen 400 2.5% 4.8% 24.2% 1.5% 1.67 by models | examples | data
mbpp 378 3.7% 5.8% 2.4% 4.0% 1.98 by models | examples | data
mbpp+ 378 4.2% 5.6% 9.5% 5.8% 1.66 by models | examples | data
mmlu 14042 0.9% 0.8% 0.6% 12.4% 13.24 by models | examples | data
nq 3610 1.1% 1.3% 31.5% 5.5% 5.98 by models | examples | data
piqa 1838 1.3% 1.5% 5.3% 8.4% 1.76 by models | examples | data
safim 18340 0.6% 0.8% 15.6% 7.4% 8.47 by models | examples | data
siqa 1954 1.5% 2.1% 14.5% 19.0% 0.93 by models | examples | data
swebench-lite 300 2.3% 5.7% 29.3% 1.7% NaN by models | examples | data
swebench-test 2294 0.5% 1.6% 61.5% 1.4% NaN by models | examples | data
swebench-verified 500 2.4% 4.0% 22.0% 1.8% NaN by models | examples | data
tqa 11313 0.5% 0.7% 11.8% 2.9% 12.59 by models | examples | data

Code datasets: