We ran pairwise match-ups on thousands of model pairs on various LLM benchmarks. Here is a summary of the main results with links to details aggregating by models or by examples.
benchmark_id | size | p5_min | p5_max | no_solve | tau- | sig_noise | link to details |
---|---|---|---|---|---|---|---|
CRUXEval-input | 800 | 3.0% | 3.9% | 2.1% | 10.2% | 1.93 | by models | examples | data |
CRUXEval-output | 800 | 2.6% | 3.1% | 3.0% | 4.4% | 3.21 | by models | examples | data |
DS1000 | 1000 | 1.5% | 3.4% | 11.0% | 1.2% | 3.00 | by models | examples | data |
agi_english | 2546 | 2.0% | 2.3% | 1.0% | 20.5% | 5.17 | by models | examples | data |
arc_challenge | 1165 | 2.3% | 2.6% | 14.2% | 11.2% | 2.77 | by models | examples | data |
gsm8k | 1319 | 1.5% | 2.6% | 2.1% | 0.5% | 7.09 | by models | examples | data |
hellaswag | 10042 | 0.4% | 0.5% | 6.1% | 1.6% | 6.41 | by models | examples | data |
humaneval | 164 | 4.9% | 9.8% | 3.7% | 1.2% | 1.12 | by models | examples | data |
humaneval+ | 164 | 6.7% | 9.8% | 4.3% | 1.8% | 0.50 | by models | examples | data |
lcb_codegen | 400 | 2.5% | 4.8% | 24.2% | 1.5% | 1.67 | by models | examples | data |
mbpp | 378 | 3.7% | 5.8% | 2.4% | 4.0% | 1.98 | by models | examples | data |
mbpp+ | 378 | 4.2% | 5.6% | 9.5% | 5.8% | 1.66 | by models | examples | data |
mmlu | 14042 | 0.9% | 0.8% | 0.6% | 12.4% | 13.24 | by models | examples | data |
nq | 3610 | 1.1% | 1.3% | 31.5% | 5.5% | 5.98 | by models | examples | data |
piqa | 1838 | 1.3% | 1.5% | 5.3% | 8.4% | 1.76 | by models | examples | data |
safim | 18340 | 0.6% | 0.8% | 15.6% | 7.4% | 8.47 | by models | examples | data |
siqa | 1954 | 1.5% | 2.1% | 14.5% | 19.0% | 0.93 | by models | examples | data |
swebench-lite | 300 | 2.3% | 5.7% | 29.3% | 1.7% | NaN | by models | examples | data |
swebench-test | 2294 | 0.5% | 1.6% | 61.5% | 1.4% | NaN | by models | examples | data |
swebench-verified | 500 | 2.4% | 4.0% | 22.0% | 1.8% | NaN | by models | examples | data |
tqa | 11313 | 0.5% | 0.7% | 11.8% | 2.9% | 12.59 | by models | examples | data |
Code datasets: