Eval-Arena: Noise and errors of LLM evals

Doc/Code

We ran pairwise match-ups on thousands of model pairs on various LLM benchmarks. Here is a summary of the main results with links to details aggregating by models or by examples.

Raw data: summary.csv

benchmark_id size models std(A) std(E(A)) std(A-B) corr(A,B) no_solve tau- sig_noise details
CRUXEval-input-T0.2 800 37 1.7 1.5 1.8 57 2.4 9.2 1.7 models | examples | data | raw
CRUXEval-input-T0.8 800 31 1.6 1.2 1.8 70 2.1 11 2.1 models | examples | data | raw
CRUXEval-output-T0.2 800 37 1.7 1.6 1.6 66 3 4.5 2.7 models | examples | data | raw
CRUXEval-output-T0.8 800 31 1.7 1.4 1.6 76 4.4 11 2.6 models | examples | data | raw
DS1000 1000 105 1.3 1.3 1.4 44 11 1.2 3 models | examples | data | raw
agi_english 2546 35 0.93 0.93 1.1 26 0.98 21 5.2 models | examples | data | raw
arc_challenge 1165 38 1.4 1.4 1.3 62 14 11 2.8 models | examples | data | raw
gsm8k 1319 37 1.1 1.1 1.2 32 2.1 0.53 7.1 models | examples | data | raw
hellaswag 10042 36 0.41 0.41 0.26 78 6.1 1.6 6.4 models | examples | data | raw
humaneval 164 78 3.5 3.5 3.8 46 3.7 1.2 1.1 models | examples | data | raw
humaneval+ 164 49 3.7 3.7 3.9 45 4.3 1.8 0.5 models | examples | data | raw
lcb_codegen 400 58 1.9 1.8 1.8 62 24 1.5 1.5 models | examples | data | raw
lcb_codegen_v5 880 24 1.5 1.5 1.3 67 6.4 4.3 NaN models | examples | data | raw
lcb_codegen_v6 1055 30 1.3 1.3 1.1 64 3.5 2.1 NaN models | examples | data | raw
lcb_codegen_v6_080124 454 30 2.2 2.2 1.9 63 5.5 2.6 NaN models | examples | data | raw
mbpp 378 59 2.4 2.4 2.4 48 2.4 4 2 models | examples | data | raw
mbpp+ 378 59 2.5 2.5 2.4 55 9.5 5.8 1.7 models | examples | data | raw
mmlu 14042 36 0.39 0.39 0.44 38 0.6 12 13 models | examples | data | raw
nq 3610 36 0.67 0.67 0.63 61 32 5.5 6 models | examples | data | raw
piqa 1838 36 0.93 0.93 0.69 72 5.3 8.4 1.8 models | examples | data | raw
safim 18340 22 0.36 0.36 0.36 48 16 7.4 8.6 models | examples | data | raw
siqa 1954 36 1.1 1.1 0.83 72 15 19 0.93 models | examples | data | raw
swebench-lite 300 75 2.5 2.5 2.6 52 15 0.67 NaN models | examples | data | raw
swebench-test 2294 22 0.67 0.67 0.6 40 47 1.1 NaN models | examples | data | raw
swebench-verified 500 93 2 2 2 57 9.8 1.8 NaN models | examples | data | raw
tqa 11313 36 0.43 0.43 0.34 68 12 2.9 13 models | examples | data | raw

Datasets: