Raw data: summary.csv
| benchmark_id | size | models | std(A) | std(E(A)) | std(A-B) | corr(A,B) | no_solve | tau- | sig_noise | details |
|---|---|---|---|---|---|---|---|---|---|---|
| CRUXEval-input-T0.2 | 800 | 37 | 1.7 | 1.5 | 1.8 | 57 | 2.4 | 9.2 | 1.7 | models | examples | data | raw |
| CRUXEval-input-T0.8 | 800 | 31 | 1.6 | 1.2 | 1.8 | 70 | 2.1 | 11 | 2.1 | models | examples | data | raw |
| CRUXEval-output-T0.2 | 800 | 37 | 1.7 | 1.6 | 1.6 | 66 | 3 | 4.5 | 2.7 | models | examples | data | raw |
| CRUXEval-output-T0.8 | 800 | 31 | 1.7 | 1.4 | 1.6 | 76 | 4.4 | 11 | 2.6 | models | examples | data | raw |
| DS1000 | 1000 | 105 | 1.3 | 1.3 | 1.4 | 44 | 11 | 1.2 | 3 | models | examples | data | raw |
| agi_english | 2546 | 35 | 0.93 | 0.93 | 1.1 | 26 | 0.98 | 21 | 5.2 | models | examples | data | raw |
| arc_challenge | 1165 | 38 | 1.4 | 1.4 | 1.3 | 62 | 14 | 11 | 2.8 | models | examples | data | raw |
| gsm8k | 1319 | 37 | 1.1 | 1.1 | 1.2 | 32 | 2.1 | 0.53 | 7.1 | models | examples | data | raw |
| hellaswag | 10042 | 36 | 0.41 | 0.41 | 0.26 | 78 | 6.1 | 1.6 | 6.4 | models | examples | data | raw |
| humaneval | 164 | 78 | 3.5 | 3.5 | 3.8 | 46 | 3.7 | 1.2 | 1.1 | models | examples | data | raw |
| humaneval+ | 164 | 49 | 3.7 | 3.7 | 3.9 | 45 | 4.3 | 1.8 | 0.5 | models | examples | data | raw |
| lcb_codegen | 400 | 58 | 1.9 | 1.8 | 1.8 | 62 | 24 | 1.5 | 1.5 | models | examples | data | raw |
| lcb_codegen_v5 | 880 | 24 | 1.5 | 1.5 | 1.3 | 67 | 6.4 | 4.3 | NaN | models | examples | data | raw |
| lcb_codegen_v6 | 1055 | 30 | 1.3 | 1.3 | 1.1 | 64 | 3.5 | 2.1 | NaN | models | examples | data | raw |
| lcb_codegen_v6_080124 | 454 | 30 | 2.2 | 2.2 | 1.9 | 63 | 5.5 | 2.6 | NaN | models | examples | data | raw |
| mbpp | 378 | 59 | 2.4 | 2.4 | 2.4 | 48 | 2.4 | 4 | 2 | models | examples | data | raw |
| mbpp+ | 378 | 59 | 2.5 | 2.5 | 2.4 | 55 | 9.5 | 5.8 | 1.7 | models | examples | data | raw |
| mmlu | 14042 | 36 | 0.39 | 0.39 | 0.44 | 38 | 0.6 | 12 | 13 | models | examples | data | raw |
| nq | 3610 | 36 | 0.67 | 0.67 | 0.63 | 61 | 32 | 5.5 | 6 | models | examples | data | raw |
| piqa | 1838 | 36 | 0.93 | 0.93 | 0.69 | 72 | 5.3 | 8.4 | 1.8 | models | examples | data | raw |
| safim | 18340 | 22 | 0.36 | 0.36 | 0.36 | 48 | 16 | 7.4 | 8.6 | models | examples | data | raw |
| siqa | 1954 | 36 | 1.1 | 1.1 | 0.83 | 72 | 15 | 19 | 0.93 | models | examples | data | raw |
| swebench-lite | 300 | 75 | 2.5 | 2.5 | 2.6 | 52 | 15 | 0.67 | NaN | models | examples | data | raw |
| swebench-test | 2294 | 22 | 0.67 | 0.67 | 0.6 | 40 | 47 | 1.1 | NaN | models | examples | data | raw |
| swebench-verified | 500 | 93 | 2 | 2 | 2 | 57 | 9.8 | 1.8 | NaN | models | examples | data | raw |
| tqa | 11313 | 36 | 0.43 | 0.43 | 0.34 | 68 | 12 | 2.9 | 13 | models | examples | data | raw |