hellaswag: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
dbrx-base 88.7% 90.3% 1399.6
Mixtral-8x22B-v0.1 86.8% 90.7% 1386.8
Qwen1.5-110B 86.5% 90.6% 1380.1
Meta-Llama-3-70B 85.9% 88.6% 1337.2
deepseek-llm-67b-base 85.5% 88.8% 1339.5
Qwen1.5-72B 85.3% 86.0% 1296.6
llama_65B 85.3% 87.6% 1320.2
falcon-40b 85.1% 86.4% 1301.6
Mixtral-8x7B-v0.1 84.5% 84.5% 1271.5
Qwen1.5-32B 84.1% 82.7% 1244.8
llama_33B 84.0% 83.0% 1249.8
llama2_70B 83.0% 75.7% 1180.5
Mistral-7B-v0.1 81.7% 73.8% 1140.4
gemma-7b 80.8% 69.5% 1101.5
mpt-30b 80.8% 69.3% 1096.6
Meta-Llama-3-8B 80.5% 68.3% 1084.4
llama_13B 80.4% 67.7% 1086.5
llama2_13B 80.3% 65.0% 1078.7
Qwen1.5-14B 80.0% 65.1% 1063.5
deepseek-moe-16b-base 78.6% 59.1% 1010.6
falcon-7b 78.3% 58.1% 1002.8
Qwen1.5-7B 77.3% 53.1% 970.5
deepseek-llm-7b-base 77.2% 53.0% 961.2
llama_07B 77.1% 52.7% 955.4
llama2_07B 76.2% 48.8% 939.5
stablelm-base-alpha-7b-v2 75.5% 45.3% 899.3
stablelm-3b-4e1t 75.2% 44.3% 888.6
gemma-2b 71.7% 31.5% 782.9
Qwen1.5-4B 71.6% 31.7% 789.5
pythia-12b-deduped-v0 69.5% 25.3% 730.3
pythia-6.9b-deduped-v0 66.1% 17.7% 648.0
Qwen1.5-1.8B 61.0% 10.8% 540.9
pythia-2.8b-deduped 60.3% 9.5% 515.8
pythia-1.4b-deduped-v0 52.0% 4.8% 379.7
pythia-1b-deduped 49.6% 3.2% 305.5
Qwen1.5-0.5B 49.4% 3.5% 319.8