gsm8k: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
Qwen1.5-110B 84.1% 95.3% 1608.0
Meta-Llama-3-70B 82.7% 95.0% 1596.2
Mixtral-8x22B-v0.1 80.2% 93.5% 1543.4
Qwen1.5-72B 78.4% 92.1% 1504.8
DeepSeek-V2 77.3% 91.8% 1498.5
Qwen1.5-32B 76.3% 91.0% 1476.3
Qwen1.5-14B 69.5% 86.0% 1380.0
dbrx-base 69.5% 85.7% 1379.0
deepseek-llm-67b-base 62.9% 80.2% 1305.6
Mixtral-8x7B-v0.1 60.3% 77.9% 1276.1
Qwen1.5-7B 59.1% 75.5% 1248.6
gemma-7b 56.8% 73.2% 1234.1
llama2_70B 56.7% 73.3% 1232.6
Meta-Llama-3-8B 55.4% 72.0% 1220.7
Qwen1.5-4B 55.0% 71.3% 1208.1
llama_65B 50.8% 66.3% 1171.6
Mistral-7B-v0.1 41.2% 53.4% 1062.8
llama2_13B 38.0% 48.7% 1025.4
Qwen1.5-1.8B 36.9% 47.2% 1011.9
llama_33B 34.6% 43.8% 992.4
falcon-40b 27.1% 32.8% 902.1
llama2_07B 22.5% 25.8% 834.5
mpt-30b 21.8% 24.9% 826.8
Qwen1.5-0.5B 20.9% 24.4% 821.6
gemma-2b 18.8% 21.5% 794.5
deepseek-moe-16b-base 18.6% 20.4% 782.4
llama_13B 17.6% 19.6% 770.2
deepseek-llm-7b-base 14.1% 15.2% 716.9
llama_07B 10.9% 11.0% 648.8
stablelm-3b-4e1t 10.8% 10.7% 644.7
falcon-7b 7.9% 7.9% 587.0
stablelm-base-alpha-7b-v2 7.4% 7.7% 584.9
pythia-12b-deduped-v0 3.6% 3.8% 453.7
pythia-2.8b-deduped 3.0% 3.5% 437.0
pythia-6.9b-deduped-v0 2.9% 3.4% 429.2
pythia-1.4b-deduped-v0 2.0% 2.9% 408.7
pythia-1b-deduped 2.0% 2.6% 380.9