gsm8k: by models

Home Doc/Code


std predicted by accuracy

The typical stddev between pairs of models on this dataset as a function of the absolute accuracy.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, the significance level mainly depends on the accuracy difference as shown here. Hover over each model pair for detailed information.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000. std: standard deviation due to drawing examples from a population, this is the dominant term. std_i: the standard deviation due to drawing samples from the model on each example. std_total: the total standard deviation, satisfying std_total^2 = std^2 + std_i^2.

model pass1 std(E(A)) E(std(A)) std(A) N win_rate elo
Qwen1.5-110B 84.1 1 0 1 NaN 48.9 1.18e+03
Meta-Llama-3-70B 82.7 1 0 1 NaN 47.6 1.18e+03
Mixtral-8x22B-v0.1 80.2 1.1 0 1.1 NaN 45.7 1.17e+03
Qwen1.5-72B 78.4 1.1 0 1.1 NaN 44.5 1.16e+03
DeepSeek-V2 77.3 1.2 0 1.2 NaN 43.5 1.15e+03
Qwen1.5-32B 76.3 1.2 0 1.2 NaN 42.7 1.15e+03
Qwen1.5-14B 69.5 1.3 0 1.3 NaN 37.7 1.12e+03
dbrx-base 69.5 1.3 0 1.3 NaN 37.8 1.12e+03
deepseek-llm-67b-base 62.9 1.3 0 1.3 NaN 32.8 1.09e+03
Mixtral-8x7B-v0.1 60.3 1.3 0 1.3 NaN 30.9 1.08e+03
Qwen1.5-7B 59.1 1.4 0 1.4 NaN 30.7 1.08e+03
gemma-7b 56.8 1.4 0 1.4 NaN 29.1 1.07e+03
llama2_70B 56.7 1.4 0 1.4 NaN 28.8 1.07e+03
Meta-Llama-3-8B 55.4 1.4 0 1.4 NaN 27.8 1.06e+03
Qwen1.5-4B 55 1.4 0 1.4 NaN 27.8 1.06e+03
llama_65B 50.8 1.4 0 1.4 NaN 24.9 1.05e+03
Mistral-7B-v0.1 41.2 1.4 0 1.4 NaN 19 1.01e+03
llama2_13B 38 1.3 0 1.3 NaN 17 996
Qwen1.5-1.8B 36.9 1.3 0 1.3 NaN 16.8 992
llama_33B 34.6 1.3 0 1.3 NaN 15.4 983
falcon-40b 27.1 1.2 0 1.2 NaN 11.5 955
llama2_07B 22.5 1.2 0 1.2 NaN 8.99 937
mpt-30b 21.8 1.1 0 1.1 NaN 8.71 934
Qwen1.5-0.5B 20.9 1.1 0 1.1 NaN 8.75 930
gemma-2b 18.8 1.1 0 1.1 NaN 7.76 922
deepseek-moe-16b-base 18.6 1.1 0 1.1 NaN 7.21 921
llama_13B 17.6 1 0 1 NaN 7.06 917
deepseek-llm-7b-base 14.1 0.96 0 0.96 NaN 5.53 903
llama_07B 10.9 0.86 0 0.86 NaN 4.04 890
stablelm-3b-4e1t 10.8 0.85 0 0.85 NaN 3.95 890
falcon-7b 7.88 0.74 0 0.74 NaN 2.99 878
stablelm-base-alpha-7b-v2 7.35 0.72 0 0.72 NaN 2.93 876
pythia-12b-deduped-v0 3.56 0.51 0 0.51 NaN 1.5 860
pythia-2.8b-deduped 2.96 0.47 0 0.47 NaN 1.4 857
pythia-6.9b-deduped-v0 2.88 0.46 0 0.46 NaN 1.33 857
pythia-1.4b-deduped-v0 2.05 0.39 0 0.39 NaN 1.19 853
pythia-1b-deduped 1.97 0.38 0 0.38 NaN 1.04 853