mbpp: by models

Home Doc/Code


std predicted by accuracy

The typical stddev between pairs of models on this dataset as a function of the absolute accuracy.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, the significance level mainly depends on the accuracy difference as shown here. Hover over each model pair for detailed information.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000. std: standard deviation due to drawing examples from a population, this is the dominant term. std_i: the standard deviation due to drawing samples from the model on each example. std_total: the total standard deviation, satisfying std_total^2 = std^2 + std_i^2.

model pass1 std(E(A)) E(std(A)) std(A) N win_rate elo
claude-3-opus-20240229 89.4 1.6 0 1.6 NaN 26.5 1.09e+03
gpt-4-1106-preview 85.7 1.8 0 1.8 NaN 24.4 1.07e+03
claude-3-sonnet-20240229 83.6 1.9 0 1.9 NaN 22.5 1.07e+03
meta-llama-3-70b-instruct 82.3 2 0 2 NaN 21.3 1.06e+03
deepseek-coder-33b-instruct 80.4 2 0 2 NaN 21.3 1.05e+03
claude-3-haiku-20240307 80.2 2.1 0 2.1 NaN 21.3 1.05e+03
opencodeinterpreter-ds-33b 80.2 2.1 0 2.1 NaN 20.6 1.05e+03
white-rabbit-neo-33b-v1 79.4 2.1 0 2.1 NaN 19.8 1.05e+03
bigcode--starcoder2-15b-instruct-v0.1 78 2.1 0 2.1 NaN 19.1 1.05e+03
xwincoder-34b 77 2.2 0 2.2 NaN 18.1 1.04e+03
opencodeinterpreter-ds-6.7b 76.5 2.2 0 2.2 NaN 17.7 1.04e+03
code-millenials-34b 76.2 2.2 0 2.2 NaN 17.5 1.04e+03
wizardcoder-34b 75.1 2.2 0 2.2 NaN 17.5 1.04e+03
deepseek-coder-6.7b-instruct 74.9 2.2 0 2.2 NaN 17 1.03e+03
HuggingFaceH4--starchat2-15b-v0.1 74.9 2.2 0 2.2 NaN 17.4 1.03e+03
starcoder2-15b-oci 74.3 2.2 0 2.2 NaN 16.8 1.03e+03
CohereForAI--c4ai-command-r-plus 74.3 2.2 0 2.2 NaN 17.7 1.03e+03
mixtral-8x22b-instruct-v0.1 73.8 2.3 0 2.3 NaN 17.1 1.03e+03
speechless-codellama-34b 73.8 2.3 0 2.3 NaN 16.6 1.03e+03
speechless-starcoder2-15b 73.5 2.3 0 2.3 NaN 16.1 1.03e+03
mistral-large-latest 72.8 2.3 0 2.3 NaN 17.9 1.03e+03
Qwen--Qwen1.5-72B-Chat 72.5 2.3 0 2.3 NaN 15.5 1.03e+03
deepseek-coder-6.7b-base 72 2.3 0 2.3 NaN 15.5 1.02e+03
dolphin-2.6 70.6 2.3 0 2.3 NaN 14.9 1.02e+03
codegemma-7b-it 70.4 2.3 0 2.3 NaN 15.1 1.02e+03
code-llama-34b 69.3 2.4 0 2.4 NaN 14.6 1.02e+03
databricks--dbrx-instruct 67.2 2.4 0 2.4 NaN 14.3 1.01e+03
speechless-starcoder2-7b 66.7 2.4 0 2.4 NaN 13.6 1.01e+03
code-llama-multi-34b 66.7 2.4 0 2.4 NaN 13.2 1.01e+03
microsoft--Phi-3-mini-4k-instruct 65.9 2.4 0 2.4 NaN 14.7 1e+03
codegemma-7b 65.1 2.5 0 2.5 NaN 13.2 1e+03
wizardcoder-15b 64.3 2.5 0 2.5 NaN 12 997
phi-2 64 2.5 0 2.5 NaN 12.4 996
openchat 63.8 2.5 0 2.5 NaN 12 996
code-llama-13b 63.5 2.5 0 2.5 NaN 12 995
code-llama-multi-13b 63 2.5 0 2.5 NaN 11.5 993
mixtral-8x7b-instruct 59.5 2.5 0 2.5 NaN 12.3 981
octocoder 59.3 2.5 0 2.5 NaN 9.86 980
wizardcoder-7b 58.5 2.5 0 2.5 NaN 10 977
speechless-mistral-7b 57.4 2.5 0 2.5 NaN 10.3 973
gemma-1.1-7b-it 57.1 2.5 0 2.5 NaN 10.6 972
codet5p-16b 56.6 2.5 0 2.5 NaN 8.68 970
codegemma-2b 55.6 2.6 0 2.6 NaN 9 967
stable-code-3b 54.8 2.6 0 2.6 NaN 8.35 964
codegen-16b 54.2 2.6 0 2.6 NaN 8.82 962
code-llama-multi-7b 53.7 2.6 0 2.6 NaN 8.69 960
starcoder2-3b 53.4 2.6 0 2.6 NaN 9.68 959
codet5p-6b 52.9 2.6 0 2.6 NaN 8.88 957
gemma-7b 52.6 2.6 0 2.6 NaN 8.61 956
open-hermes-2.5-code-290k-13b 52.4 2.6 0 2.6 NaN 9.1 955
mistral-7b 51.9 2.6 0 2.6 NaN 7.37 953
codegen-6b 50.8 2.6 0 2.6 NaN 7.68 949
xdan-l1-chat 50.3 2.6 0 2.6 NaN 8.23 948
codet5p-2b 48.4 2.6 0 2.6 NaN 7.5 941
codegen-2b 46.3 2.6 0 2.6 NaN 6.97 933
mistralai--Mistral-7B-Instruct-v0.2 44.7 2.6 0 2.6 NaN 6.68 927
solar-10.7b-instruct 43.9 2.6 0 2.6 NaN 6.21 924
gemma-2b 41.8 2.5 0 2.5 NaN 5.29 917
gemma-7b-it 39.7 2.5 0 2.5 NaN 6.51 909