mbpp+: by models

Home Doc/Code


std predicted by accuracy

The typical stddev between pairs of models on this dataset as a function of the absolute accuracy.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, the significance level mainly depends on the accuracy difference as shown here. Hover over each model pair for detailed information.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000. std: standard deviation due to drawing examples from a population, this is the dominant term. std_i: the standard deviation due to drawing samples from the model on each example. std_total: the total standard deviation, satisfying std_total^2 = std^2 + std_i^2.

model pass1 std(E(A)) E(std(A)) std(A) N win_rate elo
gpt-4-1106-preview 74.1 2.3 0 2.3 NaN 23.9 1.07e+03
claude-3-opus-20240229 73.5 2.3 0 2.3 NaN 23.1 1.07e+03
deepseek-coder-33b-instruct 70.4 2.3 0 2.3 NaN 21.2 1.05e+03
claude-3-sonnet-20240229 69.8 2.4 0 2.4 NaN 20.2 1.05e+03
meta-llama-3-70b-instruct 69.6 2.4 0 2.4 NaN 19.3 1.05e+03
claude-3-haiku-20240307 69.3 2.4 0 2.4 NaN 20.1 1.05e+03
opencodeinterpreter-ds-33b 68.8 2.4 0 2.4 NaN 19.2 1.05e+03
white-rabbit-neo-33b-v1 67.5 2.4 0 2.4 NaN 18.5 1.04e+03
opencodeinterpreter-ds-6.7b 66.9 2.4 0 2.4 NaN 17.6 1.04e+03
xwincoder-34b 66.1 2.4 0 2.4 NaN 16.9 1.04e+03
deepseek-coder-6.7b-instruct 66.1 2.4 0 2.4 NaN 17.6 1.04e+03
bigcode--starcoder2-15b-instruct-v0.1 65.1 2.5 0 2.5 NaN 17.6 1.04e+03
HuggingFaceH4--starchat2-15b-v0.1 64.8 2.5 0 2.5 NaN 16.9 1.03e+03
code-millenials-34b 64.6 2.5 0 2.5 NaN 16.7 1.03e+03
mixtral-8x22b-instruct-v0.1 64.6 2.5 0 2.5 NaN 16.9 1.03e+03
wizardcoder-34b 63.8 2.5 0 2.5 NaN 15.9 1.03e+03
CohereForAI--c4ai-command-r-plus 63.8 2.5 0 2.5 NaN 17.1 1.03e+03
starcoder2-15b-oci 63.8 2.5 0 2.5 NaN 16.2 1.03e+03
speechless-starcoder2-15b 63 2.5 0 2.5 NaN 15.2 1.03e+03
Qwen--Qwen1.5-72B-Chat 62.4 2.5 0 2.5 NaN 15.4 1.03e+03
speechless-codellama-34b 61.4 2.5 0 2.5 NaN 14.5 1.02e+03
dolphin-2.6 60.1 2.5 0 2.5 NaN 13.9 1.02e+03
mistral-large-latest 59.8 2.5 0 2.5 NaN 15.8 1.02e+03
deepseek-coder-6.7b-base 59.5 2.5 0 2.5 NaN 13.7 1.02e+03
codegemma-7b-it 57.4 2.5 0 2.5 NaN 12.9 1.01e+03
speechless-starcoder2-7b 57.1 2.5 0 2.5 NaN 12.6 1.01e+03
code-llama-34b 56.9 2.5 0 2.5 NaN 12.9 1.01e+03
databricks--dbrx-instruct 56.3 2.6 0 2.6 NaN 13.8 1e+03
openchat 56.1 2.6 0 2.6 NaN 12.2 1e+03
phi-2 55.3 2.6 0 2.6 NaN 12 1e+03
code-llama-multi-34b 55 2.6 0 2.6 NaN 11.6 1e+03
wizardcoder-15b 54.8 2.6 0 2.6 NaN 11.6 999
microsoft--Phi-3-mini-4k-instruct 54.5 2.6 0 2.6 NaN 13.5 998
code-llama-multi-13b 54.5 2.6 0 2.6 NaN 11.4 998
code-llama-13b 53.2 2.6 0 2.6 NaN 11.1 993
codegemma-7b 52.4 2.6 0 2.6 NaN 11.4 991
octocoder 51.3 2.6 0 2.6 NaN 10 987
mixtral-8x7b-instruct 50.3 2.6 0 2.6 NaN 11.9 983
wizardcoder-7b 50 2.6 0 2.6 NaN 9.48 982
speechless-mistral-7b 49.2 2.6 0 2.6 NaN 10.2 979
codet5p-16b 48.1 2.6 0 2.6 NaN 8.65 976
codegemma-2b 47.9 2.6 0 2.6 NaN 9.53 975
stable-code-3b 46.8 2.6 0 2.6 NaN 8.47 971
open-hermes-2.5-code-290k-13b 46.8 2.6 0 2.6 NaN 10.1 971
gemma-1.1-7b-it 46.6 2.6 0 2.6 NaN 8.85 970
codegen-16b 46.3 2.6 0 2.6 NaN 8.51 969
gemma-7b 45 2.6 0 2.6 NaN 8.73 964
starcoder2-3b 44.4 2.6 0 2.6 NaN 8.26 963
code-llama-multi-7b 44.2 2.6 0 2.6 NaN 7.61 962
codegen-6b 43.7 2.6 0 2.6 NaN 7.7 960
mistral-7b 42.9 2.5 0 2.5 NaN 7.01 957
codet5p-6b 42.6 2.5 0 2.5 NaN 7.82 956
xdan-l1-chat 41.8 2.5 0 2.5 NaN 7.6 953
codet5p-2b 38.9 2.5 0 2.5 NaN 6.16 943
mistralai--Mistral-7B-Instruct-v0.2 37.6 2.5 0 2.5 NaN 7 938
solar-10.7b-instruct 37.6 2.5 0 2.5 NaN 6.41 938
codegen-2b 37.3 2.5 0 2.5 NaN 5.65 937
gemma-2b 35.4 2.5 0 2.5 NaN 5.42 930
gemma-7b-it 33.1 2.4 0 2.4 NaN 6.01 921