mbpp+: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
gpt-4-1106-preview 74.1% 83.8% 1281.9
claude-3-opus-20240229 73.5% 84.3% 1283.1
deepseek-coder-33b-instruct 70.4% 78.9% 1221.5
claude-3-sonnet-20240229 69.8% 79.7% 1225.8
meta-llama-3-70b-instruct 69.6% 80.9% 1239.8
claude-3-haiku-20240307 69.3% 78.2% 1210.2
opencodeinterpreter-ds-33b 68.8% 78.5% 1216.4
white-rabbit-neo-33b-v1 67.5% 75.7% 1185.7
opencodeinterpreter-ds-6.7b 66.9% 76.0% 1192.5
xwincoder-34b 66.1% 74.9% 1175.6
deepseek-coder-6.7b-instruct 66.1% 73.6% 1165.2
bigcode--starcoder2-15b-instruct-v0.1 65.1% 70.4% 1138.9
HuggingFaceH4--starchat2-15b-v0.1 64.8% 70.7% 1139.6
code-millenials-34b 64.6% 70.3% 1140.4
mixtral-8x22b-instruct-v0.1 64.6% 69.9% 1131.7
wizardcoder-34b 63.8% 69.2% 1130.7
CohereForAI--c4ai-command-r-plus 63.8% 67.3% 1115.4
starcoder2-15b-oci 63.8% 68.8% 1128.5
speechless-starcoder2-15b 63.0% 67.9% 1116.3
Qwen--Qwen1.5-72B-Chat 62.4% 66.0% 1105.0
speechless-codellama-34b 61.4% 64.1% 1090.9
dolphin-2.6 60.1% 61.1% 1067.5
mistral-large-latest 59.8% 58.9% 1048.5
deepseek-coder-6.7b-base 59.5% 59.9% 1059.7
codegemma-7b-it 57.4% 55.1% 1027.3
speechless-starcoder2-7b 57.1% 54.5% 1013.5
code-llama-34b 56.9% 53.8% 1017.6
databricks--dbrx-instruct 56.3% 52.4% 1001.5
openchat 56.1% 52.2% 1001.3
phi-2 55.3% 50.4% 994.5
code-llama-multi-34b 55.0% 49.9% 998.0
wizardcoder-15b 54.8% 49.3% 989.0
microsoft--Phi-3-mini-4k-instruct 54.5% 48.9% 975.9
code-llama-multi-13b 54.5% 48.7% 990.1
code-llama-13b 53.2% 46.0% 966.1
codegemma-7b 52.4% 44.6% 954.0
octocoder 51.3% 42.0% 938.9
mixtral-8x7b-instruct 50.3% 41.5% 924.7
wizardcoder-7b 50.0% 39.3% 913.0
speechless-mistral-7b 49.2% 38.7% 907.6
codet5p-16b 48.1% 35.5% 892.7
codegemma-2b 47.9% 36.1% 896.0
stable-code-3b 46.8% 33.4% 873.4
open-hermes-2.5-code-290k-13b 46.8% 35.3% 883.1
gemma-1.1-7b-it 46.6% 33.6% 872.8
codegen-16b 46.3% 32.8% 868.6
gemma-7b 45.0% 31.5% 857.1
starcoder2-3b 44.4% 30.2% 843.5
code-llama-multi-7b 44.2% 28.9% 836.4
codegen-6b 43.7% 28.5% 834.3
mistral-7b 42.9% 26.5% 817.6
codet5p-6b 42.6% 27.6% 830.7
xdan-l1-chat 41.8% 26.5% 807.1
codet5p-2b 38.9% 21.4% 768.7
mistralai--Mistral-7B-Instruct-v0.2 37.6% 22.0% 763.1
solar-10.7b-instruct 37.6% 20.9% 753.5
codegen-2b 37.3% 19.2% 743.3
gemma-2b 35.4% 17.6% 723.0
gemma-7b-it 33.1% 17.5% 711.4