mbpp: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
claude-3-opus-20240229 89.4% 94.1% 1472.8
gpt-4-1106-preview 85.7% 87.8% 1335.5
claude-3-sonnet-20240229 83.6% 86.2% 1305.6
meta-llama-3-70b-instruct 82.3% 85.1% 1289.9
deepseek-coder-33b-instruct 80.4% 79.1% 1215.8
claude-3-haiku-20240307 80.2% 78.2% 1203.1
opencodeinterpreter-ds-33b 80.2% 79.7% 1222.6
white-rabbit-neo-33b-v1 79.4% 79.1% 1214.6
bigcode--starcoder2-15b-instruct-v0.1 78.0% 76.6% 1190.8
xwincoder-34b 77.0% 75.3% 1175.9
opencodeinterpreter-ds-6.7b 76.5% 74.4% 1163.8
code-millenials-34b 76.2% 73.9% 1165.1
wizardcoder-34b 75.1% 70.9% 1136.8
deepseek-coder-6.7b-instruct 74.9% 70.8% 1130.0
HuggingFaceH4--starchat2-15b-v0.1 74.9% 70.2% 1128.7
starcoder2-15b-oci 74.3% 69.6% 1128.5
CohereForAI--c4ai-command-r-plus 74.3% 68.3% 1115.0
mixtral-8x22b-instruct-v0.1 73.8% 67.7% 1104.7
speechless-codellama-34b 73.8% 68.4% 1119.9
speechless-starcoder2-15b 73.5% 68.3% 1114.0
mistral-large-latest 72.8% 64.0% 1079.0
Qwen--Qwen1.5-72B-Chat 72.5% 66.2% 1098.0
deepseek-coder-6.7b-base 72.0% 64.7% 1090.9
dolphin-2.6 70.6% 61.8% 1065.2
codegemma-7b-it 70.4% 60.9% 1067.3
code-llama-34b 69.3% 58.7% 1046.4
databricks--dbrx-instruct 67.2% 54.2% 1004.2
speechless-starcoder2-7b 66.7% 53.3% 1002.0
code-llama-multi-34b 66.7% 53.4% 1011.1
microsoft--Phi-3-mini-4k-instruct 65.9% 51.5% 985.7
codegemma-7b 65.1% 50.1% 985.8
wizardcoder-15b 64.3% 48.5% 972.2
phi-2 64.0% 48.0% 968.6
openchat 63.8% 47.4% 960.3
code-llama-13b 63.5% 46.9% 963.6
code-llama-multi-13b 63.0% 45.8% 959.2
mixtral-8x7b-instruct 59.5% 40.7% 908.0
octocoder 59.3% 38.5% 905.9
wizardcoder-7b 58.5% 37.5% 890.3
speechless-mistral-7b 57.4% 36.3% 880.2
gemma-1.1-7b-it 57.1% 36.2% 884.9
codet5p-16b 56.6% 33.5% 867.3
codegemma-2b 55.6% 32.6% 861.8
stable-code-3b 54.8% 30.8% 842.1
codegen-16b 54.2% 30.8% 843.2
code-llama-multi-7b 53.7% 30.1% 839.0
starcoder2-3b 53.4% 31.1% 841.7
codet5p-6b 52.9% 29.5% 836.6
gemma-7b 52.6% 28.9% 823.8
open-hermes-2.5-code-290k-13b 52.4% 29.3% 825.7
mistral-7b 51.9% 26.2% 804.5
codegen-6b 50.8% 25.7% 799.7
xdan-l1-chat 50.3% 26.1% 793.7
codet5p-2b 48.4% 23.5% 778.3
codegen-2b 46.3% 21.1% 756.5
mistralai--Mistral-7B-Instruct-v0.2 44.7% 19.6% 730.6
solar-10.7b-instruct 43.9% 18.3% 714.1
gemma-2b 41.8% 15.5% 684.6
gemma-7b-it 39.7% 16.8% 694.8