humaneval: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
claude-3-opus-20240229 82.9% 93.5% 1501.4
deepseek-coder-33b-instruct 81.7% 91.7% 1454.2
opencodeinterpreter-ds-33b 77.4% 88.3% 1386.3
speechless-codellama-34b 77.4% 89.2% 1404.8
meta-llama-3-70b-instruct 77.4% 89.1% 1400.0
claude-3-haiku-20240307 76.8% 86.8% 1365.2
gpt-3.5-turbo 76.8% 87.3% 1371.6
mixtral-8x22b-instruct-v0.1 76.2% 86.1% 1352.0
deepseek-coder-7b-instruct-v1.5 75.6% 87.0% 1362.7
xwincoder-34b 75.6% 87.0% 1368.4
deepseek-coder-6.7b-instruct 74.4% 82.5% 1292.6
code-millenials-34b 74.4% 83.8% 1317.2
opencodeinterpreter-ds-6.7b 74.4% 86.2% 1347.2
HuggingFaceH4--starchat2-15b-v0.1 73.8% 84.3% 1319.4
openchat 72.6% 83.5% 1315.0
white-rabbit-neo-33b-v1 72.0% 82.0% 1294.5
code-llama-70b-instruct 72.0% 82.3% 1304.3
codebooga-34b 71.3% 82.7% 1292.7
speechless-coder-ds-6.7b 71.3% 83.7% 1311.1
claude-3-sonnet-20240229 70.7% 79.7% 1264.6
mistral-large-latest 69.5% 78.8% 1257.3
Qwen--Qwen1.5-72B-Chat 68.3% 78.2% 1247.4
bigcode--starcoder2-15b-instruct-v0.1 67.7% 77.4% 1240.7
speechless-starcoder2-15b 67.1% 75.7% 1209.8
deepseek-coder-1.3b-instruct 65.9% 75.4% 1217.3
microsoft--Phi-3-mini-4k-instruct 64.6% 72.5% 1203.7
codegemma-7b-it 60.4% 68.1% 1160.0
wizardcoder-15b 56.7% 61.9% 1109.0
code-13b 56.1% 60.4% 1094.7
speechless-starcoder2-7b 56.1% 60.6% 1095.0
speechless-coding-7b-16k-tora 54.9% 58.9% 1069.5
code-33b 54.9% 58.2% 1070.1
Qwen1.5-110B 54.3% 57.9% 1086.6
open-hermes-2.5-code-290k-13b 54.3% 57.9% 1078.2
deepseek-coder-33b 51.2% 52.8% 1040.5
wizardcoder-7b 50.6% 51.9% 1034.3
phi-2 49.4% 50.0% 1031.1
code-llama-multi-34b 48.2% 47.9% 1012.1
mistral-7b-codealpaca 48.2% 48.1% 997.8
speechless-mistral-7b 48.2% 47.9% 995.4
dbrx-base 47.0% 46.0% 994.5
starcoder2-15b-oci 47.0% 45.9% 991.1
mixtral-8x7b-instruct 45.1% 43.7% 956.4
codegemma-7b 44.5% 43.6% 940.1
Qwen1.5-72B 44.5% 41.9% 967.1
solar-10.7b-instruct 43.3% 40.7% 936.3
gemma-1.1-7b-it 42.7% 39.0% 937.7
deepseek-llm-67b-base 42.7% 38.9% 943.9
mistralai--Mistral-7B-Instruct-v0.2 42.1% 38.7% 933.9
Meta-Llama-3-70B 41.5% 37.9% 928.3
Qwen1.5-14B 40.2% 35.0% 908.9
Mixtral-8x22B-v0.1 40.2% 35.2% 916.7
Qwen1.5-32B 40.2% 34.9% 907.0
xdan-l1-chat 40.2% 35.4% 914.5
code-llama-multi-13b 37.8% 31.5% 871.1
octocoder 37.2% 29.7% 872.3
Qwen1.5-7B 36.6% 28.4% 863.5
Meta-Llama-3-8B 35.4% 27.2% 842.2
gemma-7b 34.8% 27.2% 851.6
Mixtral-8x7B-v0.1 33.5% 23.0% 798.8
python-code-13b 32.9% 25.1% 803.6
llama2_70B 32.3% 25.2% 817.1
Mistral-7B-v0.1 27.4% 14.2% 688.9
Qwen1.5-4B 25.6% 11.7% 653.4
mpt-30b 25.6% 15.1% 704.4
llama_65B 25.6% 12.8% 668.2
deepseek-llm-7b-base 24.4% 10.7% 629.4
gemma-2b 23.2% 9.9% 618.4
deepseek-moe-16b-base 23.2% 11.3% 637.8
Qwen1.5-1.8B 21.3% 10.1% 614.6
llama_33B 20.7% 9.5% 603.6
llama2_13B 18.9% 8.8% 589.6
llama_13B 16.5% 4.2% 447.8
stablelm-3b-4e1t 15.9% 4.9% 479.1
stablelm-base-alpha-7b-v2 15.2% 4.5% 451.6
llama2_07B 14.0% 3.2% 387.3
llama_07B 12.8% 3.1% 367.8
Qwen1.5-0.5B 11.6% 1.9% 285.7