humaneval+: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
claude-3-opus-20240229 77.4% 86.0% 1309.1
deepseek-coder-33b-instruct 76.2% 85.1% 1297.0
opencodeinterpreter-ds-33b 74.4% 80.5% 1246.2
mixtral-8x22b-instruct-v0.1 73.8% 79.8% 1235.1
speechless-codellama-34b 72.6% 79.1% 1225.3
HuggingFaceH4--starchat2-15b-v0.1 72.0% 77.5% 1208.4
code-millenials-34b 72.0% 76.9% 1200.5
deepseek-coder-6.7b-instruct 72.0% 75.2% 1185.0
meta-llama-3-70b-instruct 72.0% 77.5% 1207.7
deepseek-coder-7b-instruct-v1.5 71.3% 76.4% 1198.0
gpt-3.5-turbo 70.7% 76.2% 1195.4
opencodeinterpreter-ds-6.7b 70.7% 75.2% 1184.7
xwincoder-34b 70.1% 73.8% 1170.4
claude-3-haiku-20240307 68.9% 71.8% 1155.8
openchat 68.9% 71.8% 1153.9
speechless-coder-ds-6.7b 66.5% 69.3% 1127.1
code-llama-70b-instruct 66.5% 67.1% 1116.1
white-rabbit-neo-33b-v1 65.9% 66.3% 1107.3
codebooga-34b 65.9% 68.0% 1113.4
claude-3-sonnet-20240229 64.6% 63.8% 1083.7
mistral-large-latest 63.4% 61.7% 1069.8
speechless-starcoder2-15b 63.4% 62.8% 1076.8
deepseek-coder-1.3b-instruct 61.6% 59.1% 1052.9
bigcode--starcoder2-15b-instruct-v0.1 61.0% 58.0% 1041.8
Qwen--Qwen1.5-72B-Chat 59.8% 55.5% 1022.6
microsoft--Phi-3-mini-4k-instruct 59.8% 55.4% 1034.4
code-13b 53.7% 44.7% 953.3
codegemma-7b-it 53.0% 43.1% 938.8
speechless-coding-7b-16k-tora 52.4% 42.4% 933.7
speechless-starcoder2-7b 51.8% 41.3% 925.3
wizardcoder-15b 50.6% 39.0% 913.8
open-hermes-2.5-code-290k-13b 50.6% 38.9% 905.4
code-33b 50.0% 38.8% 903.9
phi-2 45.7% 32.7% 861.6
wizardcoder-7b 45.7% 31.7% 856.5
code-llama-multi-34b 43.9% 28.7% 830.2
deepseek-coder-33b 43.9% 30.9% 838.4
mistral-7b-codealpaca 43.3% 29.0% 831.0
starcoder2-15b-oci 43.3% 28.3% 819.6
speechless-mistral-7b 42.7% 26.1% 804.4
codegemma-7b 42.1% 30.2% 825.2
mixtral-8x7b-instruct 40.9% 26.5% 799.5
solar-10.7b-instruct 37.8% 21.2% 755.7
mistralai--Mistral-7B-Instruct-v0.2 36.6% 20.3% 749.2
gemma-1.1-7b-it 36.0% 17.8% 716.5
code-llama-multi-13b 34.8% 17.7% 713.1
octocoder 33.5% 17.6% 718.6
xdan-l1-chat 32.9% 16.4% 701.3
python-code-13b 31.7% 15.5% 686.4