humaneval+: by models

Home Doc/Code


std predicted by accuracy

The typical stddev between pairs of models on this dataset as a function of the absolute accuracy.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, the significance level mainly depends on the accuracy difference as shown here. Hover over each model pair for detailed information.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000. std: standard deviation due to drawing examples from a population, this is the dominant term. std_i: the standard deviation due to drawing samples from the model on each example. std_total: the total standard deviation, satisfying std_total^2 = std^2 + std_i^2.

model pass1 std(E(A)) E(std(A)) std(A) N win_rate elo
claude-3-opus-20240229 77.4 3.3 0 3.3 NaN 25.3 1.07e+03
deepseek-coder-33b-instruct 76.2 3.3 0 3.3 NaN 24.1 1.07e+03
opencodeinterpreter-ds-33b 74.4 3.4 0 3.4 NaN 23.8 1.06e+03
mixtral-8x22b-instruct-v0.1 73.8 3.4 0 3.4 NaN 23.4 1.06e+03
speechless-codellama-34b 72.6 3.5 0 3.5 NaN 22 1.06e+03
HuggingFaceH4--starchat2-15b-v0.1 72 3.5 0 3.5 NaN 21.9 1.05e+03
code-millenials-34b 72 3.5 0 3.5 NaN 22.3 1.05e+03
deepseek-coder-6.7b-instruct 72 3.5 0 3.5 NaN 23.2 1.05e+03
meta-llama-3-70b-instruct 72 3.5 0 3.5 NaN 21.9 1.05e+03
deepseek-coder-7b-instruct-v1.5 71.3 3.5 0 3.5 NaN 21.6 1.05e+03
gpt-3.5-turbo 70.7 3.6 0 3.6 NaN 20.8 1.05e+03
opencodeinterpreter-ds-6.7b 70.7 3.6 0 3.6 NaN 21.4 1.05e+03
xwincoder-34b 70.1 3.6 0 3.6 NaN 21.2 1.05e+03
claude-3-haiku-20240307 68.9 3.6 0 3.6 NaN 20.5 1.04e+03
openchat 68.9 3.6 0 3.6 NaN 20.5 1.04e+03
speechless-coder-ds-6.7b 66.5 3.7 0 3.7 NaN 17.9 1.03e+03
code-llama-70b-instruct 66.5 3.7 0 3.7 NaN 19.6 1.03e+03
white-rabbit-neo-33b-v1 65.9 3.7 0 3.7 NaN 19 1.03e+03
codebooga-34b 65.9 3.7 0 3.7 NaN 17.7 1.03e+03
claude-3-sonnet-20240229 64.6 3.7 0 3.7 NaN 18.8 1.03e+03
mistral-large-latest 63.4 3.8 0 3.8 NaN 18.1 1.02e+03
speechless-starcoder2-15b 63.4 3.8 0 3.8 NaN 16.8 1.02e+03
deepseek-coder-1.3b-instruct 61.6 3.8 0 3.8 NaN 16.2 1.02e+03
bigcode--starcoder2-15b-instruct-v0.1 61 3.8 0 3.8 NaN 15.8 1.02e+03
Qwen--Qwen1.5-72B-Chat 59.8 3.8 0 3.8 NaN 15.8 1.01e+03
microsoft--Phi-3-mini-4k-instruct 59.8 3.8 0 3.8 NaN 16 1.01e+03
code-13b 53.7 3.9 0 3.9 NaN 13.1 989
codegemma-7b-it 53 3.9 0 3.9 NaN 11.7 987
speechless-coding-7b-16k-tora 52.4 3.9 0 3.9 NaN 12.2 985
speechless-starcoder2-7b 51.8 3.9 0 3.9 NaN 11.8 983
wizardcoder-15b 50.6 3.9 0 3.9 NaN 11 978
open-hermes-2.5-code-290k-13b 50.6 3.9 0 3.9 NaN 10.8 978
code-33b 50 3.9 0 3.9 NaN 11.8 976
phi-2 45.7 3.9 0 3.9 NaN 10.6 961
wizardcoder-7b 45.7 3.9 0 3.9 NaN 9.72 961
code-llama-multi-34b 43.9 3.9 0 3.9 NaN 8.78 954
deepseek-coder-33b 43.9 3.9 0 3.9 NaN 10.6 954
mistral-7b-codealpaca 43.3 3.9 0 3.9 NaN 9.46 952
starcoder2-15b-oci 43.3 3.9 0 3.9 NaN 8.89 952
speechless-mistral-7b 42.7 3.9 0 3.9 NaN 7.81 950
codegemma-7b 42.1 3.9 0 3.9 NaN 11.4 948
mixtral-8x7b-instruct 40.9 3.8 0 3.8 NaN 9.08 943
solar-10.7b-instruct 37.8 3.8 0 3.8 NaN 7.08 932
mistralai--Mistral-7B-Instruct-v0.2 36.6 3.8 0 3.8 NaN 6.99 928
gemma-1.1-7b-it 36 3.7 0 3.7 NaN 5.83 926
code-llama-multi-13b 34.8 3.7 0 3.7 NaN 6.12 921
octocoder 33.5 3.7 0 3.7 NaN 6.44 916
xdan-l1-chat 32.9 3.7 0 3.7 NaN 5.92 914
python-code-13b 31.7 3.6 0 3.6 NaN 5.74 910