humaneval: by models

Home Doc/Code


std predicted by accuracy

The typical stddev between pairs of models on this dataset as a function of the absolute accuracy.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, the significance level mainly depends on the accuracy difference as shown here. Hover over each model pair for detailed information.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000. std: standard deviation due to drawing examples from a population, this is the dominant term. std_i: the standard deviation due to drawing samples from the model on each example. std_total: the total standard deviation, satisfying std_total^2 = std^2 + std_i^2.

model pass1 std(E(A)) E(std(A)) std(A) N win_rate elo
claude-3-opus-20240229 82.9 2.9 0 2.9 NaN 36.5 1.13e+03
deepseek-coder-33b-instruct 81.7 3 0 3 NaN 36 1.12e+03
opencodeinterpreter-ds-33b 77.4 3.3 0 3.3 NaN 32.7 1.1e+03
speechless-codellama-34b 77.4 3.3 0 3.3 NaN 32.3 1.1e+03
meta-llama-3-70b-instruct 77.4 3.3 0 3.3 NaN 32.4 1.1e+03
claude-3-haiku-20240307 76.8 3.3 0 3.3 NaN 32.8 1.1e+03
gpt-3.5-turbo 76.8 3.3 0 3.3 NaN 32.5 1.1e+03
mixtral-8x22b-instruct-v0.1 76.2 3.3 0 3.3 NaN 32.3 1.1e+03
deepseek-coder-7b-instruct-v1.5 75.6 3.4 0 3.4 NaN 31.2 1.1e+03
xwincoder-34b 75.6 3.4 0 3.4 NaN 31.2 1.1e+03
deepseek-coder-6.7b-instruct 74.4 3.4 0 3.4 NaN 32.1 1.09e+03
code-millenials-34b 74.4 3.4 0 3.4 NaN 31.4 1.09e+03
opencodeinterpreter-ds-6.7b 74.4 3.4 0 3.4 NaN 30.1 1.09e+03
HuggingFaceH4--starchat2-15b-v0.1 73.8 3.4 0 3.4 NaN 30.3 1.09e+03
openchat 72.6 3.5 0 3.5 NaN 29.2 1.09e+03
white-rabbit-neo-33b-v1 72 3.5 0 3.5 NaN 29.2 1.08e+03
code-llama-70b-instruct 72 3.5 0 3.5 NaN 29.1 1.08e+03
codebooga-34b 71.3 3.5 0 3.5 NaN 28.1 1.08e+03
speechless-coder-ds-6.7b 71.3 3.5 0 3.5 NaN 27.6 1.08e+03
claude-3-sonnet-20240229 70.7 3.6 0 3.6 NaN 29 1.08e+03
mistral-large-latest 69.5 3.6 0 3.6 NaN 27.8 1.07e+03
Qwen--Qwen1.5-72B-Chat 68.3 3.6 0 3.6 NaN 26.5 1.07e+03
bigcode--starcoder2-15b-instruct-v0.1 67.7 3.7 0 3.7 NaN 26.1 1.07e+03
speechless-starcoder2-15b 67.1 3.7 0 3.7 NaN 26.4 1.07e+03
deepseek-coder-1.3b-instruct 65.9 3.7 0 3.7 NaN 24.7 1.06e+03
microsoft--Phi-3-mini-4k-instruct 64.6 3.7 0 3.7 NaN 24.8 1.06e+03
codegemma-7b-it 60.4 3.8 0 3.8 NaN 20.9 1.04e+03
wizardcoder-15b 56.7 3.9 0 3.9 NaN 19.2 1.03e+03
code-13b 56.1 3.9 0 3.9 NaN 19.6 1.02e+03
speechless-starcoder2-7b 56.1 3.9 0 3.9 NaN 19.4 1.02e+03
speechless-coding-7b-16k-tora 54.9 3.9 0 3.9 NaN 18.4 1.02e+03
code-33b 54.9 3.9 0 3.9 NaN 19.6 1.02e+03
Qwen1.5-110B 54.3 3.9 0 3.9 NaN 18 1.02e+03
open-hermes-2.5-code-290k-13b 54.3 3.9 0 3.9 NaN 18 1.02e+03
deepseek-coder-33b 51.2 3.9 0 3.9 NaN 17.2 1.01e+03
wizardcoder-7b 50.6 3.9 0 3.9 NaN 16.3 1e+03
phi-2 49.4 3.9 0 3.9 NaN 15.8 1e+03
code-llama-multi-34b 48.2 3.9 0 3.9 NaN 14.6 996
mistral-7b-codealpaca 48.2 3.9 0 3.9 NaN 16.2 996
speechless-mistral-7b 48.2 3.9 0 3.9 NaN 14.3 996
dbrx-base 47 3.9 0 3.9 NaN 14.6 991
starcoder2-15b-oci 47 3.9 0 3.9 NaN 13.9 991
mixtral-8x7b-instruct 45.1 3.9 0 3.9 NaN 15.1 984
codegemma-7b 44.5 3.9 0 3.9 NaN 16.9 982
Qwen1.5-72B 44.5 3.9 0 3.9 NaN 12.8 982
solar-10.7b-instruct 43.3 3.9 0 3.9 NaN 13.5 978
gemma-1.1-7b-it 42.7 3.9 0 3.9 NaN 12.1 976
deepseek-llm-67b-base 42.7 3.9 0 3.9 NaN 12 976
mistralai--Mistral-7B-Instruct-v0.2 42.1 3.9 0 3.9 NaN 12.8 973
Meta-Llama-3-70B 41.5 3.8 0 3.8 NaN 12.7 971
Qwen1.5-14B 40.2 3.8 0 3.8 NaN 10.8 967
Mixtral-8x22B-v0.1 40.2 3.8 0 3.8 NaN 11 967
Qwen1.5-32B 40.2 3.8 0 3.8 NaN 10.7 967
xdan-l1-chat 40.2 3.8 0 3.8 NaN 11.3 967
code-llama-multi-13b 37.8 3.8 0 3.8 NaN 10 958
octocoder 37.2 3.8 0 3.8 NaN 9.08 955
Qwen1.5-7B 36.6 3.8 0 3.8 NaN 8.52 953
Meta-Llama-3-8B 35.4 3.7 0 3.7 NaN 8.51 949
gemma-7b 34.8 3.7 0 3.7 NaN 8.85 946
Mixtral-8x7B-v0.1 33.5 3.7 0 3.7 NaN 6.85 942
python-code-13b 32.9 3.7 0 3.7 NaN 8.44 940
llama2_70B 32.3 3.7 0 3.7 NaN 8.81 937
Mistral-7B-v0.1 27.4 3.5 0 3.5 NaN 4.43 919
Qwen1.5-4B 25.6 3.4 0 3.4 NaN 3.67 912
mpt-30b 25.6 3.4 0 3.4 NaN 5.19 912
llama_65B 25.6 3.4 0 3.4 NaN 4.16 912
deepseek-llm-7b-base 24.4 3.4 0 3.4 NaN 3.44 907
gemma-2b 23.2 3.3 0 3.3 NaN 3.29 902
deepseek-moe-16b-base 23.2 3.3 0 3.3 NaN 3.87 902
Qwen1.5-1.8B 21.3 3.2 0 3.2 NaN 3.6 895
llama_33B 20.7 3.2 0 3.2 NaN 3.41 893
llama2_13B 18.9 3.1 0 3.1 NaN 3.29 886
llama_13B 16.5 2.9 0 2.9 NaN 1.51 876
stablelm-3b-4e1t 15.9 2.9 0 2.9 NaN 1.83 873
stablelm-base-alpha-7b-v2 15.2 2.8 0 2.8 NaN 1.71 871
llama2_07B 14 2.7 0 2.7 NaN 1.23 866
llama_07B 12.8 2.6 0 2.6 NaN 1.23 861
Qwen1.5-0.5B 11.6 2.5 0 2.5 NaN 0.76 856