p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
claude-3-opus-20240229	82.9%	93.5%	1501.4
deepseek-coder-33b-instruct	81.7%	91.7%	1454.2
opencodeinterpreter-ds-33b	77.4%	88.3%	1386.3
speechless-codellama-34b	77.4%	89.2%	1404.8
meta-llama-3-70b-instruct	77.4%	89.1%	1400.0
claude-3-haiku-20240307	76.8%	86.8%	1365.2
gpt-3.5-turbo	76.8%	87.3%	1371.6
mixtral-8x22b-instruct-v0.1	76.2%	86.1%	1352.0
deepseek-coder-7b-instruct-v1.5	75.6%	87.0%	1362.7
xwincoder-34b	75.6%	87.0%	1368.4
deepseek-coder-6.7b-instruct	74.4%	82.5%	1292.6
code-millenials-34b	74.4%	83.8%	1317.2
opencodeinterpreter-ds-6.7b	74.4%	86.2%	1347.2
HuggingFaceH4--starchat2-15b-v0.1	73.8%	84.3%	1319.4
openchat	72.6%	83.5%	1315.0
white-rabbit-neo-33b-v1	72.0%	82.0%	1294.5
code-llama-70b-instruct	72.0%	82.3%	1304.3
codebooga-34b	71.3%	82.7%	1292.7
speechless-coder-ds-6.7b	71.3%	83.7%	1311.1
claude-3-sonnet-20240229	70.7%	79.7%	1264.6
mistral-large-latest	69.5%	78.8%	1257.3
Qwen--Qwen1.5-72B-Chat	68.3%	78.2%	1247.4
bigcode--starcoder2-15b-instruct-v0.1	67.7%	77.4%	1240.7
speechless-starcoder2-15b	67.1%	75.7%	1209.8
deepseek-coder-1.3b-instruct	65.9%	75.4%	1217.3
microsoft--Phi-3-mini-4k-instruct	64.6%	72.5%	1203.7
codegemma-7b-it	60.4%	68.1%	1160.0
wizardcoder-15b	56.7%	61.9%	1109.0
code-13b	56.1%	60.4%	1094.7
speechless-starcoder2-7b	56.1%	60.6%	1095.0
speechless-coding-7b-16k-tora	54.9%	58.9%	1069.5
code-33b	54.9%	58.2%	1070.1
Qwen1.5-110B	54.3%	57.9%	1086.6
open-hermes-2.5-code-290k-13b	54.3%	57.9%	1078.2
deepseek-coder-33b	51.2%	52.8%	1040.5
wizardcoder-7b	50.6%	51.9%	1034.3
phi-2	49.4%	50.0%	1031.1
code-llama-multi-34b	48.2%	47.9%	1012.1
mistral-7b-codealpaca	48.2%	48.1%	997.8
speechless-mistral-7b	48.2%	47.9%	995.4
dbrx-base	47.0%	46.0%	994.5
starcoder2-15b-oci	47.0%	45.9%	991.1
mixtral-8x7b-instruct	45.1%	43.7%	956.4
codegemma-7b	44.5%	43.6%	940.1
Qwen1.5-72B	44.5%	41.9%	967.1
solar-10.7b-instruct	43.3%	40.7%	936.3
gemma-1.1-7b-it	42.7%	39.0%	937.7
deepseek-llm-67b-base	42.7%	38.9%	943.9
mistralai--Mistral-7B-Instruct-v0.2	42.1%	38.7%	933.9
Meta-Llama-3-70B	41.5%	37.9%	928.3
Qwen1.5-14B	40.2%	35.0%	908.9
Mixtral-8x22B-v0.1	40.2%	35.2%	916.7
Qwen1.5-32B	40.2%	34.9%	907.0
xdan-l1-chat	40.2%	35.4%	914.5
code-llama-multi-13b	37.8%	31.5%	871.1
octocoder	37.2%	29.7%	872.3
Qwen1.5-7B	36.6%	28.4%	863.5
Meta-Llama-3-8B	35.4%	27.2%	842.2
gemma-7b	34.8%	27.2%	851.6
Mixtral-8x7B-v0.1	33.5%	23.0%	798.8
python-code-13b	32.9%	25.1%	803.6
llama2_70B	32.3%	25.2%	817.1
Mistral-7B-v0.1	27.4%	14.2%	688.9
Qwen1.5-4B	25.6%	11.7%	653.4
mpt-30b	25.6%	15.1%	704.4
llama_65B	25.6%	12.8%	668.2
deepseek-llm-7b-base	24.4%	10.7%	629.4
gemma-2b	23.2%	9.9%	618.4
deepseek-moe-16b-base	23.2%	11.3%	637.8
Qwen1.5-1.8B	21.3%	10.1%	614.6
llama_33B	20.7%	9.5%	603.6
llama2_13B	18.9%	8.8%	589.6
llama_13B	16.5%	4.2%	447.8
stablelm-3b-4e1t	15.9%	4.9%	479.1
stablelm-base-alpha-7b-v2	15.2%	4.5%	451.6
llama2_07B	14.0%	3.2%	387.3
llama_07B	12.8%	3.1%	367.8
Qwen1.5-0.5B	11.6%	1.9%	285.7

humaneval: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model