p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
claude-3-opus-20240229	77.4%	86.0%	1309.1
deepseek-coder-33b-instruct	76.2%	85.1%	1297.0
opencodeinterpreter-ds-33b	74.4%	80.5%	1246.2
mixtral-8x22b-instruct-v0.1	73.8%	79.8%	1235.1
speechless-codellama-34b	72.6%	79.1%	1225.3
HuggingFaceH4--starchat2-15b-v0.1	72.0%	77.5%	1208.4
code-millenials-34b	72.0%	76.9%	1200.5
deepseek-coder-6.7b-instruct	72.0%	75.2%	1185.0
meta-llama-3-70b-instruct	72.0%	77.5%	1207.7
deepseek-coder-7b-instruct-v1.5	71.3%	76.4%	1198.0
gpt-3.5-turbo	70.7%	76.2%	1195.4
opencodeinterpreter-ds-6.7b	70.7%	75.2%	1184.7
xwincoder-34b	70.1%	73.8%	1170.4
claude-3-haiku-20240307	68.9%	71.8%	1155.8
openchat	68.9%	71.8%	1153.9
speechless-coder-ds-6.7b	66.5%	69.3%	1127.1
code-llama-70b-instruct	66.5%	67.1%	1116.1
white-rabbit-neo-33b-v1	65.9%	66.3%	1107.3
codebooga-34b	65.9%	68.0%	1113.4
claude-3-sonnet-20240229	64.6%	63.8%	1083.7
mistral-large-latest	63.4%	61.7%	1069.8
speechless-starcoder2-15b	63.4%	62.8%	1076.8
deepseek-coder-1.3b-instruct	61.6%	59.1%	1052.9
bigcode--starcoder2-15b-instruct-v0.1	61.0%	58.0%	1041.8
Qwen--Qwen1.5-72B-Chat	59.8%	55.5%	1022.6
microsoft--Phi-3-mini-4k-instruct	59.8%	55.4%	1034.4
code-13b	53.7%	44.7%	953.3
codegemma-7b-it	53.0%	43.1%	938.8
speechless-coding-7b-16k-tora	52.4%	42.4%	933.7
speechless-starcoder2-7b	51.8%	41.3%	925.3
wizardcoder-15b	50.6%	39.0%	913.8
open-hermes-2.5-code-290k-13b	50.6%	38.9%	905.4
code-33b	50.0%	38.8%	903.9
phi-2	45.7%	32.7%	861.6
wizardcoder-7b	45.7%	31.7%	856.5
code-llama-multi-34b	43.9%	28.7%	830.2
deepseek-coder-33b	43.9%	30.9%	838.4
mistral-7b-codealpaca	43.3%	29.0%	831.0
starcoder2-15b-oci	43.3%	28.3%	819.6
speechless-mistral-7b	42.7%	26.1%	804.4
codegemma-7b	42.1%	30.2%	825.2
mixtral-8x7b-instruct	40.9%	26.5%	799.5
solar-10.7b-instruct	37.8%	21.2%	755.7
mistralai--Mistral-7B-Instruct-v0.2	36.6%	20.3%	749.2
gemma-1.1-7b-it	36.0%	17.8%	716.5
code-llama-multi-13b	34.8%	17.7%	713.1
octocoder	33.5%	17.6%	718.6
xdan-l1-chat	32.9%	16.4%	701.3
python-code-13b	31.7%	15.5%	686.4

humaneval+: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model