p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
claude-3-opus-20240229	89.4%	94.1%	1472.8
gpt-4-1106-preview	85.7%	87.8%	1335.5
claude-3-sonnet-20240229	83.6%	86.2%	1305.6
meta-llama-3-70b-instruct	82.3%	85.1%	1289.9
deepseek-coder-33b-instruct	80.4%	79.1%	1215.8
claude-3-haiku-20240307	80.2%	78.2%	1203.1
opencodeinterpreter-ds-33b	80.2%	79.7%	1222.6
white-rabbit-neo-33b-v1	79.4%	79.1%	1214.6
bigcode--starcoder2-15b-instruct-v0.1	78.0%	76.6%	1190.8
xwincoder-34b	77.0%	75.3%	1175.9
opencodeinterpreter-ds-6.7b	76.5%	74.4%	1163.8
code-millenials-34b	76.2%	73.9%	1165.1
wizardcoder-34b	75.1%	70.9%	1136.8
deepseek-coder-6.7b-instruct	74.9%	70.8%	1130.0
HuggingFaceH4--starchat2-15b-v0.1	74.9%	70.2%	1128.7
starcoder2-15b-oci	74.3%	69.6%	1128.5
CohereForAI--c4ai-command-r-plus	74.3%	68.3%	1115.0
mixtral-8x22b-instruct-v0.1	73.8%	67.7%	1104.7
speechless-codellama-34b	73.8%	68.4%	1119.9
speechless-starcoder2-15b	73.5%	68.3%	1114.0
mistral-large-latest	72.8%	64.0%	1079.0
Qwen--Qwen1.5-72B-Chat	72.5%	66.2%	1098.0
deepseek-coder-6.7b-base	72.0%	64.7%	1090.9
dolphin-2.6	70.6%	61.8%	1065.2
codegemma-7b-it	70.4%	60.9%	1067.3
code-llama-34b	69.3%	58.7%	1046.4
databricks--dbrx-instruct	67.2%	54.2%	1004.2
speechless-starcoder2-7b	66.7%	53.3%	1002.0
code-llama-multi-34b	66.7%	53.4%	1011.1
microsoft--Phi-3-mini-4k-instruct	65.9%	51.5%	985.7
codegemma-7b	65.1%	50.1%	985.8
wizardcoder-15b	64.3%	48.5%	972.2
phi-2	64.0%	48.0%	968.6
openchat	63.8%	47.4%	960.3
code-llama-13b	63.5%	46.9%	963.6
code-llama-multi-13b	63.0%	45.8%	959.2
mixtral-8x7b-instruct	59.5%	40.7%	908.0
octocoder	59.3%	38.5%	905.9
wizardcoder-7b	58.5%	37.5%	890.3
speechless-mistral-7b	57.4%	36.3%	880.2
gemma-1.1-7b-it	57.1%	36.2%	884.9
codet5p-16b	56.6%	33.5%	867.3
codegemma-2b	55.6%	32.6%	861.8
stable-code-3b	54.8%	30.8%	842.1
codegen-16b	54.2%	30.8%	843.2
code-llama-multi-7b	53.7%	30.1%	839.0
starcoder2-3b	53.4%	31.1%	841.7
codet5p-6b	52.9%	29.5%	836.6
gemma-7b	52.6%	28.9%	823.8
open-hermes-2.5-code-290k-13b	52.4%	29.3%	825.7
mistral-7b	51.9%	26.2%	804.5
codegen-6b	50.8%	25.7%	799.7
xdan-l1-chat	50.3%	26.1%	793.7
codet5p-2b	48.4%	23.5%	778.3
codegen-2b	46.3%	21.1%	756.5
mistralai--Mistral-7B-Instruct-v0.2	44.7%	19.6%	730.6
solar-10.7b-instruct	43.9%	18.3%	714.1
gemma-2b	41.8%	15.5%	684.6
gemma-7b-it	39.7%	16.8%	694.8

mbpp: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model