p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
gpt-4-1106-preview	74.1%	83.8%	1281.9
claude-3-opus-20240229	73.5%	84.3%	1283.1
deepseek-coder-33b-instruct	70.4%	78.9%	1221.5
claude-3-sonnet-20240229	69.8%	79.7%	1225.8
meta-llama-3-70b-instruct	69.6%	80.9%	1239.8
claude-3-haiku-20240307	69.3%	78.2%	1210.2
opencodeinterpreter-ds-33b	68.8%	78.5%	1216.4
white-rabbit-neo-33b-v1	67.5%	75.7%	1185.7
opencodeinterpreter-ds-6.7b	66.9%	76.0%	1192.5
xwincoder-34b	66.1%	74.9%	1175.6
deepseek-coder-6.7b-instruct	66.1%	73.6%	1165.2
bigcode--starcoder2-15b-instruct-v0.1	65.1%	70.4%	1138.9
HuggingFaceH4--starchat2-15b-v0.1	64.8%	70.7%	1139.6
code-millenials-34b	64.6%	70.3%	1140.4
mixtral-8x22b-instruct-v0.1	64.6%	69.9%	1131.7
wizardcoder-34b	63.8%	69.2%	1130.7
CohereForAI--c4ai-command-r-plus	63.8%	67.3%	1115.4
starcoder2-15b-oci	63.8%	68.8%	1128.5
speechless-starcoder2-15b	63.0%	67.9%	1116.3
Qwen--Qwen1.5-72B-Chat	62.4%	66.0%	1105.0
speechless-codellama-34b	61.4%	64.1%	1090.9
dolphin-2.6	60.1%	61.1%	1067.5
mistral-large-latest	59.8%	58.9%	1048.5
deepseek-coder-6.7b-base	59.5%	59.9%	1059.7
codegemma-7b-it	57.4%	55.1%	1027.3
speechless-starcoder2-7b	57.1%	54.5%	1013.5
code-llama-34b	56.9%	53.8%	1017.6
databricks--dbrx-instruct	56.3%	52.4%	1001.5
openchat	56.1%	52.2%	1001.3
phi-2	55.3%	50.4%	994.5
code-llama-multi-34b	55.0%	49.9%	998.0
wizardcoder-15b	54.8%	49.3%	989.0
microsoft--Phi-3-mini-4k-instruct	54.5%	48.9%	975.9
code-llama-multi-13b	54.5%	48.7%	990.1
code-llama-13b	53.2%	46.0%	966.1
codegemma-7b	52.4%	44.6%	954.0
octocoder	51.3%	42.0%	938.9
mixtral-8x7b-instruct	50.3%	41.5%	924.7
wizardcoder-7b	50.0%	39.3%	913.0
speechless-mistral-7b	49.2%	38.7%	907.6
codet5p-16b	48.1%	35.5%	892.7
codegemma-2b	47.9%	36.1%	896.0
stable-code-3b	46.8%	33.4%	873.4
open-hermes-2.5-code-290k-13b	46.8%	35.3%	883.1
gemma-1.1-7b-it	46.6%	33.6%	872.8
codegen-16b	46.3%	32.8%	868.6
gemma-7b	45.0%	31.5%	857.1
starcoder2-3b	44.4%	30.2%	843.5
code-llama-multi-7b	44.2%	28.9%	836.4
codegen-6b	43.7%	28.5%	834.3
mistral-7b	42.9%	26.5%	817.6
codet5p-6b	42.6%	27.6%	830.7
xdan-l1-chat	41.8%	26.5%	807.1
codet5p-2b	38.9%	21.4%	768.7
mistralai--Mistral-7B-Instruct-v0.2	37.6%	22.0%	763.1
solar-10.7b-instruct	37.6%	20.9%	753.5
codegen-2b	37.3%	19.2%	743.3
gemma-2b	35.4%	17.6%	723.0
gemma-7b-it	33.1%	17.5%	711.4

mbpp+: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model