p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	std	win_rate	elo
GPT-4O-2024-05-13	51.3%	0.83%	96.7%	1667.6
GPT-4-Turbo-2024-04-09	43.5%	0.98%	90.7%	1466.4
GPT-4-Turbo-1106	38.3%	0.92%	87.9%	1411.0
Gemini-Pro-1.5 (May)	38.2%	0.78%	88.6%	1435.5
Claude-3-Opus	35.4%	0.63%	85.4%	1373.7
GPT-4-0613	34.8%	0.82%	85.5%	1372.9
WCoder-33B-V1.1	30.4%	0.91%	76.8%	1276.7
DSCoder-33b-Ins	30.3%	0.90%	78.4%	1294.5
Gemini-Pro-1.5 (April) (n=1)	29.5%	0.00%	76.4%	1266.0
LLama3-70b-Ins	29.3%	0.61%	76.8%	1272.6
Eurux-8x22b-NCA (n=1)	27.3%	0.00%	73.4%	1234.2
OC-DS-33B	26.6%	0.78%	68.5%	1186.5
Claude-3-Sonnet	25.9%	0.76%	68.9%	1193.9
Mistral-Large	25.7%	0.58%	67.8%	1192.8
Mixtral-8x22B-Ins	25.6%	0.77%	67.3%	1178.2
CodeQwen15-7B-Chat	25.0%	0.99%	58.4%	1113.6
Claude-3-Haiku	24.6%	0.64%	65.1%	1159.0
DSCoder-33b-Base	23.7%	0.81%	61.0%	1136.0
Claude-2	23.6%	0.53%	63.5%	1144.6
Eurus-70B-SFT (n=1)	23.0%	0.00%	62.3%	1143.5
MagiCoderS-DS-6.7B	22.6%	0.84%	57.5%	1101.8
GPT-3.5-Turbo-0125	22.5%	0.66%	59.4%	1110.8
OC-DS-6.7B	21.9%	0.79%	54.8%	1085.4
CodeQwen15-7B	21.8%	0.87%	53.8%	1083.1
LLama3-70b-Base	21.8%	0.73%	56.6%	1091.5
Claude-Instant-1	21.7%	0.46%	57.2%	1095.3
DSCoder-6.7b-Ins	21.6%	0.84%	53.2%	1068.6
GPT-3.5-Turbo-0301	21.2%	0.82%	53.9%	1069.1
Phind-34B-V2	20.4%	0.66%	52.5%	1062.3
Command-R+	20.4%	0.43%	52.4%	1060.4
DSCoder-6.7b-Base	19.1%	0.77%	45.9%	1018.3
Gemini-Pro	18.8%	0.81%	47.2%	1018.7
Smaug-2-72B	18.4%	0.72%	42.0%	978.4
DBRX-Ins	17.5%	0.75%	42.2%	981.6
LLama3-8b-Ins	17.3%	0.62%	41.9%	980.2
WCoder-34B-V1	16.2%	0.82%	32.6%	890.5
Qwen-1.5-72B-Chat	15.9%	0.50%	37.1%	935.0
StarCoder2-15b	15.4%	0.83%	33.3%	912.0
CodeGemma-7b-Base	14.0%	0.95%	26.6%	854.1
Command-R	14.0%	0.57%	31.3%	883.9
CodeLlama-34b-Base	12.3%	0.80%	21.3%	797.9
Mixtral-8x7B-Ins	12.3%	0.64%	24.2%	818.3
Cllama-13b-Ins	12.0%	0.65%	23.7%	813.9
LLama3-8b-Base	11.9%	0.69%	20.9%	789.1
Cllama-34b-Ins	11.6%	0.58%	24.2%	816.4
MagiCoderS-CL-7B	11.4%	0.74%	20.1%	772.8
StarCoder2-7b	11.3%	0.70%	18.0%	757.6
DSCoder-1.3b-Ins	10.8%	0.68%	21.5%	793.6
Cllama-7b-Ins	10.6%	0.55%	19.0%	761.4
Gemma-7b-Base	9.9%	0.79%	15.5%	727.4
CodeLlama-13b-Base	8.6%	0.59%	11.9%	669.0
StarCoder2-3b	8.6%	0.65%	10.8%	643.0
DSCoder-1.3b-Base	8.0%	0.67%	13.3%	686.5
CodeGemma-2b-Base	7.0%	0.64%	8.4%	592.3
CodeLlama-7b-Base	6.5%	0.63%	6.8%	546.3
StableCode-3B	5.7%	0.59%	7.5%	562.7
OC-DS-1.3B	5.0%	0.57%	4.1%	439.2
Gemma-2b-Base	2.5%	0.50%	1.2%	212.8

lcb_codegen: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model