p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
dbrx-base	48.8%	88.2%	1384.1
Meta-Llama-3-70B	43.2%	86.2%	1334.5
Mixtral-8x22B-v0.1	42.2%	85.8%	1326.1
Qwen1.5-110B	41.6%	85.1%	1315.7
llama_65B	38.2%	79.9%	1250.3
deepseek-llm-67b-base	37.7%	79.7%	1244.7
Mixtral-8x7B-v0.1	36.9%	78.5%	1235.0
Qwen1.5-72B	35.9%	75.7%	1210.0
llama_33B	34.8%	74.7%	1197.1
llama2_70B	33.3%	69.8%	1155.6
falcon-40b	33.3%	71.5%	1173.1
Qwen1.5-32B	30.7%	64.9%	1118.4
Meta-Llama-3-8B	29.9%	63.5%	1114.3
Mistral-7B-v0.1	29.2%	61.8%	1097.9
llama_13B	28.6%	60.1%	1088.1
llama2_13B	27.0%	55.6%	1053.7
deepseek-moe-16b-base	26.8%	55.3%	1057.7
mpt-30b	26.1%	53.6%	1045.3
gemma-7b	24.8%	49.9%	1021.6
Qwen1.5-14B	23.6%	46.7%	999.2
falcon-7b	22.6%	44.0%	982.7
llama_07B	22.5%	43.7%	973.9
llama2_07B	22.3%	43.4%	970.8
deepseek-llm-7b-base	22.1%	42.6%	973.6
Qwen1.5-7B	19.1%	35.1%	918.2
stablelm-3b-4e1t	17.6%	31.0%	884.3
stablelm-base-alpha-7b-v2	16.8%	29.3%	875.6
Qwen1.5-4B	15.8%	26.5%	848.9
gemma-2b	14.4%	23.7%	827.5
pythia-12b-deduped-v0	10.4%	15.1%	734.5
Qwen1.5-1.8B	10.1%	14.6%	725.6
pythia-6.9b-deduped-v0	8.8%	12.9%	704.1
pythia-2.8b-deduped	6.5%	8.2%	610.8
Qwen1.5-0.5B	5.4%	7.3%	588.4
pythia-1b-deduped	3.7%	5.5%	534.7
pythia-1.4b-deduped-v0	2.3%	3.1%	424.1

nq: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model