p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
Qwen1.5-110B	65.2%	80.6%	1225.4
Meta-Llama-3-70B	63.7%	79.3%	1211.3
Qwen1.5-72B	63.2%	78.9%	1207.3
Qwen1.5-32B	61.4%	76.7%	1185.2
Mixtral-8x22B-v0.1	61.2%	76.6%	1184.6
dbrx-base	55.9%	70.9%	1133.8
deepseek-llm-67b-base	55.5%	70.3%	1129.9
Qwen1.5-14B	54.7%	69.6%	1125.4
Mixtral-8x7B-v0.1	50.4%	63.8%	1079.9
llama2_70B	48.9%	61.8%	1066.5
llama_65B	48.4%	61.6%	1070.4
Qwen1.5-7B	48.2%	61.2%	1064.3
Meta-Llama-3-8B	47.4%	59.9%	1055.3
gemma-7b	45.3%	56.9%	1037.8
Mistral-7B-v0.1	44.0%	55.2%	1025.1
Qwen1.5-4B	42.9%	53.5%	1015.2
llama_33B	41.4%	51.4%	999.7
llama2_13B	38.0%	46.5%	971.2
llama2_07B	34.8%	42.3%	945.2
deepseek-llm-7b-base	34.3%	41.4%	942.3
Qwen1.5-1.8B	34.1%	41.5%	942.0
mpt-30b	34.1%	41.3%	939.6
llama_13B	31.6%	38.0%	921.0
stablelm-base-alpha-7b-v2	31.3%	37.7%	915.2
stablelm-3b-4e1t	29.9%	35.6%	902.7
deepseek-moe-16b-base	29.7%	35.8%	904.4
Qwen1.5-0.5B	29.4%	35.5%	900.6
gemma-2b	27.3%	34.5%	891.9
llama_07B	24.6%	30.4%	866.8
pythia-12b-deduped-v0	24.5%	30.8%	869.5
pythia-2.8b-deduped	23.5%	30.1%	861.3
pythia-6.9b-deduped-v0	23.4%	29.8%	861.7
falcon-7b	22.9%	29.0%	852.5
pythia-1b-deduped	22.3%	28.6%	848.3
pythia-1.4b-deduped-v0	22.0%	28.0%	846.7

agi_english: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model