p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
Qwen1.5-110B	81.1%	87.4%	1305.0
Meta-Llama-3-70B	78.7%	84.9%	1266.7
Mixtral-8x22B-v0.1	77.6%	83.9%	1251.0
Qwen1.5-72B	77.2%	83.7%	1248.8
dbrx-base	74.3%	79.9%	1204.5
Qwen1.5-32B	73.6%	79.4%	1198.3
deepseek-llm-67b-base	71.4%	77.2%	1174.9
Mixtral-8x7B-v0.1	70.3%	75.5%	1159.0
Qwen1.5-14B	67.8%	71.9%	1131.0
Meta-Llama-3-8B	65.3%	68.7%	1103.8
llama2_70B	63.2%	65.5%	1079.4
gemma-7b	62.6%	64.8%	1076.5
Mistral-7B-v0.1	62.5%	64.8%	1074.5
llama_65B	62.2%	64.1%	1071.9
Qwen1.5-7B	60.5%	61.5%	1052.2
llama_33B	57.0%	56.3%	1020.6
falcon-40b	55.4%	53.8%	1000.9
Qwen1.5-4B	55.2%	53.4%	1002.2
llama2_13B	53.7%	51.2%	985.6
deepseek-llm-7b-base	48.1%	43.3%	938.3
llama2_07B	47.3%	42.1%	929.0
mpt-30b	47.0%	41.8%	928.6
Qwen1.5-1.8B	45.6%	40.0%	917.2
llama_13B	45.6%	39.7%	915.6
deepseek-moe-16b-base	44.9%	39.0%	912.8
stablelm-3b-4e1t	44.4%	38.4%	906.2
stablelm-base-alpha-7b-v2	44.4%	38.3%	906.4
gemma-2b	41.0%	35.1%	883.5
Qwen1.5-0.5B	38.4%	31.8%	860.9
llama_07B	35.1%	28.9%	842.6
falcon-7b	27.2%	22.9%	790.4
pythia-2.8b-deduped	26.4%	23.0%	787.4
pythia-12b-deduped-v0	24.7%	20.8%	771.5
pythia-6.9b-deduped-v0	24.7%	20.7%	768.7
pythia-1b-deduped	24.6%	21.3%	772.3
pythia-1.4b-deduped-v0	23.3%	20.1%	761.4

mmlu: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model