p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
dbrx-base	88.7%	90.3%	1399.6
Mixtral-8x22B-v0.1	86.8%	90.7%	1386.8
Qwen1.5-110B	86.5%	90.6%	1380.1
Meta-Llama-3-70B	85.9%	88.6%	1337.2
deepseek-llm-67b-base	85.5%	88.8%	1339.5
Qwen1.5-72B	85.3%	86.0%	1296.6
llama_65B	85.3%	87.6%	1320.2
falcon-40b	85.1%	86.4%	1301.6
Mixtral-8x7B-v0.1	84.5%	84.5%	1271.5
Qwen1.5-32B	84.1%	82.7%	1244.8
llama_33B	84.0%	83.0%	1249.8
llama2_70B	83.0%	75.7%	1180.5
Mistral-7B-v0.1	81.7%	73.8%	1140.4
gemma-7b	80.8%	69.5%	1101.5
mpt-30b	80.8%	69.3%	1096.6
Meta-Llama-3-8B	80.5%	68.3%	1084.4
llama_13B	80.4%	67.7%	1086.5
llama2_13B	80.3%	65.0%	1078.7
Qwen1.5-14B	80.0%	65.1%	1063.5
deepseek-moe-16b-base	78.6%	59.1%	1010.6
falcon-7b	78.3%	58.1%	1002.8
Qwen1.5-7B	77.3%	53.1%	970.5
deepseek-llm-7b-base	77.2%	53.0%	961.2
llama_07B	77.1%	52.7%	955.4
llama2_07B	76.2%	48.8%	939.5
stablelm-base-alpha-7b-v2	75.5%	45.3%	899.3
stablelm-3b-4e1t	75.2%	44.3%	888.6
gemma-2b	71.7%	31.5%	782.9
Qwen1.5-4B	71.6%	31.7%	789.5
pythia-12b-deduped-v0	69.5%	25.3%	730.3
pythia-6.9b-deduped-v0	66.1%	17.7%	648.0
Qwen1.5-1.8B	61.0%	10.8%	540.9
pythia-2.8b-deduped	60.3%	9.5%	515.8
pythia-1.4b-deduped-v0	52.0%	4.8%	379.7
pythia-1b-deduped	49.6%	3.2%	305.5
Qwen1.5-0.5B	49.4%	3.5%	319.8

hellaswag: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model