p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
dbrx-base	78.2%	92.6%	1442.6
Meta-Llama-3-70B	77.6%	92.8%	1442.6
Mixtral-8x22B-v0.1	77.0%	92.2%	1423.8
Qwen1.5-110B	74.6%	90.0%	1371.2
llama_65B	73.3%	87.7%	1328.6
Mixtral-8x7B-v0.1	73.1%	88.0%	1332.8
deepseek-llm-67b-base	72.9%	87.9%	1331.7
llama_33B	70.7%	84.2%	1272.2
Qwen1.5-72B	70.7%	83.3%	1264.3
llama2_70B	68.7%	79.0%	1217.8
falcon-40b	67.5%	78.5%	1204.5
Qwen1.5-32B	65.5%	74.0%	1164.4
Meta-Llama-3-8B	65.4%	74.3%	1161.5
Mistral-7B-v0.1	64.2%	72.1%	1142.9
llama_13B	63.6%	70.1%	1121.2
mpt-30b	60.8%	64.1%	1076.3
llama2_13B	60.4%	62.5%	1062.8
gemma-7b	60.3%	62.7%	1063.7
deepseek-moe-16b-base	59.1%	60.5%	1049.5
llama_07B	56.4%	54.0%	994.9
deepseek-llm-7b-base	54.4%	49.7%	970.5
Qwen1.5-14B	54.0%	48.8%	961.7
llama2_07B	52.6%	46.0%	935.5
falcon-7b	52.2%	44.8%	928.3
stablelm-base-alpha-7b-v2	49.6%	39.4%	888.1
stablelm-3b-4e1t	48.7%	37.8%	877.3
Qwen1.5-7B	48.1%	36.9%	873.3
gemma-2b	42.8%	27.7%	794.3
Qwen1.5-4B	39.4%	23.0%	748.9
pythia-12b-deduped-v0	37.8%	20.5%	724.0
pythia-6.9b-deduped-v0	33.2%	15.2%	658.3
Qwen1.5-1.8B	26.2%	9.4%	557.0
pythia-2.8b-deduped	24.1%	8.1%	531.8
pythia-1b-deduped	14.7%	3.9%	391.1
Qwen1.5-0.5B	13.4%	3.3%	359.5
pythia-1.4b-deduped-v0	12.7%	2.9%	330.9

tqa: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model