p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
Mixtral-8x22B-v0.1	85.4%	77.1%	1190.3
dbrx-base	85.4%	74.0%	1162.7
Meta-Llama-3-70B	84.4%	74.2%	1162.4
Qwen1.5-110B	84.3%	73.5%	1154.6
Mixtral-8x7B-v0.1	83.7%	71.1%	1132.3
deepseek-llm-67b-base	83.1%	68.5%	1110.6
falcon-40b	83.1%	67.9%	1108.6
Mistral-7B-v0.1	82.8%	66.7%	1098.8
Qwen1.5-32B	82.7%	65.5%	1091.4
Qwen1.5-72B	82.7%	66.1%	1093.7
llama_65B	82.6%	66.2%	1093.6
llama_33B	82.2%	63.5%	1073.4
mpt-30b	81.2%	58.1%	1036.5
Meta-Llama-3-8B	81.1%	57.3%	1031.9
gemma-7b	81.1%	57.2%	1030.0
llama2_70B	80.8%	54.9%	1019.0
falcon-7b	80.6%	55.0%	1016.5
deepseek-moe-16b-base	80.0%	51.9%	995.4
stablelm-base-alpha-7b-v2	80.0%	51.9%	1001.1
Qwen1.5-14B	79.9%	51.3%	991.3
llama_13B	79.9%	51.4%	990.6
stablelm-3b-4e1t	79.8%	50.5%	993.5
llama2_13B	79.7%	50.2%	987.2
llama_07B	79.5%	49.4%	981.8
Qwen1.5-7B	79.4%	48.7%	979.8
deepseek-llm-7b-base	79.4%	48.5%	973.9
gemma-2b	78.2%	42.8%	940.4
Qwen1.5-4B	77.3%	39.2%	915.0
pythia-12b-deduped-v0	77.0%	37.4%	901.9
llama2_07B	76.9%	39.8%	919.3
pythia-6.9b-deduped-v0	76.1%	34.0%	879.5
Qwen1.5-1.8B	74.4%	29.7%	849.7
pythia-2.8b-deduped	73.7%	26.8%	823.0
pythia-1b-deduped	70.1%	18.7%	744.0
pythia-1.4b-deduped-v0	69.6%	21.6%	772.6
Qwen1.5-0.5B	69.5%	19.6%	753.9

piqa: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model