p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
Qwen1.5-110B	84.1%	95.3%	1608.0
Meta-Llama-3-70B	82.7%	95.0%	1596.2
Mixtral-8x22B-v0.1	80.2%	93.5%	1543.4
Qwen1.5-72B	78.4%	92.1%	1504.8
DeepSeek-V2	77.3%	91.8%	1498.5
Qwen1.5-32B	76.3%	91.0%	1476.3
Qwen1.5-14B	69.5%	86.0%	1380.0
dbrx-base	69.5%	85.7%	1379.0
deepseek-llm-67b-base	62.9%	80.2%	1305.6
Mixtral-8x7B-v0.1	60.3%	77.9%	1276.1
Qwen1.5-7B	59.1%	75.5%	1248.6
gemma-7b	56.8%	73.2%	1234.1
llama2_70B	56.7%	73.3%	1232.6
Meta-Llama-3-8B	55.4%	72.0%	1220.7
Qwen1.5-4B	55.0%	71.3%	1208.1
llama_65B	50.8%	66.3%	1171.6
Mistral-7B-v0.1	41.2%	53.4%	1062.8
llama2_13B	38.0%	48.7%	1025.4
Qwen1.5-1.8B	36.9%	47.2%	1011.9
llama_33B	34.6%	43.8%	992.4
falcon-40b	27.1%	32.8%	902.1
llama2_07B	22.5%	25.8%	834.5
mpt-30b	21.8%	24.9%	826.8
Qwen1.5-0.5B	20.9%	24.4%	821.6
gemma-2b	18.8%	21.5%	794.5
deepseek-moe-16b-base	18.6%	20.4%	782.4
llama_13B	17.6%	19.6%	770.2
deepseek-llm-7b-base	14.1%	15.2%	716.9
llama_07B	10.9%	11.0%	648.8
stablelm-3b-4e1t	10.8%	10.7%	644.7
falcon-7b	7.9%	7.9%	587.0
stablelm-base-alpha-7b-v2	7.4%	7.7%	584.9
pythia-12b-deduped-v0	3.6%	3.8%	453.7
pythia-2.8b-deduped	3.0%	3.5%	437.0
pythia-6.9b-deduped-v0	2.9%	3.4%	429.2
pythia-1.4b-deduped-v0	2.0%	2.9%	408.7
pythia-1b-deduped	2.0%	2.6%	380.9

gsm8k: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model