p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
dbrx-base	65.9%	81.3%	1251.7
Meta-Llama-3-70B	65.0%	84.5%	1284.9
Mixtral-8x22B-v0.1	61.9%	81.1%	1241.0
DeepSeek-V2	60.2%	78.0%	1209.1
Mixtral-8x7B-v0.1	60.2%	78.6%	1214.0
deepseek-llm-67b-base	57.3%	73.4%	1163.3
llama_65B	55.2%	69.1%	1125.0
Qwen1.5-110B	55.0%	66.9%	1111.9
llama2_70B	54.6%	65.3%	1098.1
falcon-40b	54.4%	67.2%	1113.8
Mistral-7B-v0.1	54.2%	66.2%	1103.5
llama_33B	53.8%	65.7%	1100.5
Meta-Llama-3-8B	53.6%	64.6%	1095.9
gemma-7b	53.4%	64.2%	1089.9
Qwen1.5-72B	52.4%	61.8%	1073.0
llama2_13B	50.2%	55.8%	1036.2
Qwen1.5-32B	50.1%	56.3%	1036.0
mpt-30b	49.4%	55.2%	1030.1
llama_13B	48.6%	53.2%	1014.9
deepseek-moe-16b-base	47.6%	50.8%	1001.5
Qwen1.5-14B	45.6%	46.1%	965.9
llama_07B	44.9%	44.2%	959.0
deepseek-llm-7b-base	44.6%	43.4%	955.7
falcon-7b	44.1%	42.2%	947.3
llama2_07B	43.5%	42.0%	937.4
mpt-7b	42.5%	38.8%	926.3
Qwen1.5-7B	42.1%	38.7%	916.9
gemma-2b	41.7%	36.7%	906.6
stablelm-base-alpha-7b-v2	40.7%	34.3%	891.9
stablelm-3b-4e1t	39.7%	32.5%	878.0
Qwen1.5-4B	39.5%	33.5%	883.6
pythia-12b-deduped-v0	38.1%	29.4%	857.4
pythia-6.9b-deduped-v0	35.8%	25.3%	823.6
Qwen1.5-1.8B	34.3%	24.1%	806.2
pythia-2.8b-deduped	32.8%	21.5%	787.9
Qwen1.5-0.5B	29.4%	16.8%	727.7
pythia-1.4b-deduped-v0	27.9%	16.5%	725.7
pythia-1b-deduped	27.2%	15.0%	708.5

arc_challenge: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model