p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
dbrx-base	66.2%	77.9%	1213.4
Qwen1.5-110B	58.8%	68.5%	1128.8
Qwen1.5-72B	57.2%	65.6%	1107.4
Qwen1.5-14B	56.9%	62.4%	1082.2
Qwen1.5-32B	56.9%	64.5%	1098.3
llama2_13B	53.5%	59.0%	1065.9
Qwen1.5-7B	53.5%	57.7%	1054.0
llama2_70B	52.5%	56.4%	1048.5
Meta-Llama-3-70B	52.3%	56.5%	1053.4
llama_65B	52.1%	55.8%	1050.5
gemma-7b	51.6%	53.8%	1034.2
Mixtral-8x22B-v0.1	51.4%	53.5%	1033.3
falcon-40b	51.3%	52.9%	1029.6
deepseek-llm-67b-base	50.8%	51.2%	1018.8
llama_13B	50.6%	50.7%	1017.0
Mixtral-8x7B-v0.1	50.4%	49.8%	1009.4
llama_33B	50.2%	49.1%	1003.5
llama2_07B	50.0%	48.8%	997.4
Mistral-7B-v0.1	49.4%	46.4%	987.2
deepseek-llm-7b-base	49.0%	45.1%	978.1
Qwen1.5-4B	49.0%	46.5%	977.4
Meta-Llama-3-8B	48.8%	44.3%	973.6
llama_07B	48.8%	44.4%	974.8
falcon-7b	48.7%	43.8%	972.3
mpt-30b	48.5%	43.5%	966.3
gemma-2b	47.6%	40.8%	948.1
Qwen1.5-1.8B	47.2%	42.6%	951.5
stablelm-base-alpha-7b-v2	47.0%	38.1%	932.7
pythia-12b-deduped-v0	46.7%	37.3%	928.5
deepseek-moe-16b-base	46.6%	37.7%	928.9
stablelm-3b-4e1t	46.5%	36.7%	920.7
Qwen1.5-0.5B	45.9%	39.2%	930.2
pythia-6.9b-deduped-v0	45.5%	34.0%	904.1
pythia-2.8b-deduped	45.3%	34.6%	907.9
pythia-1b-deduped	44.3%	32.3%	888.8
pythia-1.4b-deduped-v0	43.9%	31.9%	883.1

siqa: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model