nq: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
dbrx-base 48.8% 88.2% 1384.1
Meta-Llama-3-70B 43.2% 86.2% 1334.5
Mixtral-8x22B-v0.1 42.2% 85.8% 1326.1
Qwen1.5-110B 41.6% 85.1% 1315.7
llama_65B 38.2% 79.9% 1250.3
deepseek-llm-67b-base 37.7% 79.7% 1244.7
Mixtral-8x7B-v0.1 36.9% 78.5% 1235.0
Qwen1.5-72B 35.9% 75.7% 1210.0
llama_33B 34.8% 74.7% 1197.1
llama2_70B 33.3% 69.8% 1155.6
falcon-40b 33.3% 71.5% 1173.1
Qwen1.5-32B 30.7% 64.9% 1118.4
Meta-Llama-3-8B 29.9% 63.5% 1114.3
Mistral-7B-v0.1 29.2% 61.8% 1097.9
llama_13B 28.6% 60.1% 1088.1
llama2_13B 27.0% 55.6% 1053.7
deepseek-moe-16b-base 26.8% 55.3% 1057.7
mpt-30b 26.1% 53.6% 1045.3
gemma-7b 24.8% 49.9% 1021.6
Qwen1.5-14B 23.6% 46.7% 999.2
falcon-7b 22.6% 44.0% 982.7
llama_07B 22.5% 43.7% 973.9
llama2_07B 22.3% 43.4% 970.8
deepseek-llm-7b-base 22.1% 42.6% 973.6
Qwen1.5-7B 19.1% 35.1% 918.2
stablelm-3b-4e1t 17.6% 31.0% 884.3
stablelm-base-alpha-7b-v2 16.8% 29.3% 875.6
Qwen1.5-4B 15.8% 26.5% 848.9
gemma-2b 14.4% 23.7% 827.5
pythia-12b-deduped-v0 10.4% 15.1% 734.5
Qwen1.5-1.8B 10.1% 14.6% 725.6
pythia-6.9b-deduped-v0 8.8% 12.9% 704.1
pythia-2.8b-deduped 6.5% 8.2% 610.8
Qwen1.5-0.5B 5.4% 7.3% 588.4
pythia-1b-deduped 3.7% 5.5% 534.7
pythia-1.4b-deduped-v0 2.3% 3.1% 424.1