piqa: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
Mixtral-8x22B-v0.1 85.4% 77.1% 1190.3
dbrx-base 85.4% 74.0% 1162.7
Meta-Llama-3-70B 84.4% 74.2% 1162.4
Qwen1.5-110B 84.3% 73.5% 1154.6
Mixtral-8x7B-v0.1 83.7% 71.1% 1132.3
deepseek-llm-67b-base 83.1% 68.5% 1110.6
falcon-40b 83.1% 67.9% 1108.6
Mistral-7B-v0.1 82.8% 66.7% 1098.8
Qwen1.5-32B 82.7% 65.5% 1091.4
Qwen1.5-72B 82.7% 66.1% 1093.7
llama_65B 82.6% 66.2% 1093.6
llama_33B 82.2% 63.5% 1073.4
mpt-30b 81.2% 58.1% 1036.5
Meta-Llama-3-8B 81.1% 57.3% 1031.9
gemma-7b 81.1% 57.2% 1030.0
llama2_70B 80.8% 54.9% 1019.0
falcon-7b 80.6% 55.0% 1016.5
deepseek-moe-16b-base 80.0% 51.9% 995.4
stablelm-base-alpha-7b-v2 80.0% 51.9% 1001.1
Qwen1.5-14B 79.9% 51.3% 991.3
llama_13B 79.9% 51.4% 990.6
stablelm-3b-4e1t 79.8% 50.5% 993.5
llama2_13B 79.7% 50.2% 987.2
llama_07B 79.5% 49.4% 981.8
Qwen1.5-7B 79.4% 48.7% 979.8
deepseek-llm-7b-base 79.4% 48.5% 973.9
gemma-2b 78.2% 42.8% 940.4
Qwen1.5-4B 77.3% 39.2% 915.0
pythia-12b-deduped-v0 77.0% 37.4% 901.9
llama2_07B 76.9% 39.8% 919.3
pythia-6.9b-deduped-v0 76.1% 34.0% 879.5
Qwen1.5-1.8B 74.4% 29.7% 849.7
pythia-2.8b-deduped 73.7% 26.8% 823.0
pythia-1b-deduped 70.1% 18.7% 744.0
pythia-1.4b-deduped-v0 69.6% 21.6% 772.6
Qwen1.5-0.5B 69.5% 19.6% 753.9