siqa: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
dbrx-base 66.2% 77.9% 1213.4
Qwen1.5-110B 58.8% 68.5% 1128.8
Qwen1.5-72B 57.2% 65.6% 1107.4
Qwen1.5-14B 56.9% 62.4% 1082.2
Qwen1.5-32B 56.9% 64.5% 1098.3
llama2_13B 53.5% 59.0% 1065.9
Qwen1.5-7B 53.5% 57.7% 1054.0
llama2_70B 52.5% 56.4% 1048.5
Meta-Llama-3-70B 52.3% 56.5% 1053.4
llama_65B 52.1% 55.8% 1050.5
gemma-7b 51.6% 53.8% 1034.2
Mixtral-8x22B-v0.1 51.4% 53.5% 1033.3
falcon-40b 51.3% 52.9% 1029.6
deepseek-llm-67b-base 50.8% 51.2% 1018.8
llama_13B 50.6% 50.7% 1017.0
Mixtral-8x7B-v0.1 50.4% 49.8% 1009.4
llama_33B 50.2% 49.1% 1003.5
llama2_07B 50.0% 48.8% 997.4
Mistral-7B-v0.1 49.4% 46.4% 987.2
deepseek-llm-7b-base 49.0% 45.1% 978.1
Qwen1.5-4B 49.0% 46.5% 977.4
Meta-Llama-3-8B 48.8% 44.3% 973.6
llama_07B 48.8% 44.4% 974.8
falcon-7b 48.7% 43.8% 972.3
mpt-30b 48.5% 43.5% 966.3
gemma-2b 47.6% 40.8% 948.1
Qwen1.5-1.8B 47.2% 42.6% 951.5
stablelm-base-alpha-7b-v2 47.0% 38.1% 932.7
pythia-12b-deduped-v0 46.7% 37.3% 928.5
deepseek-moe-16b-base 46.6% 37.7% 928.9
stablelm-3b-4e1t 46.5% 36.7% 920.7
Qwen1.5-0.5B 45.9% 39.2% 930.2
pythia-6.9b-deduped-v0 45.5% 34.0% 904.1
pythia-2.8b-deduped 45.3% 34.6% 907.9
pythia-1b-deduped 44.3% 32.3% 888.8
pythia-1.4b-deduped-v0 43.9% 31.9% 883.1