agi_english: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
Qwen1.5-110B 65.2% 80.6% 1225.4
Meta-Llama-3-70B 63.7% 79.3% 1211.3
Qwen1.5-72B 63.2% 78.9% 1207.3
Qwen1.5-32B 61.4% 76.7% 1185.2
Mixtral-8x22B-v0.1 61.2% 76.6% 1184.6
dbrx-base 55.9% 70.9% 1133.8
deepseek-llm-67b-base 55.5% 70.3% 1129.9
Qwen1.5-14B 54.7% 69.6% 1125.4
Mixtral-8x7B-v0.1 50.4% 63.8% 1079.9
llama2_70B 48.9% 61.8% 1066.5
llama_65B 48.4% 61.6% 1070.4
Qwen1.5-7B 48.2% 61.2% 1064.3
Meta-Llama-3-8B 47.4% 59.9% 1055.3
gemma-7b 45.3% 56.9% 1037.8
Mistral-7B-v0.1 44.0% 55.2% 1025.1
Qwen1.5-4B 42.9% 53.5% 1015.2
llama_33B 41.4% 51.4% 999.7
llama2_13B 38.0% 46.5% 971.2
llama2_07B 34.8% 42.3% 945.2
deepseek-llm-7b-base 34.3% 41.4% 942.3
Qwen1.5-1.8B 34.1% 41.5% 942.0
mpt-30b 34.1% 41.3% 939.6
llama_13B 31.6% 38.0% 921.0
stablelm-base-alpha-7b-v2 31.3% 37.7% 915.2
stablelm-3b-4e1t 29.9% 35.6% 902.7
deepseek-moe-16b-base 29.7% 35.8% 904.4
Qwen1.5-0.5B 29.4% 35.5% 900.6
gemma-2b 27.3% 34.5% 891.9
llama_07B 24.6% 30.4% 866.8
pythia-12b-deduped-v0 24.5% 30.8% 869.5
pythia-2.8b-deduped 23.5% 30.1% 861.3
pythia-6.9b-deduped-v0 23.4% 29.8% 861.7
falcon-7b 22.9% 29.0% 852.5
pythia-1b-deduped 22.3% 28.6% 848.3
pythia-1.4b-deduped-v0 22.0% 28.0% 846.7