mmlu: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
Qwen1.5-110B 81.1% 87.4% 1305.0
Meta-Llama-3-70B 78.7% 84.9% 1266.7
Mixtral-8x22B-v0.1 77.6% 83.9% 1251.0
Qwen1.5-72B 77.2% 83.7% 1248.8
dbrx-base 74.3% 79.9% 1204.5
Qwen1.5-32B 73.6% 79.4% 1198.3
deepseek-llm-67b-base 71.4% 77.2% 1174.9
Mixtral-8x7B-v0.1 70.3% 75.5% 1159.0
Qwen1.5-14B 67.8% 71.9% 1131.0
Meta-Llama-3-8B 65.3% 68.7% 1103.8
llama2_70B 63.2% 65.5% 1079.4
gemma-7b 62.6% 64.8% 1076.5
Mistral-7B-v0.1 62.5% 64.8% 1074.5
llama_65B 62.2% 64.1% 1071.9
Qwen1.5-7B 60.5% 61.5% 1052.2
llama_33B 57.0% 56.3% 1020.6
falcon-40b 55.4% 53.8% 1000.9
Qwen1.5-4B 55.2% 53.4% 1002.2
llama2_13B 53.7% 51.2% 985.6
deepseek-llm-7b-base 48.1% 43.3% 938.3
llama2_07B 47.3% 42.1% 929.0
mpt-30b 47.0% 41.8% 928.6
Qwen1.5-1.8B 45.6% 40.0% 917.2
llama_13B 45.6% 39.7% 915.6
deepseek-moe-16b-base 44.9% 39.0% 912.8
stablelm-3b-4e1t 44.4% 38.4% 906.2
stablelm-base-alpha-7b-v2 44.4% 38.3% 906.4
gemma-2b 41.0% 35.1% 883.5
Qwen1.5-0.5B 38.4% 31.8% 860.9
llama_07B 35.1% 28.9% 842.6
falcon-7b 27.2% 22.9% 790.4
pythia-2.8b-deduped 26.4% 23.0% 787.4
pythia-12b-deduped-v0 24.7% 20.8% 771.5
pythia-6.9b-deduped-v0 24.7% 20.7% 768.7
pythia-1b-deduped 24.6% 21.3% 772.3
pythia-1.4b-deduped-v0 23.3% 20.1% 761.4