tqa: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
dbrx-base 78.2% 92.6% 1442.6
Meta-Llama-3-70B 77.6% 92.8% 1442.6
Mixtral-8x22B-v0.1 77.0% 92.2% 1423.8
Qwen1.5-110B 74.6% 90.0% 1371.2
llama_65B 73.3% 87.7% 1328.6
Mixtral-8x7B-v0.1 73.1% 88.0% 1332.8
deepseek-llm-67b-base 72.9% 87.9% 1331.7
llama_33B 70.7% 84.2% 1272.2
Qwen1.5-72B 70.7% 83.3% 1264.3
llama2_70B 68.7% 79.0% 1217.8
falcon-40b 67.5% 78.5% 1204.5
Qwen1.5-32B 65.5% 74.0% 1164.4
Meta-Llama-3-8B 65.4% 74.3% 1161.5
Mistral-7B-v0.1 64.2% 72.1% 1142.9
llama_13B 63.6% 70.1% 1121.2
mpt-30b 60.8% 64.1% 1076.3
llama2_13B 60.4% 62.5% 1062.8
gemma-7b 60.3% 62.7% 1063.7
deepseek-moe-16b-base 59.1% 60.5% 1049.5
llama_07B 56.4% 54.0% 994.9
deepseek-llm-7b-base 54.4% 49.7% 970.5
Qwen1.5-14B 54.0% 48.8% 961.7
llama2_07B 52.6% 46.0% 935.5
falcon-7b 52.2% 44.8% 928.3
stablelm-base-alpha-7b-v2 49.6% 39.4% 888.1
stablelm-3b-4e1t 48.7% 37.8% 877.3
Qwen1.5-7B 48.1% 36.9% 873.3
gemma-2b 42.8% 27.7% 794.3
Qwen1.5-4B 39.4% 23.0% 748.9
pythia-12b-deduped-v0 37.8% 20.5% 724.0
pythia-6.9b-deduped-v0 33.2% 15.2% 658.3
Qwen1.5-1.8B 26.2% 9.4% 557.0
pythia-2.8b-deduped 24.1% 8.1% 531.8
pythia-1b-deduped 14.7% 3.9% 391.1
Qwen1.5-0.5B 13.4% 3.3% 359.5
pythia-1.4b-deduped-v0 12.7% 2.9% 330.9