tqa: by models

Home Doc/Code


std predicted by accuracy

The typical stddev between pairs of models on this dataset as a function of the absolute accuracy.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, the significance level mainly depends on the accuracy difference as shown here. Hover over each model pair for detailed information.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000. std: standard deviation due to drawing examples from a population, this is the dominant term. std_i: the standard deviation due to drawing samples from the model on each example. std_total: the total standard deviation, satisfying std_total^2 = std^2 + std_i^2.

model pass1 std(E(A)) E(std(A)) std(A) N win_rate elo
dbrx-base 78.2 0.39 0 0.39 NaN 26.4 1.09e+03
Meta-Llama-3-70B 77.6 0.39 0 0.39 NaN 25.7 1.09e+03
Mixtral-8x22B-v0.1 77 0.4 0 0.4 NaN 25.3 1.08e+03
Qwen1.5-110B 74.6 0.41 0 0.41 NaN 23.2 1.07e+03
llama_65B 73.3 0.42 0 0.42 NaN 22.5 1.07e+03
Mixtral-8x7B-v0.1 73.1 0.42 0 0.42 NaN 22.2 1.07e+03
deepseek-llm-67b-base 72.9 0.42 0 0.42 NaN 21.9 1.07e+03
llama_33B 70.7 0.43 0 0.43 NaN 20.5 1.06e+03
Qwen1.5-72B 70.7 0.43 0 0.43 NaN 20.8 1.06e+03
llama2_70B 68.7 0.44 0 0.44 NaN 19.8 1.05e+03
falcon-40b 67.5 0.44 0 0.44 NaN 18.4 1.05e+03
Qwen1.5-32B 65.5 0.45 0 0.45 NaN 17.4 1.04e+03
Meta-Llama-3-8B 65.4 0.45 0 0.45 NaN 17 1.04e+03
Mistral-7B-v0.1 64.2 0.45 0 0.45 NaN 16.2 1.04e+03
llama_13B 63.6 0.45 0 0.45 NaN 16.1 1.03e+03
mpt-30b 60.8 0.46 0 0.46 NaN 14.6 1.02e+03
llama2_13B 60.4 0.46 0 0.46 NaN 14.9 1.02e+03
gemma-7b 60.3 0.46 0 0.46 NaN 14.7 1.02e+03
deepseek-moe-16b-base 59.1 0.46 0 0.46 NaN 13.7 1.02e+03
llama_07B 56.4 0.47 0 0.47 NaN 12.5 1.01e+03
deepseek-llm-7b-base 54.4 0.47 0 0.47 NaN 11.5 1e+03
Qwen1.5-14B 54 0.47 0 0.47 NaN 11.7 999
llama2_07B 52.6 0.47 0 0.47 NaN 11.3 994
falcon-7b 52.2 0.47 0 0.47 NaN 10.5 992
stablelm-base-alpha-7b-v2 49.6 0.47 0 0.47 NaN 9.56 983
stablelm-3b-4e1t 48.7 0.47 0 0.47 NaN 9.34 980
Qwen1.5-7B 48.1 0.47 0 0.47 NaN 9.38 977
gemma-2b 42.8 0.47 0 0.47 NaN 7.5 958
Qwen1.5-4B 39.4 0.46 0 0.46 NaN 6.68 945
pythia-12b-deduped-v0 37.8 0.46 0 0.46 NaN 6.01 940
pythia-6.9b-deduped-v0 33.2 0.44 0 0.44 NaN 4.82 922
Qwen1.5-1.8B 26.2 0.41 0 0.41 NaN 3.37 896
pythia-2.8b-deduped 24.1 0.4 0 0.4 NaN 3.02 887
pythia-1b-deduped 14.7 0.33 0 0.33 NaN 1.74 849
Qwen1.5-0.5B 13.4 0.32 0 0.32 NaN 1.51 843
pythia-1.4b-deduped-v0 12.7 0.31 0 0.31 NaN 1.32 840