safim: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
deepseek-coder-33b 66.1% 89.6% 1363.7
deepseek-coder-6.7b 60.4% 84.2% 1281.6
wizardcoder-33b 58.1% 79.0% 1224.6
starcoderbase-16b 49.6% 69.1% 1140.3
codellama-13b 49.4% 67.3% 1123.0
gpt-4-1106-preview 48.8% 62.6% 1089.1
deepseek-coder-1.3b 47.9% 65.4% 1109.4
wizardcoder-15b 47.2% 64.5% 1104.7
codellama-34b 46.4% 61.8% 1084.6
codellama-7b 44.1% 57.1% 1047.1
mixtral-8x7b 42.2% 54.1% 1030.2
wizardcoder-3b 40.7% 51.7% 1018.4
gpt-3.5-turbo-0301 34.7% 41.5% 938.7
wizardcoder-1b 34.4% 38.9% 933.6
codegen-16b 30.8% 30.5% 872.5
codegen-6b 29.1% 27.3% 846.7
phi-2 29.1% 27.1% 843.4
codegen-2b 28.2% 25.9% 835.1
incoder-6b 27.0% 27.4% 838.0
phi-1.5 24.4% 19.8% 774.0
incoder-1b 22.3% 20.8% 775.2
codegen-350m 21.2% 16.0% 726.1