arc_challenge: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
dbrx-base 65.9% 81.3% 1251.7
Meta-Llama-3-70B 65.0% 84.5% 1284.9
Mixtral-8x22B-v0.1 61.9% 81.1% 1241.0
DeepSeek-V2 60.2% 78.0% 1209.1
Mixtral-8x7B-v0.1 60.2% 78.6% 1214.0
deepseek-llm-67b-base 57.3% 73.4% 1163.3
llama_65B 55.2% 69.1% 1125.0
Qwen1.5-110B 55.0% 66.9% 1111.9
llama2_70B 54.6% 65.3% 1098.1
falcon-40b 54.4% 67.2% 1113.8
Mistral-7B-v0.1 54.2% 66.2% 1103.5
llama_33B 53.8% 65.7% 1100.5
Meta-Llama-3-8B 53.6% 64.6% 1095.9
gemma-7b 53.4% 64.2% 1089.9
Qwen1.5-72B 52.4% 61.8% 1073.0
llama2_13B 50.2% 55.8% 1036.2
Qwen1.5-32B 50.1% 56.3% 1036.0
mpt-30b 49.4% 55.2% 1030.1
llama_13B 48.6% 53.2% 1014.9
deepseek-moe-16b-base 47.6% 50.8% 1001.5
Qwen1.5-14B 45.6% 46.1% 965.9
llama_07B 44.9% 44.2% 959.0
deepseek-llm-7b-base 44.6% 43.4% 955.7
falcon-7b 44.1% 42.2% 947.3
llama2_07B 43.5% 42.0% 937.4
mpt-7b 42.5% 38.8% 926.3
Qwen1.5-7B 42.1% 38.7% 916.9
gemma-2b 41.7% 36.7% 906.6
stablelm-base-alpha-7b-v2 40.7% 34.3% 891.9
stablelm-3b-4e1t 39.7% 32.5% 878.0
Qwen1.5-4B 39.5% 33.5% 883.6
pythia-12b-deduped-v0 38.1% 29.4% 857.4
pythia-6.9b-deduped-v0 35.8% 25.3% 823.6
Qwen1.5-1.8B 34.3% 24.1% 806.2
pythia-2.8b-deduped 32.8% 21.5% 787.9
Qwen1.5-0.5B 29.4% 16.8% 727.7
pythia-1.4b-deduped-v0 27.9% 16.5% 725.7
pythia-1b-deduped 27.2% 15.0% 708.5