lcb_codegen: by models

Home Doc/Code


std predicted by accuracy

The typical stddev between pairs of models on this dataset as a function of the absolute accuracy.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, the significance level mainly depends on the accuracy difference as shown here. Hover over each model pair for detailed information.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000. std: standard deviation due to drawing examples from a population, this is the dominant term. std_i: the standard deviation due to drawing samples from the model on each example. std_total: the total standard deviation, satisfying std_total^2 = std^2 + std_i^2.

model pass1 std(E(A)) E(std(A)) std(A) N win_rate elo
GPT-4O-2024-05-13 51.3 2.4 0.83 2.5 NaN 32.9 1.11e+03
GPT-4-Turbo-2024-04-09 43.5 2.3 0.98 2.5 NaN 25.8 1.08e+03
GPT-4-Turbo-1106 38.3 2.2 0.92 2.4 NaN 20.3 1.06e+03
Gemini-Pro-1.5 (May) 38.2 2.3 0.78 2.4 NaN 22.4 1.07e+03
Claude-3-Opus 35.4 2.3 0.63 2.4 NaN 18 1.05e+03
GPT-4-0613 34.8 2.2 0.82 2.4 NaN 18 1.05e+03
WCoder-33B-V1.1 30.5 2.1 0.91 2.3 NaN 14.5 1.03e+03
DSCoder-33b-Ins 30.3 2.1 0.9 2.3 NaN 15.1 1.04e+03
Gemini-Pro-1.5 (April) (n=1) 29.5 2.3 0 2.3 NaN 15 1.04e+03
LLama3-70b-Ins 29.3 2.2 0.61 2.3 NaN 14.5 1.03e+03
Eurux-8x22b-NCA (n=1) 27.3 2.2 0 2.2 NaN 12.7 1.03e+03
OC-DS-33B 26.6 2.1 0.78 2.2 NaN 11.2 1.02e+03
Claude-3-Sonnet 25.9 2.1 0.76 2.2 NaN 11.5 1.02e+03
Mistral-Large 25.7 2.1 0.58 2.2 NaN 11.5 1.02e+03
Mixtral-8x22B-Ins 25.5 2 0.77 2.2 NaN 11.3 1.02e+03
CodeQwen15-7B-Chat 25 1.9 0.99 2.2 NaN 14 1.01e+03
Claude-3-Haiku 24.6 2.1 0.64 2.2 NaN 10.4 1.02e+03
DSCoder-33b-Base 23.7 2 0.81 2.1 NaN 11.2 1.01e+03
Claude-2 23.6 2.1 0.53 2.1 NaN 10.1 1.01e+03
Eurus-70B-SFT (n=1) 23 2.1 0 2.1 NaN 9.56 1.01e+03
MagiCoderS-DS-6.7B 22.6 1.9 0.84 2.1 NaN 8.65 1.01e+03
GPT-3.5-Turbo-0125 22.5 2 0.66 2.1 NaN 9.48 1.01e+03
OC-DS-6.7B 21.9 1.9 0.79 2.1 NaN 8.45 1.01e+03
CodeQwen15-7B 21.8 1.9 0.87 2.1 NaN 8.79 1e+03
LLama3-70b-Base 21.8 1.9 0.73 2.1 NaN 8.55 1.01e+03
Claude-Instant-1 21.7 2 0.46 2.1 NaN 8.91 1.01e+03
DSCoder-6.7b-Ins 21.6 1.9 0.84 2.1 NaN 8.09 1e+03
GPT-3.5-Turbo-0301 21.2 1.9 0.82 2 NaN 8.59 1e+03
Phind-34B-V2 20.4 1.9 0.66 2 NaN 7.55 1e+03
Command-R+ 20.4 2 0.43 2 NaN 8 1e+03
DSCoder-6.7b-Base 19.1 1.8 0.77 2 NaN 7.28 995
Gemini-Pro 18.8 1.8 0.81 2 NaN 6.79 997
Smaug-2-72B 18.4 1.8 0.72 1.9 NaN 6.1 992
DBRX-Ins 17.5 1.7 0.75 1.9 NaN 6.27 992
LLama3-8b-Ins 17.3 1.8 0.62 1.9 NaN 6.05 992
WCoder-34B-V1 16.2 1.6 0.82 1.8 NaN 4.8 982
Qwen-1.5-72B-Chat 15.9 1.8 0.5 1.8 NaN 5.54 987
StarCoder2-15b 15.4 1.6 0.83 1.8 NaN 4.61 984
CodeGemma-7b-Base 14 1.5 0.95 1.7 NaN 3.64 978
Command-R 14 1.6 0.57 1.7 NaN 4.71 980
CodeLlama-34b-Base 12.3 1.4 0.8 1.6 NaN 3.14 971
Mixtral-8x7B-Ins 12.3 1.5 0.64 1.6 NaN 3.83 972
Cllama-13b-Ins 12 1.5 0.65 1.6 NaN 3.46 973
LLama3-8b-Base 11.9 1.5 0.69 1.6 NaN 2.93 972
Cllama-34b-Ins 11.6 1.5 0.58 1.6 NaN 3.95 971
MagiCoderS-CL-7B 11.4 1.4 0.74 1.6 NaN 2.83 971
StarCoder2-7b 11.3 1.4 0.7 1.6 NaN 2.65 967
DSCoder-1.3b-Ins 10.8 1.4 0.68 1.6 NaN 3.37 969
Cllama-7b-Ins 10.6 1.4 0.55 1.5 NaN 2.89 967
Gemma-7b-Base 9.95 1.3 0.79 1.5 NaN 2.29 965
CodeLlama-13b-Base 8.62 1.3 0.59 1.4 NaN 1.83 959
StarCoder2-3b 8.55 1.2 0.65 1.4 NaN 1.61 959
DSCoder-1.3b-Base 7.95 1.2 0.67 1.4 NaN 2.13 959
CodeGemma-2b-Base 6.97 1.1 0.64 1.3 NaN 1.34 954
CodeLlama-7b-Base 6.53 1.1 0.63 1.2 NaN 1.11 951
StableCode-3B 5.68 1 0.59 1.2 NaN 1.26 950
OC-DS-1.3B 5.05 0.94 0.57 1.1 NaN 0.693 946
Gemma-2b-Base 2.48 0.6 0.5 0.78 NaN 0.219 937