lcb_codegen: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 std win_rate elo
GPT-4O-2024-05-13 51.3% 0.83% 96.7% 1667.6
GPT-4-Turbo-2024-04-09 43.5% 0.98% 90.7% 1466.4
GPT-4-Turbo-1106 38.3% 0.92% 87.9% 1411.0
Gemini-Pro-1.5 (May) 38.2% 0.78% 88.6% 1435.5
Claude-3-Opus 35.4% 0.63% 85.4% 1373.7
GPT-4-0613 34.8% 0.82% 85.5% 1372.9
WCoder-33B-V1.1 30.4% 0.91% 76.8% 1276.7
DSCoder-33b-Ins 30.3% 0.90% 78.4% 1294.5
Gemini-Pro-1.5 (April) (n=1) 29.5% 0.00% 76.4% 1266.0
LLama3-70b-Ins 29.3% 0.61% 76.8% 1272.6
Eurux-8x22b-NCA (n=1) 27.3% 0.00% 73.4% 1234.2
OC-DS-33B 26.6% 0.78% 68.5% 1186.5
Claude-3-Sonnet 25.9% 0.76% 68.9% 1193.9
Mistral-Large 25.7% 0.58% 67.8% 1192.8
Mixtral-8x22B-Ins 25.6% 0.77% 67.3% 1178.2
CodeQwen15-7B-Chat 25.0% 0.99% 58.4% 1113.6
Claude-3-Haiku 24.6% 0.64% 65.1% 1159.0
DSCoder-33b-Base 23.7% 0.81% 61.0% 1136.0
Claude-2 23.6% 0.53% 63.5% 1144.6
Eurus-70B-SFT (n=1) 23.0% 0.00% 62.3% 1143.5
MagiCoderS-DS-6.7B 22.6% 0.84% 57.5% 1101.8
GPT-3.5-Turbo-0125 22.5% 0.66% 59.4% 1110.8
OC-DS-6.7B 21.9% 0.79% 54.8% 1085.4
CodeQwen15-7B 21.8% 0.87% 53.8% 1083.1
LLama3-70b-Base 21.8% 0.73% 56.6% 1091.5
Claude-Instant-1 21.7% 0.46% 57.2% 1095.3
DSCoder-6.7b-Ins 21.6% 0.84% 53.2% 1068.6
GPT-3.5-Turbo-0301 21.2% 0.82% 53.9% 1069.1
Phind-34B-V2 20.4% 0.66% 52.5% 1062.3
Command-R+ 20.4% 0.43% 52.4% 1060.4
DSCoder-6.7b-Base 19.1% 0.77% 45.9% 1018.3
Gemini-Pro 18.8% 0.81% 47.2% 1018.7
Smaug-2-72B 18.4% 0.72% 42.0% 978.4
DBRX-Ins 17.5% 0.75% 42.2% 981.6
LLama3-8b-Ins 17.3% 0.62% 41.9% 980.2
WCoder-34B-V1 16.2% 0.82% 32.6% 890.5
Qwen-1.5-72B-Chat 15.9% 0.50% 37.1% 935.0
StarCoder2-15b 15.4% 0.83% 33.3% 912.0
CodeGemma-7b-Base 14.0% 0.95% 26.6% 854.1
Command-R 14.0% 0.57% 31.3% 883.9
CodeLlama-34b-Base 12.3% 0.80% 21.3% 797.9
Mixtral-8x7B-Ins 12.3% 0.64% 24.2% 818.3
Cllama-13b-Ins 12.0% 0.65% 23.7% 813.9
LLama3-8b-Base 11.9% 0.69% 20.9% 789.1
Cllama-34b-Ins 11.6% 0.58% 24.2% 816.4
MagiCoderS-CL-7B 11.4% 0.74% 20.1% 772.8
StarCoder2-7b 11.3% 0.70% 18.0% 757.6
DSCoder-1.3b-Ins 10.8% 0.68% 21.5% 793.6
Cllama-7b-Ins 10.6% 0.55% 19.0% 761.4
Gemma-7b-Base 9.9% 0.79% 15.5% 727.4
CodeLlama-13b-Base 8.6% 0.59% 11.9% 669.0
StarCoder2-3b 8.6% 0.65% 10.8% 643.0
DSCoder-1.3b-Base 8.0% 0.67% 13.3% 686.5
CodeGemma-2b-Base 7.0% 0.64% 8.4% 592.3
CodeLlama-7b-Base 6.5% 0.63% 6.8% 546.3
StableCode-3B 5.7% 0.59% 7.5% 562.7
OC-DS-1.3B 5.0% 0.57% 4.1% 439.2
Gemma-2b-Base 2.5% 0.50% 1.2% 212.8