DS1000: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
claude-3-5-sonnet-20240620 54.3% 85.7% 1179.5
gpt-4-turbo-2024-04-09 54.0% 87.2% 1199.4
deepseek-ai-deepseek-coder-V2-SFT 53.2% 86.5% 1186.2
Qwen-Qwen2-72B-Instruct 52.8% 86.1% 1182.0
mistralai-Codestral-22B-v0.1 51.2% 85.1% 1165.2
gpt-4-0613 51.0% 84.0% 1151.4
meta-llama-Llama-3-70b-chat-hf 48.6% 81.0% 1115.7
deepseek-ai-deepseek-coder-V2-Base 46.7% 80.4% 1105.9
microsoft-wavecoder-ultra-6.7b 46.0% 79.2% 1090.8
deepseek-ai-deepseek-coder-33b-instruct 45.4% 77.4% 1071.2
m-a-p-OpenCodeInterpreter-DS-6.7B 42.0% 74.3% 1042.7
deepseek-ai-deepseek-coder-33b-base 41.7% 75.5% 1054.4
meta-llama-Llama-3-70B 40.9% 73.2% 1033.5
deepseek-ai-deepseek-llm-67b-chat 40.7% 72.6% 1025.9
microsoft-Phi-3-medium-4k-instruct 40.6% 73.5% 1034.9
Phind-Phind-CodeLlama-34B-v2 40.4% 71.9% 1022.5
Qwen-Qwen1.5-110B-Chat 40.2% 72.8% 1026.4
mistralai-Mixtral-8x22B 40.0% 72.8% 1030.9
codellama-CodeLlama-70b-hf 39.8% 71.8% 1022.5
m-a-p-OpenCodeInterpreter-CL-7B 39.5% 69.7% 1001.8
gpt-3.5-turbo-0125 39.4% 69.8% 1005.7
m-a-p-OpenCodeInterpreter-SC2-7B 38.9% 67.1% 975.3
codellama-CodeLlama-34b-Python-hf 38.9% 70.5% 1005.3
codellama-CodeLlama-70b-Python-hf 38.9% 70.3% 1004.8
gpt-3.5-turbo-0613 38.6% 69.2% 1000.0
codex002 38.6% 71.1% 1010.8
m-a-p-OpenCodeInterpreter-SC2-3B 38.6% 68.2% 989.2
deepseek-ai-deepseek-V2-chat 38.5% 69.0% 996.9
microsoft-Phi-3-small-8k-instruct 37.7% 68.6% 993.4
bigcode-starcoder2-15b 37.0% 67.8% 983.1
WizardLM-WizardCoder-Python-34B-V1.0 36.7% 66.6% 973.1
Qwen-Qwen1.5-72B-Chat 35.5% 65.5% 965.4
google-codegemma-7b 34.8% 64.6% 960.3
ibm-granite-granite-34b-code-base 34.8% 63.9% 953.0
codellama-CodeLlama-34b-hf 34.6% 64.0% 949.4
Qwen-Qwen1.5-72B 34.3% 63.1% 950.2
deepseek-ai-deepseek-coder-7b-base-v1.5 34.2% 63.5% 950.4
ibm-granite-granite-8b-code-base 33.8% 62.4% 939.2
Qwen-Qwen1.5-32B-Chat 32.8% 60.0% 923.3
microsoft-wavecoder-ds-6.7b 32.8% 59.9% 923.5
microsoft-Phi-3-mini-4k-instruct 32.1% 58.7% 916.0
meta-llama-Llama-3-8B 31.5% 57.7% 909.1
bigcode-starcoder2-7b 31.4% 57.8% 906.7
microsoft-Phi-3-mini-128k-instruct 31.3% 57.0% 905.4
microsoft-wavecoder-pro-6.7b 31.2% 56.7% 900.3
deepseek-ai-deepseek-coder-6.7b-base 31.1% 57.3% 900.0
Qwen-Qwen2-7B 31.0% 57.1% 900.6
codellama-CodeLlama-13b-Python-hf 31.0% 57.0% 900.3
deepseek-ai-deepseek-coder-V2-Lite-Base 30.5% 55.9% 889.8
openchat-openchat-3.5-0106 30.3% 55.2% 885.7
ibm-granite-granite-20b-code-base 30.0% 54.7% 885.4
google-codegemma-1.1-7b-it 29.7% 53.9% 880.9
Doubao-pro-4k 29.1% 52.6% 872.1
mistralai-Mixtral-8x7B-v0.1 28.8% 52.3% 868.2
Qwen-Qwen1.5-32B 28.5% 51.7% 861.2
codellama-CodeLlama-13b-hf 27.8% 50.2% 852.9
Qwen-CodeQwen1.5-7B 27.6% 49.7% 847.1
bigcode-starcoder2-3b 27.3% 49.1% 842.0
google-codegemma-7b-it 26.2% 46.9% 830.8
google-gemma-7b 26.1% 46.4% 822.6
codellama-CodeLlama-7b-Python-hf 26.0% 46.2% 823.0
stabilityai-stable-code-3b 25.6% 45.4% 819.1
meta-llama-Llama-2-70b-hf 25.2% 44.4% 807.2
m-a-p-OpenCodeInterpreter-DS-1.3B 25.0% 44.6% 810.4
Qwen-Qwen1.5-14B 24.8% 43.5% 803.5
THUDM-codegeex2-6b 24.1% 42.1% 790.6
deepseek-ai-deepseek-coder-V2-Instruct 23.3% 41.5% 798.6
claude-3-sonnet-20240229 23.2% 40.9% 790.4
codellama-CodeLlama-7b-hf 22.9% 39.2% 770.1
ibm-granite-granite-3b-code-base 22.8% 38.8% 764.6
claude-3-opus-20240229 21.6% 37.9% 769.4
microsoft-phi-2 21.5% 36.3% 748.8
Qwen-Qwen1.5-14B-Chat 21.4% 36.8% 754.8
Qwen-Qwen1.5-7B 20.1% 32.9% 720.1
gpt-4o-2024-05-13 20.1% 35.6% 755.7
mistralai-Mixtral-8x22B-Instruct-v0.1 19.9% 36.6% 762.4
mistralai-Mistral-7B-v0.3 19.7% 32.4% 712.8
google-gemma-1.1-7b-it 18.3% 30.5% 699.2
meta-llama-Llama-3-8b-chat-hf 17.8% 30.4% 708.2
deepseek-ai-deepseek-coder-1.3b-base 17.5% 27.1% 667.4
deepseek-ai-deepseek-V2-Lite 16.9% 26.4% 664.3
google-codegemma-1.1-2b 16.6% 26.0% 661.0
claude-3-haiku-20240307 16.3% 26.7% 673.7
Doubao-lite-4k 15.7% 25.1% 654.8
Salesforce-codegen25-7b-mono_P 15.6% 24.6% 651.5
google-codegemma-2b 13.3% 19.3% 591.5
Qwen-Qwen2-1.5B 11.8% 17.8% 576.0
meta-llama-Llama-2-13b-hf 11.6% 16.3% 553.1
google-gemma-7b-it 11.4% 16.2% 553.3
google-gemma-2b 10.3% 13.1% 508.9
microsoft-phi-1 9.1% 11.1% 471.6
ERNIE-Speed-8K 8.8% 11.8% 489.0
codellama-CodeLlama-70b-Instruct-hf 8.7% 12.7% 509.2
google-gemma-1.1-2b-it 8.5% 11.7% 487.0
microsoft-phi-1_5 8.3% 10.9% 471.6
codellama-CodeLlama-13b-Instruct-hf 7.9% 11.1% 468.9
meta-llama-Llama-2-7b-hf 6.9% 8.3% 414.7
mistralai-Mistral-7B-Instruct-v0.3 6.9% 7.4% 397.8
meta-llama-Llama-2-7b-chat-hf 6.4% 7.6% 399.5
google-gemma-2b-it 6.0% 7.4% 392.7
smallcloudai-Refact-1_6B-fim 5.7% 7.5% 394.9
codellama-CodeLlama-34b-Instruct-hf 5.2% 5.9% 352.7
Qwen-Qwen2-0.5B 3.9% 3.1% 232.8
mistralai-Mixtral-8x7B-Instruct-v0.1 3.7% 4.4% 292.2
meta-llama-Llama-2-70b-chat-hf 3.7% 4.5% 302.3