DS1000: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
claude-3-5-sonnet-20240620 54.3% 85.5% 1179.6
gpt-4-turbo-2024-04-09 54.0% 86.9% 1199.1
deepseek-ai-deepseek-coder-V2-SFT 53.2% 86.2% 1186.0
llama3-405 53.1% 84.7% 1166.1
Qwen-Qwen2-72B-Instruct 52.8% 85.8% 1181.8
mistralai-Codestral-22B-v0.1 51.2% 84.8% 1165.0
gpt-4-0613 51.0% 83.7% 1151.4
meta-llama-Llama-3-70b-chat-hf 48.6% 80.7% 1115.8
deepseek-ai-deepseek-coder-V2-Base 46.7% 80.0% 1105.8
microsoft-wavecoder-ultra-6.7b 46.0% 78.8% 1090.8
deepseek-ai-deepseek-coder-33b-instruct 45.4% 77.0% 1071.5
m-a-p-OpenCodeInterpreter-DS-6.7B 42.0% 73.9% 1042.7
deepseek-ai-deepseek-coder-33b-base 41.7% 75.0% 1054.2
meta-llama-Llama-3-70B 40.9% 72.7% 1033.5
deepseek-ai-deepseek-llm-67b-chat 40.7% 72.2% 1026.2
microsoft-Phi-3-medium-4k-instruct 40.6% 73.1% 1034.7
Phind-Phind-CodeLlama-34B-v2 40.4% 71.5% 1022.5
Qwen-Qwen1.5-110B-Chat 40.2% 72.3% 1026.2
mistralai-Mixtral-8x22B 40.0% 72.3% 1030.7
codellama-CodeLlama-70b-hf 39.8% 71.3% 1022.3
m-a-p-OpenCodeInterpreter-CL-7B 39.5% 69.3% 1002.0
gpt-3.5-turbo-0125 39.4% 69.3% 1005.8
codellama-CodeLlama-70b-Python-hf 38.9% 69.9% 1004.6
m-a-p-OpenCodeInterpreter-SC2-7B 38.9% 66.7% 975.8
codellama-CodeLlama-34b-Python-hf 38.9% 70.0% 1005.3
m-a-p-OpenCodeInterpreter-SC2-3B 38.6% 67.8% 989.5
gpt-3.5-turbo-0613 38.6% 68.8% 1000.0
codex002 38.6% 70.6% 1010.5
deepseek-ai-deepseek-V2-chat 38.5% 68.6% 997.0
microsoft-Phi-3-small-8k-instruct 37.7% 68.2% 993.4
bigcode-starcoder2-15b 37.0% 67.3% 983.1
WizardLM-WizardCoder-Python-34B-V1.0 36.7% 66.1% 973.2
Qwen-Qwen1.5-72B-Chat 35.5% 65.0% 965.5
google-codegemma-7b 34.8% 64.1% 960.0
ibm-granite-granite-34b-code-base 34.8% 63.4% 952.9
codellama-CodeLlama-34b-hf 34.6% 63.4% 949.3
Qwen-Qwen1.5-72B 34.3% 62.6% 950.3
deepseek-ai-deepseek-coder-7b-base-v1.5 34.2% 63.0% 950.1
ibm-granite-granite-8b-code-base 33.8% 61.9% 939.1
Qwen-Qwen1.5-32B-Chat 32.8% 59.5% 923.4
microsoft-wavecoder-ds-6.7b 32.8% 59.4% 923.9
microsoft-Phi-3-mini-4k-instruct 32.1% 58.1% 916.4
meta-llama-Llama-3-8B 31.5% 57.2% 909.2
bigcode-starcoder2-7b 31.4% 57.3% 906.9
microsoft-Phi-3-mini-128k-instruct 31.3% 56.5% 905.6
microsoft-wavecoder-pro-6.7b 31.2% 56.2% 900.6
deepseek-ai-deepseek-coder-6.7b-base 31.1% 56.7% 900.1
Qwen-Qwen2-7B 31.0% 56.6% 900.6
codellama-CodeLlama-13b-Python-hf 31.0% 56.5% 900.3
deepseek-ai-deepseek-coder-V2-Lite-Base 30.5% 55.4% 889.8
openchat-openchat-3.5-0106 30.3% 54.7% 886.0
ibm-granite-granite-20b-code-base 30.0% 54.1% 885.7
google-codegemma-1.1-7b-it 29.7% 53.4% 881.1
Doubao-pro-4k 29.1% 52.2% 872.5
mistralai-Mixtral-8x7B-v0.1 28.8% 51.8% 868.4
Qwen-Qwen1.5-32B 28.5% 51.1% 861.2
codellama-CodeLlama-13b-hf 27.8% 49.7% 853.1
Qwen-CodeQwen1.5-7B 27.6% 49.2% 847.2
bigcode-starcoder2-3b 27.3% 48.6% 842.6
google-codegemma-7b-it 26.2% 46.4% 831.1
google-gemma-7b 26.1% 45.9% 823.0
codellama-CodeLlama-7b-Python-hf 26.0% 45.7% 823.3
stabilityai-stable-code-3b 25.6% 45.0% 819.5
meta-llama-Llama-2-70b-hf 25.2% 43.9% 807.6
m-a-p-OpenCodeInterpreter-DS-1.3B 25.0% 44.1% 810.8
Qwen-Qwen1.5-14B 24.8% 43.0% 803.8
THUDM-codegeex2-6b 24.1% 41.6% 791.0
deepseek-ai-deepseek-coder-V2-Instruct 23.3% 41.1% 799.0
claude-3-sonnet-20240229 23.2% 40.5% 790.9
codellama-CodeLlama-7b-hf 22.9% 38.8% 770.4
ibm-granite-granite-3b-code-base 22.8% 38.3% 765.1
claude-3-opus-20240229 21.6% 37.5% 769.7
microsoft-phi-2 21.5% 35.9% 749.1
Qwen-Qwen1.5-14B-Chat 21.4% 36.4% 755.4
Qwen-Qwen1.5-7B 20.1% 32.5% 720.8
gpt-4o-2024-05-13 20.1% 35.2% 756.1
mistralai-Mixtral-8x22B-Instruct-v0.1 19.9% 36.2% 762.9
mistralai-Mistral-7B-v0.3 19.7% 32.0% 713.3
google-gemma-1.1-7b-it 18.3% 30.2% 699.6
meta-llama-Llama-3-8b-chat-hf 17.8% 30.1% 708.7
deepseek-ai-deepseek-coder-1.3b-base 17.5% 26.8% 668.2
deepseek-ai-deepseek-V2-Lite 16.9% 26.0% 664.9
google-codegemma-1.1-2b 16.6% 25.7% 662.0
claude-3-haiku-20240307 16.3% 26.4% 674.3
Doubao-lite-4k 15.7% 24.8% 655.5
Salesforce-codegen25-7b-mono_P 15.6% 24.3% 652.2
google-codegemma-2b 13.3% 19.1% 592.1
Qwen-Qwen2-1.5B 11.8% 17.5% 576.9
meta-llama-Llama-2-13b-hf 11.6% 16.1% 553.8
google-gemma-7b-it 11.4% 16.0% 553.8
google-gemma-2b 10.3% 12.9% 509.3
microsoft-phi-1 9.1% 11.0% 472.4
ERNIE-Speed-8K 8.8% 11.7% 490.0
codellama-CodeLlama-70b-Instruct-hf 8.7% 12.6% 509.7
google-gemma-1.1-2b-it 8.5% 11.6% 487.6
microsoft-phi-1_5 8.3% 10.8% 472.1
codellama-CodeLlama-13b-Instruct-hf 7.9% 11.0% 469.5
mistralai-Mistral-7B-Instruct-v0.3 6.9% 7.3% 398.8
meta-llama-Llama-2-7b-hf 6.9% 8.2% 415.4
meta-llama-Llama-2-7b-chat-hf 6.4% 7.5% 400.7
google-gemma-2b-it 6.0% 7.3% 393.2
smallcloudai-Refact-1_6B-fim 5.7% 7.4% 396.0
codellama-CodeLlama-34b-Instruct-hf 5.2% 5.8% 353.2
Qwen-Qwen2-0.5B 3.9% 3.1% 234.8
meta-llama-Llama-2-70b-chat-hf 3.7% 4.4% 302.8
mistralai-Mixtral-8x7B-Instruct-v0.1 3.7% 4.3% 292.8