DS1000: by models

Home Doc/Code


std predicted by accuracy

The typical stddev between pairs of models on this dataset as a function of the absolute accuracy.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, the significance level mainly depends on the accuracy difference as shown here. Hover over each model pair for detailed information.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000. std: standard deviation due to drawing examples from a population, this is the dominant term. std_i: the standard deviation due to drawing samples from the model on each example. std_total: the total standard deviation, satisfying std_total^2 = std^2 + std_i^2.

model pass1 std(E(A)) E(std(A)) std(A) N win_rate elo
claude-3-5-sonnet-20240620 54.3 1.6 0 1.6 NaN 32.2 1.06e+03
gpt-4-turbo-2024-04-09 54 1.6 0 1.6 NaN 31.1 1.06e+03
deepseek-ai-deepseek-coder-V2-SFT 53.2 1.6 0 1.6 NaN 30.5 1.05e+03
Qwen-Qwen2-72B-Instruct 52.8 1.6 0 1.6 NaN 30.2 1.05e+03
mistralai-Codestral-22B-v0.1 51.2 1.6 0 1.6 NaN 28.7 1.05e+03
gpt-4-0613 51 1.6 0 1.6 NaN 29.1 1.05e+03
meta-llama-Llama-3-70b-chat-hf 48.6 1.6 0 1.6 NaN 27.6 1.04e+03
deepseek-ai-deepseek-coder-V2-Base 46.7 1.6 0 1.6 NaN 25.3 1.03e+03
microsoft-wavecoder-ultra-6.7b 46 1.6 0 1.6 NaN 25 1.03e+03
deepseek-ai-deepseek-coder-33b-instruct 45.4 1.6 0 1.6 NaN 25.2 1.02e+03
m-a-p-OpenCodeInterpreter-DS-6.7B 42 1.6 0 1.6 NaN 22.1 1.01e+03
deepseek-ai-deepseek-coder-33b-base 41.7 1.6 0 1.6 NaN 20.9 1.01e+03
meta-llama-Llama-3-70B 40.9 1.6 0 1.6 NaN 21 1.01e+03
deepseek-ai-deepseek-llm-67b-chat 40.7 1.6 0 1.6 NaN 21 1.01e+03
microsoft-Phi-3-medium-4k-instruct 40.6 1.6 0 1.6 NaN 20.3 1.01e+03
Phind-Phind-CodeLlama-34B-v2 40.4 1.6 0 1.6 NaN 21 1.01e+03
Qwen-Qwen1.5-110B-Chat 40.2 1.6 0 1.6 NaN 20.1 1.01e+03
mistralai-Mixtral-8x22B 40 1.5 0 1.5 NaN 19.8 1.01e+03
codellama-CodeLlama-70b-hf 39.8 1.5 0 1.5 NaN 20.1 1e+03
m-a-p-OpenCodeInterpreter-CL-7B 39.5 1.5 0 1.5 NaN 21 1e+03
gpt-3.5-turbo-0125 39.4 1.5 0 1.5 NaN 20.8 1e+03
m-a-p-OpenCodeInterpreter-SC2-7B 38.9 1.5 0 1.5 NaN 22.1 1e+03
codellama-CodeLlama-34b-Python-hf 38.9 1.5 0 1.5 NaN 19.4 1e+03
codellama-CodeLlama-70b-Python-hf 38.9 1.5 0 1.5 NaN 19.5 1e+03
gpt-3.5-turbo-0613 38.6 1.5 0 1.5 NaN 19.8 1e+03
codex002 38.6 1.5 0 1.5 NaN 18.5 1e+03
m-a-p-OpenCodeInterpreter-SC2-3B 38.6 1.5 0 1.5 NaN 20.5 1e+03
deepseek-ai-deepseek-V2-chat 38.5 1.5 0 1.5 NaN 19.7 1e+03
microsoft-Phi-3-small-8k-instruct 37.7 1.5 0 1.5 NaN 18.5 997
bigcode-starcoder2-15b 37 1.5 0 1.5 NaN 17.9 994
WizardLM-WizardCoder-Python-34B-V1.0 36.7 1.5 0 1.5 NaN 18.2 993
Qwen-Qwen1.5-72B-Chat 35.5 1.5 0 1.5 NaN 16.6 989
google-codegemma-7b 34.8 1.5 0 1.5 NaN 15.8 986
ibm-granite-granite-34b-code-base 34.8 1.5 0 1.5 NaN 16.4 986
codellama-CodeLlama-34b-hf 34.6 1.5 0 1.5 NaN 15.9 986
Qwen-Qwen1.5-72B 34.3 1.5 0 1.5 NaN 16 985
deepseek-ai-deepseek-coder-7b-base-v1.5 34.2 1.5 0 1.5 NaN 15.4 984
ibm-granite-granite-8b-code-base 33.8 1.5 0 1.5 NaN 15.5 983
Qwen-Qwen1.5-32B-Chat 32.8 1.5 0 1.5 NaN 15.3 979
microsoft-wavecoder-ds-6.7b 32.8 1.5 0 1.5 NaN 15.5 979
microsoft-Phi-3-mini-4k-instruct 32.1 1.5 0 1.5 NaN 15 977
meta-llama-Llama-3-8B 31.5 1.5 0 1.5 NaN 14.3 975
bigcode-starcoder2-7b 31.4 1.5 0 1.5 NaN 13.7 974
microsoft-Phi-3-mini-128k-instruct 31.3 1.5 0 1.5 NaN 14.7 974
microsoft-wavecoder-pro-6.7b 31.2 1.5 0 1.5 NaN 14.8 974
deepseek-ai-deepseek-coder-6.7b-base 31.1 1.5 0 1.5 NaN 13.4 973
Qwen-Qwen2-7B 31 1.5 0 1.5 NaN 13.2 973
codellama-CodeLlama-13b-Python-hf 31 1.5 0 1.5 NaN 13.5 973
deepseek-ai-deepseek-coder-V2-Lite-Base 30.5 1.5 0 1.5 NaN 13.3 971
openchat-openchat-3.5-0106 30.3 1.5 0 1.5 NaN 13.9 970
ibm-granite-granite-20b-code-base 30 1.4 0 1.4 NaN 13.5 969
google-codegemma-1.1-7b-it 29.7 1.4 0 1.4 NaN 13.7 968
Doubao-pro-4k 29.1 1.4 0 1.4 NaN 14 966
mistralai-Mixtral-8x7B-v0.1 28.8 1.4 0 1.4 NaN 12.4 965
Qwen-Qwen1.5-32B 28.5 1.4 0 1.4 NaN 12.2 964
codellama-CodeLlama-13b-hf 27.8 1.4 0 1.4 NaN 11.8 962
Qwen-CodeQwen1.5-7B 27.6 1.4 0 1.4 NaN 12 961
bigcode-starcoder2-3b 27.3 1.4 0 1.4 NaN 11.5 960
google-codegemma-7b-it 26.2 1.4 0 1.4 NaN 11.6 956
google-gemma-7b 26.1 1.4 0 1.4 NaN 10.7 956
codellama-CodeLlama-7b-Python-hf 26 1.4 0 1.4 NaN 10.7 955
stabilityai-stable-code-3b 25.6 1.4 0 1.4 NaN 10.7 954
meta-llama-Llama-2-70b-hf 25.2 1.4 0 1.4 NaN 10.1 952
m-a-p-OpenCodeInterpreter-DS-1.3B 25 1.4 0 1.4 NaN 11.3 952
Qwen-Qwen1.5-14B 24.8 1.4 0 1.4 NaN 9.89 951
THUDM-codegeex2-6b 24.1 1.4 0 1.4 NaN 9.74 949
deepseek-ai-deepseek-coder-V2-Instruct 23.3 1.3 0 1.3 NaN 10.9 946
claude-3-sonnet-20240229 23.2 1.3 0 1.3 NaN 10.3 945
codellama-CodeLlama-7b-hf 22.9 1.3 0 1.3 NaN 8.85 944
ibm-granite-granite-3b-code-base 22.8 1.3 0 1.3 NaN 8.61 944
claude-3-opus-20240229 21.6 1.3 0 1.3 NaN 9.72 940
microsoft-phi-2 21.5 1.3 0 1.3 NaN 8.36 939
Qwen-Qwen1.5-14B-Chat 21.4 1.3 0 1.3 NaN 8.92 939
Qwen-Qwen1.5-7B 20.1 1.3 0 1.3 NaN 7.42 934
gpt-4o-2024-05-13 20.1 1.3 0 1.3 NaN 9.52 934
mistralai-Mixtral-8x22B-Instruct-v0.1 19.9 1.3 0 1.3 NaN 10.8 934
mistralai-Mistral-7B-v0.3 19.7 1.3 0 1.3 NaN 7.48 933
google-gemma-1.1-7b-it 18.3 1.2 0 1.2 NaN 7.47 928
meta-llama-Llama-3-8b-chat-hf 17.8 1.2 0 1.2 NaN 7.77 926
deepseek-ai-deepseek-coder-1.3b-base 17.5 1.2 0 1.2 NaN 6.11 925
deepseek-ai-deepseek-V2-Lite 16.9 1.2 0 1.2 NaN 6.1 923
google-codegemma-1.1-2b 16.6 1.2 0 1.2 NaN 6.08 922
claude-3-haiku-20240307 16.3 1.2 0 1.2 NaN 6.6 921
Doubao-lite-4k 15.7 1.2 0 1.2 NaN 6.11 919
Salesforce-codegen25-7b-mono_P 15.6 1.1 0 1.1 NaN 5.94 918
google-codegemma-2b 13.3 1.1 0 1.1 NaN 4.58 910
Qwen-Qwen2-1.5B 11.8 1 0 1 NaN 4.42 905
meta-llama-Llama-2-13b-hf 11.6 1 0 1 NaN 3.95 904
google-gemma-7b-it 11.4 1 0 1 NaN 3.96 903
google-gemma-2b 10.3 0.96 0 0.96 NaN 3.11 899
microsoft-phi-1 9.1 0.91 0 0.91 NaN 2.7 895
ERNIE-Speed-8K 8.8 0.9 0 0.9 NaN 2.96 894
codellama-CodeLlama-70b-Instruct-hf 8.7 0.89 0 0.89 NaN 3.28 893
google-gemma-1.1-2b-it 8.5 0.88 0 0.88 NaN 2.98 893
microsoft-phi-1_5 8.3 0.87 0 0.87 NaN 2.75 892
codellama-CodeLlama-13b-Instruct-hf 7.9 0.85 0 0.85 NaN 2.86 890
meta-llama-Llama-2-7b-hf 6.9 0.8 0 0.8 NaN 2.11 887
mistralai-Mistral-7B-Instruct-v0.3 6.9 0.8 0 0.8 NaN 1.82 887
meta-llama-Llama-2-7b-chat-hf 6.4 0.77 0 0.77 NaN 1.92 885
google-gemma-2b-it 6 0.75 0 0.75 NaN 1.89 883
smallcloudai-Refact-1_6B-fim 5.7 0.73 0 0.73 NaN 1.97 882
codellama-CodeLlama-34b-Instruct-hf 5.2 0.7 0 0.7 NaN 1.53 880
Qwen-Qwen2-0.5B 3.9 0.61 0 0.61 NaN 0.808 875
mistralai-Mixtral-8x7B-Instruct-v0.1 3.7 0.6 0 0.6 NaN 1.16 875
meta-llama-Llama-2-70b-chat-hf 3.7 0.6 0 0.6 NaN 1.2 875