p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
claude-3-5-sonnet-20240620	54.3%	85.5%	1179.6
gpt-4-turbo-2024-04-09	54.0%	86.9%	1199.1
deepseek-ai-deepseek-coder-V2-SFT	53.2%	86.2%	1186.0
llama3-405	53.1%	84.7%	1166.1
Qwen-Qwen2-72B-Instruct	52.8%	85.8%	1181.8
mistralai-Codestral-22B-v0.1	51.2%	84.8%	1165.0
gpt-4-0613	51.0%	83.7%	1151.4
meta-llama-Llama-3-70b-chat-hf	48.6%	80.7%	1115.8
deepseek-ai-deepseek-coder-V2-Base	46.7%	80.0%	1105.8
microsoft-wavecoder-ultra-6.7b	46.0%	78.8%	1090.8
deepseek-ai-deepseek-coder-33b-instruct	45.4%	77.0%	1071.5
m-a-p-OpenCodeInterpreter-DS-6.7B	42.0%	73.9%	1042.7
deepseek-ai-deepseek-coder-33b-base	41.7%	75.0%	1054.2
meta-llama-Llama-3-70B	40.9%	72.7%	1033.5
deepseek-ai-deepseek-llm-67b-chat	40.7%	72.2%	1026.2
microsoft-Phi-3-medium-4k-instruct	40.6%	73.1%	1034.7
Phind-Phind-CodeLlama-34B-v2	40.4%	71.5%	1022.5
Qwen-Qwen1.5-110B-Chat	40.2%	72.3%	1026.2
mistralai-Mixtral-8x22B	40.0%	72.3%	1030.7
codellama-CodeLlama-70b-hf	39.8%	71.3%	1022.3
m-a-p-OpenCodeInterpreter-CL-7B	39.5%	69.3%	1002.0
gpt-3.5-turbo-0125	39.4%	69.3%	1005.8
codellama-CodeLlama-70b-Python-hf	38.9%	69.9%	1004.6
m-a-p-OpenCodeInterpreter-SC2-7B	38.9%	66.7%	975.8
codellama-CodeLlama-34b-Python-hf	38.9%	70.0%	1005.3
m-a-p-OpenCodeInterpreter-SC2-3B	38.6%	67.8%	989.5
gpt-3.5-turbo-0613	38.6%	68.8%	1000.0
codex002	38.6%	70.6%	1010.5
deepseek-ai-deepseek-V2-chat	38.5%	68.6%	997.0
microsoft-Phi-3-small-8k-instruct	37.7%	68.2%	993.4
bigcode-starcoder2-15b	37.0%	67.3%	983.1
WizardLM-WizardCoder-Python-34B-V1.0	36.7%	66.1%	973.2
Qwen-Qwen1.5-72B-Chat	35.5%	65.0%	965.5
google-codegemma-7b	34.8%	64.1%	960.0
ibm-granite-granite-34b-code-base	34.8%	63.4%	952.9
codellama-CodeLlama-34b-hf	34.6%	63.4%	949.3
Qwen-Qwen1.5-72B	34.3%	62.6%	950.3
deepseek-ai-deepseek-coder-7b-base-v1.5	34.2%	63.0%	950.1
ibm-granite-granite-8b-code-base	33.8%	61.9%	939.1
Qwen-Qwen1.5-32B-Chat	32.8%	59.5%	923.4
microsoft-wavecoder-ds-6.7b	32.8%	59.4%	923.9
microsoft-Phi-3-mini-4k-instruct	32.1%	58.1%	916.4
meta-llama-Llama-3-8B	31.5%	57.2%	909.2
bigcode-starcoder2-7b	31.4%	57.3%	906.9
microsoft-Phi-3-mini-128k-instruct	31.3%	56.5%	905.6
microsoft-wavecoder-pro-6.7b	31.2%	56.2%	900.6
deepseek-ai-deepseek-coder-6.7b-base	31.1%	56.7%	900.1
Qwen-Qwen2-7B	31.0%	56.6%	900.6
codellama-CodeLlama-13b-Python-hf	31.0%	56.5%	900.3
deepseek-ai-deepseek-coder-V2-Lite-Base	30.5%	55.4%	889.8
openchat-openchat-3.5-0106	30.3%	54.7%	886.0
ibm-granite-granite-20b-code-base	30.0%	54.1%	885.7
google-codegemma-1.1-7b-it	29.7%	53.4%	881.1
Doubao-pro-4k	29.1%	52.2%	872.5
mistralai-Mixtral-8x7B-v0.1	28.8%	51.8%	868.4
Qwen-Qwen1.5-32B	28.5%	51.1%	861.2
codellama-CodeLlama-13b-hf	27.8%	49.7%	853.1
Qwen-CodeQwen1.5-7B	27.6%	49.2%	847.2
bigcode-starcoder2-3b	27.3%	48.6%	842.6
google-codegemma-7b-it	26.2%	46.4%	831.1
google-gemma-7b	26.1%	45.9%	823.0
codellama-CodeLlama-7b-Python-hf	26.0%	45.7%	823.3
stabilityai-stable-code-3b	25.6%	45.0%	819.5
meta-llama-Llama-2-70b-hf	25.2%	43.9%	807.6
m-a-p-OpenCodeInterpreter-DS-1.3B	25.0%	44.1%	810.8
Qwen-Qwen1.5-14B	24.8%	43.0%	803.8
THUDM-codegeex2-6b	24.1%	41.6%	791.0
deepseek-ai-deepseek-coder-V2-Instruct	23.3%	41.1%	799.0
claude-3-sonnet-20240229	23.2%	40.5%	790.9
codellama-CodeLlama-7b-hf	22.9%	38.8%	770.4
ibm-granite-granite-3b-code-base	22.8%	38.3%	765.1
claude-3-opus-20240229	21.6%	37.5%	769.7
microsoft-phi-2	21.5%	35.9%	749.1
Qwen-Qwen1.5-14B-Chat	21.4%	36.4%	755.4
Qwen-Qwen1.5-7B	20.1%	32.5%	720.8
gpt-4o-2024-05-13	20.1%	35.2%	756.1
mistralai-Mixtral-8x22B-Instruct-v0.1	19.9%	36.2%	762.9
mistralai-Mistral-7B-v0.3	19.7%	32.0%	713.3
google-gemma-1.1-7b-it	18.3%	30.2%	699.6
meta-llama-Llama-3-8b-chat-hf	17.8%	30.1%	708.7
deepseek-ai-deepseek-coder-1.3b-base	17.5%	26.8%	668.2
deepseek-ai-deepseek-V2-Lite	16.9%	26.0%	664.9
google-codegemma-1.1-2b	16.6%	25.7%	662.0
claude-3-haiku-20240307	16.3%	26.4%	674.3
Doubao-lite-4k	15.7%	24.8%	655.5
Salesforce-codegen25-7b-mono_P	15.6%	24.3%	652.2
google-codegemma-2b	13.3%	19.1%	592.1
Qwen-Qwen2-1.5B	11.8%	17.5%	576.9
meta-llama-Llama-2-13b-hf	11.6%	16.1%	553.8
google-gemma-7b-it	11.4%	16.0%	553.8
google-gemma-2b	10.3%	12.9%	509.3
microsoft-phi-1	9.1%	11.0%	472.4
ERNIE-Speed-8K	8.8%	11.7%	490.0
codellama-CodeLlama-70b-Instruct-hf	8.7%	12.6%	509.7
google-gemma-1.1-2b-it	8.5%	11.6%	487.6
microsoft-phi-1_5	8.3%	10.8%	472.1
codellama-CodeLlama-13b-Instruct-hf	7.9%	11.0%	469.5
mistralai-Mistral-7B-Instruct-v0.3	6.9%	7.3%	398.8
meta-llama-Llama-2-7b-hf	6.9%	8.2%	415.4
meta-llama-Llama-2-7b-chat-hf	6.4%	7.5%	400.7
google-gemma-2b-it	6.0%	7.3%	393.2
smallcloudai-Refact-1_6B-fim	5.7%	7.4%	396.0
codellama-CodeLlama-34b-Instruct-hf	5.2%	5.8%	353.2
Qwen-Qwen2-0.5B	3.9%	3.1%	234.8
meta-llama-Llama-2-70b-chat-hf	3.7%	4.4%	302.8
mistralai-Mixtral-8x7B-Instruct-v0.1	3.7%	4.3%	292.8

DS1000: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model