p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	std	win_rate	elo
gpt-4-turbo-2024-04-09+cot	75.7%	0.78%	87.1%	1293.2
gpt-4o+cot	75.6%	0.71%	87.1%	1286.6
gpt-4-0613+cot	75.5%	0.88%	88.7%	1314.0
claude-3-opus-20240229+cot	73.4%	0.00%	84.9%	1259.3
llama3-405-cot	72.0%	0.00%	84.9%	1252.8
gpt-4-0613	69.8%	0.43%	83.8%	1239.9
gpt-4-turbo-2024-04-09	68.5%	0.43%	83.0%	1230.0
llama3-405	67.6%	0.00%	81.5%	1212.4
gpt-4o	65.1%	0.42%	78.9%	1186.6
claude-3-opus-20240229	64.2%	0.00%	76.0%	1153.1
gpt-3.5-turbo-0613+cot	50.3%	1.10%	52.6%	986.5
codellama-34b+cot	50.1%	0.93%	53.5%	997.0
codetulu-2-34b	49.2%	0.69%	53.8%	999.1
gpt-3.5-turbo-0613	49.0%	0.55%	54.1%	1000.0
codellama-13b+cot	47.4%	0.85%	50.1%	982.1
codellama-34b	47.2%	0.71%	49.5%	975.0
phind	47.2%	0.61%	50.2%	973.5
deepseek-base-33b	46.5%	0.71%	48.3%	966.0
deepseek-instruct-33b	46.5%	0.65%	49.5%	967.8
codellama-python-34b	43.9%	0.70%	46.0%	947.2
wizard-34b	42.7%	0.60%	42.3%	918.3
codellama-13b	42.5%	0.76%	40.9%	931.5
deepseek-base-6.7b	41.9%	0.70%	38.2%	904.5
magicoder-ds-7b	41.7%	0.63%	40.3%	913.5
codellama-7b+cot	40.4%	0.95%	36.6%	882.9
codellama-python-13b	39.7%	0.75%	36.1%	885.1
mixtral-8x7b	39.3%	0.75%	35.0%	872.0
deepseek-instruct-6.7b	37.4%	0.60%	32.9%	856.3
codellama-python-7b	37.3%	0.65%	33.3%	871.0
wizard-13b	36.5%	0.60%	32.0%	846.7
codellama-7b	36.0%	0.69%	29.2%	840.7
mistral-7b	35.0%	0.69%	30.4%	846.6
phi-2	31.6%	0.70%	25.4%	802.0
starcoderbase-16b	31.3%	0.70%	23.2%	780.2
starcoderbase-7b	29.7%	0.65%	21.5%	764.1
deepseek-base-1.3b	27.8%	0.60%	19.3%	739.8
deepseek-instruct-1.3b	27.2%	0.55%	21.3%	758.7
phi-1.5	23.2%	0.70%	17.4%	709.9
phi-1	13.1%	0.41%	8.4%	554.2

CRUXEval-input: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model