p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	std	win_rate	elo
gpt-4-turbo-2024-04-09+cot	82.0%	0.57%	95.1%	1508.2
claude-3-opus-20240229+cot	82.0%	0.00%	94.4%	1488.5
gpt-4-0613+cot	77.1%	0.66%	91.3%	1391.6
gpt-4o+cot	76.0%	0.94%	86.0%	1301.8
llama3-405-cot	75.9%	0.00%	91.3%	1396.5
gpt-4o	70.0%	0.29%	85.7%	1294.5
gpt-4-0613	68.7%	0.32%	84.4%	1277.4
gpt-4-turbo-2024-04-09	67.7%	0.30%	83.6%	1262.3
claude-3-opus-20240229	65.8%	0.00%	81.3%	1241.2
llama3-405	64.9%	0.00%	81.0%	1235.1
gpt-3.5-turbo-0613+cot	59.0%	0.89%	68.8%	1116.6
deepseek-instruct-33b	49.9%	0.47%	55.6%	1021.1
gpt-3.5-turbo-0613	49.4%	0.40%	53.4%	1000.0
deepseek-base-33b	48.6%	0.52%	52.6%	1003.4
codetulu-2-34b	45.8%	0.50%	45.8%	953.9
magicoder-ds-7b	44.4%	0.51%	42.8%	933.1
codellama-34b+cot	43.6%	1.02%	38.9%	888.3
deepseek-base-6.7b	43.5%	0.57%	41.2%	921.7
wizard-34b	43.4%	0.44%	42.0%	921.7
codellama-34b	42.4%	0.56%	38.1%	891.3
codellama-python-34b	41.4%	0.48%	36.2%	875.0
wizard-13b	41.3%	0.50%	37.6%	889.6
deepseek-instruct-6.7b	41.2%	0.41%	37.2%	884.9
mixtral-8x7b	40.5%	0.58%	33.5%	858.1
codellama-python-13b	39.8%	0.52%	34.2%	863.1
codellama-13b	39.7%	0.56%	33.2%	862.4
phind	39.7%	0.46%	33.2%	853.1
codellama-13b+cot	36.0%	1.07%	25.1%	770.9
codellama-python-7b	35.9%	0.54%	28.1%	811.5
mistral-7b	34.3%	0.56%	23.2%	763.2
codellama-7b	34.2%	0.57%	22.5%	761.3
starcoderbase-16b	34.2%	0.55%	25.0%	783.0
phi-2	33.5%	0.55%	23.7%	763.2
starcoderbase-7b	32.2%	0.47%	20.4%	737.0
deepseek-base-1.3b	31.0%	0.57%	21.4%	747.1
codellama-7b+cot	29.9%	1.05%	15.1%	651.3
deepseek-instruct-1.3b	28.7%	0.48%	18.2%	700.0
phi-1.5	27.5%	0.56%	18.1%	698.2
phi-1	21.7%	0.45%	13.0%	619.5

CRUXEval-output: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model