CRUXEval-input-T0.2: by models

Home Doc/Code

std predicted by accuracy

The typical stddev between pairs of models on this dataset as a function of the absolute accuracy.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, the significance level mainly depends on the accuracy difference as shown here. Hover over each model pair for detailed information.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000. std: standard deviation due to drawing examples from a population, this is the dominant term. std_i: the standard deviation due to drawing samples from the model on each example. std_total: the total standard deviation, satisfying std_total^2 = std^2 + std_i^2.

model	pass1	std(E(A))	E(std(A))	std(A)	N	win_rate	elo
gpt-4-turbo-2024-04-09+cot	75.7	1.3	0.78	1.5	2.7	36.2	1.1e+03
gpt-4o+cot	75.6	1.3	0.71	1.5	2.7	35.5	1.1e+03
gpt-4-0613+cot	75.5	1.2	0.88	1.5	10	35.2	1.1e+03
claude-3-opus-20240229+cot	73.4	1.6	0	1.6	1	35.2	1.09e+03
gpt-4-0613	69.8	1.6	0.43	1.6	10	31.2	1.08e+03
gpt-4-turbo-2024-04-09	68.5	1.6	0.43	1.6	3	30	1.07e+03
gpt-4o	65.1	1.6	0.42	1.7	3	28.2	1.06e+03
claude-3-opus-20240229	64.2	1.7	0	1.7	1	28.1	1.06e+03
gpt-3.5-turbo-0613+cot	50.3	1.4	1.1	1.8	10	17.3	997
codellama-34b+cot	50.1	1.5	0.93	1.8	10	17.7	999
codetulu-2-34b	49.2	1.6	0.69	1.8	10	15.9	999
gpt-3.5-turbo-0613	49	1.7	0.55	1.8	10	17.4	1e+03
codellama-13b+cot	47.4	1.5	0.85	1.8	10	16.1	992
codellama-34b	47.2	1.6	0.71	1.8	10	14.5	990
phind	47.2	1.7	0.61	1.8	10	15.6	992
deepseek-base-33b	46.5	1.6	0.71	1.8	10	14	988
deepseek-instruct-33b	46.5	1.6	0.65	1.8	10	15.1	990
codellama-python-34b	43.9	1.6	0.7	1.8	10	13.8	983
wizard-34b	42.7	1.6	0.6	1.7	10	13.3	975
codellama-13b	42.5	1.6	0.76	1.7	10	11.9	973
deepseek-base-6.7b	41.9	1.6	0.7	1.7	10	11.4	967
magicoder-ds-7b	41.7	1.6	0.63	1.7	10	12.1	971
codellama-7b+cot	40.4	1.5	0.95	1.7	10	12.1	961
codellama-python-13b	39.7	1.6	0.75	1.7	10	10.8	963
mixtral-8x7b	39.3	1.6	0.75	1.7	10	11	959
deepseek-instruct-6.7b	37.4	1.6	0.6	1.7	10	10.4	955
codellama-python-7b	37.3	1.6	0.65	1.7	10	10.3	956
wizard-13b	36.5	1.6	0.6	1.7	10	10	953
codellama-7b	36	1.6	0.69	1.7	10	8.99	948
mistral-7b	35	1.5	0.69	1.7	10	9.55	949
phi-2	31.6	1.5	0.7	1.6	10	8.65	934
starcoderbase-16b	31.3	1.5	0.7	1.6	10	7.44	932
starcoderbase-7b	29.7	1.5	0.65	1.6	10	7	928
deepseek-base-1.3b	27.8	1.5	0.6	1.6	10	6.58	919
deepseek-instruct-1.3b	27.2	1.5	0.55	1.6	10	7.48	922
phi-1.5	23.2	1.3	0.7	1.5	10	6.81	902
phi-1	13.1	1.1	0.41	1.2	10	3.57	867