🏆 CRUXEval Leaderboard 🏆

CRUXEval is a benchmark complementary to HumanEval and MBPP measuring code reasoning, understanding, and execution capabilities!

Homepage Paper Code HF Dataset Sample Explorer

CRUXEval-I

#	Model	pass@1	pass@5
1	🥇 gpt-4-turbo-2024-04-09+cot (n=3)	75.7	-
2	🥈 gpt-4o+cot (n=3)	75.6	-
3	🥉 gpt-4-0613+cot	75.5	88.9
4	claude-3-opus+cot (n=1)	73.4	-
5	gpt-4-0613	69.8	76.8
6	gpt-4-turbo-2024-04-09 (n=3)	68.5	-
7	gpt-4o (n=3)	65.1	-
8	claude-3-opus (n=1)	64.2	-
9	semcoder-s-6.7b+cot (under verification)	63.1	81.8
10	semcoder-6.7b+cot (under verification)	62.5	81.4
11	gpt-3.5-turbo-0613+cot	50.3	74.9
12	codellama-34b+cot	50.1	73.8
13	codetulu-2-34b	49.3	68.0
14	gpt-3.5-turbo-0613	49.0	63.2
15	starcoder2-15b	48.1	66.9
16	codellama-13b+cot	47.4	68.4
17	codellama-34b	47.2	66.6
18	phind	47.2	63.9
19	deepseek-base-33b	46.5	64.9
20	deepseek-instruct-33b	46.5	63.2
21	codellama-python-34b	43.9	59.5
22	wizard-34b	42.7	57.5
23	codellama-13b	42.5	62.0
24	deepseek-base-6.7b	41.9	62.7
25	magicoder-ds-6.7b	41.7	62.4
26	codellama-7b+cot	40.4	62.8
27	codellama-python-13b	39.7	56.9
28	mixtral-8x7b	39.3	59.1
29	deepseek-instruct-6.7b	37.4	53.3
30	codellama-python-7b	37.3	57.0
31	wizard-13b	36.5	51.6
32	codellama-7b	35.9	52.9
33	mistral-7b	35.0	52.3
34	starcoder2-7b	34.6	53.5
35	stablecode-3b	33.5	53.3
36	starcoder2-3b	32.7	50.1
37	phi-2	31.6	51.1
38	starcoderbase-16b	31.3	49.2
39	starcoderbase-7b	29.7	47.3
40	deepseek-base-1.3b	27.8	44.7
41	deepseek-instruct-1.3b	27.2	40.1
42	phi-1.5	23.2	37.7
43	phi-1	13.1	21.1

CRUXEval-O

#	Model	pass@1	pass@5
1	🥇 gpt-4-turbo-2024-04-09+cot (n=3)	82.0	-
2	🥈 claude-3-opus+cot (n=1)	82.0	-
3	🥉 gpt-4-0613+cot	77.1	88.2
4	gpt-4o+cot (n=3)	76.0	-
5	gpt-4o (n=3)	70.0	-
6	gpt-4-0613	68.7	73.0
7	gpt-4-turbo-2024-04-09 (n=3)	67.7	-
8	claude-3-opus (n=1)	65.8	-
9	semcoder-6.7b+cot (under verification)	64.4	79.1
10	semcoder-s-6.7b+cot (under verification)	64.1	78.3
11	gpt-3.5-turbo-0613+cot	59.0	76.7
12	deepseek-instruct-33b	49.9	61.8
13	gpt-3.5-turbo-0613	49.4	59.3
14	deepseek-base-33b	48.6	61.6
15	starcoder2-15b	47.1	59.5
16	codetulu-2-34b	45.8	58.9
17	magicoder-ds-6.7b	44.4	57.5
18	codellama-34b+cot	43.6	69.4
19	deepseek-base-6.7b	43.5	54.8
20	wizard-34b	43.4	53.8
21	codellama-34b	42.4	55.9
22	codellama-python-34b	41.4	52.9
23	wizard-13b	41.3	52.4
24	deepseek-instruct-6.7b	41.2	52.8
25	mixtral-8x7b	40.5	54.0
26	codellama-python-13b	39.8	52.5
27	codellama-13b	39.7	53.9
28	phind	39.7	52.8
29	codellama-13b+cot	36.0	61.8
30	starcoder2-7b	36.0	52.0
31	codellama-python-7b	35.9	48.8
32	mistral-7b	34.3	48.6
33	starcoderbase-16b	34.2	47.1
34	codellama-7b	34.2	48.4
35	starcoder2-3b	34.2	48.4
36	phi-2	33.5	46.6
37	starcoderbase-7b	32.2	44.9
38	deepseek-base-1.3b	31.0	43.4
39	codellama-7b+cot	29.9	55.4
40	deepseek-instruct-1.3b	28.7	40.0
41	phi-1.5	27.5	39.1
42	stablecode-3b	26.7	43.5
43	phi-1	21.7	32.0

📝 Notes

1. All samples are generated from scratch using our codebase, where the raw generations can also be found.
2. The pass@1 scores are reported with T=0.2, and the pass@5 scores are reported witih T=0.8, with 10 samples each (unless otherwise indicated). Models are ranked by pass@1.
3. Prompts can be found here, and generations not following the format are considered incorrect. A different set of prompts were used for the paper, causing the slight difference in numbers.

🤗 Acknowlegement and More Leaderboards

We greatly thank the authors of the EvalPlus leaderboard for allowing us to borrow their leaderboard code! We also recommend the following leaderboards for measuring code LM ability on various coding tasks, such as EvalPlus Leaderboard, Chatbot Arena Leaderboard, BigCode Models Leaderboard, InfiCoder-Eval, and TabbyML Leaderboard.