🏆 CRUXEval Leaderboard 🏆

CRUXEval is a benchmark complementary to HumanEval and MBPP measuring code reasoning, understanding, and execution capabilities!

📝 Notes

1. All samples are generated from scratch using our codebase, where the raw generations can also be found.
2. The pass@1 scores are reported with T=0.2, and the pass@5 scores are reported witih T=0.8, with 10 samples each (except gpt-4-turbo-2024-04-09, which uses 3 samples). Models are ranked by pass@1.
3. Prompts can be found here, and generations not following the format are considered incorrect. A different set of prompts were used for the paper, causing the slight difference in numbers.

🤗 Acknowlegement and More Leaderboards

We greatly thank the authors of the EvalPlus leaderboard for allowing us to borrow their leaderboard code! We also recommend the following leaderboards for measuring code LM ability on various coding tasks, such as EvalPlus Leaderboard, Chatbot Arena Leaderboard, BigCode Models Leaderboard, InfiCoder-Eval, and TabbyML Leaderboard.