🏆 CRUXEval Leaderboard 🏆
CRUXEval is a benchmark complementary to HumanEval and MBPP measuring code reasoning, understanding, and execution capabilities!
📝 Notes
1. All samples are generated from scratch using our codebase, where the raw generations can also be found.
2. The pass@1 scores are reported with T=0.2, and the pass@5 scores are reported witih T=0.8, with 10 samples each (unless otherwise indicated). Models are ranked by pass@1.
3. Prompts can be found here, and generations not following the format are considered incorrect. A different set of prompts were used for the paper, causing the slight difference in numbers.
🤗 Acknowlegement and More Leaderboards
We greatly thank the authors of the EvalPlus leaderboard for allowing us to borrow their leaderboard code!
We also recommend the following leaderboards for measuring code LM ability on various coding tasks, such as
EvalPlus Leaderboard,
Chatbot Arena Leaderboard,
BigCode Models Leaderboard,
InfiCoder-Eval, and
TabbyML Leaderboard.
📝 Notes
1. All samples are generated from scratch using our codebase, where the raw generations can also be found.
2. The pass@1 scores are reported with T=0.2, and the pass@5 scores are reported witih T=0.8, with 10 samples each (unless otherwise indicated). Models are ranked by pass@1.
3. Prompts can be found here, and generations not following the format are considered incorrect. A different set of prompts were used for the paper, causing the slight difference in numbers.
🤗 Acknowlegement and More Leaderboards
We greatly thank the authors of the EvalPlus leaderboard for allowing us to borrow their leaderboard code! We also recommend the following leaderboards for measuring code LM ability on various coding tasks, such as EvalPlus Leaderboard, Chatbot Arena Leaderboard, BigCode Models Leaderboard, InfiCoder-Eval, and TabbyML Leaderboard.