CRUXEval is a benchmark complementary to HumanEval and MBPP measuring code reasoning, understanding, and execution capabilities!

📝 Notes

1. All samples are generated from scratch using our codebase, where the raw generations can also be found.
2. The pass@1 scores are reported with T=0.2, and the pass@5 scores are reported witih T=0.8, with 10 samples each (except gpt-4-turbo-2024-04-09, which uses 3 samples). Models are ranked by pass@1.
3. Prompts can be found here, and generations not following the format are considered incorrect. A different set of prompts were used for the paper, causing the slight difference in numbers.

