There are 17 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
CRUXEval-input/112, CRUXEval-input/113, CRUXEval-input/128, CRUXEval-input/129, CRUXEval-input/177, CRUXEval-input/179, CRUXEval-input/185, CRUXEval-input/218, CRUXEval-input/220, CRUXEval-input/236, CRUXEval-input/259, CRUXEval-input/413, CRUXEval-input/423, CRUXEval-input/444, CRUXEval-input/501, CRUXEval-input/545, CRUXEval-input/581
example_link | model | min_elo |
---|---|---|
CRUXEval-input/250 | gpt-4-0613+cot | 1313.953 |
CRUXEval-input/729 | llama3-405-cot | 1252.838 |
CRUXEval-input/229 | llama3-405-cot | 1252.838 |
CRUXEval-input/474 | gpt-4-0613 | 1239.906 |
CRUXEval-input/770 | phind | 973.455 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
example_link | acc | tau |
---|---|---|
CRUXEval-input/531 | 0.590 | -0.506 |
CRUXEval-input/222 | 0.641 | -0.495 |
CRUXEval-input/660 | 0.718 | -0.469 |
CRUXEval-input/242 | 0.821 | -0.447 |
CRUXEval-input/233 | 0.513 | -0.433 |
CRUXEval-input/28 | 0.769 | -0.429 |
CRUXEval-input/373 | 0.590 | -0.410 |
CRUXEval-input/598 | 0.821 | -0.344 |
CRUXEval-input/199 | 0.564 | -0.327 |
CRUXEval-input/745 | 0.513 | -0.286 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.