There are 17 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
CRUXEval-input/112, CRUXEval-input/113, CRUXEval-input/128, CRUXEval-input/129, CRUXEval-input/177, CRUXEval-input/185, CRUXEval-input/218, CRUXEval-input/220, CRUXEval-input/259, CRUXEval-input/314, CRUXEval-input/322, CRUXEval-input/413, CRUXEval-input/444, CRUXEval-input/469, CRUXEval-input/501, CRUXEval-input/556, CRUXEval-input/729
| example_link | model | min_elo |
|---|---|---|
| CRUXEval-input/620 | gpt-4-0613+cot | 1120.432 |
| CRUXEval-input/53 | gpt-4-0613+cot | 1120.432 |
| CRUXEval-input/75 | gpt-4-0613+cot | 1120.432 |
| CRUXEval-input/295 | gpt-4-0613+cot | 1120.432 |
| CRUXEval-input/754 | gpt-4-0613+cot | 1120.432 |
| CRUXEval-input/687 | gpt-4-0613 | 1091.477 |
| CRUXEval-input/229 | gpt-3.5-turbo-0613 | 1000.000 |
| CRUXEval-input/375 | gpt-3.5-turbo-0613 | 1000.000 |
| CRUXEval-input/491 | gpt-3.5-turbo-0613+cot | 987.900 |
| CRUXEval-input/581 | codellama-13b+cot | 951.045 |
| CRUXEval-input/179 | starcoderbase-7b | 916.341 |
| CRUXEval-input/474 | phi-1.5 | 874.853 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | acc | tau |
|---|---|---|
| CRUXEval-input/233 | 0.774 | -0.522 |
| CRUXEval-input/373 | 0.839 | -0.464 |
| CRUXEval-input/534 | 0.097 | -0.425 |
| CRUXEval-input/395 | 0.742 | -0.369 |
| CRUXEval-input/673 | 0.710 | -0.369 |
| CRUXEval-input/531 | 0.710 | -0.349 |
| CRUXEval-input/190 | 0.452 | -0.331 |
| CRUXEval-input/433 | 0.774 | -0.308 |
| CRUXEval-input/274 | 0.710 | -0.297 |
| CRUXEval-input/438 | 0.387 | -0.295 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.