There are 35 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
CRUXEval-output/112, CRUXEval-output/113, CRUXEval-output/125, CRUXEval-output/129, CRUXEval-output/163, CRUXEval-output/177, CRUXEval-output/211, CRUXEval-output/218, CRUXEval-output/220, CRUXEval-output/229, CRUXEval-output/250, CRUXEval-output/254, CRUXEval-output/272, CRUXEval-output/280, CRUXEval-output/301, CRUXEval-output/307, CRUXEval-output/310, CRUXEval-output/33, CRUXEval-output/347, CRUXEval-output/375, CRUXEval-output/44, CRUXEval-output/444, CRUXEval-output/445, CRUXEval-output/469, CRUXEval-output/484, CRUXEval-output/488, CRUXEval-output/501, CRUXEval-output/556, CRUXEval-output/581, CRUXEval-output/591, CRUXEval-output/599, CRUXEval-output/622, CRUXEval-output/671, CRUXEval-output/698, CRUXEval-output/726
| example_link | model | min_elo |
|---|---|---|
| CRUXEval-output/126 | gpt-4-0613+cot | 1121.772 |
| CRUXEval-output/268 | gpt-4-0613+cot | 1121.772 |
| CRUXEval-output/128 | gpt-4-0613+cot | 1121.772 |
| CRUXEval-output/393 | gpt-4-0613+cot | 1121.772 |
| CRUXEval-output/340 | gpt-4-0613+cot | 1121.772 |
| CRUXEval-output/568 | gpt-4-0613+cot | 1121.772 |
| CRUXEval-output/491 | gpt-4-0613+cot | 1121.772 |
| CRUXEval-output/35 | gpt-4-0613+cot | 1121.772 |
| CRUXEval-output/5 | gpt-4-0613+cot | 1121.772 |
| CRUXEval-output/259 | gpt-4-0613+cot | 1121.772 |
| CRUXEval-output/458 | gpt-4-0613+cot | 1121.772 |
| CRUXEval-output/179 | gpt-4-0613+cot | 1121.772 |
| CRUXEval-output/149 | gpt-4-0613+cot | 1121.772 |
| CRUXEval-output/438 | gpt-4-0613 | 1080.832 |
| CRUXEval-output/155 | gpt-3.5-turbo-0613+cot | 1033.916 |
| CRUXEval-output/543 | gpt-3.5-turbo-0613+cot | 1033.916 |
| CRUXEval-output/613 | gpt-3.5-turbo-0613+cot | 1033.916 |
| CRUXEval-output/317 | gpt-3.5-turbo-0613+cot | 1033.916 |
| CRUXEval-output/236 | gpt-3.5-turbo-0613+cot | 1033.916 |
| CRUXEval-output/391 | gpt-3.5-turbo-0613+cot | 1033.916 |
| CRUXEval-output/23 | codellama-python-13b | 958.334 |
| CRUXEval-output/499 | mixtral-8x7b | 955.682 |
| CRUXEval-output/175 | mistral-7b | 930.847 |
| CRUXEval-output/514 | phi-1.5 | 898.411 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | acc | tau |
|---|---|---|
| CRUXEval-output/132 | 0.774 | -0.379 |
| CRUXEval-output/45 | 0.774 | -0.358 |
| CRUXEval-output/571 | 0.065 | -0.329 |
| CRUXEval-output/356 | 0.677 | -0.294 |
| CRUXEval-output/329 | 0.903 | -0.283 |
| CRUXEval-output/209 | 0.065 | -0.280 |
| CRUXEval-output/456 | 0.806 | -0.273 |
| CRUXEval-output/373 | 0.161 | -0.268 |
| CRUXEval-output/403 | 0.871 | -0.268 |
| CRUXEval-output/715 | 0.903 | -0.263 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.