There are 36 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
Mbpp/103, Mbpp/113, Mbpp/119, Mbpp/126, Mbpp/138, Mbpp/235, Mbpp/244, Mbpp/255, Mbpp/260, Mbpp/267, Mbpp/278, Mbpp/300, Mbpp/305, Mbpp/306, Mbpp/310, Mbpp/311, Mbpp/398, Mbpp/415, Mbpp/427, Mbpp/430, Mbpp/448, Mbpp/462, Mbpp/468, Mbpp/577, Mbpp/589, Mbpp/590, Mbpp/603, Mbpp/630, Mbpp/639, Mbpp/739, Mbpp/748, Mbpp/765, Mbpp/771, Mbpp/790, Mbpp/806, Mbpp/92
| example_link | model | min_elo |
|---|---|---|
| Mbpp/72 | gpt-4-1106-preview | 1067.593 |
| Mbpp/780 | gpt-4-1106-preview | 1067.593 |
| Mbpp/74 | mixtral-8x22b-instruct-v0.1 | 1033.408 |
| Mbpp/239 | mistral-large-latest | 1016.585 |
| Mbpp/593 | codegemma-7b | 990.535 |
| Mbpp/124 | octocoder | 986.812 |
| Mbpp/759 | mixtral-8x7b-instruct | 983.086 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | acc | tau |
|---|---|---|
| Mbpp/77 | 0.254 | -0.267 |
| Mbpp/559 | 0.102 | -0.242 |
| Mbpp/615 | 0.085 | -0.183 |
| Mbpp/87 | 0.983 | -0.172 |
| Mbpp/581 | 0.119 | -0.157 |
| Mbpp/459 | 0.458 | -0.145 |
| Mbpp/404 | 0.983 | -0.124 |
| Mbpp/102 | 0.034 | -0.100 |
| Mbpp/558 | 0.119 | -0.090 |
| Mbpp/261 | 0.864 | -0.085 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.