There are 36 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
Mbpp/103, Mbpp/113, Mbpp/119, Mbpp/126, Mbpp/138, Mbpp/235, Mbpp/244, Mbpp/255, Mbpp/260, Mbpp/267, Mbpp/278, Mbpp/300, Mbpp/305, Mbpp/306, Mbpp/310, Mbpp/311, Mbpp/398, Mbpp/415, Mbpp/427, Mbpp/430, Mbpp/448, Mbpp/462, Mbpp/468, Mbpp/577, Mbpp/589, Mbpp/590, Mbpp/603, Mbpp/630, Mbpp/639, Mbpp/739, Mbpp/748, Mbpp/765, Mbpp/771, Mbpp/790, Mbpp/806, Mbpp/92
example_link | model | min_elo |
---|---|---|
Mbpp/72 | gpt-4-1106-preview | 1281.933 |
Mbpp/780 | gpt-4-1106-preview | 1281.933 |
Mbpp/74 | mixtral-8x22b-instruct-v0.1 | 1131.674 |
Mbpp/239 | mistral-large-latest | 1048.500 |
Mbpp/593 | codegemma-7b | 954.016 |
Mbpp/124 | octocoder | 938.886 |
Mbpp/759 | mixtral-8x7b-instruct | 924.741 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
example_link | acc | tau |
---|---|---|
Mbpp/77 | 0.254 | -0.267 |
Mbpp/559 | 0.102 | -0.242 |
Mbpp/615 | 0.085 | -0.183 |
Mbpp/87 | 0.983 | -0.172 |
Mbpp/581 | 0.119 | -0.157 |
Mbpp/459 | 0.458 | -0.145 |
Mbpp/404 | 0.983 | -0.124 |
Mbpp/102 | 0.034 | -0.100 |
Mbpp/558 | 0.119 | -0.090 |
Mbpp/261 | 0.864 | -0.085 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.