There are 9 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
Mbpp/235, Mbpp/260, Mbpp/306, Mbpp/311, Mbpp/398, Mbpp/430, Mbpp/462, Mbpp/590, Mbpp/603
| example_link | model | min_elo |
|---|---|---|
| Mbpp/780 | gpt-4-1106-preview | 1074.055 |
| Mbpp/310 | meta-llama-3-70b-instruct | 1061.476 |
| Mbpp/448 | bigcode--starcoder2-15b-instruct-v0.1 | 1046.197 |
| Mbpp/765 | mistral-large-latest | 1027.328 |
| Mbpp/103 | databricks--dbrx-instruct | 1007.678 |
| Mbpp/468 | microsoft--Phi-3-mini-4k-instruct | 1003.010 |
| Mbpp/124 | octocoder | 979.655 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | acc | tau |
|---|---|---|
| Mbpp/87 | 0.983 | -0.159 |
| Mbpp/581 | 0.119 | -0.156 |
| Mbpp/615 | 0.085 | -0.155 |
| Mbpp/142 | 0.983 | -0.149 |
| Mbpp/77 | 0.339 | -0.132 |
| Mbpp/567 | 0.915 | -0.130 |
| Mbpp/404 | 0.983 | -0.099 |
| Mbpp/138 | 0.034 | -0.063 |
| Mbpp/126 | 0.085 | -0.062 |
| Mbpp/20 | 0.186 | -0.060 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.