There are 9 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
Mbpp/235, Mbpp/260, Mbpp/306, Mbpp/311, Mbpp/398, Mbpp/430, Mbpp/462, Mbpp/590, Mbpp/603
example_link | model | min_elo |
---|---|---|
Mbpp/780 | gpt-4-1106-preview | 1335.544 |
Mbpp/310 | meta-llama-3-70b-instruct | 1289.918 |
Mbpp/448 | bigcode--starcoder2-15b-instruct-v0.1 | 1190.764 |
Mbpp/765 | mistral-large-latest | 1078.995 |
Mbpp/103 | databricks--dbrx-instruct | 1004.171 |
Mbpp/468 | microsoft--Phi-3-mini-4k-instruct | 985.727 |
Mbpp/124 | octocoder | 905.939 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
example_link | acc | tau |
---|---|---|
Mbpp/87 | 0.983 | -0.159 |
Mbpp/581 | 0.119 | -0.156 |
Mbpp/615 | 0.085 | -0.155 |
Mbpp/142 | 0.983 | -0.149 |
Mbpp/77 | 0.339 | -0.132 |
Mbpp/567 | 0.915 | -0.130 |
Mbpp/404 | 0.983 | -0.099 |
Mbpp/138 | 0.034 | -0.063 |
Mbpp/126 | 0.085 | -0.062 |
Mbpp/20 | 0.186 | -0.060 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.