mbpp+: by examples

Home   Doc/Code

Not solved by any model

There are 36 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
Mbpp/103, Mbpp/113, Mbpp/119, Mbpp/126, Mbpp/138, Mbpp/235, Mbpp/244, Mbpp/255, Mbpp/260, Mbpp/267, Mbpp/278, Mbpp/300, Mbpp/305, Mbpp/306, Mbpp/310, Mbpp/311, Mbpp/398, Mbpp/415, Mbpp/427, Mbpp/430, Mbpp/448, Mbpp/462, Mbpp/468, Mbpp/577, Mbpp/589, Mbpp/590, Mbpp/603, Mbpp/630, Mbpp/639, Mbpp/739, Mbpp/748, Mbpp/765, Mbpp/771, Mbpp/790, Mbpp/806, Mbpp/92

Problems solved by 1 model only

example_link model min_elo
Mbpp/780 gpt-4-1106-preview 1281.933
Mbpp/72 gpt-4-1106-preview 1281.933
Mbpp/74 mixtral-8x22b-instruct-v0.1 1131.674
Mbpp/239 mistral-large-latest 1048.500
Mbpp/593 codegemma-7b 954.016
Mbpp/124 octocoder 938.886
Mbpp/759 mixtral-8x7b-instruct 924.741

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
Mbpp/77 0.254 -0.267
Mbpp/559 0.102 -0.242
Mbpp/615 0.085 -0.183
Mbpp/87 0.983 -0.172
Mbpp/581 0.119 -0.157
Mbpp/459 0.458 -0.145
Mbpp/404 0.983 -0.124
Mbpp/102 0.034 -0.100
Mbpp/558 0.119 -0.090
Mbpp/261 0.864 -0.085

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.