plotly-logomark

mbpp+: by examples

Home Doc/Code

Not solved by any model

There are 36 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
Mbpp/103, Mbpp/113, Mbpp/119, Mbpp/126, Mbpp/138, Mbpp/235, Mbpp/244, Mbpp/255, Mbpp/260, Mbpp/267, Mbpp/278, Mbpp/300, Mbpp/305, Mbpp/306, Mbpp/310, Mbpp/311, Mbpp/398, Mbpp/415, Mbpp/427, Mbpp/430, Mbpp/448, Mbpp/462, Mbpp/468, Mbpp/577, Mbpp/589, Mbpp/590, Mbpp/603, Mbpp/630, Mbpp/639, Mbpp/739, Mbpp/748, Mbpp/765, Mbpp/771, Mbpp/790, Mbpp/806, Mbpp/92

Problems solved by 1 model only

example_link	model	min_elo
Mbpp/72	gpt-4-1106-preview	1281.933
Mbpp/780	gpt-4-1106-preview	1281.933
Mbpp/74	mixtral-8x22b-instruct-v0.1	1131.674
Mbpp/239	mistral-large-latest	1048.500
Mbpp/593	codegemma-7b	954.016
Mbpp/124	octocoder	938.886
Mbpp/759	mixtral-8x7b-instruct	924.741

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	acc	tau
Mbpp/77	0.254	-0.267
Mbpp/559	0.102	-0.242
Mbpp/615	0.085	-0.183
Mbpp/87	0.983	-0.172
Mbpp/581	0.119	-0.157
Mbpp/459	0.458	-0.145
Mbpp/404	0.983	-0.124
Mbpp/102	0.034	-0.100
Mbpp/558	0.119	-0.090
Mbpp/261	0.864	-0.085

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.