mbpp: by examples

Home   Doc/Code

Not solved by any model

There are 9 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
Mbpp/235, Mbpp/260, Mbpp/306, Mbpp/311, Mbpp/398, Mbpp/430, Mbpp/462, Mbpp/590, Mbpp/603

Problems solved by 1 model only

example_link model min_elo
Mbpp/780 gpt-4-1106-preview 1335.544
Mbpp/310 meta-llama-3-70b-instruct 1289.918
Mbpp/448 bigcode--starcoder2-15b-instruct-v0.1 1190.764
Mbpp/765 mistral-large-latest 1078.995
Mbpp/103 databricks--dbrx-instruct 1004.171
Mbpp/468 microsoft--Phi-3-mini-4k-instruct 985.727
Mbpp/124 octocoder 905.939

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
Mbpp/87 0.983 -0.159
Mbpp/581 0.119 -0.156
Mbpp/615 0.085 -0.155
Mbpp/142 0.983 -0.149
Mbpp/77 0.339 -0.132
Mbpp/567 0.915 -0.130
Mbpp/404 0.983 -0.099
Mbpp/138 0.034 -0.063
Mbpp/126 0.085 -0.062
Mbpp/20 0.186 -0.060

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.