There are 97 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
1899_D, 2849, 2879, 2921, 2952, 3017, 3024, 3025, 3046, 3047, 3091, 3171, 3184, 3190, 3192, 3200, 3211, 3212, 3219, 3223, 3224, 3228, 3233, 3240, 3243, 3261, 3265, 3297, 3298, 3299, 3308, 3317, abc301_d, abc301_e, abc301_f, abc302_f, abc303_e, abc304_d, abc305_d, abc305_e, abc306_d, abc306_e, abc307_c, abc307_d, abc307_e, abc308_e, abc309_d, abc309_e, abc310_e, abc310_f, abc311_c, abc311_d, abc312_e, abc312_f, abc314_d, abc314_e, abc314_f, abc315_d, abc315_e, abc315_f, abc318_e, abc319_c, abc320_c, abc321_d, abc321_e, abc322_e, abc323_d, abc323_e, abc324_d, abc324_e, abc325_d, abc325_f, abc326_d, abc326_e, abc327_e, abc329_c, abc329_e, abc329_f, abc330_e, abc331_d, abc331_e, abc333_d, abc333_e, abc334_c, abc336_d, abc337_d, abc337_e, abc338_d, abc338_f, abc340_c, abc340_e, abc341_e, abc341_f, abc342_d, abc342_e, abc343_a, abc343_e
example_link | model | min_elo |
---|---|---|
abc308_f | GPT-4O-2024-05-13 | 1667.579 |
3080 | GPT-4O-2024-05-13 | 1667.579 |
3032 | GPT-4O-2024-05-13 | 1667.579 |
abc310_d | GPT-4O-2024-05-13 | 1667.579 |
abc320_e | GPT-4O-2024-05-13 | 1667.579 |
abc332_c | GPT-4O-2024-05-13 | 1667.579 |
3141 | GPT-4O-2024-05-13 | 1667.579 |
abc312_b | GPT-4O-2024-05-13 | 1667.579 |
3209 | GPT-4-Turbo-2024-04-09 | 1466.443 |
abc342_c | GPT-4-Turbo-2024-04-09 | 1466.443 |
abc321_b | GPT-4-Turbo-2024-04-09 | 1466.443 |
2757 | Gemini-Pro-1.5 (May) | 1435.522 |
1883_C | Gemini-Pro-1.5 (May) | 1435.522 |
3292 | GPT-4-0613 | 1372.862 |
abc319_e | GPT-4-0613 | 1372.862 |
2893 | WCoder-33B-V1.1 | 1276.673 |
3166 | CodeQwen15-7B-Chat | 1113.607 |
3196 | CodeQwen15-7B-Chat | 1113.607 |
3244 | CodeQwen15-7B-Chat | 1113.607 |
2833 | CodeQwen15-7B-Chat | 1113.607 |
3262 | CodeQwen15-7B-Chat | 1113.607 |
3033 | CodeQwen15-7B-Chat | 1113.607 |
abc336_c | Command-R+ | 1060.359 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
example_link | acc | tau |
---|---|---|
abc338_a | 0.431 | -0.200 |
2819 | 0.914 | -0.080 |
3203 | 0.034 | -0.051 |
2886 | 0.828 | -0.011 |
abc336_c | 0.017 | -0.003 |
3195 | 0.103 | -0.003 |
abc324_f | 0.034 | 0.000 |
abc340_b | 0.966 | 0.014 |
2847 | 0.776 | 0.034 |
abc339_c | 0.414 | 0.043 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.