There are 56 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
atcoder.abc301_f, atcoder.abc311_c, atcoder.abc314_e, atcoder.abc315_e, atcoder.abc315_f, atcoder.abc319_c, atcoder.abc324_f, atcoder.abc327_e, atcoder.abc333_e, atcoder.abc337_e, atcoder.abc338_f, atcoder.abc343_a, atcoder.abc343_e, atcoder.abc350_c, atcoder.abc350_e, atcoder.abc355_e, atcoder.abc359_c, atcoder.abc359_e, atcoder.abc362_c, atcoder.abc363_f, atcoder.abc371_f, atcoder.abc372_f, atcoder.abc373_g, atcoder.abc374_d, atcoder.abc374_g, atcoder.abc375_b, atcoder.abc375_f, atcoder.abc376_f, atcoder.abc376_g, atcoder.abc378_g, atcoder.abc382_g, atcoder.abc385_f, atcoder.arc181_a, atcoder.arc181_c, atcoder.arc181_d, atcoder.arc182_d, atcoder.arc182_e, atcoder.arc183_b, atcoder.arc183_c, atcoder.arc183_d, atcoder.arc184_c, atcoder.arc184_d, atcoder.arc185_c, atcoder.arc186_a, atcoder.arc186_b, atcoder.arc186_c, atcoder.arc186_d, atcoder.arc186_e, atcoder.arc187_b, atcoder.arc188_c, atcoder.arc189_a, atcoder.arc189_b, leetcode.3211, leetcode.3327, leetcode.3584, leetcode.3638
| example_link | model | min_elo |
|---|---|---|
| atcoder.arc183_a | Kimi-k1.6-IOI-high | 1096.025 |
| atcoder.arc182_a | O1-2024-12-17 (High) | 1085.166 |
| leetcode.3688 | O1-2024-12-17 (High) | 1085.166 |
| atcoder.abc325_d | O1-2024-12-17 (High) | 1085.166 |
| atcoder.abc354_d | O1-2024-12-17 (High) | 1085.166 |
| atcoder.arc185_d | O1-2024-12-17 (High) | 1085.166 |
| atcoder.arc184_e | DeepSeek-R1-Preview | 1066.811 |
| leetcode.3551 | DeepSeek-R1-Preview | 1066.811 |
| atcoder.abc364_f | Llama-3_1-Nemotron-Ultra-253B-v1 | 1064.699 |
| atcoder.abc368_g | DeepCoder-14B-Preview | 1048.360 |
| leetcode.3478 | DeepCoder-14B-Preview | 1048.360 |
| atcoder.abc366_g | DeepCoder-14B-Preview | 1048.360 |
| atcoder.abc373_f | DeepCoder-14B-Preview | 1048.360 |
| atcoder.abc373_e | DeepCoder-14B-Preview | 1048.360 |
| atcoder.abc372_g | DeepCoder-14B-Preview | 1048.360 |
| atcoder.abc370_f | DeepCoder-14B-Preview | 1048.360 |
| atcoder.abc370_g | DeepCoder-14B-Preview | 1048.360 |
| atcoder.abc367_g | DeepCoder-14B-Preview | 1048.360 |
| leetcode.3562 | O1-Preview-2024-09-12 | 984.201 |
| leetcode.3344 | DeepSeek-V3 copy | 980.094 |
| atcoder.arc188_d | DeepSeek-V3 copy | 980.094 |
| leetcode.3233 | Claude-3.5-Sonnet-20240620 | 957.799 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | acc | tau |
|---|---|---|
| codeforces.1883_B | 0.417 | -0.387 |
| leetcode.2834 | 0.958 | -0.264 |
| atcoder.abc379_f | 0.333 | -0.255 |
| leetcode.2857 | 0.792 | -0.216 |
| leetcode.2919 | 0.750 | -0.174 |
| leetcode.3230 | 0.792 | -0.142 |
| leetcode.3347 | 0.958 | -0.138 |
| leetcode.3233 | 0.042 | -0.138 |
| atcoder.abc303_a | 0.917 | -0.127 |
| atcoder.abc384_f | 0.667 | -0.096 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.