There are 25 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
agi_english/1, agi_english/133, agi_english/1424, agi_english/1435, agi_english/145, agi_english/1560, agi_english/1642, agi_english/165, agi_english/1796, agi_english/2095, agi_english/210, agi_english/2386, agi_english/250, agi_english/2529, agi_english/291, agi_english/326, agi_english/357, agi_english/41, agi_english/424, agi_english/447, agi_english/576, agi_english/607, agi_english/620, agi_english/89, agi_english/97
| example_link | model | min_elo |
|---|---|---|
| agi_english/455 | Qwen1.5-72B | 1082.103 |
| agi_english/170 | Qwen1.5-32B | 1075.093 |
| agi_english/155 | dbrx-base | 1054.947 |
| agi_english/140 | dbrx-base | 1054.947 |
| agi_english/2336 | deepseek-llm-67b-base | 1053.665 |
| agi_english/1382 | deepseek-llm-67b-base | 1053.665 |
| agi_english/22 | deepseek-llm-67b-base | 1053.665 |
| agi_english/425 | Qwen1.5-14B | 1050.821 |
| agi_english/516 | Mixtral-8x7B-v0.1 | 1035.144 |
| agi_english/1284 | Mixtral-8x7B-v0.1 | 1035.144 |
| agi_english/218 | Qwen1.5-7B | 1027.571 |
| agi_english/597 | llama_33B | 1003.303 |
| agi_english/965 | llama_33B | 1003.303 |
| agi_english/919 | llama_33B | 1003.303 |
| agi_english/109 | llama2_07B | 979.909 |
| agi_english/190 | llama2_07B | 979.909 |
| agi_english/143 | deepseek-llm-7b-base | 978.233 |
| agi_english/8 | deepseek-llm-7b-base | 978.233 |
| agi_english/172 | deepseek-llm-7b-base | 978.233 |
| agi_english/139 | deepseek-llm-7b-base | 978.233 |
| agi_english/124 | deepseek-llm-7b-base | 978.233 |
| agi_english/344 | mpt-30b | 977.535 |
| agi_english/247 | mpt-30b | 977.535 |
| agi_english/556 | Qwen1.5-1.8B | 977.535 |
| agi_english/245 | mpt-30b | 977.535 |
| agi_english/327 | Qwen1.5-1.8B | 977.535 |
| agi_english/2033 | Qwen1.5-1.8B | 977.535 |
| agi_english/2396 | Qwen1.5-1.8B | 977.535 |
| agi_english/46 | llama_13B | 968.580 |
| agi_english/114 | stablelm-base-alpha-7b-v2 | 967.318 |
| agi_english/1057 | deepseek-moe-16b-base | 961.558 |
| agi_english/365 | Qwen1.5-0.5B | 960.572 |
| agi_english/171 | Qwen1.5-0.5B | 960.572 |
| agi_english/714 | Qwen1.5-0.5B | 960.572 |
| agi_english/1120 | gemma-2b | 953.233 |
| agi_english/748 | gemma-2b | 953.233 |
| agi_english/716 | gemma-2b | 953.233 |
| agi_english/850 | gemma-2b | 953.233 |
| agi_english/605 | gemma-2b | 953.233 |
| agi_english/451 | gemma-2b | 953.233 |
| agi_english/2190 | pythia-12b-deduped-v0 | 942.861 |
| agi_english/994 | pythia-12b-deduped-v0 | 942.861 |
| agi_english/120 | pythia-12b-deduped-v0 | 942.861 |
| agi_english/996 | pythia-12b-deduped-v0 | 942.861 |
| agi_english/1780 | pythia-2.8b-deduped | 939.288 |
| agi_english/1095 | pythia-2.8b-deduped | 939.288 |
| agi_english/465 | pythia-6.9b-deduped-v0 | 939.002 |
| agi_english/2456 | pythia-1b-deduped | 934.985 |
| agi_english/193 | pythia-1b-deduped | 934.985 |
| agi_english/1538 | pythia-1b-deduped | 934.985 |
| agi_english/1924 | pythia-1b-deduped | 934.985 |
| agi_english/214 | pythia-1.4b-deduped-v0 | 933.834 |
| agi_english/989 | pythia-1.4b-deduped-v0 | 933.834 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | acc | tau |
|---|---|---|
| agi_english/152 | 0.486 | -0.483 |
| agi_english/2432 | 0.143 | -0.469 |
| agi_english/851 | 0.457 | -0.461 |
| agi_english/112 | 0.171 | -0.460 |
| agi_english/2142 | 0.143 | -0.442 |
| agi_english/1201 | 0.343 | -0.435 |
| agi_english/1885 | 0.143 | -0.429 |
| agi_english/1951 | 0.114 | -0.427 |
| agi_english/894 | 0.514 | -0.427 |
| agi_english/2442 | 0.314 | -0.424 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.