There are 25 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
agi_english/1, agi_english/133, agi_english/1424, agi_english/1435, agi_english/145, agi_english/1560, agi_english/1642, agi_english/165, agi_english/1796, agi_english/2095, agi_english/210, agi_english/2386, agi_english/250, agi_english/2529, agi_english/291, agi_english/326, agi_english/357, agi_english/41, agi_english/424, agi_english/447, agi_english/576, agi_english/607, agi_english/620, agi_english/89, agi_english/97
example_link | model | min_elo |
---|---|---|
agi_english/455 | Qwen1.5-72B | 1207.278 |
agi_english/170 | Qwen1.5-32B | 1185.204 |
agi_english/140 | dbrx-base | 1133.819 |
agi_english/155 | dbrx-base | 1133.819 |
agi_english/2336 | deepseek-llm-67b-base | 1129.888 |
agi_english/22 | deepseek-llm-67b-base | 1129.888 |
agi_english/1382 | deepseek-llm-67b-base | 1129.888 |
agi_english/425 | Qwen1.5-14B | 1125.421 |
agi_english/1284 | Mixtral-8x7B-v0.1 | 1079.921 |
agi_english/516 | Mixtral-8x7B-v0.1 | 1079.921 |
agi_english/218 | Qwen1.5-7B | 1064.266 |
agi_english/919 | llama_33B | 999.688 |
agi_english/597 | llama_33B | 999.688 |
agi_english/965 | llama_33B | 999.688 |
agi_english/190 | llama2_07B | 945.229 |
agi_english/109 | llama2_07B | 945.229 |
agi_english/143 | deepseek-llm-7b-base | 942.256 |
agi_english/8 | deepseek-llm-7b-base | 942.256 |
agi_english/172 | deepseek-llm-7b-base | 942.256 |
agi_english/139 | deepseek-llm-7b-base | 942.256 |
agi_english/124 | deepseek-llm-7b-base | 942.256 |
agi_english/556 | Qwen1.5-1.8B | 941.950 |
agi_english/2396 | Qwen1.5-1.8B | 941.950 |
agi_english/327 | Qwen1.5-1.8B | 941.950 |
agi_english/2033 | Qwen1.5-1.8B | 941.950 |
agi_english/247 | mpt-30b | 939.598 |
agi_english/245 | mpt-30b | 939.598 |
agi_english/344 | mpt-30b | 939.598 |
agi_english/46 | llama_13B | 920.999 |
agi_english/114 | stablelm-base-alpha-7b-v2 | 915.194 |
agi_english/1057 | deepseek-moe-16b-base | 904.370 |
agi_english/365 | Qwen1.5-0.5B | 900.606 |
agi_english/171 | Qwen1.5-0.5B | 900.606 |
agi_english/714 | Qwen1.5-0.5B | 900.606 |
agi_english/716 | gemma-2b | 891.944 |
agi_english/748 | gemma-2b | 891.944 |
agi_english/605 | gemma-2b | 891.944 |
agi_english/451 | gemma-2b | 891.944 |
agi_english/1120 | gemma-2b | 891.944 |
agi_english/850 | gemma-2b | 891.944 |
agi_english/994 | pythia-12b-deduped-v0 | 869.467 |
agi_english/2190 | pythia-12b-deduped-v0 | 869.467 |
agi_english/120 | pythia-12b-deduped-v0 | 869.467 |
agi_english/996 | pythia-12b-deduped-v0 | 869.467 |
agi_english/465 | pythia-6.9b-deduped-v0 | 861.694 |
agi_english/1095 | pythia-2.8b-deduped | 861.302 |
agi_english/1780 | pythia-2.8b-deduped | 861.302 |
agi_english/2456 | pythia-1b-deduped | 848.328 |
agi_english/1538 | pythia-1b-deduped | 848.328 |
agi_english/193 | pythia-1b-deduped | 848.328 |
agi_english/1924 | pythia-1b-deduped | 848.328 |
agi_english/989 | pythia-1.4b-deduped-v0 | 846.722 |
agi_english/214 | pythia-1.4b-deduped-v0 | 846.722 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
example_link | acc | tau |
---|---|---|
agi_english/152 | 0.486 | -0.483 |
agi_english/2432 | 0.143 | -0.469 |
agi_english/851 | 0.457 | -0.461 |
agi_english/112 | 0.171 | -0.460 |
agi_english/2142 | 0.143 | -0.442 |
agi_english/1201 | 0.343 | -0.435 |
agi_english/1885 | 0.143 | -0.429 |
agi_english/1951 | 0.114 | -0.427 |
agi_english/894 | 0.514 | -0.427 |
agi_english/2442 | 0.314 | -0.424 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.