agi_english: by examples

Home Doc/Code

Not solved by any model

There are 25 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
agi_english/1, agi_english/133, agi_english/1424, agi_english/1435, agi_english/145, agi_english/1560, agi_english/1642, agi_english/165, agi_english/1796, agi_english/2095, agi_english/210, agi_english/2386, agi_english/250, agi_english/2529, agi_english/291, agi_english/326, agi_english/357, agi_english/41, agi_english/424, agi_english/447, agi_english/576, agi_english/607, agi_english/620, agi_english/89, agi_english/97

Problems solved by 1 model only

example_link model min_elo
agi_english/455 Qwen1.5-72B 1082.103
agi_english/170 Qwen1.5-32B 1075.093
agi_english/155 dbrx-base 1054.947
agi_english/140 dbrx-base 1054.947
agi_english/2336 deepseek-llm-67b-base 1053.665
agi_english/1382 deepseek-llm-67b-base 1053.665
agi_english/22 deepseek-llm-67b-base 1053.665
agi_english/425 Qwen1.5-14B 1050.821
agi_english/516 Mixtral-8x7B-v0.1 1035.144
agi_english/1284 Mixtral-8x7B-v0.1 1035.144
agi_english/218 Qwen1.5-7B 1027.571
agi_english/597 llama_33B 1003.303
agi_english/965 llama_33B 1003.303
agi_english/919 llama_33B 1003.303
agi_english/109 llama2_07B 979.909
agi_english/190 llama2_07B 979.909
agi_english/143 deepseek-llm-7b-base 978.233
agi_english/8 deepseek-llm-7b-base 978.233
agi_english/172 deepseek-llm-7b-base 978.233
agi_english/139 deepseek-llm-7b-base 978.233
agi_english/124 deepseek-llm-7b-base 978.233
agi_english/344 mpt-30b 977.535
agi_english/247 mpt-30b 977.535
agi_english/556 Qwen1.5-1.8B 977.535
agi_english/245 mpt-30b 977.535
agi_english/327 Qwen1.5-1.8B 977.535
agi_english/2033 Qwen1.5-1.8B 977.535
agi_english/2396 Qwen1.5-1.8B 977.535
agi_english/46 llama_13B 968.580
agi_english/114 stablelm-base-alpha-7b-v2 967.318
agi_english/1057 deepseek-moe-16b-base 961.558
agi_english/365 Qwen1.5-0.5B 960.572
agi_english/171 Qwen1.5-0.5B 960.572
agi_english/714 Qwen1.5-0.5B 960.572
agi_english/1120 gemma-2b 953.233
agi_english/748 gemma-2b 953.233
agi_english/716 gemma-2b 953.233
agi_english/850 gemma-2b 953.233
agi_english/605 gemma-2b 953.233
agi_english/451 gemma-2b 953.233
agi_english/2190 pythia-12b-deduped-v0 942.861
agi_english/994 pythia-12b-deduped-v0 942.861
agi_english/120 pythia-12b-deduped-v0 942.861
agi_english/996 pythia-12b-deduped-v0 942.861
agi_english/1780 pythia-2.8b-deduped 939.288
agi_english/1095 pythia-2.8b-deduped 939.288
agi_english/465 pythia-6.9b-deduped-v0 939.002
agi_english/2456 pythia-1b-deduped 934.985
agi_english/193 pythia-1b-deduped 934.985
agi_english/1538 pythia-1b-deduped 934.985
agi_english/1924 pythia-1b-deduped 934.985
agi_english/214 pythia-1.4b-deduped-v0 933.834
agi_english/989 pythia-1.4b-deduped-v0 933.834

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
agi_english/152 0.486 -0.483
agi_english/2432 0.143 -0.469
agi_english/851 0.457 -0.461
agi_english/112 0.171 -0.460
agi_english/2142 0.143 -0.442
agi_english/1201 0.343 -0.435
agi_english/1885 0.143 -0.429
agi_english/1951 0.114 -0.427
agi_english/894 0.514 -0.427
agi_english/2442 0.314 -0.424

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.