agi_english: by examples

Home   Doc/Code

Not solved by any model

There are 25 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
agi_english/1, agi_english/133, agi_english/1424, agi_english/1435, agi_english/145, agi_english/1560, agi_english/1642, agi_english/165, agi_english/1796, agi_english/2095, agi_english/210, agi_english/2386, agi_english/250, agi_english/2529, agi_english/291, agi_english/326, agi_english/357, agi_english/41, agi_english/424, agi_english/447, agi_english/576, agi_english/607, agi_english/620, agi_english/89, agi_english/97

Problems solved by 1 model only

example_link model min_elo
agi_english/455 Qwen1.5-72B 1207.278
agi_english/170 Qwen1.5-32B 1185.204
agi_english/155 dbrx-base 1133.819
agi_english/140 dbrx-base 1133.819
agi_english/2336 deepseek-llm-67b-base 1129.888
agi_english/22 deepseek-llm-67b-base 1129.888
agi_english/1382 deepseek-llm-67b-base 1129.888
agi_english/425 Qwen1.5-14B 1125.421
agi_english/1284 Mixtral-8x7B-v0.1 1079.921
agi_english/516 Mixtral-8x7B-v0.1 1079.921
agi_english/218 Qwen1.5-7B 1064.266
agi_english/597 llama_33B 999.688
agi_english/965 llama_33B 999.688
agi_english/919 llama_33B 999.688
agi_english/190 llama2_07B 945.229
agi_english/109 llama2_07B 945.229
agi_english/143 deepseek-llm-7b-base 942.256
agi_english/139 deepseek-llm-7b-base 942.256
agi_english/8 deepseek-llm-7b-base 942.256
agi_english/124 deepseek-llm-7b-base 942.256
agi_english/172 deepseek-llm-7b-base 942.256
agi_english/327 Qwen1.5-1.8B 941.950
agi_english/2396 Qwen1.5-1.8B 941.950
agi_english/2033 Qwen1.5-1.8B 941.950
agi_english/556 Qwen1.5-1.8B 941.950
agi_english/344 mpt-30b 939.598
agi_english/245 mpt-30b 939.598
agi_english/247 mpt-30b 939.598
agi_english/46 llama_13B 920.999
agi_english/114 stablelm-base-alpha-7b-v2 915.194
agi_english/1057 deepseek-moe-16b-base 904.370
agi_english/365 Qwen1.5-0.5B 900.606
agi_english/714 Qwen1.5-0.5B 900.606
agi_english/171 Qwen1.5-0.5B 900.606
agi_english/605 gemma-2b 891.944
agi_english/1120 gemma-2b 891.944
agi_english/748 gemma-2b 891.944
agi_english/451 gemma-2b 891.944
agi_english/850 gemma-2b 891.944
agi_english/716 gemma-2b 891.944
agi_english/120 pythia-12b-deduped-v0 869.467
agi_english/2190 pythia-12b-deduped-v0 869.467
agi_english/996 pythia-12b-deduped-v0 869.467
agi_english/994 pythia-12b-deduped-v0 869.467
agi_english/465 pythia-6.9b-deduped-v0 861.694
agi_english/1095 pythia-2.8b-deduped 861.302
agi_english/1780 pythia-2.8b-deduped 861.302
agi_english/1924 pythia-1b-deduped 848.328
agi_english/1538 pythia-1b-deduped 848.328
agi_english/193 pythia-1b-deduped 848.328
agi_english/2456 pythia-1b-deduped 848.328
agi_english/989 pythia-1.4b-deduped-v0 846.722
agi_english/214 pythia-1.4b-deduped-v0 846.722

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
agi_english/152 0.486 -0.483
agi_english/2432 0.143 -0.469
agi_english/851 0.457 -0.461
agi_english/112 0.171 -0.460
agi_english/2142 0.143 -0.442
agi_english/1201 0.343 -0.435
agi_english/1885 0.143 -0.429
agi_english/1951 0.114 -0.427
agi_english/894 0.514 -0.427
agi_english/2442 0.314 -0.424

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.