There are 84 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
mmlu/10112, mmlu/10177, mmlu/10356, mmlu/10370, mmlu/1047, mmlu/10549, mmlu/10576, mmlu/10582, mmlu/10621, mmlu/10752, mmlu/10802, mmlu/11046, mmlu/11059, mmlu/11064, mmlu/11109, mmlu/11302, mmlu/11632, mmlu/11725, mmlu/11789, mmlu/11829, mmlu/11880, mmlu/12161, mmlu/1269, mmlu/12698, mmlu/12820, mmlu/1298, mmlu/13190, mmlu/13218, mmlu/13722, mmlu/13734, mmlu/13740, mmlu/13747, mmlu/13751, mmlu/13767, mmlu/13768, mmlu/13788, mmlu/13795, mmlu/1380, mmlu/13825, mmlu/1416, mmlu/1457, mmlu/1481, mmlu/1676, mmlu/168, mmlu/1720, mmlu/1748, mmlu/1931, mmlu/2025, mmlu/2131, mmlu/2392, mmlu/2440, mmlu/2951, mmlu/3072, mmlu/3077, mmlu/3295, mmlu/3356, mmlu/3405, mmlu/3507, mmlu/419, mmlu/4222, mmlu/4329, mmlu/4377, mmlu/4393, mmlu/4448, mmlu/479, mmlu/516, mmlu/552, mmlu/6156, mmlu/6514, mmlu/6864, mmlu/6908, mmlu/7467, mmlu/8177, mmlu/8195, mmlu/8235, mmlu/8280, mmlu/8388, mmlu/8445, mmlu/8815, mmlu/9057, mmlu/9080, mmlu/9607, mmlu/9801, mmlu/9917
example_link | model | min_elo |
---|---|---|
mmlu/2211 | Qwen1.5-110B | 1305.025 |
mmlu/2416 | Qwen1.5-110B | 1305.025 |
mmlu/2255 | Qwen1.5-110B | 1305.025 |
mmlu/10778 | Qwen1.5-110B | 1305.025 |
mmlu/2092 | Qwen1.5-110B | 1305.025 |
mmlu/10402 | Qwen1.5-110B | 1305.025 |
mmlu/1390 | Qwen1.5-110B | 1305.025 |
mmlu/2250 | Qwen1.5-110B | 1305.025 |
mmlu/12903 | Qwen1.5-110B | 1305.025 |
mmlu/10411 | Meta-Llama-3-70B | 1266.741 |
mmlu/1033 | Meta-Llama-3-70B | 1266.741 |
mmlu/4842 | Meta-Llama-3-70B | 1266.741 |
mmlu/3990 | Meta-Llama-3-70B | 1266.741 |
mmlu/90 | Meta-Llama-3-70B | 1266.741 |
mmlu/1738 | Meta-Llama-3-70B | 1266.741 |
mmlu/3008 | Meta-Llama-3-70B | 1266.741 |
mmlu/2464 | Mixtral-8x22B-v0.1 | 1250.972 |
mmlu/4315 | Mixtral-8x22B-v0.1 | 1250.972 |
mmlu/8175 | Mixtral-8x22B-v0.1 | 1250.972 |
mmlu/1403 | Mixtral-8x22B-v0.1 | 1250.972 |
mmlu/6273 | Mixtral-8x22B-v0.1 | 1250.972 |
mmlu/12040 | Mixtral-8x22B-v0.1 | 1250.972 |
mmlu/2140 | Mixtral-8x22B-v0.1 | 1250.972 |
mmlu/2508 | Mixtral-8x22B-v0.1 | 1250.972 |
mmlu/8503 | Mixtral-8x22B-v0.1 | 1250.972 |
mmlu/1398 | Mixtral-8x22B-v0.1 | 1250.972 |
mmlu/5876 | Qwen1.5-72B | 1248.779 |
mmlu/6935 | Qwen1.5-72B | 1248.779 |
mmlu/11492 | Qwen1.5-72B | 1248.779 |
mmlu/3214 | Qwen1.5-72B | 1248.779 |
mmlu/10705 | Qwen1.5-72B | 1248.779 |
mmlu/1194 | dbrx-base | 1204.504 |
mmlu/8862 | dbrx-base | 1204.504 |
mmlu/12654 | dbrx-base | 1204.504 |
mmlu/965 | dbrx-base | 1204.504 |
mmlu/2579 | dbrx-base | 1204.504 |
mmlu/1447 | dbrx-base | 1204.504 |
mmlu/10517 | dbrx-base | 1204.504 |
mmlu/4647 | dbrx-base | 1204.504 |
mmlu/1404 | Qwen1.5-32B | 1198.264 |
mmlu/10939 | Qwen1.5-32B | 1198.264 |
mmlu/4276 | Qwen1.5-32B | 1198.264 |
mmlu/1260 | Qwen1.5-32B | 1198.264 |
mmlu/2160 | Qwen1.5-32B | 1198.264 |
mmlu/3723 | deepseek-llm-67b-base | 1174.936 |
mmlu/6148 | deepseek-llm-67b-base | 1174.936 |
mmlu/3997 | deepseek-llm-67b-base | 1174.936 |
mmlu/12674 | Mixtral-8x7B-v0.1 | 1159.020 |
mmlu/2734 | Mixtral-8x7B-v0.1 | 1159.020 |
mmlu/1431 | Qwen1.5-14B | 1131.033 |
mmlu/9161 | Meta-Llama-3-8B | 1103.817 |
mmlu/2668 | llama2_70B | 1079.359 |
mmlu/12497 | gemma-7b | 1076.537 |
mmlu/4342 | Mistral-7B-v0.1 | 1074.545 |
mmlu/3140 | Mistral-7B-v0.1 | 1074.545 |
mmlu/3037 | llama_65B | 1071.934 |
mmlu/10736 | llama_65B | 1071.934 |
mmlu/4755 | llama_65B | 1071.934 |
mmlu/2475 | llama_33B | 1020.632 |
mmlu/11254 | Qwen1.5-4B | 1002.201 |
mmlu/3440 | falcon-40b | 1000.936 |
mmlu/923 | falcon-40b | 1000.936 |
mmlu/978 | falcon-40b | 1000.936 |
mmlu/4864 | falcon-40b | 1000.936 |
mmlu/940 | falcon-40b | 1000.936 |
mmlu/3299 | falcon-40b | 1000.936 |
mmlu/72 | falcon-40b | 1000.936 |
mmlu/10583 | falcon-40b | 1000.936 |
mmlu/1418 | falcon-40b | 1000.936 |
mmlu/1493 | deepseek-llm-7b-base | 938.317 |
mmlu/6835 | deepseek-llm-7b-base | 938.317 |
mmlu/1097 | deepseek-llm-7b-base | 938.317 |
mmlu/6914 | deepseek-llm-7b-base | 938.317 |
mmlu/2319 | deepseek-llm-7b-base | 938.317 |
mmlu/10355 | deepseek-llm-7b-base | 938.317 |
mmlu/7889 | llama2_07B | 929.025 |
mmlu/12213 | Qwen1.5-1.8B | 917.249 |
mmlu/13785 | Qwen1.5-1.8B | 917.249 |
mmlu/11664 | llama_13B | 915.612 |
mmlu/914 | deepseek-moe-16b-base | 912.847 |
mmlu/2597 | deepseek-moe-16b-base | 912.847 |
mmlu/13253 | deepseek-moe-16b-base | 912.847 |
mmlu/4122 | stablelm-base-alpha-7b-v2 | 906.386 |
mmlu/7 | stablelm-base-alpha-7b-v2 | 906.386 |
mmlu/3474 | stablelm-base-alpha-7b-v2 | 906.386 |
mmlu/2129 | stablelm-base-alpha-7b-v2 | 906.386 |
mmlu/4863 | stablelm-3b-4e1t | 906.237 |
mmlu/225 | stablelm-3b-4e1t | 906.237 |
mmlu/11510 | gemma-2b | 883.543 |
mmlu/13842 | gemma-2b | 883.543 |
mmlu/13044 | gemma-2b | 883.543 |
mmlu/13841 | gemma-2b | 883.543 |
mmlu/1841 | gemma-2b | 883.543 |
mmlu/6651 | gemma-2b | 883.543 |
mmlu/10699 | gemma-2b | 883.543 |
mmlu/2491 | Qwen1.5-0.5B | 860.942 |
mmlu/11661 | Qwen1.5-0.5B | 860.942 |
mmlu/4484 | Qwen1.5-0.5B | 860.942 |
mmlu/4462 | Qwen1.5-0.5B | 860.942 |
mmlu/11057 | Qwen1.5-0.5B | 860.942 |
mmlu/4405 | llama_07B | 842.554 |
mmlu/12469 | llama_07B | 842.554 |
mmlu/982 | llama_07B | 842.554 |
mmlu/13739 | llama_07B | 842.554 |
mmlu/8379 | llama_07B | 842.554 |
mmlu/4559 | llama_07B | 842.554 |
mmlu/937 | llama_07B | 842.554 |
mmlu/6755 | llama_07B | 842.554 |
mmlu/10368 | falcon-7b | 790.445 |
mmlu/165 | falcon-7b | 790.445 |
mmlu/6221 | falcon-7b | 790.445 |
mmlu/5654 | falcon-7b | 790.445 |
mmlu/10523 | falcon-7b | 790.445 |
mmlu/1326 | falcon-7b | 790.445 |
mmlu/3916 | pythia-2.8b-deduped | 787.422 |
mmlu/11520 | pythia-2.8b-deduped | 787.422 |
mmlu/988 | pythia-2.8b-deduped | 787.422 |
mmlu/8314 | pythia-2.8b-deduped | 787.422 |
mmlu/2647 | pythia-2.8b-deduped | 787.422 |
mmlu/11540 | pythia-2.8b-deduped | 787.422 |
mmlu/1003 | pythia-2.8b-deduped | 787.422 |
mmlu/4761 | pythia-2.8b-deduped | 787.422 |
mmlu/155 | pythia-2.8b-deduped | 787.422 |
mmlu/4779 | pythia-2.8b-deduped | 787.422 |
mmlu/4106 | pythia-2.8b-deduped | 787.422 |
mmlu/4176 | pythia-2.8b-deduped | 787.422 |
mmlu/4829 | pythia-2.8b-deduped | 787.422 |
mmlu/8490 | pythia-2.8b-deduped | 787.422 |
mmlu/207 | pythia-2.8b-deduped | 787.422 |
mmlu/8218 | pythia-2.8b-deduped | 787.422 |
mmlu/1979 | pythia-1b-deduped | 772.264 |
mmlu/11089 | pythia-1b-deduped | 772.264 |
mmlu/4816 | pythia-1b-deduped | 772.264 |
mmlu/133 | pythia-1b-deduped | 772.264 |
mmlu/5420 | pythia-1b-deduped | 772.264 |
mmlu/5839 | pythia-1b-deduped | 772.264 |
mmlu/11054 | pythia-12b-deduped-v0 | 771.460 |
mmlu/6101 | pythia-12b-deduped-v0 | 771.460 |
mmlu/8501 | pythia-12b-deduped-v0 | 771.460 |
mmlu/10668 | pythia-12b-deduped-v0 | 771.460 |
mmlu/8465 | pythia-6.9b-deduped-v0 | 768.682 |
mmlu/9819 | pythia-6.9b-deduped-v0 | 768.682 |
mmlu/8342 | pythia-6.9b-deduped-v0 | 768.682 |
mmlu/2961 | pythia-6.9b-deduped-v0 | 768.682 |
mmlu/203 | pythia-6.9b-deduped-v0 | 768.682 |
mmlu/1876 | pythia-6.9b-deduped-v0 | 768.682 |
mmlu/9810 | pythia-6.9b-deduped-v0 | 768.682 |
mmlu/10665 | pythia-1.4b-deduped-v0 | 761.364 |
mmlu/11886 | pythia-1.4b-deduped-v0 | 761.364 |
mmlu/11741 | pythia-1.4b-deduped-v0 | 761.364 |
mmlu/12092 | pythia-1.4b-deduped-v0 | 761.364 |
mmlu/13833 | pythia-1.4b-deduped-v0 | 761.364 |
mmlu/13474 | pythia-1.4b-deduped-v0 | 761.364 |
mmlu/12500 | pythia-1.4b-deduped-v0 | 761.364 |
mmlu/1877 | pythia-1.4b-deduped-v0 | 761.364 |
mmlu/9926 | pythia-1.4b-deduped-v0 | 761.364 |
mmlu/11742 | pythia-1.4b-deduped-v0 | 761.364 |
mmlu/11410 | pythia-1.4b-deduped-v0 | 761.364 |
mmlu/10993 | pythia-1.4b-deduped-v0 | 761.364 |
mmlu/10980 | pythia-1.4b-deduped-v0 | 761.364 |
mmlu/10911 | pythia-1.4b-deduped-v0 | 761.364 |
mmlu/11810 | pythia-1.4b-deduped-v0 | 761.364 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
example_link | acc | tau |
---|---|---|
mmlu/2534 | 0.194 | -0.551 |
mmlu/805 | 0.306 | -0.550 |
mmlu/24 | 0.333 | -0.526 |
mmlu/1461 | 0.167 | -0.505 |
mmlu/4699 | 0.194 | -0.501 |
mmlu/329 | 0.194 | -0.495 |
mmlu/6255 | 0.250 | -0.493 |
mmlu/8429 | 0.139 | -0.490 |
mmlu/1467 | 0.222 | -0.484 |
mmlu/5196 | 0.278 | -0.484 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.