There are 84 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
mmlu/10112, mmlu/10177, mmlu/10356, mmlu/10370, mmlu/1047, mmlu/10549, mmlu/10576, mmlu/10582, mmlu/10621, mmlu/10752, mmlu/10802, mmlu/11046, mmlu/11059, mmlu/11064, mmlu/11109, mmlu/11302, mmlu/11632, mmlu/11725, mmlu/11789, mmlu/11829, mmlu/11880, mmlu/12161, mmlu/1269, mmlu/12698, mmlu/12820, mmlu/1298, mmlu/13190, mmlu/13218, mmlu/13722, mmlu/13734, mmlu/13740, mmlu/13747, mmlu/13751, mmlu/13767, mmlu/13768, mmlu/13788, mmlu/13795, mmlu/1380, mmlu/13825, mmlu/1416, mmlu/1457, mmlu/1481, mmlu/1676, mmlu/168, mmlu/1720, mmlu/1748, mmlu/1931, mmlu/2025, mmlu/2131, mmlu/2392, mmlu/2440, mmlu/2951, mmlu/3072, mmlu/3077, mmlu/3295, mmlu/3356, mmlu/3405, mmlu/3507, mmlu/419, mmlu/4222, mmlu/4329, mmlu/4377, mmlu/4393, mmlu/4448, mmlu/479, mmlu/516, mmlu/552, mmlu/6156, mmlu/6514, mmlu/6864, mmlu/6908, mmlu/7467, mmlu/8177, mmlu/8195, mmlu/8235, mmlu/8280, mmlu/8388, mmlu/8445, mmlu/8815, mmlu/9057, mmlu/9080, mmlu/9607, mmlu/9801, mmlu/9917
| example_link | model | min_elo |
|---|---|---|
| mmlu/10778 | Qwen1.5-110B | 1104.205 |
| mmlu/2255 | Qwen1.5-110B | 1104.205 |
| mmlu/2092 | Qwen1.5-110B | 1104.205 |
| mmlu/2250 | Qwen1.5-110B | 1104.205 |
| mmlu/12903 | Qwen1.5-110B | 1104.205 |
| mmlu/2416 | Qwen1.5-110B | 1104.205 |
| mmlu/2211 | Qwen1.5-110B | 1104.205 |
| mmlu/1390 | Qwen1.5-110B | 1104.205 |
| mmlu/10402 | Qwen1.5-110B | 1104.205 |
| mmlu/3008 | Meta-Llama-3-70B | 1094.830 |
| mmlu/1738 | Meta-Llama-3-70B | 1094.830 |
| mmlu/3990 | Meta-Llama-3-70B | 1094.830 |
| mmlu/4842 | Meta-Llama-3-70B | 1094.830 |
| mmlu/90 | Meta-Llama-3-70B | 1094.830 |
| mmlu/10411 | Meta-Llama-3-70B | 1094.830 |
| mmlu/1033 | Meta-Llama-3-70B | 1094.830 |
| mmlu/6273 | Mixtral-8x22B-v0.1 | 1090.909 |
| mmlu/1398 | Mixtral-8x22B-v0.1 | 1090.909 |
| mmlu/2140 | Mixtral-8x22B-v0.1 | 1090.909 |
| mmlu/12040 | Mixtral-8x22B-v0.1 | 1090.909 |
| mmlu/4315 | Mixtral-8x22B-v0.1 | 1090.909 |
| mmlu/2508 | Mixtral-8x22B-v0.1 | 1090.909 |
| mmlu/1403 | Mixtral-8x22B-v0.1 | 1090.909 |
| mmlu/8503 | Mixtral-8x22B-v0.1 | 1090.909 |
| mmlu/2464 | Mixtral-8x22B-v0.1 | 1090.909 |
| mmlu/8175 | Mixtral-8x22B-v0.1 | 1090.909 |
| mmlu/11492 | Qwen1.5-72B | 1089.172 |
| mmlu/6935 | Qwen1.5-72B | 1089.172 |
| mmlu/5876 | Qwen1.5-72B | 1089.172 |
| mmlu/3214 | Qwen1.5-72B | 1089.172 |
| mmlu/10705 | Qwen1.5-72B | 1089.172 |
| mmlu/12654 | dbrx-base | 1078.166 |
| mmlu/2579 | dbrx-base | 1078.166 |
| mmlu/4647 | dbrx-base | 1078.166 |
| mmlu/1194 | dbrx-base | 1078.166 |
| mmlu/1447 | dbrx-base | 1078.166 |
| mmlu/8862 | dbrx-base | 1078.166 |
| mmlu/965 | dbrx-base | 1078.166 |
| mmlu/10517 | dbrx-base | 1078.166 |
| mmlu/10939 | Qwen1.5-32B | 1075.657 |
| mmlu/1260 | Qwen1.5-32B | 1075.657 |
| mmlu/1404 | Qwen1.5-32B | 1075.657 |
| mmlu/4276 | Qwen1.5-32B | 1075.657 |
| mmlu/2160 | Qwen1.5-32B | 1075.657 |
| mmlu/3723 | deepseek-llm-67b-base | 1067.350 |
| mmlu/6148 | deepseek-llm-67b-base | 1067.350 |
| mmlu/3997 | deepseek-llm-67b-base | 1067.350 |
| mmlu/2734 | Mixtral-8x7B-v0.1 | 1063.500 |
| mmlu/12674 | Mixtral-8x7B-v0.1 | 1063.500 |
| mmlu/1431 | Qwen1.5-14B | 1054.246 |
| mmlu/9161 | Meta-Llama-3-8B | 1045.086 |
| mmlu/2668 | llama2_70B | 1037.246 |
| mmlu/12497 | gemma-7b | 1035.388 |
| mmlu/4342 | Mistral-7B-v0.1 | 1034.899 |
| mmlu/3140 | Mistral-7B-v0.1 | 1034.899 |
| mmlu/10736 | llama_65B | 1033.919 |
| mmlu/4755 | llama_65B | 1033.919 |
| mmlu/3037 | llama_65B | 1033.919 |
| mmlu/2475 | llama_33B | 1015.181 |
| mmlu/4864 | falcon-40b | 1009.477 |
| mmlu/978 | falcon-40b | 1009.477 |
| mmlu/72 | falcon-40b | 1009.477 |
| mmlu/10583 | falcon-40b | 1009.477 |
| mmlu/1418 | falcon-40b | 1009.477 |
| mmlu/3299 | falcon-40b | 1009.477 |
| mmlu/3440 | falcon-40b | 1009.477 |
| mmlu/940 | falcon-40b | 1009.477 |
| mmlu/923 | falcon-40b | 1009.477 |
| mmlu/11254 | Qwen1.5-4B | 1008.429 |
| mmlu/6914 | deepseek-llm-7b-base | 983.016 |
| mmlu/2319 | deepseek-llm-7b-base | 983.016 |
| mmlu/1493 | deepseek-llm-7b-base | 983.016 |
| mmlu/1097 | deepseek-llm-7b-base | 983.016 |
| mmlu/6835 | deepseek-llm-7b-base | 983.016 |
| mmlu/10355 | deepseek-llm-7b-base | 983.016 |
| mmlu/7889 | llama2_07B | 980.275 |
| mmlu/12213 | Qwen1.5-1.8B | 973.988 |
| mmlu/13785 | Qwen1.5-1.8B | 973.988 |
| mmlu/11664 | llama_13B | 973.963 |
| mmlu/914 | deepseek-moe-16b-base | 971.650 |
| mmlu/13253 | deepseek-moe-16b-base | 971.650 |
| mmlu/2597 | deepseek-moe-16b-base | 971.650 |
| mmlu/225 | stablelm-3b-4e1t | 969.875 |
| mmlu/4863 | stablelm-3b-4e1t | 969.875 |
| mmlu/4122 | stablelm-base-alpha-7b-v2 | 969.617 |
| mmlu/3474 | stablelm-base-alpha-7b-v2 | 969.617 |
| mmlu/2129 | stablelm-base-alpha-7b-v2 | 969.617 |
| mmlu/7 | stablelm-base-alpha-7b-v2 | 969.617 |
| mmlu/13044 | gemma-2b | 957.509 |
| mmlu/6651 | gemma-2b | 957.509 |
| mmlu/1841 | gemma-2b | 957.509 |
| mmlu/13842 | gemma-2b | 957.509 |
| mmlu/10699 | gemma-2b | 957.509 |
| mmlu/13841 | gemma-2b | 957.509 |
| mmlu/11510 | gemma-2b | 957.509 |
| mmlu/11057 | Qwen1.5-0.5B | 947.916 |
| mmlu/4462 | Qwen1.5-0.5B | 947.916 |
| mmlu/2491 | Qwen1.5-0.5B | 947.916 |
| mmlu/4484 | Qwen1.5-0.5B | 947.916 |
| mmlu/11661 | Qwen1.5-0.5B | 947.916 |
| mmlu/13739 | llama_07B | 935.643 |
| mmlu/6755 | llama_07B | 935.643 |
| mmlu/982 | llama_07B | 935.643 |
| mmlu/4559 | llama_07B | 935.643 |
| mmlu/12469 | llama_07B | 935.643 |
| mmlu/4405 | llama_07B | 935.643 |
| mmlu/8379 | llama_07B | 935.643 |
| mmlu/937 | llama_07B | 935.643 |
| mmlu/6221 | falcon-7b | 906.104 |
| mmlu/1326 | falcon-7b | 906.104 |
| mmlu/10368 | falcon-7b | 906.104 |
| mmlu/10523 | falcon-7b | 906.104 |
| mmlu/165 | falcon-7b | 906.104 |
| mmlu/5654 | falcon-7b | 906.104 |
| mmlu/8490 | pythia-2.8b-deduped | 903.094 |
| mmlu/4779 | pythia-2.8b-deduped | 903.094 |
| mmlu/11520 | pythia-2.8b-deduped | 903.094 |
| mmlu/11540 | pythia-2.8b-deduped | 903.094 |
| mmlu/4829 | pythia-2.8b-deduped | 903.094 |
| mmlu/1003 | pythia-2.8b-deduped | 903.094 |
| mmlu/207 | pythia-2.8b-deduped | 903.094 |
| mmlu/4106 | pythia-2.8b-deduped | 903.094 |
| mmlu/4761 | pythia-2.8b-deduped | 903.094 |
| mmlu/8218 | pythia-2.8b-deduped | 903.094 |
| mmlu/3916 | pythia-2.8b-deduped | 903.094 |
| mmlu/2647 | pythia-2.8b-deduped | 903.094 |
| mmlu/4176 | pythia-2.8b-deduped | 903.094 |
| mmlu/155 | pythia-2.8b-deduped | 903.094 |
| mmlu/8314 | pythia-2.8b-deduped | 903.094 |
| mmlu/988 | pythia-2.8b-deduped | 903.094 |
| mmlu/11054 | pythia-12b-deduped-v0 | 896.287 |
| mmlu/8501 | pythia-12b-deduped-v0 | 896.287 |
| mmlu/10668 | pythia-12b-deduped-v0 | 896.287 |
| mmlu/6101 | pythia-12b-deduped-v0 | 896.287 |
| mmlu/9810 | pythia-6.9b-deduped-v0 | 896.260 |
| mmlu/2961 | pythia-6.9b-deduped-v0 | 896.260 |
| mmlu/203 | pythia-6.9b-deduped-v0 | 896.260 |
| mmlu/8465 | pythia-6.9b-deduped-v0 | 896.260 |
| mmlu/8342 | pythia-6.9b-deduped-v0 | 896.260 |
| mmlu/1876 | pythia-6.9b-deduped-v0 | 896.260 |
| mmlu/9819 | pythia-6.9b-deduped-v0 | 896.260 |
| mmlu/1979 | pythia-1b-deduped | 896.176 |
| mmlu/133 | pythia-1b-deduped | 896.176 |
| mmlu/5839 | pythia-1b-deduped | 896.176 |
| mmlu/5420 | pythia-1b-deduped | 896.176 |
| mmlu/11089 | pythia-1b-deduped | 896.176 |
| mmlu/4816 | pythia-1b-deduped | 896.176 |
| mmlu/11410 | pythia-1.4b-deduped-v0 | 890.863 |
| mmlu/11742 | pythia-1.4b-deduped-v0 | 890.863 |
| mmlu/10665 | pythia-1.4b-deduped-v0 | 890.863 |
| mmlu/12092 | pythia-1.4b-deduped-v0 | 890.863 |
| mmlu/13474 | pythia-1.4b-deduped-v0 | 890.863 |
| mmlu/11810 | pythia-1.4b-deduped-v0 | 890.863 |
| mmlu/11886 | pythia-1.4b-deduped-v0 | 890.863 |
| mmlu/9926 | pythia-1.4b-deduped-v0 | 890.863 |
| mmlu/1877 | pythia-1.4b-deduped-v0 | 890.863 |
| mmlu/10993 | pythia-1.4b-deduped-v0 | 890.863 |
| mmlu/10911 | pythia-1.4b-deduped-v0 | 890.863 |
| mmlu/11741 | pythia-1.4b-deduped-v0 | 890.863 |
| mmlu/13833 | pythia-1.4b-deduped-v0 | 890.863 |
| mmlu/10980 | pythia-1.4b-deduped-v0 | 890.863 |
| mmlu/12500 | pythia-1.4b-deduped-v0 | 890.863 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | acc | tau |
|---|---|---|
| mmlu/2534 | 0.194 | -0.551 |
| mmlu/805 | 0.306 | -0.550 |
| mmlu/24 | 0.333 | -0.526 |
| mmlu/1461 | 0.167 | -0.505 |
| mmlu/4699 | 0.194 | -0.501 |
| mmlu/329 | 0.194 | -0.495 |
| mmlu/6255 | 0.250 | -0.493 |
| mmlu/8429 | 0.139 | -0.490 |
| mmlu/1467 | 0.222 | -0.484 |
| mmlu/5196 | 0.278 | -0.484 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.