mmlu: by examples

Home   Doc/Code

Not solved by any model

There are 84 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
mmlu/10112, mmlu/10177, mmlu/10356, mmlu/10370, mmlu/1047, mmlu/10549, mmlu/10576, mmlu/10582, mmlu/10621, mmlu/10752, mmlu/10802, mmlu/11046, mmlu/11059, mmlu/11064, mmlu/11109, mmlu/11302, mmlu/11632, mmlu/11725, mmlu/11789, mmlu/11829, mmlu/11880, mmlu/12161, mmlu/1269, mmlu/12698, mmlu/12820, mmlu/1298, mmlu/13190, mmlu/13218, mmlu/13722, mmlu/13734, mmlu/13740, mmlu/13747, mmlu/13751, mmlu/13767, mmlu/13768, mmlu/13788, mmlu/13795, mmlu/1380, mmlu/13825, mmlu/1416, mmlu/1457, mmlu/1481, mmlu/1676, mmlu/168, mmlu/1720, mmlu/1748, mmlu/1931, mmlu/2025, mmlu/2131, mmlu/2392, mmlu/2440, mmlu/2951, mmlu/3072, mmlu/3077, mmlu/3295, mmlu/3356, mmlu/3405, mmlu/3507, mmlu/419, mmlu/4222, mmlu/4329, mmlu/4377, mmlu/4393, mmlu/4448, mmlu/479, mmlu/516, mmlu/552, mmlu/6156, mmlu/6514, mmlu/6864, mmlu/6908, mmlu/7467, mmlu/8177, mmlu/8195, mmlu/8235, mmlu/8280, mmlu/8388, mmlu/8445, mmlu/8815, mmlu/9057, mmlu/9080, mmlu/9607, mmlu/9801, mmlu/9917

Problems solved by 1 model only

example_link model min_elo
mmlu/2211 Qwen1.5-110B 1305.025
mmlu/2416 Qwen1.5-110B 1305.025
mmlu/2255 Qwen1.5-110B 1305.025
mmlu/10778 Qwen1.5-110B 1305.025
mmlu/2092 Qwen1.5-110B 1305.025
mmlu/10402 Qwen1.5-110B 1305.025
mmlu/1390 Qwen1.5-110B 1305.025
mmlu/2250 Qwen1.5-110B 1305.025
mmlu/12903 Qwen1.5-110B 1305.025
mmlu/10411 Meta-Llama-3-70B 1266.741
mmlu/1033 Meta-Llama-3-70B 1266.741
mmlu/4842 Meta-Llama-3-70B 1266.741
mmlu/3990 Meta-Llama-3-70B 1266.741
mmlu/90 Meta-Llama-3-70B 1266.741
mmlu/1738 Meta-Llama-3-70B 1266.741
mmlu/3008 Meta-Llama-3-70B 1266.741
mmlu/2464 Mixtral-8x22B-v0.1 1250.972
mmlu/4315 Mixtral-8x22B-v0.1 1250.972
mmlu/8175 Mixtral-8x22B-v0.1 1250.972
mmlu/1403 Mixtral-8x22B-v0.1 1250.972
mmlu/6273 Mixtral-8x22B-v0.1 1250.972
mmlu/12040 Mixtral-8x22B-v0.1 1250.972
mmlu/2140 Mixtral-8x22B-v0.1 1250.972
mmlu/2508 Mixtral-8x22B-v0.1 1250.972
mmlu/8503 Mixtral-8x22B-v0.1 1250.972
mmlu/1398 Mixtral-8x22B-v0.1 1250.972
mmlu/5876 Qwen1.5-72B 1248.779
mmlu/6935 Qwen1.5-72B 1248.779
mmlu/11492 Qwen1.5-72B 1248.779
mmlu/3214 Qwen1.5-72B 1248.779
mmlu/10705 Qwen1.5-72B 1248.779
mmlu/1194 dbrx-base 1204.504
mmlu/8862 dbrx-base 1204.504
mmlu/12654 dbrx-base 1204.504
mmlu/965 dbrx-base 1204.504
mmlu/2579 dbrx-base 1204.504
mmlu/1447 dbrx-base 1204.504
mmlu/10517 dbrx-base 1204.504
mmlu/4647 dbrx-base 1204.504
mmlu/1404 Qwen1.5-32B 1198.264
mmlu/10939 Qwen1.5-32B 1198.264
mmlu/4276 Qwen1.5-32B 1198.264
mmlu/1260 Qwen1.5-32B 1198.264
mmlu/2160 Qwen1.5-32B 1198.264
mmlu/3723 deepseek-llm-67b-base 1174.936
mmlu/6148 deepseek-llm-67b-base 1174.936
mmlu/3997 deepseek-llm-67b-base 1174.936
mmlu/12674 Mixtral-8x7B-v0.1 1159.020
mmlu/2734 Mixtral-8x7B-v0.1 1159.020
mmlu/1431 Qwen1.5-14B 1131.033
mmlu/9161 Meta-Llama-3-8B 1103.817
mmlu/2668 llama2_70B 1079.359
mmlu/12497 gemma-7b 1076.537
mmlu/4342 Mistral-7B-v0.1 1074.545
mmlu/3140 Mistral-7B-v0.1 1074.545
mmlu/3037 llama_65B 1071.934
mmlu/10736 llama_65B 1071.934
mmlu/4755 llama_65B 1071.934
mmlu/2475 llama_33B 1020.632
mmlu/11254 Qwen1.5-4B 1002.201
mmlu/3440 falcon-40b 1000.936
mmlu/923 falcon-40b 1000.936
mmlu/978 falcon-40b 1000.936
mmlu/4864 falcon-40b 1000.936
mmlu/940 falcon-40b 1000.936
mmlu/3299 falcon-40b 1000.936
mmlu/72 falcon-40b 1000.936
mmlu/10583 falcon-40b 1000.936
mmlu/1418 falcon-40b 1000.936
mmlu/1493 deepseek-llm-7b-base 938.317
mmlu/6835 deepseek-llm-7b-base 938.317
mmlu/1097 deepseek-llm-7b-base 938.317
mmlu/6914 deepseek-llm-7b-base 938.317
mmlu/2319 deepseek-llm-7b-base 938.317
mmlu/10355 deepseek-llm-7b-base 938.317
mmlu/7889 llama2_07B 929.025
mmlu/12213 Qwen1.5-1.8B 917.249
mmlu/13785 Qwen1.5-1.8B 917.249
mmlu/11664 llama_13B 915.612
mmlu/914 deepseek-moe-16b-base 912.847
mmlu/2597 deepseek-moe-16b-base 912.847
mmlu/13253 deepseek-moe-16b-base 912.847
mmlu/4122 stablelm-base-alpha-7b-v2 906.386
mmlu/7 stablelm-base-alpha-7b-v2 906.386
mmlu/3474 stablelm-base-alpha-7b-v2 906.386
mmlu/2129 stablelm-base-alpha-7b-v2 906.386
mmlu/4863 stablelm-3b-4e1t 906.237
mmlu/225 stablelm-3b-4e1t 906.237
mmlu/11510 gemma-2b 883.543
mmlu/13842 gemma-2b 883.543
mmlu/13044 gemma-2b 883.543
mmlu/13841 gemma-2b 883.543
mmlu/1841 gemma-2b 883.543
mmlu/6651 gemma-2b 883.543
mmlu/10699 gemma-2b 883.543
mmlu/2491 Qwen1.5-0.5B 860.942
mmlu/11661 Qwen1.5-0.5B 860.942
mmlu/4484 Qwen1.5-0.5B 860.942
mmlu/4462 Qwen1.5-0.5B 860.942
mmlu/11057 Qwen1.5-0.5B 860.942
mmlu/4405 llama_07B 842.554
mmlu/12469 llama_07B 842.554
mmlu/982 llama_07B 842.554
mmlu/13739 llama_07B 842.554
mmlu/8379 llama_07B 842.554
mmlu/4559 llama_07B 842.554
mmlu/937 llama_07B 842.554
mmlu/6755 llama_07B 842.554
mmlu/10368 falcon-7b 790.445
mmlu/165 falcon-7b 790.445
mmlu/6221 falcon-7b 790.445
mmlu/5654 falcon-7b 790.445
mmlu/10523 falcon-7b 790.445
mmlu/1326 falcon-7b 790.445
mmlu/3916 pythia-2.8b-deduped 787.422
mmlu/11520 pythia-2.8b-deduped 787.422
mmlu/988 pythia-2.8b-deduped 787.422
mmlu/8314 pythia-2.8b-deduped 787.422
mmlu/2647 pythia-2.8b-deduped 787.422
mmlu/11540 pythia-2.8b-deduped 787.422
mmlu/1003 pythia-2.8b-deduped 787.422
mmlu/4761 pythia-2.8b-deduped 787.422
mmlu/155 pythia-2.8b-deduped 787.422
mmlu/4779 pythia-2.8b-deduped 787.422
mmlu/4106 pythia-2.8b-deduped 787.422
mmlu/4176 pythia-2.8b-deduped 787.422
mmlu/4829 pythia-2.8b-deduped 787.422
mmlu/8490 pythia-2.8b-deduped 787.422
mmlu/207 pythia-2.8b-deduped 787.422
mmlu/8218 pythia-2.8b-deduped 787.422
mmlu/1979 pythia-1b-deduped 772.264
mmlu/11089 pythia-1b-deduped 772.264
mmlu/4816 pythia-1b-deduped 772.264
mmlu/133 pythia-1b-deduped 772.264
mmlu/5420 pythia-1b-deduped 772.264
mmlu/5839 pythia-1b-deduped 772.264
mmlu/11054 pythia-12b-deduped-v0 771.460
mmlu/6101 pythia-12b-deduped-v0 771.460
mmlu/8501 pythia-12b-deduped-v0 771.460
mmlu/10668 pythia-12b-deduped-v0 771.460
mmlu/8465 pythia-6.9b-deduped-v0 768.682
mmlu/9819 pythia-6.9b-deduped-v0 768.682
mmlu/8342 pythia-6.9b-deduped-v0 768.682
mmlu/2961 pythia-6.9b-deduped-v0 768.682
mmlu/203 pythia-6.9b-deduped-v0 768.682
mmlu/1876 pythia-6.9b-deduped-v0 768.682
mmlu/9810 pythia-6.9b-deduped-v0 768.682
mmlu/10665 pythia-1.4b-deduped-v0 761.364
mmlu/11886 pythia-1.4b-deduped-v0 761.364
mmlu/11741 pythia-1.4b-deduped-v0 761.364
mmlu/12092 pythia-1.4b-deduped-v0 761.364
mmlu/13833 pythia-1.4b-deduped-v0 761.364
mmlu/13474 pythia-1.4b-deduped-v0 761.364
mmlu/12500 pythia-1.4b-deduped-v0 761.364
mmlu/1877 pythia-1.4b-deduped-v0 761.364
mmlu/9926 pythia-1.4b-deduped-v0 761.364
mmlu/11742 pythia-1.4b-deduped-v0 761.364
mmlu/11410 pythia-1.4b-deduped-v0 761.364
mmlu/10993 pythia-1.4b-deduped-v0 761.364
mmlu/10980 pythia-1.4b-deduped-v0 761.364
mmlu/10911 pythia-1.4b-deduped-v0 761.364
mmlu/11810 pythia-1.4b-deduped-v0 761.364

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
mmlu/2534 0.194 -0.551
mmlu/805 0.306 -0.550
mmlu/24 0.333 -0.526
mmlu/1461 0.167 -0.505
mmlu/4699 0.194 -0.501
mmlu/329 0.194 -0.495
mmlu/6255 0.250 -0.493
mmlu/8429 0.139 -0.490
mmlu/1467 0.222 -0.484
mmlu/5196 0.278 -0.484

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.