mmlu: by examples

Home Doc/Code

Not solved by any model

There are 84 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
mmlu/10112, mmlu/10177, mmlu/10356, mmlu/10370, mmlu/1047, mmlu/10549, mmlu/10576, mmlu/10582, mmlu/10621, mmlu/10752, mmlu/10802, mmlu/11046, mmlu/11059, mmlu/11064, mmlu/11109, mmlu/11302, mmlu/11632, mmlu/11725, mmlu/11789, mmlu/11829, mmlu/11880, mmlu/12161, mmlu/1269, mmlu/12698, mmlu/12820, mmlu/1298, mmlu/13190, mmlu/13218, mmlu/13722, mmlu/13734, mmlu/13740, mmlu/13747, mmlu/13751, mmlu/13767, mmlu/13768, mmlu/13788, mmlu/13795, mmlu/1380, mmlu/13825, mmlu/1416, mmlu/1457, mmlu/1481, mmlu/1676, mmlu/168, mmlu/1720, mmlu/1748, mmlu/1931, mmlu/2025, mmlu/2131, mmlu/2392, mmlu/2440, mmlu/2951, mmlu/3072, mmlu/3077, mmlu/3295, mmlu/3356, mmlu/3405, mmlu/3507, mmlu/419, mmlu/4222, mmlu/4329, mmlu/4377, mmlu/4393, mmlu/4448, mmlu/479, mmlu/516, mmlu/552, mmlu/6156, mmlu/6514, mmlu/6864, mmlu/6908, mmlu/7467, mmlu/8177, mmlu/8195, mmlu/8235, mmlu/8280, mmlu/8388, mmlu/8445, mmlu/8815, mmlu/9057, mmlu/9080, mmlu/9607, mmlu/9801, mmlu/9917

Problems solved by 1 model only

example_link model min_elo
mmlu/10778 Qwen1.5-110B 1104.205
mmlu/2255 Qwen1.5-110B 1104.205
mmlu/2092 Qwen1.5-110B 1104.205
mmlu/2250 Qwen1.5-110B 1104.205
mmlu/12903 Qwen1.5-110B 1104.205
mmlu/2416 Qwen1.5-110B 1104.205
mmlu/2211 Qwen1.5-110B 1104.205
mmlu/1390 Qwen1.5-110B 1104.205
mmlu/10402 Qwen1.5-110B 1104.205
mmlu/3008 Meta-Llama-3-70B 1094.830
mmlu/1738 Meta-Llama-3-70B 1094.830
mmlu/3990 Meta-Llama-3-70B 1094.830
mmlu/4842 Meta-Llama-3-70B 1094.830
mmlu/90 Meta-Llama-3-70B 1094.830
mmlu/10411 Meta-Llama-3-70B 1094.830
mmlu/1033 Meta-Llama-3-70B 1094.830
mmlu/6273 Mixtral-8x22B-v0.1 1090.909
mmlu/1398 Mixtral-8x22B-v0.1 1090.909
mmlu/2140 Mixtral-8x22B-v0.1 1090.909
mmlu/12040 Mixtral-8x22B-v0.1 1090.909
mmlu/4315 Mixtral-8x22B-v0.1 1090.909
mmlu/2508 Mixtral-8x22B-v0.1 1090.909
mmlu/1403 Mixtral-8x22B-v0.1 1090.909
mmlu/8503 Mixtral-8x22B-v0.1 1090.909
mmlu/2464 Mixtral-8x22B-v0.1 1090.909
mmlu/8175 Mixtral-8x22B-v0.1 1090.909
mmlu/11492 Qwen1.5-72B 1089.172
mmlu/6935 Qwen1.5-72B 1089.172
mmlu/5876 Qwen1.5-72B 1089.172
mmlu/3214 Qwen1.5-72B 1089.172
mmlu/10705 Qwen1.5-72B 1089.172
mmlu/12654 dbrx-base 1078.166
mmlu/2579 dbrx-base 1078.166
mmlu/4647 dbrx-base 1078.166
mmlu/1194 dbrx-base 1078.166
mmlu/1447 dbrx-base 1078.166
mmlu/8862 dbrx-base 1078.166
mmlu/965 dbrx-base 1078.166
mmlu/10517 dbrx-base 1078.166
mmlu/10939 Qwen1.5-32B 1075.657
mmlu/1260 Qwen1.5-32B 1075.657
mmlu/1404 Qwen1.5-32B 1075.657
mmlu/4276 Qwen1.5-32B 1075.657
mmlu/2160 Qwen1.5-32B 1075.657
mmlu/3723 deepseek-llm-67b-base 1067.350
mmlu/6148 deepseek-llm-67b-base 1067.350
mmlu/3997 deepseek-llm-67b-base 1067.350
mmlu/2734 Mixtral-8x7B-v0.1 1063.500
mmlu/12674 Mixtral-8x7B-v0.1 1063.500
mmlu/1431 Qwen1.5-14B 1054.246
mmlu/9161 Meta-Llama-3-8B 1045.086
mmlu/2668 llama2_70B 1037.246
mmlu/12497 gemma-7b 1035.388
mmlu/4342 Mistral-7B-v0.1 1034.899
mmlu/3140 Mistral-7B-v0.1 1034.899
mmlu/10736 llama_65B 1033.919
mmlu/4755 llama_65B 1033.919
mmlu/3037 llama_65B 1033.919
mmlu/2475 llama_33B 1015.181
mmlu/4864 falcon-40b 1009.477
mmlu/978 falcon-40b 1009.477
mmlu/72 falcon-40b 1009.477
mmlu/10583 falcon-40b 1009.477
mmlu/1418 falcon-40b 1009.477
mmlu/3299 falcon-40b 1009.477
mmlu/3440 falcon-40b 1009.477
mmlu/940 falcon-40b 1009.477
mmlu/923 falcon-40b 1009.477
mmlu/11254 Qwen1.5-4B 1008.429
mmlu/6914 deepseek-llm-7b-base 983.016
mmlu/2319 deepseek-llm-7b-base 983.016
mmlu/1493 deepseek-llm-7b-base 983.016
mmlu/1097 deepseek-llm-7b-base 983.016
mmlu/6835 deepseek-llm-7b-base 983.016
mmlu/10355 deepseek-llm-7b-base 983.016
mmlu/7889 llama2_07B 980.275
mmlu/12213 Qwen1.5-1.8B 973.988
mmlu/13785 Qwen1.5-1.8B 973.988
mmlu/11664 llama_13B 973.963
mmlu/914 deepseek-moe-16b-base 971.650
mmlu/13253 deepseek-moe-16b-base 971.650
mmlu/2597 deepseek-moe-16b-base 971.650
mmlu/225 stablelm-3b-4e1t 969.875
mmlu/4863 stablelm-3b-4e1t 969.875
mmlu/4122 stablelm-base-alpha-7b-v2 969.617
mmlu/3474 stablelm-base-alpha-7b-v2 969.617
mmlu/2129 stablelm-base-alpha-7b-v2 969.617
mmlu/7 stablelm-base-alpha-7b-v2 969.617
mmlu/13044 gemma-2b 957.509
mmlu/6651 gemma-2b 957.509
mmlu/1841 gemma-2b 957.509
mmlu/13842 gemma-2b 957.509
mmlu/10699 gemma-2b 957.509
mmlu/13841 gemma-2b 957.509
mmlu/11510 gemma-2b 957.509
mmlu/11057 Qwen1.5-0.5B 947.916
mmlu/4462 Qwen1.5-0.5B 947.916
mmlu/2491 Qwen1.5-0.5B 947.916
mmlu/4484 Qwen1.5-0.5B 947.916
mmlu/11661 Qwen1.5-0.5B 947.916
mmlu/13739 llama_07B 935.643
mmlu/6755 llama_07B 935.643
mmlu/982 llama_07B 935.643
mmlu/4559 llama_07B 935.643
mmlu/12469 llama_07B 935.643
mmlu/4405 llama_07B 935.643
mmlu/8379 llama_07B 935.643
mmlu/937 llama_07B 935.643
mmlu/6221 falcon-7b 906.104
mmlu/1326 falcon-7b 906.104
mmlu/10368 falcon-7b 906.104
mmlu/10523 falcon-7b 906.104
mmlu/165 falcon-7b 906.104
mmlu/5654 falcon-7b 906.104
mmlu/8490 pythia-2.8b-deduped 903.094
mmlu/4779 pythia-2.8b-deduped 903.094
mmlu/11520 pythia-2.8b-deduped 903.094
mmlu/11540 pythia-2.8b-deduped 903.094
mmlu/4829 pythia-2.8b-deduped 903.094
mmlu/1003 pythia-2.8b-deduped 903.094
mmlu/207 pythia-2.8b-deduped 903.094
mmlu/4106 pythia-2.8b-deduped 903.094
mmlu/4761 pythia-2.8b-deduped 903.094
mmlu/8218 pythia-2.8b-deduped 903.094
mmlu/3916 pythia-2.8b-deduped 903.094
mmlu/2647 pythia-2.8b-deduped 903.094
mmlu/4176 pythia-2.8b-deduped 903.094
mmlu/155 pythia-2.8b-deduped 903.094
mmlu/8314 pythia-2.8b-deduped 903.094
mmlu/988 pythia-2.8b-deduped 903.094
mmlu/11054 pythia-12b-deduped-v0 896.287
mmlu/8501 pythia-12b-deduped-v0 896.287
mmlu/10668 pythia-12b-deduped-v0 896.287
mmlu/6101 pythia-12b-deduped-v0 896.287
mmlu/9810 pythia-6.9b-deduped-v0 896.260
mmlu/2961 pythia-6.9b-deduped-v0 896.260
mmlu/203 pythia-6.9b-deduped-v0 896.260
mmlu/8465 pythia-6.9b-deduped-v0 896.260
mmlu/8342 pythia-6.9b-deduped-v0 896.260
mmlu/1876 pythia-6.9b-deduped-v0 896.260
mmlu/9819 pythia-6.9b-deduped-v0 896.260
mmlu/1979 pythia-1b-deduped 896.176
mmlu/133 pythia-1b-deduped 896.176
mmlu/5839 pythia-1b-deduped 896.176
mmlu/5420 pythia-1b-deduped 896.176
mmlu/11089 pythia-1b-deduped 896.176
mmlu/4816 pythia-1b-deduped 896.176
mmlu/11410 pythia-1.4b-deduped-v0 890.863
mmlu/11742 pythia-1.4b-deduped-v0 890.863
mmlu/10665 pythia-1.4b-deduped-v0 890.863
mmlu/12092 pythia-1.4b-deduped-v0 890.863
mmlu/13474 pythia-1.4b-deduped-v0 890.863
mmlu/11810 pythia-1.4b-deduped-v0 890.863
mmlu/11886 pythia-1.4b-deduped-v0 890.863
mmlu/9926 pythia-1.4b-deduped-v0 890.863
mmlu/1877 pythia-1.4b-deduped-v0 890.863
mmlu/10993 pythia-1.4b-deduped-v0 890.863
mmlu/10911 pythia-1.4b-deduped-v0 890.863
mmlu/11741 pythia-1.4b-deduped-v0 890.863
mmlu/13833 pythia-1.4b-deduped-v0 890.863
mmlu/10980 pythia-1.4b-deduped-v0 890.863
mmlu/12500 pythia-1.4b-deduped-v0 890.863

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
mmlu/2534 0.194 -0.551
mmlu/805 0.306 -0.550
mmlu/24 0.333 -0.526
mmlu/1461 0.167 -0.505
mmlu/4699 0.194 -0.501
mmlu/329 0.194 -0.495
mmlu/6255 0.250 -0.493
mmlu/8429 0.139 -0.490
mmlu/1467 0.222 -0.484
mmlu/5196 0.278 -0.484

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.