There are 165 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
arc_challenge/1013, arc_challenge/1020, arc_challenge/1025, arc_challenge/1030, arc_challenge/1045, arc_challenge/1049, arc_challenge/1062, arc_challenge/1064, arc_challenge/1071, arc_challenge/1074, arc_challenge/1085, arc_challenge/109, arc_challenge/1095, arc_challenge/1096, arc_challenge/1097, arc_challenge/110, arc_challenge/1109, arc_challenge/1111, arc_challenge/1115, arc_challenge/1132, arc_challenge/1137, arc_challenge/1140, arc_challenge/1156, arc_challenge/1157, arc_challenge/123, arc_challenge/129, arc_challenge/133, arc_challenge/144, arc_challenge/145, arc_challenge/154, arc_challenge/156, arc_challenge/158, arc_challenge/171, arc_challenge/172, arc_challenge/173, arc_challenge/185, arc_challenge/186, arc_challenge/19, arc_challenge/208, arc_challenge/221, arc_challenge/222, arc_challenge/230, arc_challenge/235, arc_challenge/247, arc_challenge/249, arc_challenge/250, arc_challenge/253, arc_challenge/257, arc_challenge/271, arc_challenge/281, arc_challenge/293, arc_challenge/298, arc_challenge/299, arc_challenge/310, arc_challenge/316, arc_challenge/318, arc_challenge/324, arc_challenge/326, arc_challenge/328, arc_challenge/329, arc_challenge/333, arc_challenge/35, arc_challenge/361, arc_challenge/366, arc_challenge/369, arc_challenge/37, arc_challenge/376, arc_challenge/383, arc_challenge/389, arc_challenge/390, arc_challenge/4, arc_challenge/41, arc_challenge/417, arc_challenge/419, arc_challenge/435, arc_challenge/438, arc_challenge/446, arc_challenge/447, arc_challenge/45, arc_challenge/450, arc_challenge/461, arc_challenge/462, arc_challenge/474, arc_challenge/489, arc_challenge/5, arc_challenge/51, arc_challenge/517, arc_challenge/531, arc_challenge/547, arc_challenge/55, arc_challenge/560, arc_challenge/569, arc_challenge/574, arc_challenge/577, arc_challenge/594, arc_challenge/599, arc_challenge/604, arc_challenge/613, arc_challenge/622, arc_challenge/626, arc_challenge/627, arc_challenge/64, arc_challenge/654, arc_challenge/662, arc_challenge/67, arc_challenge/672, arc_challenge/679, arc_challenge/688, arc_challenge/690, arc_challenge/691, arc_challenge/692, arc_challenge/708, arc_challenge/709, arc_challenge/717, arc_challenge/726, arc_challenge/728, arc_challenge/734, arc_challenge/740, arc_challenge/746, arc_challenge/747, arc_challenge/75, arc_challenge/752, arc_challenge/755, arc_challenge/756, arc_challenge/757, arc_challenge/765, arc_challenge/774, arc_challenge/78, arc_challenge/789, arc_challenge/79, arc_challenge/790, arc_challenge/791, arc_challenge/792, arc_challenge/797, arc_challenge/798, arc_challenge/803, arc_challenge/808, arc_challenge/813, arc_challenge/816, arc_challenge/820, arc_challenge/828, arc_challenge/835, arc_challenge/838, arc_challenge/852, arc_challenge/872, arc_challenge/882, arc_challenge/89, arc_challenge/890, arc_challenge/897, arc_challenge/903, arc_challenge/905, arc_challenge/908, arc_challenge/919, arc_challenge/926, arc_challenge/957, arc_challenge/959, arc_challenge/960, arc_challenge/961, arc_challenge/962, arc_challenge/967, arc_challenge/971, arc_challenge/974, arc_challenge/982, arc_challenge/983, arc_challenge/989
example_link | model | min_elo |
---|---|---|
arc_challenge/508 | Meta-Llama-3-70B | 1284.923 |
arc_challenge/1034 | Meta-Llama-3-70B | 1284.923 |
arc_challenge/259 | dbrx-base | 1251.719 |
arc_challenge/796 | dbrx-base | 1251.719 |
arc_challenge/53 | dbrx-base | 1251.719 |
arc_challenge/492 | dbrx-base | 1251.719 |
arc_challenge/570 | dbrx-base | 1251.719 |
arc_challenge/20 | dbrx-base | 1251.719 |
arc_challenge/2 | dbrx-base | 1251.719 |
arc_challenge/783 | dbrx-base | 1251.719 |
arc_challenge/212 | dbrx-base | 1251.719 |
arc_challenge/278 | dbrx-base | 1251.719 |
arc_challenge/674 | dbrx-base | 1251.719 |
arc_challenge/1065 | dbrx-base | 1251.719 |
arc_challenge/597 | dbrx-base | 1251.719 |
arc_challenge/345 | dbrx-base | 1251.719 |
arc_challenge/285 | dbrx-base | 1251.719 |
arc_challenge/1075 | dbrx-base | 1251.719 |
arc_challenge/162 | dbrx-base | 1251.719 |
arc_challenge/602 | dbrx-base | 1251.719 |
arc_challenge/314 | dbrx-base | 1251.719 |
arc_challenge/922 | dbrx-base | 1251.719 |
arc_challenge/892 | dbrx-base | 1251.719 |
arc_challenge/125 | dbrx-base | 1251.719 |
arc_challenge/176 | dbrx-base | 1251.719 |
arc_challenge/93 | dbrx-base | 1251.719 |
arc_challenge/315 | dbrx-base | 1251.719 |
arc_challenge/965 | dbrx-base | 1251.719 |
arc_challenge/181 | dbrx-base | 1251.719 |
arc_challenge/652 | dbrx-base | 1251.719 |
arc_challenge/1104 | Mixtral-8x22B-v0.1 | 1240.987 |
arc_challenge/26 | Mixtral-8x7B-v0.1 | 1214.019 |
arc_challenge/1086 | DeepSeek-V2 | 1209.083 |
arc_challenge/34 | DeepSeek-V2 | 1209.083 |
arc_challenge/317 | DeepSeek-V2 | 1209.083 |
arc_challenge/623 | deepseek-llm-67b-base | 1163.263 |
arc_challenge/1133 | falcon-40b | 1113.751 |
arc_challenge/456 | Qwen1.5-110B | 1111.923 |
arc_challenge/1011 | Mistral-7B-v0.1 | 1103.452 |
arc_challenge/682 | llama_33B | 1100.508 |
arc_challenge/513 | llama2_70B | 1098.129 |
arc_challenge/138 | llama2_70B | 1098.129 |
arc_challenge/54 | llama2_70B | 1098.129 |
arc_challenge/598 | Meta-Llama-3-8B | 1095.915 |
arc_challenge/1024 | Meta-Llama-3-8B | 1095.915 |
arc_challenge/1116 | llama2_13B | 1036.237 |
arc_challenge/805 | llama2_13B | 1036.237 |
arc_challenge/716 | llama2_13B | 1036.237 |
arc_challenge/241 | llama2_13B | 1036.237 |
arc_challenge/270 | llama2_13B | 1036.237 |
arc_challenge/901 | llama2_13B | 1036.237 |
arc_challenge/468 | Qwen1.5-14B | 965.943 |
arc_challenge/584 | deepseek-llm-7b-base | 955.690 |
arc_challenge/311 | falcon-7b | 947.325 |
arc_challenge/994 | llama2_07B | 937.411 |
arc_challenge/925 | mpt-7b | 926.285 |
arc_challenge/991 | Qwen1.5-4B | 883.591 |
arc_challenge/860 | pythia-2.8b-deduped | 787.950 |
arc_challenge/49 | Qwen1.5-0.5B | 727.733 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
example_link | acc | tau |
---|---|---|
arc_challenge/256 | 0.342 | -0.530 |
arc_challenge/878 | 0.105 | -0.408 |
arc_challenge/521 | 0.132 | -0.397 |
arc_challenge/987 | 0.079 | -0.387 |
arc_challenge/801 | 0.289 | -0.370 |
arc_challenge/825 | 0.605 | -0.350 |
arc_challenge/1055 | 0.474 | -0.330 |
arc_challenge/189 | 0.079 | -0.306 |
arc_challenge/309 | 0.158 | -0.300 |
arc_challenge/29 | 0.842 | -0.289 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.