arc_challenge: by examples

Home   Doc/Code

Not solved by any model

There are 165 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
arc_challenge/1013, arc_challenge/1020, arc_challenge/1025, arc_challenge/1030, arc_challenge/1045, arc_challenge/1049, arc_challenge/1062, arc_challenge/1064, arc_challenge/1071, arc_challenge/1074, arc_challenge/1085, arc_challenge/109, arc_challenge/1095, arc_challenge/1096, arc_challenge/1097, arc_challenge/110, arc_challenge/1109, arc_challenge/1111, arc_challenge/1115, arc_challenge/1132, arc_challenge/1137, arc_challenge/1140, arc_challenge/1156, arc_challenge/1157, arc_challenge/123, arc_challenge/129, arc_challenge/133, arc_challenge/144, arc_challenge/145, arc_challenge/154, arc_challenge/156, arc_challenge/158, arc_challenge/171, arc_challenge/172, arc_challenge/173, arc_challenge/185, arc_challenge/186, arc_challenge/19, arc_challenge/208, arc_challenge/221, arc_challenge/222, arc_challenge/230, arc_challenge/235, arc_challenge/247, arc_challenge/249, arc_challenge/250, arc_challenge/253, arc_challenge/257, arc_challenge/271, arc_challenge/281, arc_challenge/293, arc_challenge/298, arc_challenge/299, arc_challenge/310, arc_challenge/316, arc_challenge/318, arc_challenge/324, arc_challenge/326, arc_challenge/328, arc_challenge/329, arc_challenge/333, arc_challenge/35, arc_challenge/361, arc_challenge/366, arc_challenge/369, arc_challenge/37, arc_challenge/376, arc_challenge/383, arc_challenge/389, arc_challenge/390, arc_challenge/4, arc_challenge/41, arc_challenge/417, arc_challenge/419, arc_challenge/435, arc_challenge/438, arc_challenge/446, arc_challenge/447, arc_challenge/45, arc_challenge/450, arc_challenge/461, arc_challenge/462, arc_challenge/474, arc_challenge/489, arc_challenge/5, arc_challenge/51, arc_challenge/517, arc_challenge/531, arc_challenge/547, arc_challenge/55, arc_challenge/560, arc_challenge/569, arc_challenge/574, arc_challenge/577, arc_challenge/594, arc_challenge/599, arc_challenge/604, arc_challenge/613, arc_challenge/622, arc_challenge/626, arc_challenge/627, arc_challenge/64, arc_challenge/654, arc_challenge/662, arc_challenge/67, arc_challenge/672, arc_challenge/679, arc_challenge/688, arc_challenge/690, arc_challenge/691, arc_challenge/692, arc_challenge/708, arc_challenge/709, arc_challenge/717, arc_challenge/726, arc_challenge/728, arc_challenge/734, arc_challenge/740, arc_challenge/746, arc_challenge/747, arc_challenge/75, arc_challenge/752, arc_challenge/755, arc_challenge/756, arc_challenge/757, arc_challenge/765, arc_challenge/774, arc_challenge/78, arc_challenge/789, arc_challenge/79, arc_challenge/790, arc_challenge/791, arc_challenge/792, arc_challenge/797, arc_challenge/798, arc_challenge/803, arc_challenge/808, arc_challenge/813, arc_challenge/816, arc_challenge/820, arc_challenge/828, arc_challenge/835, arc_challenge/838, arc_challenge/852, arc_challenge/872, arc_challenge/882, arc_challenge/89, arc_challenge/890, arc_challenge/897, arc_challenge/903, arc_challenge/905, arc_challenge/908, arc_challenge/919, arc_challenge/926, arc_challenge/957, arc_challenge/959, arc_challenge/960, arc_challenge/961, arc_challenge/962, arc_challenge/967, arc_challenge/971, arc_challenge/974, arc_challenge/982, arc_challenge/983, arc_challenge/989

Problems solved by 1 model only

example_link model min_elo
arc_challenge/508 Meta-Llama-3-70B 1284.923
arc_challenge/1034 Meta-Llama-3-70B 1284.923
arc_challenge/259 dbrx-base 1251.719
arc_challenge/796 dbrx-base 1251.719
arc_challenge/53 dbrx-base 1251.719
arc_challenge/492 dbrx-base 1251.719
arc_challenge/570 dbrx-base 1251.719
arc_challenge/20 dbrx-base 1251.719
arc_challenge/2 dbrx-base 1251.719
arc_challenge/783 dbrx-base 1251.719
arc_challenge/212 dbrx-base 1251.719
arc_challenge/278 dbrx-base 1251.719
arc_challenge/674 dbrx-base 1251.719
arc_challenge/1065 dbrx-base 1251.719
arc_challenge/597 dbrx-base 1251.719
arc_challenge/345 dbrx-base 1251.719
arc_challenge/285 dbrx-base 1251.719
arc_challenge/1075 dbrx-base 1251.719
arc_challenge/162 dbrx-base 1251.719
arc_challenge/602 dbrx-base 1251.719
arc_challenge/314 dbrx-base 1251.719
arc_challenge/922 dbrx-base 1251.719
arc_challenge/892 dbrx-base 1251.719
arc_challenge/125 dbrx-base 1251.719
arc_challenge/176 dbrx-base 1251.719
arc_challenge/93 dbrx-base 1251.719
arc_challenge/315 dbrx-base 1251.719
arc_challenge/965 dbrx-base 1251.719
arc_challenge/181 dbrx-base 1251.719
arc_challenge/652 dbrx-base 1251.719
arc_challenge/1104 Mixtral-8x22B-v0.1 1240.987
arc_challenge/26 Mixtral-8x7B-v0.1 1214.019
arc_challenge/1086 DeepSeek-V2 1209.083
arc_challenge/34 DeepSeek-V2 1209.083
arc_challenge/317 DeepSeek-V2 1209.083
arc_challenge/623 deepseek-llm-67b-base 1163.263
arc_challenge/1133 falcon-40b 1113.751
arc_challenge/456 Qwen1.5-110B 1111.923
arc_challenge/1011 Mistral-7B-v0.1 1103.452
arc_challenge/682 llama_33B 1100.508
arc_challenge/513 llama2_70B 1098.129
arc_challenge/138 llama2_70B 1098.129
arc_challenge/54 llama2_70B 1098.129
arc_challenge/598 Meta-Llama-3-8B 1095.915
arc_challenge/1024 Meta-Llama-3-8B 1095.915
arc_challenge/1116 llama2_13B 1036.237
arc_challenge/805 llama2_13B 1036.237
arc_challenge/716 llama2_13B 1036.237
arc_challenge/241 llama2_13B 1036.237
arc_challenge/270 llama2_13B 1036.237
arc_challenge/901 llama2_13B 1036.237
arc_challenge/468 Qwen1.5-14B 965.943
arc_challenge/584 deepseek-llm-7b-base 955.690
arc_challenge/311 falcon-7b 947.325
arc_challenge/994 llama2_07B 937.411
arc_challenge/925 mpt-7b 926.285
arc_challenge/991 Qwen1.5-4B 883.591
arc_challenge/860 pythia-2.8b-deduped 787.950
arc_challenge/49 Qwen1.5-0.5B 727.733

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
arc_challenge/256 0.342 -0.530
arc_challenge/878 0.105 -0.408
arc_challenge/521 0.132 -0.397
arc_challenge/987 0.079 -0.387
arc_challenge/801 0.289 -0.370
arc_challenge/825 0.605 -0.350
arc_challenge/1055 0.474 -0.330
arc_challenge/189 0.079 -0.306
arc_challenge/309 0.158 -0.300
arc_challenge/29 0.842 -0.289

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.