piqa: by examples

Home   Doc/Code

Not solved by any model

There are 98 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
piqa/1, piqa/1070, piqa/1092, piqa/1101, piqa/1105, piqa/1121, piqa/1149, piqa/1150, piqa/1182, piqa/1192, piqa/1206, piqa/1225, piqa/1244, piqa/1246, piqa/1258, piqa/1306, piqa/1332, piqa/134, piqa/1341, piqa/1350, piqa/1352, piqa/1366, piqa/1372, piqa/1383, piqa/1394, piqa/1415, piqa/1417, piqa/1421, piqa/1427, piqa/1469, piqa/1493, piqa/1515, piqa/1517, piqa/156, piqa/1587, piqa/1647, piqa/1654, piqa/1663, piqa/173, piqa/1775, piqa/179, piqa/1790, piqa/1815, piqa/184, piqa/186, piqa/189, piqa/209, piqa/244, piqa/256, piqa/272, piqa/288, piqa/308, piqa/329, piqa/33, piqa/332, piqa/402, piqa/409, piqa/413, piqa/443, piqa/455, piqa/471, piqa/490, piqa/500, piqa/505, piqa/54, piqa/554, piqa/566, piqa/570, piqa/572, piqa/576, piqa/579, piqa/589, piqa/601, piqa/611, piqa/613, piqa/617, piqa/620, piqa/631, piqa/638, piqa/656, piqa/666, piqa/681, piqa/706, piqa/711, piqa/719, piqa/730, piqa/770, piqa/774, piqa/79, piqa/814, piqa/823, piqa/848, piqa/855, piqa/876, piqa/899, piqa/939, piqa/949, piqa/984

Problems solved by 1 model only

example_link model min_elo
piqa/1485 Mixtral-8x22B-v0.1 1190.300
piqa/1264 Mixtral-8x22B-v0.1 1190.300
piqa/493 Mixtral-8x22B-v0.1 1190.300
piqa/1689 Mixtral-8x22B-v0.1 1190.300
piqa/1335 Mixtral-8x22B-v0.1 1190.300
piqa/1169 dbrx-base 1162.667
piqa/1318 dbrx-base 1162.667
piqa/957 dbrx-base 1162.667
piqa/146 dbrx-base 1162.667
piqa/689 dbrx-base 1162.667
piqa/122 dbrx-base 1162.667
piqa/369 dbrx-base 1162.667
piqa/492 Meta-Llama-3-70B 1162.373
piqa/1410 Qwen1.5-110B 1154.604
piqa/993 Mixtral-8x7B-v0.1 1132.347
piqa/1282 Qwen1.5-72B 1093.669
piqa/1728 llama2_70B 1018.974
piqa/1714 llama2_70B 1018.974
piqa/616 llama2_70B 1018.974
piqa/573 llama2_70B 1018.974
piqa/1641 llama2_70B 1018.974
piqa/497 falcon-7b 1016.467
piqa/926 stablelm-base-alpha-7b-v2 1001.142
piqa/444 stablelm-base-alpha-7b-v2 1001.142
piqa/971 Qwen1.5-14B 991.346
piqa/1071 llama2_13B 987.217
piqa/1561 llama2_13B 987.217
piqa/1346 llama2_07B 919.291
piqa/115 llama2_07B 919.291
piqa/1836 llama2_07B 919.291
piqa/1323 llama2_07B 919.291
piqa/997 llama2_07B 919.291
piqa/1763 pythia-1.4b-deduped-v0 772.584
piqa/1553 pythia-1.4b-deduped-v0 772.584
piqa/1183 pythia-1.4b-deduped-v0 772.584
piqa/1308 pythia-1.4b-deduped-v0 772.584
piqa/234 Qwen1.5-0.5B 753.907

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
piqa/895 0.694 -0.548
piqa/868 0.750 -0.467
piqa/1793 0.194 -0.438
piqa/1227 0.833 -0.435
piqa/1164 0.278 -0.427
piqa/1072 0.333 -0.420
piqa/632 0.417 -0.394
piqa/377 0.083 -0.382
piqa/60 0.889 -0.375
piqa/1209 0.083 -0.358

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.