There are 98 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
piqa/1, piqa/1070, piqa/1092, piqa/1101, piqa/1105, piqa/1121, piqa/1149, piqa/1150, piqa/1182, piqa/1192, piqa/1206, piqa/1225, piqa/1244, piqa/1246, piqa/1258, piqa/1306, piqa/1332, piqa/134, piqa/1341, piqa/1350, piqa/1352, piqa/1366, piqa/1372, piqa/1383, piqa/1394, piqa/1415, piqa/1417, piqa/1421, piqa/1427, piqa/1469, piqa/1493, piqa/1515, piqa/1517, piqa/156, piqa/1587, piqa/1647, piqa/1654, piqa/1663, piqa/173, piqa/1775, piqa/179, piqa/1790, piqa/1815, piqa/184, piqa/186, piqa/189, piqa/209, piqa/244, piqa/256, piqa/272, piqa/288, piqa/308, piqa/329, piqa/33, piqa/332, piqa/402, piqa/409, piqa/413, piqa/443, piqa/455, piqa/471, piqa/490, piqa/500, piqa/505, piqa/54, piqa/554, piqa/566, piqa/570, piqa/572, piqa/576, piqa/579, piqa/589, piqa/601, piqa/611, piqa/613, piqa/617, piqa/620, piqa/631, piqa/638, piqa/656, piqa/666, piqa/681, piqa/706, piqa/711, piqa/719, piqa/730, piqa/770, piqa/774, piqa/79, piqa/814, piqa/823, piqa/848, piqa/855, piqa/876, piqa/899, piqa/939, piqa/949, piqa/984
example_link | model | min_elo |
---|---|---|
piqa/1689 | Mixtral-8x22B-v0.1 | 1190.300 |
piqa/1335 | Mixtral-8x22B-v0.1 | 1190.300 |
piqa/1485 | Mixtral-8x22B-v0.1 | 1190.300 |
piqa/1264 | Mixtral-8x22B-v0.1 | 1190.300 |
piqa/493 | Mixtral-8x22B-v0.1 | 1190.300 |
piqa/957 | dbrx-base | 1162.667 |
piqa/122 | dbrx-base | 1162.667 |
piqa/1318 | dbrx-base | 1162.667 |
piqa/146 | dbrx-base | 1162.667 |
piqa/1169 | dbrx-base | 1162.667 |
piqa/689 | dbrx-base | 1162.667 |
piqa/369 | dbrx-base | 1162.667 |
piqa/492 | Meta-Llama-3-70B | 1162.373 |
piqa/1410 | Qwen1.5-110B | 1154.604 |
piqa/993 | Mixtral-8x7B-v0.1 | 1132.347 |
piqa/1282 | Qwen1.5-72B | 1093.669 |
piqa/616 | llama2_70B | 1018.974 |
piqa/1728 | llama2_70B | 1018.974 |
piqa/573 | llama2_70B | 1018.974 |
piqa/1641 | llama2_70B | 1018.974 |
piqa/1714 | llama2_70B | 1018.974 |
piqa/497 | falcon-7b | 1016.467 |
piqa/926 | stablelm-base-alpha-7b-v2 | 1001.142 |
piqa/444 | stablelm-base-alpha-7b-v2 | 1001.142 |
piqa/971 | Qwen1.5-14B | 991.346 |
piqa/1071 | llama2_13B | 987.217 |
piqa/1561 | llama2_13B | 987.217 |
piqa/1323 | llama2_07B | 919.291 |
piqa/115 | llama2_07B | 919.291 |
piqa/997 | llama2_07B | 919.291 |
piqa/1346 | llama2_07B | 919.291 |
piqa/1836 | llama2_07B | 919.291 |
piqa/1763 | pythia-1.4b-deduped-v0 | 772.584 |
piqa/1183 | pythia-1.4b-deduped-v0 | 772.584 |
piqa/1308 | pythia-1.4b-deduped-v0 | 772.584 |
piqa/1553 | pythia-1.4b-deduped-v0 | 772.584 |
piqa/234 | Qwen1.5-0.5B | 753.907 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
example_link | acc | tau |
---|---|---|
piqa/895 | 0.694 | -0.548 |
piqa/868 | 0.750 | -0.467 |
piqa/1793 | 0.194 | -0.438 |
piqa/1227 | 0.833 | -0.435 |
piqa/1164 | 0.278 | -0.427 |
piqa/1072 | 0.333 | -0.420 |
piqa/632 | 0.417 | -0.394 |
piqa/377 | 0.083 | -0.382 |
piqa/60 | 0.889 | -0.375 |
piqa/1209 | 0.083 | -0.358 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.