There are 284 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
siqa/1008, siqa/102, siqa/1020, siqa/1042, siqa/1047, siqa/1058, siqa/106, siqa/1062, siqa/1076, siqa/1077, siqa/1081, siqa/1083, siqa/1091, siqa/1097, siqa/11, siqa/1111, siqa/1114, siqa/1119, siqa/113, siqa/1131, siqa/1134, siqa/1139, siqa/1140, siqa/1141, siqa/1161, siqa/1164, siqa/1186, siqa/1198, siqa/1199, siqa/1201, siqa/1204, siqa/1211, siqa/1221, siqa/1231, siqa/1271, siqa/1273, siqa/1282, siqa/1287, siqa/1321, siqa/1334, siqa/1348, siqa/1355, siqa/1357, siqa/1365, siqa/1380, siqa/1381, siqa/1385, siqa/1386, siqa/1390, siqa/1391, siqa/1405, siqa/1406, siqa/1413, siqa/1426, siqa/1430, siqa/1431, siqa/1432, siqa/1437, siqa/1438, siqa/1441, siqa/1448, siqa/145, siqa/1454, siqa/146, siqa/1460, siqa/1461, siqa/1472, siqa/1488, siqa/1490, siqa/1495, siqa/1499, siqa/1509, siqa/1513, siqa/1518, siqa/153, siqa/1532, siqa/1564, siqa/1566, siqa/1580, siqa/1589, siqa/1593, siqa/1594, siqa/1595, siqa/1605, siqa/1612, siqa/162, siqa/1628, siqa/163, siqa/1632, siqa/1641, siqa/1649, siqa/1663, siqa/167, siqa/1676, siqa/1692, siqa/1694, siqa/1695, siqa/1701, siqa/1703, siqa/1705, siqa/1725, siqa/1726, siqa/1735, siqa/174, siqa/1740, siqa/1744, siqa/1749, siqa/1751, siqa/1761, siqa/1762, siqa/1767, siqa/1777, siqa/1778, siqa/1784, siqa/1786, siqa/1790, siqa/1792, siqa/1800, siqa/1801, siqa/1808, siqa/1810, siqa/1811, siqa/1815, siqa/1816, siqa/1820, siqa/1822, siqa/1824, siqa/1830, siqa/1837, siqa/1847, siqa/1852, siqa/1861, siqa/1879, siqa/1883, siqa/1886, siqa/1891, siqa/1894, siqa/1895, siqa/190, siqa/1905, siqa/1913, siqa/1921, siqa/1923, siqa/1933, siqa/1945, siqa/195, siqa/198, siqa/211, siqa/219, siqa/229, siqa/23, siqa/240, siqa/243, siqa/244, siqa/253, siqa/259, siqa/260, siqa/262, siqa/263, siqa/272, siqa/286, siqa/296, siqa/315, siqa/318, siqa/32, siqa/335, siqa/339, siqa/344, siqa/349, siqa/352, siqa/354, siqa/367, siqa/382, siqa/388, siqa/405, siqa/421, siqa/424, siqa/43, siqa/437, siqa/439, siqa/44, siqa/441, siqa/443, siqa/444, siqa/448, siqa/461, siqa/463, siqa/465, siqa/480, siqa/487, siqa/489, siqa/497, siqa/503, siqa/506, siqa/511, siqa/519, siqa/52, siqa/532, siqa/536, siqa/539, siqa/54, siqa/550, siqa/567, siqa/568, siqa/573, siqa/575, siqa/578, siqa/583, siqa/585, siqa/587, siqa/588, siqa/593, siqa/597, siqa/60, siqa/605, siqa/609, siqa/619, siqa/62, siqa/625, siqa/627, siqa/635, siqa/636, siqa/637, siqa/646, siqa/656, siqa/665, siqa/684, siqa/699, siqa/702, siqa/712, siqa/713, siqa/715, siqa/728, siqa/729, siqa/750, siqa/751, siqa/765, siqa/77, siqa/777, siqa/778, siqa/785, siqa/791, siqa/799, siqa/806, siqa/81, siqa/812, siqa/813, siqa/817, siqa/818, siqa/830, siqa/836, siqa/837, siqa/838, siqa/84, siqa/846, siqa/852, siqa/853, siqa/86, siqa/868, siqa/869, siqa/87, siqa/870, siqa/879, siqa/881, siqa/882, siqa/884, siqa/89, siqa/898, siqa/90, siqa/908, siqa/915, siqa/920, siqa/922, siqa/927, siqa/935, siqa/942, siqa/958, siqa/962, siqa/971, siqa/977, siqa/981, siqa/982, siqa/994, siqa/997
example_link | model | min_elo |
---|---|---|
siqa/675 | dbrx-base | 1213.446 |
siqa/1909 | dbrx-base | 1213.446 |
siqa/1269 | dbrx-base | 1213.446 |
siqa/1291 | dbrx-base | 1213.446 |
siqa/1055 | dbrx-base | 1213.446 |
siqa/670 | dbrx-base | 1213.446 |
siqa/1944 | dbrx-base | 1213.446 |
siqa/1511 | dbrx-base | 1213.446 |
siqa/1435 | dbrx-base | 1213.446 |
siqa/1337 | dbrx-base | 1213.446 |
siqa/705 | dbrx-base | 1213.446 |
siqa/1869 | dbrx-base | 1213.446 |
siqa/1527 | dbrx-base | 1213.446 |
siqa/1085 | dbrx-base | 1213.446 |
siqa/1489 | dbrx-base | 1213.446 |
siqa/249 | dbrx-base | 1213.446 |
siqa/1565 | dbrx-base | 1213.446 |
siqa/1539 | dbrx-base | 1213.446 |
siqa/1733 | dbrx-base | 1213.446 |
siqa/85 | dbrx-base | 1213.446 |
siqa/1440 | dbrx-base | 1213.446 |
siqa/1729 | dbrx-base | 1213.446 |
siqa/492 | dbrx-base | 1213.446 |
siqa/293 | dbrx-base | 1213.446 |
siqa/940 | dbrx-base | 1213.446 |
siqa/302 | dbrx-base | 1213.446 |
siqa/1825 | dbrx-base | 1213.446 |
siqa/134 | dbrx-base | 1213.446 |
siqa/1741 | dbrx-base | 1213.446 |
siqa/1401 | dbrx-base | 1213.446 |
siqa/1567 | dbrx-base | 1213.446 |
siqa/904 | dbrx-base | 1213.446 |
siqa/1698 | dbrx-base | 1213.446 |
siqa/1807 | dbrx-base | 1213.446 |
siqa/501 | dbrx-base | 1213.446 |
siqa/1241 | dbrx-base | 1213.446 |
siqa/1843 | dbrx-base | 1213.446 |
siqa/1447 | dbrx-base | 1213.446 |
siqa/1002 | dbrx-base | 1213.446 |
siqa/711 | dbrx-base | 1213.446 |
siqa/691 | dbrx-base | 1213.446 |
siqa/1265 | dbrx-base | 1213.446 |
siqa/1012 | dbrx-base | 1213.446 |
siqa/1910 | dbrx-base | 1213.446 |
siqa/743 | dbrx-base | 1213.446 |
siqa/376 | dbrx-base | 1213.446 |
siqa/742 | dbrx-base | 1213.446 |
siqa/38 | dbrx-base | 1213.446 |
siqa/589 | dbrx-base | 1213.446 |
siqa/1655 | dbrx-base | 1213.446 |
siqa/658 | dbrx-base | 1213.446 |
siqa/726 | dbrx-base | 1213.446 |
siqa/1072 | dbrx-base | 1213.446 |
siqa/34 | dbrx-base | 1213.446 |
siqa/409 | dbrx-base | 1213.446 |
siqa/756 | dbrx-base | 1213.446 |
siqa/767 | Qwen1.5-110B | 1128.771 |
siqa/1120 | Qwen1.5-110B | 1128.771 |
siqa/1250 | Qwen1.5-110B | 1128.771 |
siqa/552 | Qwen1.5-110B | 1128.771 |
siqa/289 | Qwen1.5-110B | 1128.771 |
siqa/547 | Qwen1.5-110B | 1128.771 |
siqa/1936 | Qwen1.5-110B | 1128.771 |
siqa/725 | Qwen1.5-110B | 1128.771 |
siqa/1517 | Qwen1.5-110B | 1128.771 |
siqa/577 | Qwen1.5-72B | 1107.432 |
siqa/826 | Qwen1.5-72B | 1107.432 |
siqa/374 | Qwen1.5-72B | 1107.432 |
siqa/1408 | Qwen1.5-72B | 1107.432 |
siqa/965 | Qwen1.5-72B | 1107.432 |
siqa/1443 | Qwen1.5-72B | 1107.432 |
siqa/1619 | Qwen1.5-72B | 1107.432 |
siqa/1520 | Qwen1.5-32B | 1098.262 |
siqa/523 | Qwen1.5-32B | 1098.262 |
siqa/1710 | Qwen1.5-32B | 1098.262 |
siqa/1238 | Qwen1.5-32B | 1098.262 |
siqa/926 | Qwen1.5-32B | 1098.262 |
siqa/1919 | Qwen1.5-32B | 1098.262 |
siqa/147 | Qwen1.5-32B | 1098.262 |
siqa/1073 | Qwen1.5-32B | 1098.262 |
siqa/565 | Qwen1.5-32B | 1098.262 |
siqa/469 | Qwen1.5-32B | 1098.262 |
siqa/462 | Qwen1.5-32B | 1098.262 |
siqa/1146 | Qwen1.5-14B | 1082.195 |
siqa/285 | Qwen1.5-14B | 1082.195 |
siqa/1171 | Qwen1.5-14B | 1082.195 |
siqa/1325 | Qwen1.5-14B | 1082.195 |
siqa/875 | Qwen1.5-14B | 1082.195 |
siqa/1524 | Qwen1.5-14B | 1082.195 |
siqa/807 | Qwen1.5-14B | 1082.195 |
siqa/1078 | Qwen1.5-14B | 1082.195 |
siqa/1766 | Qwen1.5-14B | 1082.195 |
siqa/989 | Qwen1.5-14B | 1082.195 |
siqa/1535 | Qwen1.5-14B | 1082.195 |
siqa/1646 | Qwen1.5-14B | 1082.195 |
siqa/596 | Qwen1.5-14B | 1082.195 |
siqa/1736 | Qwen1.5-14B | 1082.195 |
siqa/505 | Qwen1.5-14B | 1082.195 |
siqa/82 | Qwen1.5-14B | 1082.195 |
siqa/789 | llama2_13B | 1065.922 |
siqa/1428 | llama2_13B | 1065.922 |
siqa/746 | llama2_13B | 1065.922 |
siqa/667 | Qwen1.5-7B | 1054.034 |
siqa/5 | Qwen1.5-7B | 1054.034 |
siqa/866 | Qwen1.5-7B | 1054.034 |
siqa/237 | Qwen1.5-7B | 1054.034 |
siqa/1343 | Meta-Llama-3-70B | 1053.440 |
siqa/1082 | llama2_70B | 1048.477 |
siqa/738 | gemma-7b | 1034.219 |
siqa/1804 | deepseek-llm-67b-base | 1018.791 |
siqa/1371 | llama_13B | 1016.981 |
siqa/1615 | llama2_07B | 997.440 |
siqa/166 | Qwen1.5-4B | 977.390 |
siqa/1549 | Qwen1.5-4B | 977.390 |
siqa/709 | Qwen1.5-4B | 977.390 |
siqa/1360 | Qwen1.5-4B | 977.390 |
siqa/613 | Qwen1.5-4B | 977.390 |
siqa/1674 | Qwen1.5-4B | 977.390 |
siqa/1256 | Qwen1.5-4B | 977.390 |
siqa/744 | Qwen1.5-1.8B | 951.498 |
siqa/629 | Qwen1.5-1.8B | 951.498 |
siqa/1205 | Qwen1.5-1.8B | 951.498 |
siqa/379 | gemma-2b | 948.135 |
siqa/959 | Qwen1.5-0.5B | 930.224 |
siqa/1180 | Qwen1.5-0.5B | 930.224 |
siqa/384 | Qwen1.5-0.5B | 930.224 |
siqa/1481 | Qwen1.5-0.5B | 930.224 |
siqa/1571 | Qwen1.5-0.5B | 930.224 |
siqa/1776 | pythia-2.8b-deduped | 907.898 |
siqa/954 | pythia-1b-deduped | 888.846 |
siqa/1285 | pythia-1b-deduped | 888.846 |
siqa/499 | pythia-1b-deduped | 888.846 |
siqa/39 | pythia-1.4b-deduped-v0 | 883.123 |
siqa/1651 | pythia-1.4b-deduped-v0 | 883.123 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
example_link | acc | tau |
---|---|---|
siqa/15 | 0.556 | -0.535 |
siqa/1341 | 0.861 | -0.496 |
siqa/914 | 0.778 | -0.480 |
siqa/1409 | 0.722 | -0.475 |
siqa/1552 | 0.556 | -0.459 |
siqa/1000 | 0.833 | -0.446 |
siqa/944 | 0.417 | -0.443 |
siqa/1829 | 0.250 | -0.437 |
siqa/56 | 0.889 | -0.430 |
siqa/570 | 0.472 | -0.428 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.