siqa: by examples

Home   Doc/Code

Not solved by any model

There are 284 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
siqa/1008, siqa/102, siqa/1020, siqa/1042, siqa/1047, siqa/1058, siqa/106, siqa/1062, siqa/1076, siqa/1077, siqa/1081, siqa/1083, siqa/1091, siqa/1097, siqa/11, siqa/1111, siqa/1114, siqa/1119, siqa/113, siqa/1131, siqa/1134, siqa/1139, siqa/1140, siqa/1141, siqa/1161, siqa/1164, siqa/1186, siqa/1198, siqa/1199, siqa/1201, siqa/1204, siqa/1211, siqa/1221, siqa/1231, siqa/1271, siqa/1273, siqa/1282, siqa/1287, siqa/1321, siqa/1334, siqa/1348, siqa/1355, siqa/1357, siqa/1365, siqa/1380, siqa/1381, siqa/1385, siqa/1386, siqa/1390, siqa/1391, siqa/1405, siqa/1406, siqa/1413, siqa/1426, siqa/1430, siqa/1431, siqa/1432, siqa/1437, siqa/1438, siqa/1441, siqa/1448, siqa/145, siqa/1454, siqa/146, siqa/1460, siqa/1461, siqa/1472, siqa/1488, siqa/1490, siqa/1495, siqa/1499, siqa/1509, siqa/1513, siqa/1518, siqa/153, siqa/1532, siqa/1564, siqa/1566, siqa/1580, siqa/1589, siqa/1593, siqa/1594, siqa/1595, siqa/1605, siqa/1612, siqa/162, siqa/1628, siqa/163, siqa/1632, siqa/1641, siqa/1649, siqa/1663, siqa/167, siqa/1676, siqa/1692, siqa/1694, siqa/1695, siqa/1701, siqa/1703, siqa/1705, siqa/1725, siqa/1726, siqa/1735, siqa/174, siqa/1740, siqa/1744, siqa/1749, siqa/1751, siqa/1761, siqa/1762, siqa/1767, siqa/1777, siqa/1778, siqa/1784, siqa/1786, siqa/1790, siqa/1792, siqa/1800, siqa/1801, siqa/1808, siqa/1810, siqa/1811, siqa/1815, siqa/1816, siqa/1820, siqa/1822, siqa/1824, siqa/1830, siqa/1837, siqa/1847, siqa/1852, siqa/1861, siqa/1879, siqa/1883, siqa/1886, siqa/1891, siqa/1894, siqa/1895, siqa/190, siqa/1905, siqa/1913, siqa/1921, siqa/1923, siqa/1933, siqa/1945, siqa/195, siqa/198, siqa/211, siqa/219, siqa/229, siqa/23, siqa/240, siqa/243, siqa/244, siqa/253, siqa/259, siqa/260, siqa/262, siqa/263, siqa/272, siqa/286, siqa/296, siqa/315, siqa/318, siqa/32, siqa/335, siqa/339, siqa/344, siqa/349, siqa/352, siqa/354, siqa/367, siqa/382, siqa/388, siqa/405, siqa/421, siqa/424, siqa/43, siqa/437, siqa/439, siqa/44, siqa/441, siqa/443, siqa/444, siqa/448, siqa/461, siqa/463, siqa/465, siqa/480, siqa/487, siqa/489, siqa/497, siqa/503, siqa/506, siqa/511, siqa/519, siqa/52, siqa/532, siqa/536, siqa/539, siqa/54, siqa/550, siqa/567, siqa/568, siqa/573, siqa/575, siqa/578, siqa/583, siqa/585, siqa/587, siqa/588, siqa/593, siqa/597, siqa/60, siqa/605, siqa/609, siqa/619, siqa/62, siqa/625, siqa/627, siqa/635, siqa/636, siqa/637, siqa/646, siqa/656, siqa/665, siqa/684, siqa/699, siqa/702, siqa/712, siqa/713, siqa/715, siqa/728, siqa/729, siqa/750, siqa/751, siqa/765, siqa/77, siqa/777, siqa/778, siqa/785, siqa/791, siqa/799, siqa/806, siqa/81, siqa/812, siqa/813, siqa/817, siqa/818, siqa/830, siqa/836, siqa/837, siqa/838, siqa/84, siqa/846, siqa/852, siqa/853, siqa/86, siqa/868, siqa/869, siqa/87, siqa/870, siqa/879, siqa/881, siqa/882, siqa/884, siqa/89, siqa/898, siqa/90, siqa/908, siqa/915, siqa/920, siqa/922, siqa/927, siqa/935, siqa/942, siqa/958, siqa/962, siqa/971, siqa/977, siqa/981, siqa/982, siqa/994, siqa/997

Problems solved by 1 model only

example_link model min_elo
siqa/675 dbrx-base 1213.446
siqa/1909 dbrx-base 1213.446
siqa/1269 dbrx-base 1213.446
siqa/1291 dbrx-base 1213.446
siqa/1055 dbrx-base 1213.446
siqa/670 dbrx-base 1213.446
siqa/1944 dbrx-base 1213.446
siqa/1511 dbrx-base 1213.446
siqa/1435 dbrx-base 1213.446
siqa/1337 dbrx-base 1213.446
siqa/705 dbrx-base 1213.446
siqa/1869 dbrx-base 1213.446
siqa/1527 dbrx-base 1213.446
siqa/1085 dbrx-base 1213.446
siqa/1489 dbrx-base 1213.446
siqa/249 dbrx-base 1213.446
siqa/1565 dbrx-base 1213.446
siqa/1539 dbrx-base 1213.446
siqa/1733 dbrx-base 1213.446
siqa/85 dbrx-base 1213.446
siqa/1440 dbrx-base 1213.446
siqa/1729 dbrx-base 1213.446
siqa/492 dbrx-base 1213.446
siqa/293 dbrx-base 1213.446
siqa/940 dbrx-base 1213.446
siqa/302 dbrx-base 1213.446
siqa/1825 dbrx-base 1213.446
siqa/134 dbrx-base 1213.446
siqa/1741 dbrx-base 1213.446
siqa/1401 dbrx-base 1213.446
siqa/1567 dbrx-base 1213.446
siqa/904 dbrx-base 1213.446
siqa/1698 dbrx-base 1213.446
siqa/1807 dbrx-base 1213.446
siqa/501 dbrx-base 1213.446
siqa/1241 dbrx-base 1213.446
siqa/1843 dbrx-base 1213.446
siqa/1447 dbrx-base 1213.446
siqa/1002 dbrx-base 1213.446
siqa/711 dbrx-base 1213.446
siqa/691 dbrx-base 1213.446
siqa/1265 dbrx-base 1213.446
siqa/1012 dbrx-base 1213.446
siqa/1910 dbrx-base 1213.446
siqa/743 dbrx-base 1213.446
siqa/376 dbrx-base 1213.446
siqa/742 dbrx-base 1213.446
siqa/38 dbrx-base 1213.446
siqa/589 dbrx-base 1213.446
siqa/1655 dbrx-base 1213.446
siqa/658 dbrx-base 1213.446
siqa/726 dbrx-base 1213.446
siqa/1072 dbrx-base 1213.446
siqa/34 dbrx-base 1213.446
siqa/409 dbrx-base 1213.446
siqa/756 dbrx-base 1213.446
siqa/767 Qwen1.5-110B 1128.771
siqa/1120 Qwen1.5-110B 1128.771
siqa/1250 Qwen1.5-110B 1128.771
siqa/552 Qwen1.5-110B 1128.771
siqa/289 Qwen1.5-110B 1128.771
siqa/547 Qwen1.5-110B 1128.771
siqa/1936 Qwen1.5-110B 1128.771
siqa/725 Qwen1.5-110B 1128.771
siqa/1517 Qwen1.5-110B 1128.771
siqa/577 Qwen1.5-72B 1107.432
siqa/826 Qwen1.5-72B 1107.432
siqa/374 Qwen1.5-72B 1107.432
siqa/1408 Qwen1.5-72B 1107.432
siqa/965 Qwen1.5-72B 1107.432
siqa/1443 Qwen1.5-72B 1107.432
siqa/1619 Qwen1.5-72B 1107.432
siqa/1520 Qwen1.5-32B 1098.262
siqa/523 Qwen1.5-32B 1098.262
siqa/1710 Qwen1.5-32B 1098.262
siqa/1238 Qwen1.5-32B 1098.262
siqa/926 Qwen1.5-32B 1098.262
siqa/1919 Qwen1.5-32B 1098.262
siqa/147 Qwen1.5-32B 1098.262
siqa/1073 Qwen1.5-32B 1098.262
siqa/565 Qwen1.5-32B 1098.262
siqa/469 Qwen1.5-32B 1098.262
siqa/462 Qwen1.5-32B 1098.262
siqa/1146 Qwen1.5-14B 1082.195
siqa/285 Qwen1.5-14B 1082.195
siqa/1171 Qwen1.5-14B 1082.195
siqa/1325 Qwen1.5-14B 1082.195
siqa/875 Qwen1.5-14B 1082.195
siqa/1524 Qwen1.5-14B 1082.195
siqa/807 Qwen1.5-14B 1082.195
siqa/1078 Qwen1.5-14B 1082.195
siqa/1766 Qwen1.5-14B 1082.195
siqa/989 Qwen1.5-14B 1082.195
siqa/1535 Qwen1.5-14B 1082.195
siqa/1646 Qwen1.5-14B 1082.195
siqa/596 Qwen1.5-14B 1082.195
siqa/1736 Qwen1.5-14B 1082.195
siqa/505 Qwen1.5-14B 1082.195
siqa/82 Qwen1.5-14B 1082.195
siqa/789 llama2_13B 1065.922
siqa/1428 llama2_13B 1065.922
siqa/746 llama2_13B 1065.922
siqa/667 Qwen1.5-7B 1054.034
siqa/5 Qwen1.5-7B 1054.034
siqa/866 Qwen1.5-7B 1054.034
siqa/237 Qwen1.5-7B 1054.034
siqa/1343 Meta-Llama-3-70B 1053.440
siqa/1082 llama2_70B 1048.477
siqa/738 gemma-7b 1034.219
siqa/1804 deepseek-llm-67b-base 1018.791
siqa/1371 llama_13B 1016.981
siqa/1615 llama2_07B 997.440
siqa/166 Qwen1.5-4B 977.390
siqa/1549 Qwen1.5-4B 977.390
siqa/709 Qwen1.5-4B 977.390
siqa/1360 Qwen1.5-4B 977.390
siqa/613 Qwen1.5-4B 977.390
siqa/1674 Qwen1.5-4B 977.390
siqa/1256 Qwen1.5-4B 977.390
siqa/744 Qwen1.5-1.8B 951.498
siqa/629 Qwen1.5-1.8B 951.498
siqa/1205 Qwen1.5-1.8B 951.498
siqa/379 gemma-2b 948.135
siqa/959 Qwen1.5-0.5B 930.224
siqa/1180 Qwen1.5-0.5B 930.224
siqa/384 Qwen1.5-0.5B 930.224
siqa/1481 Qwen1.5-0.5B 930.224
siqa/1571 Qwen1.5-0.5B 930.224
siqa/1776 pythia-2.8b-deduped 907.898
siqa/954 pythia-1b-deduped 888.846
siqa/1285 pythia-1b-deduped 888.846
siqa/499 pythia-1b-deduped 888.846
siqa/39 pythia-1.4b-deduped-v0 883.123
siqa/1651 pythia-1.4b-deduped-v0 883.123

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
siqa/15 0.556 -0.535
siqa/1341 0.861 -0.496
siqa/914 0.778 -0.480
siqa/1409 0.722 -0.475
siqa/1552 0.556 -0.459
siqa/1000 0.833 -0.446
siqa/944 0.417 -0.443
siqa/1829 0.250 -0.437
siqa/56 0.889 -0.430
siqa/570 0.472 -0.428

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.