There are 110 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
astropy__astropy-13033, astropy__astropy-13398, astropy__astropy-13977, astropy__astropy-14182, astropy__astropy-14365, django__django-10554, django__django-10999, django__django-11087, django__django-11138, django__django-11400, django__django-11433, django__django-11477, django__django-11728, django__django-11734, django__django-11820, django__django-11885, django__django-12406, django__django-12965, django__django-13195, django__django-13212, django__django-13344, django__django-13449, django__django-13512, django__django-13513, django__django-14011, django__django-14034, django__django-14155, django__django-14170, django__django-14315, django__django-14404, django__django-14534, django__django-14725, django__django-14792, django__django-15098, django__django-15252, django__django-15280, django__django-15375, django__django-15503, django__django-15554, django__django-15563, django__django-15629, django__django-15957, django__django-15973, django__django-16256, django__django-16263, django__django-16502, django__django-16631, django__django-16667, django__django-16950, matplotlib__matplotlib-20676, matplotlib__matplotlib-21568, matplotlib__matplotlib-22871, matplotlib__matplotlib-24177, matplotlib__matplotlib-24870, matplotlib__matplotlib-25479, matplotlib__matplotlib-25960, matplotlib__matplotlib-26208, matplotlib__matplotlib-26466, mwaskom__seaborn-3187, psf__requests-6028, pydata__xarray-6938, pydata__xarray-6992, pydata__xarray-7229, pylint-dev__pylint-4551, pylint-dev__pylint-4604, pylint-dev__pylint-4661, pytest-dev__pytest-10356, pytest-dev__pytest-5787, pytest-dev__pytest-5840, pytest-dev__pytest-6197, pytest-dev__pytest-7324, scikit-learn__scikit-learn-12682, scikit-learn__scikit-learn-14629, scikit-learn__scikit-learn-26194, sphinx-doc__sphinx-10435, sphinx-doc__sphinx-10614, sphinx-doc__sphinx-11445, sphinx-doc__sphinx-11510, sphinx-doc__sphinx-7462, sphinx-doc__sphinx-7590, sphinx-doc__sphinx-7748, sphinx-doc__sphinx-8056, sphinx-doc__sphinx-8265, sphinx-doc__sphinx-8548, sphinx-doc__sphinx-8551, sphinx-doc__sphinx-8621, sphinx-doc__sphinx-8638, sphinx-doc__sphinx-9229, sphinx-doc__sphinx-9461, sphinx-doc__sphinx-9602, sphinx-doc__sphinx-9658, sympy__sympy-12489, sympy__sympy-13091, sympy__sympy-13551, sympy__sympy-13852, sympy__sympy-14248, sympy__sympy-15976, sympy__sympy-16597, sympy__sympy-17630, sympy__sympy-18199, sympy__sympy-18698, sympy__sympy-20428, sympy__sympy-20438, sympy__sympy-20916, sympy__sympy-21596, sympy__sympy-21612, sympy__sympy-21930, sympy__sympy-22080, sympy__sympy-23413, sympy__sympy-24562
example_link | model | min_elo |
---|---|---|
sphinx-doc__sphinx-9258 | 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 | 1413.988 |
sphinx-doc__sphinx-8593 | 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 | 1413.988 |
sphinx-doc__sphinx-9230 | 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 | 1413.988 |
astropy__astropy-14598 | 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 | 1413.988 |
astropy__astropy-14369 | 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 | 1413.988 |
django__django-15916 | 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 | 1413.988 |
django__django-13406 | 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 | 1413.988 |
pydata__xarray-4094 | 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 | 1413.988 |
matplotlib__matplotlib-23476 | 20241028_solver | 1367.719 |
scikit-learn__scikit-learn-25747 | 20241025_composio_swekit | 1363.368 |
django__django-12273 | 20241025_composio_swekit | 1363.368 |
django__django-11299 | 20241025_composio_swekit | 1363.368 |
django__django-13112 | 20241022_tools_claude-3-5-sonnet-updated | 1350.121 |
pydata__xarray-4695 | 20241022_tools_claude-3-5-sonnet-updated | 1350.121 |
sphinx-doc__sphinx-7985 | 20241022_tools_claude-3-5-sonnet-updated | 1350.121 |
django__django-15695 | 20241023_emergent | 1327.793 |
django__django-16938 | 20241023_emergent | 1327.793 |
django__django-15732 | 20241023_emergent | 1327.793 |
mwaskom__seaborn-3069 | 20240924_solver | 1309.500 |
astropy__astropy-13236 | 20240824_gru | 1307.738 |
sphinx-doc__sphinx-10323 | 20240824_gru | 1307.738 |
django__django-12774 | 20240824_gru | 1307.738 |
matplotlib__matplotlib-14623 | 20240920_solver | 1275.450 |
sympy__sympy-13615 | 20240820_honeycomb | 1224.060 |
django__django-16560 | 20241022_tools_claude-3-5-haiku | 1223.089 |
django__django-15037 | 20241022_tools_claude-3-5-haiku | 1223.089 |
pylint-dev__pylint-8898 | 20241029_epam-ai-run-claude-3-5-sonnet | 1210.978 |
sympy__sympy-19040 | 20241028_agentless-1.5_gpt4o | 1202.577 |
matplotlib__matplotlib-23299 | 20240721_amazon-q-developer-agent-20240719-dev | 1199.734 |
sympy__sympy-13974 | 20240617_factory_code_droid | 1176.917 |
sympy__sympy-13798 | 20240402_sweagent_gpt4 | 932.067 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
example_link | acc | tau |
---|---|---|
sympy__sympy-13798 | 0.028 | -0.115 |
pydata__xarray-7393 | 0.111 | -0.081 |
django__django-13794 | 0.056 | -0.078 |
astropy__astropy-7606 | 0.167 | -0.074 |
matplotlib__matplotlib-20488 | 0.083 | -0.052 |
sympy__sympy-13031 | 0.056 | -0.039 |
django__django-13568 | 0.056 | -0.029 |
django__django-14631 | 0.056 | -0.010 |
sympy__sympy-13877 | 0.139 | -0.006 |
django__django-15161 | 0.111 | 0.000 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.