swebench-verified: by examples

Home   Doc/Code

Not solved by any model

There are 110 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
astropy__astropy-13033, astropy__astropy-13398, astropy__astropy-13977, astropy__astropy-14182, astropy__astropy-14365, django__django-10554, django__django-10999, django__django-11087, django__django-11138, django__django-11400, django__django-11433, django__django-11477, django__django-11728, django__django-11734, django__django-11820, django__django-11885, django__django-12406, django__django-12965, django__django-13195, django__django-13212, django__django-13344, django__django-13449, django__django-13512, django__django-13513, django__django-14011, django__django-14034, django__django-14155, django__django-14170, django__django-14315, django__django-14404, django__django-14534, django__django-14725, django__django-14792, django__django-15098, django__django-15252, django__django-15280, django__django-15375, django__django-15503, django__django-15554, django__django-15563, django__django-15629, django__django-15957, django__django-15973, django__django-16256, django__django-16263, django__django-16502, django__django-16631, django__django-16667, django__django-16950, matplotlib__matplotlib-20676, matplotlib__matplotlib-21568, matplotlib__matplotlib-22871, matplotlib__matplotlib-24177, matplotlib__matplotlib-24870, matplotlib__matplotlib-25479, matplotlib__matplotlib-25960, matplotlib__matplotlib-26208, matplotlib__matplotlib-26466, mwaskom__seaborn-3187, psf__requests-6028, pydata__xarray-6938, pydata__xarray-6992, pydata__xarray-7229, pylint-dev__pylint-4551, pylint-dev__pylint-4604, pylint-dev__pylint-4661, pytest-dev__pytest-10356, pytest-dev__pytest-5787, pytest-dev__pytest-5840, pytest-dev__pytest-6197, pytest-dev__pytest-7324, scikit-learn__scikit-learn-12682, scikit-learn__scikit-learn-14629, scikit-learn__scikit-learn-26194, sphinx-doc__sphinx-10435, sphinx-doc__sphinx-10614, sphinx-doc__sphinx-11445, sphinx-doc__sphinx-11510, sphinx-doc__sphinx-7462, sphinx-doc__sphinx-7590, sphinx-doc__sphinx-7748, sphinx-doc__sphinx-8056, sphinx-doc__sphinx-8265, sphinx-doc__sphinx-8548, sphinx-doc__sphinx-8551, sphinx-doc__sphinx-8621, sphinx-doc__sphinx-8638, sphinx-doc__sphinx-9229, sphinx-doc__sphinx-9461, sphinx-doc__sphinx-9602, sphinx-doc__sphinx-9658, sympy__sympy-12489, sympy__sympy-13091, sympy__sympy-13551, sympy__sympy-13852, sympy__sympy-14248, sympy__sympy-15976, sympy__sympy-16597, sympy__sympy-17630, sympy__sympy-18199, sympy__sympy-18698, sympy__sympy-20428, sympy__sympy-20438, sympy__sympy-20916, sympy__sympy-21596, sympy__sympy-21612, sympy__sympy-21930, sympy__sympy-22080, sympy__sympy-23413, sympy__sympy-24562

Problems solved by 1 model only

example_link model min_elo
sphinx-doc__sphinx-9258 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 1413.988
sphinx-doc__sphinx-8593 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 1413.988
sphinx-doc__sphinx-9230 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 1413.988
astropy__astropy-14598 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 1413.988
astropy__astropy-14369 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 1413.988
django__django-15916 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 1413.988
django__django-13406 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 1413.988
pydata__xarray-4094 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 1413.988
matplotlib__matplotlib-23476 20241028_solver 1367.719
scikit-learn__scikit-learn-25747 20241025_composio_swekit 1363.368
django__django-12273 20241025_composio_swekit 1363.368
django__django-11299 20241025_composio_swekit 1363.368
django__django-13112 20241022_tools_claude-3-5-sonnet-updated 1350.121
pydata__xarray-4695 20241022_tools_claude-3-5-sonnet-updated 1350.121
sphinx-doc__sphinx-7985 20241022_tools_claude-3-5-sonnet-updated 1350.121
django__django-15695 20241023_emergent 1327.793
django__django-16938 20241023_emergent 1327.793
django__django-15732 20241023_emergent 1327.793
mwaskom__seaborn-3069 20240924_solver 1309.500
astropy__astropy-13236 20240824_gru 1307.738
sphinx-doc__sphinx-10323 20240824_gru 1307.738
django__django-12774 20240824_gru 1307.738
matplotlib__matplotlib-14623 20240920_solver 1275.450
sympy__sympy-13615 20240820_honeycomb 1224.060
django__django-16560 20241022_tools_claude-3-5-haiku 1223.089
django__django-15037 20241022_tools_claude-3-5-haiku 1223.089
pylint-dev__pylint-8898 20241029_epam-ai-run-claude-3-5-sonnet 1210.978
sympy__sympy-19040 20241028_agentless-1.5_gpt4o 1202.577
matplotlib__matplotlib-23299 20240721_amazon-q-developer-agent-20240719-dev 1199.734
sympy__sympy-13974 20240617_factory_code_droid 1176.917
sympy__sympy-13798 20240402_sweagent_gpt4 932.067

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
sympy__sympy-13798 0.028 -0.115
pydata__xarray-7393 0.111 -0.081
django__django-13794 0.056 -0.078
astropy__astropy-7606 0.167 -0.074
matplotlib__matplotlib-20488 0.083 -0.052
sympy__sympy-13031 0.056 -0.039
django__django-13568 0.056 -0.029
django__django-14631 0.056 -0.010
sympy__sympy-13877 0.139 -0.006
django__django-15161 0.111 0.000

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.