swebench-verified: by examples

Home   Doc/Code

Not solved by any model

There are 49 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
astropy__astropy-13033, astropy__astropy-13398, astropy__astropy-13977, django__django-10554, django__django-10999, django__django-11087, django__django-11400, django__django-11820, django__django-12406, django__django-13195, django__django-13212, django__django-13344, django__django-13513, django__django-14011, django__django-14034, django__django-14155, django__django-14792, django__django-15252, django__django-15629, django__django-16256, django__django-16263, django__django-16502, django__django-16631, django__django-16667, matplotlib__matplotlib-21568, matplotlib__matplotlib-24870, matplotlib__matplotlib-25479, matplotlib__matplotlib-26466, pydata__xarray-6992, pydata__xarray-7229, pylint-dev__pylint-4551, pylint-dev__pylint-4604, pylint-dev__pylint-4661, pytest-dev__pytest-10356, sphinx-doc__sphinx-11510, sphinx-doc__sphinx-7462, sphinx-doc__sphinx-7590, sphinx-doc__sphinx-7748, sphinx-doc__sphinx-9229, sphinx-doc__sphinx-9461, sympy__sympy-13852, sympy__sympy-16597, sympy__sympy-17630, sympy__sympy-18199, sympy__sympy-20428, sympy__sympy-20438, sympy__sympy-21596, sympy__sympy-21930, sympy__sympy-22080

Problems solved by 1 model only

example_link model min_elo
sphinx-doc__sphinx-9602 20250612_trae 1509.157
django__django-15098 20250603_Refact_Agent_claude-4-sonnet 1507.013
django__django-15957 20250522_tools_claude-4-opus 1422.448
sympy__sympy-21612 20250522_tools_claude-4-opus 1422.448
django__django-14725 20250522_tools_claude-4-opus 1422.448
matplotlib__matplotlib-24177 20250522_tools_claude-4-sonnet 1417.074
pytest-dev__pytest-5840 20250522_tools_claude-4-sonnet 1417.074
django__django-15973 20250430_zencoder_ai 1400.777
sphinx-doc__sphinx-10435 20250524_openhands_claude_4_sonnet 1386.557
matplotlib__matplotlib-26208 20250524_openhands_claude_4_sonnet 1386.557
django__django-11138 20250415_openhands 1312.280
django__django-14170 20250206_agentscope 1308.513
psf__requests-6028 20250503_patchpilot-v1.1-o4-mini 1301.343
sphinx-doc__sphinx-10614 20250110_blackboxai_agent_v1.1 1263.185
django__django-14534 20241202_agentless-1.5_claude-3.5-sonnet-20241022 1086.794
matplotlib__matplotlib-23476 20241028_solver 1062.014

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
astropy__astropy-7606 0.075 -0.197
astropy__astropy-8707 0.054 -0.185
matplotlib__matplotlib-20488 0.032 -0.164
django__django-11790 0.258 -0.116
astropy__astropy-8872 0.054 -0.110
sympy__sympy-18763 0.484 -0.106
django__django-13794 0.043 -0.071
sympy__sympy-17318 0.075 -0.066
pylint-dev__pylint-4970 0.129 -0.008
sympy__sympy-13877 0.215 0.002

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.