swebench-lite: by examples

Home   Doc/Code

Not solved by any model

There are 44 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
astropy__astropy-7746, django__django-11019, django__django-11564, django__django-11630, django__django-11905, django__django-14667, django__django-14730, django__django-14997, django__django-15695, django__django-16816, django__django-16820, matplotlib__matplotlib-22835, matplotlib__matplotlib-25433, pallets__flask-5063, pydata__xarray-4493, pylint-dev__pylint-7228, pytest-dev__pytest-5221, scikit-learn__scikit-learn-11040, scikit-learn__scikit-learn-25638, sphinx-doc__sphinx-7686, sphinx-doc__sphinx-7738, sphinx-doc__sphinx-8282, sympy__sympy-11400, sympy__sympy-11870, sympy__sympy-11897, sympy__sympy-12171, sympy__sympy-13146, sympy__sympy-13773, sympy__sympy-13895, sympy__sympy-14024, sympy__sympy-14308, sympy__sympy-14317, sympy__sympy-15308, sympy__sympy-16106, sympy__sympy-16281, sympy__sympy-17630, sympy__sympy-18087, sympy__sympy-18199, sympy__sympy-19254, sympy__sympy-20322, sympy__sympy-20639, sympy__sympy-21171, sympy__sympy-23191, sympy__sympy-24102

Problems solved by 1 model only

example_link model min_elo
django__django-16229 20250526_sweagent_claude-4-sonnet-20250514 1442.593
sympy__sympy-12236 20250526_sweagent_claude-4-sonnet-20250514 1442.593
sympy__sympy-13915 20250526_sweagent_claude-4-sonnet-20250514 1442.593
astropy__astropy-14182 20240702_codestory_aide_mixed 1271.640
django__django-15738 20241207_kodu_sonnet_v1 1248.416
django__django-15252 20241207_kodu_sonnet_v1 1248.416
sympy__sympy-13437 20241207_kodu_sonnet_v1 1248.416
pallets__flask-4045 20241207_kodu_sonnet_v1 1248.416
pydata__xarray-4248 20241207_kodu_sonnet_v1 1248.416
sympy__sympy-13043 20241207_kodu_sonnet_v1 1248.416
sphinx-doc__sphinx-8273 20241207_kodu_sonnet_v1 1248.416
django__django-13265 20241025_OpenHands-CodeAct-2.1-sonnet-20241022 1230.583
scikit-learn__scikit-learn-10949 20250515_codartai 1212.476
sphinx-doc__sphinx-8474 20250515_codartai 1212.476
django__django-13220 20250515_codartai 1212.476
matplotlib__matplotlib-25079 20250515_codartai 1212.476
scikit-learn__scikit-learn-10508 20250515_codartai 1212.476
matplotlib__matplotlib-18869 20250515_codartai 1212.476
django__django-15996 20250113_OpenCSG-Starship-Agentic-Coder_gpt4o 1211.143
matplotlib__matplotlib-22711 20240627_abanteai_mentatbot_gpt4o 1179.907
django__django-11742 20240627_abanteai_mentatbot_gpt4o 1179.907
pydata__xarray-3364 20250104_codefuse-aais 1134.838
sympy__sympy-19007 20240622_Lingma_Agent 1100.960
django__django-11910 20250207_aegis_o3mini 1060.372
django__django-13768 20240523_aider 986.147

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
django__django-13768 0.013 -0.071
django__django-11910 0.013 -0.013
sympy__sympy-24909 0.120 0.010
pytest-dev__pytest-8365 0.040 0.013
sympy__sympy-19007 0.013 0.013
matplotlib__matplotlib-23299 0.027 0.027
pydata__xarray-3364 0.013 0.033
sympy__sympy-18835 0.040 0.045
pytest-dev__pytest-7220 0.093 0.049
django__django-11742 0.013 0.053

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.