plotly-logomark

Not solved by any model

There are 110 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
astropy__astropy-13033, astropy__astropy-13398, astropy__astropy-13977, astropy__astropy-14182, astropy__astropy-14365, django__django-10554, django__django-10999, django__django-11087, django__django-11138, django__django-11400, django__django-11433, django__django-11477, django__django-11728, django__django-11734, django__django-11820, django__django-11885, django__django-12406, django__django-12965, django__django-13195, django__django-13212, django__django-13344, django__django-13449, django__django-13512, django__django-13513, django__django-14011, django__django-14034, django__django-14155, django__django-14170, django__django-14315, django__django-14404, django__django-14534, django__django-14725, django__django-14792, django__django-15098, django__django-15252, django__django-15280, django__django-15375, django__django-15503, django__django-15554, django__django-15563, django__django-15629, django__django-15957, django__django-15973, django__django-16256, django__django-16263, django__django-16502, django__django-16631, django__django-16667, django__django-16950, matplotlib__matplotlib-20676, matplotlib__matplotlib-21568, matplotlib__matplotlib-22871, matplotlib__matplotlib-24177, matplotlib__matplotlib-24870, matplotlib__matplotlib-25479, matplotlib__matplotlib-25960, matplotlib__matplotlib-26208, matplotlib__matplotlib-26466, mwaskom__seaborn-3187, psf__requests-6028, pydata__xarray-6938, pydata__xarray-6992, pydata__xarray-7229, pylint-dev__pylint-4551, pylint-dev__pylint-4604, pylint-dev__pylint-4661, pytest-dev__pytest-10356, pytest-dev__pytest-5787, pytest-dev__pytest-5840, pytest-dev__pytest-6197, pytest-dev__pytest-7324, scikit-learn__scikit-learn-12682, scikit-learn__scikit-learn-14629, scikit-learn__scikit-learn-26194, sphinx-doc__sphinx-10435, sphinx-doc__sphinx-10614, sphinx-doc__sphinx-11445, sphinx-doc__sphinx-11510, sphinx-doc__sphinx-7462, sphinx-doc__sphinx-7590, sphinx-doc__sphinx-7748, sphinx-doc__sphinx-8056, sphinx-doc__sphinx-8265, sphinx-doc__sphinx-8548, sphinx-doc__sphinx-8551, sphinx-doc__sphinx-8621, sphinx-doc__sphinx-8638, sphinx-doc__sphinx-9229, sphinx-doc__sphinx-9461, sphinx-doc__sphinx-9602, sphinx-doc__sphinx-9658, sympy__sympy-12489, sympy__sympy-13091, sympy__sympy-13551, sympy__sympy-13852, sympy__sympy-14248, sympy__sympy-15976, sympy__sympy-16597, sympy__sympy-17630, sympy__sympy-18199, sympy__sympy-18698, sympy__sympy-20428, sympy__sympy-20438, sympy__sympy-20916, sympy__sympy-21596, sympy__sympy-21612, sympy__sympy-21930, sympy__sympy-22080, sympy__sympy-23413, sympy__sympy-24562

Problems solved by 1 model only

example_link	model	min_elo
sphinx-doc__sphinx-9258	20241029_OpenHands-CodeAct-2.1-sonnet-20241022	1413.988
sphinx-doc__sphinx-8593	20241029_OpenHands-CodeAct-2.1-sonnet-20241022	1413.988
sphinx-doc__sphinx-9230	20241029_OpenHands-CodeAct-2.1-sonnet-20241022	1413.988
astropy__astropy-14598	20241029_OpenHands-CodeAct-2.1-sonnet-20241022	1413.988
astropy__astropy-14369	20241029_OpenHands-CodeAct-2.1-sonnet-20241022	1413.988
django__django-15916	20241029_OpenHands-CodeAct-2.1-sonnet-20241022	1413.988
django__django-13406	20241029_OpenHands-CodeAct-2.1-sonnet-20241022	1413.988
pydata__xarray-4094	20241029_OpenHands-CodeAct-2.1-sonnet-20241022	1413.988
matplotlib__matplotlib-23476	20241028_solver	1367.719
scikit-learn__scikit-learn-25747	20241025_composio_swekit	1363.368
django__django-12273	20241025_composio_swekit	1363.368
django__django-11299	20241025_composio_swekit	1363.368
django__django-13112	20241022_tools_claude-3-5-sonnet-updated	1350.121
pydata__xarray-4695	20241022_tools_claude-3-5-sonnet-updated	1350.121
sphinx-doc__sphinx-7985	20241022_tools_claude-3-5-sonnet-updated	1350.121
django__django-15695	20241023_emergent	1327.793
django__django-16938	20241023_emergent	1327.793
django__django-15732	20241023_emergent	1327.793
mwaskom__seaborn-3069	20240924_solver	1309.500
astropy__astropy-13236	20240824_gru	1307.738
sphinx-doc__sphinx-10323	20240824_gru	1307.738
django__django-12774	20240824_gru	1307.738
matplotlib__matplotlib-14623	20240920_solver	1275.450
sympy__sympy-13615	20240820_honeycomb	1224.060
django__django-16560	20241022_tools_claude-3-5-haiku	1223.089
django__django-15037	20241022_tools_claude-3-5-haiku	1223.089
pylint-dev__pylint-8898	20241029_epam-ai-run-claude-3-5-sonnet	1210.978
sympy__sympy-19040	20241028_agentless-1.5_gpt4o	1202.577
matplotlib__matplotlib-23299	20240721_amazon-q-developer-agent-20240719-dev	1199.734
sympy__sympy-13974	20240617_factory_code_droid	1176.917
sympy__sympy-13798	20240402_sweagent_gpt4	932.067

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	acc	tau
sympy__sympy-13798	0.028	-0.115
pydata__xarray-7393	0.111	-0.081
django__django-13794	0.056	-0.078
astropy__astropy-7606	0.167	-0.074
matplotlib__matplotlib-20488	0.083	-0.052
sympy__sympy-13031	0.056	-0.039
django__django-13568	0.056	-0.029
django__django-14631	0.056	-0.010
sympy__sympy-13877	0.139	-0.006
django__django-15161	0.111	0.000

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.

swebench-verified: by examples

Home Doc/Code

Not solved by any model

Problems solved by 1 model only

Suspect problems

Histogram of accuracies

Histogram of difficulties