There are 49 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
astropy__astropy-13033, astropy__astropy-13398, astropy__astropy-13977, django__django-10554, django__django-10999, django__django-11087, django__django-11400, django__django-11820, django__django-12406, django__django-13195, django__django-13212, django__django-13344, django__django-13513, django__django-14011, django__django-14034, django__django-14155, django__django-14792, django__django-15252, django__django-15629, django__django-16256, django__django-16263, django__django-16502, django__django-16631, django__django-16667, matplotlib__matplotlib-21568, matplotlib__matplotlib-24870, matplotlib__matplotlib-25479, matplotlib__matplotlib-26466, pydata__xarray-6992, pydata__xarray-7229, pylint-dev__pylint-4551, pylint-dev__pylint-4604, pylint-dev__pylint-4661, pytest-dev__pytest-10356, sphinx-doc__sphinx-11510, sphinx-doc__sphinx-7462, sphinx-doc__sphinx-7590, sphinx-doc__sphinx-7748, sphinx-doc__sphinx-9229, sphinx-doc__sphinx-9461, sympy__sympy-13852, sympy__sympy-16597, sympy__sympy-17630, sympy__sympy-18199, sympy__sympy-20428, sympy__sympy-20438, sympy__sympy-21596, sympy__sympy-21930, sympy__sympy-22080
example_link | model | min_elo |
---|---|---|
sphinx-doc__sphinx-9602 | 20250612_trae | 1509.157 |
django__django-15098 | 20250603_Refact_Agent_claude-4-sonnet | 1507.013 |
django__django-15957 | 20250522_tools_claude-4-opus | 1422.448 |
sympy__sympy-21612 | 20250522_tools_claude-4-opus | 1422.448 |
django__django-14725 | 20250522_tools_claude-4-opus | 1422.448 |
matplotlib__matplotlib-24177 | 20250522_tools_claude-4-sonnet | 1417.074 |
pytest-dev__pytest-5840 | 20250522_tools_claude-4-sonnet | 1417.074 |
django__django-15973 | 20250430_zencoder_ai | 1400.777 |
sphinx-doc__sphinx-10435 | 20250524_openhands_claude_4_sonnet | 1386.557 |
matplotlib__matplotlib-26208 | 20250524_openhands_claude_4_sonnet | 1386.557 |
django__django-11138 | 20250415_openhands | 1312.280 |
django__django-14170 | 20250206_agentscope | 1308.513 |
psf__requests-6028 | 20250503_patchpilot-v1.1-o4-mini | 1301.343 |
sphinx-doc__sphinx-10614 | 20250110_blackboxai_agent_v1.1 | 1263.185 |
django__django-14534 | 20241202_agentless-1.5_claude-3.5-sonnet-20241022 | 1086.794 |
matplotlib__matplotlib-23476 | 20241028_solver | 1062.014 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
example_link | acc | tau |
---|---|---|
astropy__astropy-7606 | 0.075 | -0.197 |
astropy__astropy-8707 | 0.054 | -0.185 |
matplotlib__matplotlib-20488 | 0.032 | -0.164 |
django__django-11790 | 0.258 | -0.116 |
astropy__astropy-8872 | 0.054 | -0.110 |
sympy__sympy-18763 | 0.484 | -0.106 |
django__django-13794 | 0.043 | -0.071 |
sympy__sympy-17318 | 0.075 | -0.066 |
pylint-dev__pylint-4970 | 0.129 | -0.008 |
sympy__sympy-13877 | 0.215 | 0.002 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.