There are 88 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
astropy__astropy-7746, django__django-11019, django__django-11564, django__django-11630, django__django-11797, django__django-11905, django__django-11910, django__django-12589, django__django-12908, django__django-13220, django__django-13448, django__django-14667, django__django-14730, django__django-14997, django__django-15252, django__django-15320, django__django-15695, django__django-15738, django__django-15819, django__django-15996, django__django-16229, django__django-16408, django__django-16816, django__django-16820, django__django-16910, matplotlib__matplotlib-18869, matplotlib__matplotlib-22835, matplotlib__matplotlib-23476, matplotlib__matplotlib-24265, matplotlib__matplotlib-25079, matplotlib__matplotlib-25433, pallets__flask-4045, pallets__flask-5063, psf__requests-2148, pydata__xarray-3364, pydata__xarray-4248, pydata__xarray-4493, pylint-dev__pylint-7228, pytest-dev__pytest-5103, pytest-dev__pytest-5221, pytest-dev__pytest-5413, pytest-dev__pytest-6116, pytest-dev__pytest-8906, pytest-dev__pytest-9359, scikit-learn__scikit-learn-10508, scikit-learn__scikit-learn-10949, scikit-learn__scikit-learn-11040, scikit-learn__scikit-learn-25638, sphinx-doc__sphinx-10451, sphinx-doc__sphinx-11445, sphinx-doc__sphinx-7686, sphinx-doc__sphinx-7738, sphinx-doc__sphinx-8273, sphinx-doc__sphinx-8282, sphinx-doc__sphinx-8474, sympy__sympy-11400, sympy__sympy-11870, sympy__sympy-11897, sympy__sympy-12171, sympy__sympy-12236, sympy__sympy-12454, sympy__sympy-13043, sympy__sympy-13146, sympy__sympy-13177, sympy__sympy-13437, sympy__sympy-13773, sympy__sympy-13895, sympy__sympy-13915, sympy__sympy-14024, sympy__sympy-14308, sympy__sympy-14317, sympy__sympy-15308, sympy__sympy-16106, sympy__sympy-16281, sympy__sympy-16503, sympy__sympy-17630, sympy__sympy-18087, sympy__sympy-18199, sympy__sympy-18698, sympy__sympy-19254, sympy__sympy-20049, sympy__sympy-20322, sympy__sympy-20639, sympy__sympy-21171, sympy__sympy-21612, sympy__sympy-21627, sympy__sympy-23191, sympy__sympy-24102
example_link | model | min_elo |
---|---|---|
astropy__astropy-14182 | 20240702_codestory_aide_mixed | 1384.954 |
django__django-12856 | 20240702_codestory_aide_mixed | 1384.954 |
django__django-12113 | 20240702_codestory_aide_mixed | 1384.954 |
django__django-12470 | 20241025_OpenHands-CodeAct-2.1-sonnet-20241022 | 1329.636 |
django__django-13033 | 20241025_OpenHands-CodeAct-2.1-sonnet-20241022 | 1329.636 |
django__django-16400 | 20241025_OpenHands-CodeAct-2.1-sonnet-20241022 | 1329.636 |
scikit-learn__scikit-learn-25747 | 20241025_OpenHands-CodeAct-2.1-sonnet-20241022 | 1329.636 |
sphinx-doc__sphinx-8801 | 20241025_OpenHands-CodeAct-2.1-sonnet-20241022 | 1329.636 |
matplotlib__matplotlib-25498 | 20241025_OpenHands-CodeAct-2.1-sonnet-20241022 | 1329.636 |
django__django-13265 | 20241025_OpenHands-CodeAct-2.1-sonnet-20241022 | 1329.636 |
sympy__sympy-15346 | 20240912_marscode-agent-dev | 1304.833 |
django__django-14155 | 20240627_abanteai_mentatbot_gpt4o | 1271.732 |
matplotlib__matplotlib-22711 | 20240627_abanteai_mentatbot_gpt4o | 1271.732 |
django__django-15388 | 20240627_abanteai_mentatbot_gpt4o | 1271.732 |
sympy__sympy-19487 | 20240627_abanteai_mentatbot_gpt4o | 1271.732 |
django__django-11742 | 20240627_abanteai_mentatbot_gpt4o | 1271.732 |
mwaskom__seaborn-2848 | 20240723_marscode-agent-dev | 1210.347 |
sympy__sympy-12419 | 20240723_marscode-agent-dev | 1210.347 |
sympy__sympy-19007 | 20240622_Lingma_Agent | 1206.959 |
django__django-13321 | 20240806_SuperCoder2.0 | 1206.002 |
pallets__flask-4992 | 20240806_SuperCoder2.0 | 1206.002 |
django__django-11283 | 20240806_SuperCoder2.0 | 1206.002 |
mwaskom__seaborn-3407 | 20240806_SuperCoder2.0 | 1206.002 |
django__django-15781 | 20240806_SuperCoder2.0 | 1206.002 |
matplotlib__matplotlib-23299 | 20240721_amazon-q-developer-agent-20240719-dev | 1140.952 |
django__django-15202 | 20240725_opendevin_codeact_v1.8_claude35sonnet | 1089.109 |
django__django-13768 | 20240523_aider | 1080.243 |
django__django-14534 | 20241016_IBM-SWE-1.0 | 1025.575 |
sympy__sympy-18835 | 20241016_IBM-SWE-1.0 | 1025.575 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
example_link | acc | tau |
---|---|---|
django__django-14534 | 0.024 | -0.064 |
sympy__sympy-18835 | 0.024 | -0.064 |
django__django-15213 | 0.190 | -0.035 |
django__django-13768 | 0.024 | -0.027 |
django__django-15202 | 0.024 | -0.005 |
django__django-13757 | 0.071 | 0.000 |
django__django-13660 | 0.071 | 0.003 |
sphinx-doc__sphinx-7975 | 0.095 | 0.006 |
django__django-15061 | 0.048 | 0.054 |
sphinx-doc__sphinx-8435 | 0.071 | 0.054 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.