swebench-lite: by models

Home Doc/Code


std predicted by accuracy

The typical stddev between pairs of models on this dataset as a function of the absolute accuracy.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, the significance level mainly depends on the accuracy difference as shown here. Hover over each model pair for detailed information.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000. std: standard deviation due to drawing examples from a population, this is the dominant term. std_i: the standard deviation due to drawing samples from the model on each example. std_total: the total standard deviation, satisfying std_total^2 = std^2 + std_i^2.

model pass1 std(E(A)) E(std(A)) std(A) N win_rate elo
20250425_Refact_Agent 60 2.8 0 2.8 NaN 32.1 1.1e+03
20250526_sweagent_claude-4-sonnet-20250514 56.7 2.9 0 2.9 NaN 29.7 1.09e+03
20250114_Isoform 55 2.9 0 2.9 NaN 27.5 1.09e+03
20241220_blackboxai_agent_v1 49 2.9 0 2.9 NaN 22.8 1.06e+03
20241208_gru 48.7 2.9 0 2.9 NaN 21.8 1.06e+03
20241127_globant_codefixer_agent 48.3 2.9 0 2.9 NaN 22.9 1.06e+03
20250613_ExpeRepair-v1.0 48.3 2.9 0 2.9 NaN 22.3 1.06e+03
20250226_sweagent_claude-3-7-sonnet-20250219 48 2.9 0 2.9 NaN 21.5 1.06e+03
20241122_devlo 47.3 2.9 0 2.9 NaN 21.1 1.06e+03
20250205_dars_agent_claude_3.5_sonnet_deepseek_r1 47 2.9 0 2.9 NaN 21.4 1.06e+03
20241207_kodu_sonnet_v1 44.7 2.9 0 2.9 NaN 23.2 1.05e+03
20250310_codefuse-cgm 44 2.9 0 2.9 NaN 18.1 1.05e+03
20240702_codestory_aide_mixed 43 2.9 0 2.9 NaN 18.1 1.04e+03
20250509_Lingxi_claude-3-5-sonnet-20241022 42.7 2.9 0 2.9 NaN 18.3 1.04e+03
20250515_codartai 41.7 2.8 0 2.8 NaN 21 1.04e+03
20241025_OpenHands-CodeAct-2.1-sonnet-20241022 41.7 2.8 0 2.8 NaN 18.1 1.04e+03
20241220_PatchKitty-0.9_claude-3.5-sonnet-20241022 41.3 2.8 0 2.8 NaN 16.6 1.04e+03
20241030_composio_swekit 41 2.8 0 2.8 NaN 16.4 1.03e+03
20250113_OrcaLoca 41 2.8 0 2.8 NaN 16.7 1.03e+03
20241202_agentless-1.5_claude-3.5-sonnet-20241022 40.7 2.8 0 2.8 NaN 16.4 1.03e+03
20250113_OpenCSG-Starship-Agentic-Coder_gpt4o 39.7 2.8 0 2.8 NaN 16 1.03e+03
20240912_marscode-agent-dev 39.3 2.8 0 2.8 NaN 16.1 1.03e+03
20250114_moatless_claude-3.5-sonnet-20241022 39 2.8 0 2.8 NaN 16.3 1.03e+03
20240820_honeycomb 38.3 2.8 0 2.8 NaN 15.4 1.03e+03
20241117_moatless_claude-3.5-sonnet-20241022 38.3 2.8 0 2.8 NaN 14.4 1.03e+03
20240627_abanteai_mentatbot_gpt4o 38 2.8 0 2.8 NaN 16.3 1.02e+03
20250104_patched_codes_claude-3.5-sonnet-20241022 37 2.8 0 2.8 NaN 14.7 1.02e+03
20250609_KGCompass_deepseek-v3 36.7 2.8 0 2.8 NaN 16.5 1.02e+03
20241113_navie-2-gpt4o-sonnet 36 2.8 0 2.8 NaN 14 1.02e+03
20240811_gru 35.7 2.8 0 2.8 NaN 13.6 1.02e+03
20250104_codefuse-aais 35.7 2.8 0 2.8 NaN 13.4 1.02e+03
20240829_Isoform 35 2.8 0 2.8 NaN 12.8 1.01e+03
20240806_SuperCoder2.0 34 2.7 0 2.7 NaN 14.3 1.01e+03
20240723_marscode-agent-dev 34 2.7 0 2.7 NaN 13.1 1.01e+03
20240622_Lingma_Agent 33 2.7 0 2.7 NaN 11.8 1.01e+03
20250214_agentless_lite_o3_mini 32.3 2.7 0 2.7 NaN 13.4 1e+03
20241028_agentless-1.5_gpt4o 32 2.7 0 2.7 NaN 11.5 1e+03
20241111_codeshelltester_gpt4o 31.3 2.7 0 2.7 NaN 11.8 1e+03
20240617_factory_code_droid 31.3 2.7 0 2.7 NaN 12.1 1e+03
20240621_autocoderover-v20240620 30.7 2.7 0 2.7 NaN 10.7 998
20250207_aegis_o3mini 30.3 2.7 0 2.7 NaN 12.2 997
20241203_KortixAI-AgentPress-sonnet-20241022 30 2.6 0 2.6 NaN 11.5 996
20240908_infant_gpt4o 30 2.6 0 2.6 NaN 11 996
20240808_RepoGraph_gpt4o 29.7 2.6 0 2.6 NaN 10.4 995
20240721_amazon-q-developer-agent-20240719-dev 29.7 2.6 0 2.6 NaN 10.9 995
20240604_CodeR 28.3 2.6 0 2.6 NaN 9.85 990
20241117_reproducedRG_gpt4o 28 2.6 0 2.6 NaN 9.16 989
20240706_sima_gpt4o 27.7 2.6 0 2.6 NaN 9.13 987
20240612_MASAI_gpt4o 27.3 2.6 0 2.6 NaN 9.18 986
20240630_agentless_gpt4o 27.3 2.6 0 2.6 NaN 9.2 986
20240725_opendevin_codeact_v1.8_claude35sonnet 26.7 2.6 0 2.6 NaN 9.69 984
20240612_IBM_Research_Agent101 26.7 2.6 0 2.6 NaN 8.48 984
20240623_moatless_claude35sonnet 26.7 2.6 0 2.6 NaN 8.44 984
20240523_aider 26.3 2.5 0 2.5 NaN 8.94 983
20240925_hyperagent_lite1 25.3 2.5 0 2.5 NaN 8.69 979
20250306_SWE-Fixer_Qwen2.5-7b-retriever_Qwen2.5-72b-editor 24.7 2.5 0 2.5 NaN 7.97 977
20240617_moatless_gpt4o 24.7 2.5 0 2.5 NaN 7.82 977
20240524_opencsg_starship_gpt4 23.7 2.5 0 2.5 NaN 7.55 973
20241016_IBM-SWE-1.0 23.7 2.5 0 2.5 NaN 7.08 973
20241128_SWE-Fixer_Qwen2.5-7b-retriever_Qwen2.5-72b-editor_20241128 23.3 2.4 0 2.4 NaN 7.55 972
20240620_sweagent_claude3.5sonnet 23 2.4 0 2.4 NaN 8.18 971
20240828_autose_mixed 21.7 2.4 0 2.4 NaN 6.22 966
20240615_appmap-navie_gpt4o 21.7 2.4 0 2.4 NaN 7.04 966
20240509_amazon-q-developer-agent-20240430-dev 20.3 2.3 0 2.3 NaN 6.95 961
20240530_autocoderover-v20240408 19 2.3 0 2.3 NaN 5.86 957
20240728_sweagent_gpt4o 18.3 2.2 0 2.2 NaN 5.32 954
20240402_sweagent_gpt4 18 2.2 0 2.2 NaN 5.05 953
20240402_sweagent_claude3opus 11.7 1.9 0 1.9 NaN 2.89 930
20240402_rag_claude3opus 4.33 1.2 0 1.2 NaN 0.689 902
20231010_rag_claude2 3 0.98 0 0.98 NaN 0.541 897
20240402_rag_gpt4 2.67 0.93 0 0.93 NaN 0.554 896
20231010_rag_swellama7b 1.33 0.66 0 0.66 NaN 0.437 891
20231010_rag_swellama13b 1 0.57 0 0.57 NaN 0.297 890
20231010_rag_gpt35 0.333 0.33 0 0.33 NaN 0.027 887
20250111_moatless_deepseek_v3 0 0 0 0 NaN 0 886