swebench-verified: by models

Home Doc/Code


std predicted by accuracy

The typical stddev between pairs of models on this dataset as a function of the absolute accuracy.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, the significance level mainly depends on the accuracy difference as shown here. Hover over each model pair for detailed information.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000. std: standard deviation due to drawing examples from a population, this is the dominant term. std_i: the standard deviation due to drawing samples from the model on each example. std_total: the total standard deviation, satisfying std_total^2 = std^2 + std_i^2.

model pass1 std(E(A)) E(std(A)) std(A) N win_rate elo
20250612_trae 75.2 1.9 0 1.9 NaN 31.8 1.11e+03
20250603_Refact_Agent_claude-4-sonnet 74.4 2 0 2 NaN 31 1.11e+03
20250522_tools_claude-4-opus 73.2 2 0 2 NaN 31.1 1.1e+03
20250522_tools_claude-4-sonnet 72.4 2 0 2 NaN 30.1 1.1e+03
20250611_moatless_claude-4-sonnet-20250514 70.8 2 0 2 NaN 28 1.09e+03
20250519_trae 70.6 2 0 2 NaN 27.9 1.09e+03
20250524_openhands_claude_4_sonnet 70.4 2 0 2 NaN 28.5 1.09e+03
20250515_Refact_Agent 70.4 2 0 2 NaN 27.6 1.09e+03
20250610_augment_agent_v1 70.4 2 0 2 NaN 28.6 1.09e+03
20250519_devlo 70.2 2 0 2 NaN 27.6 1.09e+03
20250430_zencoder_ai 70 2 0 2 NaN 27.8 1.09e+03
20250516_cortexa_o3 68.2 2.1 0 2.1 NaN 26.6 1.08e+03
20250522_sweagent_claude-4-sonnet-20250514 66.6 2.1 0 2.1 NaN 25.9 1.08e+03
20250514_aime_coder 66.4 2.1 0 2.1 NaN 25.2 1.08e+03
20250415_openhands 65.8 2.1 0 2.1 NaN 24.8 1.07e+03
20250405_amazon-q-developer-agent-20250405-dev 65.4 2.1 0 2.1 NaN 24.2 1.07e+03
20250316_augment_agent_v0 65.4 2.1 0 2.1 NaN 24.1 1.07e+03
20250117_wandb_programmer_o1_crosscheck5 64.6 2.1 0 2.1 NaN 23.8 1.07e+03
20250503_patchpilot-v1.1-o4-mini 64.6 2.1 0 2.1 NaN 24 1.07e+03
20250206_agentscope 63.4 2.2 0 2.2 NaN 22.2 1.06e+03
20250224_tools_claude-3-7-sonnet 63.2 2.2 0 2.2 NaN 23 1.06e+03
20250110_blackboxai_agent_v1.1 62.8 2.2 0 2.2 NaN 23.3 1.06e+03
20250228_epam-ai-run-claude-3-5-sonnet 62.8 2.2 0 2.2 NaN 22.5 1.06e+03
20250225_sweagent_claude-3-7-sonnet 62.4 2.2 0 2.2 NaN 22.1 1.06e+03
20241221_codestory_midwit_claude-3-5-sonnet_swe-search 62.2 2.2 0 2.2 NaN 21.9 1.06e+03
20250203_openhands_4x_scaled 60.8 2.2 0 2.2 NaN 21 1.05e+03
20250110_learn_by_interact_claude3.5 60.2 2.2 0 2.2 NaN 23.5 1.05e+03
20250410_cortexa 58.2 2.2 0 2.2 NaN 19.5 1.05e+03
20241213_devlo 58.2 2.2 0 2.2 NaN 19.5 1.05e+03
20241223_emergent 57.2 2.2 0 2.2 NaN 18.4 1.04e+03
20241208_gru 57 2.2 0 2.2 NaN 18.6 1.04e+03
20241212_epam-ai-run-claude-3-5-sonnet 55.4 2.2 0 2.2 NaN 17.4 1.04e+03
20241202_amazon-q-developer-agent-20241202-dev 55 2.2 0 2.2 NaN 17.5 1.03e+03
20241108_devlo 54.2 2.2 0 2.2 NaN 17.1 1.03e+03
20250120_Bracket 53.2 2.2 0 2.2 NaN 18.2 1.03e+03
20241029_OpenHands-CodeAct-2.1-sonnet-20241022 53 2.2 0 2.2 NaN 16.8 1.03e+03
20241212_google_jules_gemini_2.0_flash_experimental 52.2 2.2 0 2.2 NaN 16.7 1.02e+03
20241125_enginelabs 51.8 2.2 0 2.2 NaN 16.7 1.02e+03
20250122_autocoderover-v2.1-claude-3-5-sonnet-20241022 51.6 2.2 0 2.2 NaN 16 1.02e+03
20241202_agentless-1.5_claude-3.5-sonnet-20241022 50.8 2.2 0 2.2 NaN 15.9 1.02e+03
20241028_solver 50 2.2 0 2.2 NaN 15 1.02e+03
20241125_marscode-agent-dev 50 2.2 0 2.2 NaN 15.4 1.02e+03
20241105_nfactorial 49.2 2.2 0 2.2 NaN 14.6 1.01e+03
20241022_tools_claude-3-5-sonnet-updated 49 2.2 0 2.2 NaN 14.7 1.01e+03
20241025_composio_swekit 48.6 2.2 0 2.2 NaN 14.1 1.01e+03
20241106_navie-2-gpt4o-sonnet 47.2 2.2 0 2.2 NaN 14.6 1.01e+03
20250520_openhands_devstral_small 46.8 2.2 0 2.2 NaN 13.9 1e+03
20241023_emergent 46.6 2.2 0 2.2 NaN 13.5 1e+03
20241108_autocoderover-v2.0-claude-3-5-sonnet-20241022 46.2 2.2 0 2.2 NaN 13.1 1e+03
20250528_patchpilot_Co-PatcheR 46 2.2 0 2.2 NaN 13.3 1e+03
20240924_solver 45.4 2.2 0 2.2 NaN 12.5 999
20240824_gru 45.2 2.2 0 2.2 NaN 12.8 998
20250118_codeshellagent_gemini_2.0_flash_experimental 44.2 2.2 0 2.2 NaN 12.7 995
20240920_solver 43.6 2.2 0 2.2 NaN 12 992
20250527_amazon.nova-premier-v1.0 42.4 2.2 0 2.2 NaN 12.8 988
20250214_agentless_lite_o3_mini 42.4 2.2 0 2.2 NaN 12.8 988
20250112_ugaiforge 41.6 2.2 0 2.2 NaN 10.9 985
20241030_nfactorial 41.6 2.2 0 2.2 NaN 11.8 985
20250226_swerl_llama3_70b 41.2 2.2 0 2.2 NaN 11.8 984
20241016_composio_swekit 40.6 2.2 0 2.2 NaN 10.6 982
20241022_tools_claude-3-5-haiku 40.6 2.2 0 2.2 NaN 10.9 982
20240820_honeycomb 40.6 2.2 0 2.2 NaN 11.4 982
20241113_nebius-search-open-weight-models-11-24 40.6 2.2 0 2.2 NaN 10.7 982
20250511_sweagent_lm_32b 40.2 2.2 0 2.2 NaN 10.5 980
20241029_epam-ai-run-claude-3-5-sonnet 39.6 2.2 0 2.2 NaN 10.6 978
20240721_amazon-q-developer-agent-20240719-dev 38.8 2.2 0 2.2 NaN 10.6 975
20241028_agentless-1.5_gpt4o 38.8 2.2 0 2.2 NaN 10.3 975
20240628_autocoderover-v20240620 38.4 2.2 0 2.2 NaN 10.6 974
20240617_factory_code_droid 37 2.2 0 2.2 NaN 10.1 968
20240620_sweagent_claude3.5sonnet 33.6 2.1 0 2.1 NaN 8.62 956
20250306_SWE-Fixer_Qwen2.5-7b-retriever_Qwen2.5-72b-editor 32.8 2.1 0 2.1 NaN 8.21 953
20240612_MASAI_gpt4o 32.6 2.1 0 2.1 NaN 8.26 952
20241120_artemis_agent 32 2.1 0 2.1 NaN 8.11 950
20241007_nfactorial 31.6 2.1 0 2.1 NaN 7.49 949
20241128_SWE-Fixer_Qwen2.5-7b-retriever_Qwen2.5-72b-editor_20241128 30.2 2.1 0 2.1 NaN 7.33 944
20241002_lingma-agent_lingma-swe-gpt-72b 28.8 2 0 2 NaN 6.88 938
20241016_epam-ai-run-gpt-4o 27 2 0 2 NaN 6.44 932
20240615_appmap-navie_gpt4o 26.2 2 0 2 NaN 6.05 929
20241001_nfactorial 25.8 2 0 2 NaN 6.02 927
20240509_amazon-q-developer-agent-20240430-dev 25.6 2 0 2 NaN 6.22 926
20240918_lingma-agent_lingma-swe-gpt-72b 25 1.9 0 1.9 NaN 5.2 924
20240820_epam-ai-run-gpt-4o 24 1.9 0 1.9 NaN 5.03 920
20240728_sweagent_gpt4o 23.2 1.9 0 1.9 NaN 5.02 917
20240402_sweagent_gpt4 22.4 1.9 0 1.9 NaN 4.67 914
20241002_lingma-agent_lingma-swe-gpt-7b 18.2 1.7 0 1.7 NaN 3.45 898
20240402_sweagent_claude3opus 15.8 1.6 0 1.6 NaN 2.81 888
20240918_lingma-agent_lingma-swe-gpt-7b 10.2 1.4 0 1.4 NaN 1.56 866
20240402_rag_claude3opus 7 1.1 0 1.1 NaN 1.1 852
20231010_rag_claude2 4.4 0.92 0 0.92 NaN 0.709 841
20240402_rag_gpt4 2.8 0.74 0 0.74 NaN 0.409 834
20231010_rag_swellama7b 1.4 0.53 0 0.53 NaN 0.461 828
20231010_rag_swellama13b 1.2 0.49 0 0.49 NaN 0.315 827
20231010_rag_gpt35 0.4 0.28 0 0.28 NaN 0.0717 823