swebench-lite: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
20240702_codestory_aide_mixed 43.0% 86.4% 1385.0
20241025_OpenHands-CodeAct-2.1-sonnet-20241022 41.7% 81.4% 1329.6
20240912_marscode-agent-dev 39.3% 79.8% 1304.8
20240820_honeycomb 38.3% 79.4% 1298.5
20240627_abanteai_mentatbot_gpt4o 38.0% 76.7% 1271.7
20240811_gru 35.7% 75.1% 1252.3
20240829_Isoform 35.0% 75.0% 1248.4
20240723_marscode-agent-dev 34.0% 70.4% 1210.3
20240806_SuperCoder2.0 34.0% 69.2% 1206.0
20240622_Lingma_Agent 33.0% 70.5% 1207.0
20241028_agentless-1.5_gpt4o 32.0% 67.5% 1181.4
20240617_factory_code_droid 31.3% 64.6% 1162.7
20240621_autocoderover-v20240620 30.7% 65.1% 1158.3
20240908_infant_gpt4o 30.0% 62.7% 1148.3
20240721_amazon-q-developer-agent-20240719-dev 29.7% 61.2% 1141.0
20240808_RepoGraph_gpt4o 29.7% 62.3% 1141.0
20240604_CodeR 28.3% 58.7% 1117.9
20240612_MASAI_gpt4o 28.0% 58.6% 1109.9
20240706_sima_gpt4o 27.7% 57.8% 1106.0
20240630_agentless_gpt4o 27.3% 56.7% 1097.6
20240612_IBM_Research_Agent101 26.7% 55.2% 1087.4
20240623_moatless_claude35sonnet 26.7% 55.2% 1087.5
20240725_opendevin_codeact_v1.8_claude35sonnet 26.7% 54.6% 1089.1
20240523_aider 26.3% 54.1% 1080.2
20240925_hyperagent_lite1 25.3% 51.6% 1063.9
20240617_moatless_gpt4o 24.7% 50.0% 1048.1
20240524_opencsg_starship_gpt4 23.7% 47.4% 1028.5
20241016_IBM-SWE-1.0 23.7% 47.3% 1025.6
20240620_sweagent_claude3.5sonnet 23.0% 46.1% 1028.5
20240828_autose_mixed 21.7% 41.7% 985.2
20240615_appmap-navie_gpt4o 21.7% 42.6% 994.8
20240509_amazon-q-developer-agent-20240430-dev 20.3% 40.0% 983.4
20240530_autocoderover-v20240408 19.0% 35.7% 942.6
20240728_sweagent_gpt4o 18.3% 33.3% 915.5
20240402_sweagent_gpt4 18.0% 32.4% 912.4
20240402_sweagent_claude3opus 11.7% 18.0% 765.6
20240402_rag_claude3opus 4.3% 4.3% 475.0
20231010_rag_claude2 3.0% 3.1% 417.7
20240402_rag_gpt4 2.7% 3.2% 428.0
20231010_rag_swellama7b 1.3% 2.1% 355.3
20231010_rag_swellama13b 1.0% 1.6% 308.5
20231010_rag_gpt35 0.3% 0.2% -100.4