swebench-verified: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 win_rate elo
20241029_OpenHands-CodeAct-2.1-sonnet-20241022 53.0% 87.5% 1414.0
20241028_solver 50.0% 85.0% 1367.7
20241022_tools_claude-3-5-sonnet-updated 49.0% 83.3% 1350.1
20241025_composio_swekit 48.6% 84.6% 1363.4
20241023_emergent 46.6% 81.9% 1327.8
20240924_solver 45.4% 80.9% 1309.5
20240824_gru 45.2% 80.1% 1307.7
20240920_solver 43.6% 77.5% 1275.5
20241016_composio_swekit 40.6% 72.9% 1227.8
20240820_honeycomb 40.6% 71.6% 1224.1
20241022_tools_claude-3-5-haiku 40.6% 72.2% 1223.1
20241029_epam-ai-run-claude-3-5-sonnet 39.6% 70.0% 1211.0
20240721_amazon-q-developer-agent-20240719-dev 38.8% 68.9% 1199.7
20241028_agentless-1.5_gpt4o 38.8% 68.9% 1202.6
20240628_autocoderover-v20240620 38.4% 68.2% 1196.9
20240617_factory_code_droid 37.0% 65.8% 1176.9
20240620_sweagent_claude3.5sonnet 33.6% 58.9% 1118.7
20241007_nfactorial 31.6% 55.1% 1084.1
20241002_lingma-agent_lingma-swe-gpt-72b 28.8% 49.2% 1046.2
20241016_epam-ai-run-gpt-4o 27.0% 45.4% 1016.5
20240615_appmap-navie_gpt4o 26.2% 43.5% 1002.8
20241001_nfactorial 25.8% 42.9% 1000.9
20240509_amazon-q-developer-agent-20240430-dev 25.6% 42.5% 1001.7
20240918_lingma-agent_lingma-swe-gpt-72b 25.0% 40.8% 974.1
20240820_epam-ai-run-gpt-4o 24.0% 38.6% 963.6
20240728_sweagent_gpt4o 23.2% 37.4% 951.2
20240402_sweagent_gpt4 22.4% 35.0% 932.1
20241002_lingma-agent_lingma-swe-gpt-7b 18.2% 26.3% 851.7
20240402_sweagent_claude3opus 15.8% 21.5% 792.3
20240918_lingma-agent_lingma-swe-gpt-7b 10.2% 11.2% 645.9
20240402_rag_claude3opus 7.0% 7.9% 595.9
20231010_rag_claude2 4.4% 4.8% 499.2
20240402_rag_gpt4 2.8% 2.7% 392.0
20231010_rag_swellama7b 1.4% 2.3% 381.7
20231010_rag_swellama13b 1.2% 1.7% 322.4
20231010_rag_gpt35 0.4% 0.4% 49.2