swebench-test: by models

Home Doc/Code


std predicted by accuracy

The typical stddev between pairs of models on this dataset as a function of the absolute accuracy.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, the significance level mainly depends on the accuracy difference as shown here. Hover over each model pair for detailed information.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000. std: standard deviation due to drawing examples from a population, this is the dominant term. std_i: the standard deviation due to drawing samples from the model on each example. std_total: the total standard deviation, satisfying std_total^2 = std^2 + std_i^2.

model pass1 std(E(A)) E(std(A)) std(A) N win_rate elo
20250605_atlassian-rovo-dev 42 1 0 1 NaN 28.1 1.09e+03
20250522_amazon-q-developer-agent-20250405-dev 37.1 1 0 1 NaN 23.3 1.07e+03
20250227_sweagent-claude-3-7-20250219 33.8 0.99 0 0.99 NaN 20.8 1.06e+03
20250131_amazon-q-developer-agent-20241202-dev 30 0.96 0 0.96 NaN 17.4 1.05e+03
20241103_OpenHands-CodeAct-2.1-sonnet-20241022 29.4 0.95 0 0.95 NaN 17.2 1.05e+03
20241121_autocoderover-v2.0-claude-3-5-sonnet-20241022 24.9 0.9 0 0.9 NaN 13.8 1.03e+03
20240820_honeycomb 22.1 0.87 0 0.87 NaN 11.9 1.02e+03
20240721_amazon-q-developer-agent-20240719-dev 19.7 0.83 0 0.83 NaN 10.3 1.01e+03
20240617_factory_code_droid 19.3 0.82 0 0.82 NaN 9.79 1.01e+03
20240628_autocoderover-v20240620 18.8 0.82 0 0.82 NaN 9.63 1.01e+03
20240620_sweagent_claude3.5sonnet 18.1 0.8 0 0.8 NaN 9.37 1.01e+03
20240615_appmap-navie_gpt4o 14.6 0.74 0 0.74 NaN 6.91 993
20240509_amazon-q-developer-agent-20240430-dev 13.8 0.72 0 0.72 NaN 6.79 990
20240402_sweagent_gpt4 12.5 0.69 0 0.69 NaN 5.68 985
20240728_sweagent_gpt4o 12 0.68 0 0.68 NaN 5.63 984
20240402_sweagent_claude3opus 9.29 0.61 0 0.61 NaN 4.07 974
20240402_rag_claude3opus 3.79 0.4 0 0.4 NaN 1.57 954
20231010_rag_claude2 1.96 0.29 0 0.29 NaN 0.845 948
20240402_rag_gpt4 1.31 0.24 0 0.24 NaN 0.502 945
20231010_rag_swellama7b 0.697 0.17 0 0.17 NaN 0.359 943
20231010_rag_swellama13b 0.697 0.17 0 0.17 NaN 0.266 943
20231010_rag_gpt35 0.174 0.087 0 0.087 NaN 0.0623 941