p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
20240820_honeycomb	22.1%	84.5%	1392.5
20240721_amazon-q-developer-agent-20240719-dev	19.7%	80.3%	1345.7
20240617_factory_code_droid	19.3%	79.7%	1336.4
20240628_autocoderover-v20240620	18.8%	78.2%	1323.2
20240620_sweagent_claude3.5sonnet	18.1%	75.6%	1302.0
20240615_appmap-navie_gpt4o	14.6%	66.1%	1215.7
20240509_amazon-q-developer-agent-20240430-dev	13.8%	62.7%	1197.8
20240402_sweagent_gpt4	12.5%	58.1%	1158.6
20240728_sweagent_gpt4o	12.0%	55.9%	1145.6
20240402_sweagent_claude3opus	9.3%	44.3%	1060.8
20240402_rag_claude3opus	3.8%	17.6%	845.9
20231010_rag_claude2	2.0%	9.3%	724.1
20240402_rag_gpt4	1.3%	5.7%	622.4
20231010_rag_swellama13b	0.7%	3.1%	520.9
20231010_rag_swellama7b	0.7%	3.8%	562.8
20231010_rag_gpt35	0.2%	0.7%	245.7

swebench-test: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model