p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
20240702_codestory_aide_mixed	43.0%	86.4%	1385.0
20241025_OpenHands-CodeAct-2.1-sonnet-20241022	41.7%	81.4%	1329.6
20240912_marscode-agent-dev	39.3%	79.8%	1304.8
20240820_honeycomb	38.3%	79.4%	1298.5
20240627_abanteai_mentatbot_gpt4o	38.0%	76.7%	1271.7
20240811_gru	35.7%	75.1%	1252.3
20240829_Isoform	35.0%	75.0%	1248.4
20240723_marscode-agent-dev	34.0%	70.4%	1210.3
20240806_SuperCoder2.0	34.0%	69.2%	1206.0
20240622_Lingma_Agent	33.0%	70.5%	1207.0
20241028_agentless-1.5_gpt4o	32.0%	67.5%	1181.4
20240617_factory_code_droid	31.3%	64.6%	1162.7
20240621_autocoderover-v20240620	30.7%	65.1%	1158.3
20240908_infant_gpt4o	30.0%	62.7%	1148.3
20240721_amazon-q-developer-agent-20240719-dev	29.7%	61.2%	1141.0
20240808_RepoGraph_gpt4o	29.7%	62.3%	1141.0
20240604_CodeR	28.3%	58.7%	1117.9
20240612_MASAI_gpt4o	28.0%	58.6%	1109.9
20240706_sima_gpt4o	27.7%	57.8%	1106.0
20240630_agentless_gpt4o	27.3%	56.7%	1097.6
20240612_IBM_Research_Agent101	26.7%	55.2%	1087.4
20240623_moatless_claude35sonnet	26.7%	55.2%	1087.5
20240725_opendevin_codeact_v1.8_claude35sonnet	26.7%	54.6%	1089.1
20240523_aider	26.3%	54.1%	1080.2
20240925_hyperagent_lite1	25.3%	51.6%	1063.9
20240617_moatless_gpt4o	24.7%	50.0%	1048.1
20240524_opencsg_starship_gpt4	23.7%	47.4%	1028.5
20241016_IBM-SWE-1.0	23.7%	47.3%	1025.6
20240620_sweagent_claude3.5sonnet	23.0%	46.1%	1028.5
20240828_autose_mixed	21.7%	41.7%	985.2
20240615_appmap-navie_gpt4o	21.7%	42.6%	994.8
20240509_amazon-q-developer-agent-20240430-dev	20.3%	40.0%	983.4
20240530_autocoderover-v20240408	19.0%	35.7%	942.6
20240728_sweagent_gpt4o	18.3%	33.3%	915.5
20240402_sweagent_gpt4	18.0%	32.4%	912.4
20240402_sweagent_claude3opus	11.7%	18.0%	765.6
20240402_rag_claude3opus	4.3%	4.3%	475.0
20231010_rag_claude2	3.0%	3.1%	417.7
20240402_rag_gpt4	2.7%	3.2%	428.0
20231010_rag_swellama7b	1.3%	2.1%	355.3
20231010_rag_swellama13b	1.0%	1.6%	308.5
20231010_rag_gpt35	0.3%	0.2%	-100.4

swebench-lite: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model