p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model	pass1	win_rate	elo
20241029_OpenHands-CodeAct-2.1-sonnet-20241022	53.0%	87.5%	1414.0
20241028_solver	50.0%	85.0%	1367.7
20241022_tools_claude-3-5-sonnet-updated	49.0%	83.3%	1350.1
20241025_composio_swekit	48.6%	84.6%	1363.4
20241023_emergent	46.6%	81.9%	1327.8
20240924_solver	45.4%	80.9%	1309.5
20240824_gru	45.2%	80.1%	1307.7
20240920_solver	43.6%	77.5%	1275.5
20241016_composio_swekit	40.6%	72.9%	1227.8
20240820_honeycomb	40.6%	71.6%	1224.1
20241022_tools_claude-3-5-haiku	40.6%	72.2%	1223.1
20241029_epam-ai-run-claude-3-5-sonnet	39.6%	70.0%	1211.0
20240721_amazon-q-developer-agent-20240719-dev	38.8%	68.9%	1199.7
20241028_agentless-1.5_gpt4o	38.8%	68.9%	1202.6
20240628_autocoderover-v20240620	38.4%	68.2%	1196.9
20240617_factory_code_droid	37.0%	65.8%	1176.9
20240620_sweagent_claude3.5sonnet	33.6%	58.9%	1118.7
20241007_nfactorial	31.6%	55.1%	1084.1
20241002_lingma-agent_lingma-swe-gpt-72b	28.8%	49.2%	1046.2
20241016_epam-ai-run-gpt-4o	27.0%	45.4%	1016.5
20240615_appmap-navie_gpt4o	26.2%	43.5%	1002.8
20241001_nfactorial	25.8%	42.9%	1000.9
20240509_amazon-q-developer-agent-20240430-dev	25.6%	42.5%	1001.7
20240918_lingma-agent_lingma-swe-gpt-72b	25.0%	40.8%	974.1
20240820_epam-ai-run-gpt-4o	24.0%	38.6%	963.6
20240728_sweagent_gpt4o	23.2%	37.4%	951.2
20240402_sweagent_gpt4	22.4%	35.0%	932.1
20241002_lingma-agent_lingma-swe-gpt-7b	18.2%	26.3%	851.7
20240402_sweagent_claude3opus	15.8%	21.5%	792.3
20240918_lingma-agent_lingma-swe-gpt-7b	10.2%	11.2%	645.9
20240402_rag_claude3opus	7.0%	7.9%	595.9
20231010_rag_claude2	4.4%	4.8%	499.2
20240402_rag_gpt4	2.8%	2.7%	392.0
20231010_rag_swellama7b	1.4%	2.3%	381.7
20231010_rag_swellama13b	1.2%	1.7%	322.4
20231010_rag_gpt35	0.4%	0.4%	49.2

swebench-verified: by models

Home Doc/Code

p-values for model pairs

p-values vs. differences

Differences vs inconsistencies

Results table by model