The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.
The range of possible p-values vs. the difference in accuracy over all pairs.
Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.
We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.
model | pass1 | win_rate | elo |
---|---|---|---|
20241029_OpenHands-CodeAct-2.1-sonnet-20241022 | 53.0% | 87.5% | 1414.0 |
20241028_solver | 50.0% | 85.0% | 1367.7 |
20241022_tools_claude-3-5-sonnet-updated | 49.0% | 83.3% | 1350.1 |
20241025_composio_swekit | 48.6% | 84.6% | 1363.4 |
20241023_emergent | 46.6% | 81.9% | 1327.8 |
20240924_solver | 45.4% | 80.9% | 1309.5 |
20240824_gru | 45.2% | 80.1% | 1307.7 |
20240920_solver | 43.6% | 77.5% | 1275.5 |
20241016_composio_swekit | 40.6% | 72.9% | 1227.8 |
20240820_honeycomb | 40.6% | 71.6% | 1224.1 |
20241022_tools_claude-3-5-haiku | 40.6% | 72.2% | 1223.1 |
20241029_epam-ai-run-claude-3-5-sonnet | 39.6% | 70.0% | 1211.0 |
20240721_amazon-q-developer-agent-20240719-dev | 38.8% | 68.9% | 1199.7 |
20241028_agentless-1.5_gpt4o | 38.8% | 68.9% | 1202.6 |
20240628_autocoderover-v20240620 | 38.4% | 68.2% | 1196.9 |
20240617_factory_code_droid | 37.0% | 65.8% | 1176.9 |
20240620_sweagent_claude3.5sonnet | 33.6% | 58.9% | 1118.7 |
20241007_nfactorial | 31.6% | 55.1% | 1084.1 |
20241002_lingma-agent_lingma-swe-gpt-72b | 28.8% | 49.2% | 1046.2 |
20241016_epam-ai-run-gpt-4o | 27.0% | 45.4% | 1016.5 |
20240615_appmap-navie_gpt4o | 26.2% | 43.5% | 1002.8 |
20241001_nfactorial | 25.8% | 42.9% | 1000.9 |
20240509_amazon-q-developer-agent-20240430-dev | 25.6% | 42.5% | 1001.7 |
20240918_lingma-agent_lingma-swe-gpt-72b | 25.0% | 40.8% | 974.1 |
20240820_epam-ai-run-gpt-4o | 24.0% | 38.6% | 963.6 |
20240728_sweagent_gpt4o | 23.2% | 37.4% | 951.2 |
20240402_sweagent_gpt4 | 22.4% | 35.0% | 932.1 |
20241002_lingma-agent_lingma-swe-gpt-7b | 18.2% | 26.3% | 851.7 |
20240402_sweagent_claude3opus | 15.8% | 21.5% | 792.3 |
20240918_lingma-agent_lingma-swe-gpt-7b | 10.2% | 11.2% | 645.9 |
20240402_rag_claude3opus | 7.0% | 7.9% | 595.9 |
20231010_rag_claude2 | 4.4% | 4.8% | 499.2 |
20240402_rag_gpt4 | 2.8% | 2.7% | 392.0 |
20231010_rag_swellama7b | 1.4% | 2.3% | 381.7 |
20231010_rag_swellama13b | 1.2% | 1.7% | 322.4 |
20231010_rag_gpt35 | 0.4% | 0.4% | 49.2 |