CRUXEval-output: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 std win_rate elo
gpt-4-turbo-2024-04-09+cot 82.0% 0.57% 95.4% 1498.9
claude-3-opus-20240229+cot 82.0% 0.00% 94.8% 1479.9
gpt-4-0613+cot 77.1% 0.66% 91.9% 1384.8
gpt-4o+cot 76.0% 0.94% 86.8% 1291.0
gpt-4o 70.0% 0.29% 86.7% 1291.0
gpt-4-0613 68.7% 0.32% 85.6% 1275.4
gpt-4-turbo-2024-04-09 67.7% 0.30% 84.9% 1260.2
claude-3-opus-20240229 65.8% 0.00% 82.8% 1239.8
gpt-3.5-turbo-0613+cot 59.0% 0.89% 70.8% 1115.1
deepseek-instruct-33b 49.9% 0.47% 58.2% 1022.9
gpt-3.5-turbo-0613 49.4% 0.40% 55.9% 1000.0
deepseek-base-33b 48.6% 0.52% 55.3% 1005.0
codetulu-2-34b 45.8% 0.50% 48.4% 955.0
magicoder-ds-7b 44.4% 0.51% 45.3% 933.5
codellama-34b+cot 43.6% 1.02% 41.2% 888.5
deepseek-base-6.7b 43.5% 0.57% 43.7% 921.6
wizard-34b 43.4% 0.44% 44.5% 921.2
codellama-34b 42.4% 0.56% 40.6% 891.7
codellama-python-34b 41.4% 0.48% 38.6% 875.3
wizard-13b 41.3% 0.50% 40.0% 890.4
deepseek-instruct-6.7b 41.2% 0.41% 39.7% 885.8
mixtral-8x7b 40.5% 0.58% 35.8% 858.5
codellama-python-13b 39.8% 0.52% 36.5% 863.8
codellama-13b 39.7% 0.56% 35.4% 861.4
phind 39.7% 0.46% 35.5% 853.7
codellama-13b+cot 36.0% 1.07% 26.8% 769.9
codellama-python-7b 35.9% 0.54% 29.9% 809.7
mistral-7b 34.3% 0.56% 24.9% 761.3
codellama-7b 34.2% 0.57% 24.1% 760.1
starcoderbase-16b 34.2% 0.55% 26.8% 782.5
phi-2 33.5% 0.55% 25.3% 761.4
starcoderbase-7b 32.2% 0.47% 21.8% 734.5
deepseek-base-1.3b 31.0% 0.57% 22.8% 744.7
codellama-7b+cot 29.9% 1.05% 16.2% 650.4
deepseek-instruct-1.3b 28.7% 0.48% 19.3% 696.8
phi-1.5 27.5% 0.56% 19.2% 695.2
phi-1 21.7% 0.45% 13.8% 616.7