CRUXEval-output: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 std win_rate elo
gpt-4-turbo-2024-04-09+cot 82.0% 0.57% 95.1% 1508.2
claude-3-opus-20240229+cot 82.0% 0.00% 94.4% 1488.5
gpt-4-0613+cot 77.1% 0.66% 91.3% 1391.6
gpt-4o+cot 76.0% 0.94% 86.0% 1301.8
llama3-405-cot 75.9% 0.00% 91.3% 1396.5
gpt-4o 70.0% 0.29% 85.7% 1294.5
gpt-4-0613 68.7% 0.32% 84.4% 1277.4
gpt-4-turbo-2024-04-09 67.7% 0.30% 83.6% 1262.3
claude-3-opus-20240229 65.8% 0.00% 81.3% 1241.2
llama3-405 64.9% 0.00% 81.0% 1235.1
gpt-3.5-turbo-0613+cot 59.0% 0.89% 68.8% 1116.6
deepseek-instruct-33b 49.9% 0.47% 55.6% 1021.1
gpt-3.5-turbo-0613 49.4% 0.40% 53.4% 1000.0
deepseek-base-33b 48.6% 0.52% 52.6% 1003.4
codetulu-2-34b 45.8% 0.50% 45.8% 953.9
magicoder-ds-7b 44.4% 0.51% 42.8% 933.1
codellama-34b+cot 43.6% 1.02% 38.9% 888.3
deepseek-base-6.7b 43.5% 0.57% 41.2% 921.7
wizard-34b 43.4% 0.44% 42.0% 921.7
codellama-34b 42.4% 0.56% 38.1% 891.3
codellama-python-34b 41.4% 0.48% 36.2% 875.0
wizard-13b 41.3% 0.50% 37.6% 889.6
deepseek-instruct-6.7b 41.2% 0.41% 37.2% 884.9
mixtral-8x7b 40.5% 0.58% 33.5% 858.1
codellama-python-13b 39.8% 0.52% 34.2% 863.1
codellama-13b 39.7% 0.56% 33.2% 862.4
phind 39.7% 0.46% 33.2% 853.1
codellama-13b+cot 36.0% 1.07% 25.1% 770.9
codellama-python-7b 35.9% 0.54% 28.1% 811.5
mistral-7b 34.3% 0.56% 23.2% 763.2
codellama-7b 34.2% 0.57% 22.5% 761.3
starcoderbase-16b 34.2% 0.55% 25.0% 783.0
phi-2 33.5% 0.55% 23.7% 763.2
starcoderbase-7b 32.2% 0.47% 20.4% 737.0
deepseek-base-1.3b 31.0% 0.57% 21.4% 747.1
codellama-7b+cot 29.9% 1.05% 15.1% 651.3
deepseek-instruct-1.3b 28.7% 0.48% 18.2% 700.0
phi-1.5 27.5% 0.56% 18.1% 698.2
phi-1 21.7% 0.45% 13.0% 619.5