CRUXEval-input: by models

Home   Doc/Code

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, this mainly depends on the difference in accuracy. Hover over each model pair for detailed information.

p-values vs. differences

The range of possible p-values vs. the difference in accuracy over all pairs.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000.

model pass1 std win_rate elo
gpt-4-turbo-2024-04-09+cot 75.7% 0.78% 87.1% 1293.2
gpt-4o+cot 75.6% 0.71% 87.1% 1286.6
gpt-4-0613+cot 75.5% 0.88% 88.7% 1314.0
claude-3-opus-20240229+cot 73.4% 0.00% 84.9% 1259.3
llama3-405-cot 72.0% 0.00% 84.9% 1252.8
gpt-4-0613 69.8% 0.43% 83.8% 1239.9
gpt-4-turbo-2024-04-09 68.5% 0.43% 83.0% 1230.0
llama3-405 67.6% 0.00% 81.5% 1212.4
gpt-4o 65.1% 0.42% 78.9% 1186.6
claude-3-opus-20240229 64.2% 0.00% 76.0% 1153.1
gpt-3.5-turbo-0613+cot 50.3% 1.10% 52.6% 986.5
codellama-34b+cot 50.1% 0.93% 53.5% 997.0
codetulu-2-34b 49.2% 0.69% 53.8% 999.1
gpt-3.5-turbo-0613 49.0% 0.55% 54.1% 1000.0
codellama-13b+cot 47.4% 0.85% 50.1% 982.1
codellama-34b 47.2% 0.71% 49.5% 975.0
phind 47.2% 0.61% 50.2% 973.5
deepseek-base-33b 46.5% 0.71% 48.3% 966.0
deepseek-instruct-33b 46.5% 0.65% 49.5% 967.8
codellama-python-34b 43.9% 0.70% 46.0% 947.2
wizard-34b 42.7% 0.60% 42.3% 918.3
codellama-13b 42.5% 0.76% 40.9% 931.5
deepseek-base-6.7b 41.9% 0.70% 38.2% 904.5
magicoder-ds-7b 41.7% 0.63% 40.3% 913.5
codellama-7b+cot 40.4% 0.95% 36.6% 882.9
codellama-python-13b 39.7% 0.75% 36.1% 885.1
mixtral-8x7b 39.3% 0.75% 35.0% 872.0
deepseek-instruct-6.7b 37.4% 0.60% 32.9% 856.3
codellama-python-7b 37.3% 0.65% 33.3% 871.0
wizard-13b 36.5% 0.60% 32.0% 846.7
codellama-7b 36.0% 0.69% 29.2% 840.7
mistral-7b 35.0% 0.69% 30.4% 846.6
phi-2 31.6% 0.70% 25.4% 802.0
starcoderbase-16b 31.3% 0.70% 23.2% 780.2
starcoderbase-7b 29.7% 0.65% 21.5% 764.1
deepseek-base-1.3b 27.8% 0.60% 19.3% 739.8
deepseek-instruct-1.3b 27.2% 0.55% 21.3% 758.7
phi-1.5 23.2% 0.70% 17.4% 709.9
phi-1 13.1% 0.41% 8.4% 554.2