CRUXEval-output-T0.2: by models

Home Doc/Code


std predicted by accuracy

The typical stddev between pairs of models on this dataset as a function of the absolute accuracy.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

p-values for model pairs

The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are ignored. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. For all pairs of models, the significance level mainly depends on the accuracy difference as shown here. Hover over each model pair for detailed information.

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000. std: standard deviation due to drawing examples from a population, this is the dominant term. std_i: the standard deviation due to drawing samples from the model on each example. std_total: the total standard deviation, satisfying std_total^2 = std^2 + std_i^2.

model pass1 std(E(A)) E(std(A)) std(A) N win_rate elo
gpt-4-turbo-2024-04-09+cot 82 1.2 0.57 1.4 3 39.5 1.13e+03
claude-3-opus-20240229+cot 82 1.4 0 1.4 1 39.4 1.13e+03
gpt-4-0613+cot 77.1 1.3 0.66 1.5 10 34.8 1.1e+03
gpt-4o+cot 76 1.2 0.94 1.5 3 38.3 1.11e+03
gpt-4o 70 1.6 0.29 1.6 3 29.6 1.08e+03
gpt-4-0613 68.7 1.6 0.32 1.6 10 28 1.07e+03
gpt-4-turbo-2024-04-09 67.7 1.6 0.3 1.7 3 27.4 1.07e+03
claude-3-opus-20240229 65.8 1.7 0 1.7 1 26 1.06e+03
gpt-3.5-turbo-0613+cot 59 1.5 0.89 1.7 10 21.4 1.03e+03
deepseek-instruct-33b 49.9 1.7 0.47 1.8 10 14.6 1e+03
gpt-3.5-turbo-0613 49.4 1.7 0.4 1.8 10 14.6 1e+03
deepseek-base-33b 48.6 1.7 0.52 1.8 10 13.4 998
codetulu-2-34b 45.8 1.7 0.5 1.8 10 11.6 987
magicoder-ds-7b 44.4 1.7 0.51 1.8 10 11.2 981
codellama-34b+cot 43.6 1.4 1 1.8 10 11.1 973
deepseek-base-6.7b 43.5 1.7 0.57 1.8 10 10.7 979
wizard-34b 43.4 1.7 0.44 1.8 10 10.9 980
codellama-34b 42.4 1.7 0.56 1.7 10 9.68 974
codellama-python-34b 41.4 1.7 0.48 1.7 10 9.32 970
wizard-13b 41.3 1.7 0.5 1.7 10 9.73 972
deepseek-instruct-6.7b 41.2 1.7 0.41 1.7 10 9.59 972
mixtral-8x7b 40.5 1.6 0.58 1.7 10 8.58 966
codellama-python-13b 39.8 1.7 0.52 1.7 10 8.84 966
codellama-13b 39.7 1.6 0.56 1.7 10 8.72 964
phind 39.7 1.7 0.46 1.7 10 8.63 965
codellama-13b+cot 36 1.3 1.1 1.7 10 7.84 942
codellama-python-7b 35.9 1.6 0.54 1.7 10 7.95 952
mistral-7b 34.3 1.6 0.56 1.7 10 6.48 943
codellama-7b 34.2 1.6 0.57 1.7 10 6.16 943
starcoderbase-16b 34.2 1.6 0.55 1.7 10 7.11 946
phi-2 33.5 1.6 0.55 1.7 10 6.98 942
starcoderbase-7b 32.2 1.6 0.47 1.7 10 5.79 937
deepseek-base-1.3b 31 1.5 0.57 1.6 10 6.52 935
codellama-7b+cot 29.9 1.2 1 1.6 10 4.94 916
deepseek-instruct-1.3b 28.7 1.5 0.48 1.6 10 5.75 925
phi-1.5 27.5 1.5 0.56 1.6 10 6.36 917
phi-1 21.7 1.4 0.45 1.5 10 4.81 899