humaneval+: by examples

Home   Doc/Code

Not solved by any model

There are 7 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
HumanEval/129, HumanEval/130, HumanEval/132, HumanEval/145, HumanEval/163, HumanEval/32, HumanEval/91

Problems solved by 1 model only

example_link model min_elo
HumanEval/140 speechless-codellama-34b 1225.298
HumanEval/124 code-millenials-34b 1200.529
HumanEval/93 xwincoder-34b 1170.432
HumanEval/76 openchat 1153.939
HumanEval/108 claude-3-sonnet-20240229 1083.657
HumanEval/137 Qwen--Qwen1.5-72B-Chat 1022.575

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
HumanEval/54 0.163 -0.135
HumanEval/154 0.224 -0.046
HumanEval/137 0.020 -0.004
HumanEval/122 0.061 0.010
HumanEval/83 0.041 0.024
HumanEval/47 0.939 0.025
HumanEval/108 0.020 0.042
HumanEval/126 0.041 0.051
HumanEval/11 0.837 0.070
HumanEval/65 0.408 0.078

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.