humaneval: by examples

Home   Doc/Code

Not solved by any model

There are 6 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
HumanEval/129, HumanEval/130, HumanEval/132, HumanEval/145, HumanEval/163, HumanEval/32

Problems solved by 1 model only

example_link model min_elo
HumanEval/93 xwincoder-34b 1368.434
HumanEval/108 claude-3-sonnet-20240229 1264.638
HumanEval/137 Qwen--Qwen1.5-72B-Chat 1247.401

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
HumanEval/55 0.923 -0.209
HumanEval/54 0.167 -0.081
HumanEval/53 0.987 0.021
HumanEval/22 0.987 0.021
HumanEval/23 0.987 0.021
HumanEval/35 0.974 0.042
HumanEval/126 0.038 0.070
HumanEval/137 0.013 0.073
HumanEval/26 0.154 0.074
HumanEval/108 0.013 0.081

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.