There are 6 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
HumanEval/129, HumanEval/130, HumanEval/132, HumanEval/145, HumanEval/163, HumanEval/32
example_link | model | min_elo |
---|---|---|
HumanEval/93 | xwincoder-34b | 1368.434 |
HumanEval/108 | claude-3-sonnet-20240229 | 1264.638 |
HumanEval/137 | Qwen--Qwen1.5-72B-Chat | 1247.401 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
example_link | acc | tau |
---|---|---|
HumanEval/55 | 0.923 | -0.209 |
HumanEval/54 | 0.167 | -0.081 |
HumanEval/53 | 0.987 | 0.021 |
HumanEval/22 | 0.987 | 0.021 |
HumanEval/23 | 0.987 | 0.021 |
HumanEval/35 | 0.974 | 0.042 |
HumanEval/126 | 0.038 | 0.070 |
HumanEval/137 | 0.013 | 0.073 |
HumanEval/26 | 0.154 | 0.074 |
HumanEval/108 | 0.013 | 0.081 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.