There are 6 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
HumanEval/129, HumanEval/130, HumanEval/132, HumanEval/145, HumanEval/163, HumanEval/32
| example_link | model | min_elo |
|---|---|---|
| HumanEval/93 | xwincoder-34b | 1097.560 |
| HumanEval/108 | claude-3-sonnet-20240229 | 1078.808 |
| HumanEval/137 | Qwen--Qwen1.5-72B-Chat | 1069.583 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | acc | tau |
|---|---|---|
| HumanEval/55 | 0.923 | -0.209 |
| HumanEval/54 | 0.167 | -0.081 |
| HumanEval/23 | 0.987 | 0.021 |
| HumanEval/53 | 0.987 | 0.021 |
| HumanEval/22 | 0.987 | 0.021 |
| HumanEval/35 | 0.974 | 0.042 |
| HumanEval/126 | 0.038 | 0.070 |
| HumanEval/137 | 0.013 | 0.073 |
| HumanEval/26 | 0.154 | 0.074 |
| HumanEval/108 | 0.013 | 0.081 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.