There are 7 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
HumanEval/129, HumanEval/130, HumanEval/132, HumanEval/145, HumanEval/163, HumanEval/32, HumanEval/91
example_link | model | min_elo |
---|---|---|
HumanEval/140 | speechless-codellama-34b | 1225.298 |
HumanEval/124 | code-millenials-34b | 1200.529 |
HumanEval/93 | xwincoder-34b | 1170.432 |
HumanEval/76 | openchat | 1153.939 |
HumanEval/108 | claude-3-sonnet-20240229 | 1083.657 |
HumanEval/137 | Qwen--Qwen1.5-72B-Chat | 1022.575 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
example_link | acc | tau |
---|---|---|
HumanEval/54 | 0.163 | -0.135 |
HumanEval/154 | 0.224 | -0.046 |
HumanEval/137 | 0.020 | -0.004 |
HumanEval/122 | 0.061 | 0.010 |
HumanEval/83 | 0.041 | 0.024 |
HumanEval/47 | 0.939 | 0.025 |
HumanEval/108 | 0.020 | 0.042 |
HumanEval/126 | 0.041 | 0.051 |
HumanEval/11 | 0.837 | 0.070 |
HumanEval/65 | 0.408 | 0.078 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.