plotly-logomark

humaneval+: by examples

Home Doc/Code

Not solved by any model

There are 7 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
HumanEval/129, HumanEval/130, HumanEval/132, HumanEval/145, HumanEval/163, HumanEval/32, HumanEval/91

Problems solved by 1 model only

example_link	model	min_elo
HumanEval/140	speechless-codellama-34b	1225.298
HumanEval/124	code-millenials-34b	1200.529
HumanEval/93	xwincoder-34b	1170.432
HumanEval/76	openchat	1153.939
HumanEval/108	claude-3-sonnet-20240229	1083.657
HumanEval/137	Qwen--Qwen1.5-72B-Chat	1022.575

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	acc	tau
HumanEval/54	0.163	-0.135
HumanEval/154	0.224	-0.046
HumanEval/137	0.020	-0.004
HumanEval/122	0.061	0.010
HumanEval/83	0.041	0.024
HumanEval/47	0.939	0.025
HumanEval/108	0.020	0.042
HumanEval/126	0.041	0.051
HumanEval/11	0.837	0.070
HumanEval/65	0.408	0.078

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.