plotly-logomark

CRUXEval-input: by examples

Home Doc/Code

Not solved by any model

There are 17 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
CRUXEval-input/112, CRUXEval-input/113, CRUXEval-input/128, CRUXEval-input/129, CRUXEval-input/177, CRUXEval-input/179, CRUXEval-input/185, CRUXEval-input/218, CRUXEval-input/220, CRUXEval-input/236, CRUXEval-input/259, CRUXEval-input/413, CRUXEval-input/423, CRUXEval-input/444, CRUXEval-input/501, CRUXEval-input/545, CRUXEval-input/581

Problems solved by 1 model only

example_link	model	min_elo
CRUXEval-input/250	gpt-4-0613+cot	1313.953
CRUXEval-input/729	llama3-405-cot	1252.838
CRUXEval-input/229	llama3-405-cot	1252.838
CRUXEval-input/474	gpt-4-0613	1239.906
CRUXEval-input/770	phind	973.455

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	acc	tau
CRUXEval-input/531	0.590	-0.506
CRUXEval-input/222	0.641	-0.495
CRUXEval-input/660	0.718	-0.469
CRUXEval-input/242	0.821	-0.447
CRUXEval-input/233	0.513	-0.433
CRUXEval-input/28	0.769	-0.429
CRUXEval-input/373	0.590	-0.410
CRUXEval-input/598	0.821	-0.344
CRUXEval-input/199	0.564	-0.327
CRUXEval-input/745	0.513	-0.286

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.