plotly-logomark

Not solved by any model

There are 110 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
DS/106, DS/107, DS/108, DS/121, DS/122, DS/142, DS/15, DS/159, DS/165, DS/173, DS/174, DS/197, DS/202, DS/203, DS/204, DS/205, DS/208, DS/209, DS/210, DS/211, DS/216, DS/225, DS/228, DS/240, DS/242, DS/245, DS/269, DS/270, DS/272, DS/284, DS/285, DS/286, DS/29, DS/318, DS/319, DS/328, DS/339, DS/345, DS/354, DS/372, DS/375, DS/385, DS/387, DS/389, DS/390, DS/394, DS/40, DS/407, DS/408, DS/410, DS/42, DS/420, DS/421, DS/43, DS/439, DS/44, DS/45, DS/46, DS/468, DS/509, DS/516, DS/521, DS/526, DS/54, DS/57, DS/58, DS/59, DS/60, DS/612, DS/626, DS/638, DS/65, DS/672, DS/699, DS/701, DS/726, DS/73, DS/74, DS/747, DS/749, DS/75, DS/750, DS/751, DS/755, DS/773, DS/779, DS/780, DS/789, DS/790, DS/798, DS/80, DS/808, DS/809, DS/81, DS/86, DS/877, DS/879, DS/88, DS/883, DS/884, DS/885, DS/9, DS/900, DS/901, DS/904, DS/905, DS/927, DS/96, DS/987, DS/993

Problems solved by 1 model only

example_link	model	min_elo
DS/253	gpt-4-turbo-2024-04-09	1199.141
DS/304	gpt-4-turbo-2024-04-09	1199.141
DS/505	gpt-4-turbo-2024-04-09	1199.141
DS/280	deepseek-ai-deepseek-coder-V2-SFT	1186.012
DS/56	deepseek-ai-deepseek-coder-V2-SFT	1186.012
DS/131	Qwen-Qwen2-72B-Instruct	1181.847
DS/7	Qwen-Qwen2-72B-Instruct	1181.847
DS/244	Qwen-Qwen2-72B-Instruct	1181.847
DS/744	claude-3-5-sonnet-20240620	1179.604
DS/6	claude-3-5-sonnet-20240620	1179.604
DS/922	claude-3-5-sonnet-20240620	1179.604
DS/105	claude-3-5-sonnet-20240620	1179.604
DS/926	claude-3-5-sonnet-20240620	1179.604
DS/488	claude-3-5-sonnet-20240620	1179.604
DS/458	claude-3-5-sonnet-20240620	1179.604
DS/984	claude-3-5-sonnet-20240620	1179.604
DS/132	llama3-405	1166.067
DS/596	llama3-405	1166.067
DS/418	mistralai-Codestral-22B-v0.1	1164.963
DS/784	gpt-4-0613	1151.388
DS/765	gpt-4-0613	1151.388
DS/154	gpt-4-0613	1151.388
DS/807	gpt-4-0613	1151.388
DS/39	gpt-4-0613	1151.388
DS/362	meta-llama-Llama-3-70b-chat-hf	1115.769
DS/772	meta-llama-Llama-3-70b-chat-hf	1115.769
DS/347	meta-llama-Llama-3-70b-chat-hf	1115.769
DS/346	meta-llama-Llama-3-70b-chat-hf	1115.769
DS/776	meta-llama-Llama-3-70b-chat-hf	1115.769
DS/781	meta-llama-Llama-3-70b-chat-hf	1115.769
DS/373	deepseek-ai-deepseek-coder-V2-Base	1105.770
DS/679	microsoft-wavecoder-ultra-6.7b	1090.762
DS/515	meta-llama-Llama-3-70B	1033.482
DS/26	deepseek-ai-deepseek-llm-67b-chat	1026.171
DS/799	Phind-Phind-CodeLlama-34B-v2	1022.473
DS/997	codellama-CodeLlama-34b-Python-hf	1005.280
DS/998	codellama-CodeLlama-34b-Python-hf	1005.280
DS/411	codellama-CodeLlama-70b-Python-hf	1004.570
DS/447	m-a-p-OpenCodeInterpreter-CL-7B	1002.023
DS/813	gpt-3.5-turbo-0613	1000.000
DS/172	deepseek-ai-deepseek-V2-chat	997.043
DS/953	microsoft-Phi-3-small-8k-instruct	993.383
DS/766	m-a-p-OpenCodeInterpreter-SC2-7B	975.822
DS/604	m-a-p-OpenCodeInterpreter-SC2-7B	975.822
DS/775	WizardLM-WizardCoder-Python-34B-V1.0	973.230
DS/67	Qwen-Qwen1.5-72B-Chat	965.496
DS/90	Qwen-Qwen1.5-72B-Chat	965.496
DS/263	ibm-granite-granite-34b-code-base	952.891
DS/899	meta-llama-Llama-3-8B	909.209
DS/161	codellama-CodeLlama-7b-hf	770.446
DS/64	ERNIE-Speed-8K	489.968
DS/523	google-gemma-1.1-2b-it	487.580

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link	acc	tau
DS/523	0.009	-0.109
DS/64	0.009	-0.103
DS/585	0.019	-0.102
DS/880	0.208	-0.102
DS/611	0.311	-0.101
DS/762	0.189	-0.094
DS/250	0.019	-0.076
DS/161	0.009	-0.043
DS/882	0.113	-0.034
DS/632	0.123	-0.031

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.

DS1000: by examples

Home Doc/Code

Not solved by any model

Problems solved by 1 model only

Suspect problems

Histogram of accuracies

Histogram of difficulties