DS1000: by examples

Home   Doc/Code

Not solved by any model

There are 110 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
DS/106, DS/107, DS/108, DS/121, DS/122, DS/142, DS/15, DS/159, DS/165, DS/173, DS/174, DS/197, DS/202, DS/203, DS/204, DS/205, DS/208, DS/209, DS/210, DS/211, DS/216, DS/225, DS/228, DS/240, DS/242, DS/245, DS/269, DS/270, DS/272, DS/284, DS/285, DS/286, DS/29, DS/318, DS/319, DS/328, DS/339, DS/345, DS/354, DS/372, DS/375, DS/385, DS/387, DS/389, DS/390, DS/394, DS/40, DS/407, DS/408, DS/410, DS/42, DS/420, DS/421, DS/43, DS/439, DS/44, DS/45, DS/46, DS/468, DS/509, DS/516, DS/521, DS/526, DS/54, DS/57, DS/58, DS/59, DS/60, DS/612, DS/626, DS/638, DS/65, DS/672, DS/699, DS/701, DS/726, DS/73, DS/74, DS/747, DS/749, DS/75, DS/750, DS/751, DS/755, DS/773, DS/779, DS/780, DS/789, DS/790, DS/798, DS/80, DS/808, DS/809, DS/81, DS/86, DS/877, DS/879, DS/88, DS/883, DS/884, DS/885, DS/9, DS/900, DS/901, DS/904, DS/905, DS/927, DS/96, DS/987, DS/993

Problems solved by 1 model only

example_link model min_elo
DS/253 gpt-4-turbo-2024-04-09 1199.141
DS/304 gpt-4-turbo-2024-04-09 1199.141
DS/505 gpt-4-turbo-2024-04-09 1199.141
DS/280 deepseek-ai-deepseek-coder-V2-SFT 1186.012
DS/56 deepseek-ai-deepseek-coder-V2-SFT 1186.012
DS/131 Qwen-Qwen2-72B-Instruct 1181.847
DS/7 Qwen-Qwen2-72B-Instruct 1181.847
DS/244 Qwen-Qwen2-72B-Instruct 1181.847
DS/744 claude-3-5-sonnet-20240620 1179.604
DS/6 claude-3-5-sonnet-20240620 1179.604
DS/922 claude-3-5-sonnet-20240620 1179.604
DS/105 claude-3-5-sonnet-20240620 1179.604
DS/926 claude-3-5-sonnet-20240620 1179.604
DS/488 claude-3-5-sonnet-20240620 1179.604
DS/458 claude-3-5-sonnet-20240620 1179.604
DS/984 claude-3-5-sonnet-20240620 1179.604
DS/132 llama3-405 1166.067
DS/596 llama3-405 1166.067
DS/418 mistralai-Codestral-22B-v0.1 1164.963
DS/784 gpt-4-0613 1151.388
DS/765 gpt-4-0613 1151.388
DS/154 gpt-4-0613 1151.388
DS/807 gpt-4-0613 1151.388
DS/39 gpt-4-0613 1151.388
DS/362 meta-llama-Llama-3-70b-chat-hf 1115.769
DS/772 meta-llama-Llama-3-70b-chat-hf 1115.769
DS/347 meta-llama-Llama-3-70b-chat-hf 1115.769
DS/346 meta-llama-Llama-3-70b-chat-hf 1115.769
DS/776 meta-llama-Llama-3-70b-chat-hf 1115.769
DS/781 meta-llama-Llama-3-70b-chat-hf 1115.769
DS/373 deepseek-ai-deepseek-coder-V2-Base 1105.770
DS/679 microsoft-wavecoder-ultra-6.7b 1090.762
DS/515 meta-llama-Llama-3-70B 1033.482
DS/26 deepseek-ai-deepseek-llm-67b-chat 1026.171
DS/799 Phind-Phind-CodeLlama-34B-v2 1022.473
DS/997 codellama-CodeLlama-34b-Python-hf 1005.280
DS/998 codellama-CodeLlama-34b-Python-hf 1005.280
DS/411 codellama-CodeLlama-70b-Python-hf 1004.570
DS/447 m-a-p-OpenCodeInterpreter-CL-7B 1002.023
DS/813 gpt-3.5-turbo-0613 1000.000
DS/172 deepseek-ai-deepseek-V2-chat 997.043
DS/953 microsoft-Phi-3-small-8k-instruct 993.383
DS/766 m-a-p-OpenCodeInterpreter-SC2-7B 975.822
DS/604 m-a-p-OpenCodeInterpreter-SC2-7B 975.822
DS/775 WizardLM-WizardCoder-Python-34B-V1.0 973.230
DS/67 Qwen-Qwen1.5-72B-Chat 965.496
DS/90 Qwen-Qwen1.5-72B-Chat 965.496
DS/263 ibm-granite-granite-34b-code-base 952.891
DS/899 meta-llama-Llama-3-8B 909.209
DS/161 codellama-CodeLlama-7b-hf 770.446
DS/64 ERNIE-Speed-8K 489.968
DS/523 google-gemma-1.1-2b-it 487.580

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
DS/523 0.009 -0.109
DS/64 0.009 -0.103
DS/585 0.019 -0.102
DS/880 0.208 -0.102
DS/611 0.311 -0.101
DS/762 0.189 -0.094
DS/250 0.019 -0.076
DS/161 0.009 -0.043
DS/882 0.113 -0.034
DS/632 0.123 -0.031

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.