DS1000: by examples

Home   Doc/Code

Not solved by any model

There are 112 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
DS/106, DS/107, DS/108, DS/121, DS/122, DS/132, DS/142, DS/15, DS/159, DS/165, DS/173, DS/174, DS/197, DS/202, DS/203, DS/204, DS/205, DS/208, DS/209, DS/210, DS/211, DS/216, DS/225, DS/228, DS/240, DS/242, DS/245, DS/269, DS/270, DS/272, DS/284, DS/285, DS/286, DS/29, DS/318, DS/319, DS/328, DS/339, DS/345, DS/354, DS/372, DS/375, DS/385, DS/387, DS/389, DS/390, DS/394, DS/40, DS/407, DS/408, DS/410, DS/42, DS/420, DS/421, DS/43, DS/439, DS/44, DS/45, DS/46, DS/468, DS/509, DS/516, DS/521, DS/526, DS/54, DS/57, DS/58, DS/59, DS/596, DS/60, DS/612, DS/626, DS/638, DS/65, DS/672, DS/699, DS/701, DS/726, DS/73, DS/74, DS/747, DS/749, DS/75, DS/750, DS/751, DS/755, DS/773, DS/779, DS/780, DS/789, DS/790, DS/798, DS/80, DS/808, DS/809, DS/81, DS/86, DS/877, DS/879, DS/88, DS/883, DS/884, DS/885, DS/9, DS/900, DS/901, DS/904, DS/905, DS/927, DS/96, DS/987, DS/993

Problems solved by 1 model only

example_link model min_elo
DS/304 gpt-4-turbo-2024-04-09 1199.368
DS/253 gpt-4-turbo-2024-04-09 1199.368
DS/505 gpt-4-turbo-2024-04-09 1199.368
DS/280 deepseek-ai-deepseek-coder-V2-SFT 1186.180
DS/56 deepseek-ai-deepseek-coder-V2-SFT 1186.180
DS/131 Qwen-Qwen2-72B-Instruct 1182.027
DS/7 Qwen-Qwen2-72B-Instruct 1182.027
DS/244 Qwen-Qwen2-72B-Instruct 1182.027
DS/922 claude-3-5-sonnet-20240620 1179.459
DS/926 claude-3-5-sonnet-20240620 1179.459
DS/105 claude-3-5-sonnet-20240620 1179.459
DS/744 claude-3-5-sonnet-20240620 1179.459
DS/79 claude-3-5-sonnet-20240620 1179.459
DS/6 claude-3-5-sonnet-20240620 1179.459
DS/458 claude-3-5-sonnet-20240620 1179.459
DS/488 claude-3-5-sonnet-20240620 1179.459
DS/984 claude-3-5-sonnet-20240620 1179.459
DS/418 mistralai-Codestral-22B-v0.1 1165.213
DS/784 gpt-4-0613 1151.426
DS/39 gpt-4-0613 1151.426
DS/154 gpt-4-0613 1151.426
DS/807 gpt-4-0613 1151.426
DS/765 gpt-4-0613 1151.426
DS/362 meta-llama-Llama-3-70b-chat-hf 1115.748
DS/781 meta-llama-Llama-3-70b-chat-hf 1115.748
DS/347 meta-llama-Llama-3-70b-chat-hf 1115.748
DS/346 meta-llama-Llama-3-70b-chat-hf 1115.748
DS/776 meta-llama-Llama-3-70b-chat-hf 1115.748
DS/772 meta-llama-Llama-3-70b-chat-hf 1115.748
DS/373 deepseek-ai-deepseek-coder-V2-Base 1105.874
DS/679 microsoft-wavecoder-ultra-6.7b 1090.820
DS/515 meta-llama-Llama-3-70B 1033.511
DS/26 deepseek-ai-deepseek-llm-67b-chat 1025.877
DS/799 Phind-Phind-CodeLlama-34B-v2 1022.515
DS/87 gpt-3.5-turbo-0125 1005.712
DS/998 codellama-CodeLlama-34b-Python-hf 1005.304
DS/997 codellama-CodeLlama-34b-Python-hf 1005.304
DS/411 codellama-CodeLlama-70b-Python-hf 1004.769
DS/447 m-a-p-OpenCodeInterpreter-CL-7B 1001.759
DS/813 gpt-3.5-turbo-0613 1000.000
DS/172 deepseek-ai-deepseek-V2-chat 996.937
DS/953 microsoft-Phi-3-small-8k-instruct 993.444
DS/604 m-a-p-OpenCodeInterpreter-SC2-7B 975.298
DS/766 m-a-p-OpenCodeInterpreter-SC2-7B 975.298
DS/775 WizardLM-WizardCoder-Python-34B-V1.0 973.105
DS/90 Qwen-Qwen1.5-72B-Chat 965.425
DS/67 Qwen-Qwen1.5-72B-Chat 965.425
DS/263 ibm-granite-granite-34b-code-base 952.973
DS/899 meta-llama-Llama-3-8B 909.073
DS/161 codellama-CodeLlama-7b-hf 770.087
DS/64 ERNIE-Speed-8K 489.014
DS/523 google-gemma-1.1-2b-it 487.001

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
DS/523 0.010 -0.109
DS/64 0.010 -0.104
DS/585 0.019 -0.102
DS/880 0.210 -0.096
DS/611 0.314 -0.094
DS/762 0.190 -0.089
DS/250 0.019 -0.076
DS/161 0.010 -0.043
DS/514 0.076 -0.039
DS/882 0.114 -0.030

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.