There are 110 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
DS/106, DS/107, DS/108, DS/121, DS/122, DS/142, DS/15, DS/159, DS/165, DS/173, DS/174, DS/197, DS/202, DS/203, DS/204, DS/205, DS/208, DS/209, DS/210, DS/211, DS/216, DS/225, DS/228, DS/240, DS/242, DS/245, DS/269, DS/270, DS/272, DS/284, DS/285, DS/286, DS/29, DS/318, DS/319, DS/328, DS/339, DS/345, DS/354, DS/372, DS/375, DS/385, DS/387, DS/389, DS/390, DS/394, DS/40, DS/407, DS/408, DS/410, DS/42, DS/420, DS/421, DS/43, DS/439, DS/44, DS/45, DS/46, DS/468, DS/509, DS/516, DS/521, DS/526, DS/54, DS/57, DS/58, DS/59, DS/60, DS/612, DS/626, DS/638, DS/65, DS/672, DS/699, DS/701, DS/726, DS/73, DS/74, DS/747, DS/749, DS/75, DS/750, DS/751, DS/755, DS/773, DS/779, DS/780, DS/789, DS/790, DS/798, DS/80, DS/808, DS/809, DS/81, DS/86, DS/877, DS/879, DS/88, DS/883, DS/884, DS/885, DS/9, DS/900, DS/901, DS/904, DS/905, DS/927, DS/96, DS/987, DS/993
example_link | model | min_elo |
---|---|---|
DS/253 | gpt-4-turbo-2024-04-09 | 1199.141 |
DS/304 | gpt-4-turbo-2024-04-09 | 1199.141 |
DS/505 | gpt-4-turbo-2024-04-09 | 1199.141 |
DS/280 | deepseek-ai-deepseek-coder-V2-SFT | 1186.012 |
DS/56 | deepseek-ai-deepseek-coder-V2-SFT | 1186.012 |
DS/131 | Qwen-Qwen2-72B-Instruct | 1181.847 |
DS/7 | Qwen-Qwen2-72B-Instruct | 1181.847 |
DS/244 | Qwen-Qwen2-72B-Instruct | 1181.847 |
DS/744 | claude-3-5-sonnet-20240620 | 1179.604 |
DS/6 | claude-3-5-sonnet-20240620 | 1179.604 |
DS/922 | claude-3-5-sonnet-20240620 | 1179.604 |
DS/105 | claude-3-5-sonnet-20240620 | 1179.604 |
DS/926 | claude-3-5-sonnet-20240620 | 1179.604 |
DS/488 | claude-3-5-sonnet-20240620 | 1179.604 |
DS/458 | claude-3-5-sonnet-20240620 | 1179.604 |
DS/984 | claude-3-5-sonnet-20240620 | 1179.604 |
DS/132 | llama3-405 | 1166.067 |
DS/596 | llama3-405 | 1166.067 |
DS/418 | mistralai-Codestral-22B-v0.1 | 1164.963 |
DS/784 | gpt-4-0613 | 1151.388 |
DS/765 | gpt-4-0613 | 1151.388 |
DS/154 | gpt-4-0613 | 1151.388 |
DS/807 | gpt-4-0613 | 1151.388 |
DS/39 | gpt-4-0613 | 1151.388 |
DS/362 | meta-llama-Llama-3-70b-chat-hf | 1115.769 |
DS/772 | meta-llama-Llama-3-70b-chat-hf | 1115.769 |
DS/347 | meta-llama-Llama-3-70b-chat-hf | 1115.769 |
DS/346 | meta-llama-Llama-3-70b-chat-hf | 1115.769 |
DS/776 | meta-llama-Llama-3-70b-chat-hf | 1115.769 |
DS/781 | meta-llama-Llama-3-70b-chat-hf | 1115.769 |
DS/373 | deepseek-ai-deepseek-coder-V2-Base | 1105.770 |
DS/679 | microsoft-wavecoder-ultra-6.7b | 1090.762 |
DS/515 | meta-llama-Llama-3-70B | 1033.482 |
DS/26 | deepseek-ai-deepseek-llm-67b-chat | 1026.171 |
DS/799 | Phind-Phind-CodeLlama-34B-v2 | 1022.473 |
DS/997 | codellama-CodeLlama-34b-Python-hf | 1005.280 |
DS/998 | codellama-CodeLlama-34b-Python-hf | 1005.280 |
DS/411 | codellama-CodeLlama-70b-Python-hf | 1004.570 |
DS/447 | m-a-p-OpenCodeInterpreter-CL-7B | 1002.023 |
DS/813 | gpt-3.5-turbo-0613 | 1000.000 |
DS/172 | deepseek-ai-deepseek-V2-chat | 997.043 |
DS/953 | microsoft-Phi-3-small-8k-instruct | 993.383 |
DS/766 | m-a-p-OpenCodeInterpreter-SC2-7B | 975.822 |
DS/604 | m-a-p-OpenCodeInterpreter-SC2-7B | 975.822 |
DS/775 | WizardLM-WizardCoder-Python-34B-V1.0 | 973.230 |
DS/67 | Qwen-Qwen1.5-72B-Chat | 965.496 |
DS/90 | Qwen-Qwen1.5-72B-Chat | 965.496 |
DS/263 | ibm-granite-granite-34b-code-base | 952.891 |
DS/899 | meta-llama-Llama-3-8B | 909.209 |
DS/161 | codellama-CodeLlama-7b-hf | 770.446 |
DS/64 | ERNIE-Speed-8K | 489.968 |
DS/523 | google-gemma-1.1-2b-it | 487.580 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
example_link | acc | tau |
---|---|---|
DS/523 | 0.009 | -0.109 |
DS/64 | 0.009 | -0.103 |
DS/585 | 0.019 | -0.102 |
DS/880 | 0.208 | -0.102 |
DS/611 | 0.311 | -0.101 |
DS/762 | 0.189 | -0.094 |
DS/250 | 0.019 | -0.076 |
DS/161 | 0.009 | -0.043 |
DS/882 | 0.113 | -0.034 |
DS/632 | 0.123 | -0.031 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.