There are 112 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
DS/106, DS/107, DS/108, DS/121, DS/122, DS/132, DS/142, DS/15, DS/159, DS/165, DS/173, DS/174, DS/197, DS/202, DS/203, DS/204, DS/205, DS/208, DS/209, DS/210, DS/211, DS/216, DS/225, DS/228, DS/240, DS/242, DS/245, DS/269, DS/270, DS/272, DS/284, DS/285, DS/286, DS/29, DS/318, DS/319, DS/328, DS/339, DS/345, DS/354, DS/372, DS/375, DS/385, DS/387, DS/389, DS/390, DS/394, DS/40, DS/407, DS/408, DS/410, DS/42, DS/420, DS/421, DS/43, DS/439, DS/44, DS/45, DS/46, DS/468, DS/509, DS/516, DS/521, DS/526, DS/54, DS/57, DS/58, DS/59, DS/596, DS/60, DS/612, DS/626, DS/638, DS/65, DS/672, DS/699, DS/701, DS/726, DS/73, DS/74, DS/747, DS/749, DS/75, DS/750, DS/751, DS/755, DS/773, DS/779, DS/780, DS/789, DS/790, DS/798, DS/80, DS/808, DS/809, DS/81, DS/86, DS/877, DS/879, DS/88, DS/883, DS/884, DS/885, DS/9, DS/900, DS/901, DS/904, DS/905, DS/927, DS/96, DS/987, DS/993
| example_link | model | min_elo |
|---|---|---|
| DS/926 | claude-3-5-sonnet-20240620 | 1057.655 |
| DS/488 | claude-3-5-sonnet-20240620 | 1057.655 |
| DS/105 | claude-3-5-sonnet-20240620 | 1057.655 |
| DS/922 | claude-3-5-sonnet-20240620 | 1057.655 |
| DS/6 | claude-3-5-sonnet-20240620 | 1057.655 |
| DS/744 | claude-3-5-sonnet-20240620 | 1057.655 |
| DS/79 | claude-3-5-sonnet-20240620 | 1057.655 |
| DS/984 | claude-3-5-sonnet-20240620 | 1057.655 |
| DS/458 | claude-3-5-sonnet-20240620 | 1057.655 |
| DS/304 | gpt-4-turbo-2024-04-09 | 1056.516 |
| DS/253 | gpt-4-turbo-2024-04-09 | 1056.516 |
| DS/505 | gpt-4-turbo-2024-04-09 | 1056.516 |
| DS/280 | deepseek-ai-deepseek-coder-V2-SFT | 1053.486 |
| DS/56 | deepseek-ai-deepseek-coder-V2-SFT | 1053.486 |
| DS/7 | Qwen-Qwen2-72B-Instruct | 1051.977 |
| DS/244 | Qwen-Qwen2-72B-Instruct | 1051.977 |
| DS/131 | Qwen-Qwen2-72B-Instruct | 1051.977 |
| DS/418 | mistralai-Codestral-22B-v0.1 | 1045.969 |
| DS/784 | gpt-4-0613 | 1045.221 |
| DS/807 | gpt-4-0613 | 1045.221 |
| DS/39 | gpt-4-0613 | 1045.221 |
| DS/154 | gpt-4-0613 | 1045.221 |
| DS/765 | gpt-4-0613 | 1045.221 |
| DS/772 | meta-llama-Llama-3-70b-chat-hf | 1036.304 |
| DS/346 | meta-llama-Llama-3-70b-chat-hf | 1036.304 |
| DS/362 | meta-llama-Llama-3-70b-chat-hf | 1036.304 |
| DS/776 | meta-llama-Llama-3-70b-chat-hf | 1036.304 |
| DS/781 | meta-llama-Llama-3-70b-chat-hf | 1036.304 |
| DS/347 | meta-llama-Llama-3-70b-chat-hf | 1036.304 |
| DS/373 | deepseek-ai-deepseek-coder-V2-Base | 1029.310 |
| DS/679 | microsoft-wavecoder-ultra-6.7b | 1026.747 |
| DS/515 | meta-llama-Llama-3-70B | 1008.253 |
| DS/26 | deepseek-ai-deepseek-llm-67b-chat | 1007.534 |
| DS/799 | Phind-Phind-CodeLlama-34B-v2 | 1006.455 |
| DS/447 | m-a-p-OpenCodeInterpreter-CL-7B | 1003.224 |
| DS/87 | gpt-3.5-turbo-0125 | 1002.866 |
| DS/998 | codellama-CodeLlama-34b-Python-hf | 1001.074 |
| DS/997 | codellama-CodeLlama-34b-Python-hf | 1001.074 |
| DS/766 | m-a-p-OpenCodeInterpreter-SC2-7B | 1001.074 |
| DS/411 | codellama-CodeLlama-70b-Python-hf | 1001.074 |
| DS/604 | m-a-p-OpenCodeInterpreter-SC2-7B | 1001.074 |
| DS/813 | gpt-3.5-turbo-0613 | 1000.000 |
| DS/172 | deepseek-ai-deepseek-V2-chat | 999.642 |
| DS/953 | microsoft-Phi-3-small-8k-instruct | 996.782 |
| DS/775 | WizardLM-WizardCoder-Python-34B-V1.0 | 993.213 |
| DS/67 | Qwen-Qwen1.5-72B-Chat | 988.939 |
| DS/90 | Qwen-Qwen1.5-72B-Chat | 988.939 |
| DS/263 | ibm-granite-granite-34b-code-base | 986.449 |
| DS/899 | meta-llama-Llama-3-8B | 974.741 |
| DS/161 | codellama-CodeLlama-7b-hf | 944.299 |
| DS/64 | ERNIE-Speed-8K | 893.631 |
| DS/523 | google-gemma-1.1-2b-it | 892.530 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | acc | tau |
|---|---|---|
| DS/523 | 0.010 | -0.109 |
| DS/64 | 0.010 | -0.104 |
| DS/585 | 0.019 | -0.102 |
| DS/880 | 0.210 | -0.096 |
| DS/611 | 0.314 | -0.094 |
| DS/762 | 0.190 | -0.089 |
| DS/250 | 0.019 | -0.076 |
| DS/161 | 0.010 | -0.043 |
| DS/514 | 0.076 | -0.039 |
| DS/882 | 0.114 | -0.030 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.