There are 28 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
gsm8k/1035, gsm8k/1042, gsm8k/1048, gsm8k/1105, gsm8k/1137, gsm8k/1190, gsm8k/1309, gsm8k/153, gsm8k/205, gsm8k/236, gsm8k/252, gsm8k/298, gsm8k/359, gsm8k/406, gsm8k/454, gsm8k/530, gsm8k/532, gsm8k/541, gsm8k/589, gsm8k/590, gsm8k/754, gsm8k/782, gsm8k/802, gsm8k/823, gsm8k/87, gsm8k/942, gsm8k/979, gsm8k/999
| example_link | model | min_elo |
|---|---|---|
| gsm8k/687 | Qwen1.5-110B | 1184.228 |
| gsm8k/763 | Qwen1.5-110B | 1184.228 |
| gsm8k/603 | Qwen1.5-110B | 1184.228 |
| gsm8k/293 | Qwen1.5-110B | 1184.228 |
| gsm8k/102 | Qwen1.5-110B | 1184.228 |
| gsm8k/652 | Qwen1.5-110B | 1184.228 |
| gsm8k/314 | Qwen1.5-110B | 1184.228 |
| gsm8k/1288 | Qwen1.5-110B | 1184.228 |
| gsm8k/806 | Meta-Llama-3-70B | 1177.840 |
| gsm8k/494 | Meta-Llama-3-70B | 1177.840 |
| gsm8k/1075 | Meta-Llama-3-70B | 1177.840 |
| gsm8k/780 | Mixtral-8x22B-v0.1 | 1166.351 |
| gsm8k/1181 | Qwen1.5-72B | 1158.162 |
| gsm8k/854 | Qwen1.5-72B | 1158.162 |
| gsm8k/20 | Qwen1.5-72B | 1158.162 |
| gsm8k/122 | DeepSeek-V2 | 1153.446 |
| gsm8k/141 | DeepSeek-V2 | 1153.446 |
| gsm8k/2 | DeepSeek-V2 | 1153.446 |
| gsm8k/796 | Qwen1.5-32B | 1148.770 |
| gsm8k/611 | Qwen1.5-14B | 1119.906 |
| gsm8k/510 | dbrx-base | 1119.906 |
| gsm8k/580 | deepseek-llm-67b-base | 1092.850 |
| gsm8k/958 | gemma-7b | 1068.416 |
| gsm8k/672 | gemma-7b | 1068.416 |
| gsm8k/539 | llama2_70B | 1068.118 |
| gsm8k/675 | Meta-Llama-3-8B | 1063.063 |
| gsm8k/119 | Mistral-7B-v0.1 | 1008.460 |
| gsm8k/357 | llama2_07B | 936.649 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
| example_link | acc | tau |
|---|---|---|
| gsm8k/1310 | 0.108 | -0.159 |
| gsm8k/12 | 0.054 | -0.111 |
| gsm8k/466 | 0.297 | -0.073 |
| gsm8k/901 | 0.108 | -0.061 |
| gsm8k/357 | 0.027 | -0.039 |
| gsm8k/749 | 0.081 | -0.031 |
| gsm8k/584 | 0.297 | -0.002 |
| gsm8k/175 | 0.054 | 0.000 |
| gsm8k/1094 | 0.108 | 0.020 |
| gsm8k/119 | 0.027 | 0.026 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.