There are 28 examples not solved by any model.
Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
gsm8k/1035, gsm8k/1042, gsm8k/1048, gsm8k/1105, gsm8k/1137, gsm8k/1190, gsm8k/1309, gsm8k/153, gsm8k/205, gsm8k/236, gsm8k/252, gsm8k/298, gsm8k/359, gsm8k/406, gsm8k/454, gsm8k/530, gsm8k/532, gsm8k/541, gsm8k/589, gsm8k/590, gsm8k/754, gsm8k/782, gsm8k/802, gsm8k/823, gsm8k/87, gsm8k/942, gsm8k/979, gsm8k/999
example_link | model | min_elo |
---|---|---|
gsm8k/763 | Qwen1.5-110B | 1607.968 |
gsm8k/603 | Qwen1.5-110B | 1607.968 |
gsm8k/102 | Qwen1.5-110B | 1607.968 |
gsm8k/293 | Qwen1.5-110B | 1607.968 |
gsm8k/687 | Qwen1.5-110B | 1607.968 |
gsm8k/1288 | Qwen1.5-110B | 1607.968 |
gsm8k/314 | Qwen1.5-110B | 1607.968 |
gsm8k/652 | Qwen1.5-110B | 1607.968 |
gsm8k/1075 | Meta-Llama-3-70B | 1596.226 |
gsm8k/806 | Meta-Llama-3-70B | 1596.226 |
gsm8k/494 | Meta-Llama-3-70B | 1596.226 |
gsm8k/780 | Mixtral-8x22B-v0.1 | 1543.408 |
gsm8k/854 | Qwen1.5-72B | 1504.785 |
gsm8k/20 | Qwen1.5-72B | 1504.785 |
gsm8k/1181 | Qwen1.5-72B | 1504.785 |
gsm8k/122 | DeepSeek-V2 | 1498.478 |
gsm8k/2 | DeepSeek-V2 | 1498.478 |
gsm8k/141 | DeepSeek-V2 | 1498.478 |
gsm8k/796 | Qwen1.5-32B | 1476.260 |
gsm8k/611 | Qwen1.5-14B | 1380.032 |
gsm8k/510 | dbrx-base | 1379.046 |
gsm8k/580 | deepseek-llm-67b-base | 1305.619 |
gsm8k/958 | gemma-7b | 1234.147 |
gsm8k/672 | gemma-7b | 1234.147 |
gsm8k/539 | llama2_70B | 1232.620 |
gsm8k/675 | Meta-Llama-3-8B | 1220.710 |
gsm8k/119 | Mistral-7B-v0.1 | 1062.761 |
gsm8k/357 | llama2_07B | 834.538 |
These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )
example_link | acc | tau |
---|---|---|
gsm8k/1310 | 0.108 | -0.159 |
gsm8k/12 | 0.054 | -0.111 |
gsm8k/466 | 0.297 | -0.073 |
gsm8k/901 | 0.108 | -0.061 |
gsm8k/357 | 0.027 | -0.039 |
gsm8k/749 | 0.081 | -0.031 |
gsm8k/584 | 0.297 | -0.002 |
gsm8k/175 | 0.054 | 0.000 |
gsm8k/1094 | 0.108 | 0.020 |
gsm8k/119 | 0.027 | 0.026 |
Histogram of problems by the accuracy on each problem.
Histogram of problems by the minimum Elo to solve each problem.