gsm8k: by examples

Home   Doc/Code

Not solved by any model

There are 28 examples not solved by any model. Solving some of these can be a good signal that your model is indeed better than leading models if these are good problems.
gsm8k/1035, gsm8k/1042, gsm8k/1048, gsm8k/1105, gsm8k/1137, gsm8k/1190, gsm8k/1309, gsm8k/153, gsm8k/205, gsm8k/236, gsm8k/252, gsm8k/298, gsm8k/359, gsm8k/406, gsm8k/454, gsm8k/530, gsm8k/532, gsm8k/541, gsm8k/589, gsm8k/590, gsm8k/754, gsm8k/782, gsm8k/802, gsm8k/823, gsm8k/87, gsm8k/942, gsm8k/979, gsm8k/999

Problems solved by 1 model only

example_link model min_elo
gsm8k/1288 Qwen1.5-110B 1607.968
gsm8k/603 Qwen1.5-110B 1607.968
gsm8k/102 Qwen1.5-110B 1607.968
gsm8k/763 Qwen1.5-110B 1607.968
gsm8k/687 Qwen1.5-110B 1607.968
gsm8k/314 Qwen1.5-110B 1607.968
gsm8k/652 Qwen1.5-110B 1607.968
gsm8k/293 Qwen1.5-110B 1607.968
gsm8k/494 Meta-Llama-3-70B 1596.226
gsm8k/806 Meta-Llama-3-70B 1596.226
gsm8k/1075 Meta-Llama-3-70B 1596.226
gsm8k/780 Mixtral-8x22B-v0.1 1543.408
gsm8k/20 Qwen1.5-72B 1504.785
gsm8k/854 Qwen1.5-72B 1504.785
gsm8k/1181 Qwen1.5-72B 1504.785
gsm8k/141 DeepSeek-V2 1498.478
gsm8k/122 DeepSeek-V2 1498.478
gsm8k/2 DeepSeek-V2 1498.478
gsm8k/796 Qwen1.5-32B 1476.260
gsm8k/611 Qwen1.5-14B 1380.032
gsm8k/510 dbrx-base 1379.046
gsm8k/580 deepseek-llm-67b-base 1305.619
gsm8k/672 gemma-7b 1234.147
gsm8k/958 gemma-7b 1234.147
gsm8k/539 llama2_70B 1232.620
gsm8k/675 Meta-Llama-3-8B 1220.710
gsm8k/119 Mistral-7B-v0.1 1062.761
gsm8k/357 llama2_07B 834.538

Suspect problems

These are 10 problems with the lowest correlation with the overall evaluation (i.e. better models tend to do worse on these. )

example_link acc tau
gsm8k/1310 0.108 -0.159
gsm8k/12 0.054 -0.111
gsm8k/466 0.297 -0.073
gsm8k/901 0.108 -0.061
gsm8k/357 0.027 -0.039
gsm8k/749 0.081 -0.031
gsm8k/584 0.297 -0.002
gsm8k/175 0.054 0.000
gsm8k/1094 0.108 0.020
gsm8k/119 0.027 0.026

Histogram of accuracies

Histogram of problems by the accuracy on each problem.

Histogram of difficulties

Histogram of problems by the minimum Elo to solve each problem.