A large-scale Tool-Integrated Reasoning (TIR) dataset for mathematical problem solving, designed for SFT and RL training targeting competitive mathematics (AIMO3).
| Models | DeepSeek-V3.2, GPT-OSS-120B, Step-3.5-Flash, GLM-5, Kimi-K2.5, Qwen3.5-397B-A17B |
| Answer Format | Cleaned verifiable numeric answers |
| File Format | JSONL (training) / CSV (benchmark) |
| Stage 1 | 340,411 TIR traces — SFT training or RL warmup |
| Stage 2 | 5,656 TIR traces from 3,304 hardest questions — RL finetuning |
Each unique question is solved independently by multiple frontier models, producing diverse solution trajectories for the same problem.
AstralMath-v1 is built from curated questions across multiple high-quality mathematical datasets, deduplicated to remove overlapping problems. A significant portion is synthetically transformed into AIMO3-competition-style questions.
| Source | Selected | Transformed | Share |
|---|---|---|---|
| Nemotron-Math-v2 | 89,344 | 70,596 | 51.9% |
| AI-MO/NuminaMath-1.5 | 28,363 | 0 | 20.8% |
| ScaleQuest-Math | 10,139 | 0 | 7.4% |
| DeepScaleR-Preview-Dataset | 7,580 | 7,580 | 5.0% |
| DeepMath-103K | 540 | 540 | 0.4% |
| Project Euler | 199 | 0 | 0.2% |
| IMO AnswerBench | 24 | 0 | — |
| Total | 136,151 | 78,716 | 100% |
Nemotron-Math-v2 is the largest contributor, with over half the questions undergoing AIMO3-style transformation. NuminaMath and ScaleQuest provide diverse untransformed problems from competition archives.
We apply aggressive filtering to retain only challenging, well-formed problems:
A key differentiator: existing math problems are rewritten into AIMO3-competition-style questions. All transformed answers are integers in [0, 99999].
| Transform | Rule | Example |
|---|---|---|
| Modular Arithmetic | "Find X" → "Find the remainder when X is divided by M" M from primes/powers: 99991, 105, 57, 77781, etc. | "Find the sum of..." → "Find the remainder when the sum is divided by 99991" |
| Power Transforms | For small answers k ≤ 10: compute basek mod M base from [2, 7]. For k in (10, 20] only base 2 or 3. | answer = 7 → "Find 57 mod 99991" = 78120 |
| Answer-Power | For answers 20 < k ≤ 50: compute kn mod M n from [2, 6]. | answer = 42 → "Find 423 mod 99991" = 74088 |
| Symbol Substitution | Replace numeric parameters with algebraic symbols for parametric variants | Creates new problem families from a single seed by varying constants |
Each transformation is verified through 12-run consensus: the same prompt with fixed random parameters is sent to gpt-oss-120b 12 times, and the result is only accepted if all runs return the identical answer.
Solutions are generated using 6 frontier models in a Kaggle-identical Docker environment. After the first correct solution, improvement iterations produce shorter solutions with less total tool run time.
| Model | Datapoints | After Deadline |
|---|---|---|
| DeepSeek-V3.2 | 116,719 | |
| Step-3.5-Flash | 74,248 | |
| GPT-OSS-120B | 49,046 | |
| GLM-5 | — | +9,042 |
| Kimi-K2.5 | — | +65,977 |
| Qwen3.5-397B-A17B | — | +31,035 |
| Total | 240,013 | +106,054 |
Token length distribution between Stage 1 and Stage 2 (estimated using gpt-oss-120b harmony encoding). Stage 2 contains the hardest questions, resulting in longer and more complex solution traces.
Stage 1 — Majority of traces are under 10k tokens, with a long tail of complex problems reaching 40k+. The distribution shows a healthy spread across difficulty levels.
Stage 2 — Shifted right compared to Stage 1, reflecting the higher difficulty. These are the hardest questions that require extended multi-step reasoning and multiple tool calls.
Topic distribution comparison between the full AstralMath-v1 dataset and the AstralBench benchmark subset.
AstralMath-v1 — Broad coverage across algebra, number theory, combinatorics, geometry, and analysis. The hierarchical topic classification supports fine-grained curriculum design.
AstralBench — 50 curated problems spanning the hardest topics. Selected from IMOBench and Project Euler with current model accuracy between 5-30%.
AstralBench is a carefully curated subset of 50 high-quality problems, selected for benchmarking model performance. Problems are sourced from IMOBench and Project Euler, with non-integer answers manually transformed into numeric-answer problems.
| Source | Count | Transformed |
|---|---|---|
| IMOBench | 46 | 20 |
| Project Euler | 4 | 4 |
| Total | 50 | 24 |
Model Performance on AstralBench:
| Model | Accuracy |
|---|---|
| GPT-OSS-120B (public notebook) | 9, 14, 13/50 |
| GPT-OSS-120B (10h timeout) | 15/50 (30%) |
All models run in a Kaggle-identical Docker environment with no time limit. Problems that originally had symbolic or fraction answers are transformed via modular arithmetic and parameter changes while preserving their original complexity.
We validated AstralMath-v1 by fine-tuning Qwen3-4B and gpt-oss-20b using supervised fine-tuning (SFT):
| Model | Before | After | Setting |
|---|---|---|---|
| Qwen/Qwen3-4B-Thinking-2507 | 0 | 22, 23 | 1050 SFT steps, max length 40960, batch size 8, ~11h on 1x H100 |
| openai/gpt-oss-20b | 30 | 34 | 400 lora(rank 1, alpha 32) finetune steps, max length 50176, batch size 12, ~11.5h on 1x H100 |
| More results coming soon | |||
Each question is solved independently by multiple frontier models, capturing diverse reasoning strategies and code patterns. This is not a single-model distillation dataset.
AIMO3-style transformations are validated through 12-run consensus, ensuring all transformed answers are computationally correct — not just plausible.
Every solution carries detailed metadata: generation config, execution timing, quality metrics — enabling difficulty-aware curriculum learning.
Solutions generated and verified in a Kaggle-identical Docker environment, ensuring code execution compatibility at competition time.
340,411 TIR traces from 6 frontier models, target for SFT training or RL warmup
(The actual number of data points is ~340k. The HuggingFace API shows an incorrect number.)
Or upload manually:
5,656 TIR traces from 3,304 hardest questions, target for RL finetuning
Or upload manually: