Dataset at a Glance

ModelsDeepSeek-V3.2, GPT-OSS-120B, Step-3.5-Flash, GLM-5, Kimi-K2.5, Qwen3.5-397B-A17B
Answer FormatCleaned verifiable numeric answers
File FormatJSONL (training) / CSV (benchmark)
Stage 1340,411 TIR traces — SFT training or RL warmup
Stage 25,656 TIR traces from 3,304 hardest questions — RL finetuning

Each unique question is solved independently by multiple frontier models, producing diverse solution trajectories for the same problem.

Data Sources

AstralMath-v1 is built from curated questions across multiple high-quality mathematical datasets, deduplicated to remove overlapping problems. A significant portion is synthetically transformed into AIMO3-competition-style questions.

SourceSelectedTransformedShare
Nemotron-Math-v289,34470,59651.9%
AI-MO/NuminaMath-1.528,363020.8%
ScaleQuest-Math10,13907.4%
DeepScaleR-Preview-Dataset7,5807,5805.0%
DeepMath-103K5405400.4%
Project Euler19900.2%
IMO AnswerBench240
Total136,15178,716100%

Nemotron-Math-v2 is the largest contributor, with over half the questions undergoing AIMO3-style transformation. NuminaMath and ScaleQuest provide diverse untransformed problems from competition archives.

Multi-Stage Filtering Pipeline

We apply aggressive filtering to retain only challenging, well-formed problems:

1Question length Only select if question > 100 chars and < 50 lines
2Contains image Remove if contains both "![]" and "figure"
3Trivial answer Remove any question with answer in [0, 1]
4Multiple choice Remove if contains MC patterns like "\nA. ", "\nB. "
5Double question Remove questions containing two unrelated questions (detected via LLM)
6Pre-Filter 1 Filter by LLM solution metadata (solution length, pass rate)
7Deduplication Deduplicate across all sources using minhash + LSH
8Pre-Filter 2 Remove easy questions that gpt-oss-120b solves in 1 try (no tool, 28k tokens)
9AIMO3 Transform Synthetic transforms with 12-run consensus verification

Synthetic AIMO3-Style Transforms

A key differentiator: existing math problems are rewritten into AIMO3-competition-style questions. All transformed answers are integers in [0, 99999].

TransformRuleExample
Modular Arithmetic"Find X" → "Find the remainder when X is divided by M"
M from primes/powers: 99991, 105, 57, 77781, etc.
"Find the sum of..." → "Find the remainder when the sum is divided by 99991"
Power TransformsFor small answers k ≤ 10: compute basek mod M
base from [2, 7]. For k in (10, 20] only base 2 or 3.
answer = 7 → "Find 57 mod 99991" = 78120
Answer-PowerFor answers 20 < k ≤ 50: compute kn mod M
n from [2, 6].
answer = 42 → "Find 423 mod 99991" = 74088
Symbol SubstitutionReplace numeric parameters with algebraic symbols for parametric variantsCreates new problem families from a single seed by varying constants

Each transformation is verified through 12-run consensus: the same prompt with fixed random parameters is sent to gpt-oss-120b 12 times, and the result is only accepted if all runs return the identical answer.

Reasoning Trace Generation

Solutions are generated using 6 frontier models in a Kaggle-identical Docker environment. After the first correct solution, improvement iterations produce shorter solutions with less total tool run time.

ModelDatapointsAfter Deadline
DeepSeek-V3.2116,719
Step-3.5-Flash74,248
GPT-OSS-120B49,046
GLM-5+9,042
Kimi-K2.5+65,977
Qwen3.5-397B-A17B+31,035
Total240,013+106,054

Token Length Distribution

Token length distribution between Stage 1 and Stage 2 (estimated using gpt-oss-120b harmony encoding). Stage 2 contains the hardest questions, resulting in longer and more complex solution traces.

Stage 1 token length distribution

Stage 1 — Majority of traces are under 10k tokens, with a long tail of complex problems reaching 40k+. The distribution shows a healthy spread across difficulty levels.

Stage 2 token length distribution

Stage 2 — Shifted right compared to Stage 1, reflecting the higher difficulty. These are the hardest questions that require extended multi-step reasoning and multiple tool calls.

Topic Distribution

Topic distribution comparison between the full AstralMath-v1 dataset and the AstralBench benchmark subset.

AstralMath-v1 topic distribution

AstralMath-v1 — Broad coverage across algebra, number theory, combinatorics, geometry, and analysis. The hierarchical topic classification supports fine-grained curriculum design.

AstralBench topic distribution

AstralBench — 50 curated problems spanning the hardest topics. Selected from IMOBench and Project Euler with current model accuracy between 5-30%.

AstralBench

AstralBench is a carefully curated subset of 50 high-quality problems, selected for benchmarking model performance. Problems are sourced from IMOBench and Project Euler, with non-integer answers manually transformed into numeric-answer problems.

SourceCountTransformed
IMOBench4620
Project Euler44
Total5024

Model Performance on AstralBench:

ModelAccuracy
GPT-OSS-120B (public notebook)9, 14, 13/50
GPT-OSS-120B (10h timeout)15/50 (30%)

All models run in a Kaggle-identical Docker environment with no time limit. Problems that originally had symbolic or fraction answers are transformed via modular arithmetic and parameter changes while preserving their original complexity.

Training Results

We validated AstralMath-v1 by fine-tuning Qwen3-4B and gpt-oss-20b using supervised fine-tuning (SFT):

ModelBeforeAfterSetting
Qwen/Qwen3-4B-Thinking-2507022, 231050 SFT steps, max length 40960, batch size 8, ~11h on 1x H100
openai/gpt-oss-20b3034400 lora(rank 1, alpha 32) finetune steps, max length 50176, batch size 12, ~11.5h on 1x H100
More results coming soon

What Makes It Novel

Multi-Model TIR Solutions

Each question is solved independently by multiple frontier models, capturing diverse reasoning strategies and code patterns. This is not a single-model distillation dataset.

Consensus-Verified Transforms

AIMO3-style transformations are validated through 12-run consensus, ensuring all transformed answers are computationally correct — not just plausible.

Rich Per-Datapoint Metadata

Every solution carries detailed metadata: generation config, execution timing, quality metrics — enabling difficulty-aware curriculum learning.

Execution Environment Parity

Solutions generated and verified in a Kaggle-identical Docker environment, ensuring code execution compatibility at competition time.

0 / 0
Connecting to HuggingFace...

Or upload manually:

0 / 0
Connecting to HuggingFace...

Or upload manually:

0 / 0
Loading AstralBench...

Or upload manually: