AstralMath-v1

Dataset at a Glance

Models	DeepSeek-V3.2, GPT-OSS-120B, Step-3.5-Flash, GLM-5, Kimi-K2.5, Qwen3.5-397B-A17B
Answer Format	Cleaned verifiable numeric answers
File Format	JSONL (training) / CSV (benchmark)
Stage 1	340,411 TIR traces — SFT training or RL warmup
Stage 2	5,656 TIR traces from 3,304 hardest questions — RL finetuning

Each unique question is solved independently by multiple frontier models, producing diverse solution trajectories for the same problem.

Data Sources

AstralMath-v1 is built from curated questions across multiple high-quality mathematical datasets, deduplicated to remove overlapping problems. A significant portion is synthetically transformed into AIMO3-competition-style questions.

Source	Selected	Transformed	Share
Nemotron-Math-v2	89,344	70,596	51.9%
AI-MO/NuminaMath-1.5	28,363	0	20.8%
ScaleQuest-Math	10,139	0	7.4%
DeepScaleR-Preview-Dataset	7,580	7,580	5.0%
DeepMath-103K	540	540	0.4%
Project Euler	199	0	0.2%
IMO AnswerBench	24	0	—
Total	136,151	78,716	100%

Nemotron-Math-v2 is the largest contributor, with over half the questions undergoing AIMO3-style transformation. NuminaMath and ScaleQuest provide diverse untransformed problems from competition archives.

Multi-Stage Filtering Pipeline

We apply aggressive filtering to retain only challenging, well-formed problems:

1Question length Only select if question > 100 chars and < 50 lines

2Contains image Remove if contains both "![]" and "figure"

3Trivial answer Remove any question with answer in [0, 1]

4Multiple choice Remove if contains MC patterns like "\nA. ", "\nB. "

5Double question Remove questions containing two unrelated questions (detected via LLM)

6Pre-Filter 1 Filter by LLM solution metadata (solution length, pass rate)

7Deduplication Deduplicate across all sources using minhash + LSH

8Pre-Filter 2 Remove easy questions that gpt-oss-120b solves in 1 try (no tool, 28k tokens)

9AIMO3 Transform Synthetic transforms with 12-run consensus verification

Synthetic AIMO3-Style Transforms

A key differentiator: existing math problems are rewritten into AIMO3-competition-style questions. All transformed answers are integers in [0, 99999].

Transform	Rule	Example
Modular Arithmetic	"Find X" → "Find the remainder when X is divided by M" M from primes/powers: 99991, 10⁵, 5⁷, 77781, etc.	"Find the sum of..." → "Find the remainder when the sum is divided by 99991"
Power Transforms	For small answers k ≤ 10: compute base^k mod M base from [2, 7]. For k in (10, 20] only base 2 or 3.	answer = 7 → "Find 5⁷ mod 99991" = 78120
Answer-Power	For answers 20 < k ≤ 50: compute kⁿ mod M n from [2, 6].	answer = 42 → "Find 42³ mod 99991" = 74088
Symbol Substitution	Replace numeric parameters with algebraic symbols for parametric variants	Creates new problem families from a single seed by varying constants

Each transformation is verified through 12-run consensus: the same prompt with fixed random parameters is sent to gpt-oss-120b 12 times, and the result is only accepted if all runs return the identical answer.

Reasoning Trace Generation

Solutions are generated using 6 frontier models in a Kaggle-identical Docker environment. After the first correct solution, improvement iterations produce shorter solutions with less total tool run time.

Model	Datapoints	After Deadline
DeepSeek-V3.2	116,719
Step-3.5-Flash	74,248
GPT-OSS-120B	49,046
GLM-5	—	+9,042
Kimi-K2.5	—	+65,977
Qwen3.5-397B-A17B	—	+31,035
Total	240,013	+106,054

Token Length Distribution

Token length distribution between Stage 1 and Stage 2 (estimated using gpt-oss-120b harmony encoding). Stage 2 contains the hardest questions, resulting in longer and more complex solution traces.

Stage 1 — Majority of traces are under 10k tokens, with a long tail of complex problems reaching 40k+. The distribution shows a healthy spread across difficulty levels.

Stage 2 — Shifted right compared to Stage 1, reflecting the higher difficulty. These are the hardest questions that require extended multi-step reasoning and multiple tool calls.

Topic Distribution

Topic distribution comparison between the full AstralMath-v1 dataset and the AstralBench benchmark subset.

AstralMath-v1 — Broad coverage across algebra, number theory, combinatorics, geometry, and analysis. The hierarchical topic classification supports fine-grained curriculum design.

AstralBench — 50 curated problems spanning the hardest topics. Selected from IMOBench and Project Euler with current model accuracy between 5-30%.

AstralBench

AstralBench is a carefully curated subset of 50 high-quality problems, selected for benchmarking model performance. Problems are sourced from IMOBench and Project Euler, with non-integer answers manually transformed into numeric-answer problems.

Source	Count	Transformed
IMOBench	46	20
Project Euler	4	4
Total	50	24

Model Performance on AstralBench:

Model	Accuracy
GPT-OSS-120B (public notebook)	9, 14, 13/50
GPT-OSS-120B (10h timeout)	15/50 (30%)

All models run in a Kaggle-identical Docker environment with no time limit. Problems that originally had symbolic or fraction answers are transformed via modular arithmetic and parameter changes while preserving their original complexity.

Training Results

We validated AstralMath-v1 by fine-tuning Qwen3-4B and gpt-oss-20b using supervised fine-tuning (SFT):

Model	Before	After	Setting
Qwen/Qwen3-4B-Thinking-2507	0	22, 23	1050 SFT steps, max length 40960, batch size 8, ~11h on 1x H100
openai/gpt-oss-20b	30	34	400 lora(rank 1, alpha 32) finetune steps, max length 50176, batch size 12, ~11.5h on 1x H100
More results coming soon

What Makes It Novel

Multi-Model TIR Solutions

Each question is solved independently by multiple frontier models, capturing diverse reasoning strategies and code patterns. This is not a single-model distillation dataset.

Consensus-Verified Transforms

AIMO3-style transformations are validated through 12-run consensus, ensuring all transformed answers are computationally correct — not just plausible.

Rich Per-Datapoint Metadata

Every solution carries detailed metadata: generation config, execution timing, quality metrics — enabling difficulty-aware curriculum learning.

Execution Environment Parity

Solutions generated and verified in a Kaggle-identical Docker environment, ensuring code execution compatibility at competition time.