DeepSeek-R1: Training Recipe and Data
For simplified understanding, the training pipeline of DeepSeek-R1 is presented in 6 stages. The official technical report describes it in 4 stages. This article separates the intermediate data generation and fine-tuning stages, resulting in a more detailed six-stage breakdown.
Note: Since the data used for DeepSeek-R1 is not open-sourced, this article highlights and expands on the data quantity, quality, diversity and data collection methodology information mentioned in the DeepSeek (R1, V3, V2, MoE) technical reports.
📜 Table of Contents
- 📌 Short Description of the Training Recipe
- 🟢 Stage 1: Cold-Start Data Collection for Supervised Fine-Tuning (SFT)
- 🎯 Stage 2: Supervised Fine-Tuning (SFT) on DeepSeek-V3-Base
- 🔄 Stage 3: Reinforcement Learning (GRPO) on the Stage 2 Finetuned Model
- 📊 Stage 4: 800K SFT Data Generation and Sourcing
- 🚀 Stage 5: Supervised Fine-Tuning (SFT) on DeepSeek-V3-Base (Not from Stage 2 or 3)
- 🎛️ Stage 6: Reinforcement Learning (GRPO) on the Stage 5 Finetuned Model to produce DeepSeek-R1
- 📚 References
1. 📌 Short Description of the Training Recipe
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Training pipeline described in short in the technical report:
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
2. 🟢 Stage 1: Cold-Start Data Collection for Supervised Fine-Tuning (SFT)
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Thousands of cold-start samples, consisting of long-chain-of-thought (CoT) data, are collected to fine-tune DeepSeek-V3-Base.
Why collect cold-start data?
In the development of DeepSeek-R1, the incorporation of cold-start data served specific purposes:
I. Enhancing Readability and Output Structure
- The preceding model, DeepSeek-R1-Zero, trained solely through reinforcement learning (RL), exhibited issues such as poor readability and language mixing.
- To address these challenges, the team fine-tuned the base model using a small set of cold-start data, meticulously designed to promote clear and coherent outputs.
II. Accelerating Convergence in Reinforcement Learning
- Starting RL training from an uninitialized model can lead to instability and slow convergence.
- By fine-tuning the model on cold-start data before initiating RL, the model achieved a more stable and efficient learning process.
How Was the Data Collected?
Several approaches were used to generate high-quality cold-start reasoning data:
- Few-shot prompting with long CoT examples: The model was given structured reasoning examples and prompted to generate similar responses. [Note: It is is not clear from the technical report what ‘model’ refers to. Is it DeepSeek-V3-Base or DeepSeek-R1-Zero or any other models. It is more likely to be DeepSeek-R1-Zero due to its higher reasoning capabilities.]
- Direct prompting with reflection and verification: The model was encouraged to explain and verify its own reasoning step by step.
- Reformatting DeepSeek-R1-Zero outputs: Outputs from a previous version were collected and structured for improved readability.
- Human annotation and post-processing: AI-generated responses were reviewed and refined to ensure clarity, coherence, and logical correctness.
How Was the Data Structured?
To maintain consistency and readability, responses were formatted using a structured output pattern:
Where the reasoning process is the CoT for the query, and the summary is used to summarize the reasoning results.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
3. 🎯 Stage 2: Supervised Fine-Tuning (SFT) on DeepSeek-V3-Base
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
In this stage, the DeepSeek-V3-Base model is trained on Stage-1 cold-start data.
What is the DeepSeek-V3-Base model?
DeepSeek-V3 is an open-source Mixture-of-Experts (MoE) large language (LLM) model with 671B total parameters. Trained on 14.8T multilingual tokens (mostly English & Chinese), it excels in reasoning, coding, and math. Despite its massive scale, it is cost-effective and rivals top proprietary models.
The “Base” designation in DeepSeek-V3-Base signifies that it is a pretrained-only foundation model, trained on a diverse, large-scale corpus without extensive task-specific fine-tuning, making it a versatile starting point for further adaptation.
The DeepSeek-V3-Base model architecture features a 128K token input context length. Instead of the standard attention mechanism, it uses Multi-Head Latent Attention (MLA) to reduce KV caching memory usage and accelerate inference. A Mixture-of-Experts (MoE) layer is applied in all layers except the first three. Additionally, the model incorporates Multi-Token Prediction (MTP) to enhance token generation accuracy.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
4. 🔄 Stage 3: Reinforcement Learning (GRPO) on the Stage 2 Finetuned Model
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
After fine-tuning DeepSeek-V3-Base on cold start data, a large-scale reinforcement learning (RL) process, same as the DeepSeek-R1-Zero reinforcement learning (RL) process, is applied here.
For understanding Reinforcement Learning (GRPO) training in DeepSeek-R1-Zero, please refer to my article GRPO and DeepSeek-R1-Zero.
I. Enhancing Reasoning Capabilities
This training phase strengthens the model’s ability to handle:
- Coding
- Mathematics
- Science
- Logical reasoning
These tasks involve well-defined problems with clear solutions.
II. Addressing Language Mixing in Chain-of-Thought (CoT) Reasoning
During training, Chain-of-Thought (CoT) reasoning often exhibits language mixing, especially in multi-language RL prompts.
- To mitigate this, a language consistency reward is introduced.
Ablation studies show:
- A slight performance trade-off
- Improved readability and alignment with human preferences
III. Final Reward Optimization
The final reward is computed by summing:
- Reasoning accuracy reward
- Language consistency reward
This ensures a balance between precision in reasoning tasks and linguistic coherence.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
5. 📊 Stage 4: 800K SFT Data Generation and Sourcing
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
I. Expanding Data Beyond Cold-Start
Unlike the initial cold-start data, which primarily focused on reasoning, this stage integrates diverse domains to enhance the model’s capabilities in:
- Writing
- Role-playing
- General-purpose tasks
II. Reasoning Data Collection
To improve reasoning ability, a structured data collection process is applied:
Curating Reasoning Prompts:
- Prompts are selected based on their complexity and relevance to logical reasoning.
Generating and Filtering Responses [Rejection Sampling]:
- Multiple responses are sampled for each prompt using the stage-3 RL checkpoint (stage-3 final model).
- Only correct responses are retained through rejection sampling.
Dataset Size: R
- Approximately 600K reasoning-related training samples are collected.
Data Diversity:
- Previously, only rule-based rewards were used for evaluation.
- This stage expands the dataset by incorporating: Generative reward models, where DeepSeek-V3 evaluates model (RL checkpoint) predictions against ground-truth answers.
III. Non-Reasoning Data Collection
To strengthen performance in non-reasoning tasks, additional data sources are incorporated:
Categories Included:
- Writing, factual QA, self-cognition, and translation
Data Sources:
- Portions of the DeepSeek-V3 SFT dataset are reused.
- For some tasks, DeepSeek-V3 generates a potential Chain-of-Thought (CoT) before answering.
Dataset Size:
- Approximately 200K non-reasoning training samples are collected.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
6. 🚀 Stage 5: Supervised Fine-Tuning (SFT) on DeepSeek-V3-Base (Not from Stage 2 or 3)
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
The DeepSeek-V3-Base model used in stages 2 to 3 is NOT carried over. Instead, a fresh DeepSeek-V3-Base model is utilized for this phase.
DeepSeek-V3-Base model is fine-tuned for two epochs using the curated SFT dataset of approximately 800K samples.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
7. 🎛️ Stage 6: Reinforcement Learning (GRPO) on the Stage 5 Finetuned Model to Produce DeepSeek-R1
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
This final reinforcement learning stage enhances the model’s alignment by refining its helpfulness, harmlessness, and reasoning abilities through carefully designed reward signals and diverse prompts.
For understanding Reinforcement Learning (GRPO) training, please refer to my article GRPO and DeepSeek-R1-Zero.
Where does this reward in the reinforcement training come from?
“For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios.” — DeepSeek-R1 Technical Report
Improving Reasoning Capabilities:
To improve reasoning capabilities, this step follows the DeepSeek-R1-Zero RL training methodology:
- Rule-based rewards to guide the model in breaking down problems logically.
- Training includes reasoning-heavy tasks such as coding, mathematics, and logical puzzles.
Capturing Nuanced Human Preferences:
Complex and nuanced human preferences are captured using reward models.
- Reward models help refine outputs in subjective or multi-layered tasks.
- Training incorporates diverse pairs of prompts and preferences to cover complex use cases.
Evaluating Helpfulness in Output:
The model’s helpfulness is assessed based on the quality of the final summary to avoid interfering with the underlying reasoning.
- Evaluations emphasize the utility and clarity of the final answer.
- Intermediate steps are left unchanged to preserve logical reasoning.
- Reward models are used to achieve this.
Reducing Harmful Content and Bias:
Harmlessness is evaluated by reviewing the entire response, including both the reasoning process and the final summary, to address potential risks or biases.
- Responses are checked for harmful content, biases, or risks.
- Evaluation covers reasoning steps and final conclusions to ensure safe outputs.
- Reward models are used to achieve this.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
References:
YouTube Video: https://www.youtube.com/watch?v=QdEuh2UVbu0