DeepSeek-R1: Training Recipe and Data

9 min readFeb 5, 2025

For simplified understanding, the training pipeline of DeepSeek-R1 is presented in 6 stages. The official technical report describes it in 4 stages. This article separates the intermediate data generation and fine-tuning stages, resulting in a more detailed six-stage breakdown.

Note: Since the data used for DeepSeek-R1 is not open-sourced, this article highlights and expands on the data quantity, quality, diversity and data collection methodology information mentioned in the DeepSeek (R1, V3, V2, MoE) technical reports.

📜 Table of Contents

📌 Short Description of the Training Recipe
🟢 Stage 1: Cold-Start Data Collection for Supervised Fine-Tuning (SFT)
🎯 Stage 2: Supervised Fine-Tuning (SFT) on DeepSeek-V3-Base
🔄 Stage 3: Reinforcement Learning (GRPO) on the Stage 2 Finetuned Model
📊 Stage 4: 800K SFT Data Generation and Sourcing
🚀 Stage 5: Supervised Fine-Tuning (SFT) on DeepSeek-V3-Base (Not from Stage 2 or 3)
🎛️ Stage 6: Reinforcement Learning (GRPO) on the Stage 5 Finetuned Model to produce DeepSeek-R1
📚 References

1. 📌 Short Description of the Training Recipe

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Training pipeline described in short in the technical report:

Figure 2: Brief summary DeepSeek-R1 training pipeline

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

2. 🟢 Stage 1: Cold-Start Data Collection for Supervised Fine-Tuning (SFT)

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Thousands of cold-start samples, consisting of long-chain-of-thought (CoT) data, are collected to fine-tune DeepSeek-V3-Base.

Why collect cold-start data?

In the development of DeepSeek-R1, the incorporation of cold-start data served specific purposes:

I. Enhancing Readability and Output Structure

The preceding model, DeepSeek-R1-Zero, trained solely through reinforcement learning (RL), exhibited issues such as poor readability and language mixing.
To address these challenges, the team fine-tuned the base model using a small set of cold-start data, meticulously designed to promote clear and coherent outputs.

II. Accelerating Convergence in Reinforcement Learning

Starting RL training from an uninitialized model can lead to instability and slow convergence.
By fine-tuning the model on cold-start data before initiating RL, the model achieved a more stable and efficient learning process.

How Was the Data Collected?

Several approaches were used to generate high-quality cold-start reasoning data:

Few-shot prompting with long CoT examples: The model was given structured reasoning examples and prompted to generate similar responses. [Note: It is is not clear from the technical report what ‘model’ refers to. Is it DeepSeek-V3-Base or DeepSeek-R1-Zero or any other models. It is more likely to be DeepSeek-R1-Zero due to its higher reasoning capabilities.]
Direct prompting with reflection and verification: The model was encouraged to explain and verify its own reasoning step by step.
Reformatting DeepSeek-R1-Zero outputs: Outputs from a previous version were collected and structured for improved readability.
Human annotation and post-processing: AI-generated responses were reviewed and refined to ensure clarity, coherence, and logical correctness.

How Was the Data Structured?

To maintain consistency and readability, responses were formatted using a structured output pattern:

Figure 3: Data Structure

Where the reasoning process is the CoT for the query, and the summary is used to summarize the reasoning results.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

3. 🎯 Stage 2: Supervised Fine-Tuning (SFT) on DeepSeek-V3-Base

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

In this stage, the DeepSeek-V3-Base model is trained on Stage-1 cold-start data.

What is the DeepSeek-V3-Base model?

DeepSeek-V3 is an open-source Mixture-of-Experts (MoE) large language (LLM) model with 671B total parameters. Trained on 14.8T multilingual tokens (mostly English & Chinese), it excels in reasoning, coding, and math. Despite its massive scale, it is cost-effective and rivals top proprietary models.

The “Base” designation in DeepSeek-V3-Base signifies that it is a pretrained-only foundation model, trained on a diverse, large-scale corpus without extensive task-specific fine-tuning, making it a versatile starting point for further adaptation.

The DeepSeek-V3-Base model architecture features a 128K token input context length. Instead of the standard attention mechanism, it uses Multi-Head Latent Attention (MLA) to reduce KV caching memory usage and accelerate inference. A Mixture-of-Experts (MoE) layer is applied in all layers except the first three. Additionally, the model incorporates Multi-Token Prediction (MTP) to enhance token generation accuracy.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

4. 🔄 Stage 3: Reinforcement Learning (GRPO) on the Stage 2 Finetuned Model

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

After fine-tuning DeepSeek-V3-Base on cold start data, a large-scale reinforcement learning (RL) process, same as the DeepSeek-R1-Zero reinforcement learning (RL) process, is applied here.

For understanding Reinforcement Learning (GRPO) training in DeepSeek-R1-Zero, please refer to my article GRPO and DeepSeek-R1-Zero.

I. Enhancing Reasoning Capabilities

This training phase strengthens the model’s ability to handle:

Coding
Mathematics
Science
Logical reasoning

These tasks involve well-defined problems with clear solutions.

II. Addressing Language Mixing in Chain-of-Thought (CoT) Reasoning

During training, Chain-of-Thought (CoT) reasoning often exhibits language mixing, especially in multi-language RL prompts.

To mitigate this, a language consistency reward is introduced.

Ablation studies show:

A slight performance trade-off
Improved readability and alignment with human preferences

III. Final Reward Optimization

The final reward is computed by summing:

Reasoning accuracy reward
Language consistency reward

This ensures a balance between precision in reasoning tasks and linguistic coherence.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

5. 📊 Stage 4: 800K SFT Data Generation and Sourcing

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

I. Expanding Data Beyond Cold-Start

Unlike the initial cold-start data, which primarily focused on reasoning, this stage integrates diverse domains to enhance the model’s capabilities in:

Writing
Role-playing
General-purpose tasks

II. Reasoning Data Collection

To improve reasoning ability, a structured data collection process is applied:

Curating Reasoning Prompts:

Prompts are selected based on their complexity and relevance to logical reasoning.

Generating and Filtering Responses [Rejection Sampling]:

Multiple responses are sampled for each prompt using the stage-3 RL checkpoint (stage-3 final model).
Only correct responses are retained through rejection sampling.

Dataset Size: R

Approximately 600K reasoning-related training samples are collected.

Data Diversity:

Previously, only rule-based rewards were used for evaluation.
This stage expands the dataset by incorporating: Generative reward models, where DeepSeek-V3 evaluates model (RL checkpoint) predictions against ground-truth answers.

III. Non-Reasoning Data Collection

To strengthen performance in non-reasoning tasks, additional data sources are incorporated:

Categories Included:

Writing, factual QA, self-cognition, and translation

Data Sources:

Portions of the DeepSeek-V3 SFT dataset are reused.
For some tasks, DeepSeek-V3 generates a potential Chain-of-Thought (CoT) before answering.

Dataset Size:

Approximately 200K non-reasoning training samples are collected.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

6. 🚀 Stage 5: Supervised Fine-Tuning (SFT) on DeepSeek-V3-Base (Not from Stage 2 or 3)

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

The DeepSeek-V3-Base model used in stages 2 to 3 is NOT carried over. Instead, a fresh DeepSeek-V3-Base model is utilized for this phase.

DeepSeek-V3-Base model is fine-tuned for two epochs using the curated SFT dataset of approximately 800K samples.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

7. 🎛️ Stage 6: Reinforcement Learning (GRPO) on the Stage 5 Finetuned Model to Produce DeepSeek-R1

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

This final reinforcement learning stage enhances the model’s alignment by refining its helpfulness, harmlessness, and reasoning abilities through carefully designed reward signals and diverse prompts.

For understanding Reinforcement Learning (GRPO) training, please refer to my article GRPO and DeepSeek-R1-Zero.

Where does this reward in the reinforcement training come from?

“For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios.” — DeepSeek-R1 Technical Report

Improving Reasoning Capabilities:
To improve reasoning capabilities, this step follows the DeepSeek-R1-Zero RL training methodology:

Rule-based rewards to guide the model in breaking down problems logically.
Training includes reasoning-heavy tasks such as coding, mathematics, and logical puzzles.

Capturing Nuanced Human Preferences:
Complex and nuanced human preferences are captured using reward models.

Reward models help refine outputs in subjective or multi-layered tasks.
Training incorporates diverse pairs of prompts and preferences to cover complex use cases.

Evaluating Helpfulness in Output:
The model’s helpfulness is assessed based on the quality of the final summary to avoid interfering with the underlying reasoning.

Evaluations emphasize the utility and clarity of the final answer.
Intermediate steps are left unchanged to preserve logical reasoning.
Reward models are used to achieve this.

Reducing Harmful Content and Bias:
Harmlessness is evaluated by reviewing the entire response, including both the reasoning process and the final summary, to address potential risks or biases.

Responses are checked for harmful content, biases, or risks.
Evaluation covers reasoning steps and final conclusions to ensure safe outputs.
Reward models are used to achieve this.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

References:

DeepSeek-V3 Technical Report

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated…

arxiv.org

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and…

arxiv.org

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In…

arxiv.org

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law…

arxiv.org

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational…

arxiv.org

The Illustrated DeepSeek-R1

A recipe for reasoning LLMs

newsletter.languagemodels.co

Primers * DeepSeek-R1

Aman's AI Journal | Course notes and learning material for Artificial Intelligence and Deep Learning Stanford classes.

aman.ai

YaRN: Efficient Context Window Extension of Large Language Models

Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based…

arxiv.org

DeepSeek: Bridging Performance and Efficiency in Modern AI

AI models keep getting better with each new release. But now, we care about more than just how well they work — we look…

medium.com

YouTube Video: https://www.youtube.com/watch?v=QdEuh2UVbu0

Mixture of Experts Explained

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

makeMoE: Implement a Sparse Mixture of Experts Language Model from Scratch

A Blog post by Avinash Sooriyarachchi on Hugging Face

huggingface.co

A Visual Guide to Mixture of Experts (MoE)

Demystifying the role of Mixture of Experts (MoE) in Large Language Models (LLMs) with over 50 illustrations.

newsletter.maartengrootendorst.com

GRPO and DeepSeek-R1-Zero

📚 Table of Contents

shaktiwadekar.medium.com

DeepSeek-R1: Training Recipe and Data

📜 Table of Contents

1. 📌 Short Description of the Training Recipe

2. 🟢 Stage 1: Cold-Start Data Collection for Supervised Fine-Tuning (SFT)

I. Enhancing Readability and Output Structure

II. Accelerating Convergence in Reinforcement Learning

3. 🎯 Stage 2: Supervised Fine-Tuning (SFT) on DeepSeek-V3-Base

4. 🔄 Stage 3: Reinforcement Learning (GRPO) on the Stage 2 Finetuned Model

I. Enhancing Reasoning Capabilities

II. Addressing Language Mixing in Chain-of-Thought (CoT) Reasoning

III. Final Reward Optimization

5. 📊 Stage 4: 800K SFT Data Generation and Sourcing

I. Expanding Data Beyond Cold-Start

II. Reasoning Data Collection

III. Non-Reasoning Data Collection

6. 🚀 Stage 5: Supervised Fine-Tuning (SFT) on DeepSeek-V3-Base (Not from Stage 2 or 3)

7. 🎛️ Stage 6: Reinforcement Learning (GRPO) on the Stage 5 Finetuned Model to Produce DeepSeek-R1

References:

DeepSeek-V3 Technical Report

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated…

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and…

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In…

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law…

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational…

The Illustrated DeepSeek-R1

A recipe for reasoning LLMs

Primers * DeepSeek-R1

Aman's AI Journal | Course notes and learning material for Artificial Intelligence and Deep Learning Stanford classes.

YaRN: Efficient Context Window Extension of Large Language Models

Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based…

DeepSeek: Bridging Performance and Efficiency in Modern AI

AI models keep getting better with each new release. But now, we care about more than just how well they work — we look…

Mixture of Experts Explained

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

makeMoE: Implement a Sparse Mixture of Experts Language Model from Scratch

A Blog post by Avinash Sooriyarachchi on Hugging Face

A Visual Guide to Mixture of Experts (MoE)

Demystifying the role of Mixture of Experts (MoE) in Large Language Models (LLMs) with over 50 illustrations.

GRPO and DeepSeek-R1-Zero

📚 Table of Contents

Written by Shakti Wadekar

No responses yet