DeepSeek-R1: Model Architecture

Shakti Wadekar
18 min readFeb 5, 2025

--

This article provides an in-depth exploration of the DeepSeek-R1 model architecture. Let’s trace DeepSeek-R1 model from input to the output to find new developments and critical parts in the architecture. DeepSeek-R1 is based on the DeepSeek-V3-Base model architecture. This article aims to cover all essential aspects of its design.

📖 Table of Contents:

📝 1. Input Context Length
🏛 2. Total Layers
🔬 3. First 3 DeepSeek-R1 Layers
🧩 4. Layers 4 to 61 of DeepSeek-R1
🧠 5. Multi-Head Latent Attention (MLA)
🎭 6. Mixture-of-Experts (MoE)
🔢 7. Multi Token Prediction (MTP)

📝 1. Input Context Length

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

DeepSeek-R1’s input context length is 128K.

DeepSeek-R1 inherits its 128K context length from its base model, DeepSeek-V3-Base. Initially, DeepSeek-V3 is pretrained with a 4K context length. Then, a two-stage context length extension increases it first to 32K and then to 128K, utilizing the YaRN technique.

YaRN (Yet another RoPE extensioN method) is a technique aimed at efficiently extending the context window of large language models (LLMs) that use Rotary Position Embeddings (RoPE). RoPE encodes positional information using a rotation matrix, and YaRN modifies how these rotational frequencies scale. Instead of simply extrapolating frequencies (which often leads to performance drops), it smoothly interpolates and adjusts these frequencies, enabling better generalization to longer contexts. It is computationally efficient and practical to extend model context length without massive retraining.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

🏛 2. Total Layers

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

DeepSeek-R1 consists of an embedding layer, followed by 61 transformer layers, and multiple prediction heads at the output stage.

DeepSeek-R1 employs Multi-Head Latent Attention (MLA) layers instead of standard multi-head attention across all transformer layers. The first three transformer layers differ from the rest, using a standard Feed-Forward Network (FFN) layer. From layer 4 to 61, a Mixture-of-Experts (MoE) layer replaces the FFN layer. The details of MLA and MoE will be explored in the following sections.

Transformer Layers in DeepSeek-R1 (Source 1 and Source 2)

Full model architecture description with dimensions.

DeepSeek-R1 architecture details

DeepSeek-V3 predicts the next 2 tokens using the Multi-Token Prediction (MTP) technique using their final two prediction heads. The acceptance rate of the second predicted token ranges between 85% and 90%, demonstrating high reliability across various generation topics​. DeepSeek-R1 (DeepSeek-V3) comprises 671B total parameters, of which 37B are activated for each token.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

🔬 3. First 3 DeepSeek-R1 Layers

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

First 3 layers consist of Multi-Head Latent Attention (MLA) and a standard FFN layer. These are typically referred to as ‘dense LLM layers’ since the FFN layers are not replaced with MoE layers, which are considered sparser in comparison.

First 3 Transformer Layers in DeepSeek-R1

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

🧩 4. Layers 4 to 61 of DeepSeek-R1

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

These layers consist of MLA and MoE layers. We will understand what is MLA and MoE layers and how they work in coming sections.

MoE Transformer Layer

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

🧠 5. Multi-Head Latent Attention (MLA)

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Let’s now understand what is MLA.

The MLA was first introduced in DeepSeek-V2 and has been carried over to DeepSeek-V3 and DeepSeek-R1.

Why MLA was developed ?

Below is a statement from DeepSeek-V2 paper/technical report which clearly established the reason behind development of MLA.

“Conventional Transformer models usually adopts Multi-Head Attention (MHA), but during generation, its heavy Key-Value (KV) cache will become the bottleneck that limit the inference efficiency. In order to reduce the KV cache, Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) are proposed. They require a smaller magnitude of KV cache, but their performance does not match MHA.

For DeepSeek-V2, we design an innovative attention mechanism called Multi-head Latent Attention (MLA). Equipped with low-rank key-value joint compression, MLA achieves better performance than MHA, but requires a significantly smaller amount of KV cache.”

Comparison of MLA with MHA, GQA and MQA : Diagram from DeepSeek-V2

How does MLA achieve reduced KV cache for faster inference?

“The core of MLA is the low-rank joint compression for attention keys and values to reduce Key-Value (KV) cache during inference.” — DeepSeek-V2

Multi-head Latent Attention (MLA) used in DeepSeek-V2, DeepSeek-V3 and Deepseek-R1

Let’s understand this diagram step by step:

Step 1: Down projection of Q, K and V

Down-Projection of input to get Q, K and V

The input to the MLA layer is h_t​. For clarity, let’s assume h_t has a shape of (input_sequence_length×2000).

In a traditional transformer layer, weight matrices are used to project h_t​ into query (Q), key (K), and value (V) representations. Each of these typically retains the same hidden dimension as the input, resulting in Q,K,V having a shape of (input_sequence_length×2000).

However, in a transformer layer with Multi-Head Latent Attention (MLA), the weight matrices generate Q, K, and V with significantly smaller dimensions compared to the input. Instead of retaining the full hidden dimension, MLA reduces their size. For example, if the input h_t​ has a shape of (sequence length × 2000), the resulting Q, K, and V might have a shape of (sequence length × 100).

During implementation, the weight matrices for Q, K, and V are typically fused for better compute and memory efficiency on GPUs. Instead of applying separate projections, a combined weight matrix is used to optimize operations. In MLA, the generation of K and V follows this principle. Specifically, a single weight matrix, denoted as W_DKV​, is used in the equation. Here, the ‘D’ in W_DKV​ stands for the down-projection weight matrix, reflecting its role in reducing dimensionality for efficient attention computation.

Latent K and V embeddings

The output of this projection is a concatenated representation containing both K and V. These can be easily extracted using a simple slicing mechanism. The resulting output has a shape of (sequence length × 200), where the first (sequence length × 100) corresponds to K, and the remaining (sequence length × 100) corresponds to V.

The compressed K and V outputs are cached during inference, significantly reducing the memory footprint of KV caching.

Similarly, Q is also compressed in MLA. The resulting shape of Q is (sequence length × 100).

Latent Q embeddings

Step 2: Up-projection of Q, K and V

Up-Projection of Q, K and V

After compression, Q, K, and V are up-projected back to a larger size for attention computation. This larger size can either match the original input hth_tht​ or follow a structure based on the attention head configuration.

For example, the up-projected shape can be:

  • (sequence length×2000), matching the input size.
  • (sequence length×3200), where 3200 is derived from 64×50 (with 64 attention heads and 50 dimensions per head).
K and V up-projection
Q up-projection

The up-projection of Q, K, and V is performed using dedicated weight matrices:

  • W_UK​ for K up-projection
  • W_UV​ for V up-projection
  • W_UQ​ for Q up-projection

Here, ‘U’ stands for up-projection, indicating the expansion of the compressed representations back to a larger dimensional space for attention computation.

Note: The per-attention-head input dimension will be adjusted to accommodate Rotary Positional Embeddings (RoPE). This adjustment will become clearer in the upcoming sections.

Step 3: RoPE embeddings in Q and K to encode positional information

This step is for calculating the RoPE embeddings to encode postional information.

Incorporation of Rotary Positional Embeddings (RoPE):

  • Decoupled RoPE Strategy: To integrate positional information, DeepSeek-V2 (consequently DeepSeek-V3 and DeepSeek-R1) employs a decoupled RoPE approach. This involves creating additional query (Q) and key (K) vectors specifically designed to carry positional information.
  • Concatenation: These RoPE-enhanced Q and K vectors are concatenated with the up-projected Q and K vectors.

This is a bit of tricker part in MLA.

I will try to explain it the way I understood it from the DeepSeek’s technical reports.

In traditional transformer layers, RoPE operation is applied directly on Q and K. It does not change the Q and K dimensions, but changes the semantic representations (numerical values in Q and K) in the Q and K to encode positional information. So the resulting Q and K has both semantic and positional information in it.

But, in transformer layers with MLA, the RoPE is applied to separate newly generated query (Q) and key (K) embeddings and concatenated to the up-projected Q and K.

Step 3.1: Generating RoPE embeddings for Q

Traditionally, RoPE (Rotary Positional Embeddings) applies a rotation matrix to the query (Q) and key (K) vectors based on their positions in the sequence. This transformation encodes relative positional information directly within Q and K, eliminating the need for explicit positional embeddings like sinusoidal or absolute encodings.

But in MLA, instead of applying RoPE to the up-projected Q (q_t_C), new Q embeddings (q_t_R) are generated from c_t_Q and RoPE is applied on it.

Completey separate query embeddings are generated from c_t_Q by multiplying it with weight matrix W_QR. These new separate query embeddings, goes through RoPE transformation and gives us position encoded query embeddings (q_t_R).

q_t_R is generated such that they can be concatenated to each attention head’s input query embeddings, inorder for each attention head to have the postional information. [From the equation this statement seems to be true, but needs additional verification.]

Step 3.2: Generating RoPE embeddings for K

Similarly, instead of applying RoPE to the up-projected K, new K embeddings are generated and RoPE is applied on it.

But there are two critical difference with the RoPE embedded q_t_R:

  1. New K embeddings are generated from h_t (input embeddings) instead of the down-projected K (c_t_K).
  2. Same RoPE embedded K (keys), are concatenated to the input of each attention head. But, separate RoPE embedded Q (queries) are calculated and concatenated to the each attention head as seen in Step 3.1. [From the equation this statement seems to be true, but needs additional verification.]

Why not generate from the up-projected K i.e, k_t_C ?

Reasoning in the DeepSeek-V2 report:

“If we apply RoPE for the keys k_𝐶, 𝑊𝑈𝐾 will be coupled with a position-sensitive RoPE matrix. In this way,𝑊𝑈𝐾 cannot be absorbed into𝑊_𝑄 any more during inference, since a RoPE matrix related to the currently generating token will lie between 𝑊_𝑄 and 𝑊𝑈𝐾 and matrix multiplication does not obey a commutative law.”

This can be better understood from the below explanation screenshots:

RoPE embeddings for K: Part 1
RoPE embeddings for K: Part 2
RoPE embeddings for K: Part 3

Hence, for efficiency in inference, the position embedded K (key) embeddings are generated from the input embeddings h_t.

Wouldn’t these introduction of additional weight matrices in MLA cause memory and compute inefficiencies?

To address these overheads, the DeepSeek-V2 reports:

“In addition, during inference, since 𝑊𝑈𝐾 can be absorbed into 𝑊𝑄, and 𝑊𝑈𝑉 can be absorbed into 𝑊𝑂, we even do not need to compute keys and values out for attention.”

To further reduce memory consumption:

“Moreover, in order to reduce the activation memory during training, we also perform low-rank compression for the queries, even if it cannot reduce the KV cache”

Step 4: Calculating Attention output

The concatenation process increases the dimensionality of the Q and K vectors. To manage this increased dimensionality, the model can either:

  • Increase the Number of Attention Heads: This would maintain the original per-head dimensionality but requires more computational resources.
  • Adjust Per-Head Dimensionality: Keep the number of heads constant but increase the dimensionality of each head to accommodate the concatenated vectors.

The attention output is calculated using these standard attention equation:

O_t_i is the attention score, u_t is the attention output. W_o represents the output projection weight matrix. The output is projected back to the same dimensions as the input (like in our example: this shape will be input_sequence_length x 2000)

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

🎭 6. Mixture-of-Experts (MoE)

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

What is Mixture-of-Experts (MoE)?

To clearly understand what MoE is, first let’s see where exactly it is used in the transformer and what is its architecture in brief. The FFN in standard transformer layer is replaced with the MoE.

MoE in a Transformer layer

At its core, MoE follows the standard transformer design but modifies the feed-forward layer by introducing multiple parallel expert networks (FFNs) instead of a single dense FFN. Here’s how it functions:

  1. Multiple FFNs Instead of One
  • Instead of a single shared FFN, MoE uses multiple FFN layers (experts) trained in parallel.

2. Input Processing & Token Routing

  • Each token passes through the transformer self-attention layer as usual.
  • Instead of being processed by a single FFN, it is sent to a router that decides which experts should handle it.

3. Expert Selection via Router

  • A small, trainable router determines which subset of experts (FFNs) should process each token.
  • Typically, only 1 or 2 experts are selected per token to maintain efficiency (e.g., top-1 or top-2 gating). DeepSeek-V3 (DeepSeek-R1), utilizes 9 experts, where 1 is a shared expert and other 8 are routed experts.
  • Selection is often based on a softmax scoring mechanism, where the router assigns probabilities to each expert. Specifically, in DeepSeek-V3 (DeepSeek-R1), uses Sigmoid instead of softmax.

4. Sparse Computation with Experts

  • Only the chosen experts process the token, while others remain inactive.
  • The expert outputs are combined using weighted summation and passed to the next transformer layer. In DeepSeek-V3/R1, the weights are the normalized sigmoid outputs.
  • This sparse activation ensures that only a fraction of the model is used at any time, keeping computation manageable.

Why Replace a Single FFN with MoE?

ScalabilityMoE allows models to scale with more parameters without linearly increasing computation.
Efficient Learning — Experts specialize in different aspects of data, improving generalization.
Computation Savings — Since only a subset of experts is used per token, MoE models are cheaper to run compared to dense models of the same size. DeepSeek-V3/R1 has 671 billion total parameters, of which 37 billion are activated for each token.

How MoE works in DeepSeek-R1?

Below equations from DeepSeek-V3 technical report shows the computations in each MoE layer. In DeepSeek family of models, MoE architecture was first introduced in DeepSeekMoE model, and it is being used in DeepSeek-V2, DeepSeek-V3 and DeepSeek-R1.

Router computations:

In DeepSeek-V3, DeepSeek-R1 and some other modern Mixture-of-Experts (MoE) models, e_i​ represents a learned centroid that helps in routing inputs to the right experts. Unlike traditional MoE architectures where an FFN-based router computes gating scores, this approach pre-defines a set of learnable vectors e_i​, each corresponding to an expert.

Key Idea:

  • Each expert i has an associated centroid vector e_i​.
  • Instead of passing the input u_t​ through an FFN to get expert probabilities, we compute the similarity between u_t​ and each e_i​ via a dot product:
Router computations Part 1: Token-to-expert affinity (s) computation
  • This score determines how relevant an expert is for the given input.
  • Only the Top-K experts with the highest s_{i,t}​ values are activated for processing.
Router computations Part 2: Top-K selection
  • A bias term is added to the sigmoid outputs, in order to create a auxilary-loss-free MoE load balancing.

This snippet from DeepSeek-V3 paper clarifies it’s purpose further and how it is computed during training:

  • The output values are normalized using the selected top-k values.
Router computations Part 3: Normalization

Expert computations:

Router computations Part 4: Expert computations

u_t is the input to the MoE layer. Second term in the equation represents, the input being multiplied with the shared experts. Each expert is made up of a FFN (Feed-Forward-Network), hence represented by ‘FFN’. In DeepSeek-R1, there’s only 1 shared expert. Hence, Ns=1. Similarly, third term in the equation represents, input being multiplied with the active individual experts. In DeepSeek-R1, there are total 256 individual experts. But only 8 are active per token, hence Nr=8. Each of this active individual expert will have g_{i,t} associated with it from equation 13. It is used to compute the third term.

Output computations:

Router computations Part 5: MoE Output computations

h_t represents the output of MoE layer. u_t is the input to the MoE layer. The expert computations are added to the input u_t resulting in the output of MoE layer.

🔢 7. Multi Token Prediction (MTP)

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

What is multi-token prediction?

Multi-token prediction is an advanced approach in language modeling where, instead of predicting the next word in a sequence one at a time, the model forecasts multiple future tokens simultaneously. This method enhances learning efficiency and accelerates text generation by enabling the model to anticipate several upcoming words in parallel.

Left: Meta’s MTP model, Right: DeepSeek’s MTP prediction heads

Meta introduced a multi-token prediction architecture that trains language models to predict several future tokens concurrently, resulting in higher sample efficiency and faster inference. Building upon this concept, DeepSeek-V3 incorporated a Multi-Token Prediction (MTP) objective, allowing the model to predict multiple tokens at once. This approach densifies training signals and enables better pre-planning of token representations, boosting performance on complex benchmarks.

Two critical difference in DeepSeek-V3/R1 and Meta’s multi-token prediction:

“Different from Gloeckle et al. (2024) [Meta Research], which parallelly predicts 𝐷 additional tokens using independent output heads, we sequentially predict additional tokens and keep the complete causal chain at each prediction depth.” — DeepSeek-V3

  1. Meta’s model predicts 4 tokens, while DeepSeek-V3 predicts 2 tokens.
  2. The prediction heads in Meta’s model are independent, while DeepSeek-V3’s predition heads are sequentially connected.

How does MTP work in DeepSeek-R1?

Let’s go through the diagram step by step.

During training, the input tokens (located in the bottom-left corner) pass through the embedding layer and then propagate through all the transformer blocks/layers.

The first prediction head (which includes the output head) is directly connected to the final transformer layer of the main model. The output head is typically a feed-forward network (FFN) with output dimension matching the model’s vocabulary size. This head is responsible for predicting the next token in sequence. Given input tokens t₁, t₂, t₃, t₄, it predicts t₂, t₃, t₄, t₅. However, during inference, only the final token t₅ is computed.

The second prediction head extends this approach by adding additional learnable layers. It takes the output from the main model’s final transformer layer, applies RMSNorm for normalization, and concatenates it with the input embeddings. These input embeddings are obtained from the same embedding layer used in the main model. Unlike the first prediction head, this head processes input tokens starting from t₂ instead of t₁. The concatenated output is then projected to a suitable embedding size using a linear projection layer, followed by a learnable transformer block/layer for further processing. During training, this head predicts t₃ to t₆, but in inference, only t₆ is computed.

Similarly, the third prediction head takes input from the transformer block/layer of the second prediction head along with the corresponding input embeddings, now starting from t₃ to t₆. It follows the same structure as the previous heads, predicting t₄ to t₇ during training, but computing only t₇ during inference.

Each prediction head computes a loss using cross-entropy. These losses are then weighted by a factor λ, and their average is taken as the final loss value.

Individual prediction head loss function
Final loss function for MTP

In DeepSeek-V3 and R1, MTP only used during training and not during inference:

“MTP in Inference: Our MTP strategy mainly aims to improve the performance of the main model, so during inference, we can directly discard the MTP modules and the main model can function independently and normally.” — DeepSeek-V3

Note: It is not clear to me how MTP is used with reinforcement learning in DeepSeek-R1. It is clear how MTP is used in pretraining and SFT. I am assuming they don’t use MTP in RL stage.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

References:

YouTube Video: https://www.youtube.com/watch?v=QdEuh2UVbu0

--

--

Shakti Wadekar
Shakti Wadekar

No responses yet