DeepSeek-R1 Paper Explained — A New Reinforcement Learning Era for LLMs?

In the Age of Information, news media faces both unprecedented opportunities and significant challenges.

bymagicteam

February 14, 2025

A doll-like robotic figure in a blue hue with large, glowing red eyes and a textured coat, set against a dark background.

“A haunting fusion of childlike innocence and robotic technology, reflecting the uncanny side of AI’s rapid evolution

DeepSeek-R1 Paper Explained — A New Reinforcement Learning Era for LLMs?

Over the past few years, the field of artificial intelligence has progressed at a remarkable pace, driven in part by Large Language Models (LLMs) that are inching closer to the concept of artificial general intelligence (AGI). One standout model, OpenAI’s o1, introduced innovative inference-time scaling methods that significantly boosted its reasoning capabilities, although it remains closed-source.

In response, the team at DeepSeek has unveiled groundbreaking research on DeepSeek-R1, detailed in the paper titled “DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models via Reinforcement Learning.” This publication introduces an open-source reasoning model and a comprehensive guide for using large-scale reinforcement learning to enhance LLMs.

A Recap of How LLMs Are Typically Trained

LLMs generally progress through three stages of training:

Pre-training
Models first learn on vast text and code datasets, acquiring broad knowledge. By predicting the next token in a sequence, an LLM can handle basic tasks such as completing the phrase “write a bedtime _.” However, these pre-trained models often struggle to follow human instructions directly.
Supervised Fine-tuning (SFT)
Next, the model is fine-tuned on an instruction dataset. Each sample includes an instruction and a corresponding response, which the model uses as a labeled target. After this phase, the LLM becomes much better at handling explicit instructions.
Reinforcement Learning
Finally, LLMs receive feedback for further refinement, often via Reinforcement Learning from Human Feedback (RLHF), which relies on high-quality human annotations. Because human feedback for complex tasks can be expensive and time-consuming, another approach, Reinforcement Learning from AI Feedback (RLAIF), may be employed. RLAIF requires a robust feedback model to ensure accuracy and reliability.

Introducing the DeepSeek-R1-Zero Model

The DeepSeek-R1 paper takes an unusual step by removing or significantly reducing the supervised fine-tuning stage. Specifically, the model called DeepSeek-R1-Zero begins with DeepSeek-V3-Base, a 671-billion-parameter pretrained model, and skips SFT entirely. Large-scale reinforcement learning is then applied without relying on conventional human or AI-based feedback; instead, a rule-based reinforcement learning strategy is used.

Rule-Based Reinforcement Learning with GRPO

The authors employ Group Relative Policy Optimization (GRPO), an in-house DeepSeek technique. Given an input, the model generates multiple candidate outputs (each containing both reasoning and an answer). Predefined rules then determine a reward for each output, guiding the model to favor higher-reward sequences.

Accuracy Rules
For math problems with concrete answers, correctness can be checked directly. For coding tasks with test cases, compilers provide feedback. These rules enable automatic validation of the model’s final answers.
Format Rules
The DeepSeek-R1-Zero model is instructed to produce reasoning within <think> tags and the final answer within <answer> tags. Format rewards ensure strict compliance with this structure.

Because no separate neural network is used to generate rewards, large-scale training becomes more cost-effective. The approach also helps prevent “reward hacking,” where a model exploits loopholes to inflate its score without truly achieving the intended goals.

DeepSeek-R1-Zero: Performance Highlights

Comparisons with OpenAI’s o1 on various reasoning benchmarks show that DeepSeek-R1-Zero can match or even exceed o1 in some cases. On the AIME dataset, for instance, the model’s pass@1 score soared from 15.6% to 71.0% during training—very close to o1’s performance level.

Self-Evolution in Reasoning

One revealing chart in the paper tracks the self-evolution of DeepSeek-R1-Zero. As training progresses, the model organically lengthens its thought process for complex queries. This behavior emerges purely from reinforcement learning, indicating that the model adapts to require more “thinking” steps when the task becomes difficult.

The ‘Aha Moment’

A particularly intriguing aspect is the “Aha moment.” For a challenging math problem, DeepSeek-R1-Zero may start with an initial line of reasoning, then pause to reexamine its steps, make necessary corrections, and finalize a more accurate solution. This habit of self-correction develops naturally under reinforcement learning.

Why Introduce DeepSeek-R1?

Although DeepSeek-R1-Zero delivers impressive accuracy, two key issues prompted the creation of a second model:

Readability
DeepSeek-R1-Zero’s outputs can be hard to read, with cluttered or verbose text.
Language Consistency
The model occasionally mixes multiple languages in a single response.

Ablation studies show that forcing the model to stick to just one language slightly reduces performance. Interestingly, DeepSeek-R1-Zero seems to handle certain tasks better by blending languages—an approach not typically seen in human problem-solving.

The Full Training Pipeline of DeepSeek-R1

The complete paper details a four-phase training pipeline, offering deeper insights into how DeepSeek-R1 refines both the performance and readability of its predecessor. By incrementally addressing issues with formatting, language consistency, and interpretability, the final DeepSeek-R1 emerges as a more user-friendly yet powerful LLM.

Impressive Results of DeepSeek-R1

In direct comparisons against OpenAI’s o1, DeepSeek-R1 consistently holds its ground and often outperforms o1 on certain benchmarks. The fact that DeepSeek-R1 is publicly available further underscores its significance as an open-source alternative to proprietary models.

Conclusion

DeepSeek-R1 and DeepSeek-R1-Zero represent a leap forward in training LLMs via large-scale reinforcement learning—particularly by either reducing or removing the supervised fine-tuning stage. Through rule-based reinforcement signals, these models exhibit strong reasoning capabilities, the ability to self-correct, and, in DeepSeek-R1’s case, more polished, readable outputs. As AI research continues to evolve, these advancements illustrate how open-source innovation can stand toe-to-toe with closed-source giants in the race toward ever more capable language models.