DeepSeek-R1 and V3 - Advancing AI Reasoning with Reinforcement Learning

image.png

Paper: https://arxiv.org/abs/2501.12948

1. Introduction to the Model(s) - High level overviews

1.1 Evolution of Reasoning Models

Screenshot 2025-02-09 at 12.49.41 PM.png

The landscape of model training is undergoing a significant transformation.

The development of reasoning models has reached a pivotal point. Initially spearheaded by OpenAI’s o1 Mini, the field has seen rapid advancements with new models emerging.

DeepSeek’s latest release, for instance, rivals OpenAI’s offerings and is completely open-sourced under an MIT license. Despite the limited details in their publication, the model’s performance is noteworthy, highlighting the potential for further enhancements.

1.2 Objective and Methodology

DeepSeek’s primary goal is to harness the potential of Large Language Models (LLMs) to develop reasoning capabilities without relying on supervised data. This is achieved through a pure Reinforcement Learning (RL) process. By employing RL, they discovered that models could achieve high performance without traditional supervised data, focusing instead on self-evolution and reflection.

1.3 Training Approach

The training methodology for DeepSeek’s models is innovative and multi-staged:

  1. Post-Training with Group Relative Policy Optimization (GRPO): The base model, DeepSeek-V3, undergoes post-training with GRPO, leading to emergent reasoning capabilities and enhanced performance.
  2. Four-Stage Training for R1:
    • Cold Start: Initial phase to stabilize the model.
    • RL for Reasoning: Focused training to enhance reasoning skills.
    • Rejection Sampling: Applied for general-purpose refinement.
    • RL Polishing: Final stage to fine-tune the model’s capabilities.
    image.png

1.4 Model Variants

DeepSeek introduces two primary models:

1.5 Distillation and Performance

Outputs from R1 are distilled into Qwen and Llama models, demonstrating exceptional performance even without native RL training. Distillation proves more effective than traditional RL post-training, offering a computationally efficient alternative.

1.6 Open Source and Accessibility

DeepSeek’s models are fully open-sourced under the MIT license, though training data and code remain undisclosed. The models are hosted on DeepSeek’s API, which boasts speeds 3 to 10 times faster and more cost-effective than other providers, albeit with considerations for data privacy.

2. DeepSeek-V3 Overview

image.png

2.1 Introduction

DeepSeek-V3 represents a significant advancement in the field of large language models (LLMs). As a foundational model, it serves as the base for subsequent reasoning models, setting a new standard in terms of efficiency, scalability, and performance. This section provides a detailed overview of DeepSeek-V3, highlighting its architecture, training methodology, and key features.

2.2 Model Architecture

DeepSeek-V3 is a large-scale language model, characterized by its substantial parameter count. It boasts a total of 671 billion parameters, with 37 billion active parameters during inference. This unique architecture allows the model to maintain a high level of performance while optimizing resource usage.

The model leverages a multi-expert system, which enables it to function effectively as a 37 billion parameter model at inference time. This design choice significantly reduces computational demands while maintaining the model’s capabilities.

2.3 Training Methodology

The training of DeepSeek-V3 involved several strategic steps to enhance its performance and efficiency:

  1. Base Model Training: The initial training phase established the foundational capabilities of the model.
  2. Context Length Extension: The model underwent two stages of context length extension, first to 32k and then to 128k, allowing it to handle longer sequences effectively.
  3. Token Training: Trained on approximately 15 trillion tokens, the model’s training process incorporated standard techniques such as supervised fine-tuning (SFT) and reinforcement learning (RL).

2.4 Key Features

DeepSeek-V3 introduces several innovative features that set it apart from other models:

2.5 Cost and Efficiency

One of the most notable claims about DeepSeek-V3 is its cost-effectiveness. The model was reportedly trained at a cost of $5.5 million, a figure that has generated considerable attention and discussion within the AI community. This cost efficiency is attributed to several factors, including the use of mixed precision training and the absence of auxiliary loss during training.

2.6 Comparison and Impact

DeepSeek-V3 is often compared to models like GPT-4o and Mitral, positioning itself as a competitive alternative in the landscape of LLMs. Its ability to deliver high performance with reduced active parameters makes it a desirable choice for applications where speed and cost are critical considerations.

3. DeepSeek-R1-Zero: From Scratch to Emergence

3.1 Overview

DeepSeek-R1-Zero utilizes pure reinforcement learning (RL) without any supervised fine-tuning (SFT) data. This section explores the innovative methodologies and emergent capabilities of DeepSeek-R1-Zero, highlighting its unique training processes and performance metrics.

3.2 Training Methodology

image.png

3.2.1 Pure Reinforcement Learning

DeepSeek-R1-Zero applies pure RL directly to a V3-base model, bypassing the traditional SFT data. This approach leverages the GRPO (Group-based Policy Optimization) algorithm, introduced in the DeepSeek Math paper, to optimize model performance.

image.png

image.png

3.2.2 Training Template

The training process involves a structured conversational template where the assistant processes user queries, reasons through problems, and provides answers within specified tags. This structure facilitates clear reasoning and accurate responses.

3.3 Performance and Evaluation

3.3.1 Benchmark Comparisons

DeepSeek-R1-Zero exhibits exceptional reasoning capabilities without relying on labeled SFT data. It performs well across various benchmarks, including math and coding challenges, often surpassing models like OpenAI-o1.

image.png

3.3.2 Emergent Capabilities

As training progresses, DeepSeek-R1-Zero demonstrates emergent behaviors, such as:

image.png

3.4 Inference and Efficiency

Charts indicate that inference time correlates with evaluation performance. As the model undergoes more training steps, its reasoning depth and response length increase, enhancing its ability to tackle complex tasks.

image.png

3.5 Emergent Behaviors

DeepSeek-R1-Zero naturally evolves its problem-solving abilities by extending test time compute. This ranges from hundreds to thousands of reasoning tokens, leading to the emergence of interesting behaviors as test time compute increases.

DeepSeek-R1-Zero exemplifies the potential of RL to unlock new levels of intelligence in artificial systems. By providing the right incentives, the model autonomously develops advanced problem-solving strategies, paving the way for more autonomous and adaptive models in the future.

4. DeepSeek-R1: A Multi-Stage Evolution

4.1 Overview

DeepSeek-R1 is addressing key issues present in its predecessor, R1-Zero. This section delves into the multi-stage process employed to enhance readability, language consistency, and overall functionality, transforming DeepSeek-R1 into a robust reasoning and chat model.

4.2 Key Challenges with R1-Zero

R1-Zero faced several challenges, including:

These issues were partially attributed to the reinforcement learning (RL) approach used, which lacked specific objectives related to safety, conciseness, and user engagement.

4.3 Development Stages of DeepSeek-R1

image.png

4.3.1 Stage 1: Cold Start Training

The initial stage focused on stabilizing the model through strong supervised fine-tuning (SFT):

4.3.2 Stage 2: Reasoning-Based Reinforcement Learning

Building on the cold start, the second stage involved reasoning-based RL:

4.3.3 Stage 3: Rejection Sampling

Rejection sampling was introduced to further refine the model:

4.3.4 Stage 4: Final Reinforcement Learning

The final stage aimed to optimize the model for general use:

4.4 Emergent Capabilities

DeepSeek-R1 successfully overcame the limitations of its predecessor, achieving:

5. Evaluating Performance & Benchmarks

DeepSeek-R1 has undergone rigorous testing to evaluate its effectiveness as a chat model. Here are the key findings:

image.png

6. Knowledge Distillation: Enhancing Reasoning in Llama 3.3 & Qwen, etc

image.png

The distillation process played a crucial role in enhancing the reasoning capabilities of Llama 3.3 and Qwen models. Here’s how it was implemented:

7. Future Directions & Ongoing Research

7.1 Current Limitations

Despite its strengths, DeepSeek-R1 has several areas where it underperforms compared to V3:

7.2 Future Research Directions

To address these limitations, ongoing research is focusing on:

haohoang

© 2025 Aria

LinkedIn YouTube GitHub