Qwen3 技术报告的 Strong-to-Weak Distillation 强到弱蒸馏和代码实现

flyfish

代码在文末

技术报告就是不一定经过严格的学术期刊同行评审，但具有较强的专业性和实用性。
在这里插入图片描述

The post-training pipeline of Qwen3 is strategically designed with two core objectives:
(1) Thinking Control: This involves the integration of two distinct modes, namely the “non-thinking”
and “thinking” modes, providing users with the flexibility to choose whether the model should
engage in reasoning or not, and to control the depth of thinking by specifying a token budget for
the thinking process.
(2) Strong-to-Weak Distillation: This aims to streamline and optimize the post-training process
for lightweight models. By leveraging the knowledge from large-scale models, we substantially
reduce both the computational costs and the development efforts required for building smaller-
scale models.

post-training 后训练

post-training强调的是在模型完成预训练（pre-training）之后、正式部署或应用之前进行的一系列针对性训练、优化步骤（比如 “思考控制”“强到弱蒸馏” 等）

“后训练” 体现这一阶段与 “预训练” 的承接关系 —— 它属于模型训练生命周期中的一个特定阶段（预训练→后训练→部署）

后训练包括但不限于：微调（fine-tuning）、对齐（alignment）符合人类价值观或指令、蒸馏（distillation）、能力增强等。

Qwen3 的后训练流程

Qwen3 的后训练流程经过精心设计，围绕两个核心目标展开：

（1）思考控制：整合 “非思考” 和 “思考” 两种不同模式，为用户提供灵活选择的空间 —— 既可以决定模型是否进行推理，也能通过指定思考过程的 token 预算来控制思考深度。

（2）强到弱蒸馏：旨在简化和优化轻量级模型的训练后流程。借助大规模模型的知识，大幅降低构建小规模模型所需的计算成本和开发精力。

通过让小模型直接学习大模型输出的 logits，既能提升小模型性能，又能保留对其推理过程的精准调控，同时不用给每个小模型重复走复杂的四阶段训练流程，效率更高。

Strong-to-Weak Distillation 强到弱蒸馏

Strong-to-Weak Distillation
The Strong-to-Weak Distillation pipeline is specifically designed to optimize lightweight models, encompassing 5 dense models (Qwen3-0.6B, 1.7B, 4B, 8B, and 14B) and one MoE model (Qwen3-30B-A3B). This approach enhances model performance while effectively imparting robust mode-switching capabilities.The distillation process is divided into two primary phases:
(1) Off-policy Distillation: At this initial phase, we combine the outputs of teacher models generated
with both /think and /no think modes for response distillation. This helps lightweight student
models develop basic reasoning skills and the ability to switch between different modes of
thinking, laying a solid foundation for the next on-policy training phase.(2) On-policy Distillation: In this phase, the student model generates on-policy sequences for
fine-tuning. Specifically, prompts are sampled, and the student model produces responses in
either /think or /no think mode. The student model is then fine-tuned by aligning its logits
with those of a teacher model (Qwen3-32B or Qwen3-235B-A22B) to minimize the KL divergence.

Strong-to-Weak Distillation（强到弱蒸馏）是一种知识迁移策略，其含义是：利用大规模高性能模型（强模型，即教师模型）的知识，通过系统性方法优化轻量级模型（弱模型，即学生模型）的训练流程

强到弱蒸馏流程专为优化轻量级模型而设计，涵盖 5 个密集型模型（Qwen3-0.6B、1.7B、4B、8B 和 14B）以及 1 个混合专家模型（Qwen3-30B-A3B）。该方法在提升模型性能的同时，能有效赋予其强大的模式切换能力。

蒸馏过程分为两个主要阶段：
（1）离线策略蒸馏：在这一初始阶段，我们结合教师模型在 /think 模式和 /no think 模式下生成的输出进行响应蒸馏。这有助于轻量级学生模型培养基本推理能力以及在不同思考模式间切换的能力，为下一阶段的在线策略训练奠定坚实基础。

（2）在线策略蒸馏：在这一阶段，学生模型生成在线策略序列以进行微调。具体而言，先对提示词进行抽样，再让学生模型以 /think 模式或 /no think 模式生成响应。随后，通过将学生模型的 logits 与教师模型（Qwen3-32B 或 Qwen3-235B-A22B）的 logits 对齐，以最小化 KL 散度，完成对学生模型的微调。