核心思想与定义

扩散模型的核心思想是：学习一个去噪过程，以逆转一个固定的加噪过程。

前向过程（固定）：定义一个马尔可夫链，逐步向数据 $x0∼q(x0)\mathbf{x}_0 \sim q(\mathbf{x}_0)$ 添加高斯噪声，产生一系列噪声逐渐增大的隐变量 $x1,...,xT\mathbf{x}_1, ..., \mathbf{x}_T$ 。最终 $xT\mathbf{x}_T$ 近似为一个标准高斯分布。
$q(\mathbf{x}_{1:T} | \mathbf{x}_0) = \prod_{t=1}^T q(\mathbf{x}_t | \mathbf{x}_{t-1}), \quad \text{其中} \quad q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})$
这里 ${βt}t=1T\{\beta_t\}_{t=1}^T$ 是预先定义好的方差调度表。
反向过程（可学习）：我们想要学习一个参数化的反向马尔可夫链 $pθp_\theta$ ，从噪声 $xT∼N(0,I)\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ 开始，逐步去噪以生成数据。
$p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t), \quad \text{其中} \quad p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \mathbf{\mu}_\theta(\mathbf{x}_t, t), \mathbf{\Sigma}_\theta(\mathbf{x}_t, t))$
我们的目标是让 $pθ(x0)p_\theta(\mathbf{x}_0)$ 尽可能接近真实数据分布 $q(x0)q(\mathbf{x}_0)$ 。
前向过程的闭式解：得益于高斯分布的可加性，我们可以直接从 $x0\mathbf{x}_0$ 采样任意时刻 $t$ 的 $xt\mathbf{x}_t$ ：
$q(\mathbf{x}_t | \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I})$
其中 $αt=1−βt\alpha_t = 1 - \beta_t$ , $αˉt=∏i=1tαi\bar{\alpha}_t = \prod_{i=1}^t \alpha_i$ 。使用重参数化技巧，可以写为：
$\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \mathbf{\epsilon}, \quad \text{其中} \quad \mathbf{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$
这个公式至关重要，它允许我们随机采样时间步 $t$ 并高效地计算训练损失。

优化目标：变分下界 (VLB/ELBO)

我们的目标是最大化模型生成真实数据的对数似然 $log⁡pθ(x0)\log p_\theta(\mathbf{x}_0)$ 。由于其难以直接计算，我们转而最大化其变分下界（VLB），也称为证据下界（ELBO）。

$\begin{aligned} \log p_\theta(\mathbf{x}_0) &\geq \mathbb{E}_{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} | \mathbf{x}_0)} \right] \\ &= \mathbb{E}_{q} \left[ \log \frac{ p(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t) }{ \prod_{t=1}^T q(\mathbf{x}_t | \mathbf{x}_{t-1}) } \right] \\ &\triangleq -L_{\text{VLB}} \end{aligned}$
因此，我们最小化 $LVLBL_{\text{VLB}}$ 。

通过对 $LVLBL_{\text{VLB}}$ 进行推导（利用马尔可夫性和贝叶斯定理），可以将其分解为以下几项：

$L_{\text{VLB}} = \mathbb{E}_q [\underbrace{D_{\text{KL}}(q(\mathbf{x}_T | \mathbf{x}_0) \parallel p(\mathbf{x}_T))}_{L_T} - \underbrace{\log p_\theta(\mathbf{x}_0 | \mathbf{x}_1)}_{L_0} + \sum_{t=2}^T \underbrace{D_{\text{KL}}(q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t))}_{L_{t-1}} ]$

$L_T$ : 衡量最终噪声分布与先验分布 $N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I})$ 的差异。此项没有可学习参数，接近于0，可以忽略。
$L_0$ : 重建项，衡量最后一步生成图像与真实图像的差异。此项在原始DDPM中通过一个离散化decoder处理，实践中发现其影响较小。
$L_{t-1}$ ( $\le t \le T$ ): 这是最关键的一项。它衡量的是对于每一个去噪步，真实的去噪分布 $q(xt−1∣xt,x0)q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ 和 学习的去噪分布 $pθ(xt−1∣xt)p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)$ 之间的KL散度。

核心推导：真实的后验分布 $q(xt−1∣xt,x0)q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$

根据贝叶斯定理和马尔可夫性，我们可以推导出这个真实的后验分布。它也是一个高斯分布，这意味着我们可以用另一个高斯分布 $pθp_\theta$ 去匹配它。

$\begin{aligned} q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) &= \frac{q(\mathbf{x}_t | \mathbf{x}_{t-1}, \mathbf{x}_0) q(\mathbf{x}_{t-1} | \mathbf{x}_0)}{q(\mathbf{x}_t | \mathbf{x}_0)} \\ &\propto \mathcal{N}(\mathbf{x}_t; \sqrt{\alpha_t} \mathbf{x}_{t-1}, (1 - \alpha_t)\mathbf{I}) \cdot \mathcal{N}(\mathbf{x}_{t-1}; \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0, (1 - \bar{\alpha}_{t-1})\mathbf{I}) \end{aligned}$

经过一系列高斯分布密度函数的乘积和配方，可以得出其均值和方差为：

$q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \mathbf{\tilde{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I})$

$\text{其中} \quad \mathbf{\tilde{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \mathbf{\epsilon} \right), \quad \tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t$

注意：这里 $ϵ\mathbf{\epsilon}$ 是前向过程中添加到 $x0\mathbf{x}_0$ 上生成 $xt\mathbf{x}_t$ 的噪声。这个 $μ~t\mathbf{\tilde{\mu}}_t$ 的表达式非常关键！

简化损失函数：从均值预测到噪声预测

现在我们来看要最小化的 $L_{t-1}$ ，它是两个高斯分布的KL散度。高斯分布的KL散度主要由其均值的差异主导（假设方差固定）。

$\begin{aligned} L_{t-1} &= \mathbb{E}_q \left[ D_{\text{KL}}(q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} | \mathbf{x}_t)) \right] \\ &= \mathbb{E}_q \left[ \frac{1}{2\sigma_t^2} \| \mathbf{\tilde{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) - \mathbf{\mu}_\theta(\mathbf{x}_t, t) \|^2 \right] + C \end{aligned}$

现在我们有两个选择：

让网络 $μθ\mathbf{\mu}_\theta$ 直接预测均值 $μ~t\mathbf{\tilde{\mu}}_t$ 。
根据 $μ~t\mathbf{\tilde{\mu}}_t$ 的表达式，重新参数化模型。

DDPM选择了第二种方式，因为它效果更好。我们将 $μ~t\mathbf{\tilde{\mu}}_t$ 的表达式代入：

$\mathbf{\mu}_\theta(\mathbf{x}_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \mathbf{\epsilon}_\theta(\mathbf{x}_t, t) \right)$

这里，我们不再让网络预测均值，而是让它预测噪声 $ϵ\mathbf{\epsilon}$ ，即 $ϵθ(xt,t)\mathbf{\epsilon}_\theta(\mathbf{x}_t, t)$ 。将这个表达式代入上面的损失函数，经过简化（忽略权重系数），我们得到最终极其简洁的损失函数：

$L_{\text{simple}} = \mathbb{E}_{\mathbf{x}_0, t, \mathbf{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left[ \| \mathbf{\epsilon} - \mathbf{\epsilon}_\theta( \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \mathbf{\epsilon}, t ) \|^2 \right]$

这个损失函数的直观解释是：对于一张真实图像 $x0\mathbf{x}_0$ ，随机选择一个时间步 $t$ ，随机采样一个噪声 $ϵ\mathbf{\epsilon}$ ，构造出噪声图像 $xt\mathbf{x}_t$ 。然后，我们训练一个网络 $ϵθ\mathbf{\epsilon}_\theta$ ，让它根据 $xt\mathbf{x}_t$ 和 $t$ 来预测出我们添加的噪声 $ϵ\mathbf{\epsilon}$ 。损失就是预测噪声和真实噪声之间的均方误差。

总结：优化流程

输入：从训练集中采样一张真实图像 $x0\mathbf{x}_0$ 。
加噪：
- 均匀采样一个时间步 $\sim \text{Uniform}(1, ..., T)$ 。
- 从标准高斯分布采样噪声 $ϵ∼N(0,I)\mathbf{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ 。
- 计算 $xt=αˉtx0+1−αˉtϵ\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \mathbf{\epsilon}$ 。
预测：将 $xt\mathbf{x}_t$ 和 $t$ 输入神经网络 $ϵθ\mathbf{\epsilon}_\theta$ ，得到其对噪声的预测 $ϵθ(xt,t)\mathbf{\epsilon}_\theta(\mathbf{x}_t, t)$ 。
优化：计算损失 $\| \mathbf{\epsilon} - \mathbf{\epsilon}_\theta \|^2$ ，并通过梯度下降更新网络参数 $θ\theta$ 。
重复：重复步骤1-4直至收敛。