系列文章目录

Fundamental Tools

RL【1】：Basic Concepts
RL【2】：Bellman Equation
RL【3】：Bellman Optimality Equation

Algorithm

RL【4】：Value Iteration and Policy Iteration
RL【5】：Monte Carlo Learning
RL【6】：Stochastic Approximation and Stochastic Gradient Descent

Method

RL【7-1】：Temporal-difference Learning
RL【7-2】：Temporal-difference Learning
RL【8】：Value Function Approximation
RL【9】：Policy Gradient
RL【10-1】：Actor - Critic
RL【10-2】：Actor - Critic

文章目录

系列文章目录
- Fundamental Tools
- Algorithm
- Method
前言
Oﬀ-policy actor-critic
- Importance sampling
- The theorem of oﬀ-policy policy gradient
- The algorithm of oﬀ-policy actor-critic
Deterministic actor-critic (DPG)
- Introduction
- The theorem of deterministic policy gradient
- The algorithm of deterministic actor-critic
总结

前言

本系列文章主要用于记录 B站赵世钰老师的【强化学习的数学原理】的学习笔记，关于赵老师课程的具体内容，可以移步：
B站视频：【【强化学习的数学原理】课程：从零开始到透彻理解（完结）】
GitHub 课程资料：Book-Mathematical-Foundation-of-Reinforcement-Learning

Oﬀ-policy actor-critic

Importance sampling

Definition

Note that

$EX∼p0[X]=∑xp0(x)x=∑xp1(x)p0(x)p1(x)x=EX∼p1[f(X)]\mathbb{E}{X \sim p_0}[X] = \sum_x p_0(x)x = \sum_x p_1(x) \frac{p_0(x)}{p_1(x)} x = \mathbb{E}{X \sim p_1}[f(X)]$

Thus, we can estimate $EX∼p1[f(X)]\mathbb{E}{X \sim p_1}[f(X)]$ in order to estimate $EX∼p0[X]\mathbb{E}{X \sim p_0}[X]$ .
How to estimate $EX∼p1[f(X)]\mathbb{E}_{X \sim p_1}[f(X)]$ ? Easy.
- Let
  
  $xi∼p1\bar{f} \doteq \frac{1}{n}\sum_{i=1}^n f(x_i), \quad \text{where } x_i \sim p_1$
- Then,
  
  $EX∼p1[fˉ]=EX∼p1[f(X)]\mathbb{E}{X \sim p_1}[\bar{f}] = \mathbb{E}{X \sim p_1}[f(X)]$
  
  $varX∼p1[fˉ]=1nvarX∼p1[f(X)]\text{var}{X \sim p_1}[\bar{f}] = \frac{1}{n}\text{var}{X \sim p_1}[f(X)]$

定理推导

基本思想

我们通常想要估计一个期望：

$EX∼p0[X]\mathbb{E}_{X \sim p_0}[X]$

但有时候 直接从 $p_0$ 采样很难，或者采样代价高。于是我们考虑换一个更容易采样的分布 $p_1$ ，但同时通过一个修正因子（importance weight）来保持无偏估计。

推导过程

从定义出发：

$EX∼p0[X]=∑xp0(x)x\mathbb{E}_{X \sim p_0}[X] = \sum_x p_0(x)x$

插入一个“1”：

$\sum_x p_1(x)\frac{p_0(x)}{p_1(x)}x = \mathbb{E}_{X \sim p_1}[f(X)]$

其中：

$\frac{p_0(x)}{p_1(x)} x$

这样就把原本关于 $p_0$ 的期望，转化成了关于 $p_1$ 的期望。

Monte Carlo 近似
由于无法解析计算 $EX∼p1[f(X)]\mathbb{E}_{X \sim p_1}[f(X)]$ ，我们用采样近似：

采样 $n$ 个样本 $xi∼p1x_i \sim p_1$ ，定义

$fˉ=1n∑i=1nf(xi)=1n∑i=1np0(xi)p1(xi)xi\bar{f} = \frac{1}{n}\sum_{i=1}^n f(x_i) = \frac{1}{n}\sum_{i=1}^n \frac{p_0(x_i)}{p_1(x_i)} x_i$

性质：

无偏： $EX∼p1[fˉ]=EX∼p0[X]\mathbb{E}{X \sim p_1}[\bar{f}] = \mathbb{E}{X \sim p_0}[X]$
方差： $Var[fˉ]=1nVar[f(X)]\text{Var}[\bar{f}] = \frac{1}{n}\text{Var}[f(X)]$

所以样本数越多，估计越稳定。

Importance Weight

Therefore, $fˉ\bar{f}$ is a good approximation for $EX∼p1[f(X)]=EX∼p0[X]\mathbb{E}{X \sim p_1}[f(X)] = \mathbb{E}{X \sim p_0}[X]$ .

$EX∼p0[X]≈fˉ=1n∑i=1nf(xi)=1n∑i=1np0(xi)p1(xi)xi\mathbb{E}{X \sim p_0}[X] \approx \bar{f} = \frac{1}{n}\sum{i=1}^n f(x_i) = \frac{1}{n}\sum_{i=1}^n \frac{p_0(x_i)}{p_1(x_i)} x_i$

$p0(xi)p1(xi)\frac{p_0(x_i)}{p_1(x_i)}$ is called the importance weight.
- If $p_1(x_i) = p_0(x_i)$ , the importance weight is one and $fˉ\bar{f}$ becomes $xˉ\bar{x}$ .
- If $p0(xi)≥p1(xi)p_0(x_i) \geq p_1(x_i)$ , $x_i$ can be more often sampled by $p_0$ than $p_1$ .
  The importance weight ( $> 1$ ) can emphasize the importance of this sample.

Importance Weight 的意义

修正因子：

$\frac{p_0(x)}{p_1(x)}$

解释：

如果 $p_1(x) = p_0(x)$ ，那么 $w (x) = 1$ ，这就退化成普通 Monte Carlo。
如果某个 $x$ 在 $p_0$ 下比在 $p_1$ 下更常见（即 $p0(x)≥p1(x)p_0(x) \ge p_1(x)$ ），那么 $w (x) > 1$ ，采到这个样本时会“放大”它的重要性。
相反，如果 $p0(x)≪p1(x)p_0(x) \ll p_1(x)$ ，那么 $w (x) < 1$ ，采到的样本会被削弱影响。

直观上：importance weight 让“偏样”的分布 $p_1$ 采到的数据，能被重新加权，修正成目标分布 $p_0$ 下的真实期望。

Q & A

Question
- While $fˉ=1n∑i=1np0(xi)p1(xi)xi\bar{f} = \frac{1}{n}\sum_{i=1}^n \frac{p_0(x_i)}{p_1(x_i)}x_i$ requires $p_0(x)$ , if I know $p_0(x)$ , why not directly calculate the expectation?
Answer
- It is applicable to the case where it is easy to calculate $p_0(x)$ given an $x$ , but difficult to calculate the expectation.
  - For example, continuous case, complex expression of $p_0$ , or no expression of $p_0$ (e.g., $p_0$ represented by a neural network).

为什么不直接算期望？

既然要用到 $p_0(x)$ ，为什么不直接计算 $EX∼p0[X]\mathbb{E}_{X \sim p_0}[X]$ ？
原因：
我们可能能写出 $p_0(x)$ ，但它的积分或期望形式复杂，没法直接算。
比如在连续空间，高维积分几乎不可解。
$p_0(x)$ 可能由神经网络表示（无解析形式）。

而从 $p_1$ 采样比较容易，比如选择一个更简单的分布。

Equation Summary

if ${xi}∼p1\{x_i\} \sim p_1$ ,

$xˉ=1n∑i=1nxi→EX∼p1[X]\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i \to \mathbb{E}_{X \sim p_1}[X]$

$fˉ=1n∑i=1np0(xi)p1(xi)xi→EX∼p0[X]\bar{f} = \frac{1}{n}\sum_{i=1}^n \frac{p_0(x_i)}{p_1(x_i)} x_i \to \mathbb{E}_{X \sim p_0}[X]$

在 RL 中的应用

在强化学习里：

行为策略：采样数据的分布 $p1=πbp_1 = \pi_b$ （behavior policy）
目标策略：我们想优化的分布 $p0=πp_0 = \pi$ （target policy）

如果直接用行为策略的数据来学习，会有分布不匹配的问题。Importance sampling 就通过

$wt=π(at∣st)πb(at∣st)w_t = \frac{\pi(a_t|s_t)}{\pi_b(a_t|s_t)}$

来修正，使得学习仍然是关于目标策略的无偏估计。

The theorem of oﬀ-policy policy gradient

Like the previous on-policy case, we need to derive the policy gradient in the off-policy case.

Suppose $β\beta$ is the behavior policy that generates experience samples.
Our aim is to use these samples to update a target policy $π\pi$ that can minimize the metric

$J(θ)=∑s∈Sdβ(s)vπ(s)=ES∼dβ[vπ(S)]J(\theta) = \sum_{s \in \mathcal{S}} d_\beta(s) v_\pi(s) = \mathbb{E}{S \sim d\beta}[v_\pi(S)]$
- where $dβd_\beta$ is the stationary distribution under policy $β\beta$ .

Theorem (Off-policy policy gradient theorem)

In the discounted case where $γ∈(0,1)\gamma \in (0,1)$ , the gradient of $J(θ)J(\theta)$ is

$∇θJ(θ)=ES∼ρ,A∼β[π(A∣S,θ)β(A∣S)∇θln⁡π(A∣S,θ)qπ(S,A)]\nabla_\theta J(\theta) = \mathbb{E}{S \sim \rho, A \sim \beta} \left[ \frac{\pi(A|S,\theta)}{\beta(A|S)} \nabla\theta \ln \pi(A|S,\theta) q_\pi(S,A) \right]$
- where $β\beta$ is the behavior policy and $ρ\rho$ is a state distribution.

Off-policy Policy Gradient Theorem 的直观含义和推导思路。

1. 背景：为什么要 Off-policy？

在前面 on-policy 的情形中，
我们用目标策略 $π\pi$ 自己生成的轨迹来更新 $π\pi$ 。
但是在实际应用中，常常会遇到以下问题：
目标策略 $π\pi$ 还没收敛，样本质量不足；
我们已经有了由其他策略（例如旧的策略、随机策略、经验库）生成的大量数据；
希望 复用已有样本 来提高数据利用率。

这时候，就需要 off-policy learning：
利用行为策略 $β\beta$ 生成的样本，去更新目标策略 $π\pi$ 。

性能目标函数

定义目标函数：

$J(θ)=∑s∈Sdβ(s)vπ(s)J(\theta) = \sum_{s \in \mathcal{S}} d_\beta(s) v_\pi(s)$

$\mathbb{E}{S \sim d\beta}[v_\pi(S)]$

$dβ(s)d_\beta(s)$ ：在行为策略 $β\beta$ 下，状态 $s$ 的 平稳分布；
$vπ(s)v_\pi(s)$ ：在目标策略 $π\pi$ 下，从状态 $s$ 出发的期望回报。

也就是说，我们要最大化的是：在 $β\beta$ 产生的状态分布下， $π\pi$ 的价值函数。

Off-policy Policy Gradient Theorem

定理给出的梯度形式是：

$∇θJ(θ)=ES∼ρ,A∼β[π(A∣S,θ)β(A∣S)∇θln⁡π(A∣S,θ)qπ(S,A)]\nabla_\theta J(\theta) = \mathbb{E}{S \sim \rho, A \sim \beta} \left[ \frac{\pi(A|S,\theta)}{\beta(A|S)} \nabla\theta \ln \pi(A|S,\theta) q_\pi(S,A) \right]$

其中：

$ρ\rho$ ：状态分布；
$β\beta$ ：行为策略；
$π\pi$ ：目标策略；
$π(A∣S,θ)β(A∣S)\frac{\pi(A|S,\theta)}{\beta(A|S)}$ ：重要性采样权重。

为什么要重要性采样比率？

在 on-policy 的情形下， $A$ 是由 $π\pi$ 采样的，所以没有问题。

但是在 off-policy 中， $A$ 是由 $β\beta$ 采样的，这和 $π\pi$ 不一致。

为了修正这种分布不匹配，就要引入修正因子：

$π(A∣S,θ)β(A∣S)\frac{\pi(A|S,\theta)}{\beta(A|S)}$

这样，虽然我们是从 $β\beta$ 的分布中采样，但通过加权，可以让期望和在 $π\pi$ 下采样的结果保持一致。

直观解释
分布修正：通过 $πβ\frac{\pi}{\beta}$ ，把 $β\beta$ 下的采样分布“修正”为 $π\pi$ 下的分布；
无偏性：保证估计的期望仍然是 $π\pi$ 下的真实梯度；
可复用性：我们可以利用 $β\beta$ 生成的旧经验来更新 $π\pi$ ，而不是每次都要重新采样。

实际问题
虽然引入了 $πβ\frac{\pi}{\beta}$ 修正项保证了无偏性，但也带来了方差过大的问题。
因此，实际中很多 off-policy 算法（比如 DDPG、SAC）会：
使用 截断的重要性权重（clipped importance sampling）；
或者用 critic 近似 $qπ(s,a)q_\pi(s,a)$ 来减少方差。

The algorithm of oﬀ-policy actor-critic

Off-policy policy gradient

The off-policy policy gradient is also invariant to a baseline $b (s)$ .
In particular, we have

$∇θJ(θ)=ES∼ρ,A∼β[π(A∣S,θ)β(A∣S)∇θln⁡π(A∣S,θ)(qπ(S,A)−b(S))]\nabla_\theta J(\theta) = \mathbb{E}{S \sim \rho, A \sim \beta} \left[ \frac{\pi(A|S,\theta)}{\beta(A|S)} \nabla\theta \ln \pi(A|S,\theta) (q_\pi(S,A) - b(S)) \right]$
To reduce the estimation variance, we can select the baseline as $v_\pi(S$ ) and obtain

$∇θJ(θ)=E[π(A∣S,θ)β(A∣S)∇θln⁡π(A∣S,θ)(qπ(S,A)−vπ(S))]\nabla_\theta J(\theta) = \mathbb{E} \left[ \frac{\pi(A|S,\theta)}{\beta(A|S)} \nabla_\theta \ln \pi(A|S,\theta) (q_\pi(S,A) - v_\pi(S)) \right]$

为什么需要 Off-policy？

在 on-policy 的情况下，Actor 和 Critic 的更新都依赖于 当前策略 $π\pi$ 产生的数据。但在实际应用中，这样有两个问题：

策略需要不断更新 → 采样效率低。
旧的数据（由旧策略产生）无法直接利用。

因此我们希望能在 一个行为策略 $β\beta$ （behavior policy） 下收集经验，然后用这些经验去更新一个 目标策略 $π\pi$ （target policy）。这就需要 importance sampling 来进行修正。

Off-policy Policy Gradient with Baseline

在 off-policy 场景下，策略梯度公式变为：

$∇θJ(θ)=ES∼ρ,A∼β[π(A∣S,θ)β(A∣S)∇θln⁡π(A∣S,θ)(qπ(S,A)−b(S))]\nabla_\theta J(\theta) = \mathbb{E}{S \sim \rho, A \sim \beta} \left[ \frac{\pi(A|S,\theta)}{\beta(A|S)} \nabla\theta \ln \pi(A|S,\theta) (q_\pi(S,A) - b(S)) \right]$

关键点：

$π(A∣S,θ)β(A∣S)\frac{\pi(A|S,\theta)}{\beta(A|S)}$ 称为 importance weight，用于校正行为策略与目标策略的分布差异。

baseline $b (S)$ 用于降低方差。最常见选择是 $v_\pi(S)$ ，得到 advantage function：

$qπ(S,A)−vπ(S)=δπ(S,A)q_\pi(S,A) - v_\pi(S) = \delta_\pi(S,A)$

Stochastic gradient-ascent algorithm

The corresponding stochastic gradient-ascent algorithm is

$θt+1=θt+αθπ(at∣st,θt)β(at∣st)∇θln⁡π(at∣st,θt)(qt(st,at)−vt(st))\theta_{t+1} = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t,\theta_t)}{\beta(a_t|s_t)} \nabla_\theta \ln \pi(a_t|s_t,\theta_t) (q_t(s_t,a_t) - v_t(s_t))$
Similar to the on-policy case,

$qt(st,at)−vt(st)≈rt+1+γvt(st+1)−vt(st)≐δt(st,at)q_t(s_t,a_t) - v_t(s_t) \approx r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) \doteq \delta_t(s_t,a_t)$

Then, the algorithm becomes

  $\theta_{t+1} = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t,\theta_t)}{\beta(a_t|s_t)}

\nabla_\theta \ln \pi(a_t|s_t,\theta_t) \delta_t(s_t,a_t)$

and hence

$θt+1=θt+αθ(δt(st,at)β(at∣st))∇θπ(at∣st,θt)\theta_{t+1} = \theta_t + \alpha_\theta \left( \frac{\delta_t(s_t,a_t)}{\beta(a_t|s_t)} \right) \nabla_\theta \pi(a_t|s_t,\theta_t)$

Off-policy actor-critic based on importance sampling

Initialization:
- A given behavior policy $β(a∣s)\beta(a|s)$ .
- A target policy $π(a∣s,θ0)\pi(a|s,\theta_0)$ where $θ0\theta_0$ is the initial parameter vector.
- A value function $v(s,w_0)$ where $w_0$ is the initial parameter vector.
Aim: Search for an optimal policy by maximizing $J(θ)J(\theta)$ .
At time step t in each episode, do
- Generate $a_t$ following $β(st)\beta(s_t)$ and then observe $r_{t+1}, s_{t+1}$ .
- TD error (advantage function):
  
  $δt=rt+1+γv(st+1,wt)−v(st,wt)\delta_t = r_{t+1} + \gamma v(s_{t+1},w_t) - v(s_t,w_t)$
- Critic (value update):
  
  $wt+1=wt+αwπ(at∣st,θt)β(at∣st)δt∇wv(st,wt)w_{t+1} = w_t + \alpha_w \frac{\pi(a_t|s_t,\theta_t)}{\beta(a_t|s_t)} \delta_t \nabla_w v(s_t,w_t)$
- Actor (policy update):
  
  $θt+1=θt+αθπ(at∣st,θt)β(at∣st)δt∇θln⁡π(at∣st,θt)\theta_{t+1} = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t,\theta_t)}{\beta(a_t|s_t)} \delta_t \nabla_\theta \ln \pi(a_t|s_t,\theta_t)$

Off-policy Actor-Critic 算法流程深度解析

初始化

行为策略 $β(a∣s)\beta(a|s)$ ：

这是用来 收集数据 的策略。它可以是一个旧版本的策略，也可以是专门设计来探索的策略。它的优点是能产生大量样本，甚至存放在 经验回放池 中供以后使用。
用来收集经验，提高样本利用率。

目标策略 $π(a∣s,θ)\pi(a|s,\theta)$ ：

这是我们真正想要优化的策略（Actor）。参数是 $θ\theta$ ，它决定了我们的最终决策性能。

价值函数 $v (s, w)$ ：

Critic 的任务是近似状态价值或动作价值，用参数 $w$ 表示。Critic 为 Actor 提供“动作好坏的评价信号”。
真正优化的目标，保证学习方向正确。

采样阶段

在时间步 $t$ ，
用行为策略 $β(st)\beta(s_t)$ 选择动作 $a_t$ 。
环境返回奖励 $r_{t+1}$ 和下一个状态 $s_{t+1}$ 。

注意：这里的动作不是从 $π\pi$ 采样，而是从 $β\beta$ 采样。
这就是 off-policy 的关键：数据来自别的策略。

计算 TD 误差（Advantage 近似）

$δt=rt+1+γv(st+1,wt)−v(st,wt)\delta_t = r_{t+1} + \gamma v(s_{t+1},w_t) - v(s_t,w_t)$

直观理解：
如果奖励 + 下一状态的价值大于当前状态价值 → 说明动作比预期更好， $δt>0\delta_t > 0$ 。
如果奖励小于预期 → 动作比想象中差， $δt<0\delta_t < 0$ 。

Critic 更新（值函数近似）

$wt+1=wt+αwπ(at∣st,θt)β(at∣st)δt∇wv(st,wt)w_{t+1} = w_t + \alpha_w \frac{\pi(a_t|s_t,\theta_t)}{\beta(a_t|s_t)} \delta_t \nabla_w v(s_t,w_t)$

Critic 学习目标是 让价值函数更好地逼近 TD 目标。
乘上 $πβ\frac{\pi}{\beta}$ 是因为数据不是从目标策略 $π\pi$ 采样的，而是从行为策略 $β\beta$ 采样的，我们必须用 importance sampling 权重 进行修正，保证更新是 无偏的。
没有这个修正，Critic 会学到偏差很大的价值估计，从而误导 Actor。

Actor 更新（策略改进）

$θt+1=θt+αθπ(at∣st,θt)β(at∣st)δt∇θln⁡π(at∣st,θt)\theta_{t+1} = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t,\theta_t)}{\beta(a_t|s_t)} \delta_t \nabla_\theta \ln \pi(a_t|s_t,\theta_t)$

直观理解：
如果 $δt>0\delta_t > 0$ ，说明这个动作比预期更好，Actor 就应该 增加该动作的概率；
如果 $δt<0\delta_t < 0$ ，说明动作不好，Actor 应该 降低该动作的概率。

同样， $πβ\frac{\pi}{\beta}$ 用来修正数据分布。
最终，Actor 会逐渐朝着能带来更大长期回报的方向优化。

核心理解

$β\beta$ 负责采样， $π\pi$ 负责学习
这使得我们可以反复使用旧数据（经验回放），大大提高采样效率。

Importance Sampling 保证无偏
修正因 $β≠π\beta ≠ \pi$ 带来的分布差异，否则 Actor 和 Critic 的更新方向会有偏差。

TD 误差作为 Advantage
Critic 提供的 $δt\delta_t$ 告诉 Actor：这个动作比平均水平好多少，从而指导策略改进。

Deterministic actor-critic (DPG)

Introduction

The ways to represent a policy:

Up to now, a general policy is denoted as
```
  $\pi(a|s,\theta) \in [0,1],$
```
- which can be either stochastic or deterministic.
Now, the deterministic policy is specifically denoted as

$\mu(s,\theta) \doteq \mu(s)$
- $μ\mu$ is a mapping from $S\mathcal{S}$ to $A\mathcal{A}$ .
- $μ\mu$ can be represented by, for example, a neural network with the input as $s$ , the output as $a$ , and the parameter as $θ\theta$ .
- We may write $μ(s,θ)\mu(s,\theta)$ in short as $μ(s)\mu(s)$ .

确定性策略的表示

一般策略： $π(a∣s,θ)\pi(a|s, \theta)$ ，可以是分布也可以是函数。

确定性策略：

$\mu(s, \theta)$

它是一个映射 $\to A$ 。

实现方式：用一个神经网络，输入状态 $s$ ，输出一个唯一动作 $a$ 。

这意味着 不再采样，而是直接计算。

The theorem of deterministic policy gradient

Introduction

The policy gradient theorems introduced before are merely valid for stochastic policies.
If the policy must be deterministic, we must derive a new policy gradient theorem.
The ideas and procedures are similar.

为什么要从随机策略转向确定性策略？

在之前的 stochastic policy gradient (SPG) 中，我们有：

$π(a∣s,θ)>0,∀(s,a)\pi(a|s, \theta) > 0, \quad \forall (s,a)$

意味着无论什么状态，每个动作都有概率被选到。

好处：理论上保证覆盖整个动作空间。

问题：

连续动作空间时，分布很难采样完整，学习效率低；
梯度估计有 高方差，更新不稳定。

于是我们考虑：

能不能直接学一个函数 $\mu(s, \theta)$ ，让策略 deterministically 输出动作？

Definition

Consider the metric of average state value in the discounted case:

$J(θ)=E[vμ(s)]=∑s∈Sd0(s)vμ(s)J(\theta) = \mathbb{E}[v_\mu(s)] = \sum_{s \in \mathcal{S}} d_0(s) v_\mu(s)$
- where $d_0(s)$ is a probability distribution satisfying $∑s∈Sd0(s)=1\sum_{s \in \mathcal{S}} d_0(s) = 1$ .
- $d_0$ is selected to be independent of $μ\mu$ . The gradient in this case is easier to calculate.
There are two special yet important cases of selecting $d_0$ .
- The first special case is that $d_0(s_0) = 1$ and $d0(s≠s0)=0d_0(s \neq s_0) = 0$ , where $s_0$ is a specific starting state of interest.
- The second special case is that $d_0$ is the stationary distribution of a behavior policy that is different from the $μ\mu$ .

DPG 的目标函数

我们依旧定义目标函数（期望回报）：

$J(θ)=E[vμ(s)]=∑s∈Sd0(s)vμ(s)J(\theta) = \mathbb{E}[v_\mu(s)] = \sum_{s \in \mathcal{S}} d_0(s) v_\mu(s)$

其中：
$vμ(s)v_\mu(s)$ ：在策略 $μ\mu$ 下，状态 $s$ 的价值。
$d_0(s)$ ：状态分布，可以有不同的选择：
如果 $d_0(s_0)=1$ ，表示从某个起点状态开始。
如果 $d_0$ 是行为策略的平稳分布，说明我们是 off-policy 学习。

Theorem (Deterministic policy gradient theorem in the discounted case)

In the discounted case where $γ∈(0,1)\gamma \in (0,1)$ , the gradient of $J(θ)J(\theta)$ is

$∇θJ(θ)=∑s∈Sρμ(s)∇θμ(s)(∇aqμ(s,a))∣a=μ(s)\nabla_\theta J(\theta) = \sum_{s \in \mathcal{S}} \rho_\mu(s) \nabla_\theta \mu(s) \big(\nabla_a q_\mu(s,a)\big)\big|_{a=\mu(s)}$

$\mathbb{E}_{S \sim \rho\mu} \Big[ \nabla_\theta \mu(S) \big(\nabla_a q_\mu(S,a)\big)\big|_{a=\mu(S)} \Big]$
Here, $ρμ\rho_\mu$ is a state distribution.
One important difference from the stochastic case:
- The gradient does not involve the distribution of the action $A$ (why?).
- As a result, the deterministic policy gradient method is off-policy.

确定性策略梯度定理

定理表明：

$∇θJ(θ)=ES∼ρμ[∇θμ(S)∇aqμ(S,a)∣a=μ(S)]\nabla_\theta J(\theta) = \mathbb{E}_{S \sim \rho\mu} \Big[ \nabla_\theta \mu(S) \, \nabla_a q_\mu(S,a) \big|_{a=\mu(S)} \Big]$

关键点：
梯度更新依赖于：
策略梯度 $∇θμ(S)\nabla_\theta \mu(S)$ ：动作 $a$ 对参数 $θ\theta$ 的敏感性（“动作会随参数怎么变”）。
动作价值梯度 $∇aqμ(S,a)\nabla_a q_\mu(S,a)$ ：动作 $a$ 对未来价值的敏感性（“这个动作对价值函数的边际贡献”）。
组合：我们要让动作朝着“价值上升最快”的方向调整。

区别于随机策略的情况：这里没有 $π(a∣s)\pi(a|s)$ 的分布项！
在 stochastic PG 中，更新公式中会有 $∇θln⁡π(a∣s,θ)\nabla_\theta \ln \pi(a|s,\theta)$ ，因为需要考虑采样概率。
在 DPG 中，动作是唯一确定的，不需要概率分布，所以消掉了这一项。

为什么和随机策略不同？

在 stochastic policy gradient (REINFORCE) 里：

$∇θJ(θ)=E[∇θln⁡π(a∣s,θ)qπ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \ln \pi(a|s,\theta) \, q_\pi(s,a)]$

解释：
因为动作是“从分布中采样”，所以必须用 $ln⁡π(a∣s,θ)\ln \pi(a|s,\theta)$ 的导数来捕捉“采样概率随参数变化的敏感性”。

而在 deterministic case：

动作是确定的，不需要概率分布。
于是 $ln⁡π\ln \pi$ 项消失，直接用 $∇θμ(s)\nabla_\theta \mu(s)$ 。

为什么 DPG 是 Off-policy？

在随机策略下，我们必须从当前策略 $π\pi$ 中采样动作（on-policy）。

但在 DPG 中，公式只依赖于 $μ(s)\mu(s)$ 和 $qμ(s,a)q_\mu(s,a)$ ，与采样分布 $β\beta$ 无关。

→ 我们可以用任意行为策略（例如 $ϵ\epsilon$ -greedy 探索策略）来收集数据，再用这些数据更新 $μ\mu$ 。

The algorithm of deterministic actor-critic

Gradient-ascent algorithm for deterministic policy gradient

Based on the policy gradient, the gradient-ascent algorithm for maximizing $J(θ)J(\theta)$ is:

$θt+1=θt+αθES∼ρμ[∇θμ(S)(∇aqμ(S,a))∣a=μ(S)]\theta_{t+1} = \theta_t + \alpha_\theta \mathbb{E}{S \sim \rho\mu} \left[ \nabla_\theta \mu(S) (\nabla_a q_\mu(S, a)) \big|_{a=\mu(S)} \right]$
The corresponding stochastic gradient-ascent algorithm is

$θt+1=θt+αθ∇θμ(st)(∇aqμ(st,a))∣a=μ(st)\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu(s_t) (\nabla_a q_\mu(s_t, a)) \big|_{a=\mu(s_t)}$

Gradient-ascent algorithm

Deterministic policy gradient

$θt+1=θt+αθES∼ρμ[∇θμ(S)(∇aqμ(S,a))∣a=μ(S)]\theta_{t+1} = \theta_t + \alpha_\theta \mathbb{E}{S \sim \rho\mu} \left[ \nabla_\theta \mu(S) (\nabla_a q_\mu(S, a)) \big|_{a=\mu(S)} \right]$

含义：
在确定性策略下，动作 $a$ 不再是一个概率分布采样结果，而是直接由函数 $μ(s,θ)\mu(s, \theta)$ 给出。
策略参数 $θ\theta$ 的更新方向由 链式法则 得到：
$∇θμ(S)\nabla_\theta \mu(S)$ ：状态对策略参数的敏感性（即参数变化对输出动作的影响）；
$∇aqμ(S,a)\nabla_a q_\mu(S, a)$ ：动作对价值函数的敏感性（即动作改变对未来回报的影响）。

两者结合，意味着参数更新方向是动作变化对回报的敏感性 × 参数变化对动作的敏感性。

Stochastic gradient-ascent algorithm

$θt+1=θt+αθ∇θμ(st)(∇aqμ(st,a))∣a=μ(st)\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu(s_t)(\nabla_a q_\mu(s_t, a)) \big|_{a=\mu(s_t)}$

这里用单个采样状态 $s_t$ 近似期望。
相当于 mini-batch SGD 的思想，用样本来代替整体分布的期望。

Deterministic actor-critic algorithm

Initialization: A given behavior policy $β(a∣s)\beta(a|s)$ . A deterministic target policy $μ(s,θ0)\mu(s, \theta_0)$ where $θ0\theta_0$ is the initial parameter vector. A value function $v(s, w_0)$ where $w_0$ is the initial parameter vector.
Aim: Search for an optimal policy by maximizing $J(θ)J(\theta)$ .
At time step t in each episode, do
- Generate $a_t$ following $β\beta$ and then observe $r_{t+1}, s_{t+1}$ .
- TD error:
  
  $δt=rt+1+γq(st+1,μ(st+1,θt),wt)−q(st,at,wt)\delta_t = r_{t+1} + \gamma q(s_{t+1}, \mu(s_{t+1}, \theta_t), w_t) - q(s_t, a_t, w_t)$
- Critic (value update):
  
  $wt+1=wt+αwδt∇wq(st,at,wt)w_{t+1} = w_t + \alpha_w \delta_t \nabla_w q(s_t, a_t, w_t)$
- Actor (policy update):
  
  $θt+1=θt+αθ∇θμ(st,θt)(∇aq(st,a,wt+1))∣a=μ(st)\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu(s_t, \theta_t)(\nabla_a q(s_t, a, w_{t+1})) \big|_{a=\mu(s_t)}$

Deterministic Actor-Critic Algorithm (DAC)

算法流程：

初始化：

行为策略 $β(a∣s)\beta(a|s)$ ：用于采样数据（可以是 $μ+noise\mu + \text{noise}$ ）。
目标策略 $μ(s,θ)\mu(s, \theta)$ ：确定性输出动作。
价值函数 $q (s, a, w)$ ：用参数 $w$ 近似。

执行过程：

在状态 $s_t$ 下，根据行为策略 $β\beta$ 生成动作 $a_t$ 。
执行 $a_t$ ，观察奖励 $r_{t+1}$ 和下一个状态 $s_{t+1}$ 。

TD 误差：

$δt=rt+1+γq(st+1,μ(st+1,θt),wt)−q(st,at,wt)\delta_t = r_{t+1} + \gamma q(s_{t+1}, \mu(s_{t+1}, \theta_t), w_t) - q(s_t, a_t, w_t)$

衡量当前 $q$ 估计与真实回报之间的差距。
如果 $δt>0\delta_t > 0$ ，说明实际回报比估计高，动作和策略需要被强化。

Critic 更新：

$wt+1=wt+αwδt∇wq(st,at,wt)w_{t+1} = w_t + \alpha_w \delta_t \nabla_w q(s_t, a_t, w_t)$

类似 TD 学习，逼近真实的 $q$ 函数。

Actor 更新：

$θt+1=θt+αθ∇θμ(st,θt)(∇aq(st,a,wt+1))∣a=μ(st)\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu(s_t, \theta_t)(\nabla_a q(s_t, a, w_{t+1}))|_{a=\mu(s_t)}$

利用 critic 提供的梯度信号，更新确定性策略。
注意这里直接用梯度 $∇aq\nabla_a q$ ，而不是通过 log-likelihood trick。

Remarks

This is an off-policy implementation where the behavior policy $β\beta$ may be different from $μ\mu$ .
$β\beta$ can also be replaced by $μ+noise\mu + \text{noise}$ .
How to select the function to represent $q (s, a, w)$ ?
- Linear function:
  
  $\phi^T(s,a) w$ where $ϕ(s,a)\phi(s,a)$ is the feature vector. Details can be found in the DPG paper.
- Neural networks: deep deterministic policy gradient (DDPG) method.

Remarks 的意义

Off-policy：
行为策略 $β\beta$ 可以和目标策略 $μ\mu$ 不同。
好处：数据采样更灵活，可以复用过去的经验（比如 replay buffer）。

行为策略 = 目标策略 + 噪声：
常见做法：在 $μ\mu$ 的输出上加噪声（如 Ornstein-Uhlenbeck noise），增加探索性。
DDPG 就是这样实现的。

Q 函数表示方式：

线性函数近似：

$\phi^T(s,a) w$

简单高效，但表达能力有限。

神经网络：

对连续动作空间更适用（即 DDPG 方法）。
神经网络负责拟合复杂的 $q$ 函数。

与随机策略梯度的对比

随机策略梯度：

更新依赖于 $∇θlog⁡πθ(a∣s)qπ(s,a)\nabla_\theta \log \pi_\theta(a|s) q_\pi(s,a)$ 。

优点：可以探索更多动作。
缺点：连续动作空间中采样效率低。

确定性策略梯度：

更新依赖于 $∇θμ(s)∇aq(s,a)\nabla_\theta \mu(s) \nabla_a q(s,a)$ 。

优点：直接学最优动作，避免动作采样；更适合高维连续控制问题。
缺点：探索能力依赖额外噪声机制。

总结

Actor-Critic 方法通过 Critic 提供的价值估计来指导 Actor 的策略更新，而在确定性策略梯度（DPG/DDPG）中，Actor 直接输出动作并利用 $∇θμ(s)∇aq(s,a)\nabla_\theta \mu(s)\nabla_a q(s,a)$ 更新，避免了概率采样，高效适用于连续动作空间，但探索需依赖额外噪声。