策略梯度定理公式的具体推导
以下是策略梯度定理公式从基础概率公式到终极形式的完整推导,帮助更清楚地理解推导过程中的每一个步骤。
1. 策略梯度的目标
我们希望最大化渴望累积奖励 ( J ( θ ) J(\theta) J(θ) ),其界说为:
J ( θ ) = E π θ [ R t ] J(\theta) = \mathbb{E}_{\pi_\theta} \left[ R_t \right] J(θ)=Eπθ[Rt]
根据渴望的界说,可以将 ( J ( θ ) J(\theta) J(θ) ) 写为积分形式:
J ( θ ) = ∫ τ P ( τ ; θ ) R t d τ J(\theta) = \int_{\tau} P(\tau; \theta) R_t \, d\tau J(θ)=∫τP(τ;θ)Rtdτ
其中:
- ( τ = ( s 0 , a 0 , s 1 , a 1 , … ) \tau = (s_0, a_0, s_1, a_1, \dots) τ=(s0,a0,s1,a1,…) ) 体现一条轨迹;
- ( P ( τ ; θ ) P(\tau; \theta) P(τ;θ) ) 是轨迹的概率分布。
接下来,我们对目标 ( J ( θ ) J(\theta) J(θ) ) 求梯度:
∇ θ J ( θ ) = ∇ θ ∫ τ P ( τ ; θ ) R t d τ \nabla_\theta J(\theta) = \nabla_\theta \int_{\tau} P(\tau; \theta) R_t \, d\tau ∇θJ(θ)=∇θ∫τP(τ;θ)Rtdτ
根据微积分中的交换求导与积分的规则,将梯度符号与积分符号交换位置:
∇ θ J ( θ ) = ∫ τ ∇ θ [ P ( τ ; θ ) R t ] d τ \nabla_\theta J(\theta) = \int_{\tau} \nabla_\theta \left[ P(\tau; \theta) R_t \right] d\tau ∇θJ(θ)=∫τ∇θ[P(τ;θ)Rt]dτ
由于 ( R t R_t Rt ) 不依靠于参数 ( θ \theta θ ),所以可以提取出来:
∇ θ J ( θ ) = ∫ τ R t ∇ θ P ( τ ; θ ) d τ \nabla_\theta J(\theta) = \int_{\tau} R_t \nabla_\theta P(\tau; \theta) \, d\tau ∇θJ(θ)=∫τRt∇θP(τ;θ)dτ
2. 引入对数梯度
为了化简 ( ∇ θ P ( τ ; θ ) \nabla_\theta P(\tau; \theta) ∇θP(τ;θ) ),我们引入对数梯度技巧:
∇ θ P ( τ ; θ ) = P ( τ ; θ ) ⋅ ∇ θ log P ( τ ; θ ) \nabla_\theta P(\tau; \theta) = P(\tau; \theta) \cdot \nabla_\theta \log P(\tau; \theta) ∇θP(τ;θ)=P(τ;θ)⋅∇θlogP(τ;θ)
将其代入梯度公式:
∇ θ J ( θ ) = ∫ τ R t ⋅ P ( τ ; θ ) ⋅ ∇ θ log P ( τ ; θ ) d τ \nabla_\theta J(\theta) = \int_{\tau} R_t \cdot P(\tau; \theta) \cdot \nabla_\theta \log P(\tau; \theta) \, d\tau ∇θJ(θ)=∫τRt⋅P(τ;θ)⋅∇θlogP(τ;θ)dτ
根据概率分布 ( P ( τ ; θ ) P(\tau; \theta) P(τ;θ) ) 的性质,可以用渴望形式重新体现:
∇ θ J ( θ ) = E π θ [ R t ⋅ ∇ θ log P ( τ ; θ ) ] \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ R_t \cdot \nabla_\theta \log P(\tau; \theta) \right] ∇θJ(θ)=Eπθ[Rt⋅∇θlogP(τ;θ)]
这一步的告急性在于将积分转化为在策略 ( π θ \pi_\theta πθ ) 下的渴望,使得后续计算能够通过采样来实现。
3. 轨迹概率分布的分解
轨迹 ( τ \tau τ ) 的概率 ( P ( τ ; θ ) P(\tau; \theta) P(τ;θ) ) 可以分解为以下形式:
P ( τ ; θ ) = P ( s 0 ) ∏ t = 0 ∞ π θ ( a t ∣ s t ) P ( s t + 1 ∣ s t , a t ) P(\tau; \theta) = P(s_0) \prod_{t=0}^{\infty} \pi_\theta(a_t | s_t) P(s_{t+1} | s_t, a_t) P(τ;θ)=P(s0)t=0∏∞πθ(at∣st)P(st+1∣st,at)
其中:
- ( P ( s 0 ) P(s_0) P(s0) ):初始状态的概率;
- ( π θ ( a t ∣ s t ) \pi_\theta(a_t | s_t) πθ(at∣st) ):策略分布,体现在状态 ( s t s_t st ) 下采取动作 ( a t a_t at ) 的概率;
- ( P ( s t + 1 ∣ s t , a t ) P(s_{t+1} | s_t, a_t) P(st+1∣st,at) ):情况的状态转移概率。
对 ( log P ( τ ; θ ) \log P(\tau; \theta) logP(τ;θ) ) 求导时,仅有 ( π θ ( a t ∣ s t ) \pi_\theta(a_t | s_t) πθ(at∣st) ) 与参数 ( θ \theta θ ) 相关,因此可化简为:
∇ θ log P ( τ ; θ ) = ∑ t = 0 ∞ ∇ θ log π θ ( a t ∣ s t ) \nabla_\theta \log P(\tau; \theta) = \sum_{t=0}^{\infty} \nabla_\theta \log \pi_\theta(a_t | s_t) ∇θlogP(τ;θ)=t=0∑∞∇θlogπθ(at∣st)
将此结果代入梯度公式:
∇ θ J ( θ ) = E π θ [ R t ⋅ ∑ t = 0 ∞ ∇ θ log π θ ( a t ∣ s t ) ] \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ R_t \cdot \sum_{t=0}^{\infty} \nabla_\theta \log \pi_\theta(a_t | s_t) \right] ∇θJ(θ)=Eπθ[Rt⋅t=0∑∞∇θlogπθ(at∣st)]
4. 化简终极公式
将渴望中的求和移到外部,可以得到:
∇ θ J ( θ ) = ∑ t = 0 ∞ E π θ [ R t ⋅ ∇ θ log π θ ( a t ∣ s t ) ] \nabla_\theta J(\theta) = \sum_{t=0}^{\infty} \mathbb{E}_{\pi_\theta} \left[ R_t \cdot \nabla_\theta \log \pi_\theta(a_t | s_t) \right] ∇θJ(θ)=t=0∑∞Eπθ[Rt⋅∇θlogπθ(at∣st)]
在每个时间步 ( t t t ),我们只需要计算与当前动作 ( a t a_t at ) 和状态 ( s t s_t st ) 相关的对数梯度,从而得到:
∇ θ J ( θ ) = E π θ [ R t ⋅ ∇ θ log π θ ( a t ∣ s t ) ] \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ R_t \cdot \nabla_\theta \log \pi_\theta(a_t | s_t) \right] ∇θJ(θ)=Eπθ[Rt⋅∇θlogπθ(at∣st)]
这就是策略梯度定理的终极公式。
5. 利用对数梯度性质验证
策略梯度公式的核心在于以下对数梯度性质:
∇ θ π θ ( a t ∣ s t ) = π θ ( a t ∣ s t ) ⋅ ∇ θ log π θ ( a t ∣ s t ) \nabla_\theta \pi_\theta(a_t | s_t) = \pi_\theta(a_t | s_t) \cdot \nabla_\theta \log \pi_\theta(a_t | s_t) ∇θπθ(at∣st)=πθ(at∣st)⋅∇θlogπθ(at∣st)
证明如下:
- 根据对数界说, ( log x \log x logx ) 的导数为 ( 1 x \frac{1}{x} x1 );
- 对 ( π θ ( a t ∣ s t ) \pi_\theta(a_t | s_t) πθ(at∣st) ) 求梯度:
∇ θ log π θ ( a t ∣ s t ) = 1 π θ ( a t ∣ s t ) ⋅ ∇ θ π θ ( a t ∣ s t ) \nabla_\theta \log \pi_\theta(a_t | s_t) = \frac{1}{\pi_\theta(a_t | s_t)} \cdot \nabla_\theta \pi_\theta(a_t | s_t) ∇θlogπθ(at∣st)=πθ(at∣st)1⋅∇θπθ(at∣st)
两边乘以 ( π θ ( a t ∣ s t ) \pi_\theta(a_t | s_t) πθ(at∣st) ):
∇ θ π θ ( a t ∣ s t ) = π θ ( a t ∣ s t ) ⋅ ∇ θ log π θ ( a t ∣ s t ) \nabla_\theta \pi_\theta(a_t | s_t) = \pi_\theta(a_t | s_t) \cdot \nabla_\theta \log \pi_\theta(a_t | s_t) ∇θπθ(at∣st)=πθ(at∣st)⋅∇θlogπθ(at∣st)
将此性质代入公式,概率 ( π θ ( a t ∣ s t ) \pi_\theta(a_t | s_t) πθ(at∣st) ) 被约去,得到:
∇ θ J ( θ ) = E π θ [ R t ⋅ ∇ θ log π θ ( a t ∣ s t ) ] \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ R_t \cdot \nabla_\theta \log \pi_\theta(a_t | s_t) \right] ∇θJ(θ)=Eπθ[Rt⋅∇θlogπθ(at∣st)]
总结
通过以上具体推导,可以看出策略梯度定理的核心在于以下两点:
- 引入对数梯度性质:将复杂的概率梯度转化为对数形式;
- 利用轨迹概率分布的分解:化简梯度公式,使得计算集中在策略部分 ( π θ ( a t ∣ s t ) \pi_\theta(a_t | s_t) πθ(at∣st) )。
终极的策略梯度公式为:
∇ θ J ( θ ) = E π θ [ R t ⋅ ∇ θ log π θ ( a t ∣ s t ) ] \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ R_t \cdot \nabla_\theta \log \pi_\theta(a_t | s_t) \right] ∇θJ(θ)=Eπθ[Rt⋅∇θlogπθ(at∣st)]
这一公式既简洁又高效,是策略梯度方法的理论基础。
后记
2024年12月12日17点00分于上海,在GPT4o大模型辅助下完成。
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |