← back

Policy Optimization Methods in Reinforcement Learning

updated 2026-02-12

Unliked value-based methods, policy optimization methods search over policy parameters without computing structured value representations for states (or state-action pairs), in order to find parameters that maximize (or minimize) a policy objective function.

Let U(θ) be any policy objective function. Then the general structure would follow something like

  1. Initialize policy parameters θ
  2. Sample trajectories τi={sti,ati}t=0T by deploying the current policy πθ(atst)
  3. Compute gradient vector θU(θ) Typically this is done through estimation from collected data.
  4. Apply a gradient ascent update θ+αθU(θ)

In comparison to value-based methods, policy-based methods are

  1. more effective in high-dimensional + continuous action spaces Value-based methods should generally select optimal actions based on something like argmaxaQ(s,a) that faces problems across high dimensions.

  2. better at learning stochastic policies Policy-based methods are naturally stochastic: this is because our policy parameters θ parameterize a distribution that our actions are sampled from. Remember that value-based methods need to rely on heuristics like epsilon-greedy to embed exploration in their policies.

    Why is stochasticity a good thing?

    • exploration is optimal to train any policy
    • in partial-observability settings/environments, deterministic mappings aren’t the only optimal solution

Evolutionary Methods for Policy Search

These are called evolutionary methods because they closely follow the evolution of a population!

  1. Initialize a population of parameter vectors Genotypes
  2. Make random permutations to each parameter vector Simulate mutations in offspring
  3. Evaluate the perturbed parameter vector
  4. Update your policy parameters to move to favor the best performing parameter vectors Survival of the fittest enabling fitness in this offspring

These are examples of black-box policy optimization, where we treat the policy and environment as a “black box” that we can query (run rollouts) and update our policy based purely on performance we can observe;

CEM: Cross-Entropy Method

In this method, we

  1. sample n policy parameters θi from a multivariate Gaussian distribution matrix pϕ(θ)
  2. Evaluate those n parameters to generate a reward signal or scalar return F(θi)
  3. select a proportion ρ of those parameters with highest score as our elite samples
  4. use their corresponding μ and σ2 to update our reference matrix for sampling, which basically means setting the new mean as the mean of those highest ρn parameters and setting the new variance as simply the variance of those selected elite samples!

This worked really well up to the 2010’s and in low dimensional search space dimensions, shown to work well in Tetris (Szita, 2006), where we can craft a state-value function that is a linear combination of 22 basis functions ϕ(s) (individual column heights, height differences, etc)

Vw(s)=i=122wiϕi(s)

These evolutionary search methods weren’t a threat to DQN implementations at the time because they couldn’t scale to large non-linear neural nets with thousands of parameters.

CMA-ES: Covariance Matrix Adaptation

Instead of limiting ourselves to diagonal Gaussian, we search by learning a full covariance matrix so instead of just updating the mean and variances, we’re updating the full covariance matrix.

Visually, if there is an objective function we are trying to reach that is best maximized with samples in a 2d rotated ellipse, then we should utilize all the entries in a full covariance matrix which then allows us to rotate a standard diagonal gaussian matrix (x-y aligned ellipse) to best maximize this objective function efficiently.

NES - Natural Evolutionary Strategies

NES considers every offspring when updating our policy parameters - they optimize our expected fitness objective by updating the parameters of our search distribution through a natural gradient, specifically updating our learned mean μ, while fixing our covariance this time.

Consider the parameters of our policy θRd are sampled from a Gaussian distribution with learned mean μRd and fixed diagonal covariance matrix σ2I (which is not being learned). We denote this distribution as Pμ(θ)

θPμ(θ)=N(μ,σ2I)

Our goal is find the best possible policy distribution, parameterized by μ, that our policy is sampled from (through θ)

maxμEPμ(θ)F(θ)

based on a fitness score which is the expectation of reward over entire trajectories

F(θ)=Eτπθ, s0μ0(s)R(τ)

Deriving the Natural Gradient

Computing the update for our mean μ through this objective is as follows:

(use p.m.f to integrate out expectation)μEθPμ(θ)[F(θ)]=μPμ(θ)F(θ)dθ=μPμ(θ)F(θ)dθ=Pμ(θ)μPμ(θ)Pμ(θ)F(θ)dθ(derivative of log trick!)=Pμ(θ)μlogPu(θ)F(θ)dθ(based on pmf)=EθPμ(θ)[μlogPμ(θ)F(θ)](Monte Carlo Sampling)1N[μlogPμ(θ)F(θ)](logPμ(θ)=θμ2σ2+C for Gaussians)1N[i=1Nθμσ2F(θ)](Reparameterization trick for θ)1N[i=1NϵσF(θ)]

From this derivation, we have shown that this gradient can be estimated by

  1. sampling N parameters θi, running trajectories for each parameter, and obtaining our scalar fitness score F(θi) for each sample
  2. Then we simply need to scale by a sampled noise term ϵ divided by our variance σ and average out all scaled terms!

To expand on that last step, in order to back propagate through Gaussian distributions to μ, then the sampling of θiN(μ,σ2I) needs to be converted through the reparameterization trick so that θi=μ+σϵi, ϵiN(0,I).

Based on this derivation we can now apply gradient ascent to iteratively update our μ! (based on learning rate α)

μt+1=μt+α[1nσi=1nϵiF(θi)]

Black-Box Optimization

To clarify, this is still black-box optimization:

  1. We do not need to know anything about how we are computing our fitness score, we are simply using raw output returns given our samples θi
  2. We are not computing computing analytic gradients of F(θ) wrt to θ in order to update our policy parameters directly; in fact, we compute gradients wrt to a different, known object that we have set: the search distribution Pμ(θ)

Scalability + Parallelization of ES

The reason why this scales well for large dimension θ when we are working with large neural networks is that when parallelizing this natural gradient computation across multiple worker processes, then each worker needs to compute this term ϵiFi(θi). This means we need to distribute back and forth this large θi=μt+σϵi vector across all our n workers and can do so through

  1. Coordinator broadcasts μt once per update step to all workers
  2. Coordinator sends ϵi to all n workers individually, which allows every worker to compute θi
  3. Each workers runs trajectories and sends back F(θi) to the central workers.

Because of reparameterization only this ϵi needs to be sent back and forth, but this is still a very large parameter, so instead we use a pseudo random number generator to compute n (tiny) seeds and then send these small seeds to the n workers, which can they reconstruct each large dimension ϵi and compute our returns F(θi) which is just a scalar! So communication time is cut a lot.

Local Maxima Issue

In order to prevent ES methods from getting stuck in local optima, we average our search over multiple tasks and related environments to improve robustness and our objective can become more more generalizable.

Policy Gradient

We no longer consider black-box optimization methods. Instead of updating a search distribution that our policy parameters are sampled from, we directly update our policy parameters, because we need to compute analytic gradients directly!

Policy Objective

One reasonable policy objective is to maximize our expected trajectory reward over distribution of all trajectories parametrized by our policy parameters θ (assuming a discrete trajectory space).

maxθ.U(θ)=EτPθ(τ)[R(τ)]=τPθ(τ)R(τ)

Remember that Pθ(τ) is the probability distribution over seeing that entire trajectory when we run πθ in our environment which abstracts three key ingredients

  1. the initial state being sampled from an initial state distribution
  2. the stochasticity of the policy in which actions are sampled from - this is what our θ actually parameterizes.
  3. the dynamics of the environment resulting in stochastic next states st+1
Pθ(τ)=ρ0(s0)initial statet=0TP(st+1st, at)dynamicsπθ(atst)action sampling

It’s assumed that P is a probability density function that is continuous and differentiable - necessary to propagate our gradient as we will see in the derivation.

We now need to figure out how to compute this gradient in order to find optimal θ:

Finite-Difference Methods

One way to try and approximate policy gradient of πθ(s) by nudging θ in every possible small amount dimension and approximate partial derivatives as such:

U(θ)θk=U(θ+ϵuk)U(θϵuk)2ϵ

This was used to train these AIBO robots to walk. But this is really not feasible in high dimensions

Derivatives of the Policy Objective

Policy gradients aim to exploit our factorization of Pθ(τ)=t=0HP(st+1st,at)πθ(atst) to compute approximate gradient estimate for

θU(θ)=θEτP(τ;θ)[R(τ)]

In comparison to evolutionary methods, here the challenge is to compute derivatives w.r.t variables that parameterize a distribution that our expectation is summed over. The derivation uses the same log probability trick as derived for evolutionary methods; also we assume discrete trajectory space to sum over - if continuous, the derivation is largely the same but the justification for moving gradients inside the integral terms is different and above my math knowledge.

(expand expectation using pmf)θEτPθ(τ)[R(τ)]=θτPθ(τ)R(τ)(sum rule)=τθPθ(τ)R(τ)(prep for log prob trick)=τPθ(τ)θPθ(τ)Pθ(τ)R(τ)(chain rule)=τPθ(τ)[θlogPθ(τ)]R(τ)=Eτ[θlogPθ(τ)R(τ)]

I think this part is intuitively explainable, our policy objective gradient is equivalent to increasing the log probability of trajectories that give a positive reward and decreasing the log probability of trajectories that give a negative reward.

The key observation is that this expectation can be simplified much further because our trajectories do encapsulate the dynamics of the environment - but this is not specifically parametrized by our policy parameters, so the derivatives of our trajectories propagate to derivatives of taking actions under our policy.

(factorizing our trajectory)θlogPθ(τ)=θlog[ρ(s0)t=0TP(st+1st,at)πθ(atst)](using logs to sum out product!)=θ[logρ0(s0)+t=0TlogP(st+1st,at)+logπθ(atst)](dynamics in env. are  of policy parameters!)=θ[t=0Tlogπθ(atst)](sum rule)=[t=0Tθlogπθ(atst)]

Then completing our derivation

θEτPθ(τ)[R(τ)]=EτPθ(τ)[θlogPθ(τ)R(τ)]=EτPθ(τ)[t=0Tθlogπθ(atst)R(τ)](Monte Carlo Estimation!)1Ni=1Nt=0Tθlogπθ(atst)R(τ)

To approximate the gradient, we use an empirical estimate to from N sampled trajectories!

θU(θ)1Ni=1Nt=0Tθlogπθ(atst)R(τ)

Computing Policy Gradient

And the natural question is whether the derivative term is computable - which yes it is.

  1. If our action space is continuous, then we our policy network can be gaussian, outputting a mean and standard deviation.
  2. If our action space is discrete, obviously we apply a final softmax layer to output a discrete probability distribution over finite action space. Then for stochasticity, we can query a categorial distribution based on these probabilities for sampling.

Temporal Structures

Can we do better than a standard R(τ) of the entire trajectory for every gradient update? One problem is the issue with scalar R(τ), why should the action an agent takes at time step t be scaled by the reward trajectory of time steps that occurred before that [0,t1]?

Instead, we should emphasize causality: Only future rewards should be attributed to the action taken at timestep t

Gt=k=tTγkR(sk,ak)

REINFORCE - Monte Carlo Policy Gradient

The above discussion concludes REINFORCE - the simplest policy gradient also referred to as “vanilla” policy gradient.

  1. Initialize policy parameters θ
  2. Sample trajectories {τi={sti,ati}t=0T} by deploying the current policy πθ(atst)
  3. Compute gradient vector θU(θ)g^=1Ni=1Nt=1Tθlogπθ(at(i)|st(i))Gt(i)
  4. Perform Gradient Ascent: θθ+αθU(θ)

Baselines with Advantages

Our gradient estimator is unbiased, but still can have high variance

g^=1Ni=1Nt=0Tθlogπθ(atst)Gt

One issue with weighting our gradient updates with Gt is the following situation:

To counteract this we should then only consider the trajectory reward above our a constant-term baseline - Advantages - at that state!

g^=1Ni=1Nt=0Tθlogπθ(atst)[Gtb]=1Ni=1Nt=0Tθlogπθ(atst)Gt1Ni=1Nt=0Tθlogπθ(atst)bhow does this affect our gradient estimation?

Actually, this new g^ is still equal to our original g^ because in expectation the baseline term has zero expectation.

EτPθ(τ)[θlogPθ(τ)b]=bτPθ(τ)θlogPθ(τ)=bτPθ(τ)θPθ(τ)Pθ(τ)=bθEτPθ(τ)[1]=b0=0

And our subtraction of baseline to consider relative reward has effectively reduce the scale of gradient updates quite a bit, thus we minimize variance overall!

Baseline Choices

  1. Constant Baselines using the average return of the policy b=E[R(τ)]
  2. Time-dependent Baselines bt=i=1NGt(i) where we average temporal reward over all trajectories
  3. State-dependent Baselines i.e. value function b(st)=Vπ(s)

Actor-Critic Methods

Actor-Critic methods build of our state-dependent baselines used in REINFORCE with baselines method where our action advantage is Aπ(sti,ati)=Gt(i)Vϕπ(sti). But the Gt(i) term can have high variance: it’s a single rollout Monte-Carlo return based on our st and at and varies from our environment; but doesn’t this sound familiar?

Our returns Gt=k=tTR(sk,ak) are exactly estimated by our Q-functions Qπ(s,a)=E[Gtst,at] by definition. Furthermore, however, if our critic function only estimates our value functions Vϕπ(s) already, then we should expand our bellman equations to express Q-functions in terms of value functions: Qπ(s,a)=E[Gtst,at]=E[Rt+γGt+1st,at]=E[Rt+V(st+1)st,at].

Then our action advantages can be simplified as

Aπ(sti,ati)=Gt(i)Vϕπ(st)=R(sti,ati)+γVϕπ(st+1i)Vϕπ(st)
  1. Initialize actor policy parameters θ and critic parameters ϕ
  2. Sample trajectories {τi={sti,ati}i=0T} by deploying our current policy πθ(atst)
  3. Calculate value functions Vϕπ(s) through MC or TD estimation
  4. Compute action advantages Aπ(sti,ati)=Gt(i)Vϕπ(sti)
  5. θU(θ)1Ni=1Nt=0Tlogπθ(atst)A(st,at)
  6. θθ+αθU(θ)

In some sense the actor-critic is just “policy iteration” written in gradient form.

  1. We run the policy and collect a series of N trajectories.
  2. Based on the performance, we compute advantages for each time step during each trajectory
  3. Then we update our policy parameters directly using a policy gradient that is computed through these advantages.

A2C - Advantage Actor-Critic

The trajectories we collect arrive sequentially, while stability of training our policy networks require the gradient updates to be decorrelated - a problem we fixed in Q-learning using replay buffers - a solution is A3C by parallelizing the experience collection to multiple agents in order to stabilize training.

Workers run in parallel, computing gradients from their own rollouts, but we batch updates after each worker has finished their process to a combined gradient update step - ensuring consistent /stable gradient updates.

A3C - Asynchronous Advantage Actor-Critic

The natural performance optimization to make is what if the workers worked asynchronously and computed + provided gradient updates without waiting for all workers each iteration.

PPO - Proximal Policy Optimization

PPO is derived from policy improvement logic and more so a approximate policy iteration method than a policy gradient method.

High UTD

Updates to Data (UTD)=number of gradient updatesnumber of env. steps (samples)

Obviously it seems to us that a high UTD is efficient with collected data - and a bottleneck in RL for complex environments is exactly data collection - so we want to come up with methods that work well with high UTD.

Here’s the issue:

θθ+αθU(θ)

if we apply one gradient update step, then we land on a new policy π parameterized by θ. We cannot compute the same policy gradient estimate for θU(θ) by reusing the past rollouts (in order to compute the advantages).

This means that forcing a high UTD is a noisy estimate based on limited experience and can result in policy drifts. This is a motivator for PPO and TRPO methods as we discuss, which aim to fix this issue through

Policy Improvement

Performance of Policy

If we quantify the performance of a policy as expected return over all trajectory

J(π)=E[t=0γtr(st,at)]=Es0ρ[Vπ(s0)]

and define a discounted state visitation distribution that weights time spent in a specific state s over all trajectories given a policy π

dπ(s)=t=0γtPrπ(st=ss0ρ)

where

t=0γtEπ[f(St)]=t=0γtsf(s)Prπ(St=s)=sf(s)t=0γtPrπ(St=s)dπ(s)

Performance Difference Lemma

Then we can show that the policy improvement from ππ can be written as expected advantage over state visitation distribution and action sampling from our policy.

J(π)J(π)=Esdπ, aπ(s)[Aπ(s,a)]

Policy Improvement Formulation

We aim to find the maximum new policy π which corresponds to an expression

maxπJ(π)=maxπ(J(π)J(π))=maxπEsdπ(s),aπ(s)[Aπ(s,a)]

Importance Sampling

But our whole goal for performance is to not sample from π for the s,a. If we can’t sample from π (we can but the whole goal is to avoid sampling) of a distribution p(z), but want to compute an expectation of a function f(z) under that distribution, then importance sampling allows us to sample from a different easier proposal/behavior distribution q(z) then scale using a term called important weight our function values computed from those samples through manipulation:

Ezp(z)[f(z)]=f(z)p(z)dz=q(z)f(z)p(z)q(z)weightdz=Ezq(z)[f(z)p(z)q(z)]

which works as long as the denominator isn’t 0 whenever p(z) is nonzero.

Derivation

Then to apply this trick to our formulation, we aim to express our expectation entirely in terms of π, our first attempt in re-expressing our state visitation distribution in terms of π would be

maxπEsdπ(s), aπ(s)[Aπ(s,a)]=maxπEsdπ(s),aπ(s)[dπ(s)dπ(s)Aπ(s,a)]

but calculating state-visitation ratio dπ(s)dπ(s) is hard in itself - because the discounted visitation for π is unknown - and if we try to estimate this ratio with a large amount of sampling, then because this is a high variance term that could explode in certain states leading to instability in the computation.

PPO fixes this by simply keeping π close to π so that the state-visitation distribution naturally induces an approximate equality of dπdπ. And we apply the same importance sampling trick for π(st) which note the ration π(st)π(st) is easy to deal with - this is computable from our policy network in a single forward pass!

(PPO assumption)maxπEsdπ(s),aπ(st)[Aπ(s,a)]=maxπEsdπ(s),aπ(st)[dπ(s)dπ(s) 1Aπ(s,a)]=maxπEsdπ(s),aπ(st)[π(ast)π(ast)Aπ(s,a)]

Now, to compute this objective max function, we need to take gradient ascent steps wrt to this object, and rewrite in the same policy gradient form

θEsdπ(s)Eaπ(s)[π(as)π(as)Aπ(s,a)]=Esdπ(s)Eaπ(s)[θπ(as)π(as)[Aπ(s,a)]=Esdπ(s)Eaπ(s)π(as)[θlogπ(as)π(as)[Aπ(s,a)]

But we still have this issue of need to have a high UTD and resulting in policy drift. PPO aims to enforce some closeness penalty constraints on how far the new policy π can drift from π on each gradient update! This is a regularization term.

Esdπ(s)Eaπ(s)[θlogπ(as)π(as)Aπ(s,a)]λEs[D(π(s),π(s))]

Clipped Ratio Objectives

What PPO does is use a soft approximation by utilizing ratio clipping keeping the π(as)π(as) close to 1 instead of using an explicit distance metric (KL Divergence) like with TRPO. Remember the whole purpose is to make sure the old batch is representative of the new policy so we can get high UTD.

This can be expressed as

maxπEsdπ(s)Eaπ(s)[min(π(as)π(as),clip(π(as)π(as),1ϵ,1+ϵ))Aπ(s,a)]

Note that a naive clip objectives clips our ratio on both sides of the curve. But this has the issue of

  1. if our advantage A>0 and our ratio r<1ϵ because the clipped value is flat on this side of the naive clip objective, the gradient zeros out, and we don’t get to use this (s,a) experience for gradient update even though this advantage is trying to force π(as) up.
  2. If our advantage A<0 and our ratio r>1+ϵ then our gradient zeros out once again, and we miss utilizing this (s,a) experience for parameter update, even though the advantage is telling us to force π(as) smaller.

Then our entire clipped objective adds a min term so that we only flattens the ratio on the side we care about!

Clipped Objective.png

Asymmetric clipping

In reality we need to emphasize good actions when exploring for language models, meaning our clip term should be something like

clip(π(sa)π(sa),1ϵ,1+ϵ+)

This means that we want to make ϵ+<ϵ so that we don’t clip for higher positive advantages that we would clip if we kept ϵ fixed and had ϵ+=ϵ!

TRPO - Trust-Region Policy Optimization

If we were to keep the KL-constraint in our objective instead of a soft ratio-clipping objective with PPO, then the formulation is different. This is called TRPO.

Our surrogate objective is now

maxθAπold(π)=t=1TEstpθold(st)Eatπθold(atst)[πθ(atst)πθold(atst)Aπold(st,at)]s.t.Et[DKL[πθold(st)πθ(st)]]δ