Introduction
So far, we have studied supervised learning and unsupervised learning. We now turn to a third paradigm.
Reinforcement Learning (RL) is a machine learning framework in which an agent interacts
with an environment to learn a behavior that maximizes a scalar reward signal over time.
Formally, the environment is typically modeled as a Markov Decision Process (MDP), defined by a tuple
\[
(\mathcal{S}, \mathcal{A}, \mathcal{T}, r, \gamma)
\]
where \(\mathcal{S}\) is the state space, \(\mathcal{A}\) is the action space, \(\mathcal{T}\) is the transition kernel
(a conditional distribution \(\mathcal{T}(s' \mid s, a)\)), \(r\) is the reward function, and \(\gamma\) is the discount factor.
We give a more detailed probabilistic formulation, separating the stochastic reward kernel from the expected reward function
and including the initial state distribution explicitly, in the MDP definition below.
At each time step, the agent observes a state \(s_t \in \mathcal{S}\) and selects an action \(a_t \in \mathcal{A}\);
the environment then transitions to a new state \(s_{t+1}\) according to the transition dynamics \(\mathcal{T}(s_{t+1} \mid s_t, a_t)\)
and emits a reward \(r_t\).
The goal of the agent is to learn a policy \(\pi(a \mid s)\) that maximizes the expected cumulative reward:
\[
\mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t r_t \right].
\]
Reinforcement learning differs fundamentally from other machine learning paradigms:
- Supervised learning trains models on ground-truth input-output pairs with explicit correct labels for each input.
- Unsupervised learning discovers hidden structure or representations in unlabeled data without any feedback signal.
- Reinforcement learning uses only a scalar reward signal that provides evaluative feedback based on the agent's interaction with an environment.
In RL, learning is typically driven by trial-and-error interaction rather than labeled examples (the offline / batch RL setting,
where an agent learns from a fixed pre-collected dataset, is a notable exception). Unlike supervised learning, there are no correct labels for each input,
and unlike unsupervised learning, the objective is not to model data structure but to learn a policy that maximizes expected cumulative reward over time.
RL has been successfully applied in domains such as robotics, recommendation systems, and game-playing.
More recently, RL has also become integral to the training of modern large language models (LLMs).
In particular, a method called Reinforcement Learning from Human Feedback (RLHF) is used to fine-tune LLMs
to follow human preferences, aiming to make outputs more helpful, safe, and better aligned with user intent.
The diversity of RL applications has led to a rich classification of algorithms, each designed to address different aspects of
the learning problem. Understanding this classification is crucial for selecting appropriate methods and comprehending the
relationships between different approaches.
Classification of RL Algorithms
Model-Based vs Model-Free Methods
The most fundamental distinction in RL concerns whether the agent learns an explicit model of the environment.
Model-based RL methods learn an explicit model of the environment from data, including:
- the transition dynamics \(p_T(s' \mid s, a)\), and
- the reward model \(p_R(r \mid s, a, s')\).
Once learned, the agent uses planning algorithms to compute an optimal policy \(\pi_*\).
Classical planning is based on dynamic programming such as value iteration (VI) and policy iteration (PI).
The main advantage is environment-sample efficiency — the agent can simulate many trajectories using the learned model without additional
environment interaction. However, model learning introduces model bias: if the learned model is inaccurate, the agent
may perform well in simulation but poorly in the real environment.
Model-free RL methods bypass explicit environment modeling and directly learn value functions or policies
from experience. These methods are generally simpler to implement and more robust to model errors, but typically require
more environment interactions compared to model-based approaches.
Value-Based vs Policy-Based
This taxonomy classifies algorithms based on what they learn in the model-free RL.
Value-based methods learn the optimal action-value function \(Q_*\) (or its approximator \(Q_w\),
typically a neural network trained iteratively) from experience, and then act greedily with respect to it:
\(\pi_*(s) = \arg\max_a Q_*(s, a)\). Given a transition \((s, a, r, s')\), we define the
temporal difference (or TD error) as
\[
r + \gamma \max_{a'} Q_w(s', a') - Q_w(s, a),
\]
whose expectation is the Bellman optimality residual evaluated at \((s, a)\); when \(Q_w = Q_*\), the TD error
is zero on average by Bellman's optimality equation, so a non-zero expected TD error provides a learning signal.
Examples include Q-learning, SARSA, and Deep Q-Networks (DQN);
these methods differ in their bootstrap target — for instance, SARSA replaces the \(\max_{a'}\) above with the
actually-taken next action \(a'\) — and the precise update rules are given in the TD-learning sections below.
Policy-based methods directly optimize the expected-return objective
\[
J(\pi_{\boldsymbol{\theta}}) := \mathbb{E}_{\pi_{\boldsymbol{\theta}}} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right]
\]
with respect to the policy parameter \(\boldsymbol{\theta}\) using policy gradient, and can naturally handle
continuous action spaces and stochastic policies.
Examples include REINFORCE, Trust Region Policy Optimization (TRPO), and actor-critic methods
such as Advantage Actor-Critic (A2C), Deep Deterministic Policy Gradient (DDPG), Soft Actor-Critic (SAC),
and Proximal Policy Optimization (PPO).
On-Policy vs Off-Policy Methods
This taxonomy concerns the data usage strategy during learning.
On-policy methods learn the Q-function from data generated by the current policy that is being improved.
The agent explores the environment and makes updates to its policy based on the actions it has taken.
(e.g., SARSA, REINFORCE, A2C, and PPO.)
Off-policy methods learn from data generated by a different policy than the one being
improved. This allows the agent to reuse old data or learn from data collected by other agents. The policy used to
collect data is called the behavioral policy, and the policy being learned is called the target policy.
They offer higher sample efficiency than on-policy methods, but require handling distribution
mismatch (typically via importance-sampling corrections).
(e.g., Q-learning, DQN, DDPG, and SAC.)
Exploration vs. Exploitation
The exploration-exploitation tradeoff is fundamental to reinforcement learning. An agent must
balance between:
- Exploitation: Choosing actions that yield high rewards based on current knowledge
- Exploration: Trying new actions to potentially discover better strategies
All strategies in this section assume access to action-value estimates \(Q(s, a)\), the action-value function
formally defined in the Value Functions section below; here we use it operationally as "the agent's current estimate
of how good action \(a\) is in state \(s\)."
\(\varepsilon\)-Greedy Strategy
The simplest exploration strategy chooses:
\[
a_t = \begin{cases}
\arg\max_a Q(s_t, a) & \text{with probability } 1-\varepsilon \\
\text{uniformly random action over } \mathcal{A} & \text{with probability } \varepsilon
\end{cases}
\]
where \(\varepsilon \in [0,1]\) controls the exploration rate. Often \(\varepsilon\) is decayed over time.
Note that the random branch samples uniformly from all actions (including the greedy one),
so the greedy action is selected with total probability \(1 - \varepsilon + \varepsilon / |\mathcal{A}|\).
Softmax Action Selection
A more sophisticated approach uses the Boltzmann (softmax) distribution:
\[
P(a_t = a \mid s_t) = \frac{\exp(Q(s_t, a)/\tau)}{\sum_{a'} \exp(Q(s_t, a')/\tau)}
\]
where \(\tau > 0\) is the temperature parameter. Higher temperatures lead to more exploration.
Optimistic Initialization
Initializing Q-values optimistically (higher than realistic values) encourages exploration
of all actions early in learning, as the agent will be "disappointed" by actual rewards
and try other actions. This technique is most effective in the tabular setting; with neural-network
function approximators, the notion of a uniform "optimistic" initialization is harder to enforce,
and the trick does not transfer cleanly to deep RL.
The choice of exploration strategy significantly affects learning performance and is often
problem-dependent. More advanced methods like UCB (Upper Confidence Bound) and Thompson
sampling provide principled approaches to this tradeoff.
Markov Decision Process (MDP)
Before diving into the details of RL algorithms, this section presents a detailed probabilistic formulation
of Markov Decision Processes (MDPs), where both transitions and rewards are modeled as random variables.
This perspective extends the classical deterministic view and is particularly useful for analyzing trajectory
distributions and gradient-based learning methods used in modern reinforcement learning.
An agent sequentially interacts with an initially unknown environment to obtain a trajectory
or multiple trajectories. A trajectory of length \(T\) (i.e., consisting of \(T\) transitions) is defined as:
\[
\boldsymbol{\tau} = (s_0, a_0, r_0, s_1, a_1, r_1, s_2, \ldots, a_{T-1}, r_{T-1}, s_T),
\]
where \(s_t\) is a state, \(a_t\) is an action, and \(r_t\) is a reward.
The objective is to optimize the agent's action-selection policy so that the expected discounted cumulative reward
is maximized:
\[
G_0 = \sum_{t = 0}^{T -1} \gamma^t r_t,
\]
where \(\gamma \in [0, 1]\) is the discount factor. We assume the environment follows a
Markov Decision Process (MDP), where the trajectory distribution can be factored into
single-step transition and reward models. The process of estimating an optimal policy from trajectories
is referred to as learning.
Definition: Markov Decision Process (MDP)
An MDP is a tuple
\(\left\langle \mathcal{S}, \mathcal{A}, p_T, p_R, p_0 \right\rangle\)
where:
- \(\mathcal{S}\): set of environment states
- \(\mathcal{A}\): set of available actions
- \(p_T(s' \mid s, a)\): transition model (next-state distribution)
- \(p_R(r \mid s, a, s')\): reward model (stochastic reward distribution)
- \(p_0(s_0)\): initial state distribution
At time \(t = 0\), the initial state is sampled as \(s_0 \sim p_0\). At each step \(t \geq 0\), the agent observes
state \(s_t \in \mathcal{S}\), selects action \(a_t \sim \pi(a_t \mid s_t)\), and receives reward
\(r_t \sim p_R(r \mid s_t, a_t, s_{t+1})\), where the next state is drawn from \(s_{t+1} \sim p_T(s_{t+1} \mid s_t, a_t)\).
The agent's decision-making is governed by a stochastic policy \(\pi(a \mid s)\).
This interaction at each step is called a transition, represented as the tuple:
\[
(s_t, a_t, r_t, s_{t+1}),
\]
where:
- \(a_t \sim \pi(a \mid s_t)\)
- \(s_{t+1} \sim p_T(s_{t+1} \mid s_t, a_t)\)
- \(r_t \sim p_R(r \mid s_t, a_t, s_{t+1})\)
The transition kernel \(p_T(s_{t+1} \mid s_t, a_t)\) depends only on the current state and action, not on prior history;
this is the first-order Markov property
in its state-action form, and it is what makes the trajectory distribution factorize cleanly.
Under policy \(\pi\), the joint distribution of a trajectory \(\boldsymbol{\tau}\) of length \(T\) is given by:
\[
p(\boldsymbol{\tau}) = p_0(s_0) \prod_{t = 0}^{T -1}
\pi(a_t \mid s_t) \, p_T(s_{t+1} \mid s_t, a_t) \, p_R(r_t \mid s_t, a_t, s_{t+1}).
\]
We define the expected reward function from the reward model \(p_R\) as the marginal average immediate reward
for taking action \(a\) in state \(s\), integrating over possible next states:
\[
R(s, a) = \mathbb{E}_{p_T(s' \mid s, a)} \left[
\mathbb{E}_{p_R(r \mid s, a, s')}[r]
\right].
\]
While classical RL literature often starts with a deterministic reward function \(r(s, a)\) as a primitive,
our probabilistic formulation derives \(R(s, a)\) by taking the expectation under the stochastic reward kernel \(p_R\).
This explicit modeling of reward stochasticity is particularly useful for gradient-based RL methods, where trajectory
likelihoods must be modeled explicitly.
Value Functions and Bellman Equations
Definition: Return
Let \(\boldsymbol{\tau}\) be a trajectory. The return at time \(t\) is the total accumulated reward
from that point forward, discounted by a factor \(\gamma \in [0, 1]\). For an episodic task of finite horizon \(T\),
\[
\begin{align*}
G_t &= r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{T - t - 1} r_{T - 1} \\\\
&= \sum_{k = 0}^{T - t - 1} \gamma^k r_{t + k} \\\\
&= \sum_{j = t}^{T - 1} \gamma^{j - t} r_j.
\end{align*}
\]
For a continuing task (\(T = \infty\)), the upper limit becomes \(\infty\):
\[
G_t = \sum_{k = 0}^{\infty} \gamma^k r_{t + k},
\]
which converges absolutely whenever \(\gamma \in [0, 1)\) and the rewards are bounded.
Convention. Throughout this section, the subscript on \(\mathbb{E}\) names the distribution being averaged over:
\(\mathbb{E}_\pi[\,\cdot\,]\) abbreviates expectation over full trajectories generated by \(\pi\),
while \(\mathbb{E}_{\pi(a \mid s)}[\,\cdot\,]\) is expectation over the action variable alone, drawn from \(\pi(\cdot \mid s)\).
Definition: State-Value Function
Given a stochastic policy \(\pi(a \mid s)\), the state-value function
(or value function) under the policy \(\pi\) is defined as:
\[
\begin{align*}
V_{\pi}(s) &= \mathbb{E}_{\pi}\left[G_0 \mid s_0 = s\right] \\\\
&= \mathbb{E}_{\pi}\left[\sum_{t = 0}^{\infty} \gamma^t r_t \mid s_0 = s\right],
\end{align*}
\]
where the expectation is over trajectories induced by the policy \(\pi\) starting from state \(s\).
Definition: Action-Value Function (Q-Function)
The action-value function (or Q-function) under the policy \(\pi\) is defined as
\[
\begin{align*}
Q_{\pi}(s, a) &= \mathbb{E}_{\pi} \left[ G_0 \mid s_0 = s, \, a_0 = a \right] \\\\
&= \mathbb{E}_{\pi} \left[ \sum_{t = 0}^{\infty} \gamma^t r_t \mid s_0 = s, \, a_0 = a \right],
\end{align*}
\]
the expected return when starting in state \(s\), taking action \(a\), and thereafter following \(\pi\).
Definition: Advantage Function
The advantage function is the difference between the action-value and state-value functions:
\[
A_{\pi}(s, a) = Q_{\pi}(s, a) - V_{\pi}(s).
\]
This measures how much better action \(a\) is in state \(s\) compared to the action that would be taken on average under \(\pi\).
The discount factor \(\gamma\) ensures the return remains finite even for infinite-horizon problems and gives higher weight
to short-term rewards, thereby encouraging the agent to achieve goals sooner.
A policy \(\pi_*\) is called an optimal policy if it yields the highest value for every state:
\[
\forall s \in \mathcal{S}, \quad V_{\pi_*}(s) \geq V_{\pi}(s), \quad \forall \pi.
\]
Although multiple optimal policies may exist, their value functions are the same: \(V_*(s) = V_{\pi_*}(s)\) and \(Q_*(s, a) = Q_{\pi_*}(s, a)\).
Moreover, when the action space \(\mathcal{A}\) is finite, the maximum in the Bellman optimality equation (stated immediately below)
is attained at every state, so selecting any maximizer
\[
\pi_*(s) \in \arg\max_a \left[ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ V_*(s') \right] \right]
\]
yields a deterministic optimal policy.
Theorem: Bellman's Optimality Equations
\[
\begin{align*}
V_*(s) &= \max_a \left[ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ V_*(s') \right] \right] \\\\
Q_*(s, a) &= R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ \max_{a'} Q_*(s', a') \right]
\end{align*}
\]
The optimal value functions \(V_*\) and \(Q_*\) are the unique fixed points of these equations;
the right-hand side defines a \(\gamma\)-contraction in the supremum norm, and uniqueness follows from the
Banach Fixed-Point Theorem (used explicitly in the value-iteration analysis below).
The optimal policy can then be derived by:
\[
\begin{align*}
\pi_*(s) &= \arg \max_a Q_*(s, a) \\\\
&= \arg \max_a \left[ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ V_*(s') \right] \right].
\end{align*}
\]
Solving for \(V_*\), \(Q_*\), or \(\pi_*\) is known as policy optimization, while computing
\(V_{\pi}\) or \(Q_{\pi}\) for a given policy \(\pi\) is called policy evaluation.
Both Bellman optimality and Bellman expectation equations follow the same recursive structure: the value of a state
equals the immediate reward plus the discounted value of successor states. They differ only in how the action is chosen —
by maximization, \(\max_a[\,\cdot\,]\), in the optimality equations (giving the value of acting optimally),
or by sampling from \(\pi\), \(\mathbb{E}_{\pi(a \mid s)}[\,\cdot\,]\), in the expectation equations
(giving the value of following the fixed policy \(\pi\)).
Theorem: Bellman's Expectation Equations
For a fixed policy \(\pi\), the state-value function satisfies:
\[
V_{\pi}(s) = \mathbb{E}_{\pi(a \mid s)} \left[ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ V_{\pi}(s') \right] \right]
\]
The action-value function satisfies:
\[
Q_{\pi}(s, a) = R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ \mathbb{E}_{\pi(a' \mid s')} \left[ Q_{\pi}(s', a') \right] \right]
\]
Dynamic Programming Algorithms for RL
Dynamic programming (DP) is a technique for solving optimization problems by breaking them down into
simpler subproblems and storing the results to avoid redundant computations. In the context of reinforcement learning, DP methods
solve MDPs exactly when the complete model (transition probabilities and rewards) is known.
The key insight of DP for MDPs is that optimal policies have the property that whatever the initial state and initial decision are,
the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
This is known as Bellman's principle of optimality, which leads to the recursive Bellman equations.
Value Iteration (VI) and Policy Iteration (PI) are the two classical dynamic programming methods for solving MDPs.
These algorithms are fundamental to model-based reinforcement learning, where an agent first learns the MDP model
from experience and then applies these DP methods to compute optimal policies.
Value Iteration (VI)
Let the initial estimate be \(V_0\). We update it as follows:
\[
V_{k+1}(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} p_T(s' \mid s, a) V_k (s') \right].
\]
This is called a Bellman backup. Each iteration is a \(\gamma\)-contraction mapping
in the \(\ell_\infty\) norm, satisfying:
\[
\| V_{k+1} - V_* \|_\infty \leq \gamma \| V_k - V_* \|_\infty
\]
where \( \| V \|_\infty = \max_{s \in \mathcal{S}} |V(s)| \).
Proof of contraction.
Fix any \(s \in \mathcal{S}\). Let \(F_k(s, a) := R(s, a) + \gamma \sum_{s'} p_T(s'\mid s, a) V_k(s')\) and
\(F_*(s, a) := R(s, a) + \gamma \sum_{s'} p_T(s'\mid s, a) V_*(s')\). Then \(V_{k+1}(s) = \max_a F_k(s, a)\) and
\(V_*(s) = \max_a F_*(s, a)\) (the latter by Bellman optimality). Using the elementary inequality
\(|\max_a f(a) - \max_a g(a)| \leq \max_a |f(a) - g(a)|\),
\[
\begin{align*}
|V_{k+1}(s) - V_*(s)| &\leq \max_a |F_k(s, a) - F_*(s, a)| \\\\
&= \max_a \gamma \left| \sum_{s'} p_T(s'\mid s, a)\, [V_k(s') - V_*(s')] \right| \\\\
&\leq \gamma \max_a \sum_{s'} p_T(s'\mid s, a)\, |V_k(s') - V_*(s')| \\\\
&\leq \gamma \|V_k - V_*\|_\infty,
\end{align*}
\]
where the last step uses \(\sum_{s'} p_T(s'\mid s, a) = 1\) and \(|V_k(s') - V_*(s')| \leq \|V_k - V_*\|_\infty\).
Taking \(\max_s\) on the left gives the claim.
Because \( 0 \leq \gamma < 1 \),
the Banach Fixed-Point Theorem guarantees that
\(V_k\) converges to the unique optimal value function \(V_*\) as \(k \to \infty\).
For all possible states \(s\), value iteration computes \(V_*(s)\) and \(\pi_*(s)\), averaging over all possible
next states \(s'\) at each iteration. (Note: If we need the value and policy for only certain starting states, other methods can be used such as
real-time dynamic programming).
VALUE_ITERATION
Input: \(M = \left\langle \mathcal{S}, \mathcal{A}, p_T(s' | s, a), R(s, a), \gamma \right\rangle\)
Output: \(V_*\), \(\pi_*\)
begin
Initialize \(V(s)\) arbitrarily for all \(s \in \mathcal{S}\) (typically \(V(s) = 0\))
repeat:
\(\Delta \leftarrow 0\)
for each \(s \in \mathcal{S}\):
\(V^{\text{old}}(s) \leftarrow V(s)\)
\(V(s) \leftarrow \max_a \left[R(s,a) + \gamma \sum_{s'} p_T(s'|s,a) V(s')\right]\)
\(\Delta \leftarrow \max(\Delta, |V^{\text{old}}(s) - V(s)|)\)
until \(\Delta < \theta\) (small threshold)
// Extract optimal policy
for each \(s \in \mathcal{S}\):
\(\pi_*(s) \leftarrow \arg\max_a \left[R(s,a) + \gamma \sum_{s'} p_T(s'|s,a) V(s')\right]\)
end
Policy Iteration (PI)
Let \(\mathbf{v} \in \mathbb{R}^{|\mathcal{S}|}\) be the value function vector where \(v_i = V_{\pi}(s_i)\).
We define the expected reward vector \(\mathbf{r}\) and the state transition matrix \(\mathbf{T}\) as:
\[
\mathbf{r}(s) = \sum_a \pi(a \mid s) R(s, a), \quad \mathbf{T}(s' \mid s) = \sum_a \pi(a \mid s) p_T(s' \mid s, a).
\]
The Bellman expectation equation for policy evaluation then forms a linear system:
\[
\mathbf{v} = \mathbf{r} + \gamma \mathbf{T}\mathbf{v}.
\]
Theoretically, the exact solution is \(\mathbf{v} = (\mathbf{I} - \gamma \mathbf{T})^{-1} \mathbf{r}\).
Since \(\mathbf{T}\) is a row-stochastic matrix, all of its eigenvalues lie in the closed unit disk, so the
spectral radius \(\rho(\gamma \mathbf{T}) = \gamma \rho(\mathbf{T}) \leq \gamma < 1\). This guarantees that
\(\mathbf{I} - \gamma \mathbf{T}\) is non-singular and that the Neumann series
\(\sum_{k=0}^{\infty} (\gamma \mathbf{T})^k = (\mathbf{I} - \gamma \mathbf{T})^{-1}\) converges.
In practice, we solve this iteratively using the update \(\mathbf{v}_{k+1} = \mathbf{r} + \gamma \mathbf{T} \mathbf{v}_k\).
Once \(V_{\pi}\) is evaluated, we proceed to policy improvement to find a superior policy \(\pi'\).
Now we move on to the policy improvement step:
\[
\pi'(s) = \arg \max_a \left\{ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} [V_{\pi}(s')] \right\}
\]
This guarantees \(V_{\pi'} \geq V_{\pi}\) componentwise. To see this, write \(\mathbf{r}', \mathbf{T}'\) for the
reward vector and transition matrix induced by \(\pi'\) (analogous to \(\mathbf{r}, \mathbf{T}\) for \(\pi\)).
By the definition of \(\pi'\) as the greedy improvement,
\(\mathbf{r}' + \gamma \mathbf{T}' \mathbf{v} \geq \mathbf{r} + \gamma \mathbf{T} \mathbf{v} = \mathbf{v}\)
componentwise, where the equality is the Bellman expectation equation for \(\pi\). Iterating this inequality,
using that \(\mathbf{T}'\) has non-negative entries (so it preserves componentwise inequalities), gives
\[
\begin{align*}
\mathbf{v} &\leq \mathbf{r}' + \gamma \mathbf{T}'\mathbf{v} \\\\
&\leq \mathbf{r}' + \gamma \mathbf{T}'(\mathbf{r}' + \gamma \mathbf{T}'\mathbf{v}) \\\\
&\leq \cdots \leq \sum_{k=0}^{n-1} (\gamma \mathbf{T}')^k \mathbf{r}' + (\gamma \mathbf{T}')^n \mathbf{v}.
\end{align*}
\]
Letting \(n \to \infty\): the partial sum converges to \((\mathbf{I} - \gamma \mathbf{T}')^{-1}\mathbf{r}' = \mathbf{v}'\)
by the Neumann-series argument above (since \(\rho(\gamma \mathbf{T}') < 1\)), and the residual \((\gamma \mathbf{T}')^n \mathbf{v} \to \mathbf{0}\)
for the same reason. Therefore \(\mathbf{v} \leq \mathbf{v}'\), i.e., \(V_{\pi} \leq V_{\pi'}\) componentwise.
POLICY_ITERATION
Input: \(M = \left\langle \mathcal{S}, \mathcal{A}, p_T(s' | s, a), R(s, a), \gamma \right\rangle\)
Output: \(\pi_*\)
begin
Initialize \(V_{\pi}(s)\) arbitrarily for all \(s \in \mathcal{S}\) (typically \(V_{\pi}(s) = 0\))
Initialize \(\pi(s)\) arbitrarily for all \(s \in \mathcal{S}\)
repeat:
// Policy Evaluation
repeat:
\(\Delta \leftarrow 0\)
for each \(s \in \mathcal{S}\):
\(V_{\pi}^{\text{old}}(s) \leftarrow V_{\pi}(s)\)
\(V_{\pi}(s) \leftarrow \sum_a \pi(a|s) \left[R(s,a) + \gamma \sum_{s'} p_T(s'|s,a) V_{\pi}(s')\right]\)
\(\Delta \leftarrow \max(\Delta, |V_{\pi}^{\text{old}}(s) - V_{\pi}(s)|)\)
until \(\Delta < \theta\) (small threshold)
// Policy Improvement
\(\text{policy-stable} \leftarrow \text{true}\)
for each \(s \in \mathcal{S}\):
\(\text{old-action} \leftarrow \pi(s)\)
\(\pi(s) \leftarrow \arg\max_a \left[R(s,a) + \gamma \sum_{s'} p_T(s'|s,a) V_{\pi}(s')\right]\)
if \(\text{old-action} \neq \pi(s)\) then \(\text{policy-stable} \leftarrow \text{false}\)
until \(\text{policy-stable} = \text{true}\)
end
In policy iteration algorithm, starting from an initial policy, we alternate between policy evaluation step and
policy improvement step. Since there are at most \(|\mathcal{A}|^{|\mathcal{S}|}\) deterministic policies (assuming finite \(\mathcal{S}\) and \(\mathcal{A}\)),
and every iteration either leaves the policy unchanged — in which case Bellman optimality holds at every state and the policy is optimal —
or strictly improves it, the algorithm must converge in a finite number of iterations.
Monte Carlo Control
Having established the theoretical foundations of MDPs and dynamic programming, we now turn to model-free methods that
learn directly from experience. We begin with Monte Carlo (MC) control, a fundamental value-based method
that estimates value functions by sampling complete episodes and using actual returns. The averaging of sampled returns
is an instance of the general Monte Carlo estimator,
and its convergence is guaranteed by
Monte Carlo consistency
(an application of the strong law of large numbers).
Monte Carlo control implements generalized policy iteration (GPI) — a unifying name for any scheme
that interleaves policy evaluation and policy improvement at any granularity (also called
generalized policy improvement in some texts). At iteration \(k\), with current policy \(\pi_k\), the three steps are:
- Rollout: starting from state \(s\), take action \(a\), and sample the rest of the trajectory by following
policy \(\pi_k\) until a terminal state is reached.
- MC Q-estimate: compute the realized return \(G_0 = \sum_{t=0}^{T-1} \gamma^t r_t\) from this rollout and
average it (over many such rollouts) into the running estimate \(\hat{Q}_{\pi_k}(s, a)\). Steps 1 and 2 together perform
policy evaluation.
- Policy improvement: update the policy greedily with respect to the current Q-estimate,
\[
\pi_{k+1}(s) = \arg \max_{a} Q_k (s, a),
\]
where \(Q_k\) is the MC estimate of \(Q_{\pi_k}\) accumulated so far.
To achieve convergence to the optimal policy while maintaining adequate exploration,
we can use an \(\varepsilon\)-greedy policy that balances exploitation of current
knowledge with exploration of potentially better actions.
Monte Carlo methods provide unbiased estimates of \(Q_\pi\) since they use actual returns. However, MC
methods are not efficient for value-based RL because they require unrolling complete trajectories whose returns
are sums of random rewards generated by stochastic state transitions, leading to high variance
in value estimates. Moreover, MC methods are most natural for episodic tasks (or for continuing tasks via finite-horizon
truncation), since they need a complete trajectory to form each update; this makes them unsuitable for step-by-step online updates.
To address these limitations, we now turn to a more efficient approach that uses bootstrapping — updating
estimates based on other estimates rather than waiting for final outcomes. Note that this concept of bootstrapping
in reinforcement learning is distinct from statistical bootstrapping used in other areas of machine learning.
Temporal Difference Learning
Temporal Difference (TD) learning methods address the limitations of Monte Carlo methods through
bootstrapping — updating value estimates using single-step transitions plus current estimates
of successor states rather than waiting for complete returns. In other words, TD methods incrementally reduce the
Bellman expectation residual for sampled states by learning from individual transitions rather than complete trajectories.
Theoretical Foundation
The TD target \(r_t + \gamma V(s_{t+1})\) is a single-sample approximation of the right-hand side of the
Bellman expectation equation introduced above,
\[
V_{\pi}(s) = \mathbb{E}_{\pi(a \mid s)} \left[ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ V_{\pi}(s') \right] \right],
\]
obtained by replacing the two expectations with a single observed transition \((s_t, a_t, r_t, s_{t+1})\) drawn from the agent's actual interaction.
The TD update then adjusts the current estimate \(V(s_t)\) toward this target. This establishes TD learning as a method
for solving the Bellman equations through successive approximation, without requiring knowledge of
transition probabilities.
Under appropriate conditions — bounded rewards, finite state space, and a learning-rate schedule \(\{\alpha_t\}\) satisfying the
Robbins-Monro conditions \(\sum_{t=0}^{\infty} \alpha_t = \infty\) and \(\sum_{t=0}^{\infty} \alpha_t^2 < \infty\) —
tabular TD(0) converges to the true value function \(V_{\pi}(s)\) for the policy being followed.
TD(0): One-Step Temporal Difference
The simplest TD method, TD(0), updates value estimates after each single step. Suppose the agent learns
the value function \(V_\pi\) for a fixed policy \(\pi\).
Given a state transition \((s_t, a_t, r_t, s_{t+1})\) where \(a_t \sim \pi(s_t)\), we update the estimate \(V(s_t)\) as follows:
TD(0) Update
\[
V(s_t) \leftarrow V(s_t) + \alpha\,\delta_t
\]
where:
- \(\alpha \in (0,1]\) is the learning rate,
- \(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\) is the TD error,
- \(r_t + \gamma V(s_{t+1})\) is the TD target.
n-Step TD Methods
TD methods can be generalized to look ahead multiple steps. The n-step return is defined as:
\[
G_{t:t+n} = r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^n V(s_{t+n})
\]
The corresponding n-step TD update becomes:
\[
V(s_t) \leftarrow V(s_t) + \alpha [G_{t:t+n} - V(s_t)]
\]
This provides a spectrum of methods: \(n=1\) recovers TD(0) (with target \(G_{t:t+1} = r_t + \gamma V(s_{t+1})\)),
and as \(n\) increases toward the episode length, the method approaches Monte Carlo behavior by using longer actual return sequences.
TD(\(\lambda\)) and \(\lambda\)-returns
Rather than using a fixed n-step lookahead, TD(\(\lambda\)) methods use a weighted average of all n-step returns.
The λ-return is defined as:
\[
G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n}
\]
where \(\lambda \in [0,1]\) controls the weighting between different n-step returns.
The parameter \(\lambda\) creates a different type of interpolation than n-step methods:
- When \(\lambda = 0\): Only the 1-step return is used, reducing to TD(0)
- When \(\lambda = 1\): For episodic tasks, this gives Monte Carlo behavior through full credit assignment
- For \(0 < \lambda < 1\): Creates a weighted blend that emphasizes shorter-term returns but includes longer-term effects
Note that TD(λ) with \(\lambda = 1\) achieves Monte Carlo-equivalent behavior through a different mechanism than
n-step TD with large n. While n-step methods extend the lookahead horizon, TD(λ) uses eligibility traces to
distribute credit backward through time, allowing for online updates even in continuing tasks.
The \(\lambda\)-return provides a principled way to interpolate between the low variance but biased estimates of TD(0)
and the high variance but unbiased estimates of Monte Carlo methods. TD(0) is thus a special case
of the more general TD(\(\lambda\)) framework. Importantly, both n-step TD and TD(\(\lambda\)) can achieve Monte Carlo behavior,
but through different parameterizations: n-step methods by extending the lookahead horizon (large n),
and TD(λ) methods through the weighting parameter (\(\lambda = 1\)).
Function Approximation
More generally, for function approximation where \(V_\mathbf{w}(s)\) is parameterized by weights \(\mathbf{w}\),
the TD(0) update becomes:
\[
\mathbf{w} \leftarrow \mathbf{w} + \alpha [ r_t + \gamma V_{\mathbf{w}}(s_{t+1}) - V_{\mathbf{w}}(s_t)]\nabla_{\mathbf{w}}V_{\mathbf{w}}(s_t)
\]
Note that with function approximation, TD methods may not always converge, unlike the tabular case.
Key Advantages
TD methods offer several key advantages that make them central to modern reinforcement learning:
- Online learning: Updates occur after each time step, enabling continuous learning without waiting for episode completion.
- Computational efficiency: No need to store or process entire episode returns.
- Lower variance: TD(0) has lower variance due to bootstrapping, though this comes at the cost of introducing bias.
- Applicability to continuing tasks: Can learn in non-episodic environments where episodes never terminate.
TD Control Methods: Q-Learning & SARSA
Having established the theoretical foundations of temporal difference learning, we now turn to specific
TD control algorithms that learn optimal policies by estimating action-value functions.
While the previous section focused on TD methods for policy evaluation (learning \(V_\pi\)), these methods
extend TD learning to the control problem by learning Q-functions and deriving policies from them.
Both Q-learning and SARSA are temporal difference methods that bootstrap using single-step transitions,
but they differ fundamentally in their approach to policy learning: one learns the optimal Q-function directly
(off-policy), while the other learns the Q-function of the policy it follows (on-policy).
SARSA (On-Policy TD Control)
The agent follows policy \(\pi\) at every step to select actions, and learns its Q-function from the resulting transitions.
Under a transition \((s, a, r, s')\), the TD update rule is:
SARSA Update
\[
Q(s, a) \leftarrow Q(s, a) + \alpha\left[r + \gamma Q(s', a') - Q(s, a)\right]
\]
where \(a' \sim \pi(s')\) is the action that the agent will take in state \(s'\). This makes SARSA on-policy.
Note: SARSA is named after the transition tuple \((s, a, r, s', a')\).
Q-Learning (Off-Policy TD Control)
Instead of the sampled next action \(a' \sim \pi(s')\) in SARSA, Q-learning uses the greedy action in \(s'\):
Q-Learning Update
\[
Q(s, a) \leftarrow Q(s, a) + \alpha\left[r + \gamma \max_b Q(s', b) - Q(s, a)\right]
\]
The max operator \(\max_b Q(s', b)\) makes this off-policy:
the update targets the optimal Q-function regardless of the behavioral policy.
Key Differences
- Convergence: Q-learning provably converges to the optimal action-value function \(Q_*\) in the tabular case,
provided every state-action pair is visited infinitely often and the learning rates satisfy the Robbins-Monro conditions.
SARSA, viewed as pure policy evaluation, converges to \(Q_\pi\) for the policy \(\pi\) being followed; when combined with a
GLIE (Greedy in the Limit with Infinite Exploration) policy — for example, \(\varepsilon\)-greedy with
\(\varepsilon \to 0\) — SARSA also converges to \(Q_*\) and the optimal policy \(\pi_*\).
- Policy dependence: SARSA accounts for the actual exploration policy in its updates, while Q-learning's target
\(\max_b Q(s', b)\) ignores the behavioral policy entirely.
- Behavioral differences: In some environments, SARSA may learn different policies than Q-learning because it considers the exploration policy's actual behavior, while Q-learning optimizes assuming greedy action selection.
Both algorithms require exploration to discover good actions, leading us to the fundamental challenge
of balancing exploration and exploitation.
Policy Gradient Methods
Unlike value-based methods that learn value functions and derive policies implicitly,
policy-based methods directly parameterize and optimize policies without necessarily
learning value functions. These methods address fundamental limitations of value-based approaches,
particularly in continuous action spaces and when stochastic policies are desired.
Our objective is the expected return of a policy:
\[
\begin{align*}
J(\pi) &= \mathbb{E}_{p_0(s_0)}[V_{\pi}(s_0)] \\\\
&= \mathbb{E}_{p_0(s_0)\pi(a_0 \mid s_0)}[Q_{\pi}(s_0, a_0)].
\end{align*}
\]
Let \(\pi_{\boldsymbol{\theta}}\) be parameterized by \(\boldsymbol{\theta}\). We compute the gradient of the objective with respect to \(\boldsymbol{\theta}\).
\[
\begin{align*}
\nabla_{\boldsymbol{\theta}} J(\pi_{\boldsymbol{\theta}})
&= \mathbb{E}_{p_0(s_0)} \left[\nabla_{\boldsymbol{\theta}} \left(\sum_{a_0}\pi_{\boldsymbol{\theta}}(a_0 \mid s_0)Q_{\pi_{\boldsymbol{\theta}}}(s_0, a_0) \right) \right] \\\\
&= \mathbb{E}_{p_0(s_0)} \left[\sum_{a_0} \nabla_{\boldsymbol{\theta}} \pi_{\boldsymbol{\theta}}(a_0 \mid s_0)Q_{\pi_{\boldsymbol{\theta}}}(s_0, a_0) \right]
+ \mathbb{E}_{p_0(s_0)\pi_{\boldsymbol{\theta}}(a_0 \mid s_0)} [\nabla_{\boldsymbol{\theta}}Q_{\pi_{\boldsymbol{\theta}}}(s_0, a_0)].
\end{align*}
\]
Here, since \(R(s_0, a_0)\) does not depend on \(\boldsymbol{\theta}\),
\[
\begin{align*}
\nabla_{\boldsymbol{\theta}}Q_{\pi_{\boldsymbol{\theta}}}(s_0, a_0)
&= \nabla_{\boldsymbol{\theta}}\left[R(s_0, a_0) + \gamma \mathbb{E}_{p_T(s_1 \mid s_0, a_0)} [V_{\pi_{\boldsymbol{\theta}}}(s_1)]\right] \\\\
&= \gamma \nabla_{\boldsymbol{\theta}} \mathbb{E}_{p_T(s_1 \mid s_0, a_0)} [V_{\pi_{\boldsymbol{\theta}}}(s_1)].
\end{align*}
\]
Substituting this back and unrolling the recursion across all timesteps — where each level of recursion contributes a factor of \(\gamma\)
and shifts the state distribution one step forward under \(\pi_{\boldsymbol{\theta}}\) — and then collapsing the resulting infinite sum
using the discounted state-visitation distribution \(p_{\pi_{\boldsymbol{\theta}}}^\infty\), we obtain the policy gradient theorem:
Theorem: Policy Gradient Theorem
For a parameterized policy \(\pi_{\boldsymbol{\theta}}\), the gradient of the expected return is:
\[
\begin{align*}
\nabla_{\boldsymbol{\theta}}J(\pi_{\boldsymbol{\theta}})
&= \sum_{t = 0}^{\infty} \gamma^t \mathbb{E}_{p_t(s)} \left[\sum_a \nabla_{\boldsymbol{\theta}}\pi_{\boldsymbol{\theta}}(a \mid s) Q_{\pi_{\boldsymbol{\theta}}}(s, a)\right] \\\\
&= \frac{1}{1 -\gamma} \mathbb{E}_{p_{\pi_{\boldsymbol{\theta}}}^{\infty}(s)} \left[ \sum_a \nabla_{\boldsymbol{\theta}}\pi_{\boldsymbol{\theta}}(a \mid s) Q_{\pi_{\boldsymbol{\theta}}}(s, a) \right] \\\\
&= \frac{1}{1 -\gamma} \mathbb{E}_{p_{\pi_{\boldsymbol{\theta}}}^{\infty}(s) \pi_{\boldsymbol{\theta}}(a \mid s)} \left[ \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a \mid s) Q_{\pi_{\boldsymbol{\theta}}}(s, a) \right]
\end{align*}
\]
where \(p_t(s)\) is the probability of visiting state \(s\) at time \(t\) if the agent starts with \(s_0 \sim p_0\) following \(\pi_{\boldsymbol{\theta}}\), and
\[
p_{\pi_{\boldsymbol{\theta}}}^\infty(s) = (1-\gamma)\sum_{t=0}^{\infty}\gamma^t p_t(s)
\]
is the normalized discounted state visitation distribution.
The third equality uses the log-derivative trick
\(\nabla_{\boldsymbol{\theta}} \pi_{\boldsymbol{\theta}}(a \mid s) = \pi_{\boldsymbol{\theta}}(a \mid s) \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a \mid s)\),
converting the gradient of a probability into an expectation under that probability. The quantity
\(\nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a \mid s)\) is precisely the conditional
score function
of the policy distribution, which makes the policy gradient an instance of score-function estimation
(and connects directly to the Fisher-information geometry discussed below).
To reduce the high variance of policy-gradient estimates, we can subtract a state-dependent baseline \(b(s)\)
from \(Q_{\pi_{\boldsymbol{\theta}}}(s, a)\) inside the expectation. A key invariance property is that this subtraction does not change the gradient in expectation,
because \(\mathbb{E}_{a \sim \pi_{\boldsymbol{\theta}}(\cdot \mid s)}[\nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a \mid s)] = \mathbf{0}\) for any \(s\),
so the baseline term integrates to zero. The policy gradient theorem then becomes:
\[
\nabla_{\boldsymbol{\theta}}J(\pi_{\boldsymbol{\theta}})
= \frac{1}{1 -\gamma} \mathbb{E}_{p_{\pi_{\boldsymbol{\theta}}}^{\infty}(s) \pi_{\boldsymbol{\theta}}(a \mid s)} \left[ \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a \mid s) (Q_{\pi_{\boldsymbol{\theta}}}(s, a) - b(s)) \right].
\]
A common choice is \(b(s) = V_{\pi_{\boldsymbol{\theta}}}(s)\), which makes the multiplier the
advantage function
\(A_{\pi_{\boldsymbol{\theta}}}(s, a) = Q_{\pi_{\boldsymbol{\theta}}}(s, a) - V_{\pi_{\boldsymbol{\theta}}}(s)\).
Intuitively, this re-centers each action's value relative to the policy's average performance at that state,
sharply reducing variance whenever the policy already concentrates on near-optimal actions.
REINFORCE is the fundamental policy gradient algorithm that uses Monte Carlo estimation:
\[
\begin{align*}
\nabla_{\boldsymbol{\theta}} J(\pi_{\boldsymbol{\theta}})
&= \frac{1}{1 -\gamma} \mathbb{E}_{p_{\pi_{\boldsymbol{\theta}}}^{\infty}(s) \pi_{\boldsymbol{\theta}}(a \mid s)} \left[ \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a \mid s) Q_{\pi_{\boldsymbol{\theta}}}(s, a) \right]\\\\
&\approx \sum_{t=0}^{T-1} \gamma^t G_t \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a_t \mid s_t)
\end{align*}
\]
where \(G_t\) is the return from time \(t\):
\[
G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{T-t-1} r_{T-1}.
\]
Here, using a baseline in the gradient estimate, we obtain the REINFORCE update rule:
\[
\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha \sum_{t = 0}^{T -1} \gamma^t (G_t - b(s_t)) \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}} (a_t \mid s_t).
\]
REINFORCE
begin
Set initial policy parameters \(\boldsymbol{\theta}\), and baseline parameters \(\mathbf{w}\)
repeat:
Sample an episode \(\tau = (s_0, a_0, r_0, s_1, \cdots, s_T)\) using the policy \(\pi_{\boldsymbol{\theta}}\)
Compute \(G_t\) for all \(t \in \{0, 1, \cdots, T-1\}\)
for \(t = 0, 1, \cdots, T-1\) do
\(\delta = G_t - V_\mathbf{w}(s_t)\)
\(\mathbf{w} \leftarrow \mathbf{w} + \alpha_{\mathbf{w}} \delta \nabla_{\mathbf{w}}V_{\mathbf{w}}(s_t)\)
\(\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha_{\boldsymbol{\theta}} \gamma^t \delta \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a_t \mid s_t) \)
until converged
end
Actor-Critic Methods
Actor-critic methods are hybrid algorithms that combine both policy-based and value-based approaches
to address the high variance problem of pure policy gradient methods like REINFORCE. These methods maintain
two separate function approximators that work together:
- Actor \(\pi_{\boldsymbol{\theta}}(a \mid s)\): A parameterized policy that selects actions, where \(\boldsymbol{\theta}\) are the policy parameters.
- Critic \(V_{\mathbf{w}}(s)\): A parameterized state-value function that evaluates states, where \(\mathbf{w}\) are the value function parameters.
The actor updates the policy parameters \(\boldsymbol{\theta}\) using gradients weighted by the advantage estimates
provided by the critic, while the critic updates the value function parameters \(\mathbf{w}\) to more accurately
estimate expected returns through temporal difference learning.
Consider the use of the one-step TD(0) method to estimate the return in the episodic case.
Specifically, we replace the Monte Carlo return \(G_t\) (used in REINFORCE) with the 1-step TD target
\[
G_{t:t+1} = r_t + \gamma V_{\mathbf{w}} (s_{t+1}),
\]
and use \(V_{\mathbf{w}}(s_t)\) as the baseline. The REINFORCE update from the previous section then becomes:
\[
\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha \sum_{t=0}^{T-1} \gamma^t (G_{t:t+1} - V_{\mathbf{w}}(s_t))\nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a_t \mid s_t).
\]
Here, the increment \(r_t + \gamma V_{\mathbf{w}}(s_{t+1}) - V_{\mathbf{w}}(s_t)\) is a single-sample approximation to the
advantage function
\[
A_{\pi_{\boldsymbol{\theta}}}(s_t, a_t) = Q_{\pi_{\boldsymbol{\theta}}}(s_t, a_t) - V_{\pi_{\boldsymbol{\theta}}}(s_t),
\]
so this method is called the advantage actor-critic (or A2C).
A2C(Advantage Actor Critic)
begin
Set initial actor parameters \(\boldsymbol{\theta}\), and critic parameters \(\mathbf{w}\)
repeat:
Sample starting state \(s_0\) of a new episode
for \(t = 0, 1, \cdots\) do
Sample action \(a_t \sim \pi_{\boldsymbol{\theta}}(\cdot \mid s_t)\)
Observe next state \(s_{t+1}\) and reward \(r_t\)
\(\delta = r_t + \gamma V_\mathbf{w}(s_{t+1}) - V_\mathbf{w}(s_t)\)
\(\mathbf{w} \leftarrow \mathbf{w} + \alpha_{\mathbf{w}} \delta \nabla_{\mathbf{w}}V_{\mathbf{w}}(s_t)\)
\(\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha_{\boldsymbol{\theta}} \gamma^t \delta \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a_t \mid s_t) \)
until converged
end
Note that the \(n\)-step advantage estimate can be expressed as:
\[
A_{\pi_{\boldsymbol{\theta}}}^{(n)}(s_t, a_t) = G_{t:t+n} - V_{\mathbf{w}}(s_t)
\]
where
\[
G_{t:t+n} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^n V_{\mathbf{w}}(s_{t+n}).
\]
We can control the bias-variance tradeoff by adjusting \(n\).
Actor-critic methods offer several key advantages:
- Reduced variance: Lower variance than pure policy gradient methods like REINFORCE through value function baselines
- Sample efficiency: More efficient learning through bootstrapping compared to Monte Carlo methods
- Broad applicability: Can handle both discrete and continuous action spaces with online learning capability
Among the many extensions of A2C developed in the deep reinforcement learning literature,
Proximal Policy Optimization (PPO) deserves special mention: by clipping the policy update at each step to keep
the new policy within a trust region around the old one, PPO substantially improves training stability while remaining
simple to implement. As of the mid-2020s, PPO has become the de facto standard on-policy algorithm in continuous-control
benchmarks and, notably, in reinforcement learning from human feedback (RLHF) used to align large language models.
Standard policy gradient methods optimize \(J(\pi_{\boldsymbol{\theta}})\) using the Euclidean gradient \(\nabla_{\boldsymbol{\theta}} J\),
which assumes an isotropic parameter space. However, the parameter space of a stochastic policy is
actually a statistical manifold, where Euclidean steps can be poorly scaled if the
curvature of the distribution family varies across directions. In our
natural gradient descent page,
we introduce the Fisher information matrix
as the Riemannian metric on this space,
demonstrating how natural gradients enable more stable and efficient optimization by accounting for
the underlying geometry of the policy distribution.