Intro to Reinforcement Learning

Introduction

So far, we have studied supervised learning and unsupervised learning. We now turn to a third paradigm. Reinforcement Learning (RL) is a machine learning framework in which an agent interacts with an environment to learn a behavior that maximizes a scalar reward signal over time. Formally, the environment is typically modeled as a Markov Decision Process (MDP), defined by a tuple \[ (\mathcal{S}, \mathcal{A}, \mathcal{T}, r, \gamma) \] where \(\mathcal{S}\) is the state space, \(\mathcal{A}\) is the action space, \(\mathcal{T}\) is the transition kernel (a conditional distribution \(\mathcal{T}(s' \mid s, a)\)), \(r\) is the reward function, and \(\gamma\) is the discount factor. We give a more detailed probabilistic formulation, separating the stochastic reward kernel from the expected reward function and including the initial state distribution explicitly, in the MDP definition below.

At each time step, the agent observes a state \(s_t \in \mathcal{S}\) and selects an action \(a_t \in \mathcal{A}\); the environment then transitions to a new state \(s_{t+1}\) according to the transition dynamics \(\mathcal{T}(s_{t+1} \mid s_t, a_t)\) and emits a reward \(r_t\). The goal of the agent is to learn a policy \(\pi(a \mid s)\) that maximizes the expected cumulative reward: \[ \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t r_t \right]. \]

Reinforcement learning differs fundamentally from other machine learning paradigms:

Supervised learning trains models on ground-truth input-output pairs with explicit correct labels for each input.
Unsupervised learning discovers hidden structure or representations in unlabeled data without any feedback signal.
Reinforcement learning uses only a scalar reward signal that provides evaluative feedback based on the agent's interaction with an environment.

In RL, learning is typically driven by trial-and-error interaction rather than labeled examples (the offline / batch RL setting, where an agent learns from a fixed pre-collected dataset, is a notable exception). Unlike supervised learning, there are no correct labels for each input, and unlike unsupervised learning, the objective is not to model data structure but to learn a policy that maximizes expected cumulative reward over time.

RL has been successfully applied in domains such as robotics, recommendation systems, and game-playing. More recently, RL has also become integral to the training of modern large language models (LLMs). In particular, a method called Reinforcement Learning from Human Feedback (RLHF) is used to fine-tune LLMs to follow human preferences, aiming to make outputs more helpful, safe, and better aligned with user intent.

The diversity of RL applications has led to a rich classification of algorithms, each designed to address different aspects of the learning problem. Understanding this classification is crucial for selecting appropriate methods and comprehending the relationships between different approaches.

Classification of RL Algorithms

Model-Based vs Model-Free Methods

The most fundamental distinction in RL concerns whether the agent learns an explicit model of the environment.

Model-based RL methods learn an explicit model of the environment from data, including:

the transition dynamics \(p_T(s' \mid s, a)\), and
the reward model \(p_R(r \mid s, a, s')\).

Once learned, the agent uses planning algorithms to compute an optimal policy \(\pi_*\). Classical planning is based on dynamic programming such as value iteration (VI) and policy iteration (PI). The main advantage is environment-sample efficiency — the agent can simulate many trajectories using the learned model without additional environment interaction. However, model learning introduces model bias: if the learned model is inaccurate, the agent may perform well in simulation but poorly in the real environment.

Model-free RL methods bypass explicit environment modeling and directly learn value functions or policies from experience. These methods are generally simpler to implement and more robust to model errors, but typically require more environment interactions compared to model-based approaches.

Value-Based vs Policy-Based

This taxonomy classifies algorithms based on what they learn in the model-free RL.

Value-based methods learn the optimal action-value function \(Q_*\) (or its approximator \(Q_w\), typically a neural network trained iteratively) from experience, and then act greedily with respect to it: \(\pi_*(s) = \arg\max_a Q_*(s, a)\). Given a transition \((s, a, r, s')\), we define the temporal difference (or TD error) as \[ r + \gamma \max_{a'} Q_w(s', a') - Q_w(s, a), \] whose expectation is the Bellman optimality residual evaluated at \((s, a)\); when \(Q_w = Q_*\), the TD error is zero on average by Bellman's optimality equation, so a non-zero expected TD error provides a learning signal. Examples include Q-learning, SARSA, and Deep Q-Networks (DQN); these methods differ in their bootstrap target — for instance, SARSA replaces the \(\max_{a'}\) above with the actually-taken next action \(a'\) — and the precise update rules are given in the TD-learning sections below.

Policy-based methods directly optimize the expected-return objective \[ J(\pi_{\boldsymbol{\theta}}) := \mathbb{E}_{\pi_{\boldsymbol{\theta}}} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right] \] with respect to the policy parameter \(\boldsymbol{\theta}\) using policy gradient, and can naturally handle continuous action spaces and stochastic policies.

Examples include REINFORCE, Trust Region Policy Optimization (TRPO), and actor-critic methods such as Advantage Actor-Critic (A2C), Deep Deterministic Policy Gradient (DDPG), Soft Actor-Critic (SAC), and Proximal Policy Optimization (PPO).

On-Policy vs Off-Policy Methods

This taxonomy concerns the data usage strategy during learning.

On-policy methods learn the Q-function from data generated by the current policy that is being improved. The agent explores the environment and makes updates to its policy based on the actions it has taken.
(e.g., SARSA, REINFORCE, A2C, and PPO.)

Off-policy methods learn from data generated by a different policy than the one being improved. This allows the agent to reuse old data or learn from data collected by other agents. The policy used to collect data is called the behavioral policy, and the policy being learned is called the target policy. They offer higher sample efficiency than on-policy methods, but require handling distribution mismatch (typically via importance-sampling corrections).
(e.g., Q-learning, DQN, DDPG, and SAC.)

Exploration vs. Exploitation

The exploration-exploitation tradeoff is fundamental to reinforcement learning. An agent must balance between:

Exploitation: Choosing actions that yield high rewards based on current knowledge
Exploration: Trying new actions to potentially discover better strategies

All strategies in this section assume access to action-value estimates \(Q(s, a)\), the action-value function formally defined in the Value Functions section below; here we use it operationally as "the agent's current estimate of how good action \(a\) is in state \(s\)."

\(\varepsilon\)-Greedy Strategy

The simplest exploration strategy chooses: \[ a_t = \begin{cases} \arg\max_a Q(s_t, a) & \text{with probability } 1-\varepsilon \\ \text{uniformly random action over } \mathcal{A} & \text{with probability } \varepsilon \end{cases} \] where \(\varepsilon \in [0,1]\) controls the exploration rate. Often \(\varepsilon\) is decayed over time. Note that the random branch samples uniformly from all actions (including the greedy one), so the greedy action is selected with total probability \(1 - \varepsilon + \varepsilon / |\mathcal{A}|\).

Softmax Action Selection

A more sophisticated approach uses the Boltzmann (softmax) distribution: \[ P(a_t = a \mid s_t) = \frac{\exp(Q(s_t, a)/\tau)}{\sum_{a'} \exp(Q(s_t, a')/\tau)} \] where \(\tau > 0\) is the temperature parameter. Higher temperatures lead to more exploration.

Optimistic Initialization

Initializing Q-values optimistically (higher than realistic values) encourages exploration of all actions early in learning, as the agent will be "disappointed" by actual rewards and try other actions. This technique is most effective in the tabular setting; with neural-network function approximators, the notion of a uniform "optimistic" initialization is harder to enforce, and the trick does not transfer cleanly to deep RL.

The choice of exploration strategy significantly affects learning performance and is often problem-dependent. More advanced methods like UCB (Upper Confidence Bound) and Thompson sampling provide principled approaches to this tradeoff.

Markov Decision Process (MDP)

Before diving into the details of RL algorithms, this section presents a detailed probabilistic formulation of Markov Decision Processes (MDPs), where both transitions and rewards are modeled as random variables. This perspective extends the classical deterministic view and is particularly useful for analyzing trajectory distributions and gradient-based learning methods used in modern reinforcement learning.

An agent sequentially interacts with an initially unknown environment to obtain a trajectory or multiple trajectories. A trajectory of length \(T\) (i.e., consisting of \(T\) transitions) is defined as: \[ \boldsymbol{\tau} = (s_0, a_0, r_0, s_1, a_1, r_1, s_2, \ldots, a_{T-1}, r_{T-1}, s_T), \] where \(s_t\) is a state, \(a_t\) is an action, and \(r_t\) is a reward.

The objective is to optimize the agent's action-selection policy so that the expected discounted cumulative reward is maximized: \[ G_0 = \sum_{t = 0}^{T -1} \gamma^t r_t, \] where \(\gamma \in [0, 1]\) is the discount factor. We assume the environment follows a Markov Decision Process (MDP), where the trajectory distribution can be factored into single-step transition and reward models. The process of estimating an optimal policy from trajectories is referred to as learning.

Definition: Markov Decision Process (MDP)

An MDP is a tuple \(\left\langle \mathcal{S}, \mathcal{A}, p_T, p_R, p_0 \right\rangle\) where:

\(\mathcal{S}\): set of environment states
\(\mathcal{A}\): set of available actions
\(p_T(s' \mid s, a)\): transition model (next-state distribution)
\(p_R(r \mid s, a, s')\): reward model (stochastic reward distribution)
\(p_0(s_0)\): initial state distribution

At time \(t = 0\), the initial state is sampled as \(s_0 \sim p_0\). At each step \(t \geq 0\), the agent observes state \(s_t \in \mathcal{S}\), selects action \(a_t \sim \pi(a_t \mid s_t)\), and receives reward \(r_t \sim p_R(r \mid s_t, a_t, s_{t+1})\), where the next state is drawn from \(s_{t+1} \sim p_T(s_{t+1} \mid s_t, a_t)\). The agent's decision-making is governed by a stochastic policy \(\pi(a \mid s)\).

This interaction at each step is called a transition, represented as the tuple: \[ (s_t, a_t, r_t, s_{t+1}), \] where:

\(a_t \sim \pi(a \mid s_t)\)
\(s_{t+1} \sim p_T(s_{t+1} \mid s_t, a_t)\)
\(r_t \sim p_R(r \mid s_t, a_t, s_{t+1})\)

The transition kernel \(p_T(s_{t+1} \mid s_t, a_t)\) depends only on the current state and action, not on prior history; this is the first-order Markov property in its state-action form, and it is what makes the trajectory distribution factorize cleanly. Under policy \(\pi\), the joint distribution of a trajectory \(\boldsymbol{\tau}\) of length \(T\) is given by: \[ p(\boldsymbol{\tau}) = p_0(s_0) \prod_{t = 0}^{T -1} \pi(a_t \mid s_t) \, p_T(s_{t+1} \mid s_t, a_t) \, p_R(r_t \mid s_t, a_t, s_{t+1}). \]

We define the expected reward function from the reward model \(p_R\) as the marginal average immediate reward for taking action \(a\) in state \(s\), integrating over possible next states: \[ R(s, a) = \mathbb{E}_{p_T(s' \mid s, a)} \left[ \mathbb{E}_{p_R(r \mid s, a, s')}[r] \right]. \]

While classical RL literature often starts with a deterministic reward function \(r(s, a)\) as a primitive, our probabilistic formulation derives \(R(s, a)\) by taking the expectation under the stochastic reward kernel \(p_R\). This explicit modeling of reward stochasticity is particularly useful for gradient-based RL methods, where trajectory likelihoods must be modeled explicitly.

Value Functions and Bellman Equations

Definition: Return

Let \(\boldsymbol{\tau}\) be a trajectory. The return at time \(t\) is the total accumulated reward from that point forward, discounted by a factor \(\gamma \in [0, 1]\). For an episodic task of finite horizon \(T\), \[ \begin{align*} G_t &= r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{T - t - 1} r_{T - 1} \\\\ &= \sum_{k = 0}^{T - t - 1} \gamma^k r_{t + k} \\\\ &= \sum_{j = t}^{T - 1} \gamma^{j - t} r_j. \end{align*} \] For a continuing task (\(T = \infty\)), the upper limit becomes \(\infty\): \[ G_t = \sum_{k = 0}^{\infty} \gamma^k r_{t + k}, \] which converges absolutely whenever \(\gamma \in [0, 1)\) and the rewards are bounded.

Convention. Throughout this section, the subscript on \(\mathbb{E}\) names the distribution being averaged over: \(\mathbb{E}_\pi[\,\cdot\,]\) abbreviates expectation over full trajectories generated by \(\pi\), while \(\mathbb{E}_{\pi(a \mid s)}[\,\cdot\,]\) is expectation over the action variable alone, drawn from \(\pi(\cdot \mid s)\).

Definition: State-Value Function

Given a stochastic policy \(\pi(a \mid s)\), the state-value function (or value function) under the policy \(\pi\) is defined as: \[ \begin{align*} V_{\pi}(s) &= \mathbb{E}_{\pi}\left[G_0 \mid s_0 = s\right] \\\\ &= \mathbb{E}_{\pi}\left[\sum_{t = 0}^{\infty} \gamma^t r_t \mid s_0 = s\right], \end{align*} \] where the expectation is over trajectories induced by the policy \(\pi\) starting from state \(s\).

Definition: Action-Value Function (Q-Function)

The action-value function (or Q-function) under the policy \(\pi\) is defined as \[ \begin{align*} Q_{\pi}(s, a) &= \mathbb{E}_{\pi} \left[ G_0 \mid s_0 = s, \, a_0 = a \right] \\\\ &= \mathbb{E}_{\pi} \left[ \sum_{t = 0}^{\infty} \gamma^t r_t \mid s_0 = s, \, a_0 = a \right], \end{align*} \] the expected return when starting in state \(s\), taking action \(a\), and thereafter following \(\pi\).

Definition: Advantage Function

The advantage function is the difference between the action-value and state-value functions: \[ A_{\pi}(s, a) = Q_{\pi}(s, a) - V_{\pi}(s). \] This measures how much better action \(a\) is in state \(s\) compared to the action that would be taken on average under \(\pi\).

The discount factor \(\gamma\) ensures the return remains finite even for infinite-horizon problems and gives higher weight to short-term rewards, thereby encouraging the agent to achieve goals sooner.

A policy \(\pi_*\) is called an optimal policy if it yields the highest value for every state: \[ \forall s \in \mathcal{S}, \quad V_{\pi_*}(s) \geq V_{\pi}(s), \quad \forall \pi. \] Although multiple optimal policies may exist, their value functions are the same: \(V_*(s) = V_{\pi_*}(s)\) and \(Q_*(s, a) = Q_{\pi_*}(s, a)\). Moreover, when the action space \(\mathcal{A}\) is finite, the maximum in the Bellman optimality equation (stated immediately below) is attained at every state, so selecting any maximizer \[ \pi_*(s) \in \arg\max_a \left[ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ V_*(s') \right] \right] \] yields a deterministic optimal policy.

Theorem: Bellman's Optimality Equations

\[ \begin{align*} V_*(s) &= \max_a \left[ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ V_*(s') \right] \right] \\\\ Q_*(s, a) &= R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ \max_{a'} Q_*(s', a') \right] \end{align*} \] The optimal value functions \(V_*\) and \(Q_*\) are the unique fixed points of these equations; the right-hand side defines a \(\gamma\)-contraction in the supremum norm, and uniqueness follows from the Banach Fixed-Point Theorem (used explicitly in the value-iteration analysis below).

The optimal policy can then be derived by: \[ \begin{align*} \pi_*(s) &= \arg \max_a Q_*(s, a) \\\\ &= \arg \max_a \left[ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ V_*(s') \right] \right]. \end{align*} \] Solving for \(V_*\), \(Q_*\), or \(\pi_*\) is known as policy optimization, while computing \(V_{\pi}\) or \(Q_{\pi}\) for a given policy \(\pi\) is called policy evaluation.

Both Bellman optimality and Bellman expectation equations follow the same recursive structure: the value of a state equals the immediate reward plus the discounted value of successor states. They differ only in how the action is chosen — by maximization, \(\max_a[\,\cdot\,]\), in the optimality equations (giving the value of acting optimally), or by sampling from \(\pi\), \(\mathbb{E}_{\pi(a \mid s)}[\,\cdot\,]\), in the expectation equations (giving the value of following the fixed policy \(\pi\)).

Theorem: Bellman's Expectation Equations

For a fixed policy \(\pi\), the state-value function satisfies: \[ V_{\pi}(s) = \mathbb{E}_{\pi(a \mid s)} \left[ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ V_{\pi}(s') \right] \right] \] The action-value function satisfies: \[ Q_{\pi}(s, a) = R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ \mathbb{E}_{\pi(a' \mid s')} \left[ Q_{\pi}(s', a') \right] \right] \]

Dynamic Programming Algorithms for RL

Dynamic programming (DP) is a technique for solving optimization problems by breaking them down into simpler subproblems and storing the results to avoid redundant computations. In the context of reinforcement learning, DP methods solve MDPs exactly when the complete model (transition probabilities and rewards) is known.

The key insight of DP for MDPs is that optimal policies have the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. This is known as Bellman's principle of optimality, which leads to the recursive Bellman equations.

Value Iteration (VI) and Policy Iteration (PI) are the two classical dynamic programming methods for solving MDPs. These algorithms are fundamental to model-based reinforcement learning, where an agent first learns the MDP model from experience and then applies these DP methods to compute optimal policies.

Value Iteration (VI)

Let the initial estimate be \(V_0\). We update it as follows: \[ V_{k+1}(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} p_T(s' \mid s, a) V_k (s') \right]. \] This is called a Bellman backup. Each iteration is a \(\gamma\)-contraction mapping in the \(\ell_\infty\) norm, satisfying: \[ \| V_{k+1} - V_* \|_\infty \leq \gamma \| V_k - V_* \|_\infty \] where \( \| V \|_\infty = \max_{s \in \mathcal{S}} |V(s)| \).

Proof of contraction.

Fix any \(s \in \mathcal{S}\). Let \(F_k(s, a) := R(s, a) + \gamma \sum_{s'} p_T(s'\mid s, a) V_k(s')\) and \(F_*(s, a) := R(s, a) + \gamma \sum_{s'} p_T(s'\mid s, a) V_*(s')\). Then \(V_{k+1}(s) = \max_a F_k(s, a)\) and \(V_*(s) = \max_a F_*(s, a)\) (the latter by Bellman optimality). Using the elementary inequality \(|\max_a f(a) - \max_a g(a)| \leq \max_a |f(a) - g(a)|\), \[ \begin{align*} |V_{k+1}(s) - V_*(s)| &\leq \max_a |F_k(s, a) - F_*(s, a)| \\\\ &= \max_a \gamma \left| \sum_{s'} p_T(s'\mid s, a)\, [V_k(s') - V_*(s')] \right| \\\\ &\leq \gamma \max_a \sum_{s'} p_T(s'\mid s, a)\, |V_k(s') - V_*(s')| \\\\ &\leq \gamma \|V_k - V_*\|_\infty, \end{align*} \] where the last step uses \(\sum_{s'} p_T(s'\mid s, a) = 1\) and \(|V_k(s') - V_*(s')| \leq \|V_k - V_*\|_\infty\). Taking \(\max_s\) on the left gives the claim.

Because \( 0 \leq \gamma < 1 \), the Banach Fixed-Point Theorem guarantees that \(V_k\) converges to the unique optimal value function \(V_*\) as \(k \to \infty\).

For all possible states \(s\), value iteration computes \(V_*(s)\) and \(\pi_*(s)\), averaging over all possible next states \(s'\) at each iteration. (Note: If we need the value and policy for only certain starting states, other methods can be used such as real-time dynamic programming).

VALUE_ITERATION Input: \(M = \left\langle \mathcal{S}, \mathcal{A}, p_T(s' | s, a), R(s, a), \gamma \right\rangle\) Output: \(V_*\), \(\pi_*\) begin Initialize \(V(s)\) arbitrarily for all \(s \in \mathcal{S}\) (typically \(V(s) = 0\)) repeat: \(\Delta \leftarrow 0\) for each \(s \in \mathcal{S}\): \(V^{\text{old}}(s) \leftarrow V(s)\) \(V(s) \leftarrow \max_a \left[R(s,a) + \gamma \sum_{s'} p_T(s'|s,a) V(s')\right]\) \(\Delta \leftarrow \max(\Delta, |V^{\text{old}}(s) - V(s)|)\) until \(\Delta < \theta\) (small threshold) // Extract optimal policy for each \(s \in \mathcal{S}\): \(\pi_*(s) \leftarrow \arg\max_a \left[R(s,a) + \gamma \sum_{s'} p_T(s'|s,a) V(s')\right]\) end

Policy Iteration (PI)

Let \(\mathbf{v} \in \mathbb{R}^{|\mathcal{S}|}\) be the value function vector where \(v_i = V_{\pi}(s_i)\). We define the expected reward vector \(\mathbf{r}\) and the state transition matrix \(\mathbf{T}\) as: \[ \mathbf{r}(s) = \sum_a \pi(a \mid s) R(s, a), \quad \mathbf{T}(s' \mid s) = \sum_a \pi(a \mid s) p_T(s' \mid s, a). \] The Bellman expectation equation for policy evaluation then forms a linear system: \[ \mathbf{v} = \mathbf{r} + \gamma \mathbf{T}\mathbf{v}. \] Theoretically, the exact solution is \(\mathbf{v} = (\mathbf{I} - \gamma \mathbf{T})^{-1} \mathbf{r}\). Since \(\mathbf{T}\) is a row-stochastic matrix, all of its eigenvalues lie in the closed unit disk, so the spectral radius \(\rho(\gamma \mathbf{T}) = \gamma \rho(\mathbf{T}) \leq \gamma < 1\). This guarantees that \(\mathbf{I} - \gamma \mathbf{T}\) is non-singular and that the Neumann series \(\sum_{k=0}^{\infty} (\gamma \mathbf{T})^k = (\mathbf{I} - \gamma \mathbf{T})^{-1}\) converges. In practice, we solve this iteratively using the update \(\mathbf{v}_{k+1} = \mathbf{r} + \gamma \mathbf{T} \mathbf{v}_k\). Once \(V_{\pi}\) is evaluated, we proceed to policy improvement to find a superior policy \(\pi'\).

Now we move on to the policy improvement step: \[ \pi'(s) = \arg \max_a \left\{ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} [V_{\pi}(s')] \right\} \] This guarantees \(V_{\pi'} \geq V_{\pi}\) componentwise. To see this, write \(\mathbf{r}', \mathbf{T}'\) for the reward vector and transition matrix induced by \(\pi'\) (analogous to \(\mathbf{r}, \mathbf{T}\) for \(\pi\)). By the definition of \(\pi'\) as the greedy improvement, \(\mathbf{r}' + \gamma \mathbf{T}' \mathbf{v} \geq \mathbf{r} + \gamma \mathbf{T} \mathbf{v} = \mathbf{v}\) componentwise, where the equality is the Bellman expectation equation for \(\pi\). Iterating this inequality, using that \(\mathbf{T}'\) has non-negative entries (so it preserves componentwise inequalities), gives \[ \begin{align*} \mathbf{v} &\leq \mathbf{r}' + \gamma \mathbf{T}'\mathbf{v} \\\\ &\leq \mathbf{r}' + \gamma \mathbf{T}'(\mathbf{r}' + \gamma \mathbf{T}'\mathbf{v}) \\\\ &\leq \cdots \leq \sum_{k=0}^{n-1} (\gamma \mathbf{T}')^k \mathbf{r}' + (\gamma \mathbf{T}')^n \mathbf{v}. \end{align*} \] Letting \(n \to \infty\): the partial sum converges to \((\mathbf{I} - \gamma \mathbf{T}')^{-1}\mathbf{r}' = \mathbf{v}'\) by the Neumann-series argument above (since \(\rho(\gamma \mathbf{T}') < 1\)), and the residual \((\gamma \mathbf{T}')^n \mathbf{v} \to \mathbf{0}\) for the same reason. Therefore \(\mathbf{v} \leq \mathbf{v}'\), i.e., \(V_{\pi} \leq V_{\pi'}\) componentwise.

POLICY_ITERATION Input: \(M = \left\langle \mathcal{S}, \mathcal{A}, p_T(s' | s, a), R(s, a), \gamma \right\rangle\) Output: \(\pi_*\) begin Initialize \(V_{\pi}(s)\) arbitrarily for all \(s \in \mathcal{S}\) (typically \(V_{\pi}(s) = 0\)) Initialize \(\pi(s)\) arbitrarily for all \(s \in \mathcal{S}\) repeat: // Policy Evaluation repeat: \(\Delta \leftarrow 0\) for each \(s \in \mathcal{S}\): \(V_{\pi}^{\text{old}}(s) \leftarrow V_{\pi}(s)\) \(V_{\pi}(s) \leftarrow \sum_a \pi(a|s) \left[R(s,a) + \gamma \sum_{s'} p_T(s'|s,a) V_{\pi}(s')\right]\) \(\Delta \leftarrow \max(\Delta, |V_{\pi}^{\text{old}}(s) - V_{\pi}(s)|)\) until \(\Delta < \theta\) (small threshold) // Policy Improvement \(\text{policy-stable} \leftarrow \text{true}\) for each \(s \in \mathcal{S}\): \(\text{old-action} \leftarrow \pi(s)\) \(\pi(s) \leftarrow \arg\max_a \left[R(s,a) + \gamma \sum_{s'} p_T(s'|s,a) V_{\pi}(s')\right]\) if \(\text{old-action} \neq \pi(s)\) then \(\text{policy-stable} \leftarrow \text{false}\) until \(\text{policy-stable} = \text{true}\) end

In policy iteration algorithm, starting from an initial policy, we alternate between policy evaluation step and policy improvement step. Since there are at most \(|\mathcal{A}|^{|\mathcal{S}|}\) deterministic policies (assuming finite \(\mathcal{S}\) and \(\mathcal{A}\)), and every iteration either leaves the policy unchanged — in which case Bellman optimality holds at every state and the policy is optimal — or strictly improves it, the algorithm must converge in a finite number of iterations.

Monte Carlo Control

Having established the theoretical foundations of MDPs and dynamic programming, we now turn to model-free methods that learn directly from experience. We begin with Monte Carlo (MC) control, a fundamental value-based method that estimates value functions by sampling complete episodes and using actual returns. The averaging of sampled returns is an instance of the general Monte Carlo estimator, and its convergence is guaranteed by Monte Carlo consistency (an application of the strong law of large numbers).

Monte Carlo control implements generalized policy iteration (GPI) — a unifying name for any scheme that interleaves policy evaluation and policy improvement at any granularity (also called generalized policy improvement in some texts). At iteration \(k\), with current policy \(\pi_k\), the three steps are:

Rollout: starting from state \(s\), take action \(a\), and sample the rest of the trajectory by following policy \(\pi_k\) until a terminal state is reached.
MC Q-estimate: compute the realized return \(G_0 = \sum_{t=0}^{T-1} \gamma^t r_t\) from this rollout and average it (over many such rollouts) into the running estimate \(\hat{Q}_{\pi_k}(s, a)\). Steps 1 and 2 together perform policy evaluation.
Policy improvement: update the policy greedily with respect to the current Q-estimate, \[ \pi_{k+1}(s) = \arg \max_{a} Q_k (s, a), \] where \(Q_k\) is the MC estimate of \(Q_{\pi_k}\) accumulated so far.

To achieve convergence to the optimal policy while maintaining adequate exploration, we can use an \(\varepsilon\)-greedy policy that balances exploitation of current knowledge with exploration of potentially better actions.

Monte Carlo methods provide unbiased estimates of \(Q_\pi\) since they use actual returns. However, MC methods are not efficient for value-based RL because they require unrolling complete trajectories whose returns are sums of random rewards generated by stochastic state transitions, leading to high variance in value estimates. Moreover, MC methods are most natural for episodic tasks (or for continuing tasks via finite-horizon truncation), since they need a complete trajectory to form each update; this makes them unsuitable for step-by-step online updates.

To address these limitations, we now turn to a more efficient approach that uses bootstrapping — updating estimates based on other estimates rather than waiting for final outcomes. Note that this concept of bootstrapping in reinforcement learning is distinct from statistical bootstrapping used in other areas of machine learning.

Temporal Difference Learning

Temporal Difference (TD) learning methods address the limitations of Monte Carlo methods through bootstrapping — updating value estimates using single-step transitions plus current estimates of successor states rather than waiting for complete returns. In other words, TD methods incrementally reduce the Bellman expectation residual for sampled states by learning from individual transitions rather than complete trajectories.

Theoretical Foundation

The TD target \(r_t + \gamma V(s_{t+1})\) is a single-sample approximation of the right-hand side of the Bellman expectation equation introduced above, \[ V_{\pi}(s) = \mathbb{E}_{\pi(a \mid s)} \left[ R(s, a) + \gamma \, \mathbb{E}_{p_T(s' \mid s, a)} \left[ V_{\pi}(s') \right] \right], \] obtained by replacing the two expectations with a single observed transition \((s_t, a_t, r_t, s_{t+1})\) drawn from the agent's actual interaction. The TD update then adjusts the current estimate \(V(s_t)\) toward this target. This establishes TD learning as a method for solving the Bellman equations through successive approximation, without requiring knowledge of transition probabilities.

Under appropriate conditions — bounded rewards, finite state space, and a learning-rate schedule \(\{\alpha_t\}\) satisfying the Robbins-Monro conditions \(\sum_{t=0}^{\infty} \alpha_t = \infty\) and \(\sum_{t=0}^{\infty} \alpha_t^2 < \infty\) — tabular TD(0) converges to the true value function \(V_{\pi}(s)\) for the policy being followed.

TD(0): One-Step Temporal Difference

The simplest TD method, TD(0), updates value estimates after each single step. Suppose the agent learns the value function \(V_\pi\) for a fixed policy \(\pi\). Given a state transition \((s_t, a_t, r_t, s_{t+1})\) where \(a_t \sim \pi(s_t)\), we update the estimate \(V(s_t)\) as follows:

TD(0) Update

\[ V(s_t) \leftarrow V(s_t) + \alpha\,\delta_t \] where:

\(\alpha \in (0,1]\) is the learning rate,
\(\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)\) is the TD error,
\(r_t + \gamma V(s_{t+1})\) is the TD target.

n-Step TD Methods

TD methods can be generalized to look ahead multiple steps. The n-step return is defined as: \[ G_{t:t+n} = r_t + \gamma r_{t+1} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^n V(s_{t+n}) \] The corresponding n-step TD update becomes: \[ V(s_t) \leftarrow V(s_t) + \alpha [G_{t:t+n} - V(s_t)] \] This provides a spectrum of methods: \(n=1\) recovers TD(0) (with target \(G_{t:t+1} = r_t + \gamma V(s_{t+1})\)), and as \(n\) increases toward the episode length, the method approaches Monte Carlo behavior by using longer actual return sequences.

TD(\(\lambda\)) and \(\lambda\)-returns

Rather than using a fixed n-step lookahead, TD(\(\lambda\)) methods use a weighted average of all n-step returns. The λ-return is defined as: \[ G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_{t:t+n} \] where \(\lambda \in [0,1]\) controls the weighting between different n-step returns.

The parameter \(\lambda\) creates a different type of interpolation than n-step methods:

When \(\lambda = 0\): Only the 1-step return is used, reducing to TD(0)
When \(\lambda = 1\): For episodic tasks, this gives Monte Carlo behavior through full credit assignment
For \(0 < \lambda < 1\): Creates a weighted blend that emphasizes shorter-term returns but includes longer-term effects

Note that TD(λ) with \(\lambda = 1\) achieves Monte Carlo-equivalent behavior through a different mechanism than n-step TD with large n. While n-step methods extend the lookahead horizon, TD(λ) uses eligibility traces to distribute credit backward through time, allowing for online updates even in continuing tasks.

The \(\lambda\)-return provides a principled way to interpolate between the low variance but biased estimates of TD(0) and the high variance but unbiased estimates of Monte Carlo methods. TD(0) is thus a special case of the more general TD(\(\lambda\)) framework. Importantly, both n-step TD and TD(\(\lambda\)) can achieve Monte Carlo behavior, but through different parameterizations: n-step methods by extending the lookahead horizon (large n), and TD(λ) methods through the weighting parameter (\(\lambda = 1\)).

Function Approximation

More generally, for function approximation where \(V_\mathbf{w}(s)\) is parameterized by weights \(\mathbf{w}\), the TD(0) update becomes: \[ \mathbf{w} \leftarrow \mathbf{w} + \alpha [ r_t + \gamma V_{\mathbf{w}}(s_{t+1}) - V_{\mathbf{w}}(s_t)]\nabla_{\mathbf{w}}V_{\mathbf{w}}(s_t) \] Note that with function approximation, TD methods may not always converge, unlike the tabular case.

Key Advantages

TD methods offer several key advantages that make them central to modern reinforcement learning:

Online learning: Updates occur after each time step, enabling continuous learning without waiting for episode completion.
Computational efficiency: No need to store or process entire episode returns.
Lower variance: TD(0) has lower variance due to bootstrapping, though this comes at the cost of introducing bias.
Applicability to continuing tasks: Can learn in non-episodic environments where episodes never terminate.

TD Control Methods: Q-Learning & SARSA

Having established the theoretical foundations of temporal difference learning, we now turn to specific TD control algorithms that learn optimal policies by estimating action-value functions. While the previous section focused on TD methods for policy evaluation (learning \(V_\pi\)), these methods extend TD learning to the control problem by learning Q-functions and deriving policies from them.

Both Q-learning and SARSA are temporal difference methods that bootstrap using single-step transitions, but they differ fundamentally in their approach to policy learning: one learns the optimal Q-function directly (off-policy), while the other learns the Q-function of the policy it follows (on-policy).

SARSA (On-Policy TD Control)

The agent follows policy \(\pi\) at every step to select actions, and learns its Q-function from the resulting transitions. Under a transition \((s, a, r, s')\), the TD update rule is:

SARSA Update

\[ Q(s, a) \leftarrow Q(s, a) + \alpha\left[r + \gamma Q(s', a') - Q(s, a)\right] \] where \(a' \sim \pi(s')\) is the action that the agent will take in state \(s'\). This makes SARSA on-policy.

Note: SARSA is named after the transition tuple \((s, a, r, s', a')\).

Q-Learning (Off-Policy TD Control)

Instead of the sampled next action \(a' \sim \pi(s')\) in SARSA, Q-learning uses the greedy action in \(s'\):

Q-Learning Update

\[ Q(s, a) \leftarrow Q(s, a) + \alpha\left[r + \gamma \max_b Q(s', b) - Q(s, a)\right] \] The max operator \(\max_b Q(s', b)\) makes this off-policy: the update targets the optimal Q-function regardless of the behavioral policy.

Key Differences

Convergence: Q-learning provably converges to the optimal action-value function \(Q_*\) in the tabular case, provided every state-action pair is visited infinitely often and the learning rates satisfy the Robbins-Monro conditions. SARSA, viewed as pure policy evaluation, converges to \(Q_\pi\) for the policy \(\pi\) being followed; when combined with a GLIE (Greedy in the Limit with Infinite Exploration) policy — for example, \(\varepsilon\)-greedy with \(\varepsilon \to 0\) — SARSA also converges to \(Q_*\) and the optimal policy \(\pi_*\).
Policy dependence: SARSA accounts for the actual exploration policy in its updates, while Q-learning's target \(\max_b Q(s', b)\) ignores the behavioral policy entirely.
Behavioral differences: In some environments, SARSA may learn different policies than Q-learning because it considers the exploration policy's actual behavior, while Q-learning optimizes assuming greedy action selection.

Both algorithms require exploration to discover good actions, leading us to the fundamental challenge of balancing exploration and exploitation.

Policy Gradient Methods

Unlike value-based methods that learn value functions and derive policies implicitly, policy-based methods directly parameterize and optimize policies without necessarily learning value functions. These methods address fundamental limitations of value-based approaches, particularly in continuous action spaces and when stochastic policies are desired.

Our objective is the expected return of a policy: \[ \begin{align*} J(\pi) &= \mathbb{E}_{p_0(s_0)}[V_{\pi}(s_0)] \\\\ &= \mathbb{E}_{p_0(s_0)\pi(a_0 \mid s_0)}[Q_{\pi}(s_0, a_0)]. \end{align*} \]

Let \(\pi_{\boldsymbol{\theta}}\) be parameterized by \(\boldsymbol{\theta}\). We compute the gradient of the objective with respect to \(\boldsymbol{\theta}\). \[ \begin{align*} \nabla_{\boldsymbol{\theta}} J(\pi_{\boldsymbol{\theta}}) &= \mathbb{E}_{p_0(s_0)} \left[\nabla_{\boldsymbol{\theta}} \left(\sum_{a_0}\pi_{\boldsymbol{\theta}}(a_0 \mid s_0)Q_{\pi_{\boldsymbol{\theta}}}(s_0, a_0) \right) \right] \\\\ &= \mathbb{E}_{p_0(s_0)} \left[\sum_{a_0} \nabla_{\boldsymbol{\theta}} \pi_{\boldsymbol{\theta}}(a_0 \mid s_0)Q_{\pi_{\boldsymbol{\theta}}}(s_0, a_0) \right] + \mathbb{E}_{p_0(s_0)\pi_{\boldsymbol{\theta}}(a_0 \mid s_0)} [\nabla_{\boldsymbol{\theta}}Q_{\pi_{\boldsymbol{\theta}}}(s_0, a_0)]. \end{align*} \] Here, since \(R(s_0, a_0)\) does not depend on \(\boldsymbol{\theta}\), \[ \begin{align*} \nabla_{\boldsymbol{\theta}}Q_{\pi_{\boldsymbol{\theta}}}(s_0, a_0) &= \nabla_{\boldsymbol{\theta}}\left[R(s_0, a_0) + \gamma \mathbb{E}_{p_T(s_1 \mid s_0, a_0)} [V_{\pi_{\boldsymbol{\theta}}}(s_1)]\right] \\\\ &= \gamma \nabla_{\boldsymbol{\theta}} \mathbb{E}_{p_T(s_1 \mid s_0, a_0)} [V_{\pi_{\boldsymbol{\theta}}}(s_1)]. \end{align*} \]

Substituting this back and unrolling the recursion across all timesteps — where each level of recursion contributes a factor of \(\gamma\) and shifts the state distribution one step forward under \(\pi_{\boldsymbol{\theta}}\) — and then collapsing the resulting infinite sum using the discounted state-visitation distribution \(p_{\pi_{\boldsymbol{\theta}}}^\infty\), we obtain the policy gradient theorem:

Theorem: Policy Gradient Theorem

For a parameterized policy \(\pi_{\boldsymbol{\theta}}\), the gradient of the expected return is: \[ \begin{align*} \nabla_{\boldsymbol{\theta}}J(\pi_{\boldsymbol{\theta}}) &= \sum_{t = 0}^{\infty} \gamma^t \mathbb{E}_{p_t(s)} \left[\sum_a \nabla_{\boldsymbol{\theta}}\pi_{\boldsymbol{\theta}}(a \mid s) Q_{\pi_{\boldsymbol{\theta}}}(s, a)\right] \\\\ &= \frac{1}{1 -\gamma} \mathbb{E}_{p_{\pi_{\boldsymbol{\theta}}}^{\infty}(s)} \left[ \sum_a \nabla_{\boldsymbol{\theta}}\pi_{\boldsymbol{\theta}}(a \mid s) Q_{\pi_{\boldsymbol{\theta}}}(s, a) \right] \\\\ &= \frac{1}{1 -\gamma} \mathbb{E}_{p_{\pi_{\boldsymbol{\theta}}}^{\infty}(s) \pi_{\boldsymbol{\theta}}(a \mid s)} \left[ \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a \mid s) Q_{\pi_{\boldsymbol{\theta}}}(s, a) \right] \end{align*} \] where \(p_t(s)\) is the probability of visiting state \(s\) at time \(t\) if the agent starts with \(s_0 \sim p_0\) following \(\pi_{\boldsymbol{\theta}}\), and \[ p_{\pi_{\boldsymbol{\theta}}}^\infty(s) = (1-\gamma)\sum_{t=0}^{\infty}\gamma^t p_t(s) \] is the normalized discounted state visitation distribution.

The third equality uses the log-derivative trick \(\nabla_{\boldsymbol{\theta}} \pi_{\boldsymbol{\theta}}(a \mid s) = \pi_{\boldsymbol{\theta}}(a \mid s) \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a \mid s)\), converting the gradient of a probability into an expectation under that probability. The quantity \(\nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a \mid s)\) is precisely the conditional score function of the policy distribution, which makes the policy gradient an instance of score-function estimation (and connects directly to the Fisher-information geometry discussed below).

To reduce the high variance of policy-gradient estimates, we can subtract a state-dependent baseline \(b(s)\) from \(Q_{\pi_{\boldsymbol{\theta}}}(s, a)\) inside the expectation. A key invariance property is that this subtraction does not change the gradient in expectation, because \(\mathbb{E}_{a \sim \pi_{\boldsymbol{\theta}}(\cdot \mid s)}[\nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a \mid s)] = \mathbf{0}\) for any \(s\), so the baseline term integrates to zero. The policy gradient theorem then becomes: \[ \nabla_{\boldsymbol{\theta}}J(\pi_{\boldsymbol{\theta}}) = \frac{1}{1 -\gamma} \mathbb{E}_{p_{\pi_{\boldsymbol{\theta}}}^{\infty}(s) \pi_{\boldsymbol{\theta}}(a \mid s)} \left[ \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a \mid s) (Q_{\pi_{\boldsymbol{\theta}}}(s, a) - b(s)) \right]. \]

A common choice is \(b(s) = V_{\pi_{\boldsymbol{\theta}}}(s)\), which makes the multiplier the advantage function \(A_{\pi_{\boldsymbol{\theta}}}(s, a) = Q_{\pi_{\boldsymbol{\theta}}}(s, a) - V_{\pi_{\boldsymbol{\theta}}}(s)\). Intuitively, this re-centers each action's value relative to the policy's average performance at that state, sharply reducing variance whenever the policy already concentrates on near-optimal actions.

REINFORCE is the fundamental policy gradient algorithm that uses Monte Carlo estimation: \[ \begin{align*} \nabla_{\boldsymbol{\theta}} J(\pi_{\boldsymbol{\theta}}) &= \frac{1}{1 -\gamma} \mathbb{E}_{p_{\pi_{\boldsymbol{\theta}}}^{\infty}(s) \pi_{\boldsymbol{\theta}}(a \mid s)} \left[ \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a \mid s) Q_{\pi_{\boldsymbol{\theta}}}(s, a) \right]\\\\ &\approx \sum_{t=0}^{T-1} \gamma^t G_t \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a_t \mid s_t) \end{align*} \] where \(G_t\) is the return from time \(t\): \[ G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{T-t-1} r_{T-1}. \] Here, using a baseline in the gradient estimate, we obtain the REINFORCE update rule: \[ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha \sum_{t = 0}^{T -1} \gamma^t (G_t - b(s_t)) \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}} (a_t \mid s_t). \]

REINFORCE begin Set initial policy parameters \(\boldsymbol{\theta}\), and baseline parameters \(\mathbf{w}\) repeat: Sample an episode \(\tau = (s_0, a_0, r_0, s_1, \cdots, s_T)\) using the policy \(\pi_{\boldsymbol{\theta}}\) Compute \(G_t\) for all \(t \in \{0, 1, \cdots, T-1\}\) for \(t = 0, 1, \cdots, T-1\) do \(\delta = G_t - V_\mathbf{w}(s_t)\) \(\mathbf{w} \leftarrow \mathbf{w} + \alpha_{\mathbf{w}} \delta \nabla_{\mathbf{w}}V_{\mathbf{w}}(s_t)\) \(\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha_{\boldsymbol{\theta}} \gamma^t \delta \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a_t \mid s_t) \) until converged end

Actor-Critic Methods

Actor-critic methods are hybrid algorithms that combine both policy-based and value-based approaches to address the high variance problem of pure policy gradient methods like REINFORCE. These methods maintain two separate function approximators that work together:

Actor \(\pi_{\boldsymbol{\theta}}(a \mid s)\): A parameterized policy that selects actions, where \(\boldsymbol{\theta}\) are the policy parameters.
Critic \(V_{\mathbf{w}}(s)\): A parameterized state-value function that evaluates states, where \(\mathbf{w}\) are the value function parameters.

The actor updates the policy parameters \(\boldsymbol{\theta}\) using gradients weighted by the advantage estimates provided by the critic, while the critic updates the value function parameters \(\mathbf{w}\) to more accurately estimate expected returns through temporal difference learning.

Consider the use of the one-step TD(0) method to estimate the return in the episodic case. Specifically, we replace the Monte Carlo return \(G_t\) (used in REINFORCE) with the 1-step TD target \[ G_{t:t+1} = r_t + \gamma V_{\mathbf{w}} (s_{t+1}), \] and use \(V_{\mathbf{w}}(s_t)\) as the baseline. The REINFORCE update from the previous section then becomes: \[ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha \sum_{t=0}^{T-1} \gamma^t (G_{t:t+1} - V_{\mathbf{w}}(s_t))\nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a_t \mid s_t). \] Here, the increment \(r_t + \gamma V_{\mathbf{w}}(s_{t+1}) - V_{\mathbf{w}}(s_t)\) is a single-sample approximation to the advantage function \[ A_{\pi_{\boldsymbol{\theta}}}(s_t, a_t) = Q_{\pi_{\boldsymbol{\theta}}}(s_t, a_t) - V_{\pi_{\boldsymbol{\theta}}}(s_t), \] so this method is called the advantage actor-critic (or A2C).

A2C(Advantage Actor Critic) begin Set initial actor parameters \(\boldsymbol{\theta}\), and critic parameters \(\mathbf{w}\) repeat: Sample starting state \(s_0\) of a new episode for \(t = 0, 1, \cdots\) do Sample action \(a_t \sim \pi_{\boldsymbol{\theta}}(\cdot \mid s_t)\) Observe next state \(s_{t+1}\) and reward \(r_t\) \(\delta = r_t + \gamma V_\mathbf{w}(s_{t+1}) - V_\mathbf{w}(s_t)\) \(\mathbf{w} \leftarrow \mathbf{w} + \alpha_{\mathbf{w}} \delta \nabla_{\mathbf{w}}V_{\mathbf{w}}(s_t)\) \(\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \alpha_{\boldsymbol{\theta}} \gamma^t \delta \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a_t \mid s_t) \) until converged end

Note that the \(n\)-step advantage estimate can be expressed as: \[ A_{\pi_{\boldsymbol{\theta}}}^{(n)}(s_t, a_t) = G_{t:t+n} - V_{\mathbf{w}}(s_t) \] where \[ G_{t:t+n} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots + \gamma^{n-1} r_{t+n-1} + \gamma^n V_{\mathbf{w}}(s_{t+n}). \] We can control the bias-variance tradeoff by adjusting \(n\).

Actor-critic methods offer several key advantages:

Reduced variance: Lower variance than pure policy gradient methods like REINFORCE through value function baselines
Sample efficiency: More efficient learning through bootstrapping compared to Monte Carlo methods
Broad applicability: Can handle both discrete and continuous action spaces with online learning capability

Among the many extensions of A2C developed in the deep reinforcement learning literature, Proximal Policy Optimization (PPO) deserves special mention: by clipping the policy update at each step to keep the new policy within a trust region around the old one, PPO substantially improves training stability while remaining simple to implement. As of the mid-2020s, PPO has become the de facto standard on-policy algorithm in continuous-control benchmarks and, notably, in reinforcement learning from human feedback (RLHF) used to align large language models.

Standard policy gradient methods optimize \(J(\pi_{\boldsymbol{\theta}})\) using the Euclidean gradient \(\nabla_{\boldsymbol{\theta}} J\), which assumes an isotropic parameter space. However, the parameter space of a stochastic policy is actually a statistical manifold, where Euclidean steps can be poorly scaled if the curvature of the distribution family varies across directions. In our natural gradient descent page, we introduce the Fisher information matrix as the Riemannian metric on this space, demonstrating how natural gradients enable more stable and efficient optimization by accounting for the underlying geometry of the policy distribution.