In many real-world problems, the uncertainty involves not just a single parameter but an entire system of interrelated
random variables - for instance, the words in a sentence, the states of a dynamical system, or the pixels in an image.
To model such systems efficiently, we need a way to represent the dependencies among many random variables
without specifying the full joint distribution explicitly. This is the role of probabilistic graphical models (PGMs).
A graphical model uses a graph to represent a collection of
random variables and their probabilistic dependencies: the vertices represent random variables, and the edges encode conditional
dependence relationships. When the graph is a directed acyclic graph (DAG), the model is called a Bayesian network.
The key structural assumption is that each variable \(X_i\) is conditionally independent of all its non-descendants given its parents
in the DAG. This assumption yields a compact factorization of the joint distribution.
Theorem: Factorization of Bayesian Networks
Let \(X_1, X_2, \ldots, X_n\) be random variables (vertices) arranged in topological order
(i.e., for every directed edge \(X_i \to X_j\), we have \(i < j\)). The joint distribution
of a Bayesian network factorizes as
\[
P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^n P(X_i \mid \text{Parents}(X_i)),
\]
where \(\text{Parents}(X_i)\) denotes the set of variables with directed edges into \(X_i\),
and a variable with no parents is conditioned on the empty set: \(P(X_i \mid \varnothing) = P(X_i)\).
This factorization reduces the number of parameters needed to specify the joint distribution from exponential
(in the number of variables) to a sum of local conditional distributions. Among all Bayesian networks, one of
the simplest and most widely used structures is the linear chain, where each variable depends
only on its immediate predecessor. This leads to the concept of a Markov chain.
Language Models
A Markov chain is a linear-chain Bayesian network in which each variable depends
only on a fixed number of preceding variables. This structure is fundamental across many fields -
from physics and biology to finance and natural language processing. While the core idea remains
the same, the applications and methods of analysis can differ depending on the domain. Here, we
explore Markov chains through the lens of language modeling, which provides a
concrete and intuitive setting.
Suppose our goal is to model a joint probability distribution over variable-length sequences \(P(y_{1:T})\),
where each \(y_t \in \{1, \ldots, K\}\) represents a word from a vocabulary of size \(K\).
A language model assigns probabilities to possible sentences (sequences) of length \(T\).
By the chain rule of probability,
\[
P(y_{1:T}) = P(y_1) P(y_2 \mid y_1) P(y_3 \mid y_2, y_1) \cdots = \prod_{t = 1}^{T} P(y_t \mid y_{1:t-1}).
\]
However, as \(T\) increases, this formulation becomes computationally intractable: the conditioning
set grows with each time step. To address this, we make the Markov assumption: the next
state depends "only" on the current state.
Definition: Markov Property
A sequence of random variables \((y_1, y_2, \ldots, y_T)\) satisfies the first-order Markov
property if the conditional distribution of each variable depends only on its immediate
predecessor:
\[
P(y_t \mid y_1, y_2, \ldots, y_{t-1}) = P(y_t \mid y_{t-1}) \quad \text{for all } t \geq 2.
\]
Under the first-order Markov assumption, the joint distribution simplifies to
\[
P(y_{1:T}) = P(y_1) \prod_{t = 2}^{T} P(y_t \mid y_{t-1}).
\]
This memoryless property dramatically reduces the number of parameters needed:
from \(O(K^T)\) for the full joint distribution to \(O(K^2)\) for the transition probabilities.
The function \(P(y_t \mid y_{t-1})\) is called the transition function
(or transition kernel), and it satisfies the properties of a conditional
distribution: \(P(y_t \mid y_{t-1}) \geq 0\) and \(\sum_{k=1}^K P(y_t = k \mid y_{t-1} = j) = 1\)
for each state \(j\).
We can compactly represent all transition probabilities using a
row-stochastic matrix
(also called a conditional probability table (CPT)):
\[
A_{jk} = P(y_t = k \mid y_{t-1} = j),
\]
where each row of \(A\) sums to 1. When the transition probabilities do not change with
time (i.e., the same matrix \(A\) applies at every step), the model is said to be
time-homogeneous (or time-invariant).
The Markov assumption can be extended to consider the last \(M\) states( or memory length):
\[
P(y_{1:T}) = P(y_{1 : M}) \prod_{t = M + 1}^T P(y_t \mid y_{t - M : t - 1}).
\]
This is known as an M'th order Markov model. In language modeling, this is equivalent to an \(M+1\)-gram model.
For example, if \(M = 2\),
each word depends on the two preceding words, leading to a trigram model:
\[
P(y_t \mid y_{t-1}, y_{t-2}).
\]
Any higher-order Markov model can be converted into a first-order Markov model by redefining the state to include
the past \(M\) observations. For \(M = 2\), we define \(\tilde{y}_t = (y_{t-1}, y_t)\). Then
\[
\begin{align*}
P(\tilde{y}_{1:T}) &= P(\tilde{y}_2) \prod_{t = 3}^T P(\tilde{y}_t \mid \tilde{y}_{t-1}) \\\\
&= P(y_1, y_2) \prod_{t = 3}^T P(y_t \mid y_{t-1}, y_{t-2}).
\end{align*}
\]
This state-space augmentation is conceptually important: it shows that the first-order Markov chain is a
universal building block for sequential models, at the cost of an expanded state space of size \(K^M\).
With large vocabularies this exponential growth in state space makes higher-order n-gram models infeasible, and
modern approaches rely on neural language models (such as recurrent networks and transformers)
to capture long-range dependencies without explicit enumeration. Beyond language modeling, Markov chains are widely
used in sequential data modeling across many domains. In Bayesian statistics, Markov Chain Monte Carlo (MCMC) methods
exploit the Markov property to construct a stochastic process that converges to a target stationary distribution,
allowing for efficient sampling from complex, high-dimensional posterior distributions
Parameter Estimation of Markov Models
We now derive the maximum likelihood estimators for the parameters of a first-order Markov chain. Let the parameters
be \(\boldsymbol{\theta} = (\pi, A)\), where \(\pi_j = P(x_1 = j)\) is the initial state distribution and
\(A_{jk} = P(x_t = k \mid x_{t-1} = j)\) is the transition matrix.
(We switch to the notation \(x\) for observed sequences to distinguish from the language model setting above.)
The probability of any particular sequence of length \(T\) is given by
\[
\begin{align*}
P(x_{1:T} \mid \boldsymbol\theta) &= \pi (x_1)A(x_1, x_2)\cdots A(x_{T-1}, x_T) \\\\
&= \prod_{j=1}^K (\pi_j)^{\mathbb{I}(x_1 =j)} \prod_{t=2}^T \prod_{j=1}^K \prod_{k=1}^K
(A_{jk})^{\mathbb{I}(x_t=k,\, x_{t-1}=j)}, \tag{1}
\end{align*}
\]
where \(\pi_j\) is the probability that the first symbol is \(j\), and \(A_{jk}\) is the probability of
transitioning from state \(j\) to state \(k\). The function \(\mathbb{I}(\cdot)\) is the
indicator function.
For example,
\[
\mathbb{I}(x_1 =j) = \begin{cases}
1 &\text{if \(x_1 = j\)} \\
0 &\text{otherwise}
\end{cases}.
\]
This lets us convert sums and products into counts. In Equation 1, only one transition happens at a time, so only
one term contributes in each time step.
The log-likelihood of a set of sequences \(\mathcal{D} = (x_1, \cdots, x_N)\), where
\(x_i = (x_{i\,1}, \cdots, x_{i \, T_{i}})\) is a sequence of length \(T_i\) is given by
\[
\begin{align*}
\log P(\mathcal{D} \mid \boldsymbol\theta) &= \sum_{i=1}^N \log P(x_i \mid \boldsymbol\theta) \\\\
&= \sum_j N_j^1 \log \pi_j + \sum_j \sum_k N_{jk} \log A_{jk}.
\end{align*}
\]
Define the following counts:
How often symbol \(j\) is seen at the "start" of a sequence:
\[N_j^1 = \sum_{i=1}^N \mathbb{I}(x_{i,1} = j)\]
How often symbol \(j\) is seen "anywhere" in a sequence:
We want to obtain \(\hat{\pi}_j\) and \( \hat{A}_{jk}\) that are MLE under the constraints:
\[
\begin{align*}
&\sum_j \pi_j = 1 \\\\
&\sum_k A_{jk} = 1 \text{ for each } j
\end{align*}
\]
For \(\pi_j\), we introduce a Lagrange multiplier \(\lambda\) and define
\[
\mathcal{L}_{\pi} = \sum_j N_j^1 \log \pi_j + \lambda \left(1 - \sum_j \pi_j \right).
\]
Then
\[
\frac{\partial \mathcal{L}_{\pi}}{\partial \pi_j} = \frac{ N_j^1}{\pi_j} - \lambda = 0
\Longrightarrow \pi_j = \frac{N_j^1}{\lambda}.
\]
Plug into the constraint \(\sum_j \pi_j = 1\), we have
\[
\sum_j \frac{N_j^1}{\lambda} = 1 \Longrightarrow \lambda = \sum_j N_j^1.
\]
Thus,
\[
\hat{\pi}_j = \frac{N_j^1}{\sum_{j^{\prime}} {N_{j^{\prime}}^1}}.
\]
Plug into the constraint \(\sum_k A_{jk} = 1 \text{ for each } j\), we have
\[
\sum_k \frac{N_{jk}}{\mu_j} = 1 \Longrightarrow \mu_j = \sum_k N_{jk} = N_j.
\]
Thus,
\[
\hat{A}_{jk} = \frac{N_{jk}}{N_j}.
\]
A Code Sample (MLE in Markov Models)
To illustrate the MLE derivation above, we implement the estimation procedure on a synthetic weather dataset.
We simulate 14 days of weather observations across 10 cities (i.e., 10 independent sequences, each of length 14)
generated from a known transition matrix, then estimate the initial distribution \(\hat{\pi}\) and transition
matrix \(\hat{A}\) from the simulated data.
import numpy as np
import random
# --- Constants ---
# Define states
STATES = ['Sunny', 'Rainy', 'Cloudy', 'Stormy', 'Foggy']
STATE_TO_INDEX = {state: i for i, state in enumerate(STATES)}
INDEX_TO_STATE = {i: state for i, state in enumerate(STATES)}
# Observed 14 days of weather in 10 cities (i.e., 10 sequences, each of length 14).
DAYS = 14
CITIES = 10
# Define the true transition matrix (for data generation only)
TRUE_TRANSITION_MATRIX = np.array([
[0.5, 0.2, 0.2, 0.05, 0.05], # From Sunny
[0.3, 0.4, 0.2, 0.1, 0.0], # From Rainy
[0.4, 0.2, 0.3, 0.05, 0.05], # From Cloudy
[0.1, 0.4, 0.2, 0.3, 0.0], # From Stormy
[0.3, 0.1, 0.4, 0.0, 0.2], # From Foggy
])
# --- Sequence Generation ---
# Generate a sequence of weather states
def generate_sequence(length, transition_matrix, states, state_to_index, start_state=None):
if start_state is None:
start_state = random.choice(states)
seq = [start_state]
for _ in range(length - 1):
current_index = state_to_index[seq[-1]]
next_state = np.random.choice(states, p=transition_matrix[current_index])
seq.append(next_state)
return seq
# --- MLE Estimation ---
def estimate_mle(sequences, states, state_to_index):
num_states = len(states)
N1 = np.zeros(num_states) # Start state counts
N_jk = np.zeros((num_states, num_states)) # Transition counts
for seq in sequences:
first_idx = state_to_index[seq[0]]
N1[first_idx] += 1
for t in range(len(seq) - 1):
j = state_to_index[seq[t]]
k = state_to_index[seq[t + 1]]
N_jk[j, k] += 1
pi_hat = N1 / np.sum(N1) # Estimate start probabilities π̂
row_sums = np.sum(N_jk, axis=1, keepdims=True)
A_hat = np.divide(N_jk, row_sums, out=np.zeros_like(N_jk), where=row_sums != 0) # Estimate transition matrix Â
return pi_hat, A_hat
# --- Display Functions ---
def print_sequences(sequences):
print("Sequences:")
for i, seq in enumerate(sequences):
print(f"City {i+1}:\n {' → '.join(seq)}")
def print_start_probabilities(pi_hat, states):
print("\nEstimated Start Probabilities (π̂ ):")
for state, prob in zip(states, pi_hat):
print(f"{state:>6}: {prob:.3f}")
def print_transition_matrix(A_hat, states):
# Determine column width based on the longest state name + padding
max_len = max(len(state) for state in states)
col_width = max_len + 2
# Create header dynamically with calculated column width
header = "From \\ To".rjust(col_width) + " | " + " | ".join(f"{s:^{col_width}}" for s in states)
print(header)
print("-" * len(header))
# Print each row, formatting numbers with dynamic width
for j, row in enumerate(A_hat):
row_str = " | ".join(f"{p:{col_width}.3f}" for p in row)
print(f"{states[j]:>{col_width}} | {row_str}")
if __name__ == "__main__":
sequences = [generate_sequence(DAYS, TRUE_TRANSITION_MATRIX, STATES, STATE_TO_INDEX) for _ in range(CITIES)]
pi_hat, A_hat = estimate_mle(sequences, STATES, STATE_TO_INDEX)
print_sequences(sequences)
print_start_probabilities(pi_hat, STATES)
print("\nEstimated Transition Matrix (Â):")
print_transition_matrix(A_hat, STATES)
print("\nTrue Transition Matrix (A):")
print_transition_matrix(TRUE_TRANSITION_MATRIX, STATES)
Sparse Data & Dirichlet Prior
When we have a limited number of sequences (or sequence steps), many possible transitions might never be observed at all, or be
observed just once or twice. This situation leads to having fewer observations relative to the number of parameters (transitions).
The sparse data problem in Markov chain models is a critical issue, especially when working with many possible states.
The sparse data lead to unreliable MLE estimates (e.g., overfitting, zero estimates, biased predictions, or inaccurate long-term behavior).
To address these issues, we often need smoothing or Bayesian approaches to generalize better.
The simplest case is the uniform Dirichlet prior with \(\alpha_1 = \alpha_2 = \cdots = \alpha_K = \alpha\).
Definition: Dirichlet Prior for Transition Probabilities
The \(j\)-th row of the transition matrix \(A\), representing the probabilities of
transitioning from state \(j\) to each of the \(K\) possible states, is given a
symmetric Dirichlet prior:
\[
A_{j:} \sim \text{Dir}(\alpha \, \boldsymbol{1}).
\]
The posterior mean estimate of the transition probabilities is
\[
\hat{A}_{jk} = \frac{N_{jk} + \alpha}{N_j + K\alpha}.
\]
When \(\alpha = 1\), this is called add-one (Laplace) smoothing.
The parameter \(\alpha\) acts as a pseudocount: it is as if we observed
\(\alpha\) additional transitions to each state before seeing the data.
To derive this, assume that for the \(j\)-th row, we have observed counts \(N_{jk}\) for transitioning to state \(k\).
The total number of transitions from state \(j\) is
\[
N_j = \sum_{k=1}^K N_{jk}
\]
The likelihood under a multinomial model is:
\[
P(\{N_{jk} \mid A_{j:}\}) \propto \prod_{k=1}^K (A_{jk})^{N_{jk}}.
\]
Now, with the uniform Dirichlet prior, we have:
\[
P(A_{j:}) \propto \prod_{k=1}^K (A_{jk})^{\alpha -1}.
\]
(The prior adds the same amount to each transition probability.)
By Bayes's rule, the posterior is proportional to the product of the likelihood and the prior:
\[
\begin{align*}
P(A_{j:} \mid \{N_{jk}\}) &\propto \left[ \prod_{k=1}^K (A_{jk})^{N_{jk}}\right] \times \left[ \prod_{k=1}^K (A_{jk})^{\alpha-1}\right] \\\\
&= \prod_{k=1}^K (A_{jk})^{N_{jk} + \alpha -1}.
\end{align*}
\]
Thus, the posterior distribution for \(A_{j:}\) is a Dirichlet distribution:
\[
A_{j:} \mid \{N_{jk}\} \sim \text{ Dir } (N_{j1} +\alpha, N_{j2} +\alpha, \cdots, \, N_{jK} + \alpha).
\]
There are two common choices for an estimator from a posterior distribution: the posterior mean and the mode (MAP estimate). For a Dirichlet
distribution, the posterior mean is given by:
\[
\begin{align*}
E [A_{jk} \mid \{N_{jk}\}] &= \frac{N_{jk}+\alpha}{\sum_{i=1}^K (N_{ji}+\alpha)} \\\\
&= \frac{N_{jk}+\alpha}{N_j + K \alpha}.
\end{align*}
\]
This estimator is often used in practice because it is always defined. The mode (the "actual" MAP estimate) is given by
\[
\text{mode }(A_{jk}) = \frac{N_{jk}+\alpha - 1}{N_j + K (\alpha - 1)}.
\]
However, note that when \(\alpha = 1\), the mode simplifies to \(\frac{N_{jk}}{N_j}\), which is exactly the Maximum Likelihood Estimate (MLE).
This means that using the MAP estimate with a uniform prior (\(\alpha = 1\)) does not provide any smoothing effect, leaving the sparse data problem unresolved.
Furthermore, in the large-sample regime, the posterior mean and the mode of the Dirichlet distribution are nearly identical.
That's why the estimator based on the posterior mean is commonly used and, for \(\alpha = 1\), is referred to as add-one smoothing.
It is important to note that the symmetric Dirichlet prior assumes a uniform prior (each outcome is "equally" likely a priori), which
may not be realistic in complex models. For more structured problems where capturing group-level variation is important,
hierarchical Bayesian methods provide a more flexible approach: each sequence (e.g., each city in our weather example) has
its own transition matrix, but these matrices are drawn from a common hyper-distribution, allowing the model to share statistical
strength across sequences while accommodating heterogeneity.
The Algorithmic Bridge
The transition matrix formalism links Markov chains directly to
stochastic matrices in linear algebra.
Here, the stationary distribution is characterized as the left eigenvector corresponding to the eigenvalue 1,
describing the long-run behavior of the system—a principle that powers the PageRank algorithm.
In reinforcement learning, Markov chains
generalize to Markov decision processes (MDPs), where an agent's actions dynamically influence transition
probabilities. Furthermore, the Bayesian framework for estimating transition counts (using Dirichlet priors) provides the foundation for
hidden Markov models (HMMs) and conditional random fields (CRFs) used in sequence modeling.