Markov Chains

Probabilistic Graphical Models

In many real-world problems, the uncertainty involves not just a single parameter but an entire system of interrelated random variables - for instance, the words in a sentence, the states of a dynamical system, or the pixels in an image. To model such systems efficiently, we need a way to represent the dependencies among many random variables without specifying the full joint distribution explicitly. This is the role of probabilistic graphical models (PGMs).

A graphical model uses a graph to represent a collection of random variables and their probabilistic dependencies: the vertices represent random variables, and the edges encode conditional dependence relationships. When the graph is a directed acyclic graph (DAG), the model is called a Bayesian network. The key structural assumption is that each variable \(X_i\) is conditionally independent of all its non-descendants given its parents in the DAG. This assumption yields a compact factorization of the joint distribution.

Theorem: Factorization of Bayesian Networks

Let \(X_1, X_2, \ldots, X_n\) be random variables (vertices) arranged in topological order (i.e., for every directed edge \(X_i \to X_j\), we have \(i < j\)). The joint distribution of a Bayesian network factorizes as \[ P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^n P(X_i \mid \text{Parents}(X_i)), \] where \(\text{Parents}(X_i)\) denotes the set of variables with directed edges into \(X_i\), and a variable with no parents is conditioned on the empty set: \(P(X_i \mid \varnothing) = P(X_i)\).

This factorization reduces the number of parameters needed to specify the joint distribution from exponential (in the number of variables) to a sum of local conditional distributions. Among all Bayesian networks, one of the simplest and most widely used structures is the linear chain, where each variable depends only on its immediate predecessor. This leads to the concept of a Markov chain.

Language Models

A Markov chain is a linear-chain Bayesian network in which each variable depends only on a fixed number of preceding variables. This structure is fundamental across many fields - from physics and biology to finance and natural language processing. While the core idea remains the same, the applications and methods of analysis can differ depending on the domain. Here, we explore Markov chains through the lens of language modeling, which provides a concrete and intuitive setting.

Suppose our goal is to model a joint probability distribution over variable-length sequences \(P(y_{1:T})\), where each \(y_t \in \{1, \ldots, K\}\) represents a word from a vocabulary of size \(K\). A language model assigns probabilities to possible sentences (sequences) of length \(T\).

By the chain rule of probability, \[ P(y_{1:T}) = P(y_1) P(y_2 \mid y_1) P(y_3 \mid y_2, y_1) \cdots = \prod_{t = 1}^{T} P(y_t \mid y_{1:t-1}). \] However, as \(T\) increases, this formulation becomes computationally intractable: the conditioning set grows with each time step. To address this, we make the Markov assumption: the next state depends "only" on the current state.

Definition: Markov Property

A sequence of random variables \((y_1, y_2, \ldots, y_T)\) satisfies the first-order Markov property if the conditional distribution of each variable depends only on its immediate predecessor: \[ P(y_t \mid y_1, y_2, \ldots, y_{t-1}) = P(y_t \mid y_{t-1}) \quad \text{for all } t \geq 2. \]

Under the first-order Markov assumption, the joint distribution simplifies to \[ P(y_{1:T}) = P(y_1) \prod_{t = 2}^{T} P(y_t \mid y_{t-1}). \] This memoryless property dramatically reduces the number of parameters needed: from \(O(K^T)\) for the full joint distribution to \(O(K^2)\) for the transition probabilities. The function \(P(y_t \mid y_{t-1})\) is called the transition function (or transition kernel), and it satisfies the properties of a conditional distribution: \(P(y_t \mid y_{t-1}) \geq 0\) and \(\sum_{k=1}^K P(y_t = k \mid y_{t-1} = j) = 1\) for each state \(j\).

We can compactly represent all transition probabilities using a row-stochastic matrix (also called a conditional probability table (CPT)): \[ A_{jk} = P(y_t = k \mid y_{t-1} = j), \] where each row of \(A\) sums to 1. When the transition probabilities do not change with time (i.e., the same matrix \(A\) applies at every step), the model is said to be time-homogeneous (or time-invariant).

The Markov assumption can be extended to consider the last \(M\) states( or memory length): \[ P(y_{1:T}) = P(y_{1 : M}) \prod_{t = M + 1}^T P(y_t \mid y_{t - M : t - 1}). \] This is known as an M'th order Markov model. In language modeling, this is equivalent to an \(M+1\)-gram model. For example, if \(M = 2\), each word depends on the two preceding words, leading to a trigram model: \[ P(y_t \mid y_{t-1}, y_{t-2}). \]

Any higher-order Markov model can be converted into a first-order Markov model by redefining the state to include the past \(M\) observations. For \(M = 2\), we define \(\tilde{y}_t = (y_{t-1}, y_t)\). Then \[ \begin{align*} P(\tilde{y}_{1:T}) &= P(\tilde{y}_2) \prod_{t = 3}^T P(\tilde{y}_t \mid \tilde{y}_{t-1}) \\\\ &= P(y_1, y_2) \prod_{t = 3}^T P(y_t \mid y_{t-1}, y_{t-2}). \end{align*} \] This state-space augmentation is conceptually important: it shows that the first-order Markov chain is a universal building block for sequential models, at the cost of an expanded state space of size \(K^M\).

With large vocabularies this exponential growth in state space makes higher-order n-gram models infeasible, and modern approaches rely on neural language models (such as recurrent networks and transformers) to capture long-range dependencies without explicit enumeration. Beyond language modeling, Markov chains are widely used in sequential data modeling across many domains. In Bayesian statistics, Markov Chain Monte Carlo (MCMC) methods exploit the Markov property to construct a stochastic process that converges to a target stationary distribution, allowing for efficient sampling from complex, high-dimensional posterior distributions

Parameter Estimation of Markov Models

We now derive the maximum likelihood estimators for the parameters of a first-order Markov chain. Let the parameters be \(\boldsymbol{\theta} = (\pi, A)\), where \(\pi_j = P(x_1 = j)\) is the initial state distribution and \(A_{jk} = P(x_t = k \mid x_{t-1} = j)\) is the transition matrix. (We switch to the notation \(x\) for observed sequences to distinguish from the language model setting above.)

The probability of any particular sequence of length \(T\) is given by \[ \begin{align*} P(x_{1:T} \mid \boldsymbol\theta) &= \pi (x_1)A(x_1, x_2)\cdots A(x_{T-1}, x_T) \\\\ &= \prod_{j=1}^K (\pi_j)^{\mathbb{I}(x_1 =j)} \prod_{t=2}^T \prod_{j=1}^K \prod_{k=1}^K (A_{jk})^{\mathbb{I}(x_t=k,\, x_{t-1}=j)}, \tag{1} \end{align*} \] where \(\pi_j\) is the probability that the first symbol is \(j\), and \(A_{jk}\) is the probability of transitioning from state \(j\) to state \(k\). The function \(\mathbb{I}(\cdot)\) is the indicator function.

For example, \[ \mathbb{I}(x_1 =j) = \begin{cases} 1 &\text{if \(x_1 = j\)} \\ 0 &\text{otherwise} \end{cases}. \] This lets us convert sums and products into counts. In Equation 1, only one transition happens at a time, so only one term contributes in each time step.

The log-likelihood of a set of sequences \(\mathcal{D} = (x_1, \cdots, x_N)\), where \(x_i = (x_{i\,1}, \cdots, x_{i \, T_{i}})\) is a sequence of length \(T_i\) is given by \[ \begin{align*} \log P(\mathcal{D} \mid \boldsymbol\theta) &= \sum_{i=1}^N \log P(x_i \mid \boldsymbol\theta) \\\\ &= \sum_j N_j^1 \log \pi_j + \sum_j \sum_k N_{jk} \log A_{jk}. \end{align*} \] Define the following counts:

How often symbol \(j\) is seen at the "start" of a sequence:
How often symbol \(j\) is seen "anywhere" in a sequence:
Count of transitions from \(j\) to \(k\):

We want to obtain \(\hat{\pi}_j\) and \( \hat{A}_{jk}\) that are MLE under the constraints: \[ \begin{align*} &\sum_j \pi_j = 1 \\\\ &\sum_k A_{jk} = 1 \text{ for each } j \end{align*} \]

For \(\pi_j\), we introduce a Lagrange multiplier \(\lambda\) and define \[ \mathcal{L}_{\pi} = \sum_j N_j^1 \log \pi_j + \lambda \left(1 - \sum_j \pi_j \right). \] Then \[ \frac{\partial \mathcal{L}_{\pi}}{\partial \pi_j} = \frac{ N_j^1}{\pi_j} - \lambda = 0 \Longrightarrow \pi_j = \frac{N_j^1}{\lambda}. \]

Plug into the constraint \(\sum_j \pi_j = 1\), we have \[ \sum_j \frac{N_j^1}{\lambda} = 1 \Longrightarrow \lambda = \sum_j N_j^1. \] Thus, \[ \hat{\pi}_j = \frac{N_j^1}{\sum_{j^{\prime}} {N_{j^{\prime}}^1}}. \]

For \(A_{jk}\), we introduce Lagrange multiplier \(\mu\) and define \[ \mathcal{L_A} = \sum_j \sum_k N_{jk} \log A_{jk} + \sum_j \mu_j \left(1 - \sum_k A_{jk} \right). \] Then \[ \frac{\partial \mathcal{L}_{A}}{\partial A_{jk}} = \frac{ N_{jk}}{A_{jk}} - \mu_j = 0 \Longrightarrow A_{jk} = \frac{N_{jk}}{\mu_j}. \]

Plug into the constraint \(\sum_k A_{jk} = 1 \text{ for each } j\), we have \[ \sum_k \frac{N_{jk}}{\mu_j} = 1 \Longrightarrow \mu_j = \sum_k N_{jk} = N_j. \] Thus, \[ \hat{A}_{jk} = \frac{N_{jk}}{N_j}. \]

A Code Sample (MLE in Markov Models)

To illustrate the MLE derivation above, we implement the estimation procedure on a synthetic weather dataset. We simulate 14 days of weather observations across 10 cities (i.e., 10 independent sequences, each of length 14) generated from a known transition matrix, then estimate the initial distribution \(\hat{\pi}\) and transition matrix \(\hat{A}\) from the simulated data.

      
                                import numpy as np
                                import random
                                
                                # --- Constants ---
                                # Define states 
                                STATES = ['Sunny', 'Rainy', 'Cloudy', 'Stormy', 'Foggy']
                                STATE_TO_INDEX = {state: i for i, state in enumerate(STATES)}
                                INDEX_TO_STATE = {i: state for i, state in enumerate(STATES)}
                                # Observed 14 days of weather in 10 cities (i.e., 10 sequences, each of length 14).
                                DAYS = 14
                                CITIES = 10
                                
                                # Define the true transition matrix (for data generation only)
                                TRUE_TRANSITION_MATRIX = np.array([
                                    [0.5, 0.2, 0.2, 0.05, 0.05],   # From Sunny
                                    [0.3, 0.4, 0.2, 0.1, 0.0],     # From Rainy
                                    [0.4, 0.2, 0.3, 0.05, 0.05],   # From Cloudy
                                    [0.1, 0.4, 0.2, 0.3, 0.0],     # From Stormy
                                    [0.3, 0.1, 0.4, 0.0, 0.2],     # From Foggy
                                ])
                                
                                # --- Sequence Generation ---
                                # Generate a sequence of weather states
                                def generate_sequence(length, transition_matrix, states, state_to_index, start_state=None):
                                    if start_state is None:
                                        start_state = random.choice(states)
                                    seq = [start_state]
                                    for _ in range(length - 1):
                                        current_index = state_to_index[seq[-1]]
                                        next_state = np.random.choice(states, p=transition_matrix[current_index])
                                        seq.append(next_state)
                                    return seq
                                
                                # --- MLE Estimation ---
                                def estimate_mle(sequences, states, state_to_index):
                                    num_states = len(states)
                                    N1 = np.zeros(num_states) # Start state counts
                                    N_jk = np.zeros((num_states, num_states)) # Transition counts
                                
                                    for seq in sequences:
                                        first_idx = state_to_index[seq[0]]
                                        N1[first_idx] += 1
                                        for t in range(len(seq) - 1):
                                            j = state_to_index[seq[t]]
                                            k = state_to_index[seq[t + 1]]
                                            N_jk[j, k] += 1
                                
                                    pi_hat = N1 / np.sum(N1) # Estimate start probabilities π̂
                                    row_sums = np.sum(N_jk, axis=1, keepdims=True)
                                    A_hat = np.divide(N_jk, row_sums, out=np.zeros_like(N_jk), where=row_sums != 0) # Estimate transition matrix Â
                                
                                    return pi_hat, A_hat
                                
                                # --- Display Functions ---
                                def print_sequences(sequences):
                                    print("Sequences:")
                                    for i, seq in enumerate(sequences):
                                        print(f"City {i+1}:\n {' → '.join(seq)}")
                                
                                def print_start_probabilities(pi_hat, states):
                                    print("\nEstimated Start Probabilities (π̂ ):")
                                    for state, prob in zip(states, pi_hat):
                                        print(f"{state:>6}: {prob:.3f}")
                                
                                def print_transition_matrix(A_hat, states):
                                    # Determine column width based on the longest state name + padding
                                    max_len = max(len(state) for state in states)
                                    col_width = max_len + 2
                                
                                    # Create header dynamically with calculated column width
                                    header = "From \\ To".rjust(col_width) + " | " + " | ".join(f"{s:^{col_width}}" for s in states)
                                    print(header)
                                    print("-" * len(header))
                                    # Print each row, formatting numbers with dynamic width
                                    for j, row in enumerate(A_hat):
                                        row_str = " | ".join(f"{p:{col_width}.3f}" for p in row)
                                        print(f"{states[j]:>{col_width}} | {row_str}")
                                
                                if __name__ == "__main__":
                                    sequences = [generate_sequence(DAYS, TRUE_TRANSITION_MATRIX, STATES, STATE_TO_INDEX) for _ in range(CITIES)]
                                    pi_hat, A_hat = estimate_mle(sequences, STATES, STATE_TO_INDEX)
                                    print_sequences(sequences)
                                    print_start_probabilities(pi_hat, STATES)
                                    print("\nEstimated Transition Matrix (Â):")
                                    print_transition_matrix(A_hat, STATES)
                                    print("\nTrue Transition Matrix (A):")
                                    print_transition_matrix(TRUE_TRANSITION_MATRIX, STATES)

Sparse Data & Dirichlet Prior

When we have a limited number of sequences (or sequence steps), many possible transitions might never be observed at all, or be observed just once or twice. This situation leads to having fewer observations relative to the number of parameters (transitions). The sparse data problem in Markov chain models is a critical issue, especially when working with many possible states. The sparse data lead to unreliable MLE estimates (e.g., overfitting, zero estimates, biased predictions, or inaccurate long-term behavior). To address these issues, we often need smoothing or Bayesian approaches to generalize better.

The simplest case is the uniform Dirichlet prior with \(\alpha_1 = \alpha_2 = \cdots = \alpha_K = \alpha\).

Definition: Dirichlet Prior for Transition Probabilities

The \(j\)-th row of the transition matrix \(A\), representing the probabilities of transitioning from state \(j\) to each of the \(K\) possible states, is given a symmetric Dirichlet prior: \[ A_{j:} \sim \text{Dir}(\alpha \, \boldsymbol{1}). \]

The posterior mean estimate of the transition probabilities is \[ \hat{A}_{jk} = \frac{N_{jk} + \alpha}{N_j + K\alpha}. \]

When \(\alpha = 1\), this is called add-one (Laplace) smoothing. The parameter \(\alpha\) acts as a pseudocount: it is as if we observed \(\alpha\) additional transitions to each state before seeing the data.

To derive this, assume that for the \(j\)-th row, we have observed counts \(N_{jk}\) for transitioning to state \(k\). The total number of transitions from state \(j\) is \[ N_j = \sum_{k=1}^K N_{jk} \] The likelihood under a multinomial model is: \[ P(\{N_{jk} \mid A_{j:}\}) \propto \prod_{k=1}^K (A_{jk})^{N_{jk}}. \] Now, with the uniform Dirichlet prior, we have: \[ P(A_{j:}) \propto \prod_{k=1}^K (A_{jk})^{\alpha -1}. \] (The prior adds the same amount to each transition probability.)

By Bayes's rule, the posterior is proportional to the product of the likelihood and the prior: \[ \begin{align*} P(A_{j:} \mid \{N_{jk}\}) &\propto \left[ \prod_{k=1}^K (A_{jk})^{N_{jk}}\right] \times \left[ \prod_{k=1}^K (A_{jk})^{\alpha-1}\right] \\\\ &= \prod_{k=1}^K (A_{jk})^{N_{jk} + \alpha -1}. \end{align*} \] Thus, the posterior distribution for \(A_{j:}\) is a Dirichlet distribution: \[ A_{j:} \mid \{N_{jk}\} \sim \text{ Dir } (N_{j1} +\alpha, N_{j2} +\alpha, \cdots, \, N_{jK} + \alpha). \]

There are two common choices for an estimator from a posterior distribution: the posterior mean and the mode (MAP estimate). For a Dirichlet distribution, the posterior mean is given by: \[ \begin{align*} E [A_{jk} \mid \{N_{jk}\}] &= \frac{N_{jk}+\alpha}{\sum_{i=1}^K (N_{ji}+\alpha)} \\\\ &= \frac{N_{jk}+\alpha}{N_j + K \alpha}. \end{align*} \] This estimator is often used in practice because it is always defined. The mode (the "actual" MAP estimate) is given by \[ \text{mode }(A_{jk}) = \frac{N_{jk}+\alpha - 1}{N_j + K (\alpha - 1)}. \]

However, note that when \(\alpha = 1\), the mode simplifies to \(\frac{N_{jk}}{N_j}\), which is exactly the Maximum Likelihood Estimate (MLE). This means that using the MAP estimate with a uniform prior (\(\alpha = 1\)) does not provide any smoothing effect, leaving the sparse data problem unresolved. Furthermore, in the large-sample regime, the posterior mean and the mode of the Dirichlet distribution are nearly identical. That's why the estimator based on the posterior mean is commonly used and, for \(\alpha = 1\), is referred to as add-one smoothing.

It is important to note that the symmetric Dirichlet prior assumes a uniform prior (each outcome is "equally" likely a priori), which may not be realistic in complex models. For more structured problems where capturing group-level variation is important, hierarchical Bayesian methods provide a more flexible approach: each sequence (e.g., each city in our weather example) has its own transition matrix, but these matrices are drawn from a common hyper-distribution, allowing the model to share statistical strength across sequences while accommodating heterogeneity.

The Algorithmic Bridge

The transition matrix formalism links Markov chains directly to stochastic matrices in linear algebra. Here, the stationary distribution is characterized as the left eigenvector corresponding to the eigenvalue 1, describing the long-run behavior of the system—a principle that powers the PageRank algorithm.

In reinforcement learning, Markov chains generalize to Markov decision processes (MDPs), where an agent's actions dynamically influence transition probabilities. Furthermore, the Bayesian framework for estimating transition counts (using Dirichlet priors) provides the foundation for hidden Markov models (HMMs) and conditional random fields (CRFs) used in sequence modeling.