Markov Chains

Probabilistic Graphical Models Language Models Parameter Estimation of Markov Models A Code Sample (MLE in Markov Models) Sparse Data & Dirichlet Prior

Probabilistic Graphical Models

In many real-world problems, the uncertainty involves not just a single parameter but an entire system of interrelated random variables - for instance, the words in a sentence, the states of a dynamical system, or the pixels in an image. To model such systems efficiently, we need a way to represent the dependencies among many random variables without specifying the full joint distribution explicitly. This is the role of probabilistic graphical models (PGMs).

A graphical model uses a graph to represent a collection of random variables and their probabilistic dependencies: the vertices represent random variables, and the edges encode conditional dependence relationships. When the graph is a directed acyclic graph (DAG), the model is called a Bayesian network. The key structural assumption is that, after numbering the vertices in topological order, each variable \(X_i\) is conditionally independent of its non-parent predecessors given its parents. This is the ordered Markov property, formalized below, and it yields a compact factorization of the joint distribution.

Definition: Bayesian Network

A Bayesian network is a pair \((G, \{p_i\}_{i=1}^n)\), where:

  • \(G\) is a directed acyclic graph with vertices \(X_1, X_2, \ldots, X_n\); and
  • \(\{p_i\}_{i=1}^n\) is a family of conditional probability distributions, one per vertex, of the form \(p_i(x_i \mid x_{\operatorname{pa}(i)})\).

Here \(\operatorname{pa}(i) \subseteq \{1, \ldots, n\}\) denotes the parents of vertex \(i\) (the indices \(j\) such that \(X_j \to X_i\) is an edge of \(G\)), and \(X_{\operatorname{pa}(i)}\) is shorthand for the tuple of parent variables \((X_j)_{j \in \operatorname{pa}(i)}\). By convention, if \(\operatorname{pa}(i) = \varnothing\), the factor \(p_i\) reduces to the marginal \(p_i(x_i)\).

The network is said to satisfy the ordered Markov property if, after numbering the vertices in topological order (every edge \(X_i \to X_j\) satisfies \(i < j\)), each variable is conditionally independent of its non-parent predecessors given its parents: \[ X_i \perp X_{\operatorname{pred}(i) \setminus \operatorname{pa}(i)} \;\big|\; X_{\operatorname{pa}(i)} \quad \text{for all } i, \] where \(\operatorname{pred}(i) := \{1, \ldots, i-1\}\) denotes the predecessors of \(X_i\) in the ordering.

Under the ordered Markov property, the joint distribution admits a compact factorization in terms of the local conditional distributions.

Theorem: Factorization of Bayesian Networks

Let \((G, \{p_i\}_{i=1}^n)\) be a Bayesian network satisfying the ordered Markov property. Then the joint distribution of \((X_1, X_2, \ldots, X_n)\) factorizes as \[ P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^n P(X_i \mid X_{\operatorname{pa}(i)}). \]

Proof.

Assume the vertices are numbered in topological order. By the chain rule of probability, \[ P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^n P(X_i \mid X_1, X_2, \ldots, X_{i-1}). \] The conditioning set on the right is exactly \(X_{\operatorname{pred}(i)}\), since \(\operatorname{pred}(i) = \{1, \ldots, i-1\}\) under the topological ordering. The ordered Markov property states that, conditional on the parents \(X_{\operatorname{pa}(i)}\), the variable \(X_i\) is independent of the remaining predecessors \(X_{\operatorname{pred}(i) \setminus \operatorname{pa}(i)}\); hence the conditioning set in each factor collapses to the parents: \[ P(X_i \mid X_{\operatorname{pred}(i)}) = P(X_i \mid X_{\operatorname{pa}(i)}). \] Substituting into the chain rule expansion yields the claimed factorization. \(\blacksquare\)

Beyond Directed Graphs

This page works exclusively with directed graphical models, where the parent-child relation supplies a natural ordering. An entirely separate framework — undirected graphical models, also known as Markov random fields (MRFs) — handles symmetric pairwise interactions where no direction is meaningful. Prototypical examples include the Ising model in statistical physics, pixel grids in image processing (denoising, segmentation), and energy-based neural networks such as Boltzmann machines.

In the undirected setting, the ordered Markov property is replaced by a family of Markov properties — local, pairwise, and global — describing conditional independence in terms of graph neighborhoods or separating sets, equivalent to one another under mild conditions. Under a positivity assumption on the joint density, the celebrated Hammersley-Clifford theorem guarantees that the joint factorizes as a product of potential functions over the cliques of the graph (rather than over parent sets). We will not pursue this direction here.

This factorization reduces the number of parameters needed to specify the joint distribution from exponential (in the number of variables) to a sum of local conditional distributions. Among all Bayesian networks, one of the simplest and most widely used structures is the linear chain, where each variable depends only on its immediate predecessor. This leads to the concept of a Markov chain.

Language Models

A Markov chain is a linear-chain Bayesian network in which each variable depends only on a fixed number of preceding variables. This structure is fundamental across many fields - from physics and biology to finance and natural language processing. While the core idea remains the same, the applications and methods of analysis can differ depending on the domain. Here, we explore Markov chains through the lens of language modeling, which provides a concrete and intuitive setting.

Suppose our goal is to model a joint probability distribution over variable-length sequences \(P(y_{1:T})\), where each \(y_t \in \{1, \ldots, K\}\) represents a word from a vocabulary of size \(K\). A language model assigns probabilities to possible sentences (sequences) of length \(T\).

By the chain rule of probability, \[ P(y_{1:T}) = P(y_1) P(y_2 \mid y_1) P(y_3 \mid y_2, y_1) \cdots = \prod_{t = 1}^{T} P(y_t \mid y_{1:t-1}). \] However, as \(T\) increases, this formulation becomes computationally intractable: the conditioning set grows with each time step. To address this, we make the Markov assumption: the next state depends "only" on the current state.

Definition: First-Order Markov Property

A sequence of random variables \((y_1, y_2, \ldots, y_T)\) satisfies the first-order Markov property if the conditional distribution of each variable depends only on its immediate predecessor: \[ P(y_t \mid y_1, y_2, \ldots, y_{t-1}) = P(y_t \mid y_{t-1}) \quad \text{for all } t \geq 2. \]

Under the first-order Markov assumption, the joint distribution simplifies to \[ P(y_{1:T}) = P(y_1) \prod_{t = 2}^{T} P(y_t \mid y_{t-1}). \] This is precisely the Bayesian-network factorization specialized to the linear-chain DAG, with each \(\operatorname{pa}(t) = \{t-1\}\) for \(t \geq 2\) and \(\operatorname{pa}(1) = \varnothing\). This memoryless property (formalized above as the first-order Markov property) dramatically reduces the number of parameters needed: from \(O(K^T)\) for the full joint distribution to \(O(K^2)\) for the transition probabilities. The function \(P(y_t \mid y_{t-1})\) is called the transition function (or transition kernel — the latter name extends naturally to continuous state spaces, where the matrix is replaced by a Markov kernel between measurable spaces), and it satisfies the properties of a conditional distribution: \(P(y_t \mid y_{t-1}) \geq 0\) and \(\sum_{k=1}^K P(y_t = k \mid y_{t-1} = j) = 1\) for each state \(j\).

We can compactly represent all transition probabilities using a row-stochastic matrix (also called a conditional probability table (CPT)): \[ A_{jk} = P(y_t = k \mid y_{t-1} = j), \] where each row of \(A\) sums to 1. (Linear algebra texts often adopt the column-stochastic convention, the transpose of ours; both encode the same dynamics — see the linked page for the equivalence.) When the transition probabilities do not change with time (i.e., the same matrix \(A\) applies at every step), the model is said to be time-homogeneous (or time-invariant).

The Markov assumption can be extended to consider the last \(M\) states (or memory length): \[ P(y_{1:T}) = P(y_{1 : M}) \prod_{t = M + 1}^T P(y_t \mid y_{t - M : t - 1}). \] This is known as an \(M\)-th order Markov model. In language modeling, this is equivalent to an \(M+1\)-gram model. For example, if \(M = 2\), each word depends on the two preceding words, leading to a trigram model: \[ P(y_t \mid y_{t-1}, y_{t-2}). \]

Any higher-order Markov model can be converted into a first-order Markov model by redefining the state to include the past \(M\) observations. For \(M = 2\), we define \(\tilde{y}_t = (y_{t-1}, y_t)\). Then \[ \begin{align*} P(\tilde{y}_{1:T}) &= P(\tilde{y}_2) \prod_{t = 3}^T P(\tilde{y}_t \mid \tilde{y}_{t-1}) \\\\ &= P(y_1, y_2) \prod_{t = 3}^T P(y_t \mid y_{t-1}, y_{t-2}). \end{align*} \] This state-space augmentation is conceptually important: it shows that the first-order Markov chain is a universal building block for sequential models, at the cost of an expanded state space of size \(K^M\).

With large vocabularies this exponential growth in state space makes higher-order n-gram models infeasible, and modern approaches rely on neural language models (such as recurrent networks and transformers) to capture long-range dependencies without explicit enumeration. Beyond language modeling, Markov chains are widely used in sequential data modeling across many domains. In Bayesian statistics, Markov Chain Monte Carlo (MCMC) methods exploit the Markov property to construct a stochastic process that converges to a target stationary distribution, allowing for efficient sampling from complex, high-dimensional posterior distributions.

Parameter Estimation of Markov Models

We now derive the maximum likelihood estimators for the parameters of a first-order Markov chain. Let the parameters be \(\boldsymbol{\theta} = (\pi, A)\), where \(\pi_j = P(x_1 = j)\) is the initial state distribution and \(A_{jk} = P(x_t = k \mid x_{t-1} = j)\) is the transition matrix. (We switch to the notation \(x\) for observed sequences to distinguish from the language model setting above.)

The probability of any particular sequence of length \(T\) follows from the Bayesian-network factorization specialized to the Markov-chain DAG: \[ \begin{align*} P(x_{1:T} \mid \boldsymbol{\theta}) &= \pi_{x_1} \, A_{x_1, x_2} \, A_{x_2, x_3} \cdots A_{x_{T-1}, x_T} \\\\ &= \prod_{j=1}^K (\pi_j)^{\mathbb{1}\{x_1 =j\}} \prod_{t=2}^T \prod_{j=1}^K \prod_{k=1}^K (A_{jk})^{\mathbb{1}\{x_t=k,\, x_{t-1}=j\}}. \end{align*} \] The first equality is the Bayesian-network factorization for the linear-chain DAG (with \(\operatorname{pa}(t) = \{t-1\}\)). The second rewrites the same probability as a product over \((j, k)\) index pairs by introducing indicator functions, which will let us count transitions efficiently. Here \(\pi_j\) is the probability that the first symbol is \(j\), and \(A_{jk}\) is the probability of transitioning from state \(j\) to state \(k\); \(\mathbb{1}\{\cdot\}\) denotes the indicator function.

For example, \[ \mathbb{1}\{x_1 =j\} = \begin{cases} 1 &\text{if \(x_1 = j\)} \\\\ 0 &\text{otherwise} \end{cases}. \] This lets us convert sums and products into counts. In the indicator-power form above, only one transition happens at a time, so only one term contributes in each time step.

The log-likelihood of a set of sequences \(\mathcal{D} = (x_1, \ldots, x_N)\), where \(x_i = (x_{i\,1}, \ldots, x_{i \, T_{i}})\) is a sequence of length \(T_i\), is given by \[ \begin{align*} \log P(\mathcal{D} \mid \boldsymbol{\theta}) &= \sum_{i=1}^N \log P(x_i \mid \boldsymbol{\theta}) \\\\ &= \sum_j N_j^1 \log \pi_j + \sum_j \sum_k N_{jk} \log A_{jk}. \end{align*} \] The second equality follows by substituting the indicator-power form of \(P(x_i \mid \boldsymbol{\theta})\) above, distributing the logarithm across the products, and exchanging the order of summation: the indicator sums over sequences and time steps then collapse into the counts defined below.

Define the following counts:

We want to obtain \(\hat{\pi}_j\) and \( \hat{A}_{jk}\) that are MLE under the constraints: \[ \begin{align*} &\sum_j \pi_j = 1 \\\\ &\sum_k A_{jk} = 1 \text{ for each } j \end{align*} \] (We omit the non-negativity constraints \(\pi_j \geq 0\) and \(A_{jk} \geq 0\); they are automatically satisfied by the closed-form solutions below since the counts are non-negative.)

For \(\pi_j\), we introduce a Lagrange multiplier \(\lambda\) and define \[ \mathcal{L}_{\pi} = \sum_j N_j^1 \log \pi_j + \lambda \left(1 - \sum_j \pi_j \right). \] Then \[ \frac{\partial \mathcal{L}_{\pi}}{\partial \pi_j} = \frac{ N_j^1}{\pi_j} - \lambda = 0 \Longrightarrow \pi_j = \frac{N_j^1}{\lambda}. \]

Plug into the constraint \(\sum_j \pi_j = 1\), we have \[ \sum_j \frac{N_j^1}{\lambda} = 1 \Longrightarrow \lambda = \sum_j N_j^1. \] Thus, \[ \hat{\pi}_j = \frac{N_j^1}{\sum_{j^{\prime}} {N_{j^{\prime}}^1}}. \]

For \(A_{jk}\), we introduce Lagrange multiplier \(\mu\) and define \[ \mathcal{L}_A = \sum_j \sum_k N_{jk} \log A_{jk} + \sum_j \mu_j \left(1 - \sum_k A_{jk} \right). \] Then \[ \frac{\partial \mathcal{L}_{A}}{\partial A_{jk}} = \frac{ N_{jk}}{A_{jk}} - \mu_j = 0 \Longrightarrow A_{jk} = \frac{N_{jk}}{\mu_j}. \]

Plug into the constraint \(\sum_k A_{jk} = 1 \text{ for each } j\), we have \[ \sum_k \frac{N_{jk}}{\mu_j} = 1 \Longrightarrow \mu_j = \sum_k N_{jk} = N_j. \] Thus, \[ \hat{A}_{jk} = \frac{N_{jk}}{N_j}. \]

A Code Sample (MLE in Markov Models)

To illustrate the MLE derivation above, we implement the estimation procedure on a synthetic weather dataset. We simulate 14 days of weather observations across 10 cities (i.e., 10 independent sequences, each of length 14) generated from a known transition matrix, then estimate the initial distribution \(\hat{\pi}\) and transition matrix \(\hat{A}\) from the simulated data.

The estimator implements the closed-form MLE solutions derived above: \(\hat{\pi}_j = N_j^1 / \sum_{j'} N_{j'}^1\) and \(\hat{A}_{jk} = N_{jk} / N_j\). The corner case \(N_j = 0\) (a state never visited) is handled by setting the corresponding row of \(\hat{A}\) to zero — a pragmatic choice that foreshadows the sparse-data discussion in the next section, where a Bayesian prior will replace this ad-hoc fix.

      
                                import numpy as np
                                import random
                                
                                # --- Constants ---
                                # Define states 
                                STATES = ['Sunny', 'Rainy', 'Cloudy', 'Stormy', 'Foggy']
                                STATE_TO_INDEX = {state: i for i, state in enumerate(STATES)}
                                INDEX_TO_STATE = {i: state for i, state in enumerate(STATES)}
                                # Observed 14 days of weather in 10 cities (i.e., 10 sequences, each of length 14).
                                DAYS = 14
                                CITIES = 10
                                
                                # Define the true transition matrix (for data generation only)
                                TRUE_TRANSITION_MATRIX = np.array([
                                    [0.5, 0.2, 0.2, 0.05, 0.05],   # From Sunny
                                    [0.3, 0.4, 0.2, 0.1, 0.0],     # From Rainy
                                    [0.4, 0.2, 0.3, 0.05, 0.05],   # From Cloudy
                                    [0.1, 0.4, 0.2, 0.3, 0.0],     # From Stormy
                                    [0.3, 0.1, 0.4, 0.0, 0.2],     # From Foggy
                                ])
                                
                                # --- Sequence Generation ---
                                # Generate a sequence of weather states
                                def generate_sequence(length, transition_matrix, states, state_to_index, start_state=None):
                                    if start_state is None:
                                        start_state = random.choice(states)
                                    seq = [start_state]
                                    for _ in range(length - 1):
                                        current_index = state_to_index[seq[-1]]
                                        next_state = np.random.choice(states, p=transition_matrix[current_index])
                                        seq.append(next_state)
                                    return seq
                                
                                # --- MLE Estimation ---
                                def estimate_mle(sequences, states, state_to_index):
                                    num_states = len(states)
                                    N1 = np.zeros(num_states) # Start state counts
                                    N_jk = np.zeros((num_states, num_states)) # Transition counts
                                
                                    for seq in sequences:
                                        first_idx = state_to_index[seq[0]]
                                        N1[first_idx] += 1
                                        for t in range(len(seq) - 1):
                                            j = state_to_index[seq[t]]
                                            k = state_to_index[seq[t + 1]]
                                            N_jk[j, k] += 1
                                
                                    pi_hat = N1 / np.sum(N1) # Estimate start probabilities π̂
                                    row_sums = np.sum(N_jk, axis=1, keepdims=True) # row_sums[j] = N_j (total transitions originating from state j)
                                    # Estimate transition matrix Â. If state j is never visited (row_sums[j] = 0), 
                                    # leave A_hat[j, :] as zeros to avoid division-by-zero — this sparse-zero case 
                                    # motivates the Dirichlet prior introduced in the next section.
                                    A_hat = np.divide(N_jk, row_sums, out=np.zeros_like(N_jk), where=row_sums != 0)
                                
                                    return pi_hat, A_hat
                                
                                # --- Display Functions ---
                                def print_sequences(sequences):
                                    print("Sequences:")
                                    for i, seq in enumerate(sequences):
                                        print(f"City {i+1}:\n {' → '.join(seq)}")
                                
                                def print_start_probabilities(pi_hat, states):
                                    print("\nEstimated Start Probabilities (π̂ ):")
                                    for state, prob in zip(states, pi_hat):
                                        print(f"{state:>6}: {prob:.3f}")
                                
                                def print_transition_matrix(A_hat, states):
                                    # Determine column width based on the longest state name + padding
                                    max_len = max(len(state) for state in states)
                                    col_width = max_len + 2
                                
                                    # Create header dynamically with calculated column width
                                    header = "From \\ To".rjust(col_width) + " | " + " | ".join(f"{s:^{col_width}}" for s in states)
                                    print(header)
                                    print("-" * len(header))
                                    # Print each row, formatting numbers with dynamic width
                                    for j, row in enumerate(A_hat):
                                        row_str = " | ".join(f"{p:{col_width}.3f}" for p in row)
                                        print(f"{states[j]:>{col_width}} | {row_str}")
                                
                                if __name__ == "__main__":
                                    sequences = [generate_sequence(DAYS, TRUE_TRANSITION_MATRIX, STATES, STATE_TO_INDEX) for _ in range(CITIES)]
                                    pi_hat, A_hat = estimate_mle(sequences, STATES, STATE_TO_INDEX)
                                    print_sequences(sequences)
                                    print_start_probabilities(pi_hat, STATES)
                                    print("\nEstimated Transition Matrix (Â):")
                                    print_transition_matrix(A_hat, STATES)
                                    print("\nTrue Transition Matrix (A):")
                                    print_transition_matrix(TRUE_TRANSITION_MATRIX, STATES)
                            

Sparse Data & Dirichlet Prior

When we have a limited number of sequences (or sequence steps), many possible transitions might never be observed at all, or be observed just once or twice. This situation leads to having fewer observations relative to the number of parameters (transitions). The sparse data problem in Markov chain models is a critical issue, especially when working with many possible states. The sparse data lead to unreliable MLE estimates (e.g., overfitting, zero estimates, biased predictions, or inaccurate long-term behavior). To address these issues, we often need smoothing or Bayesian approaches to generalize better.

The simplest case is the symmetric Dirichlet prior with \(\alpha_1 = \alpha_2 = \cdots = \alpha_K = \alpha\) for some single concentration parameter \(\alpha > 0\). When \(\alpha = 1\), this further reduces to the uniform distribution on the simplex (sometimes called the flat Dirichlet prior), giving every transition probability vector \(A_{j:}\) equal density a priori.

Definition: Dirichlet Prior for Transition Probabilities

The \(j\)-th row of the transition matrix \(A\), representing the probabilities of transitioning from state \(j\) to each of the \(K\) possible states, is given a symmetric Dirichlet prior: \[ A_{j:} \sim \operatorname{Dir}(\alpha \, \boldsymbol{1}). \]

The posterior mean estimate of the transition probabilities is \[ \hat{A}_{jk} = \frac{N_{jk} + \alpha}{N_j + K\alpha}. \]

When \(\alpha = 1\), this is called add-one (Laplace) smoothing. The parameter \(\alpha\) acts as a pseudocount: it is as if we observed \(\alpha\) additional transitions to each state before seeing the data.

To derive this, assume that for the \(j\)-th row, we have observed counts \(N_{jk}\) for transitioning to state \(k\). The total number of transitions from state \(j\) is \[ N_j = \sum_{k=1}^K N_{jk}. \] The likelihood under a multinomial model is: \[ P(\{N_{jk}\} \mid A_{j:}) \propto \prod_{k=1}^K (A_{jk})^{N_{jk}}. \] Now, with the symmetric Dirichlet prior, we have: \[ P(A_{j:}) \propto \prod_{k=1}^K (A_{jk})^{\alpha -1}. \] (The prior adds the same amount to each transition probability.)

By Bayes' rule, the posterior is proportional to the product of the likelihood and the prior: \[ \begin{align*} P(A_{j:} \mid \{N_{jk}\}) &\propto \left[ \prod_{k=1}^K (A_{jk})^{N_{jk}}\right] \times \left[ \prod_{k=1}^K (A_{jk})^{\alpha-1}\right] \\\\ &= \prod_{k=1}^K (A_{jk})^{N_{jk} + \alpha -1}. \end{align*} \] Thus, the posterior distribution for \(A_{j:}\) is a Dirichlet distribution: \[ A_{j:} \mid \{N_{jk}\} \sim \operatorname{Dir}(N_{j1} +\alpha, N_{j2} +\alpha, \ldots, N_{jK} + \alpha). \] The fact that a Dirichlet prior combined with a multinomial-form likelihood yields a Dirichlet posterior is the canonical example of a conjugate prior relationship — the prior and posterior live in the same parametric family, so updating reduces to a simple update of the hyperparameters.

There are two common choices for an estimator from a posterior distribution: the posterior mean and the mode (MAP estimate). For a Dirichlet distribution, the posterior mean is given by: \[ \begin{align*} E [A_{jk} \mid \{N_{jk}\}] &= \frac{N_{jk}+\alpha}{\sum_{i=1}^K (N_{ji}+\alpha)} \\\\ &= \frac{N_{jk}+\alpha}{N_j + K \alpha}. \end{align*} \] This estimator is often used in practice because it is always defined. The mode (the "actual" MAP estimate) is given by \[ \operatorname{mode}(A_{jk}) = \frac{N_{jk}+\alpha - 1}{N_j + K (\alpha - 1)}. \] (Strictly speaking, this formula assumes \(\alpha > 1\) so that the posterior Dirichlet has an interior mode; for \(\alpha \leq 1\) the density concentrates on the boundary of the simplex and the notion of mode is degenerate. The case \(\alpha = 1\) below is best read as the formal limit of this expression.)

However, note that when \(\alpha = 1\), the mode simplifies to \(\frac{N_{jk}}{N_j}\), which is exactly the Maximum Likelihood Estimate (MLE). This means that using the MAP estimate with a uniform prior (\(\alpha = 1\)) does not provide any smoothing effect, leaving the sparse data problem unresolved. Furthermore, in the large-sample regime, the posterior mean and the mode of the Dirichlet distribution are nearly identical. That's why the estimator based on the posterior mean is commonly used and, for \(\alpha = 1\), is referred to as add-one smoothing.

It is important to note that the symmetric Dirichlet prior treats all transition probabilities exchangeably a priori (each outcome receives the same prior weight), which may not be realistic in complex models. For more structured problems where capturing group-level variation is important, hierarchical Bayesian methods provide a more flexible approach: each sequence (e.g., each city in our weather example) has its own transition matrix, but these matrices are drawn from a common hyper-distribution, allowing the model to share statistical strength across sequences while accommodating heterogeneity.

The Algorithmic Bridge

The transition matrix formalism links Markov chains directly to stochastic matrices in linear algebra. Here, the stationary distribution is characterized as the left eigenvector corresponding to the eigenvalue 1, describing the long-run behavior of the system—a principle that powers the PageRank algorithm.

In reinforcement learning, Markov chains generalize to Markov decision processes (MDPs), where an agent's actions dynamically influence transition probabilities. The Bayesian framework for estimating transition counts (using Dirichlet priors) extends naturally to hidden Markov models (HMMs), which augment Markov chains with latent states and emission distributions. Conditional random fields (CRFs) offer a complementary, discriminative approach to sequence labeling — they are undirected graphical models (in the Markov-random-field family alluded to earlier in this page) that model the conditional distribution of labels given inputs, typically trained by regularized maximum likelihood rather than by Bayesian conjugate analysis.