Stochastic Matrix

Many real-world systems evolve probabilistically: a customer's next purchase depends on their current preferences, a web user's next click depends on the current page, and weather tomorrow depends on conditions today. To model such processes using linear algebra, we need matrices whose columns (or rows) represent probability distributions. These are called stochastic matrices, and they form the algebraic backbone of Markov chains.

Definition: Stochastic Matrix & Probability Vector

A probability vector \(x \in \mathbb{R}^n\) is a vector with nonnegative entries that sum to 1: \[ x_i \geq 0 \text{ for all } i, \quad \sum_{i=1}^n x_i = 1. \] A stochastic matrix (or transition matrix) \(P \in \mathbb{R}^{n \times n}\) is a square matrix whose columns are probability vectors. That is, \(P_{ij} \geq 0\) for all \(i, j\), and \(\sum_{i=1}^n P_{ij} = 1\) for each column \(j\).

Note that while any matrix satisfying these conditions is a stochastic matrix, further properties such as irreducibility and aperiodicity (making the matrix regular) are required to guarantee convergence to a unique stationary distribution.

In this page, we will use the column-stochastic matrix, but we can also use the row-stochastic matrix. Essentially, the choice between a row-stochastic and a column-stochastic matrix is a matter of convention and convenience, and both formulations are equivalent up to taking a transpose. Using the row-stochastic matrix is often natural in Markov chains because each row directly tells you the probabilities of transitioning from a given state to all other states. In linear algebra context, due to the form of \(Ax = b\), the column-stochastic matrix can be chosen more often. (Also, many programming environments default to column vectors.)

Insight: Row vs Column Convention in ML Frameworks

In practice, most machine learning libraries (scikit-learn, PyTorch, TensorFlow) adopt the row-stochastic convention, where each row \(\sum_j P_{ij} = 1\). This aligns with the standard data representation where each row corresponds to a sample. For example, a Softmax layer applied across the final dimension of a hidden representation produces a row-stochastic matrix.

Mathematically, if \(x\) is a column vector, the update rule for a column-stochastic matrix is \(x_{k+1} = P x_k\). In code, using row-stochastic matrices and row vectors \(x^\top\), this becomes \(x_{k+1}^\top = x_k^\top P^\top\) (often written simply as \(XW\) in linear layers). Always verify the matrix orientation to avoid incorrect broadcasting or transpositions in your custom layers.

Remember, a Markov chain is a sequence of probability vectors \(x_0, x_1, x_2, \cdots \) together with a stochastic matrix \(P\) such that \[ x_1 = P x_0, \quad x_2 = P x_1, \quad x_3 = P x_2, \quad \cdots. \] So, the Markov chain is explained by the first-order difference equation: \[ x_{k+1} = P x_k \quad \text{for } k = 0, 1, 2, \cdots. \] Here, \(x_k\) is called a state vector and we have: \[ x_k = P^k x_0 \quad \text{for } k = 0, 1, 2, \cdots. \]

Example:

Consider the following two states:

State 1: a student is sick
State 2: a student is not sick

We observed an initial state distribution: \[ x_0 = \begin{bmatrix} 0.1 \\ 0.9 \end{bmatrix} \] which means that currently, 10% of students are sick and 90% are not.

Moreover, we assume the following conditions:

70% of sick students recover the next day, and 30% remain sick.
5% of healthy students become sick the next day, and 95% remain healthy.

So, our stochastic matrix can be written as \[ P = \begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix}. \] Then, \[ x_1 = P x_0 = \begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix} \begin{bmatrix} 0.1 \\ 0.9 \end{bmatrix} = \begin{bmatrix} 0.075 \\ 0.925 \end{bmatrix} \] This means that on the next day, approximately 7.5% students are expected to be sick and 92.5% are not.

We can continue this process: \[ \begin{align*} &x_2 = P x_1 = \begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix} \begin{bmatrix} 0.075 \\ 0.925 \end{bmatrix} = \begin{bmatrix} 0.06875 \\ 0.93125 \end{bmatrix} \\\\ &x_3 = P x_2 = \begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix} \begin{bmatrix} 0.06875 \\ 0.93125 \end{bmatrix} = \begin{bmatrix} 0.0671875 \\ 0.9328125 \end{bmatrix} \end{align*} \] and so on.

Steady-State Vector

Definition: Steady-State Vector

A steady-state vector \(q\) for a stochastic matrix \(P\) is defined as \[ P q = q. \] In statistics, we also call it stationary distribution.

For a regular stochastic matrix (one where some power \(P^k\) has all strictly positive entries), the Markov chain \(\{x_k : k = 1, 2, \cdots\}\) converges to a unique steady-state vector \(q\) as \(k \to \infty\), regardless of the initial distribution \(x_0\). This ensures the system eventually settles into a predictable long-term equilibrium, losing all memory of its starting state.

Algebraically, the steady-state vector \(q\) is the principal eigenvector of \(P\) associated with the eigenvalue \(\lambda_1 = 1\). The Perron-Frobenius theorem guarantees that for regular matrices, this eigenvalue is unique and its corresponding eigenvector can be normalized to form a valid probability distribution.

Definition: Spectral Radius

For any matrix \(A\), the spectral radius \(\rho(A)\) is defined by the maximum absolute value of its eigenvalues. It satisfies \[ \rho(A) \leq \| A \| \] for any submultiplicative matrix norm \(\| \cdot \| \).

By definition, every column of a column-stochastic matrix sums to 1, which means the maximum absolute column sum (the 1-norm, \(\| P \|_1\)) is always 1. According to the property of submultiplicative norms, the spectral radius satisfies: \[ \rho(P) \leq \| P \|_1 = 1 \] (For the row-stochastic matrix, we use the infinity norm \(\| P \|_{\infty} = 1\) to reach the same conclusion.)

Since the vector \(\mathbf{1}^\top = [1, 1, \dots, 1]\) is a left-eigenvector of \(P\) associated with the eigenvalue 1 (satisfying \(\mathbf{1}^\top P = \mathbf{1}^\top\)), we know that \(\lambda = 1\) is always an eigenvalue of any stochastic matrix. Therefore, the spectral radius is exactly 1 (\(\rho(P) = 1\)). For a regular stochastic matrix, all other eigenvalues \(\lambda_i\) satisfy \(|\lambda_i| < 1\), which ensures that the influence of non-steady-state components vanishes as \(k \to \infty\).

Example:

Revisiting our example, \[ \begin{align*} & P q = q \\\\ &\begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix} \begin{bmatrix} q_1 \\ q_2 \end{bmatrix} = \begin{bmatrix} q_1 \\ q_2 \end{bmatrix}\\\\ & q_1 + q_2 = 1 \\\\ &\Longrightarrow q = \begin{bmatrix} \frac{1}{15} \\ \frac{14}{15} \end{bmatrix} \end{align*} \] which means that in the long run, about 6.67 % of the students will be sick and about 93.33 % will be not sick.

Find eigenvalues and corresponding eigenvectors: \[ \begin{align*} &\det(P - \lambda I) = 0 \\\\ &\Longrightarrow \lambda^2 - 1.25\lambda + 0.25 = 0 \\\\ &\Longrightarrow (\lambda -1)(\lambda -0.25) = 0 \\\\ &\Longrightarrow \lambda_1 = 1, \quad \lambda_2 = 0.25. \end{align*} \] For \(\lambda_1 = 1\), solving \((P -I)v_1 = 0\), we obtain the corresponding eigenvector: \[ v_1 = \begin{bmatrix} 1 \\ 14 \end{bmatrix}. \] (Scaling by \(\frac{1}{15}\), we can get the stationary distribution \(q = \frac{1}{15}v_1\)).

For \(\lambda_2 = 0.25\), solving \((P -0.25I)v_2 = 0\), we obtain the corresponding eigenvector: \[ v_2 = \begin{bmatrix} - 1 \\ 1 \end{bmatrix}. \] So, \[ V = \begin{bmatrix} 1 & - 1 \\ 14 & 1 \end{bmatrix}, \quad V^{-1} = \frac{1}{15} \begin{bmatrix} 1 & 1 \\ -14 & 1 \end{bmatrix}, \] and \[ D = \begin{bmatrix} \lambda_1 & 0\\ 0 & \lambda_2 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 0.25 \end{bmatrix} \]

Thus, the transition matrix \(P\) after \(k\) steps can be written as: \[ \begin{align*} P^k &= V D^k V^{-1} \\\\ &= \begin{bmatrix} 1 & - 1 \\ 14 & 1 \end{bmatrix} \begin{bmatrix} 1^k & 0 \\ 0 & (0.25)^k \end{bmatrix} \frac{1}{15}\begin{bmatrix} 1 & 1 \\ -14 & 1 \end{bmatrix}. \end{align*} \]

The error in the state distribution after \(k\) steps is dominated by the term \(\lambda_2^k = (0.25)^k\). Thus, the convergence rate of the Markov chain toward the stationary distribution is exponential, with each additional step reducing the error roughly by a factor of \(0.25\): \[ e_k \approx e_0 (0.25)^k \] where \(e_0\) is the initial error.

In this example, \(P\) is diagonalizable, which allowed us to express \(P^k\) in closed form and read off the convergence rate directly from the eigenvalues. In general, stochastic matrices are not always diagonalizable, but the convergence result still holds under mild conditions (irreducibility and aperiodicity) via the Perron-Frobenius theorem.

A particularly well-behaved class of stochastic matrices arises when both the columns and the rows are probability vectors.

Definition: Doubly Stochastic Matrix

A nonnegative matrix \(A\) is said to be doubly stochastic if both the sum of each row and the sum of each column equal 1. That is, \(A\) is both row-stochastic and column-stochastic.

Example: Doubly Stochastic Matrix (\(2 \times 2\))

For \(n = 2\), the doubly stochastic constraint forces the matrix to be symmetric. Consider the following doubly stochastic matrix for \(0 < t \leq 1\): \[ A = \begin{bmatrix} 1 - t & t \\ t & 1 - t \end{bmatrix}. \] The trace is \(\text{tr}(A) = 2 - 2t\). Since every stochastic matrix has eigenvalue \(\lambda_1 = 1\) with corresponding eigenvector \(v_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}\), we can find the second eigenvalue from \[ \lambda_1 + \lambda_2 = \text{tr}(A) = 2 - 2t \quad \Longrightarrow \quad \lambda_2 = 1 - 2t. \]

Since \(A\) is symmetric and the two eigenvalues are distinct (for \(t \neq 0\)), the spectral theorem guarantees that the eigenvectors are orthogonal. Thus \(v_2 = \begin{bmatrix} 1 \\ -1 \end{bmatrix}\), and the eigendecomposition is \[ A = \frac{1}{2}\begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 1-2t \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}. \]

Note that the eigenvector matrix \(V = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}\) satisfies \(V^{-1} = \frac{1}{2}V\), since the columns are orthogonal with squared norm 2.

Insight: Doubly Stochastic Matrices in Machine Learning

Doubly stochastic matrices are the cornerstone of Optimal Transport (OT). The Sinkhorn-Knopp algorithm transforms a raw cost matrix into a doubly stochastic assignment matrix through iterative scaling. This approach is highly valued in modern AI (e.g., Wasserstein GANs, SuperGlue for feature matching) because it is fully differentiable and enables entropic regularization, turning a hard combinatorial matching problem into a smooth, GPU-friendly optimization.

In Spectral Clustering, ensuring the affinity matrix is normalized into a stochastic form relates to the random walk Laplacian. Since the steady-state of a doubly stochastic matrix is always the uniform distribution \(q = \frac{1}{n}\mathbf{1}\), these matrices are ideal for modeling systems where no prior bias exists, such as in data shuffling, sorting networks, or fair resource allocation.

Stochastic Matrix

Loading...