Stochastic Matrix
Many real-world systems evolve probabilistically: a customer's next purchase depends on their current
preferences, a web user's next click depends on the current page, and weather tomorrow depends on
conditions today. To model such processes using linear algebra, we need matrices whose columns (or rows)
represent probability distributions. These are called stochastic matrices, and they
form the algebraic backbone of Markov chains.
Definition: Stochastic Matrix & Probability Vector
A probability vector \(x \in \mathbb{R}^n\) is a vector with nonnegative entries that sum
to 1:
\[
x_i \geq 0 \text{ for all } i, \quad \sum_{i=1}^n x_i = 1.
\]
A stochastic matrix (or transition matrix) \(P \in \mathbb{R}^{n \times n}\) is
a square matrix whose columns are probability vectors. That is, \(P_{ij} \geq 0\) for all \(i, j\),
and \(\sum_{i=1}^n P_{ij} = 1\) for each column \(j\).
Note that while any matrix satisfying these conditions is a stochastic matrix, further properties such as
irreducibility and aperiodicity (making the matrix regular)
are required to guarantee convergence to a unique stationary distribution.
In this page, we will use the column-stochastic matrix, but we can also use the
row-stochastic matrix. Essentially, the choice between a row-stochastic and a column-stochastic
matrix is a matter of convention and convenience, and both formulations are equivalent up to taking a transpose.
Using the row-stochastic matrix is often natural in Markov chains because each row directly tells
you the probabilities of transitioning from a given state to all other states. In linear algebra context, due to
the form of \(Ax = b\), the column-stochastic matrix can be chosen more often. (Also, many programming environments
default to column vectors.)
Insight: Row vs Column Convention in ML Frameworks
In practice, most machine learning libraries (scikit-learn, PyTorch, TensorFlow) adopt the
row-stochastic convention, where each row \(\sum_j P_{ij} = 1\). This aligns with the
standard data representation where each row corresponds to a sample. For example,
a Softmax layer applied across the final dimension of a hidden representation
produces a row-stochastic matrix.
Mathematically, if \(x\) is a column vector, the update rule for a column-stochastic matrix is \(x_{k+1} = P x_k\).
In code, using row-stochastic matrices and row vectors \(x^\top\), this becomes \(x_{k+1}^\top = x_k^\top P^\top\)
(often written simply as \(XW\) in linear layers). Always verify the matrix orientation to avoid
incorrect broadcasting or transpositions in your custom layers.
Remember, a Markov chain is a sequence of probability
vectors \(x_0, x_1, x_2, \cdots \) together with a stochastic matrix \(P\)
such that
\[
x_1 = P x_0, \quad x_2 = P x_1, \quad x_3 = P x_2, \quad \cdots.
\]
So, the Markov chain is explained by the first-order difference equation:
\[
x_{k+1} = P x_k \quad \text{for } k = 0, 1, 2, \cdots.
\]
Here, \(x_k\) is called a state vector and we have:
\[
x_k = P^k x_0 \quad \text{for } k = 0, 1, 2, \cdots.
\]
Example:
Consider the following two states:
- State 1: a student is sick
- State 2: a student is not sick
We observed an initial state distribution:
\[
x_0 = \begin{bmatrix} 0.1 \\ 0.9 \end{bmatrix}
\]
which means that currently, 10% of students are sick and 90% are not.
Moreover, we assume the following conditions:
- 70% of sick students recover the next day, and 30% remain sick.
- 5% of healthy students become sick the next day, and 95% remain healthy.
So, our stochastic matrix can be written as
\[
P = \begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix}.
\]
Then,
\[
x_1 = P x_0 = \begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix} \begin{bmatrix} 0.1 \\ 0.9 \end{bmatrix} = \begin{bmatrix} 0.075 \\ 0.925 \end{bmatrix}
\]
This means that on the next day, approximately 7.5% students are expected to be sick and 92.5% are not.
We can continue this process:
\[
\begin{align*}
&x_2 = P x_1 = \begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix} \begin{bmatrix} 0.075 \\ 0.925 \end{bmatrix} = \begin{bmatrix} 0.06875 \\ 0.93125 \end{bmatrix} \\\\
&x_3 = P x_2 = \begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix} \begin{bmatrix} 0.06875 \\ 0.93125 \end{bmatrix} = \begin{bmatrix} 0.0671875 \\ 0.9328125 \end{bmatrix}
\end{align*}
\]
and so on.
Steady-State Vector
Definition: Steady-State Vector
A steady-state vector \(q\) for a stochastic matrix \(P\) is defined as
\[
P q = q.
\]
In statistics, we also call it stationary distribution.
For a regular stochastic matrix (one where some power \(P^k\) has all strictly positive entries),
the Markov chain \(\{x_k : k = 1, 2, \cdots\}\) converges to a unique steady-state vector \(q\) as
\(k \to \infty\), regardless of the initial distribution \(x_0\). This ensures the system eventually settles into a
predictable long-term equilibrium, losing all memory of its starting state.
Algebraically, the steady-state vector \(q\) is the principal eigenvector of \(P\) associated with the
eigenvalue \(\lambda_1 = 1\). The Perron-Frobenius theorem guarantees that for regular matrices, this eigenvalue is unique
and its corresponding eigenvector can be normalized to form a valid probability distribution.
Definition: Spectral Radius
For any matrix \(A\), the spectral radius \(\rho(A)\) is defined by the maximum absolute value
of its eigenvalues. It satisfies
\[
\rho(A) \leq \| A \|
\]
for any submultiplicative matrix norm \(\| \cdot \| \).
By definition, every column of a column-stochastic matrix sums to 1, which means the maximum absolute column sum
(the 1-norm, \(\| P \|_1\)) is always 1. According to the property of submultiplicative norms,
the spectral radius satisfies:
\[
\rho(P) \leq \| P \|_1 = 1
\]
(For the row-stochastic matrix, we use the infinity norm \(\| P \|_{\infty} = 1\) to reach the
same conclusion.)
Since the vector \(\mathbf{1}^\top = [1, 1, \dots, 1]\) is a left-eigenvector of \(P\) associated with the
eigenvalue 1 (satisfying \(\mathbf{1}^\top P = \mathbf{1}^\top\)), we know that \(\lambda = 1\) is always an
eigenvalue of any stochastic matrix. Therefore, the spectral radius is exactly 1 (\(\rho(P) = 1\)).
For a regular stochastic matrix, all other eigenvalues \(\lambda_i\) satisfy \(|\lambda_i| < 1\), which ensures that
the influence of non-steady-state components vanishes as \(k \to \infty\).
Example:
Revisiting our example,
\[
\begin{align*}
& P q = q \\\\
&\begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix} \begin{bmatrix} q_1 \\ q_2 \end{bmatrix} = \begin{bmatrix} q_1 \\ q_2 \end{bmatrix}\\\\
& q_1 + q_2 = 1 \\\\
&\Longrightarrow q = \begin{bmatrix} \frac{1}{15} \\ \frac{14}{15} \end{bmatrix}
\end{align*}
\]
which means that in the long run, about 6.67 % of the students will be sick and about 93.33 % will be not sick.
Find eigenvalues and corresponding eigenvectors:
\[
\begin{align*}
&\det(P - \lambda I) = 0 \\\\
&\Longrightarrow \lambda^2 - 1.25\lambda + 0.25 = 0 \\\\
&\Longrightarrow (\lambda -1)(\lambda -0.25) = 0 \\\\
&\Longrightarrow \lambda_1 = 1, \quad \lambda_2 = 0.25.
\end{align*}
\]
For \(\lambda_1 = 1\), solving \((P -I)v_1 = 0\), we obtain the corresponding eigenvector:
\[
v_1 = \begin{bmatrix} 1 \\ 14 \end{bmatrix}.
\]
(Scaling by \(\frac{1}{15}\), we can get the stationary distribution \(q = \frac{1}{15}v_1\)).
For \(\lambda_2 = 0.25\), solving \((P -0.25I)v_2 = 0\), we obtain the corresponding eigenvector:
\[
v_2 = \begin{bmatrix} - 1 \\ 1 \end{bmatrix}.
\]
So,
\[
V = \begin{bmatrix} 1 & - 1 \\ 14 & 1 \end{bmatrix}, \quad V^{-1} = \frac{1}{15} \begin{bmatrix} 1 & 1 \\ -14 & 1 \end{bmatrix},
\]
and
\[
D = \begin{bmatrix} \lambda_1 & 0\\ 0 & \lambda_2 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 0.25 \end{bmatrix}
\]
Thus, the transition matrix \(P\) after \(k\) steps can be written as:
\[
\begin{align*}
P^k &= V D^k V^{-1} \\\\
&= \begin{bmatrix} 1 & - 1 \\ 14 & 1 \end{bmatrix}
\begin{bmatrix} 1^k & 0 \\ 0 & (0.25)^k \end{bmatrix}
\frac{1}{15}\begin{bmatrix} 1 & 1 \\ -14 & 1 \end{bmatrix}.
\end{align*}
\]
The error in the state distribution after \(k\) steps is dominated by the term \(\lambda_2^k = (0.25)^k\). Thus, the
convergence rate of the Markov chain toward the stationary distribution is exponential, with each additional step reducing
the error roughly by a factor of \(0.25\):
\[
e_k \approx e_0 (0.25)^k
\]
where \(e_0\) is the initial error.
In this example, \(P\) is diagonalizable, which allowed us
to express \(P^k\) in closed form and read off the convergence rate directly from the eigenvalues. In general,
stochastic matrices are not always diagonalizable, but the convergence result still holds under mild
conditions (irreducibility and aperiodicity) via the Perron-Frobenius theorem.
A particularly well-behaved class of stochastic matrices arises when both the columns and
the rows are probability vectors.
Definition: Doubly Stochastic Matrix
A nonnegative matrix \(A\) is said to be doubly stochastic if both the sum of each row
and the sum of each column equal 1. That is, \(A\) is both row-stochastic and column-stochastic.
Example: Doubly Stochastic Matrix (\(2 \times 2\))
For \(n = 2\), the doubly stochastic constraint forces the matrix to be symmetric.
Consider the following doubly stochastic matrix for \(0 < t \leq 1\):
\[
A = \begin{bmatrix} 1 - t & t \\ t & 1 - t \end{bmatrix}.
\]
The trace is \(\text{tr}(A) = 2 - 2t\). Since every stochastic matrix has eigenvalue
\(\lambda_1 = 1\) with corresponding eigenvector \(v_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}\),
we can find the second eigenvalue from
\[
\lambda_1 + \lambda_2 = \text{tr}(A) = 2 - 2t \quad \Longrightarrow \quad \lambda_2 = 1 - 2t.
\]
Since \(A\) is symmetric and the two eigenvalues are distinct (for \(t \neq 0\)), the
spectral theorem guarantees that the eigenvectors are orthogonal.
Thus \(v_2 = \begin{bmatrix} 1 \\ -1 \end{bmatrix}\), and the eigendecomposition is
\[
A = \frac{1}{2}\begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}
\begin{bmatrix} 1 & 0 \\ 0 & 1-2t \end{bmatrix}
\begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}.
\]
Note that the eigenvector matrix \(V = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}\) satisfies
\(V^{-1} = \frac{1}{2}V\), since the columns are orthogonal with squared norm 2.
Insight: Doubly Stochastic Matrices in Machine Learning
Doubly stochastic matrices are the cornerstone of Optimal Transport (OT).
The Sinkhorn-Knopp algorithm transforms a raw cost matrix into a doubly stochastic
assignment matrix through iterative scaling. This approach is highly valued in modern AI
(e.g., Wasserstein GANs, SuperGlue for feature matching) because it is fully differentiable
and enables entropic regularization, turning a hard combinatorial matching problem into a smooth, GPU-friendly
optimization.
In Spectral Clustering, ensuring the affinity matrix is normalized into a stochastic form
relates to the random walk Laplacian. Since the steady-state of a doubly stochastic matrix is always the
uniform distribution \(q = \frac{1}{n}\mathbf{1}\), these matrices are ideal for modeling
systems where no prior bias exists, such as in data shuffling, sorting networks, or fair resource allocation.