Stochastic Matrix

Stochastic Matrix Steady-State Vector Doubly Stochastic Matrices

Stochastic Matrix

The previous pages of Section I built up a toolkit — eigenvalues and eigenvectors, diagonalization, matrix norms, the spectral theorem for symmetric matrices, block manipulations. So far these tools have been exercised mostly on static algebraic questions: decomposing a matrix, inverting a low-rank update, computing a determinant. This page is where the same toolkit begins to analyze a dynamical question: given a rule that evolves a state probabilistically, what happens in the long run?

Many real-world systems evolve probabilistically: a customer's next purchase depends on their current preferences, a web user's next click depends on the current page, and weather tomorrow depends on conditions today. To model such processes using linear algebra, we need matrices whose columns (or rows) represent probability distributions. These are called stochastic matrices, and they form the algebraic backbone of Markov chains. The eigenvalue structure of a stochastic matrix, we will see, encodes exactly the long-run behavior of the process it governs — a foreshadowing of the deeper probabilistic machinery developed in Section III.

Definition: Probability Vector

A probability vector \(\mathbf{x} \in \mathbb{R}^n\) is a vector with nonnegative entries that sum to 1: \[ x_i \geq 0 \text{ for all } i, \quad \sum_{i=1}^n x_i = 1. \]

Definition: Stochastic Matrix

A stochastic matrix (or transition matrix) \(P \in \mathbb{R}^{n \times n}\) is a square matrix whose columns are probability vectors. That is, \(P_{ij} \geq 0\) for all \(i, j\), and \(\sum_{i=1}^n P_{ij} = 1\) for each column \(j\).

Not every stochastic matrix yields a convergent Markov chain; additional structural conditions on \(P\) — specifically, a combination of irreducibility (every state can reach every other) and aperiodicity (no cyclic return constraint), sometimes packaged together as regularity — are needed to guarantee convergence to a unique stationary distribution. The formal definitions belong to Markov chain theory, treated in Section III; for the linear-algebra discussion below we use "regular" informally, defining it through its spectral consequence at the point where it matters.

In this page, we will use the column-stochastic matrix, but we can also use the row-stochastic matrix. Essentially, the choice between a row-stochastic and a column-stochastic matrix is a matter of convention and convenience, and both formulations are equivalent up to taking a transpose. Using the row-stochastic matrix is often natural in Markov chains because each row directly tells you the probabilities of transitioning from a given state to all other states. In linear algebra context, due to the form of \(Ax = b\), the column-stochastic matrix can be chosen more often. (Also, many programming environments default to column vectors.)

Insight: Row vs Column Convention in ML Frameworks

In practice, most machine learning libraries (scikit-learn, PyTorch, TensorFlow) adopt the row-stochastic convention, where each row \(\sum_j P_{ij} = 1\). This aligns with the standard data representation where each row corresponds to a sample. For example, a Softmax layer applied across the final dimension of a hidden representation produces a row-stochastic matrix.

Mathematically, if \(\mathbf{x}\) is a column vector, the update rule for a column-stochastic matrix is \(\mathbf{x}_{k+1} = P \mathbf{x}_k\). In code, using row-stochastic matrices and row vectors \(\mathbf{x}^\top\), this becomes \(\mathbf{x}_{k+1}^\top = \mathbf{x}_k^\top P^\top\) (often written simply as \(XW\) in linear layers). Always verify the matrix orientation to avoid incorrect broadcasting or transpositions in your custom layers.

Remember, a Markov chain is a sequence of probability vectors \(\mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \cdots \) together with a stochastic matrix \(P\) such that \[ \mathbf{x}_1 = P \mathbf{x}_0, \quad \mathbf{x}_2 = P \mathbf{x}_1, \quad \mathbf{x}_3 = P \mathbf{x}_2, \quad \cdots. \] So, the Markov chain is explained by the first-order difference equation: \[ \mathbf{x}_{k+1} = P \mathbf{x}_k \quad \text{for } k = 0, 1, 2, \cdots. \] Here, \(\mathbf{x}_k\) is called a state vector and we have: \[ \mathbf{x}_k = P^k \mathbf{x}_0 \quad \text{for } k = 0, 1, 2, \cdots. \]

Example:

Consider the following two states:

  • State 1: a student is sick
  • State 2: a student is not sick

We observed an initial state distribution: \[ \mathbf{x}_0 = \begin{bmatrix} 0.1 \\ 0.9 \end{bmatrix} \] which means that currently, 10% of students are sick and 90% are not.

Moreover, we assume the following conditions:

  • 70% of sick students recover the next day, and 30% remain sick.
  • 5% of healthy students become sick the next day, and 95% remain healthy.

So, our stochastic matrix can be written as \[ P = \begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix}. \] Then, \[ \mathbf{x}_1 = P \mathbf{x}_0 = \begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix} \begin{bmatrix} 0.1 \\ 0.9 \end{bmatrix} = \begin{bmatrix} 0.075 \\ 0.925 \end{bmatrix} \] This means that on the next day, approximately 7.5% students are expected to be sick and 92.5% are not.

We can continue this process: \[ \begin{align*} &\mathbf{x}_2 = P \mathbf{x}_1 = \begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix} \begin{bmatrix} 0.075 \\ 0.925 \end{bmatrix} = \begin{bmatrix} 0.06875 \\ 0.93125 \end{bmatrix} \\\\ &\mathbf{x}_3 = P \mathbf{x}_2 = \begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix} \begin{bmatrix} 0.06875 \\ 0.93125 \end{bmatrix} = \begin{bmatrix} 0.0671875 \\ 0.9328125 \end{bmatrix} \end{align*} \] and so on.

Steady-State Vector

Definition: Steady-State Vector

A steady-state vector \(\mathbf{q}\) for a stochastic matrix \(P\) is defined as \[ P \mathbf{q} = \mathbf{q}. \] In statistics, we also call it stationary distribution.

For a regular stochastic matrix (one where some power \(P^k\) has all strictly positive entries), the Markov chain \(\{\mathbf{x}_k : k = 1, 2, \cdots\}\) converges to a unique steady-state vector \(\mathbf{q}\) as \(k \to \infty\), regardless of the initial distribution \(\mathbf{x}_0\). This ensures the system eventually settles into a predictable long-term equilibrium, losing all memory of its starting state.

Algebraically, the steady-state vector \(\mathbf{q}\) is the principal eigenvector of \(P\) associated with the eigenvalue \(\lambda_1 = 1\). The uniqueness of this eigenvalue, and the fact that its eigenvector can always be normalized into a valid probability distribution, are consequences of the Perron-Frobenius theorem for nonnegative matrices — a result whose rigorous justification belongs to Section III, where Markov chain theory is developed in full. Here we take these facts as given and focus on the linear-algebraic structure they imply.

Definition: Spectral Radius

For any matrix \(A\), the spectral radius \(\rho(A)\) is defined by the maximum absolute value of its eigenvalues. It satisfies \[ \rho(A) \leq \| A \| \] for any submultiplicative matrix norm \(\| \cdot \| \).

Theorem: Spectral Radius of a Stochastic Matrix

For any stochastic matrix \(P\), \(\lambda = 1\) is an eigenvalue, and every eigenvalue \(\lambda_i\) satisfies \(|\lambda_i| \leq 1\). Consequently, the spectral radius is \(\rho(P) = 1\).

Proof:

We prove the two parts separately.

Every eigenvalue satisfies \(|\lambda_i| \leq 1\). Consider the induced 1-norm \(\|A\|_1\), which for a matrix \(A\) equals the maximum absolute column sum. For a column-stochastic matrix, every column sums to exactly \(1\), so \(\|P\|_1 = 1\). The induced 1-norm is submultiplicative — that is, \(\|AB\|_1 \leq \|A\|_1 \|B\|_1\), which follows immediately from its definition as an operator norm: \(\|AB\mathbf{x}\|_1 \leq \|A\|_1 \|B\mathbf{x}\|_1 \leq \|A\|_1 \|B\|_1 \|\mathbf{x}\|_1\). Now if \(\lambda\) is any eigenvalue of \(P\) with eigenvector \(\mathbf{v} \neq \mathbf{0}\), then \(P\mathbf{v} = \lambda \mathbf{v}\) gives \[ |\lambda| \, \|\mathbf{v}\|_1 = \|\lambda \mathbf{v}\|_1 = \|P \mathbf{v}\|_1 \leq \|P\|_1 \, \|\mathbf{v}\|_1 = \|\mathbf{v}\|_1. \] Dividing by \(\|\mathbf{v}\|_1 > 0\), we obtain \(|\lambda| \leq 1\).
(For a row-stochastic matrix, the same argument applies with the induced \(\infty\)-norm, whose value \(\|P\|_\infty\) is the maximum absolute row sum — again equal to \(1\).)

\(\lambda = 1\) is an eigenvalue. The all-ones vector \(\mathbf{1} = [1, 1, \ldots, 1]^\top\) satisfies \(\mathbf{1}^\top P = \mathbf{1}^\top\), because the \(j\)-th entry of \(\mathbf{1}^\top P\) is the column sum \(\sum_i P_{ij} = 1\). Thus \(\mathbf{1}\) is a left-eigenvector of \(P\) with eigenvalue \(1\). Since \(\det(P - \lambda I) = \det((P - \lambda I)^\top) = \det(P^\top - \lambda I)\) (using transpose-invariance of the determinant), \(P\) and \(P^\top\) share the same characteristic polynomial and hence the same eigenvalues; so \(\lambda = 1\) is also a right-eigenvalue.

Combining the two parts: all eigenvalues satisfy \(|\lambda_i| \leq 1\) and \(\lambda = 1\) is attained, so \(\rho(P) = 1\). \(\quad\blacksquare\)

For a regular stochastic matrix, all eigenvalues other than \(\lambda_1 = 1\) satisfy the strict inequality \(|\lambda_i| < 1\); this is what forces convergence \(P^k \mathbf{x}_0 \to \mathbf{q}\) as \(k \to \infty\). Proving this strictness is the content of the Perron-Frobenius theorem for regular nonnegative matrices — beyond the scope of this page, and treated rigorously in Section III.

Example:

Revisiting our example, \[ \begin{align*} & P \mathbf{q} = \mathbf{q} \\\\ &\begin{bmatrix} 0.3 & 0.05 \\ 0.7 & 0.95 \end{bmatrix} \begin{bmatrix} q_1 \\ q_2 \end{bmatrix} = \begin{bmatrix} q_1 \\ q_2 \end{bmatrix}\\\\ & q_1 + q_2 = 1 \\\\ &\Longrightarrow \mathbf{q} = \begin{bmatrix} \frac{1}{15} \\ \frac{14}{15} \end{bmatrix} \end{align*} \] which means that in the long run, about 6.67 % of the students will be sick and about 93.33 % will be not sick.

Find eigenvalues and corresponding eigenvectors: \[ \begin{align*} &\det(P - \lambda I) = 0 \\\\ &\Longrightarrow \lambda^2 - 1.25\lambda + 0.25 = 0 \\\\ &\Longrightarrow (\lambda -1)(\lambda -0.25) = 0 \\\\ &\Longrightarrow \lambda_1 = 1, \quad \lambda_2 = 0.25. \end{align*} \] For \(\lambda_1 = 1\), solving \((P -I)\mathbf{v}_1 = 0\), we obtain the corresponding eigenvector: \[ \mathbf{v}_1 = \begin{bmatrix} 1 \\ 14 \end{bmatrix}. \] (Scaling by \(\frac{1}{15}\), we can get the stationary distribution \(\mathbf{q} = \frac{1}{15}\mathbf{v}_1\)).

For \(\lambda_2 = 0.25\), solving \((P -0.25I)\mathbf{v}_2 = 0\), we obtain the corresponding eigenvector: \[ \mathbf{v}_2 = \begin{bmatrix} - 1 \\ 1 \end{bmatrix}. \] So, \[ V = \begin{bmatrix} 1 & - 1 \\ 14 & 1 \end{bmatrix}, \quad V^{-1} = \frac{1}{15} \begin{bmatrix} 1 & 1 \\ -14 & 1 \end{bmatrix}, \] and \[ D = \begin{bmatrix} \lambda_1 & 0\\ 0 & \lambda_2 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 0.25 \end{bmatrix} \]

Thus, the transition matrix \(P\) after \(k\) steps can be written as: \[ \begin{align*} P^k &= V D^k V^{-1} \\ &= \begin{bmatrix} 1 & - 1 \\ 14 & 1 \end{bmatrix} \begin{bmatrix} 1^k & 0 \\ 0 & (0.25)^k \end{bmatrix} \frac{1}{15}\begin{bmatrix} 1 & 1 \\ -14 & 1 \end{bmatrix}. \end{align*} \]

The error in the state distribution after \(k\) steps is dominated by the term \(\lambda_2^k = (0.25)^k\). Thus, the convergence rate of the Markov chain toward the stationary distribution is exponential, with each additional step reducing the error roughly by a factor of \(0.25\): \[ e_k \approx e_0 (0.25)^k \] where \(e_0\) is the initial error.

In this example, \(P\) is diagonalizable, which allowed us to express \(P^k\) in closed form and read off the convergence rate directly from the eigenvalues. In general, stochastic matrices are not always diagonalizable, but the convergence result still holds under mild conditions (irreducibility and aperiodicity) via the Perron-Frobenius theorem.

Doubly Stochastic Matrices

A particularly well-behaved class of stochastic matrices arises when both the columns and the rows are probability vectors.

Definition: Doubly Stochastic Matrix

A nonnegative matrix \(A\) is said to be doubly stochastic if both the sum of each row and the sum of each column equal 1. That is, \(A\) is both row-stochastic and column-stochastic.

Example: Doubly Stochastic Matrix (\(2 \times 2\))

For \(n = 2\), the doubly stochastic constraint forces the matrix to be symmetric. Indeed, writing \(A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}\), the four sum conditions \(a + b = 1\), \(c + d = 1\) (rows), \(a + c = 1\), \(b + d = 1\) (columns) immediately give \(b = c\) (both equal \(1 - a\)) and \(d = a\). The single free parameter is the off-diagonal value; writing it as \(t\), \[ A = \begin{bmatrix} 1 - t & t \\ t & 1 - t \end{bmatrix}, \qquad 0 \leq t \leq 1. \] The trace is \(\operatorname{tr}(A) = 2 - 2t\). Since every stochastic matrix has eigenvalue \(\lambda_1 = 1\) with corresponding eigenvector \(\mathbf{v}_1 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}\), we can find the second eigenvalue from \[ \lambda_1 + \lambda_2 = \operatorname{tr}(A) = 2 - 2t \quad \Longrightarrow \quad \lambda_2 = 1 - 2t. \]

Since \(A\) is symmetric and the two eigenvalues are distinct (for \(t \neq 0\)), the spectral theorem guarantees that the eigenvectors are orthogonal. Thus \(\mathbf{v}_2 = \begin{bmatrix} 1 \\ -1 \end{bmatrix}\), and the eigendecomposition is \[ A = \frac{1}{2}\begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 1-2t \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}. \]

Note that the eigenvector matrix \(V = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}\) satisfies \(V^{-1} = \frac{1}{2}V\), since the columns are orthogonal with squared norm 2.

Insight: Doubly Stochastic Matrices in Machine Learning

Doubly stochastic matrices are the cornerstone of Optimal Transport (OT). The Sinkhorn-Knopp algorithm transforms a raw cost matrix into a doubly stochastic assignment matrix through iterative scaling. This approach is highly valued in modern ML because it is fully differentiable and enables entropic regularization, turning a hard combinatorial matching problem into a smooth, GPU-friendly optimization.

In Spectral Clustering, ensuring the affinity matrix is normalized into a stochastic form relates to the random walk Laplacian. For any doubly stochastic matrix \(A\), every row sums to 1, so \(A \mathbf{1} = \mathbf{1}\); normalizing by \(n\) shows that the uniform distribution \(\mathbf{q} = \frac{1}{n}\mathbf{1}\) is a steady state of \(A\) (uniqueness requires the additional assumption of regularity — e.g., \(A = I\) admits every probability vector as a steady state). This property makes doubly stochastic matrices ideal for modeling systems where no prior bias exists, such as in data shuffling, sorting networks, or fair resource allocation.

This page gave a linear-algebra view of probabilistic dynamics: a stochastic matrix is a linear operator on probability vectors, and its spectral structure dictates long-run behavior. The companion probabilistic view — how to make sense of "probability", how to model transitions at the level of random variables, and how convergence to a stationary distribution is formalized and proved — belongs to Section III, where Markov chain theory is developed from measure-theoretic foundations.