Shortest Paths
In Introduction to Graph Theory, we defined paths and
studied their combinatorial structure. We now equip our graphs with weights and ask a quantitative
question: among all paths connecting two vertices, which one minimizes total weight? This is the
shortest path problem, and its study reveals a fundamental dichotomy in algorithm design —
greedy selection versus dynamic programming — that pervades discrete optimization.
Weighted Graphs and Distance
Definition: Weighted Directed Graph
A weighted directed graph (or weighted digraph) is a triple
\(G = (V, E, w)\), where \((V, E)\) is a directed graph and \(w \colon E \to \mathbb{R}\) is a
weight function assigning a real number to each edge.
Definition: Walk and Trail
Let \(G = (V, E)\) be a graph (directed or undirected). A walk in \(G\) of
length \(k\) is a sequence \((v_0, v_1, \ldots, v_k)\) of vertices such that each consecutive
pair \((v_{i-1}, v_i)\) is an edge of \(G\); vertices and edges may repeat. A walk is
closed if \(v_0 = v_k\). A trail is a walk in which no
edge is repeated (vertices may still repeat).
Together with the notion of
path
(a walk in which no vertex is repeated), these form a strict hierarchy:
every path is a trail, and every trail is a walk. The boundary case \(k = 0\) is the trivial
walk consisting of the single vertex \(v_0\) with no edges; it is simultaneously a closed
walk, a trail, and a path of length zero.
For a directed path \(P = (v_0, v_1, \ldots, v_k)\), we define its weight as
\[
w(P) = \sum_{i=0}^{k-1} w(v_i, v_{i+1}).
\]
(In the boundary case \(k = 0\), the sum is empty and \(w(P) = 0\).)
The shortest-path distance from \(s\) to \(t\) is
\[
\delta(s, t) = \begin{cases} \displaystyle\min_{P \,:\, s \leadsto t} w(P) & \text{if a path from } s \text{ to } t \text{ exists,} \\ +\infty & \text{otherwise.} \end{cases}
\]
When all weights are non-negative, \(\delta\) satisfies the triangle inequality
\(\delta(s,t) \leq \delta(s,u) + \delta(u,t)\) and non-negativity \(\delta(s,t) \geq 0\), but on a
directed graph it is generally not symmetric: \(\delta(s,t) \neq \delta(t,s)\). Thus \(\delta\)
is a quasi-metric (or asymmetric metric) on the vertex set, not a metric
in the strict sense — recall that a
metric requires symmetry by definition.
When negative weights are present, \(\delta(s,t)\) may be \(-\infty\) if a negative-weight
cycle is reachable from \(s\) on a path to \(t\).
Dijkstra's Algorithm
When all edge weights are non-negative, the shortest-path problem admits a greedy
solution. Dijkstra's algorithm maintains a set \(S\) of vertices whose shortest-path distances from
a source \(s\) have been finalized, and repeatedly extracts the vertex with the smallest tentative distance.
Theorem: Correctness of Dijkstra's Algorithm
Let \(G = (V, E, w)\) be a weighted digraph with \(w(e) \geq 0\) for all \(e \in E\).
Upon termination, Dijkstra's algorithm computes \(\delta(s, v)\) for every \(v \in V\).
Proof Sketch:
We prove by induction on \(|S|\) that for every vertex \(v \in S\), the recorded distance
\(d[v] = \delta(s,v)\).
Base case. Initially \(S = \{s\}\) with \(d[s] = 0 = \delta(s,s)\).
Inductive step. Suppose the invariant holds for \(|S| = k\). The algorithm selects
\(u \notin S\) minimizing \(d[u]\). Suppose for contradiction that \(\delta(s,u) < d[u]\). Then there
exists a shortest path \(P\) from \(s\) to \(u\) with \(w(P) < d[u]\). Let \((x, y)\) be the first edge
on \(P\) with \(x \in S\) and \(y \notin S\). By the inductive hypothesis, \(d[x] = \delta(s,x)\), and
the relaxation step performed when \(x\) was added to \(S\) ensured \(d[y] \leq d[x] + w(x,y)\).
Since the prefix of \(P\) ending at \(y\) is itself a shortest path (subpath optimality),
\(d[x] + w(x,y) = \delta(s,y)\). Non-negativity of weights gives \(\delta(s,y) \leq w(P)\),
hence \(d[y] \leq w(P) < d[u]\). But then
\(u\) was not the minimum of \(d[\cdot]\) over \(V \setminus S\), a contradiction.
Algorithm: Dijkstra's Algorithm
Input: weighted digraph \(G = (V, E, w)\) with \(w(e) \geq 0\), source \(s\);
Output: shortest-path distances \(d[v] = \delta(s,v)\) for all \(v \in V\);
begin
\(d[v] \leftarrow +\infty\) for all \(v \in V\); \(d[s] \leftarrow 0\);
\(S \leftarrow \emptyset\);
Insert all vertices into priority queue \(Q\) keyed by \(d[v]\);
while \(Q \neq \emptyset\):
\(u \leftarrow\) Extract-Min\((Q)\);
\(S \leftarrow S \cup \{u\}\);
for each edge \((u,v) \in E\) with \(v \notin S\):
if \(d[u] + w(u,v) < d[v]\):
\(d[v] \leftarrow d[u] + w(u,v)\); // relaxation
Decrease-Key\((Q, v, d[v])\);
end for
end while
return \(d\);
end
With a binary min-heap, each extract-min costs \(O(\log |V|)\) worst-case and each
decrease-key costs \(O(\log |V|)\) worst-case. Since we perform \(|V|\) extractions and
at most \(|E|\) relaxations, the total running time is
\[
O\bigl((|V| + |E|) \log |V|\bigr).
\]
With a Fibonacci heap, decrease-key is amortized \(O(1)\), yielding \(O(|V| \log |V| + |E|)\).
The greedy strategy works precisely because non-negative weights guarantee that extending a path
can never decrease its total weight — once a vertex is finalized, no future path can improve upon it.
This monotonicity breaks when negative edges are present.
Bellman-Ford Algorithm and Dynamic Programming
The Bellman-Ford algorithm handles arbitrary real-valued weights (including negative edges)
by adopting a fundamentally different strategy: rather than greedily committing to final distances, it
iteratively relaxes all edges, allowing tentative distances to improve over \(|V|-1\) rounds.
Define the subproblems: let \(d_i[v]\) be the minimum weight of any path from \(s\) to \(v\) using
at most \(i\) edges. The recurrence is
\[
d_i[v] = \min\Bigl(\, d_{i-1}[v], \;\min_{(u,v) \in E} \bigl(d_{i-1}[u] + w(u,v)\bigr)\Bigr),
\]
with base case \(d_0[s] = 0\) and \(d_0[v] = +\infty\) for \(v \neq s\). Since any shortest path in a graph
without negative cycles has at most \(|V| - 1\) edges, we have \(\delta(s,v) = d_{|V|-1}[v]\).
Definition: Dynamic Programming (Bellman's Principle of Optimality)
A problem exhibits optimal substructure if an optimal solution contains
within it optimal solutions to subproblems. A problem exhibits overlapping subproblems
if the same subproblems are encountered repeatedly during a recursive decomposition.
Dynamic programming (DP) solves optimization problems satisfying both properties
by computing and storing solutions to subproblems in a systematic order — either bottom-up (tabulation)
or top-down with memoization — ensuring each subproblem is solved exactly once.
The Bellman-Ford recurrence is a canonical instance of DP: the optimal substructure is that a shortest
path from \(s\) to \(v\) with at most \(i\) edges is either a shortest path with at most \(i-1\) edges, or
the extension of some shortest path to a predecessor \(u\) by the edge \((u,v)\). The overlapping subproblems
arise because multiple vertices may share the same predecessors.
Algorithm: Bellman-Ford
Input: weighted digraph \(G = (V, E, w)\), source \(s\);
Output: distances \(d[v] = \delta(s,v)\), or detection of a negative cycle;
begin
\(d[v] \leftarrow +\infty\) for all \(v \in V\); \(d[s] \leftarrow 0\);
for \(i = 1\) to \(|V| - 1\): // relaxation passes
for each edge \((u,v) \in E\):
if \(d[u] + w(u,v) < d[v]\):
\(d[v] \leftarrow d[u] + w(u,v)\);
end for
end for
// negative cycle check
for each edge \((u,v) \in E\):
if \(d[u] + w(u,v) < d[v]\):
return "negative cycle detected";
end for
return \(d\);
end
Theorem: Bellman-Ford Correctness and Negative Cycle Detection
Let \(G = (V, E, w)\) be a weighted digraph with source \(s\). After \(|V| - 1\) iterations of
relaxing all edges, the Bellman-Ford algorithm computes \(\delta(s, v)\) for every vertex \(v\)
reachable from \(s\) without traversing a negative-weight cycle. Moreover, a negative-weight cycle
reachable from \(s\) exists if and only if some edge \((u,v)\) still satisfies
\(d[v] > d[u] + w(u,v)\) after the \((|V|-1)\)-th iteration.
Proof Sketch:
Correctness. We show by induction on \(i\) that after \(i\) outer
iterations, \(d[v] \leq d_i[v]\) for all \(v \in V\). The base case \(i = 0\) holds
by initialization. In iteration \(i\), every edge \((u,v) \in E\) is relaxed:
\[
d[v] \leftarrow \min\bigl(d[v],\; d[u] + w(u,v)\bigr).
\]
At the moment of relaxation, \(d[u] \leq d_{i-1}[u]\) — by the inductive hypothesis
after iteration \(i-1\), and because \(d\) only decreases. Combined with the previous
value of \(d[v] \leq d_{i-1}[v]\), the recurrence for \(d_i[v]\) gives
\(d[v] \leq d_i[v]\).
Since each finite \(d[v]\) is, by construction, the weight of some \(s\)-to-\(v\)
walk, we also have \(\delta(s,v) \leq d[v]\). When no negative-weight cycle is
reachable from \(s\), every shortest path has at most \(|V|-1\) edges (a longer walk
would repeat a vertex, and excising the resulting cycle — of non-negative weight —
yields a no-longer walk). Thus \(\delta(s,v) = d_{|V|-1}[v]\), and after \(|V|-1\)
iterations \(\delta(s,v) \leq d[v] \leq d_{|V|-1}[v] = \delta(s,v)\).
Negative cycle detection.
(⟸) If no negative cycle is reachable from \(s\), then \(d[v] = \delta(s,v)\) after
\(|V|-1\) iterations, which satisfies the triangle inequality
\(d[v] \leq d[u] + w(u,v)\) for every edge — so no edge admits relaxation.
(⟹) Conversely, suppose every edge satisfies \(d[v] \leq d[u] + w(u,v)\) after
\(|V|-1\) iterations, yet a negative-weight cycle
\(C = (v_0, v_1, \ldots, v_k = v_0)\) is reachable from \(s\). Reachability and
edge-by-edge relaxation propagate finiteness, so each \(d[v_i]\) is finite.
Summing the triangle inequalities around the cycle,
\[
\sum_{i=1}^{k} d[v_i] \;\leq\; \sum_{i=1}^{k} d[v_{i-1}]
+ \sum_{i=1}^{k} w(v_{i-1}, v_i).
\]
The two \(d\)-sums cancel (since \(v_0 = v_k\) and all values are finite),
yielding \(0 \leq w(C)\) — contradicting \(w(C) < 0\). Hence some edge must
still admit relaxation.
The total running time is \(O(|V| \cdot |E|)\): we perform \(|V| - 1\) passes over all \(|E|\) edges.
Negative Cycles and Currency Arbitrage
Bellman-Ford's ability to detect negative-weight cycles has a striking application in finance.
Consider a network of currencies where each directed edge \((u,v)\) carries the exchange rate
\(r(u,v)\). Setting \(w(u,v) = -\log r(u,v)\) transforms multiplicative gains into additive
weights: a cycle \(C\) with \(\prod_{e \in C} r(e) > 1\) (a profitable round-trip) becomes a
negative-weight cycle with \(\sum_{e \in C} w(e) < 0\). Thus, Bellman-Ford on the log-transformed graph
detects arbitrage opportunities — situations where a sequence of currency
exchanges yields a net profit. The same principle applies to decentralized exchanges (DEXs)
in DeFi, where token swap rates across liquidity pools form exactly such a weighted digraph,
and arbitrage bots run variants of Bellman-Ford in real time.
Greedy vs. Dynamic Programming
Dijkstra and Bellman-Ford embody the two great paradigms of discrete optimization.
Dijkstra's greedy approach commits irrevocably to each vertex's distance, exploiting
the monotonicity of non-negative weights for superior runtime. Bellman-Ford's DP approach
defers commitment, exploring all possible relaxations — slower, but correct under weaker assumptions.
This trade-off between exploiting structural assumptions for speed and maintaining generality
recurs throughout algorithm design and optimization theory.
A natural extension of the greedy paradigm is the A* algorithm, which augments
Dijkstra's strategy with a heuristic function \(h(v)\) estimating the remaining
distance from \(v\) to the goal. Instead of extracting the vertex minimizing \(d[v]\), A* extracts
the vertex minimizing \(d[v] + h(v)\), directing the search toward the target. When \(h\) is
admissible (never overestimates true distance) and consistent (satisfies a
triangle inequality), A* retains optimality while dramatically reducing the number of vertices
explored. This makes A* the algorithm of choice for robotic path planning, autonomous navigation,
and game ML — settings where spatial heuristics provide powerful guidance that pure Dijkstra cannot exploit.
Shortest-Path Distances in Graph Learning
The shortest-path distance \(\delta(u,v)\) is a fundamental feature in graph-based machine learning.
In Graph Neural Networks (GNNs), Shortest Path Networks encode \(\delta(u,v)\) as
positional features to overcome the limited expressiveness of message-passing architectures — whose
discriminative power is fundamentally bounded by the 1-Weisfeiler-Lehman (1-WL)
test, meaning standard GNNs cannot distinguish non-isomorphic graphs (or nodes) that
1-WL fails to separate. Shortest-path distances inject global structural information that breaks
through this barrier. In knowledge graph embeddings, translational models
like TransE learn vector representations where \(\mathbf{h} + \mathbf{r} \approx \mathbf{t}\) for a
triple \((h, r, t)\), implicitly encoding a notion of relational distance. The
graph Laplacian's commute time
— which we will develop in the random walks section below — provides a spectral relaxation of
shortest-path distance that is continuous and differentiable, making it amenable to gradient-based
optimization.
Maximum Flow
We now pass from finding an optimal single path to optimizing the aggregate throughput
of an entire network. Where shortest paths ask "what is the cheapest route?", maximum flow asks
"how much can we push through simultaneously?" This shift — from a path problem to a network problem —
opens the door to deep connections with linear programming duality.
Flow Networks
Definition: Flow Network
A flow network is a tuple \((G, c, s, t)\), where \(G = (V, E)\) is a directed graph,
\(c \colon E \to \mathbb{R}_{\geq 0}\) is a capacity function, \(s \in V\) is the
source, and \(t \in V\) is the sink, with \(s \neq t\).
Definition: Flow
A flow in a flow network \((G, c, s, t)\) is a function \(f \colon E \to \mathbb{R}_{\geq 0}\) satisfying:
Capacity constraint: For every edge \(e \in E\),
\[
f(e) \leq c(e).
\]
Conservation (Kirchhoff's law): For every vertex \(v \in V \setminus \{s, t\}\),
\[
\sum_{(u,v) \in E} f(u,v) = \sum_{(v,w) \in E} f(v,w).
\]
The value of a flow is
\[
|f| = \sum_{(s,v) \in E} f(s,v) - \sum_{(v,s) \in E} f(v,s).
\]
The maximum flow problem asks: find a flow \(f^*\) of maximum value \(|f^*|\). This is a
linear program — the objective \(|f|\) and all constraints are linear in the flow variables — and we shall
see that its dual yields the min-cut theorem.
Residual Graphs and Augmenting Paths
Definition: Residual Graph
We assume that \(G\) contains no antiparallel edges (i.e., if \((u,v) \in E\) then
\((v,u) \notin E\)). This is without loss of generality: any antiparallel pair can be eliminated by
inserting a dummy vertex on one of the two edges.
Given a flow \(f\) in a network \((G, c, s, t)\), the residual graph
\(G_f = (V, E_f)\) has edge set
\[
E_f = \bigl\{(u,v) : (u,v) \in E,\; f(u,v) < c(u,v)\bigr\} \;\cup\; \bigl\{(v,u) : (u,v) \in E,\; f(u,v) > 0\bigr\}.
\]
The residual capacity is
\[
c_f(u,v) = \begin{cases} c(u,v) - f(u,v) & \text{if } (u,v) \in E, \\ f(v,u) & \text{if } (v,u) \in E. \end{cases}
\]
An augmenting path is a path from \(s\) to \(t\) in the residual graph \(G_f\). If such
a path \(P\) exists, we can increase the flow by \(\Delta = \min_{e \in P} c_f(e) > 0\) by pushing
\(\Delta\) units along \(P\). When the augmenting path traverses a backward edge
\((v,u)\) — corresponding to an original edge \((u,v)\) carrying positive flow — pushing flow along
it effectively cancels previously assigned flow on \((u,v)\), redirecting that capacity to
a more profitable route. This ability to "undo" earlier decisions is precisely what allows the
method to escape locally suboptimal flow assignments and converge to a global maximum.
Algorithm: Ford-Fulkerson Method
Input: flow network \((G, c, s, t)\);
Output: maximum flow \(f\);
begin
\(f(e) \leftarrow 0\) for all \(e \in E\);
while \(\exists\) augmenting path \(P\) from \(s\) to \(t\) in \(G_f\):
\(\Delta \leftarrow \min\{c_f(e) : e \in P\}\);
for each edge \((u,v) \in P\):
if \((u,v) \in E\): \(f(u,v) \leftarrow f(u,v) + \Delta\); // forward edge
else: \(f(v,u) \leftarrow f(v,u) - \Delta\); // backward edge
end for
end while
return \(f\);
end
The method terminates when no augmenting path exists in \(G_f\). The choice of augmenting path
determines both the algorithm's name and its complexity. When augmenting paths are found via
breadth-first search (choosing a shortest path in terms of number of edges),
the method is called the Edmonds-Karp algorithm, with
running time
\(O(|V| \cdot |E|^2)\).
The Ford-Fulkerson method's termination depends on the capacity values: with
integer (or rational) capacities, each augmentation increases \(|f|\) by at
least the capacity unit, so the procedure terminates in finitely many steps; with irrational
capacities, however, examples exist where the method runs forever without converging to the
maximum flow. The Edmonds-Karp variant resolves this entirely — the BFS choice forces
termination in \(O(|V| \cdot |E|)\) augmentations regardless of capacity values.
The Max-Flow Min-Cut Theorem
Definition: Cut
An \(s\)-\(t\) cut in a flow network \((G, c, s, t)\) is a partition \((S, T)\) of \(V\)
with \(s \in S\) and \(t \in T\). Its capacity is
\[
c(S, T) = \sum_{\substack{(u,v) \in E \\ u \in S,\; v \in T}} c(u,v).
\]
For any flow \(f\) and any \(s\)-\(t\) cut \((S,T)\), the value \(|f|\) equals the
net flow across the cut. To see this, define the
net out-flow at each vertex,
\[
b_f(v) \;=\; \sum_{(v,w) \in E} f(v,w) \;-\; \sum_{(u,v) \in E} f(u,v).
\]
Conservation gives \(b_f(v) = 0\) for all \(v \in V \setminus \{s, t\}\), and
\(b_f(s) = |f|\) by definition. Since \(t \notin S\), summing \(b_f\) over \(v \in S\) yields
\[
\sum_{v \in S} b_f(v) \;=\; b_f(s) \;+\; \sum_{v \in S \setminus \{s\}} b_f(v) \;=\; |f|.
\]
On the other hand, expanding \(\sum_{v \in S} b_f(v)\) edge-by-edge: every edge \((a,b)\) with
both endpoints in \(S\) contributes \(+f(a,b)\) (from \(a\)'s out-sum) and \(-f(a,b)\) (from
\(b\)'s in-sum), cancelling; every edge with both endpoints in \(T\) contributes nothing;
only edges crossing the cut survive:
\[
|f| = \sum_{\substack{(u,v) \in E \\ u \in S,\; v \in T}} f(u,v)
\;-\; \sum_{\substack{(v,u) \in E \\ v \in T,\; u \in S}} f(v,u).
\]
Combining \(f(u,v) \leq c(u,v)\) on the first sum and \(f(v,u) \geq 0\) on the second,
\[
|f| \;\leq\; \sum_{\substack{(u,v) \in E \\ u \in S,\; v \in T}} c(u,v) \;=\; c(S,T).
\]
Every flow value is bounded above by every cut capacity. The remarkable fact is that equality
is achieved.
Theorem: Max-Flow Min-Cut (Ford & Fulkerson, 1956)
In any flow network, the following three conditions are equivalent:
(i) \(f\) is a maximum flow.
(ii) The residual graph \(G_f\) contains no augmenting path.
(iii) There exists an \(s\)-\(t\) cut \((S, T)\) with \(|f| = c(S, T)\).
In particular, \(\max_f |f| = \min_{(S,T)} c(S,T)\).
Proof:
(i) ⇒ (ii): If an augmenting path \(P\) existed in \(G_f\), then
\(\Delta = \min_{e \in P} c_f(e) > 0\) (every residual edge has strictly positive
capacity), and pushing \(\Delta\) units along \(P\) would increase \(|f|\),
contradicting maximality.
(ii) ⇒ (iii): Suppose \(G_f\) has no \(s\)-\(t\) path. Define
\(S = \{v \in V : v \text{ is reachable from } s \text{ in } G_f\}\) and \(T = V \setminus S\).
Then \(s \in S\) and \(t \in T\) (since no augmenting path exists), so \((S,T)\) is an \(s\)-\(t\) cut.
For every edge \((u,v) \in E\) with \(u \in S, v \in T\): since \(v\) is not reachable from \(s\) in
\(G_f\), the forward residual edge \((u,v)\) is absent, meaning \(f(u,v) = c(u,v)\). For every edge
\((v,u) \in E\) with \(v \in T, u \in S\): since \(u \in S\) is reachable but \(v \in T\) is not, the
backward residual edge \((u,v)\) must be absent, meaning \(f(v,u) = 0\). Therefore,
\[
|f| = \sum_{\substack{u \in S,\, v \in T \\ (u,v) \in E}} c(u,v) - 0 = c(S,T).
\]
(iii) ⇒ (i): Since \(|f| \leq c(S',T')\) for every cut \((S',T')\), and
\(|f| = c(S,T)\), the flow \(f\) achieves the upper bound and must be maximal.
Duality and the Combinatorial-Continuous Bridge
The max-flow min-cut theorem is an instance of strong
duality. The maximum flow problem can be formulated as a linear program: maximize
\(|f|\) subject to capacity and conservation constraints. Its dual problem turns out to be the
minimum cut problem, and strong duality guarantees that their optima coincide. This is not merely a
theoretical curiosity — it reveals that combinatorial optimization on graphs and continuous optimization
in \(\mathbb{R}^n\) are often two views of the same mathematical structure. The same duality principle
underlies the min-cost flow problem, assignment problems, and the general theory of
totally unimodular matrices, where linear programming relaxations automatically yield
integer solutions.
Graph Cuts in Computer Vision
In image segmentation, the max-flow min-cut framework provides an exact
solution to a natural energy minimization problem. We construct a flow network where each pixel
is a vertex, edges between adjacent pixels carry weights encoding similarity (encouraging
neighboring pixels to share a label), and edges to the source and sink encode per-pixel costs
for foreground/background assignment. The minimum cut then partitions pixels into foreground and
background, minimizing the total cost. This formulation — known as graph cuts — was
among the first methods to give globally optimal solutions for binary labeling problems in vision,
and its extensions to multi-label problems via \(\alpha\)-expansion remain widely used.
Random Walks on Graphs
So far, we have studied deterministic optimization on graphs: finding shortest paths and maximum
flows. We now introduce stochastic dynamics by considering a particle that traverses a graph
randomly. This probabilistic perspective connects graph theory to
Markov chains and
spectral graph theory, revealing that
combinatorial and algebraic properties of a graph are two faces of the same coin.
Throughout this section we work with undirected graphs, in contrast to Sections 1
and 2 above. This is not merely a presentational choice: the closed-form stationary distribution
\(\pi_v = \deg(v)/(2m)\) and the symmetry of the graph Laplacian both rely on the undirected
structure, where adjacency is reciprocal.
The Transition Matrix
Definition: Random Walk on a Graph
Let \(G = (V, E)\) be a connected, undirected graph with \(n = |V|\) vertices. A simple
random walk on \(G\) is the
Markov chain \((X_0, X_1, X_2, \ldots)\)
with state space \(V\) and transition probabilities
\[
\Pr(X_{t+1} = v \mid X_t = u) = \begin{cases} \dfrac{1}{\deg(u)} & \text{if } (u,v) \in E, \\[6pt] 0 & \text{otherwise.} \end{cases}
\]
Let \(A\) be the adjacency matrix and \(D = \operatorname{diag}(\deg(v_1), \ldots, \deg(v_n))\) the
degree matrix. The transition matrix of the random walk is
\[
P = D^{-1}A.
\]
The entry \(P_{uv} = A_{uv}/\deg(u)\) is exactly the one-step transition probability from \(u\) to \(v\).
Note that \(P\) is a row-stochastic matrix:
each row sums to 1. After \(k\) steps, the distribution of the walker's position evolves as
\[
\boldsymbol{\pi}^{(k)} = \boldsymbol{\pi}^{(0)} P^k,
\]
where \(\boldsymbol{\pi}^{(0)}\) is the initial distribution (a row vector).
Stationary Distribution
Theorem: Stationary Distribution of a Simple Random Walk
Let \(G\) be a connected undirected graph with \(m = |E|\) edges. The distribution
\[
\pi_v = \frac{\deg(v)}{2m}, \quad v \in V
\]
is the unique stationary distribution of the simple random walk on \(G\) (i.e., the unique
distribution satisfying \(\boldsymbol{\pi} P = \boldsymbol{\pi}\)). If, in addition, \(G\) is
non-bipartite, then the walk is aperiodic, and the distribution
\(\boldsymbol{\pi}^{(k)}\) converges to \(\boldsymbol{\pi}\) from any initial distribution.
Proof:
Stationarity. First, \(\boldsymbol{\pi}\) is indeed a probability distribution:
by the Handshaking Lemma
\(\sum_{v \in V} \deg(v) = 2m\), so \(\sum_{v} \pi_v = \sum_{v} \deg(v)/(2m) = 1\).
We verify \(\boldsymbol{\pi} P = \boldsymbol{\pi}\). For each \(v \in V\),
\[
\begin{align*}
(\boldsymbol{\pi} P)_v &= \sum_{u \in V} \pi_u \cdot P_{uv} \\
&= \sum_{u \,:\, (u,v) \in E} \frac{\deg(u)}{2m} \cdot \frac{1}{\deg(u)} \\
&= \sum_{u \,:\, (u,v) \in E} \frac{1}{2m} \\
&= \frac{\deg(v)}{2m} = \pi_v.
\end{align*}
\]
Uniqueness. Since \(G\) is connected, the transition matrix \(P\) is
irreducible: any vertex \(u\) reaches any other vertex \(v\) along a
finite path in \(G\), and along this path each transition has positive probability
\(1/\deg(\cdot)\), so \((P^k)_{uv} > 0\) for some \(k \geq 0\). The standard theory of
irreducible stochastic matrices — developed in our treatment of
stochastic matrices
and applied to Markov chains —
guarantees that an irreducible stochastic matrix has a unique stationary distribution
(this is the classical Perron-Frobenius statement applied to nonnegative matrices).
Since we have already exhibited \(\boldsymbol{\pi}\) as one stationary distribution,
it must be the unique one.
Convergence. If \(G\) is non-bipartite, then by the
bipartite/odd-cycle equivalence
the graph contains an odd cycle of some length \(2\ell + 1\). Combined with the existence
of length-\(2\) closed walks (round trips on any edge — which exist since \(G\) is connected
and has at least one edge), this forces
\(\gcd\{k : P^k_{vv} > 0\} = 1\) for every vertex \(v\), so the chain is
aperiodic. The convergence \(\boldsymbol{\pi}^{(k)} \to \boldsymbol{\pi}\)
from any initial distribution is then the convergence theorem for irreducible aperiodic
Markov chains, again from the standard Perron-Frobenius theory referenced above.
The result is elegant in its simplicity: in the long run, the random walker visits each vertex in
proportion to its degree. High-degree "hub" vertices are visited more frequently — a fact that
underpins the intuition behind centrality measures in network science.
The Random Walk Laplacian
The transition matrix is intimately connected to the
graph Laplacian. Recall that the
(combinatorial) graph Laplacian is \(L = D - A\). The random walk Laplacian is defined as
\[
L_{\text{rw}} = D^{-1}L = I - D^{-1}A = I - P.
\]
This operator satisfies \(L_{\text{rw}} \mathbf{f} = \mathbf{f} - P\mathbf{f}\), so its action on a
function \(\mathbf{f} \colon V \to \mathbb{R}\) measures the difference between \(\mathbf{f}\) and its
local average under the random walk:
\[
(L_{\text{rw}} \mathbf{f})(v) = f(v) - \frac{1}{\deg(v)} \sum_{u \,:\, (u,v) \in E} f(u).
\]
A function is harmonic at \(v\) (i.e., \(L_{\text{rw}} \mathbf{f}(v) = 0\)) precisely when
its value equals the average over neighbors — the discrete analogue of the mean value property that
characterizes harmonic functions in continuous analysis.
The eigenvalues of \(L_{\text{rw}}\) are \(0 = \mu_1 \leq \mu_2 \leq \cdots \leq \mu_n\), where
\(\mu_i = 1 - \lambda_i\) and \(1 = \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n > -1\) are the
eigenvalues of \(P\) (the lower bound \(\lambda_n > -1\) holds because the graph is non-bipartite).
On a bipartite graph, by contrast, \(\lambda_n = -1\): the walker oscillates between the two
vertex classes with period 2, so the distribution never settles but instead alternates indefinitely —
this is the spectral signature of periodicity.
The rate of convergence to stationarity is controlled by
\[
\bar{\lambda} = \max\bigl(|\lambda_2|,\, |\lambda_n|\bigr),
\]
which captures contributions from both the second-largest and the most negative eigenvalue. The
standard mixing time bound is
\[
\|\boldsymbol{\pi}^{(k)} - \boldsymbol{\pi}\|_{\text{TV}} \leq \frac{1}{2}\sqrt{\frac{1}{\pi_{\min}}}\;\bar{\lambda}^{\,k},
\]
where \(\pi_{\min} = \min_v \pi_v\). The quantity \(1 - \bar{\lambda}\) is the
absolute spectral gap: a large gap means rapid mixing, while a small gap means
the walk remains trapped in local clusters for long periods. The spectral facts of this
paragraph belong to the standard theory of stochastic matrices and Markov chains;
detailed proofs are given in the dedicated treatments of
stochastic matrices and
Markov chains.
Hitting Time and Commute Time
Definition: Hitting Time and Commute Time
The hitting time \(H(u, v)\) is the expected number of steps for a random walk
starting at \(u\) to first reach \(v\):
\[
H(u,v) = \mathbb{E}\bigl[\min\{t \geq 1 : X_t = v\} \mid X_0 = u\bigr].
\]
The commute time is the expected number of steps for a round trip:
\[
C(u,v) = H(u,v) + H(v,u).
\]
These quantities admit a spectral characterization. Since \(L_{\text{rw}} = D^{-1}L\) is generally
not symmetric, we work with the similar symmetric matrix
\(L_{\text{sym}} = D^{-1/2} L\, D^{-1/2} = I - D^{-1/2} A\, D^{-1/2}\), which shares the same
eigenvalues \(\mu_1 = 0 < \mu_2 \leq \cdots \leq \mu_n\). Let
\(\boldsymbol{\psi}_1, \ldots, \boldsymbol{\psi}_n\) be an orthonormal eigenbasis of \(L_{\text{sym}}\)
(in the standard Euclidean inner product), and define
\(\boldsymbol{\phi}_k = D^{-1/2}\boldsymbol{\psi}_k\) — these are the right eigenvectors of
\(L_{\text{rw}}\), orthonormal with respect to the \(D\)-weighted inner product
\(\langle \boldsymbol{\phi}_j, \boldsymbol{\phi}_k \rangle_D = \boldsymbol{\phi}_j^\top D\, \boldsymbol{\phi}_k = \delta_{jk}\).
Under this normalization,
\[
C(u,v) = 2m \sum_{k=2}^{n} \frac{1}{\mu_k}\bigl(\phi_k(u) - \phi_k(v)\bigr)^2,
\]
where \(m = |E|\). This is precisely the squared Euclidean distance in a spectral embedding weighted by
\(1/\mu_k\) — a direct bridge between the combinatorial notion of "distance by random traversal" and the
algebraic notion of "distance in eigenspace." This identity is the commute-time formula
(Lovász's 1993 survey on random walks); we state it without proof, as the derivation requires
the electrical-network interpretation of random walks beyond the scope of this page.
Proposition: Cover Time Bound
The cover time \(\text{Cov}(G)\) — the expected time for a random walk to visit every
vertex — satisfies
\[
\text{Cov}(G) \leq 2m(n-1)
\]
for any connected graph with \(n\) vertices and \(m\) edges.
Proof Sketch:
Fix any spanning tree \(T\) of \(G\) and a depth-first traversal of \(T\) that visits all
\(n\) vertices and returns to the start, traversing each tree edge exactly twice — a closed
walk on \(T\) of length \(2(n-1)\). The cover time of \(G\) is bounded by the expected time
the random walker takes to mimic this traversal in \(G\) itself: at each step of the closed
walk, the walker must hit the next prescribed vertex along a tree edge \((u,v)\),
contributing an expected \(H(u,v)\) steps. Summing over the \(2(n-1)\) directed tree-edge
traversals pairs each undirected tree edge with both directions, contributing
\(C(u,v) = H(u,v) + H(v,u)\) per tree edge. A separate bound
(Aleliunas–Karp–Lipton–Lovász–Rackoff, 1979) establishes \(C(u,v) \leq 2m\) for any edge
\((u,v) \in E\), so summing over the \(n - 1\) tree edges gives
\(\text{Cov}(G) \leq (n-1) \cdot 2m = 2m(n-1)\).
Random Walks, PageRank, and Graph Representation Learning
PageRank modifies the simple random walk with a teleportation
mechanism: at each step, with probability \(\alpha\) the walker follows a random edge, and with
probability \(1 - \alpha\) it jumps to a uniformly random vertex. The transition matrix becomes
\(P_{\text{PR}} = \alpha P + (1 - \alpha)\frac{1}{n}\mathbf{1}\mathbf{1}^\top\),
where \(\mathbf{1} \in \mathbb{R}^n\) is the all-ones column vector, and its stationary
distribution ranks vertices by a blend of local connectivity and global accessibility.
In graph representation learning, random walks are the engine behind embedding
algorithms. DeepWalk samples random walk sequences from a graph and feeds them to
a Word2Vec-style skip-gram model, learning vertex embeddings that capture multi-hop neighborhood
structure. Node2Vec generalizes this with biased random walks controlled by
return parameter \(p\) and in-out parameter \(q\), interpolating between breadth-first (local)
and depth-first (global) exploration. The connection to spectral methods is deep: it can be shown
that the DeepWalk objective implicitly factorizes a matrix closely related to the transition matrix
\(P\), linking these seemingly heuristic methods back to the spectral theory of graph Laplacians.