Introduction
In the previous chapter, we introduced completeness as a property ensuring that a metric space
has "no holes." We stated that complete spaces are those where every Cauchy sequence converges.
But what exactly is convergence in a metric space? And what is a Cauchy sequence?
Traditional calculus often presents these concepts through the lens of pointwise \(\epsilon-N\) inequalities
- a "micro-arithmetic" approach that can sometimes obscure the underlying geometric truth. In this chapter,
we shift our perspective from individual point-to-point distances to structural topology. We define convergence
and Cauchy properties by looking at the behavior of the tail of a sequence: whether the infinite remainder of
our process eventually "settles" into an arbitrarily small neighborhood.
This shift is fundamental to understanding why optimization algorithms work. When gradient descent produces a
sequence of iterate \(\mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \ldots\), we need precise language to describe
how this sequence "concentrates" its energy toward a solution. By treating convergence as a structural event
rather than a mere numerical coincidence, we gain a clearer blueprint for analyzing stability and consistency in
high-dimensional AI-driven landscapes.
Convergence of Sequences
Here we want to generalize the familiar \(\epsilon\)-\(\delta\) notion from elementary calculus to arbitrary
metric spaces. Why does this matter for modern ML? Because ML operates in spaces
far richer than \(\mathbb{R}^n\): probability distributions (Wasserstein space), functions (reproducing
kernel Hilbert spaces), and even neural network weights (high-dimensional parameter spaces with
non-Euclidean geometry).
Definition: Tails
Suppose \(X\) is a non-empty set and \(\{x_n\}\) is a sequence in \(X\). For each \(m \in \mathbb{N}\), the
set
\[
\text{tail}_m \{x_n\} = \{x_n \mid n \in \mathbb{N}, n \geq m\}
\]
is called the \(m\)th tail of the sequence \(\{x_n\}\) .
The tail captures what happens after we discard finitely many initial terms.
In algorithm analysis, we care about asymptotic behavior: whether training eventually stabilizes,
not whether the first few epochs are noisy.
Definition: Convergence and Limit
Suppose \(X\) is a metric space, \(z \in X\) and \(\{x_n\}\) is a sequence in \(X\). We say that \(\{x_n\}\)
converges to \(z\) in \(X\), written \(x_n \to z\) if and only if every open subset of \(X\) that contains \(z\)
includes a tail of \(\{x_n\}\). Furthermore, \(z\) is called the limit of \(\{x_n\}\) denoted by \(\lim x_n\).
Insight: Convergence in ML Contexts
The abstract definition of convergence unifies diverse phenomena in machine learning, shifting the focus from individual
points to the behavior of systems in different spaces:
-
Parameter Convergence:
In optimization, we expect iterates \(\theta_t \to \theta^*\) under proper conditions
(e.g., Robbins-Monro conditions for SGD). This means that for any \(\epsilon\)-ball around the optimum \(\theta^*\),
the sequence eventually enters and remains within it.
Note: Without these conditions, the stochastic noise might prevent the sequence from settling.
- Distributional Convergence:
In Generative Models (like GANs or Diffusion Models), we aim for the generated distribution \(p_G\) to approach
the data distribution \(p_{\text{data}}\). This convergence is measured in probability spaces using
"statistical" distances (e.g., KL divergence)
or metrics (e.g., Wasserstein distance). (Note that KL divergence is not metric.)
- Function Space Convergence:
The Universal Approximation Theorem states that neural networks are dense in the space of continuous functions.
This implies that as we increase the model capacity, the approximation can converge to any target function \(f\)
under norms like the \(L_{\infty}\) (uniform convergence) or \(L_2\) norm.
Convergence in a metric space can be characterized through multiple equivalent lenses:
Theorem: Criteria for Convergence
Suppose \(X\) is a metric space, \(z \in X\) and \(\{x_n\}\) is a sequence in \(X\). The following statements
are logically equivalent:
-
\[
\bigcap \left\{\overline{x_n \mid n \in S} \mid S \subseteq \mathbb{N}, S \text{infinite}\right\} = \{z\}.
\]
-
\[
z \in \bigcap \left\{\overline{x_n \mid n \in S} \mid S \subseteq \mathbb{N}, S \text{infinite}\right\}.
\]
-
\[
\text{dist }(z, \{x_n \mid n \in S\}) = 0
\]
for every infinite subset \(S\) of \(\mathbb{N}\).
- Every open ball centred at \(z\) includes a tail of \(\{x_n\}\).
- Every open subset of \(X\) that contains \(z\) includes a tail of \(\{x_n\}\).
Criterion 4 is the familiar \(\epsilon\)-ball formulation: for every \(\epsilon > 0\), there exists \(N\)
such that \(d(x_n, z) < \epsilon\) for all \(n \geq N\). This is precisely what we check when we set
a convergence tolerance in numerical algorithms.
Criterion 5 reveals the topological perspective. Beyond numerical distances,
\(z\) is the limit because the sequence eventually enters and never leaves any open
"neighborhood" of \(z\). This formulation generalizes to spaces where we may not have an explicit
metric but do have a notion of "open sets" (general topological spaces).
Remember, convergent sequences of real numbers converge to "at most" one point. This is because \(\mathbb{R}\) is a
metric space, and uniqueness of limits holds in any metric space. This uniqueness of limits guarantees that optimization
algorithms don't exhibit "mode-switching" behavior where iterates oscillate between multiple candidates indefinitely.
Theorem: The Uniqueness of Limits
Suppose \(X\) is a metric space, and \(\{x_n\}\) is a sequence in \(X\) that converges in \(X\).
Then \(\{x_n\}\) converges to exactly one point in \(X\).
Proof:
Suppose \(w, z \in X\) and \(\{x_n\}\) converges to both \(w\) and \(z\). Then
\[
\bigcap \left \{ \overline{x_n \mid n \in S} \mid S \subseteq \mathbb{N}, S \text{infinite}\right\} = \{w\}.
\]
and
\[
\bigcap \left\{\overline{x_n \mid n \in S} \mid S \subseteq \mathbb{N}, S \text{infinite}\right\} = \{z\}.
\]
Thus, \(\{w\} = \{z\}\), and \(w = z\).
In the real world, it is usually impossible to observe "everything." So, we are more interested in
"portions" of the whole sequence. Fortunately, every subsequence of a convergent sequence converges
to the same limit as the parent sequence, which is because every tail of the parent sequence includes
a tail of any its subsequence.
Theorem: Convergence of Subsequences
Suppose \(\{x_n\}\) is a sequence in a metric space \(X\). If \(\{x_n\}\) converges to \(z \in X\),
then every subsequence \(\{x_{m_n}\}\) of \(\{x_n\}\) also converges to \(z\).
Side Note: In this case, \(z\) is in the closure of every tail of \(\{x_n\}\). Or, equivalently,
every ball centered at \(z\) contains an infinite number of terms of \(\{x_n\}\).
This theorem states that convergence is invariant under any infinite sub-sampling.
Whether we monitor every single iterate or only record the state every 100 epochs, the identified limit
remains the same.
Insight: Consistency Across Observations
In practical machine learning, we rarely inspect every single update of a high-frequency optimizer.
Instead, we "sample" the model's performance at specific intervals.
-
Robustness to Checkpointing:
This property guarantees that the "limit" we observe through periodic checkpoints is mathematically
identical to the true limit of the continuous training process.
-
Avoiding Inconsistency:
If a sequence were to have subsequences converging to different limits, our evaluation results would
depend entirely on when we measured the performance - a nightmare for reproducibility.
Convergence prevents this "mode-switching" instability.
Convergence in metric subspace is not exactly the same notion as convergence of subsequences in a metric space.
Theorem: Convergence in Subspaces
Let \(X\) be a metric space, and \(Y\) be a metric subspace of \(X\).
Suppose a sequence \(\{x_n\} \subseteq Y\) that converges to \(w \in X\). Then
\(\{x_n\}\) converges in \(Y\) if and only if \(w \in Y\), and in that case, its
limit in \(Y\) is \(w\).
Insight: Why Subspaces Matter in ML
We often constrain parameters to a subset \(Y\) (e.g., via weight clipping).
The critical question is whether the limit \(L\) stays within our allowed region \(Y\).
-
Closed Subsets: If our constraint set \(Y\) is closed in \(\mathbb{R}^n\),
any sequence that converges in the larger space \(\mathbb{R}^n\) is guaranteed
to have its limit in \(Y\).
-
Numerical Instability: The set of floating-point numbers \(\mathbb{F} \subseteq \mathbb{R}\)
is a discrete subspace. A sequence might "converge" in \(\mathbb{R}\), but its limit
may not exist in \(\mathbb{F}\), leading to rounding errors or numerical instability.
Cauchy Sequences
How do we know a sequence "should" converge without knowing the limit? In practice, this is exactly
the situation we face: gradient descent produces iterates, and we need a stopping criterion that
doesn't require oracle knowledge of the optimal solution.
The key insight is to measure whether the sequence elements are getting closer to each other,
rather than closer to some unknown target.
Definition: Cauchy Sequence
Suppose \(X\) is a metric space and \(\{x_n\}\) is a sequence in \(X\). We say that \(\{x_n\}\) is a
Cauchy sequence in \(X\) if and only if for every \(r \in \mathbb{R}^+\), there is
a ball of \(X\) of radius \(r\) that includes a tail of \(\{x_n\}\).
Theorem:
Let \((X, d)\) be a metric space. If a sequence \(\{x_n\}\) converges in \(X\), then
\(\{x_n\}\) is a Cauchy sequence.
Note that the converse of this statement is not always true.
Proof Sketch:
Remember, from Criteria for Convergence, every open ball centred at \(z\) includes a tail of \(\{x_n\}\). This implies
definition of the Cauchy sequence.
Insight: The Practical Dilemma
In optimization, the target point \(x^*\) is unknown - detecting convergence by checking \(d(x_n, x^*) < \epsilon\) is
therefore a circular problem. The Cauchy criterion provides a rigorous theoretical escape:
it ensures convergence by examining whether all future terms in a sequence remain arbitrarily close to one another,
independent of the limit itself.
However, a gap exists between theory and implementation. Verifying the formal Cauchy condition requires monitoring an
infinite tail of the sequence, which is computationally impossible. In practice, we use stagnation-based stopping criteria
(such as \(|\theta_{t+1} - \theta_t| < \epsilon\) over a fixed window). While these are inspired by Cauchy's logic, they are
heuristic approximations: they detect when an algorithm has slowed down, but do not mathematically guarantee
that it has reached a global optimum.
Boundedness
In pure mathematics, boundedness is a property that a sequence possesses as a necessary
consequence of convergence. In machine learning, we invert this logic: we enforce boundedness as a
mechanism to maintain numerical stability and prevent divergence.
Definition: Bounded
A subset \(S\) of a metric space \(X\) is called a bounded subset of \(X\) if and only if
\(S = \emptyset \) or \(S\) is included in some ball of \(X\). A metric space \(X\) is said to be
bounded if and only if it is a bounded subset of itself.
Definition: Diameter
Suppose \((X, d)\) is a metric space and \(S\) is a subset of \(X\).
The diameter of \(S\) is defined as
\[
\text{diam }(S) = \sup \{d(r, s) \mid r, s \in S\}.
\]
Theorem: Criteria for Boundedness
Suppose \(X\) is a nonempty metric space, \(z \in X\), and \(S \subseteq X\). The following statements
are logically equivalent:
- \(\text{diam }(S) < \infty\).
- There is a ball of \(X\) centered at \(z\) that includes \(S\).
- There is a ball of \(X\) that includes \(S\).
Theorem:
Suppose \(X\) is a metric space and \(\{x_n\}\) is a sequence in \(X\). If \(\{x_n\}\) is Cauchy, then
\(\{x_n\}\) is bounded in \(X\). In other words, if \(\{x_n\}\) converges in \(X\), then \(\{x_n\}\) is
bounded in \(X\).
Proof Sketch:
If \(\{x_n\}\) is Cauchy, then for \(\epsilon = 1\), there exists \(N\) such that all terms
\(x_n\) with \(n \geq N\) lie within distance 1 of \(x_N\). The finitely many terms
\(x_1, \ldots, x_{N-1}\) are at finite distances from \(x_N\). Taking \(R = \max\{d(x_i, x_N) : i < N\} + 1\),
all terms lie in the ball of radius \(R\) centered at \(x_N\).
Important: The converse is false. Boundedness alone does not guarantee convergence.
Consider \(x_n = (-1)^n\) in \(\mathbb{R}\). The sequence is bounded (contained in \([-1, 1]\)) but
oscillates forever without converging.
Insight:
While enforcing a bound does not mathematically guarantee convergence (as the sequence could still oscillate),
it ensures the iterates remain within a bounded region. In the finite-dimensional parameter
spaces typical of machine learning, such a region - when combined with appropriate closure - forms a
compact set. This "structural containment" is a prerequisite for the Cauchy property
to even be possible in practice.
-
Gradient Clipping:
By enforcing \(\|\nabla L\| \leq M\), we ensure the update steps remain in a bounded subset of the parameter
space. This is a heuristic countermeasure against the exploding gradient problem, preventing
iterates from escaping to infinity.
-
Weight Clipping (e.g., WGAN):
Forcing weights into a compact interval \([-c, c]\) ensures Lipschitz continuity.
This mathematical constraint is required to bound the slope of the function, stabilizing the dual form
of the Wasserstein distance.
-
Regularization (\(L^2\)):
Adding a penalty \(\|\theta\|^2\) makes the objective function coercive. It ensures
that the set of parameters with low loss remains bounded, effectively "trapping" the optimization process
in a region where a minimum is guaranteed to exist.
-
Trust Regions:
By restricting each step to a ball of radius \(\Delta\), we ensure that the Taylor approximation
(which is only locally valid) remains an accurate model of the loss surface.
While boundedness ensures that our sequence doesn't escape to infinity, it doesn't
guarantee that the set is structurally "solid" enough to contain its own limit points.
To bridge this gap, we consider a property that connects distance to existence
in the most fundamental way.
Definition: Nearest-Point Property
Suppose \(X\) is a metric space. We say that \(X\) has the nearest-point property
if and only if \(X = \emptyset\) or \(X\) admits a nearest point to each point in every metric superspace of
\(X\).
This property is logically equivalent to the assertion that
every bounded sequence in \(X\) has a subsequence that converges in \(X\).
It ensures that the space has no "holes" for bounded sequences to fall into.
Insight: Structural Guarantee in Optimization
In optimization, simply knowing your search space is limited (bounded) is only half the battle.
The nearest-point property ensures that even if you can't reach an "ideal" point
outside your set, you can always find a best approximation (a projection) within it.
For instance, in Constrained SGD or SVM, we rely on this structural
"solidness" to ensure that our algorithms always have a valid point to converge to. We will
explore how this combines with boundedness to form the broader concept of compactness
in a later chapter.
As we move toward more complex spaces, especially when dealing with infinite-dimensional spaces,
we need a refined version of boundedness that accounts for the "density" of the set, leading us to
the concept of total boundedness.
Definition: Total Boundedness
A subset \(S\) of a metric space \(X\) is called a totally bounded subset of \(X\)
if and only if for each \(r \in \mathbb{R}^+\), there is a finite collection of balls of \(X\) of
radius \(r\) that covers \(S\). The metric space \(X\) is said to be totally bounded
if and only if it is a totally bounded subset of itself.
Insight: Total Boundedness in Function Spaces
In the space of neural network functions, total boundedness directly relates to
covering numbers - the minimum number of balls of radius \(\epsilon\)
needed to cover a function class.
-
Capacity Control:
In statistical learning theory, the logarithm of the covering number (metric entropy)
measures the complexity of a model.
-
Generalization:
A totally bounded function class ensures that we can approximate any function in the
class with a finite set of "representative" networks. Smaller covering numbers imply a
less complex hypothesis space, which typically leads to tighter generalization bounds.
Completeness Revisited
In the previous chapter, we defined completeness as follows:
Definition: Completeness
A metric space \(X\) is said to be complete if and only if
\(X\) is closed in every metric superspace of \(X\).
This formulation prioritizes the structural integrity of the metric space itself—the space has
"no holes" that could be exposed by embedding it in a larger space. Now that we have developed
the Cauchy criterion, we can state an equivalent characterization that is often more useful in practice.
Definition: Completeness
A metric space \(X\) is said to be complete if and only if
every Cauchy sequence in \(X\) converges in \(X\).
The equivalence of these definitions is a fundamental theorem we will prove in later chapters.
For now, the sequential characterization gives us an operational test: to verify completeness,
we must show that any sequence whose terms get arbitrarily close together eventually reaches
a limit within the space.
Example:
\(\mathbb{R}^n\) (with the Euclidean metric) is complete. This is the Cauchy completeness axiom
of the real numbers, extended to finite-dimensional spaces. However, \(\mathbb{Q}^n\) (rationals)
is not complete - Cauchy sequences can converge to irrational limits.
Insight:
Completeness is the "floor" of our mathematical world; it ensures that if an algorithm suggests a limit exists
(via the Cauchy property), that limit is a valid point within our space. In machine learning, we rely on the following
complete spaces:
-
\(\mathbb{R}^d\) (Parameter Space):
As a finite-dimensional Euclidean space, \(\mathbb{R}^d\) is complete. While this does not guarantee
that SGD will converge (which depends on learning rates and loss geometry), it ensures that any
Cauchy sequence of weights \(\{\theta_t\}\) has a unique, reachable limit \(\theta^* \in \mathbb{R}^d\).
-
Hilbert Spaces (e.g., RKHS):
By definition, a Hilbert space is a complete inner product space. The convergence of kernel-based algorithms
depend on the fact that the optimization occurs in a space where "holes" do not exist.
-
\(L^p\) Spaces (Function Spaces):
The Riesz-Fischer Theorem proves that \(L^p\) spaces are complete. This is fundamental for
universal approximation; it allows us to define the "target function" as the
limit of a sequence of neural networks under an integral norm.
-
Probability Distributions (Wasserstein Space):
When training generative models (e.g., GANs or Diffusion Models), we optimize within a space of probability distributions.
Crucially, the completeness of the Wasserstein space is inherited from the underlying base space (typically the complete Euclidean space \(\mathbb{R}^n\)).
This structural guarantee ensures that a Cauchy sequence of distributions will always converge to a valid, well-defined distribution, rather than "escaping" the space of measures.