Convergence & Boundedness

Introduction

In the previous chapter, we introduced completeness as a property ensuring that a metric space has "no holes." We stated that complete spaces are those where every Cauchy sequence converges. But what exactly is convergence in a metric space? And what is a Cauchy sequence?

Traditional calculus often presents these concepts through the lens of pointwise \(\epsilon-N\) inequalities - a "micro-arithmetic" approach that can sometimes obscure the underlying geometric truth. In this chapter, we shift our perspective from individual point-to-point distances to structural topology. We define convergence and Cauchy properties by looking at the behavior of the tail of a sequence: whether the infinite remainder of our process eventually "settles" into an arbitrarily small neighborhood.

This shift is fundamental to understanding why optimization algorithms work. When gradient descent produces a sequence of iterate \(\mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \ldots\), we need precise language to describe how this sequence "concentrates" its energy toward a solution. By treating convergence as a structural event rather than a mere numerical coincidence, we gain a clearer blueprint for analyzing stability and consistency in high-dimensional AI-driven landscapes.

Convergence of Sequences

Here we want to generalize the familiar \(\epsilon\)-\(\delta\) notion from elementary calculus to arbitrary metric spaces. Why does this matter for modern ML? Because ML operates in spaces far richer than \(\mathbb{R}^n\): probability distributions (Wasserstein space), functions (reproducing kernel Hilbert spaces), and even neural network weights (high-dimensional parameter spaces with non-Euclidean geometry).

Definition: Tails Suppose \(X\) is a non-empty set and \(\{x_n\}\) is a sequence in \(X\). For each \(m \in \mathbb{N}\), the set \[ \text{tail}_m \{x_n\} = \{x_n \mid n \in \mathbb{N}, n \geq m\} \] is called the \(m\)th tail of the sequence \(\{x_n\}\) .

The tail captures what happens after we discard finitely many initial terms. In algorithm analysis, we care about asymptotic behavior: whether training eventually stabilizes, not whether the first few epochs are noisy.

Definition: Convergence and Limit Suppose \(X\) is a metric space, \(z \in X\) and \(\{x_n\}\) is a sequence in \(X\). We say that \(\{x_n\}\) converges to \(z\) in \(X\), written \(x_n \to z\) if and only if every open subset of \(X\) that contains \(z\) includes a tail of \(\{x_n\}\). Furthermore, \(z\) is called the limit of \(\{x_n\}\) denoted by \(\lim x_n\).

Insight: Convergence in ML Contexts

The abstract definition of convergence unifies diverse phenomena in machine learning, shifting the focus from individual points to the behavior of systems in different spaces:

Parameter Convergence:
In optimization, we expect iterates \(\theta_t \to \theta^*\) under proper conditions (e.g., Robbins-Monro conditions for SGD). This means that for any \(\epsilon\)-ball around the optimum \(\theta^*\), the sequence eventually enters and remains within it.
Note: Without these conditions, the stochastic noise might prevent the sequence from settling.
Distributional Convergence:
In Generative Models (like GANs or Diffusion Models), we aim for the generated distribution \(p_G\) to approach the data distribution \(p_{\text{data}}\). This convergence is measured in probability spaces using "statistical" distances (e.g., KL divergence) or metrics (e.g., Wasserstein distance). (Note that KL divergence is not metric.)
Function Space Convergence:
The Universal Approximation Theorem states that neural networks are dense in the space of continuous functions. This implies that as we increase the model capacity, the approximation can converge to any target function \(f\) under norms like the \(L_{\infty}\) (uniform convergence) or \(L_2\) norm.

Convergence in a metric space can be characterized through multiple equivalent lenses:

Theorem: Criteria for Convergence Suppose \(X\) is a metric space, \(z \in X\) and \(\{x_n\}\) is a sequence in \(X\). The following statements are logically equivalent:

\[ \bigcap \left\{\overline{x_n \mid n \in S} \mid S \subseteq \mathbb{N}, S \text{infinite}\right\} = \{z\}. \]
\[ z \in \bigcap \left\{\overline{x_n \mid n \in S} \mid S \subseteq \mathbb{N}, S \text{infinite}\right\}. \]
\[ \text{dist }(z, \{x_n \mid n \in S\}) = 0 \] for every infinite subset \(S\) of \(\mathbb{N}\).
Every open ball centred at \(z\) includes a tail of \(\{x_n\}\).
Every open subset of \(X\) that contains \(z\) includes a tail of \(\{x_n\}\).

Criterion 4 is the familiar \(\epsilon\)-ball formulation: for every \(\epsilon > 0\), there exists \(N\) such that \(d(x_n, z) < \epsilon\) for all \(n \geq N\). This is precisely what we check when we set a convergence tolerance in numerical algorithms.

Criterion 5 reveals the topological perspective. Beyond numerical distances, \(z\) is the limit because the sequence eventually enters and never leaves any open "neighborhood" of \(z\). This formulation generalizes to spaces where we may not have an explicit metric but do have a notion of "open sets" (general topological spaces).

Remember, convergent sequences of real numbers converge to "at most" one point. This is because \(\mathbb{R}\) is a metric space, and uniqueness of limits holds in any metric space. This uniqueness of limits guarantees that optimization algorithms don't exhibit "mode-switching" behavior where iterates oscillate between multiple candidates indefinitely.

Theorem: The Uniqueness of Limits Suppose \(X\) is a metric space, and \(\{x_n\}\) is a sequence in \(X\) that converges in \(X\). Then \(\{x_n\}\) converges to exactly one point in \(X\).

Proof: Suppose \(w, z \in X\) and \(\{x_n\}\) converges to both \(w\) and \(z\). Then \[ \bigcap \left \{ \overline{x_n \mid n \in S} \mid S \subseteq \mathbb{N}, S \text{infinite}\right\} = \{w\}. \] and \[ \bigcap \left\{\overline{x_n \mid n \in S} \mid S \subseteq \mathbb{N}, S \text{infinite}\right\} = \{z\}. \] Thus, \(\{w\} = \{z\}\), and \(w = z\).

In the real world, it is usually impossible to observe "everything." So, we are more interested in "portions" of the whole sequence. Fortunately, every subsequence of a convergent sequence converges to the same limit as the parent sequence, which is because every tail of the parent sequence includes a tail of any its subsequence.

Theorem: Convergence of Subsequences

Suppose \(\{x_n\}\) is a sequence in a metric space \(X\). If \(\{x_n\}\) converges to \(z \in X\), then every subsequence \(\{x_{m_n}\}\) of \(\{x_n\}\) also converges to \(z\).

Side Note: In this case, \(z\) is in the closure of every tail of \(\{x_n\}\). Or, equivalently, every ball centered at \(z\) contains an infinite number of terms of \(\{x_n\}\).

This theorem states that convergence is invariant under any infinite sub-sampling. Whether we monitor every single iterate or only record the state every 100 epochs, the identified limit remains the same.

Insight: Consistency Across Observations

In practical machine learning, we rarely inspect every single update of a high-frequency optimizer. Instead, we "sample" the model's performance at specific intervals.

Robustness to Checkpointing:
This property guarantees that the "limit" we observe through periodic checkpoints is mathematically identical to the true limit of the continuous training process.
Avoiding Inconsistency:
If a sequence were to have subsequences converging to different limits, our evaluation results would depend entirely on when we measured the performance - a nightmare for reproducibility. Convergence prevents this "mode-switching" instability.

Convergence in metric subspace is not exactly the same notion as convergence of subsequences in a metric space.

Theorem: Convergence in Subspaces

Let \(X\) be a metric space, and \(Y\) be a metric subspace of \(X\). Suppose a sequence \(\{x_n\} \subseteq Y\) that converges to \(w \in X\). Then \(\{x_n\}\) converges in \(Y\) if and only if \(w \in Y\), and in that case, its limit in \(Y\) is \(w\).

Insight: Why Subspaces Matter in ML

We often constrain parameters to a subset \(Y\) (e.g., via weight clipping). The critical question is whether the limit \(L\) stays within our allowed region \(Y\).

Closed Subsets: If our constraint set \(Y\) is closed in \(\mathbb{R}^n\), any sequence that converges in the larger space \(\mathbb{R}^n\) is guaranteed to have its limit in \(Y\).
Numerical Instability: The set of floating-point numbers \(\mathbb{F} \subseteq \mathbb{R}\) is a discrete subspace. A sequence might "converge" in \(\mathbb{R}\), but its limit may not exist in \(\mathbb{F}\), leading to rounding errors or numerical instability.

Cauchy Sequences

How do we know a sequence "should" converge without knowing the limit? In practice, this is exactly the situation we face: gradient descent produces iterates, and we need a stopping criterion that doesn't require oracle knowledge of the optimal solution.

The key insight is to measure whether the sequence elements are getting closer to each other, rather than closer to some unknown target.

Definition: Cauchy Sequence Suppose \(X\) is a metric space and \(\{x_n\}\) is a sequence in \(X\). We say that \(\{x_n\}\) is a Cauchy sequence in \(X\) if and only if for every \(r \in \mathbb{R}^+\), there is a ball of \(X\) of radius \(r\) that includes a tail of \(\{x_n\}\).

Theorem:

Let \((X, d)\) be a metric space. If a sequence \(\{x_n\}\) converges in \(X\), then \(\{x_n\}\) is a Cauchy sequence.

Note that the converse of this statement is not always true.

Proof Sketch: Remember, from Criteria for Convergence, every open ball centred at \(z\) includes a tail of \(\{x_n\}\). This implies definition of the Cauchy sequence.

Insight: The Practical Dilemma

In optimization, the target point \(x^*\) is unknown - detecting convergence by checking \(d(x_n, x^*) < \epsilon\) is therefore a circular problem. The Cauchy criterion provides a rigorous theoretical escape: it ensures convergence by examining whether all future terms in a sequence remain arbitrarily close to one another, independent of the limit itself.

However, a gap exists between theory and implementation. Verifying the formal Cauchy condition requires monitoring an infinite tail of the sequence, which is computationally impossible. In practice, we use stagnation-based stopping criteria (such as \(|\theta_{t+1} - \theta_t| < \epsilon\) over a fixed window). While these are inspired by Cauchy's logic, they are heuristic approximations: they detect when an algorithm has slowed down, but do not mathematically guarantee that it has reached a global optimum.

Boundedness

In pure mathematics, boundedness is a property that a sequence possesses as a necessary consequence of convergence. In machine learning, we invert this logic: we enforce boundedness as a mechanism to maintain numerical stability and prevent divergence.

Definition: Bounded A subset \(S\) of a metric space \(X\) is called a bounded subset of \(X\) if and only if \(S = \emptyset \) or \(S\) is included in some ball of \(X\). A metric space \(X\) is said to be bounded if and only if it is a bounded subset of itself.

Definition: Diameter Suppose \((X, d)\) is a metric space and \(S\) is a subset of \(X\). The diameter of \(S\) is defined as \[ \text{diam }(S) = \sup \{d(r, s) \mid r, s \in S\}. \]

Theorem: Criteria for Boundedness Suppose \(X\) is a nonempty metric space, \(z \in X\), and \(S \subseteq X\). The following statements are logically equivalent:

\(\text{diam }(S) < \infty\).
There is a ball of \(X\) centered at \(z\) that includes \(S\).
There is a ball of \(X\) that includes \(S\).

Theorem: Suppose \(X\) is a metric space and \(\{x_n\}\) is a sequence in \(X\). If \(\{x_n\}\) is Cauchy, then \(\{x_n\}\) is bounded in \(X\). In other words, if \(\{x_n\}\) converges in \(X\), then \(\{x_n\}\) is bounded in \(X\).

Proof Sketch: If \(\{x_n\}\) is Cauchy, then for \(\epsilon = 1\), there exists \(N\) such that all terms \(x_n\) with \(n \geq N\) lie within distance 1 of \(x_N\). The finitely many terms \(x_1, \ldots, x_{N-1}\) are at finite distances from \(x_N\). Taking \(R = \max\{d(x_i, x_N) : i < N\} + 1\), all terms lie in the ball of radius \(R\) centered at \(x_N\).

Important: The converse is false. Boundedness alone does not guarantee convergence. Consider \(x_n = (-1)^n\) in \(\mathbb{R}\). The sequence is bounded (contained in \([-1, 1]\)) but oscillates forever without converging.

Insight:

While enforcing a bound does not mathematically guarantee convergence (as the sequence could still oscillate), it ensures the iterates remain within a bounded region. In the finite-dimensional parameter spaces typical of machine learning, such a region - when combined with appropriate closure - forms a compact set. This "structural containment" is a prerequisite for the Cauchy property to even be possible in practice.

Gradient Clipping:
By enforcing \(\|\nabla L\| \leq M\), we ensure the update steps remain in a bounded subset of the parameter space. This is a heuristic countermeasure against the exploding gradient problem, preventing iterates from escaping to infinity.
Weight Clipping (e.g., WGAN):
Forcing weights into a compact interval \([-c, c]\) ensures Lipschitz continuity. This mathematical constraint is required to bound the slope of the function, stabilizing the dual form of the Wasserstein distance.
Regularization (\(L^2\)):
Adding a penalty \(\|\theta\|^2\) makes the objective function coercive. It ensures that the set of parameters with low loss remains bounded, effectively "trapping" the optimization process in a region where a minimum is guaranteed to exist.
Trust Regions:
By restricting each step to a ball of radius \(\Delta\), we ensure that the Taylor approximation (which is only locally valid) remains an accurate model of the loss surface.

While boundedness ensures that our sequence doesn't escape to infinity, it doesn't guarantee that the set is structurally "solid" enough to contain its own limit points. To bridge this gap, we consider a property that connects distance to existence in the most fundamental way.

Definition: Nearest-Point Property

Suppose \(X\) is a metric space. We say that \(X\) has the nearest-point property if and only if \(X = \emptyset\) or \(X\) admits a nearest point to each point in every metric superspace of \(X\).

This property is logically equivalent to the assertion that every bounded sequence in \(X\) has a subsequence that converges in \(X\). It ensures that the space has no "holes" for bounded sequences to fall into.

Insight: Structural Guarantee in Optimization

In optimization, simply knowing your search space is limited (bounded) is only half the battle. The nearest-point property ensures that even if you can't reach an "ideal" point outside your set, you can always find a best approximation (a projection) within it.

For instance, in Constrained SGD or SVM, we rely on this structural "solidness" to ensure that our algorithms always have a valid point to converge to. We will explore how this combines with boundedness to form the broader concept of compactness in a later chapter.

As we move toward more complex spaces, especially when dealing with infinite-dimensional spaces, we need a refined version of boundedness that accounts for the "density" of the set, leading us to the concept of total boundedness.

Definition: Total Boundedness A subset \(S\) of a metric space \(X\) is called a totally bounded subset of \(X\) if and only if for each \(r \in \mathbb{R}^+\), there is a finite collection of balls of \(X\) of radius \(r\) that covers \(S\). The metric space \(X\) is said to be totally bounded if and only if it is a totally bounded subset of itself.

Insight: Total Boundedness in Function Spaces

In the space of neural network functions, total boundedness directly relates to covering numbers - the minimum number of balls of radius \(\epsilon\) needed to cover a function class.

Capacity Control:
In statistical learning theory, the logarithm of the covering number (metric entropy) measures the complexity of a model.
Generalization:
A totally bounded function class ensures that we can approximate any function in the class with a finite set of "representative" networks. Smaller covering numbers imply a less complex hypothesis space, which typically leads to tighter generalization bounds.

Completeness Revisited

In the previous chapter, we defined completeness as follows:

Definition: Completeness A metric space \(X\) is said to be complete if and only if \(X\) is closed in every metric superspace of \(X\).

This formulation prioritizes the structural integrity of the metric space itself—the space has "no holes" that could be exposed by embedding it in a larger space. Now that we have developed the Cauchy criterion, we can state an equivalent characterization that is often more useful in practice.

Definition: Completeness A metric space \(X\) is said to be complete if and only if every Cauchy sequence in \(X\) converges in \(X\).

The equivalence of these definitions is a fundamental theorem we will prove in later chapters. For now, the sequential characterization gives us an operational test: to verify completeness, we must show that any sequence whose terms get arbitrarily close together eventually reaches a limit within the space.

Example: \(\mathbb{R}^n\) (with the Euclidean metric) is complete. This is the Cauchy completeness axiom of the real numbers, extended to finite-dimensional spaces. However, \(\mathbb{Q}^n\) (rationals) is not complete - Cauchy sequences can converge to irrational limits.

Insight:

Completeness is the "floor" of our mathematical world; it ensures that if an algorithm suggests a limit exists (via the Cauchy property), that limit is a valid point within our space. In machine learning, we rely on the following complete spaces:

\(\mathbb{R}^d\) (Parameter Space):
As a finite-dimensional Euclidean space, \(\mathbb{R}^d\) is complete. While this does not guarantee that SGD will converge (which depends on learning rates and loss geometry), it ensures that any Cauchy sequence of weights \(\{\theta_t\}\) has a unique, reachable limit \(\theta^* \in \mathbb{R}^d\).
Hilbert Spaces (e.g., RKHS):
By definition, a Hilbert space is a complete inner product space. The convergence of kernel-based algorithms depend on the fact that the optimization occurs in a space where "holes" do not exist.
\(L^p\) Spaces (Function Spaces):
The Riesz-Fischer Theorem proves that \(L^p\) spaces are complete. This is fundamental for universal approximation; it allows us to define the "target function" as the limit of a sequence of neural networks under an integral norm.
Probability Distributions (Wasserstein Space):
When training generative models (e.g., GANs or Diffusion Models), we optimize within a space of probability distributions. Crucially, the completeness of the Wasserstein space is inherited from the underlying base space (typically the complete Euclidean space \(\mathbb{R}^n\)). This structural guarantee ensures that a Cauchy sequence of distributions will always converge to a valid, well-defined distribution, rather than "escaping" the space of measures.

Loading...

Introduction

Convergence of Sequences

Insight: Convergence in ML Contexts

Insight: Consistency Across Observations

Insight: Why Subspaces Matter in ML

Cauchy Sequences

Insight: The Practical Dilemma

Boundedness

Insight:

Insight: Structural Guarantee in Optimization

Insight: Total Boundedness in Function Spaces

Completeness Revisited

Insight: