Introduction
In constrained optimization, we intuitively expect to find an optimal solution. Having explored the fundamental mathematical
structures in previous chapters, we can now articulate exactly why this intuition holds. This chapter brings us to a pivotal
milestone in our journey: Compactness.
While the Extreme Value Theorem guarantees that a continuous function on a closed and bounded subset of \(\mathbb{R}^n\) attains
its extrema, this is merely a special case of a deeper truth. In general metric spaces, compactness captures the essence of being
"topologically finite." It ensures that sequences cannot wander indefinitely without accumulating, and that local properties can
be stitched together to control the global behavior of a system with finite resources.
Open Covers and Compactness
While completeness focuses on the "internal" stability of sequences, compactness
addresses the "external" finiteness of the space. The definition via open covers is the most robust because it
relies purely on topological structure rather than specific distance values. This allows us to analyze global properties by
"stitching together" local coordinate patches - a technique fundamental to Manifold Theory.
Definition: Open Cover and Subcover
Suppose \(X\) is a metric space and \(S\) is a subset of \(X\). A collection \(\mathcal{C}\) of open
subsets of \(X\) is called an open cover for \(S\) in \(X\) if and only if
\[
S \subseteq \bigcup \mathcal{C}.
\]
If \(\mathcal{C}\) is an open cover for \(S\), then any subset \(\mathcal{A}\) of \(\mathcal{C}\)
that also covers \(S\) is called an open subcover of \(\mathcal{C}\) for \(S\) in
\(X\).
Definition: Finite Intersection Property
Suppose \(X\) is a set and \(\mathcal{S}\) is a collection of subsets of \(X\). We say that \(\mathcal{S}\)
has the finite intersection property if and only if for every non-empty finite subset
\(\mathcal{F}\) of \(\mathcal{S}\), we have
\[
\bigcap \mathcal{F} \neq \emptyset.
\]
With these tools, we can now formally define compactness. The definition takes the "open cover" formulation
as primary, following the standard topological convention.
Definition: Compact Metric Space
A metric space \(X\) is called compact if and only if every open cover of \(X\)
has a finite subcover. A subset \(K \subseteq X\) is called compact if \(K\), viewed as a metric
subspace with the restricted metric, is a compact metric space. Equivalently, \(K\) is compact
iff every collection of open subsets of \(X\) whose union contains \(K\) admits a finite
subcollection still covering \(K\) — this ambient characterization is the form most
often used in proofs.
Compactness is a multifaceted property. In metric spaces, several distinct conditions converge to describe the
same underlying structure of "finiteness within infinity." Understanding these equivalences is essential for
moving toward advanced fields like Manifold Theory and Information Geometry.
Theorem: Equivalence of Compactness
Suppose \(X\) is a non-empty metric space. The following statements are equivalent, and \(X\) is compact
if and only if it satisfies any one of the following:
-
Open Cover Criterion (Topological Definition)
Every open cover for \(X\) has a finite subcover.
Mathematical Significance: The foundational definition. It ensures that global properties can
be derived from finite local data — the mechanism behind partition of unity
constructions in Manifold Theory and Differential Geometry, where compactness lets us stitch
local charts into global objects.
-
Ball Cover Criterion
Every cover of \(X\) by open balls has a finite subcover.
Mathematical Significance: A specialization for metric spaces: it suffices to check finite
subcovers for ball covers rather than arbitrary open covers. This simplification is the starting
point for many classical proofs (the Lebesgue number lemma, total boundedness arguments).
-
Finite Intersection Criterion (Duality)
Every collection of closed subsets with the finite intersection property has a non-empty intersection.
Mathematical Significance: The De Morgan dual of the open cover definition. It provides the
logic for existence proofs: if a family of closed feasible sets has every finite
subfamily with common points, then the entire family shares a common point. In constrained
optimization, this justifies taking limits of intersecting constraint sets.
-
Cantor's Criterion (Nested Intervals)
Every nest of non-empty closed subsets of \(X\) has a non-empty intersection.
Mathematical Significance: Any downward-filtered family of non-empty closed sets has non-empty
intersection. Unlike the Cantor Intersection Theorem for complete spaces (which requires diameters
\(\to 0\)), compactness removes this size condition — a strictly stronger guarantee. This underlies
branch-and-bound and interval methods: as the search tree refines,
nested candidate regions cannot collapse to emptiness.
-
Cantor's Sequence Criterion
Every sequence \(\{F_n\}\) of nested non-empty closed subsets has a non-empty intersection.
Mathematical Significance: A countable version of (4): any nested sequence
\(F_1 \supseteq F_2 \supseteq \cdots\) of non-empty closed sets has non-empty intersection.
Useful whenever candidate regions are indexed by iteration count — e.g., proving that the
intersection of feasible sets in successive relaxation hierarchies remains non-empty.
-
Convergence Criterion (Sequential Compactness)
Every sequence in \(X\) has a subsequence that converges in \(X\).
Mathematical Significance: Every sequence has a convergent subsequence inside \(X\). This does
not by itself imply that an algorithm's iterates converge, but it guarantees the existence of
accumulation points within the feasible set. In non-convex optimization, such
accumulation points are identified with stationary/KKT points by separate arguments, making
sequential compactness the topological backbone of Bolzano-Weierstrass-type convergence
analyses.
-
Spatial Criterion (Structural Decomposition)
\(X\) is complete and totally bounded.
Mathematical Significance: Decouples compactness into "no holes" (completeness) and "finite size"
(total boundedness). Total boundedness is exactly the existence of finite \(\varepsilon\)-nets
for every \(\varepsilon > 0\), which powers covering-number bounds in statistical
learning theory (Rademacher complexity, PAC bounds) and Arzelà-Ascoli-type
precompactness arguments in Functional Analysis.
-
BBW Criterion (Nearest-point Property)
\(X\) is bounded and has the nearest-point property.
Mathematical Significance: A compact set, viewed as a subset of any ambient metric space, admits
a nearest point from every external point. This is the foundation of projection operators
onto compact sets and of Vector Quantization (k-means-type algorithms rely on a
nearest-centroid assignment always being well-defined).
-
Cantor Set Criterion
\(X\) is empty or a continuous image of the Cantor set.
Mathematical Significance: The Hausdorff-Alexandroff Theorem. Reveals that every
compact metric space, no matter how exotic, admits a sampling from a universal fractal template —
a perspective used in descriptive set theory, in constructing space-filling curves,
and in surjective encodings for symbolic dynamics.
-
Attained Bound Criterion (Operational View)
Every real continuous function defined on \(X\) is bounded and attains its bounds.
Mathematical Significance: A functional characterization: compactness is exactly what makes every
continuous real-valued function attain its supremum and infimum. This is the Extreme Value
Theorem in disguise — the single most invoked consequence of compactness in analysis and
optimization, guaranteeing that minimizers of continuous objectives on compact feasible sets always exist.
Proof Sketch (Spine):
A full proof traversing all ten criteria requires a web of implications. Rather than writing every
one, we establish the central bridge (1) ⇒ (6) in detail — it exposes the deep
link between the covering and sequential viewpoints — and record the remaining connections as
short structural remarks organized by cluster.
Core implication: (1) ⇒ (6). Suppose every open cover of \(X\) admits a finite
subcover. Let \(\{x_n\}\) be a sequence in \(X\). Assume for contradiction that \(\{x_n\}\) has
no convergent subsequence. Then for every \(p \in X\) there exists \(r_p > 0\) such that the
ball \(B(p, r_p)\) contains \(x_n\) for only finitely many indices \(n\): otherwise, the balls
of radius \(1/k\) around \(p\) would each contain infinitely many terms, and we could inductively
select indices \(n_1 < n_2 < \cdots\) with \(x_{n_k} \in B(p, 1/k)\), producing a subsequence
converging to \(p\) — contradicting our assumption. The family
\(\{B(p, r_p) : p \in X\}\) is an open cover of \(X\), so by (1) a finite subcollection
\(B(p_1, r_{p_1}), \ldots, B(p_k, r_{p_k})\) covers \(X\). Then
\[
\mathbb{N} \;=\; \bigcup_{i=1}^{k} \{\, n : x_n \in B(p_i, r_{p_i}) \,\},
\]
a finite union of finite sets — but \(\mathbb{N}\) is infinite, a contradiction. Hence a
convergent subsequence exists.
Sequential ↔ covering cluster: (6) ⇔ (7) ⇔ (1).
For (6) ⇒ (7): a Cauchy sequence with a convergent subsequence converges to the same
limit, giving completeness; and if \(X\) were not totally bounded, some \(\varepsilon > 0\)
would allow constructing a sequence \(\{y_n\}\) with pairwise distances \(\ge \varepsilon\),
which admits no convergent subsequence — any convergent subsequence is Cauchy, forcing
\(d(y_{n_k}, y_{n_l}) < \varepsilon\) for sufficiently large \(k \ne l\), contradicting the
pairwise \(\ge \varepsilon\) condition — and this contradicts (6).
For (7) ⇒ (6): given any sequence, total boundedness provides, at each scale \(1/k\),
a ball of radius \(1/k\) containing infinitely many terms; iterating this inside successive
subsequences and extracting the diagonal produces a subsequence in which any two terms beyond
index \(k\) lie in a common ball of radius \(1/k\), hence within distance \(2/k\) of each other
— forming a Cauchy subsequence, and completeness delivers its limit.
For (6) ⇒ (1): sequentially compact metric spaces admit Lebesgue numbers
(every open cover has a \(\delta > 0\) such that every set of diameter \(< \delta\) lies in
some cover element); combined with a finite \(\delta/3\)-net from total boundedness (each such
ball has diameter \(\le 2\delta/3 < \delta\)), this yields a finite subcover.
Covering-style criteria: (1) ⇔ (2) ⇔ (3).
(1) ⇒ (2) because ball covers are a special case of open covers, so the finite-subcover
property transfers directly. (2) ⇒ (1): given any open cover \(\mathcal{U}\), for each
\(x \in X\) pick \(U_x \in \mathcal{U}\) with \(x \in U_x\) and a ball \(B(x, r_x) \subseteq U_x\);
the family \(\{B(x, r_x)\}\) is a ball cover, so (2) yields a finite subcollection
\(B(x_1, r_{x_1}), \ldots, B(x_k, r_{x_k})\) covering \(X\), and the corresponding
\(\{U_{x_1}, \ldots, U_{x_k}\}\) is a finite subcover of \(\mathcal{U}\).
(1) ⇔ (3) is De Morgan duality: a family of open sets covers \(X\) iff the complementary
family of closed sets has empty intersection, and a finite subcover corresponds to a finite
subfamily with empty intersection — which is exactly the negation of the finite intersection
property.
Nested-set criteria: (3) ⇒ (4) ⇒ (5).
(3) ⇒ (4) ⇒ (5) by specialization: a nest (totally ordered family under \(\subseteq\))
of non-empty closed sets has the finite intersection property (any finite subfamily has a
minimum element among its members, which is non-empty), and any nested sequence
\(F_1 \supseteq F_2 \supseteq \cdots\) is a countable nest. The converse directions restoring
(3) from (4) or (5) are subtle — they rely either on Zorn's lemma or on an indirect route.
The cleanest closure is sequential: given a sequence \(\{x_n\}\), the nested closed sets
\(F_n = \overline{\{x_k : k \ge n\}}\) have non-empty intersection by any of (3), (4), (5), and
any point in \(\bigcap_n F_n\) is an accumulation point of \(\{x_n\}\), yielding a convergent
subsequence. Combined with (6) ⇒ (1) above, this closes the equivalence loop.
Geometric and functional characterizations: (7) ⇔ (8), (1) ⇔ (10), (9).
(1) ⇒ (8): compactness implies boundedness — the cover
\(\{B(x_0, n) : n \in \mathbb{N}\}\) admits a finite subcover, placing \(X\) inside a single
ball. For the nearest-point property: for any superspace \(Y \supseteq X\) and any \(y \in Y\),
the distance function \(d(\cdot, y): X \to \mathbb{R}\) is continuous (Lipschitz with constant 1),
and the Extreme Value Theorem — proved below as a consequence of (1) — attains its minimum on
the compact \(X\), yielding a nearest point. The converse (8) ⇒ (7) is a classical
equivalence in metric-space theory: bounded sets with the nearest-point property possess enough
structure to force completeness and the existence of finite \(\varepsilon\)-nets. This direction
is non-trivial and we do not reproduce it in this spine.
(1) ⇒ (10) is the Extreme Value Theorem, proved below. (10) ⇒ (1): from an
open cover without a finite subcover, one constructs a continuous real-valued function on \(X\)
that is unbounded (or attains no supremum), contradicting (10); the construction uses a
Urysohn-type interpolation along a nested sequence of non-empty closed sets with empty
intersection, which is non-trivial but standard.
(9): the Hausdorff-Alexandroff Theorem — every non-empty compact
metric space is a continuous image of the Cantor set — is a classical result in descriptive set
theory whose full proof uses the separability of compact metric spaces and a binary encoding of
a countable base. The reverse direction (continuous images of the Cantor set are compact)
follows from the Cantor set being compact and from continuous images of compact sets being
compact, established separately below.
While all ten criteria are logically equivalent in metric spaces, they offer different practical advantages
depending on your objective:
-
(1) & (6) Theoretical Foundations:
Open Covers (1) and Sequential Convergence (6) are the universal languages of topology and analysis. Use these
for fundamental proofs of continuity and for transitioning between local and global properties.
-
(4) & (5) Algorithmic Assurance:
Cantor's Criteria power branch-and-bound and interval methods:
as a search tree or relaxation hierarchy refines, the sequence of nested candidate regions cannot
collapse to emptiness, guaranteeing that the limit of narrowing still contains a solution.
-
(7) & (8) Approximation & Search:
Total Boundedness (7) ensures a space can be represented by a finite "\(\epsilon\)-net," while the Nearest-point property (8)
guarantees that a "best approximation" actually exists within the set—crucial for Vector Quantization
and Signal Processing.
-
(10) Optimization & Global Extrema:
The Attained Bound property is the engine behind the Extreme Value Theorem. It is the core reason why we
can guarantee the existence of global minima in constrained optimization.
-
(2), (3) & (9) Analytical Tools:
Ball Covers (2) simplify proofs in metric-specific contexts, Duality (3) allows us to switch between open and closed set logic,
and the Cantor Set image (9) reveals the underlying structural density common to all compact metric spaces.
Among these, Sequential Compactness (6) is particularly operational.
In the context of the Bolzano-Weierstrass Theorem, it ensures that even if we cannot
find the exact global minimum in finite steps, any bounded sequence of parameters generated
by our algorithm contains a subsequence that converges to a limit within the feasible set.
Intuitively, if a set is not compact, it has "holes" (missing limit points) or "ends" that stretch
to infinity, allowing us to define a cover that requires infinitely many sets to handle the
boundary or the unbounded tail. Compactness asserts that no matter how small our
open sets are, we only need a finite number of them to capture the entire space.
While "holes" often refer to a lack of completeness, compactness is a much stronger constraint.
Completeness ensures that sequences don't "fall through" internal gaps (e.g., the limit of \(1, 1.4, 1.41...\)
exists in \(\mathbb{R}\) but not in \(\mathbb{Q}\)). However, compactness prevents sequences from "escaping"
to infinity as well.
This difference is captured in Cantor's Intersection Theorem.
In a complete space, nested sets must shrink in size \(\text{diam}(A_n) \to 0\) to guarantee a limit point. In a
compact space, the nested intersection of non-empty closed sets is
always non-empty, regardless of their size. Compactness leaves no room for any sequence
to drift away or vanish, providing the ultimate structural guarantee for stability.
The Heine-Borel Theorem
While the open cover definition is powerful, checking "every" open cover is difficult. Fortunately,
in the familiar Euclidean space \(\mathbb{R}^n\), compactness has a simple geometric characterization.
Theorem: Heine-Borel
A subset \(K \subseteq \mathbb{R}^n\) is compact if and only if it is:
- Closed: It contains all its limit points.
- Bounded: It can be contained within a ball of finite radius.
Proof Sketch:
(\(\Rightarrow\)) Compactness implies closed and bounded in any metric space.
Boundedness: the cover \(\{B(x, 1) : x \in K\}\) admits a finite subcover
\(B(x_1, 1), \ldots, B(x_m, 1)\); since the finite set \(\{x_1, \ldots, x_m\}\) has bounded
diameter \(M := \max_{i,j} d(x_i, x_j)\), any \(y \in K\) lies in some \(B(x_i, 1)\) and
satisfies \(d(y, x_1) \le d(y, x_i) + d(x_i, x_1) < 1 + M\), placing \(K\) inside
\(B(x_1, 1 + M)\).
Closedness: if \(p\) is a limit point of \(K\) not in \(K\), the open sets
\(\{X \setminus \overline{B}(p, 1/n) : n \in \mathbb{N}\}\) cover \(K\) (every \(x \in K\) has
positive distance to \(p\), so \(1/n < d(x, p)\) for some \(n\)), but no finite subcover
exists — any finite subfamily would leave \(K \cap \overline{B}(p, 1/\max n_i)\) uncovered,
yet every neighborhood of \(p\) meets \(K\) by the limit-point assumption. This contradicts
compactness of \(K\), so \(p \in K\).
(\(\Leftarrow\)) This direction is the distinctively Euclidean content. By the equivalence
above, it suffices to show closed-and-bounded subsets of \(\mathbb{R}^n\) are complete and
totally bounded (criterion (7)).
Completeness: \(\mathbb{R}^n\) is complete, and a closed subset of a complete
metric space is complete
(see complete subset theorem).
Total boundedness: a bounded subset \(K \subseteq \mathbb{R}^n\) sits inside
some cube \([-R, R]^n\); given \(\varepsilon > 0\), partition this cube into finitely many
sub-cubes of side length \(\varepsilon / (2\sqrt{n})\), each having diameter
\(\varepsilon / 2 < \varepsilon\). The centres of these sub-cubes form a finite
\(\varepsilon\)-net for \(K\).
It is a common pitfall to assume that all "closed and bounded" sets are compact. While the Heine-Borel Theorem
guarantees this in Euclidean space \(\mathbb{R}^n\), it does NOT hold in infinite-dimensional spaces, such as
function spaces used in Kernel Methods or Deep Learning theory.
In an infinite-dimensional Hilbert space, the unit ball is closed and bounded, but not compact
(Riesz's lemma).
This is why optimization in infinite-dimensional settings requires additional structure — e.g.,
compactness-preserving regularization (L2 or nuclear-norm constraints that restrict
the search to a compact subset), or reduction to finite dimension via the
representer theorem in reproducing kernel Hilbert spaces (RKHS) — to recover the existence
guarantees that Heine-Borel freely provides in Euclidean space.
CS Insight: Feasible Regions
In optimization, constraints like \( \|w\|_2 \le C \) (L2 regularization)
create a closed and bounded feasible set in \(\mathbb{R}^n\). By the Heine-Borel theorem, this set is compact. This geometry
is what theoretically guarantees that regularization penalties prevent weights from exploding and stabilize the learning process.
Continuous Functions on Compact Sets
The power of compactness becomes fully evident when we examine continuous functions. It guarantees
not only that solutions exist, but also that the function behaves predictably
across the entire domain.
Theorem: Preservation of Compactness
If \(f: X \to Y\) is a continuous function between metric spaces and \(K \subseteq X\) is compact,
then the image \(f(K)\) is compact in \(Y\).
Proof:
Let \(\mathcal{C} = \{V_i\}_{i \in I}\) be any open cover of \(f(K)\) in \(Y\). By continuity of \(f\),
each preimage \(f^{-1}(V_i)\) is open in \(X\), and since \(\mathcal{C}\) covers \(f(K)\), the family
\(\{f^{-1}(V_i)\}_{i \in I}\) covers \(K\). Compactness of \(K\) yields a finite subcollection
\(f^{-1}(V_{i_1}), \ldots, f^{-1}(V_{i_n})\) covering \(K\); applying \(f\) gives
\(V_{i_1}, \ldots, V_{i_n}\) covering \(f(K)\). Thus every open cover of \(f(K)\) has a finite subcover,
so \(f(K)\) is compact. \(\square\)
Since a compact subset of \(\mathbb{R}\) must be closed and bounded (Heine-Borel), the image of a
continuous real-valued function on a compact set must have a maximum and a minimum element.
Theorem: Extreme Value Theorem (EVT)
Let \(K\) be a non-empty compact metric space and \(f: K \to \mathbb{R}\) be a continuous function.
Then \(f\) is bounded and attains its maximum and minimum values. That is, there exist
\(x_{\min}, x_{\max} \in K\) such that for all \(x \in K\):
\[
f(x_{\min}) \le f(x) \le f(x_{\max}).
\]
Proof:
By the Preservation of Compactness theorem,
\(f(K) \subseteq \mathbb{R}\) is compact. By the
Heine-Borel theorem, \(f(K)\) is closed
and bounded in \(\mathbb{R}\). Since \(K \ne \emptyset\) we have \(f(K) \ne \emptyset\), so boundedness
of \(f(K)\) means \(\sup f(K)\) and \(\inf f(K)\) exist and are finite.
Closedness of \(f(K)\) means these supremum and infimum belong to \(f(K)\) itself. Indeed, if
\(\sup f(K) \notin f(K)\), then by the defining property of the supremum every
\(\varepsilon\)-neighborhood contains some \(y \in f(K)\) with \(y > \sup f(K) - \varepsilon\),
giving \(\text{dist}(\sup f(K), f(K)) = 0\); combined with \(\sup f(K) \in f(K)^c\) (which gives
\(\text{dist}(\sup f(K), f(K)^c) = 0\) trivially), this places \(\sup f(K)\) in
\(\partial f(K) \subseteq \overline{f(K)} = f(K)\), a contradiction. Hence
\(\sup f(K) \in f(K)\), and analogously \(\inf f(K) \in f(K)\).
Therefore there exist \(x_{\max}, x_{\min} \in K\) with \(f(x_{\max}) = \sup f(K)\) and
\(f(x_{\min}) = \inf f(K)\), and for every \(x \in K\) we have \(f(x) \in f(K)\),
so \(f(x_{\min}) \le f(x) \le f(x_{\max})\). \(\square\)
Insight: Existence of Optima
The EVT is the bedrock of loss minimization. If our parameter space is compact (e.g., via regularization constraints)
and our loss function is continuous, we are mathematically guaranteed that a global minimum exists.
Without compactness (e.g., on an open interval or unbounded set), the function might only approach an
infimum asymptotically (like \(f(x) = 1/x\) on \((1, \infty)\)), leaving us chasing a solution that isn't there.
Beyond existence of optima, compactness forces a uniform modulus of continuity across the
entire domain (proved in the Heine-Cantor theorem below). On a non-compact space, a continuous
function can vary arbitrarily fast near a missing boundary point or at infinity — the function
\(f(x) = 1/x\) near \(x = 0\) illustrates how continuity alone gives no uniform control.
Compactness eliminates this pathology.
Proof Sketch:
Fix \(\varepsilon > 0\). By continuity, for each \(x \in K\) there exists \(\delta_x > 0\) such that
\(f(B(x; \delta_x)) \subseteq B(f(x); \varepsilon/2)\). The family
\(\{B(x; \delta_x / 2) : x \in K\}\) — note the half-size radii — is an open cover of \(K\), so by
compactness it admits a finite subcover \(\{B(x_1; \delta_{x_1}/2), \ldots, B(x_n; \delta_{x_n}/2)\}\).
Set \(\delta = \min_{i} \delta_{x_i} / 2 > 0\). This choice ensures
\(\delta + \delta_{x_i}/2 \le \delta_{x_i}\) for every \(i\), so that any point within distance
\(\delta\) of a half-size ball lies inside the corresponding full-size ball where \(f\) is controlled.
Concretely: for any \(x, y \in K\) with \(d(x, y) < \delta\),
\(x\) lies in some \(B(x_i; \delta_{x_i}/2)\), so \(d(x, x_i) < \delta_{x_i}/2\); then
\(d(y, x_i) \le d(y, x) + d(x, x_i) < \delta + \delta_{x_i}/2 \le \delta_{x_i}\), placing \(y\) in
\(B(x_i; \delta_{x_i})\). Both \(f(x)\) and \(f(y)\) therefore lie in \(B(f(x_i); \varepsilon/2)\), so
\(d(f(x), f(y)) < \varepsilon\). This \(\delta\) depends only on \(\varepsilon\), proving uniform continuity. \(\square\)
Insight: Optimization Stability via Uniform Continuity
While the EVT guarantees that a solution exists, the Heine-Cantor Theorem gives our
objective a uniform behaviour: on a compact domain, the modulus of continuity
\(\omega_f(\delta) = \sup\{|f(x) - f(y)| : d(x, y) < \delta\}\) tends to zero as
\(\delta \to 0\), with a rate that does not vary across the domain.
This does not by itself bound the gradient — uniform continuity is strictly weaker than
Lipschitz continuity (\(f(x) = \sqrt{x}\) on \([0, 1]\) is uniformly continuous but not
Lipschitz near \(x = 0\)). To obtain a bounded gradient one additionally applies the Extreme
Value Theorem to \(\|\nabla f\|\) under a \(C^1\) assumption. Together, these two pieces —
uniform modulus plus bounded gradient — provide the stability guarantees that let engineers
choose learning rates \(\eta\) safely across the entire compact search space.