Compactness

Introduction Open Covers and Compactness Heine-Borel Theorem Continuous Functions on Compact Sets

Introduction

In constrained optimization, we intuitively expect to find an optimal solution. Having explored the fundamental mathematical structures in previous chapters, we can now articulate exactly why this intuition holds. This chapter brings us to a pivotal milestone in our journey: Compactness.

While the Extreme Value Theorem guarantees that a continuous function on a closed and bounded subset of \(\mathbb{R}^n\) attains its extrema, this is merely a special case of a deeper truth. In general metric spaces, compactness captures the essence of being "topologically finite." It ensures that sequences cannot wander indefinitely without accumulating, and that local properties can be stitched together to control the global behavior of a system with finite resources.

Open Covers and Compactness

While completeness focuses on the "internal" stability of sequences, compactness addresses the "external" finiteness of the space. The definition via open covers is the most robust because it relies purely on topological structure rather than specific distance values. This allows us to analyze global properties by "stitching together" local coordinate patches - a technique fundamental to Manifold Theory.

Definition: Open Cover and Subcover

Suppose \(X\) is a metric space and \(S\) is a subset of \(X\). A collection \(\mathcal{C}\) of open subsets of \(X\) is called an open cover for \(S\) in \(X\) if and only if \[ S \subseteq \bigcup \mathcal{C}. \] If \(\mathcal{C}\) is an open cover for \(S\), then any subset \(\mathcal{A}\) of \(\mathcal{C}\) that also covers \(S\) is called an open subcover of \(\mathcal{C}\) for \(S\) in \(X\).

Definition: Finite Intersection Property

Suppose \(X\) is a set and \(\mathcal{S}\) is a collection of subsets of \(X\). We say that \(\mathcal{S}\) has the finite intersection property if and only if for every non-empty finite subset \(\mathcal{F}\) of \(\mathcal{S}\), we have \[ \bigcap \mathcal{F} \neq \emptyset. \]

With these tools, we can now formally define compactness. The definition takes the "open cover" formulation as primary, following the standard topological convention.

Definition: Compact Metric Space

A metric space \(X\) is called compact if and only if every open cover of \(X\) has a finite subcover. A subset \(K \subseteq X\) is called compact if \(K\), viewed as a metric subspace with the restricted metric, is a compact metric space. Equivalently, \(K\) is compact iff every collection of open subsets of \(X\) whose union contains \(K\) admits a finite subcollection still covering \(K\) — this ambient characterization is the form most often used in proofs.

Compactness is a multifaceted property. In metric spaces, several distinct conditions converge to describe the same underlying structure of "finiteness within infinity." Understanding these equivalences is essential for moving toward advanced fields like Manifold Theory and Information Geometry.

Theorem: Equivalence of Compactness

Suppose \(X\) is a non-empty metric space. The following statements are equivalent, and \(X\) is compact if and only if it satisfies any one of the following:

  1. Open Cover Criterion (Topological Definition)
    Every open cover for \(X\) has a finite subcover.
    Mathematical Significance: The foundational definition. It ensures that global properties can be derived from finite local data — the mechanism behind partition of unity constructions in Manifold Theory and Differential Geometry, where compactness lets us stitch local charts into global objects.
  2. Ball Cover Criterion
    Every cover of \(X\) by open balls has a finite subcover.
    Mathematical Significance: A specialization for metric spaces: it suffices to check finite subcovers for ball covers rather than arbitrary open covers. This simplification is the starting point for many classical proofs (the Lebesgue number lemma, total boundedness arguments).
  3. Finite Intersection Criterion (Duality)
    Every collection of closed subsets with the finite intersection property has a non-empty intersection.
    Mathematical Significance: The De Morgan dual of the open cover definition. It provides the logic for existence proofs: if a family of closed feasible sets has every finite subfamily with common points, then the entire family shares a common point. In constrained optimization, this justifies taking limits of intersecting constraint sets.
  4. Cantor's Criterion (Nested Intervals)
    Every nest of non-empty closed subsets of \(X\) has a non-empty intersection.
    Mathematical Significance: Any downward-filtered family of non-empty closed sets has non-empty intersection. Unlike the Cantor Intersection Theorem for complete spaces (which requires diameters \(\to 0\)), compactness removes this size condition — a strictly stronger guarantee. This underlies branch-and-bound and interval methods: as the search tree refines, nested candidate regions cannot collapse to emptiness.
  5. Cantor's Sequence Criterion
    Every sequence \(\{F_n\}\) of nested non-empty closed subsets has a non-empty intersection.
    Mathematical Significance: A countable version of (4): any nested sequence \(F_1 \supseteq F_2 \supseteq \cdots\) of non-empty closed sets has non-empty intersection. Useful whenever candidate regions are indexed by iteration count — e.g., proving that the intersection of feasible sets in successive relaxation hierarchies remains non-empty.
  6. Convergence Criterion (Sequential Compactness)
    Every sequence in \(X\) has a subsequence that converges in \(X\).
    Mathematical Significance: Every sequence has a convergent subsequence inside \(X\). This does not by itself imply that an algorithm's iterates converge, but it guarantees the existence of accumulation points within the feasible set. In non-convex optimization, such accumulation points are identified with stationary/KKT points by separate arguments, making sequential compactness the topological backbone of Bolzano-Weierstrass-type convergence analyses.
  7. Spatial Criterion (Structural Decomposition)
    \(X\) is complete and totally bounded.
    Mathematical Significance: Decouples compactness into "no holes" (completeness) and "finite size" (total boundedness). Total boundedness is exactly the existence of finite \(\varepsilon\)-nets for every \(\varepsilon > 0\), which powers covering-number bounds in statistical learning theory (Rademacher complexity, PAC bounds) and Arzelà-Ascoli-type precompactness arguments in Functional Analysis.
  8. BBW Criterion (Nearest-point Property)
    \(X\) is bounded and has the nearest-point property.
    Mathematical Significance: A compact set, viewed as a subset of any ambient metric space, admits a nearest point from every external point. This is the foundation of projection operators onto compact sets and of Vector Quantization (k-means-type algorithms rely on a nearest-centroid assignment always being well-defined).
  9. Cantor Set Criterion
    \(X\) is empty or a continuous image of the Cantor set.
    Mathematical Significance: The Hausdorff-Alexandroff Theorem. Reveals that every compact metric space, no matter how exotic, admits a sampling from a universal fractal template — a perspective used in descriptive set theory, in constructing space-filling curves, and in surjective encodings for symbolic dynamics.
  10. Attained Bound Criterion (Operational View)
    Every real continuous function defined on \(X\) is bounded and attains its bounds.
    Mathematical Significance: A functional characterization: compactness is exactly what makes every continuous real-valued function attain its supremum and infimum. This is the Extreme Value Theorem in disguise — the single most invoked consequence of compactness in analysis and optimization, guaranteeing that minimizers of continuous objectives on compact feasible sets always exist.
Proof Sketch (Spine):

A full proof traversing all ten criteria requires a web of implications. Rather than writing every one, we establish the central bridge (1) ⇒ (6) in detail — it exposes the deep link between the covering and sequential viewpoints — and record the remaining connections as short structural remarks organized by cluster.

Core implication: (1) ⇒ (6). Suppose every open cover of \(X\) admits a finite subcover. Let \(\{x_n\}\) be a sequence in \(X\). Assume for contradiction that \(\{x_n\}\) has no convergent subsequence. Then for every \(p \in X\) there exists \(r_p > 0\) such that the ball \(B(p, r_p)\) contains \(x_n\) for only finitely many indices \(n\): otherwise, the balls of radius \(1/k\) around \(p\) would each contain infinitely many terms, and we could inductively select indices \(n_1 < n_2 < \cdots\) with \(x_{n_k} \in B(p, 1/k)\), producing a subsequence converging to \(p\) — contradicting our assumption. The family \(\{B(p, r_p) : p \in X\}\) is an open cover of \(X\), so by (1) a finite subcollection \(B(p_1, r_{p_1}), \ldots, B(p_k, r_{p_k})\) covers \(X\). Then \[ \mathbb{N} \;=\; \bigcup_{i=1}^{k} \{\, n : x_n \in B(p_i, r_{p_i}) \,\}, \] a finite union of finite sets — but \(\mathbb{N}\) is infinite, a contradiction. Hence a convergent subsequence exists.

Sequential ↔ covering cluster: (6) ⇔ (7) ⇔ (1). For (6) ⇒ (7): a Cauchy sequence with a convergent subsequence converges to the same limit, giving completeness; and if \(X\) were not totally bounded, some \(\varepsilon > 0\) would allow constructing a sequence \(\{y_n\}\) with pairwise distances \(\ge \varepsilon\), which admits no convergent subsequence — any convergent subsequence is Cauchy, forcing \(d(y_{n_k}, y_{n_l}) < \varepsilon\) for sufficiently large \(k \ne l\), contradicting the pairwise \(\ge \varepsilon\) condition — and this contradicts (6). For (7) ⇒ (6): given any sequence, total boundedness provides, at each scale \(1/k\), a ball of radius \(1/k\) containing infinitely many terms; iterating this inside successive subsequences and extracting the diagonal produces a subsequence in which any two terms beyond index \(k\) lie in a common ball of radius \(1/k\), hence within distance \(2/k\) of each other — forming a Cauchy subsequence, and completeness delivers its limit. For (6) ⇒ (1): sequentially compact metric spaces admit Lebesgue numbers (every open cover has a \(\delta > 0\) such that every set of diameter \(< \delta\) lies in some cover element); combined with a finite \(\delta/3\)-net from total boundedness (each such ball has diameter \(\le 2\delta/3 < \delta\)), this yields a finite subcover.

Covering-style criteria: (1) ⇔ (2) ⇔ (3). (1) ⇒ (2) because ball covers are a special case of open covers, so the finite-subcover property transfers directly. (2) ⇒ (1): given any open cover \(\mathcal{U}\), for each \(x \in X\) pick \(U_x \in \mathcal{U}\) with \(x \in U_x\) and a ball \(B(x, r_x) \subseteq U_x\); the family \(\{B(x, r_x)\}\) is a ball cover, so (2) yields a finite subcollection \(B(x_1, r_{x_1}), \ldots, B(x_k, r_{x_k})\) covering \(X\), and the corresponding \(\{U_{x_1}, \ldots, U_{x_k}\}\) is a finite subcover of \(\mathcal{U}\). (1) ⇔ (3) is De Morgan duality: a family of open sets covers \(X\) iff the complementary family of closed sets has empty intersection, and a finite subcover corresponds to a finite subfamily with empty intersection — which is exactly the negation of the finite intersection property.

Nested-set criteria: (3) ⇒ (4) ⇒ (5). (3) ⇒ (4) ⇒ (5) by specialization: a nest (totally ordered family under \(\subseteq\)) of non-empty closed sets has the finite intersection property (any finite subfamily has a minimum element among its members, which is non-empty), and any nested sequence \(F_1 \supseteq F_2 \supseteq \cdots\) is a countable nest. The converse directions restoring (3) from (4) or (5) are subtle — they rely either on Zorn's lemma or on an indirect route. The cleanest closure is sequential: given a sequence \(\{x_n\}\), the nested closed sets \(F_n = \overline{\{x_k : k \ge n\}}\) have non-empty intersection by any of (3), (4), (5), and any point in \(\bigcap_n F_n\) is an accumulation point of \(\{x_n\}\), yielding a convergent subsequence. Combined with (6) ⇒ (1) above, this closes the equivalence loop.

Geometric and functional characterizations: (7) ⇔ (8), (1) ⇔ (10), (9). (1) ⇒ (8): compactness implies boundedness — the cover \(\{B(x_0, n) : n \in \mathbb{N}\}\) admits a finite subcover, placing \(X\) inside a single ball. For the nearest-point property: for any superspace \(Y \supseteq X\) and any \(y \in Y\), the distance function \(d(\cdot, y): X \to \mathbb{R}\) is continuous (Lipschitz with constant 1), and the Extreme Value Theorem — proved below as a consequence of (1) — attains its minimum on the compact \(X\), yielding a nearest point. The converse (8) ⇒ (7) is a classical equivalence in metric-space theory: bounded sets with the nearest-point property possess enough structure to force completeness and the existence of finite \(\varepsilon\)-nets. This direction is non-trivial and we do not reproduce it in this spine. (1) ⇒ (10) is the Extreme Value Theorem, proved below. (10) ⇒ (1): from an open cover without a finite subcover, one constructs a continuous real-valued function on \(X\) that is unbounded (or attains no supremum), contradicting (10); the construction uses a Urysohn-type interpolation along a nested sequence of non-empty closed sets with empty intersection, which is non-trivial but standard. (9): the Hausdorff-Alexandroff Theorem — every non-empty compact metric space is a continuous image of the Cantor set — is a classical result in descriptive set theory whose full proof uses the separability of compact metric spaces and a binary encoding of a countable base. The reverse direction (continuous images of the Cantor set are compact) follows from the Cantor set being compact and from continuous images of compact sets being compact, established separately below.

While all ten criteria are logically equivalent in metric spaces, they offer different practical advantages depending on your objective:


Among these, Sequential Compactness (6) is particularly operational. In the context of the Bolzano-Weierstrass Theorem, it ensures that even if we cannot find the exact global minimum in finite steps, any bounded sequence of parameters generated by our algorithm contains a subsequence that converges to a limit within the feasible set.

Intuitively, if a set is not compact, it has "holes" (missing limit points) or "ends" that stretch to infinity, allowing us to define a cover that requires infinitely many sets to handle the boundary or the unbounded tail. Compactness asserts that no matter how small our open sets are, we only need a finite number of them to capture the entire space.

While "holes" often refer to a lack of completeness, compactness is a much stronger constraint. Completeness ensures that sequences don't "fall through" internal gaps (e.g., the limit of \(1, 1.4, 1.41...\) exists in \(\mathbb{R}\) but not in \(\mathbb{Q}\)). However, compactness prevents sequences from "escaping" to infinity as well.

This difference is captured in Cantor's Intersection Theorem. In a complete space, nested sets must shrink in size \(\text{diam}(A_n) \to 0\) to guarantee a limit point. In a compact space, the nested intersection of non-empty closed sets is always non-empty, regardless of their size. Compactness leaves no room for any sequence to drift away or vanish, providing the ultimate structural guarantee for stability.

The Heine-Borel Theorem

While the open cover definition is powerful, checking "every" open cover is difficult. Fortunately, in the familiar Euclidean space \(\mathbb{R}^n\), compactness has a simple geometric characterization.

Theorem: Heine-Borel

A subset \(K \subseteq \mathbb{R}^n\) is compact if and only if it is:

  1. Closed: It contains all its limit points.
  2. Bounded: It can be contained within a ball of finite radius.
Proof Sketch:

(\(\Rightarrow\)) Compactness implies closed and bounded in any metric space. Boundedness: the cover \(\{B(x, 1) : x \in K\}\) admits a finite subcover \(B(x_1, 1), \ldots, B(x_m, 1)\); since the finite set \(\{x_1, \ldots, x_m\}\) has bounded diameter \(M := \max_{i,j} d(x_i, x_j)\), any \(y \in K\) lies in some \(B(x_i, 1)\) and satisfies \(d(y, x_1) \le d(y, x_i) + d(x_i, x_1) < 1 + M\), placing \(K\) inside \(B(x_1, 1 + M)\). Closedness: if \(p\) is a limit point of \(K\) not in \(K\), the open sets \(\{X \setminus \overline{B}(p, 1/n) : n \in \mathbb{N}\}\) cover \(K\) (every \(x \in K\) has positive distance to \(p\), so \(1/n < d(x, p)\) for some \(n\)), but no finite subcover exists — any finite subfamily would leave \(K \cap \overline{B}(p, 1/\max n_i)\) uncovered, yet every neighborhood of \(p\) meets \(K\) by the limit-point assumption. This contradicts compactness of \(K\), so \(p \in K\).

(\(\Leftarrow\)) This direction is the distinctively Euclidean content. By the equivalence above, it suffices to show closed-and-bounded subsets of \(\mathbb{R}^n\) are complete and totally bounded (criterion (7)). Completeness: \(\mathbb{R}^n\) is complete, and a closed subset of a complete metric space is complete (see complete subset theorem). Total boundedness: a bounded subset \(K \subseteq \mathbb{R}^n\) sits inside some cube \([-R, R]^n\); given \(\varepsilon > 0\), partition this cube into finitely many sub-cubes of side length \(\varepsilon / (2\sqrt{n})\), each having diameter \(\varepsilon / 2 < \varepsilon\). The centres of these sub-cubes form a finite \(\varepsilon\)-net for \(K\).

It is a common pitfall to assume that all "closed and bounded" sets are compact. While the Heine-Borel Theorem guarantees this in Euclidean space \(\mathbb{R}^n\), it does NOT hold in infinite-dimensional spaces, such as function spaces used in Kernel Methods or Deep Learning theory.

In an infinite-dimensional Hilbert space, the unit ball is closed and bounded, but not compact (Riesz's lemma). This is why optimization in infinite-dimensional settings requires additional structure — e.g., compactness-preserving regularization (L2 or nuclear-norm constraints that restrict the search to a compact subset), or reduction to finite dimension via the representer theorem in reproducing kernel Hilbert spaces (RKHS) — to recover the existence guarantees that Heine-Borel freely provides in Euclidean space.

CS Insight: Feasible Regions

In optimization, constraints like \( \|w\|_2 \le C \) (L2 regularization) create a closed and bounded feasible set in \(\mathbb{R}^n\). By the Heine-Borel theorem, this set is compact. This geometry is what theoretically guarantees that regularization penalties prevent weights from exploding and stabilize the learning process.

Continuous Functions on Compact Sets

The power of compactness becomes fully evident when we examine continuous functions. It guarantees not only that solutions exist, but also that the function behaves predictably across the entire domain.

Theorem: Preservation of Compactness

If \(f: X \to Y\) is a continuous function between metric spaces and \(K \subseteq X\) is compact, then the image \(f(K)\) is compact in \(Y\).

Proof:

Let \(\mathcal{C} = \{V_i\}_{i \in I}\) be any open cover of \(f(K)\) in \(Y\). By continuity of \(f\), each preimage \(f^{-1}(V_i)\) is open in \(X\), and since \(\mathcal{C}\) covers \(f(K)\), the family \(\{f^{-1}(V_i)\}_{i \in I}\) covers \(K\). Compactness of \(K\) yields a finite subcollection \(f^{-1}(V_{i_1}), \ldots, f^{-1}(V_{i_n})\) covering \(K\); applying \(f\) gives \(V_{i_1}, \ldots, V_{i_n}\) covering \(f(K)\). Thus every open cover of \(f(K)\) has a finite subcover, so \(f(K)\) is compact. \(\square\)

Since a compact subset of \(\mathbb{R}\) must be closed and bounded (Heine-Borel), the image of a continuous real-valued function on a compact set must have a maximum and a minimum element.

Theorem: Extreme Value Theorem (EVT)

Let \(K\) be a non-empty compact metric space and \(f: K \to \mathbb{R}\) be a continuous function. Then \(f\) is bounded and attains its maximum and minimum values. That is, there exist \(x_{\min}, x_{\max} \in K\) such that for all \(x \in K\): \[ f(x_{\min}) \le f(x) \le f(x_{\max}). \]

Proof:

By the Preservation of Compactness theorem, \(f(K) \subseteq \mathbb{R}\) is compact. By the Heine-Borel theorem, \(f(K)\) is closed and bounded in \(\mathbb{R}\). Since \(K \ne \emptyset\) we have \(f(K) \ne \emptyset\), so boundedness of \(f(K)\) means \(\sup f(K)\) and \(\inf f(K)\) exist and are finite.

Closedness of \(f(K)\) means these supremum and infimum belong to \(f(K)\) itself. Indeed, if \(\sup f(K) \notin f(K)\), then by the defining property of the supremum every \(\varepsilon\)-neighborhood contains some \(y \in f(K)\) with \(y > \sup f(K) - \varepsilon\), giving \(\text{dist}(\sup f(K), f(K)) = 0\); combined with \(\sup f(K) \in f(K)^c\) (which gives \(\text{dist}(\sup f(K), f(K)^c) = 0\) trivially), this places \(\sup f(K)\) in \(\partial f(K) \subseteq \overline{f(K)} = f(K)\), a contradiction. Hence \(\sup f(K) \in f(K)\), and analogously \(\inf f(K) \in f(K)\).

Therefore there exist \(x_{\max}, x_{\min} \in K\) with \(f(x_{\max}) = \sup f(K)\) and \(f(x_{\min}) = \inf f(K)\), and for every \(x \in K\) we have \(f(x) \in f(K)\), so \(f(x_{\min}) \le f(x) \le f(x_{\max})\). \(\square\)

Insight: Existence of Optima

The EVT is the bedrock of loss minimization. If our parameter space is compact (e.g., via regularization constraints) and our loss function is continuous, we are mathematically guaranteed that a global minimum exists. Without compactness (e.g., on an open interval or unbounded set), the function might only approach an infimum asymptotically (like \(f(x) = 1/x\) on \((1, \infty)\)), leaving us chasing a solution that isn't there.

Beyond existence of optima, compactness forces a uniform modulus of continuity across the entire domain (proved in the Heine-Cantor theorem below). On a non-compact space, a continuous function can vary arbitrarily fast near a missing boundary point or at infinity — the function \(f(x) = 1/x\) near \(x = 0\) illustrates how continuity alone gives no uniform control. Compactness eliminates this pathology.

Theorem: Heine-Cantor (Uniform Continuity)

If \(f: K \to Y\) is a continuous function from a non-empty compact metric space \(K\) to a metric space \(Y\), then \(f\) is uniformly continuous on \(K\).

Proof Sketch:

Fix \(\varepsilon > 0\). By continuity, for each \(x \in K\) there exists \(\delta_x > 0\) such that \(f(B(x; \delta_x)) \subseteq B(f(x); \varepsilon/2)\). The family \(\{B(x; \delta_x / 2) : x \in K\}\) — note the half-size radii — is an open cover of \(K\), so by compactness it admits a finite subcover \(\{B(x_1; \delta_{x_1}/2), \ldots, B(x_n; \delta_{x_n}/2)\}\).

Set \(\delta = \min_{i} \delta_{x_i} / 2 > 0\). This choice ensures \(\delta + \delta_{x_i}/2 \le \delta_{x_i}\) for every \(i\), so that any point within distance \(\delta\) of a half-size ball lies inside the corresponding full-size ball where \(f\) is controlled. Concretely: for any \(x, y \in K\) with \(d(x, y) < \delta\), \(x\) lies in some \(B(x_i; \delta_{x_i}/2)\), so \(d(x, x_i) < \delta_{x_i}/2\); then \(d(y, x_i) \le d(y, x) + d(x, x_i) < \delta + \delta_{x_i}/2 \le \delta_{x_i}\), placing \(y\) in \(B(x_i; \delta_{x_i})\). Both \(f(x)\) and \(f(y)\) therefore lie in \(B(f(x_i); \varepsilon/2)\), so \(d(f(x), f(y)) < \varepsilon\). This \(\delta\) depends only on \(\varepsilon\), proving uniform continuity. \(\square\)

Insight: Optimization Stability via Uniform Continuity

While the EVT guarantees that a solution exists, the Heine-Cantor Theorem gives our objective a uniform behaviour: on a compact domain, the modulus of continuity \(\omega_f(\delta) = \sup\{|f(x) - f(y)| : d(x, y) < \delta\}\) tends to zero as \(\delta \to 0\), with a rate that does not vary across the domain.

This does not by itself bound the gradient — uniform continuity is strictly weaker than Lipschitz continuity (\(f(x) = \sqrt{x}\) on \([0, 1]\) is uniformly continuous but not Lipschitz near \(x = 0\)). To obtain a bounded gradient one additionally applies the Extreme Value Theorem to \(\|\nabla f\|\) under a \(C^1\) assumption. Together, these two pieces — uniform modulus plus bounded gradient — provide the stability guarantees that let engineers choose learning rates \(\eta\) safely across the entire compact search space.