Foundations of Analysis: Metric Spaces

Introduction Metric Space Distance Boundary Open, Closed & Dense Balls

Introduction

Up until this chapter, we relied on calculus and linear algebra to solve optimization problems. For many practical machine learning applications, these tools are sufficient. However, as we move toward advanced topics like manifold learning or information geometry, intuition alone is no longer enough.

Consider a fundamental question: when we run gradient descent, we assume each step brings us closer to a solution, and that iterating long enough will reach (or approach) a minimum. But what guarantees this? The sequence of iterates \(\mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \ldots\) may get arbitrarily close together without ever settling on an actual point unless the underlying space has the right structure.

Similarly, when we work with function spaces in kernel methods or neural networks, we treat entire functions as single "points." But what does "distance" mean between two functions? What does it mean for a sequence of functions to converge? The familiar Euclidean intuition breaks down, and we need a more general framework.

This chapter introduces that framework: metric spaces and the notion of completeness. These concepts formalize exactly when iterative algorithms converge to valid solutions. Rather than learning new computational techniques, we are gaining the language to state precisely what we have been assuming all along and to understand when those assumptions hold.

Metric Space

The framework begins with a minimal abstraction: a set equipped with a notion of distance. We do not require coordinates, inner products, or linear structure — only a function that assigns a nonnegative real number to each pair of points and obeys three natural axioms.

Definition: Metric Space \((M, d)\)

A metric space is an ordered pair \((M, d)\) consisting of a nonempty set \(M\) and a metric \(d\) on \(M\). The metric \(d\) is a function \(d: M \times M \to \mathbb{R}\) that defines the distance between any two elements of \(M\). For all \(a, b, c \in M\), the following axioms must hold:

  1. Positive definiteness: \(d(a,b) \geq 0\), with equality if and only if \(a = b\). (The inequality \(d(a,b) \geq 0\) is the non-negativity condition; the "if and only if" clause is the identity of indiscernibles.)
  2. Symmetry: \(d(a,b) = d(b,a)\).
  3. Triangle inequality: \(d(a,b) \leq d(a,c) + d(c, b)\).
Definition: Metric Subspace and Superspace

Suppose \((X, d)\) and \((Y, e)\) are metric spaces. We say that \((X, d)\) is a metric subspace of \((Y, e)\), and that \((Y, e)\) is a metric superspace of \((X, d)\), if and only if \(X \subseteq Y\) and \(d\) is the restriction of \(e\) to \(X \times X\); that is, \(d(a, b) = e(a, b)\) for all \(a, b \in X\).

Insight: Why the Axioms Matter in Practice

You have already used non-Euclidean distances throughout this curriculum. In k-nearest neighbors, the choice between Euclidean distance, Manhattan distance, or cosine similarity fundamentally changes which points are considered "close." In kernel methods, the kernel function implicitly defines a distance in a high-dimensional feature space. These are all metrics on different spaces.

The axioms are not arbitrary mathematical restrictions. They encode the minimal properties required for algorithms to behave sensibly. For instance, if the triangle inequality fails, then a nearest-neighbor search might return a point \(y\) as "closest" to \(x\), even when a third point \(z\) satisfies \(d(x, z) < d(x, y)\) via an indirect path. Efficient search structures like KD-trees and ball trees rely on the triangle inequality for their pruning guarantees.

Distance

With a metric defined, we can extend the notion of distance from pairs of points to more general situations. In optimization, we rarely care about the distance between two specific points; instead, we ask questions like: "How far is this point from the feasible region?" or "How close is our current iterate to the set of optimal solutions?" These questions require measuring distance from a point to a set.

Before formalizing this, we need to make precise what it means to take the "smallest possible distance" over a potentially infinite set. Consider the interval \((0, 1) \subseteq \mathbb{R}\) and the point \(0\): the distances \(d(0, y) = y\) for \(y \in (0, 1)\) form a set with no smallest element, yet they approach \(0\) arbitrarily closely. To capture this "smallest attainable or approachable value," we use the infimum rather than the minimum.

Definition: Supremum and Infimum

Let \(S \subseteq \mathbb{R}\). An upper bound of \(S\) is any \(u \in \mathbb{R}\) satisfying \(s \leq u\) for every \(s \in S\); a lower bound is defined dually (\(\ell \leq s\) for every \(s \in S\)). We say \(S\) is bounded above if it has at least one upper bound, and bounded below if it has at least one lower bound.

When \(S\) is nonempty and bounded above, its supremum \(\sup S\) is the least upper bound — the smallest real number that is an upper bound of \(S\). Dually, when \(S\) is nonempty and bounded below, its infimum \(\inf S\) is the greatest lower bound. The existence of these quantities is a nontrivial property of \(\mathbb{R}\) — we return to this in the discussion below.

We extend these to extended-real values by convention: \(\sup S = +\infty\) when \(S\) is unbounded above, \(\inf S = -\infty\) when unbounded below, and for the empty set \(\sup \emptyset = -\infty\), \(\inf \emptyset = +\infty\). When the supremum belongs to \(S\) itself, we call it the maximum, written \(\max S\); when the infimum belongs to \(S\), the minimum, \(\min S\). The point of using \(\sup\) and \(\inf\) rather than \(\max\) and \(\min\) is precisely to handle sets without a largest or smallest element — as the interval \((0, 1)\) illustrates: \(\inf (0,1) = 0\) but \(\min (0,1)\) does not exist.

Insight: The Completeness of \(\mathbb{R}\)

The fact that every nonempty bounded-above \(S \subseteq \mathbb{R}\) has a supremum — that the least upper bound exists as a real number — is the completeness property of the real numbers, the foundational axiom distinguishing \(\mathbb{R}\) from \(\mathbb{Q}\). The set \(S = \{q \in \mathbb{Q} : q^2 < 2\}\) is bounded above in \(\mathbb{Q}\) (by \(3/2\), say) but has no rational least upper bound; only when we pass to \(\mathbb{R}\) does \(\sup S = \sqrt{2}\) exist. Without this property, the infimum in our distance definition could fail to exist even when intuitively it should be well-defined.

This order-theoretic completeness of \(\mathbb{R}\) is deeply connected to the metric-space completeness we will study later — where the analogous question becomes whether every Cauchy sequence converges. In \(\mathbb{R}\), the two notions are equivalent; in general metric spaces (which may lack an order structure), only the Cauchy formulation survives. Both express the same intuition: no holes.

In optimization and machine learning, \(\inf\) and \(\sup\) appear throughout: the optimal value of a minimization problem is \(\inf_x f(x)\), generalization bounds are stated as suprema over hypothesis classes, and the \(\limsup\) / \(\liminf\) of an error sequence capture worst-case and best-case asymptotic behavior. We will rely on these operations without further ceremony.

Definition: Distances Suppose \((X, d)\) is a metric space and \(A\) and \(B\) are subsets of \(X\).
The distance from \(x \in X\) to \(A\) to be \[ \text{dist }(x, A) = \inf \{d(x, a) \mid a \in A\} \quad (\text{Points to Sets}) \] and also the distance from \(A\) to \(B\) to be \[ \text{dist }(A, B) = \inf \{d(a, b) \mid a \in A, b \in B\}. \quad (\text{Sets to Sets}) \] Both distances depend on the choice of the metric \(d\). By convention, when the set is empty we extend these definitions via \(\inf \emptyset = +\infty\) (as established above), so that \(\text{dist}(x, \emptyset) = +\infty\) and \(\text{dist}(A, \emptyset) = +\infty\); this ensures the derived notions below (isolated points, boundary, etc.) remain well-defined in degenerate cases.
Definition: Isolated Points Suppose \(X\) is a metric space, \(S\) is a subset of \(X\), and \(z \in S\). Then \(z\) is said to be an isolated point of \(S\) if and only if \[ \text{dist }(z, S \setminus \{z\}) \neq 0. \] The collection of isolated points of \(S\) is denoted by \(\text{iso } (S)\).

The complement of isolated points leads to a more important concept for our purposes: accumulation points. While isolated points stand alone with "room around them," accumulation points are surrounded by infinitely many other points from the set no matter how small a neighborhood we consider.

Definition: Accumulation Points Suppose \(X\) is a metric space, \(S\) is a subset of \(X\), and \(z \in X\). Then \(z\) is said to be an accumulation point or a limit point of \(S\) in \(X\) if and only if \[ \text{dist }(z, S \setminus \{z\}) = 0. \] The collection of accumulation points of \(S\) in \(X\) is denoted by \(\text{acc } (S)\).
Definition: Nearest Points Suppose \((X, d)\) is a metric space, \(S\) is a subset of \(X\), and \(z \in X\). A member \(s \in S\), if one exists, is called a nearest point of \(S\) to \(z\) in \(X\) if and only if \[ d(z, s) = \text{dist }(z, S). \]

A nearest point need not exist in general: take \(X = \mathbb{R}\), \(S = (0, 1)\), and \(z = 0\). Then \(\text{dist}(z, S) = 0\), yet no member of \(S\) is at distance zero from \(z\), because \(S\) is open and excludes its boundary. Existence is guaranteed, however, when \(S\) has sufficient structure — most notably when \(S\) is compact, a notion we formalize in a later chapter. In \(\mathbb{R}^n\), this reduces to \(S\) being closed and bounded (Heine–Borel); more generally, closedness alone suffices in "nice" spaces such as Hilbert spaces, where the projection onto a closed convex set is uniquely determined.

Insight: "The Nearest Point Exists" — When Can You Trust This Assumption?

Machine learning practitioners routinely take the existence of a nearest point for granted: "find the closest training example," "project onto the feasible set," "snap the embedding to the nearest codeword." These operations are so ubiquitous that it is easy to forget they rest on a non-trivial mathematical guarantee.

Here is the cheat sheet for when existence is safe:

  • k-Nearest Neighbors / SVM support vectors — The set \(S\) is a finite collection of training points. Finite sets are automatically compact, so a minimizer always exists (in fact, it is attained by brute-force search).
  • Projection in convex optimization (proximal methods, projected gradient descent) — The feasible set is typically closed and convex in \(\mathbb{R}^n\) or a Hilbert space. The Hilbert projection theorem guarantees not only existence but also uniqueness of the nearest point.
  • Vector quantization / codebook lookup — Finite codebook, same as k-NN: existence is automatic.
  • Open constraint sets (e.g., strict inequalities \(g(x) < 0\)) — This is where existence can fail. The nearest feasible point may not exist; typical fix is to replace the constraint with its closure \(g(x) \leq 0\), or to use a barrier/penalty method rather than direct projection.
  • Non-compact feasible sets (even convex ones) — No guarantee in general. The fundamental obstruction is not non-convexity but non-compactness: a closed but unbounded set in \(\mathbb{R}^n\), or any closed bounded set in an infinite-dimensional Banach space (which is never compact), can fail to contain a nearest point. Iterative algorithms operating in such settings may oscillate or approach a limit outside \(S\).

The takeaway: whenever you write project(x, S) or call a nearest-neighbor routine, the code silently assumes existence — and that assumption is usually, but not always, justified by the structure of \(S\).

The concept of distance to a set is fundamental to margin-based classifiers (like SVMs), where the goal is to maximize the distance between a decision boundary and the nearest data points. Furthermore, the nearest point definition is the mathematical foundation of projection. In constrained optimization, projecting a gradient update back onto the feasible set \(S\) is exactly the act of finding a nearest point \(s \in S\) to the updated vector \(z\).

Boundary

Having established how to measure distance to sets, we can now classify points by their relationship to a set and its complement. This classification is fundamental to constrained optimization. When we solve problems like "minimize \(f(x)\) subject to \(g(x) \leq 0\)," the feasible region is a set \(S\), and the nature of optimal solutions depends heavily on whether they lie in the interior or on the boundary of \(S\).

Definition: Boundary Points Suppose \(X\) is a metric space, \(S\) is a subset of \(X\) and \(a \in X\). Then \(a\) is called a boundary point of \(S\) in \(X\) if and only if \[ \text{dist }(a, S) = \text{dist }(a, S^c) = 0. \] The collection of boundary points of \(S\) in \(X\) is called the boundary of \(S\) in \(X\) and is denoted by \(\partial S\).
Definition: Closure, Interior, and Exterior Suppose \(X\) is a metric space, and \(S\) is a subset of \(X\).
  • The closure of \(S\) in \(X\) to be the union: \[ \overline{S} = \text{cl}(S) = S \cup \partial S. \]
  • The interior of \(S\) in \(X\) to be the difference: \[ S^{\circ} = \text{int}(S) = S \setminus \partial S. \]
  • The exterior of \(S\) in \(X\) to be the complement of the closure of \(S\) in \(X\): \[ \text{ext}(S) = \left(\overline{S}\right)^c. \] Note that the exterior of \(S\) in \(X\) is the interior of its complement in \(X\): \[ \left(\overline{S}\right)^c = (S^c)^{\circ}. \]

Open, Closed & Dense

The concepts of interior and boundary lead naturally to a classification of sets themselves. A set is open if it consists entirely of interior points. Intuitively, every point has "breathing room" and you can move slightly in any direction without leaving the set. A set is closed if it contains all its boundary points—the set includes its own "edge."

Definition: Open and Closed Subsets Suppose \(X\) is a metric space and \(S\) is a subset of \(X\). Then \(S\) is said to be
  1. an open subset of \(X\), or open in \(X\) iff \[S \cap \partial S = \emptyset\]
  2. a closed subset of \(X\), or closed in \(X\) iff \[\partial S \subseteq S\]

Note that most subsets of metric spaces are neither open nor closed - a set can contain some but not all of its boundary points. However, the sets that are open form a structure worth naming.

Open and closed are two ways of asking which boundary points belong to a set. A different and complementary question is: how thoroughly does a subset reach into the rest of the space? Even a set that is far from being all of \(X\) can still "approach every point" of \(X\) — meaning that every point of \(X\) is the limit of some sequence drawn from the subset, or equivalently, that every open region of \(X\), no matter how small, meets the subset. Sets with this reaching property are called dense, and they behave in many contexts as good "test sets": a continuous function is determined by its values on a dense subset, an isometry is determined by its values on a dense subset, and a bounded linear operator is determined by its values on a dense subset. The formal definition uses the closure introduced above.

Definition: Dense Subset

Suppose \(X\) is a metric space and \(S \subseteq X\). Then \(S\) is said to be dense in \(X\) if and only if its closure equals the whole space: \[ \overline{S} \;=\; X. \] Equivalently, every non-empty open subset of \(X\) contains at least one point of \(S\); equivalently again, every point of \(X\) is the limit of a sequence drawn from \(S\).

The standard example is \(\mathbb{Q}\) inside \(\mathbb{R}\): every real number is a limit of rationals, so \(\overline{\mathbb{Q}} = \mathbb{R}\) and \(\mathbb{Q}\) is dense in \(\mathbb{R}\). The same holds componentwise for \(\mathbb{Q}^n\) inside \(\mathbb{R}^n\). The irrationals \(\mathbb{R} \setminus \mathbb{Q}\) are also dense in \(\mathbb{R}\) — density is symmetric in a strong sense: a set and its complement can both be dense, which is exactly what happens here. In function spaces, dense subsets play an even more decisive role: trigonometric polynomials are dense in continuous periodic functions; continuous compactly supported functions are dense in \(L^p\); these density results are the engines behind the entire theory of Fourier analysis and \(L^p\) approximation, where one proves an identity for a "nice" dense subclass and then extends it to the whole space by continuity.

Definition: Topology Suppose \((X, d)\) is a metric space. The collection of open subsets of \(X\) is called the topology determined by the metric \(d\), denoted \(\tau_d\).

While we will not develop general topology here, this connection is valuable: many properties of metric spaces (continuity, convergence, compactness) depend only on which sets are open, not on the specific numerical values of the metric.

Why do these classifications matter for optimization? If our search space is strictly open (e.g., \(0 < x < 1\)), an algorithm might iterate forever, approaching a boundary value without ever reaching it - the limit point is excluded from the feasible set. A closed set contains its boundary, ensuring that edge solutions are valid. More fundamentally, the Extreme Value Theorem guarantees that a continuous function attains its maximum and minimum only when the domain is compact - which, in \(\mathbb{R}^n\), means both closed and bounded. We will formalize compactness in later chapters.

Now we arrive at one of the most important concepts for algorithmic convergence: completeness.

Definition: Complete Metric Space

A metric space \(X\) is said to be complete if and only if \(X\) is closed in every metric superspace of \(X\).

Here "closed" is meant in the topological sense relative to the superspace \(Y\): for every metric superspace \((Y, e)\) with \(X \subseteq Y\), the boundary \(\partial X\) computed inside \(Y\) satisfies \(\partial X \subseteq X\).

Intuitively, completeness means the space has "no holes." The real numbers \(\mathbb{R}\) with the standard metric are complete - this is why calculus works. The rational numbers \(\mathbb{Q}\) are not complete: consider the sequence defined by \(x_1 = 1\) and \(x_{n+1} = \frac{1}{2}\left(x_n + \frac{2}{x_n}\right)\). Each term is rational, but the sequence converges to \(\sqrt{2}\), which is irrational. The limit exists in \(\mathbb{R}\) but not in \(\mathbb{Q}\). There is a "hole" where \(\sqrt{2}\) should be.

For optimization, completeness ensures that limit points of iterative algorithms actually belong to the space we are working in. The definition above (closedness in every superspace) is the primary one we adopt here; it is convenient because it refers only to already-established topological notions. In the chapter on completeness, we will develop the theory of Cauchy sequences and establish the standard equivalent characterization: a metric space is complete if and only if every Cauchy sequence converges to a limit within the space. This equivalence is the reason both formulations are used interchangeably in the literature.

Balls

Open and closed sets can be complicated objects, but we can build them from simpler pieces. The open ball centered at a point \(x\) with radius \(r\) is the set of all points within distance \(r\) of \(x\). These balls are the "building blocks" of the topology: every open set \(U\) can be expressed as a union of open balls. The reason is short — for each \(x \in U\), the interior-point property guarantees \(\text{dist}(x, U^c) > 0\), so the ball \(\mathcal{B}[x; r_x)\) with \(r_x = \tfrac{1}{2}\text{dist}(x, U^c)\) lies inside \(U\), giving \(U = \bigcup_{x \in U} \mathcal{B}[x; r_x)\). This is what it means to say the open balls form a basis for the metric topology, a perspective we develop in full generality in the chapter on topological spaces.

More importantly for algorithms, balls give us a coordinate-free way to talk about neighborhoods. When we say "there exists a step size \(\alpha\) such that gradient descent stays in the feasible region," we are implicitly saying the current point lies in an open ball contained within that region.

Definition: Open Balls & Closed Balls Suppose \((X, d)\) is a metric space and \(x \in X\). For each \(r \in \mathbb{R}^+\), we define:
  1. The open ball in \(X\) centered at the point \(x\) and with radius \(r\) to be the set \[ \mathcal{B}[x ; r) = \{y \in X \mid d(x, y) < r\}. \]
  2. The closed ball in \(X\) centered at the point \(x\) and with radius \(r\) to be the set \[ \mathcal{B}[x ; r] = \{y \in X \mid d(x, y) \leq r\}. \]

Intuitively, we want to say that the open ball in \(X\) is the open subset of \(X\), and similarly, the closed ball in \(X\) is the closed subset of \(X\). Indeed, these are true. We shall show that only the open ball case here.

Proof: The Open Ball is Open Suppose \((X, d)\) is a metric space, \(x \in X\), and \(r \in \mathbb{R}^+\). Consider the open ball \[ \mathcal{B}[x ; r) = \{y \in X \mid d(x, y) < r\} \subseteq X. \] Let \(z \in \mathcal{B}\). Since \(d(x, z) < r\), we may choose \[ s := \tfrac{1}{2}\bigl(r - d(x, z)\bigr) \in (0, r), \] so that \(d(x, z) < r - s\). Now, consider any point \(w\) outside the ball, i.e., \(w \in X \setminus \mathcal{B}\). By definition, \(d(w, x) \geq r\). The triangle inequality gives \(d(w, x) \leq d(w, z) + d(z, x)\), and rearranging (using symmetry) yields the reverse triangle inequality: \[ d(z, w) \;\geq\; d(w, x) - d(x, z) \;>\; r - (r - s) \;=\; s. \] Since \(w\) is an arbitrary point in the complement \(X \setminus \mathcal{B}\), this inequality holds for all points outside the ball. This implies that the "shortest distance" (infimum) from \(z\) to the outside of the ball is at least \(s\): \[ \text{dist }(z, X \setminus \mathcal{B}) = \inf \{d(z, w) \mid w \notin \mathcal{B}\} \geq s > 0. \] Recall the definition of a boundary point requires the distance to the complement to be zero. Since the distance here is strictly positive, \(z \notin \partial \mathcal{B}\). Since \(z\) was chosen arbitrarily, we have \(\mathcal{B} \cap \partial \mathcal{B} = \emptyset\), which is precisely the definition of \(\mathcal{B}\) being open. Equivalently, every point of \(\mathcal{B}\) is an interior point, so \[ \mathcal{B} = \mathcal{B}^{\circ}. \] Therefore, \(\mathcal{B}\) is open.

The open ball also provides the foundation for discussing convexity in a coordinate-free manner. In a normed linear space, all open balls of the same radius have identical shape — each is simply a translation of the others. Moreover, these balls are themselves convex sets (a short consequence of the triangle inequality and positive homogeneity of the norm). We can now state the definition of convexity that applies to general normed spaces, not just \(\mathbb{R}^n\):

Definition: Convex Suppose \(V\) is a normed linear space and \(C \subseteq V\). Then \(C\) is said to be convex if and only if for each \(a, b \in C\), the line segment \(\{ (1 - t)a + tb \mid t \in [0, 1] \}\) joining \(a\) and \(b\) is included in \(C\).