Introduction
Up until this chapter, we relied on calculus and linear algebra to solve optimization problems.
For many practical machine learning applications, these tools are sufficient. However, as we move toward
advanced topics like manifold learning or information geometry, intuition alone is no longer enough.
Consider a fundamental question: when we run gradient descent, we assume each step brings us closer to a solution,
and that iterating long enough will reach (or approach) a minimum. But what guarantees this? The sequence of
iterates \(\mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \ldots\) may get arbitrarily close together without ever
settling on an actual point unless the underlying space has the right structure.
Similarly, when we work with function spaces in kernel methods or neural networks, we treat entire functions
as single "points." But what does "distance" mean between two functions? What does it mean for a sequence of
functions to converge? The familiar Euclidean intuition breaks down, and we need a more general framework.
This chapter introduces that framework: metric spaces and the notion of completeness.
These concepts formalize exactly when iterative algorithms converge to valid solutions. Rather than learning new
computational techniques, we are gaining the language to state precisely what we have been assuming all along and
to understand when those assumptions hold.
Metric Space
The framework begins with a minimal abstraction: a set equipped with a notion of distance. We do not
require coordinates, inner products, or linear structure — only a function that assigns a nonnegative
real number to each pair of points and obeys three natural axioms.
Definition: Metric Space \((M, d)\)
A metric space is an ordered pair \((M, d)\) consisting of a nonempty set
\(M\) and a metric \(d\) on \(M\). The metric \(d\) is a function
\(d: M \times M \to \mathbb{R}\) that defines the distance between any two elements of \(M\).
For all \(a, b, c \in M\), the following axioms must hold:
- Positive definiteness: \(d(a,b) \geq 0\), with equality if and only if \(a = b\). (The inequality \(d(a,b) \geq 0\) is the non-negativity condition; the "if and only if" clause is the identity of indiscernibles.)
- Symmetry: \(d(a,b) = d(b,a)\).
- Triangle inequality: \(d(a,b) \leq d(a,c) + d(c, b)\).
Definition: Metric Subspace and Superspace
Suppose \((X, d)\) and \((Y, e)\) are metric spaces. We say that \((X, d)\) is a
metric subspace of \((Y, e)\), and that \((Y, e)\) is a
metric superspace of \((X, d)\), if and only if \(X \subseteq Y\) and
\(d\) is the restriction of \(e\) to \(X \times X\); that is, \(d(a, b) = e(a, b)\) for all
\(a, b \in X\).
Insight: Why the Axioms Matter in Practice
You have already used non-Euclidean distances throughout this curriculum. In k-nearest neighbors,
the choice between Euclidean distance, Manhattan distance, or cosine similarity fundamentally changes
which points are considered "close." In kernel methods, the kernel function implicitly defines a
distance in a high-dimensional feature space. These are all metrics on different spaces.
The axioms are not arbitrary mathematical restrictions. They encode the minimal properties required for
algorithms to behave sensibly. For instance, if the triangle inequality fails, then a nearest-neighbor
search might return a point \(y\) as "closest" to \(x\), even when a third point \(z\) satisfies
\(d(x, z) < d(x, y)\) via an indirect path. Efficient search structures like KD-trees and ball trees
rely on the triangle inequality for their pruning guarantees.
Distance
With a metric defined, we can extend the notion of distance from pairs of points to more general
situations. In optimization, we rarely care about the distance between two specific points; instead,
we ask questions like: "How far is this point from the feasible region?" or "How close is our current
iterate to the set of optimal solutions?" These questions require measuring distance from a point to
a set.
Before formalizing this, we need to make precise what it means to take the "smallest possible distance"
over a potentially infinite set. Consider the interval \((0, 1) \subseteq \mathbb{R}\) and the point \(0\):
the distances \(d(0, y) = y\) for \(y \in (0, 1)\) form a set with no smallest element, yet they approach
\(0\) arbitrarily closely. To capture this "smallest attainable or approachable value," we use the
infimum rather than the minimum.
Definition: Supremum and Infimum
Let \(S \subseteq \mathbb{R}\). An upper bound of \(S\) is any \(u \in \mathbb{R}\)
satisfying \(s \leq u\) for every \(s \in S\); a lower bound is defined dually
(\(\ell \leq s\) for every \(s \in S\)). We say \(S\) is bounded above if it has
at least one upper bound, and bounded below if it has at least one lower bound.
When \(S\) is nonempty and bounded above, its supremum \(\sup S\) is the least
upper bound — the smallest real number that is an upper bound of \(S\). Dually, when \(S\) is
nonempty and bounded below, its infimum \(\inf S\) is the greatest lower bound.
The existence of these quantities is a nontrivial property of \(\mathbb{R}\) — we return to this
in the discussion below.
We extend these to extended-real values by convention: \(\sup S = +\infty\) when \(S\) is unbounded
above, \(\inf S = -\infty\) when unbounded below, and for the empty set \(\sup \emptyset = -\infty\),
\(\inf \emptyset = +\infty\). When the supremum belongs to \(S\) itself, we call it the
maximum, written \(\max S\); when the infimum belongs to \(S\), the
minimum, \(\min S\). The point of using \(\sup\) and \(\inf\) rather than
\(\max\) and \(\min\) is precisely to handle sets without a largest or smallest element — as
the interval \((0, 1)\) illustrates: \(\inf (0,1) = 0\) but \(\min (0,1)\) does not exist.
Insight: The Completeness of \(\mathbb{R}\)
The fact that every nonempty bounded-above \(S \subseteq \mathbb{R}\) has a supremum —
that the least upper bound exists as a real number — is the completeness property
of the real numbers, the foundational axiom distinguishing \(\mathbb{R}\) from \(\mathbb{Q}\).
The set \(S = \{q \in \mathbb{Q} : q^2 < 2\}\) is bounded above in \(\mathbb{Q}\) (by \(3/2\), say)
but has no rational least upper bound; only when we pass to \(\mathbb{R}\) does \(\sup S = \sqrt{2}\)
exist. Without this property, the infimum in our distance definition could fail to exist even when
intuitively it should be well-defined.
This order-theoretic completeness of \(\mathbb{R}\) is deeply connected to the
metric-space completeness we will study later — where the analogous question becomes
whether every Cauchy sequence converges. In \(\mathbb{R}\), the two notions are equivalent; in general
metric spaces (which may lack an order structure), only the Cauchy formulation survives. Both express
the same intuition: no holes.
In optimization and machine learning, \(\inf\) and \(\sup\) appear throughout: the optimal value
of a minimization problem is \(\inf_x f(x)\), generalization bounds are stated as suprema over
hypothesis classes, and the \(\limsup\) / \(\liminf\) of an error sequence capture worst-case
and best-case asymptotic behavior. We will rely on these operations without further ceremony.
Definition: Distances
Suppose \((X, d)\) is a metric space and \(A\) and \(B\) are subsets of \(X\).
The distance from \(x \in X\) to \(A\) to be
\[
\text{dist }(x, A) = \inf \{d(x, a) \mid a \in A\} \quad (\text{Points to Sets})
\]
and also the distance from \(A\) to \(B\) to be
\[
\text{dist }(A, B) = \inf \{d(a, b) \mid a \in A, b \in B\}. \quad (\text{Sets to Sets})
\]
Both distances depend on the choice of the metric \(d\). By convention, when the set is empty we extend these definitions via \(\inf \emptyset = +\infty\) (as established above), so that \(\text{dist}(x, \emptyset) = +\infty\) and \(\text{dist}(A, \emptyset) = +\infty\); this ensures the derived notions below (isolated points, boundary, etc.) remain well-defined in degenerate cases.
Definition: Isolated Points
Suppose \(X\) is a metric space, \(S\) is a subset of \(X\), and \(z \in S\). Then \(z\) is said to be
an isolated point of \(S\) if and only if
\[
\text{dist }(z, S \setminus \{z\}) \neq 0.
\]
The collection of isolated points of \(S\) is denoted by \(\text{iso } (S)\).
The complement of isolated points leads to a more important concept for our purposes: accumulation points.
While isolated points stand alone with "room around them," accumulation points are surrounded by infinitely
many other points from the set no matter how small a neighborhood we consider.
Definition: Accumulation Points
Suppose \(X\) is a metric space, \(S\) is a subset of \(X\), and \(z \in X\). Then \(z\) is said to be
an accumulation point or a limit point of \(S\) in \(X\) if and only if
\[
\text{dist }(z, S \setminus \{z\}) = 0.
\]
The collection of accumulation points of \(S\) in \(X\) is denoted by \(\text{acc } (S)\).
Definition: Nearest Points
Suppose \((X, d)\) is a metric space, \(S\) is a subset of \(X\), and \(z \in X\). A member \(s \in S\), if one exists,
is called a nearest point of \(S\) to \(z\) in \(X\) if and only if
\[
d(z, s) = \text{dist }(z, S).
\]
A nearest point need not exist in general: take \(X = \mathbb{R}\), \(S = (0, 1)\), and \(z = 0\). Then
\(\text{dist}(z, S) = 0\), yet no member of \(S\) is at distance zero from \(z\), because \(S\) is open and
excludes its boundary. Existence is guaranteed, however, when \(S\) has sufficient structure — most notably
when \(S\) is compact, a notion we formalize in a later chapter. In \(\mathbb{R}^n\), this reduces to
\(S\) being closed and bounded (Heine–Borel); more generally, closedness alone suffices in "nice" spaces such as
Hilbert spaces, where the projection onto a closed convex set is uniquely determined.
Insight: "The Nearest Point Exists" — When Can You Trust This Assumption?
Machine learning practitioners routinely take the existence of a nearest point for granted:
"find the closest training example," "project onto the feasible set," "snap the embedding to
the nearest codeword." These operations are so ubiquitous that it is easy to forget
they rest on a non-trivial mathematical guarantee.
Here is the cheat sheet for when existence is safe:
-
k-Nearest Neighbors / SVM support vectors — The set \(S\) is a finite collection of training points.
Finite sets are automatically compact, so a minimizer always exists (in fact, it is attained by brute-force search).
-
Projection in convex optimization (proximal methods, projected gradient descent) — The feasible set is typically
closed and convex in \(\mathbb{R}^n\) or a Hilbert space. The Hilbert projection theorem guarantees not only
existence but also uniqueness of the nearest point.
-
Vector quantization / codebook lookup — Finite codebook, same as k-NN: existence is automatic.
-
Open constraint sets (e.g., strict inequalities \(g(x) < 0\)) — This is where existence can fail.
The nearest feasible point may not exist; typical fix is to replace the constraint with its closure \(g(x) \leq 0\),
or to use a barrier/penalty method rather than direct projection.
-
Non-compact feasible sets (even convex ones) — No guarantee in general. The fundamental obstruction is
not non-convexity but non-compactness: a closed but unbounded set in \(\mathbb{R}^n\), or any closed bounded set
in an infinite-dimensional Banach space
(which is never compact), can fail to contain a nearest point. Iterative algorithms operating in such settings may
oscillate or approach a limit outside \(S\).
The takeaway: whenever you write project(x, S) or call a nearest-neighbor routine, the code silently assumes
existence — and that assumption is usually, but not always, justified by the structure of \(S\).
The concept of distance to a set is fundamental to margin-based classifiers
(like SVMs),
where the goal is to maximize the distance between a decision boundary and the nearest data points.
Furthermore, the nearest point definition is the mathematical foundation of projection.
In constrained optimization, projecting a gradient update back onto the feasible set \(S\) is exactly the
act of finding a nearest point \(s \in S\) to the updated vector \(z\).
Boundary
Having established how to measure distance to sets, we can now classify points by their relationship to
a set and its complement. This classification is fundamental to constrained optimization.
When we solve problems like "minimize \(f(x)\) subject to \(g(x) \leq 0\),"
the feasible region is a set \(S\), and the nature of optimal solutions depends heavily on whether they
lie in the interior or on the boundary of \(S\).
Definition: Boundary Points
Suppose \(X\) is a metric space, \(S\) is a subset of \(X\) and \(a \in X\). Then \(a\) is called
a boundary point of \(S\) in \(X\) if and only if
\[
\text{dist }(a, S) = \text{dist }(a, S^c) = 0.
\]
The collection of boundary points of \(S\) in \(X\) is called the boundary of \(S\) in \(X\)
and is denoted by \(\partial S\).
Definition: Closure, Interior, and Exterior
Suppose \(X\) is a metric space, and \(S\) is a subset of \(X\).
- The closure of \(S\) in \(X\) to be the union:
\[
\overline{S} = \text{cl}(S) = S \cup \partial S.
\]
- The interior of \(S\) in \(X\) to be the difference:
\[
S^{\circ} = \text{int}(S) = S \setminus \partial S.
\]
- The exterior of \(S\) in \(X\) to be the complement of the closure of \(S\) in \(X\):
\[
\text{ext}(S) = \left(\overline{S}\right)^c.
\]
Note that the exterior of \(S\) in \(X\) is the interior of its complement in \(X\):
\[
\left(\overline{S}\right)^c = (S^c)^{\circ}.
\]
Open, Closed & Dense
The concepts of interior and boundary lead naturally to a classification of sets themselves. A set is
open if it consists entirely of interior points. Intuitively, every point has "breathing room"
and you can move slightly in any direction without leaving the set. A set is closed if it
contains all its boundary points—the set includes its own "edge."
Definition: Open and Closed Subsets
Suppose \(X\) is a metric space and \(S\) is a subset of \(X\). Then \(S\) is said to be
- an open subset of \(X\), or open in \(X\) iff \[S \cap \partial S = \emptyset\]
- a closed subset of \(X\), or closed in \(X\) iff \[\partial S \subseteq S\]
Note that most subsets of metric spaces are neither open nor closed - a set can contain some but not all
of its boundary points. However, the sets that are open form a structure worth naming.
Open and closed are two ways of asking which boundary points belong to a set. A different and
complementary question is: how thoroughly does a subset reach into the rest of the space?
Even a set that is far from being all of \(X\) can still "approach every point" of \(X\) — meaning
that every point of \(X\) is the limit of some sequence drawn from the subset, or equivalently, that
every open region of \(X\), no matter how small, meets the subset. Sets with this reaching property are
called dense, and they behave in many contexts as good "test sets": a continuous function is determined
by its values on a dense subset, an isometry is determined by its values on a dense subset, and a bounded
linear operator is determined by its values on a dense subset. The formal definition uses the closure
introduced above.
Definition: Dense Subset
Suppose \(X\) is a metric space and \(S \subseteq X\). Then \(S\) is said to be dense
in \(X\) if and only if its closure equals the whole space:
\[
\overline{S} \;=\; X.
\]
Equivalently, every non-empty open subset of \(X\) contains at least one point of \(S\); equivalently
again, every point of \(X\) is the limit of a sequence drawn from \(S\).
The standard example is \(\mathbb{Q}\) inside \(\mathbb{R}\): every real number is a limit of rationals,
so \(\overline{\mathbb{Q}} = \mathbb{R}\) and \(\mathbb{Q}\) is dense in \(\mathbb{R}\). The same holds
componentwise for \(\mathbb{Q}^n\) inside \(\mathbb{R}^n\). The irrationals \(\mathbb{R} \setminus \mathbb{Q}\)
are also dense in \(\mathbb{R}\) — density is symmetric in a strong sense: a set and its complement can
both be dense, which is exactly what happens here. In function spaces, dense subsets play an even more
decisive role: trigonometric polynomials are dense in continuous periodic functions; continuous compactly
supported functions are dense in \(L^p\); these density results are the engines behind the entire theory
of Fourier analysis and \(L^p\) approximation, where one proves an identity for a "nice" dense subclass
and then extends it to the whole space by continuity.
Definition: Topology
Suppose \((X, d)\) is a metric space. The collection of open subsets of \(X\) is called the
topology determined by the metric \(d\), denoted \(\tau_d\).
While we will not develop general topology here, this connection is valuable: many properties
of metric spaces (continuity, convergence, compactness) depend only on which sets are open, not on the
specific numerical values of the metric.
Why do these classifications matter for optimization? If our search space is strictly open
(e.g., \(0 < x < 1\)), an algorithm might iterate forever, approaching a boundary value without ever
reaching it - the limit point is excluded from the feasible set. A closed set contains
its boundary, ensuring that edge solutions are valid. More fundamentally, the Extreme Value
Theorem guarantees that a continuous function attains its maximum and minimum only when the
domain is compact - which, in \(\mathbb{R}^n\), means both closed and bounded. We will
formalize compactness in later chapters.
Now we arrive at one of the most important concepts for algorithmic convergence: completeness.
Definition: Complete Metric Space
A metric space \(X\) is said to be complete if and only if
\(X\) is closed in every metric superspace of \(X\).
Here "closed" is meant in the topological sense relative to the superspace \(Y\): for every metric superspace \((Y, e)\) with \(X \subseteq Y\), the boundary \(\partial X\) computed inside \(Y\) satisfies \(\partial X \subseteq X\).
Intuitively, completeness means the space has "no holes." The real numbers \(\mathbb{R}\) with the
standard metric are complete - this is why calculus works. The rational numbers
\(\mathbb{Q}\) are not complete: consider the sequence defined by \(x_1 = 1\) and
\(x_{n+1} = \frac{1}{2}\left(x_n + \frac{2}{x_n}\right)\). Each term is rational, but the sequence
converges to \(\sqrt{2}\), which is irrational. The limit exists in \(\mathbb{R}\) but not in
\(\mathbb{Q}\). There is a "hole" where \(\sqrt{2}\) should be.
For optimization, completeness ensures that limit points of iterative algorithms actually belong to
the space we are working in. The definition above (closedness in every superspace) is the primary one
we adopt here; it is convenient because it refers only to already-established topological notions.
In the chapter on completeness, we will develop the theory of Cauchy sequences and establish the
standard equivalent characterization: a metric space is complete if and only if every Cauchy
sequence converges to a limit within the space. This equivalence is the reason both
formulations are used interchangeably in the literature.
Balls
Open and closed sets can be complicated objects, but we can build them from simpler pieces. The
open ball centered at a point \(x\) with radius \(r\) is the set of all points
within distance \(r\) of \(x\). These balls are the "building blocks" of the topology: every open
set \(U\) can be expressed as a union of open balls. The reason is short — for each \(x \in U\),
the interior-point property guarantees \(\text{dist}(x, U^c) > 0\), so the ball
\(\mathcal{B}[x; r_x)\) with \(r_x = \tfrac{1}{2}\text{dist}(x, U^c)\) lies inside \(U\), giving
\(U = \bigcup_{x \in U} \mathcal{B}[x; r_x)\). This is what it means to say the open balls form a
basis for the metric topology, a perspective we develop in full generality in the chapter
on topological spaces.
More importantly for algorithms, balls give us a coordinate-free way to talk about
neighborhoods. When we say "there exists a step size \(\alpha\) such that gradient descent stays in the
feasible region," we are implicitly saying the current point lies in an open ball contained within
that region.
Definition: Open Balls & Closed Balls
Suppose \((X, d)\) is a metric space and \(x \in X\). For each \(r \in \mathbb{R}^+\), we define:
- The open ball in \(X\) centered at the point \(x\) and with radius \(r\) to be the set
\[
\mathcal{B}[x ; r) = \{y \in X \mid d(x, y) < r\}.
\]
- The closed ball in \(X\) centered at the point \(x\) and with radius \(r\) to be the set
\[
\mathcal{B}[x ; r] = \{y \in X \mid d(x, y) \leq r\}.
\]
Intuitively, we want to say that the open ball in \(X\) is the open subset of \(X\), and similarly, the closed ball in \(X\)
is the closed subset of \(X\). Indeed, these are true. We shall show that only the open ball case here.
Proof: The Open Ball is Open
Suppose \((X, d)\) is a metric space, \(x \in X\), and \(r \in \mathbb{R}^+\). Consider the open ball
\[
\mathcal{B}[x ; r) = \{y \in X \mid d(x, y) < r\} \subseteq X.
\]
Let \(z \in \mathcal{B}\). Since \(d(x, z) < r\), we may choose
\[
s := \tfrac{1}{2}\bigl(r - d(x, z)\bigr) \in (0, r),
\]
so that \(d(x, z) < r - s\).
Now, consider any point \(w\) outside the ball, i.e., \(w \in X \setminus \mathcal{B}\).
By definition, \(d(w, x) \geq r\). The triangle inequality gives
\(d(w, x) \leq d(w, z) + d(z, x)\), and rearranging (using symmetry) yields the
reverse triangle inequality:
\[
d(z, w) \;\geq\; d(w, x) - d(x, z) \;>\; r - (r - s) \;=\; s.
\]
Since \(w\) is an arbitrary point in the complement \(X \setminus \mathcal{B}\),
this inequality holds for all points outside the ball. This implies that the
"shortest distance" (infimum) from \(z\) to the outside of the ball is at least \(s\):
\[
\text{dist }(z, X \setminus \mathcal{B}) = \inf \{d(z, w) \mid w \notin \mathcal{B}\} \geq s > 0.
\]
Recall the definition of a boundary point requires the distance to the complement to be zero.
Since the distance here is strictly positive, \(z \notin \partial \mathcal{B}\).
Since \(z\) was chosen arbitrarily, we have \(\mathcal{B} \cap \partial \mathcal{B} = \emptyset\),
which is precisely the definition of \(\mathcal{B}\) being open. Equivalently, every point of
\(\mathcal{B}\) is an interior point, so
\[
\mathcal{B} = \mathcal{B}^{\circ}.
\]
Therefore, \(\mathcal{B}\) is open.
The open ball also provides the foundation for discussing convexity in a coordinate-free manner.
In a normed linear space, all open balls of the same radius have identical shape — each is simply a translation
of the others. Moreover, these balls are themselves convex sets (a short consequence of the triangle inequality and positive homogeneity of the norm). We can now state the definition of convexity
that applies to general normed spaces, not just \(\mathbb{R}^n\):
Definition: Convex
Suppose \(V\) is a
normed linear space and \(C \subseteq V\). Then \(C\) is said to be
convex if and only if
for each \(a, b \in C\), the line segment \(\{ (1 - t)a + tb \mid t \in [0, 1] \}\) joining \(a\) and \(b\) is included in
\(C\).