Introduction
Up until this chapter, we relied on calculus and linear algebra to solve optimization problems.
For many practical machine learning applications, these tools are sufficient. However, as we move toward
advanced topics like manifold learning or information geometry, intuition alone is no longer enough.
Consider a fundamental question: when we run gradient descent, we assume each step brings us closer to a solution,
and that iterating long enough will reach (or approach) a minimum. But what guarantees this? The sequence of
iterates \(\mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \ldots\) may get arbitrarily close together without ever
settling on an actual point unless the underlying space has the right structure.
Similarly, when we work with function spaces in kernel methods or neural networks, we treat entire functions
as single "points." But what does "distance" mean between two functions? What does it mean for a sequence of
functions to converge? The familiar Euclidean intuition breaks down, and we need a more general framework.
This chapter introduces that framework: metric spaces and the notion of completeness.
These concepts formalize exactly when iterative algorithms converge to valid solutions. Rather than learning new
computational techniques, we are gaining the language to state precisely what we have been assuming all along and
to understand when those assumptions hold.
Metric Space
Remember, In Section I we defined the measure space and metric
as follows.
Definition: Metric Space \((M, d)\)
A metric space is an ordered pair \((M, d)\)consisting of a nonempty set
\(M\) and a metric \(d\) on \(M\). The metric \(d\) is a function
\(d: M \times M \to \mathbb{R}\) that defines the distance between any two elements of \(M\).
For all \(a, b, c \in M\), the following axioms must hold:
- Positivity: \(d(a,b) \geq 0\) with equality if and only if \(a = b\).
- Symmetry: \(d(a,b) = d(b,a)\).
- The triangle inequality \(d(a,b) \leq d(a,c) + d(c, b)\).
Like the set theory, for two metric spaces \((X, d)\) and \((Y, e)\), we can define that \(X\) is a
metric subspace of \(Y\) and \(Y\) is a metric superspace of \(X\) if and
only if \(X\) is a subset of \(Y\) and \(d\) is a restriction of \(e\).
Insight: Why the Axioms Matter in Practice
You have already used non-Euclidean distances throughout this curriculum. In k-nearest neighbors,
the choice between Euclidean distance, Manhattan distance, or cosine similarity fundamentally changes
which points are considered "close." In kernel methods, the kernel function implicitly defines a
distance in a high-dimensional feature space. These are all metrics on different spaces.
The axioms are not arbitrary mathematical restrictions. They encode the minimal properties required for
algorithms to behave sensibly. For instance, if the triangle inequality fails, then a nearest-neighbor
search might return a point \(y\) as "closest" to \(x\), even when a third point \(z\) satisfies
\(d(x, z) < d(x, y)\) via an indirect path. Efficient search structures like KD-trees and ball trees
rely on the triangle inequality for their pruning guarantees.
Distance
With a metric defined, we can extend the notion of distance from pairs of points to more general
situations. In optimization, we rarely care about the distance between two specific points; instead,
we ask questions like: "How far is this point from the feasible region?" or "How close is our current
iterate to the set of optimal solutions?" These questions require measuring distance from a point to
a set.
Definition: Distances
Suppose \((X, d)\) is a metric space and \(A\) and \(B\) are subsets of \(X\).
The distance from \(x \in X\) to \(A\) to be
\[
\text{dist }(x, A) = \inf \{d(x, a) \mid a \in A\} \quad (\text{Points to Sets})
\]
and also the distance from \(A\) to \(B\) to be
\[
\text{dist }(A, B) = \inf \{d(a, b) \mid a \in A, b \in B\}. \quad (\text{Sets to Sets})
\]
The both distances depend on what the metric \(d\) is.
Definition: Isolated Points
Suppose \(X\) is a metric space, \(S\) is a subset of \(X\), and \(z \in S\). Then \(z\) is said to be
an isolated point of \(S\) if and only if
\[
\text{dist }(z, S \ \{z\}) \neq 0.
\]
The collection of isolated points of \(S\) is denoted by \(\text{iso } (S)\).
The complement of isolated points leads to a more important concept for our purposes: accumulation points.
While isolated points stand alone with "room around them," accumulation points are surrounded by infinitely
many other points from the set no matter how small a neighborhood we consider.
Definition: Accumulation Points
Suppose \(X\) is a metric space, \(S\) is a subset of \(X\), and \(z \in X\). Then \(z\) is said to be
an accumulation point or a limit point of \(S\) in \(X\) if and only if
\[
\text{dist }(z, S \ \{z\}) = 0.
\]
The collection of accumulation points of \(S\) in \(X\) is denoted by \(\text{acc } (S)\).
Definition: Nearest Points
Suppose \((X, d)\) is a metric space, \(S\) is a subset of \(X\) and \(z \in X\). A member \(s\) of \(S\) is
called a nearest point of \(S\) to \(z\) in \(X\) if and only if
\[
d(z, s) = \text{dist }(z, S).
\]
The concept of distance to a set is fundamental to margin-based classifiers
(like SVMs),
where the goal is to maximize the distance between a decision boundary and the nearest data points.
Furthermore, the nearest point definition is the mathematical foundation of projection.
In constrained optimization, projecting a gradient update back onto the feasible set \(S\) is exactly the
act of finding a nearest point \(s \in S\) to the updated vector \(z\).
Boundary
Having established how to measure distance to sets, we can now classify points by their relationship to
a set and its complement. This classification is fundamental to constrained optimization.
When we solve problems like "minimize \(f(x)\) subject to \(g(x) \leq 0\),"
the feasible region is a set \(S\), and the nature of optimal solutions depends heavily on whether they
lie in the interior or on the boundary of \(S\).
Definition: Boundary Points
Suppose \(X\) is a metric space, \(S\) is a subset of \(X\) and \(a \in X\). Then \(a\) is called
a boundary point of \(S\) in \(X\) if and only if
\[
\text{dist }(a, S) = \text{dist }(a, S^c) = 0.
\]
The collection of boundary points of \(S\) in \(X\) is called the boundary of \(S\) in \(X\)
and is denoted by \(\partial S\).
Definition: Closure, Interior, and Exterior
Suppose \(X\) is a metric space, and \(S\) is a subset of \(X\).
- The closure of \(S\) in \(X\) to be the union:
\[
\overline{S} = \text{cl}(S) = S \cup \partial S.
\]
- The interior of \(S\) in \(X\) to be the difference:
\[
S^{\circ} = \text{int}(S) = S \ \partial S.
\]
- The exterior of \(S\) in \(X\) to be the complement of the closure of \(S\) in \(X\):
\[
\text{ext}(S) = \left(\overline{S}\right)^c.
\]
Note that the exterior of \(S\) in \(X\) is the interior of its complement in \(X\):
\[
\left(\overline{S}\right)^c = (S^c)^{\circ}.
\]
Open & Closed
The concepts of interior and boundary lead naturally to a classification of sets themselves. A set is
open if it consists entirely of interior points. Intuitively, every point has "breathing room"
and you can move slightly in any direction without leaving the set. A set is closed if it
contains all its boundary points—the set includes its own "edge."
Definition: Open and Closed Subsets
Suppose \(X\) is a metric space and \(S\) is a subset of \(X\). Then \(S\) is said to be
- an open subset of \(X\), or open in \(X\) iff \[S \cap \partial S = \emptyset\]
- a closed subset of \(X\), or closed in \(X\) iff \[\partial S \subseteq S\]
Note that most subsets of metric spaces are neither open nor closed - a set can contain some but not all
of its boundary points. However, the sets that are open form a structure worth naming.
Definition: Topology
Suppose \((X, d)\) is a metric space. The collection of open subsets of \(X\) is called the
topology determined by the metric \(d\).
While we will not develop general topology here, this connection is valuable: many properties
of metric spaces (continuity, convergence, compactness) depend only on which sets are open, not on the
specific numerical values of the metric.
Why do these classifications matter for optimization? If our search space is strictly open
(e.g., \(0 < x < 1\)), an algorithm might iterate forever, approaching a boundary value without ever
reaching it - the limit point is excluded from the feasible set. A closed set contains
its boundary, ensuring that edge solutions are valid. More fundamentally, the Extreme Value
Theorem guarantees that a continuous function attains its maximum and minimum only when the
domain is compact - which, in \(\mathbb{R}^n\), means both closed and bounded. We will
formalize compactness in later chapters.
Now we arrive at one of the most important concepts for algorithmic convergence: completeness.
Definition: Complete Metric Space
A metric space \(X\) is said to be complete if and only if
\(X\) is closed in every metric superspace of \(X\).
Intuitively, completeness means the space has "no holes." The real numbers \(\mathbb{R}\) with the
standard metric are complete - this is why calculus works. The rational numbers
\(\mathbb{Q}\) are not complete: consider the sequence defined by \(x_1 = 1\) and
\(x_{n+1} = \frac{1}{2}\left(x_n + \frac{2}{x_n}\right)\). Each term is rational, but the sequence
converges to \(\sqrt{2}\), which is irrational. The limit exists in \(\mathbb{R}\) but not in
\(\mathbb{Q}\). There is a "hole" where \(\sqrt{2}\) should be.
For optimization, completeness ensures that limit points of iterative algorithms actually belong to
the space we're working in. In future chapters, we will develop the theory of convergence and see an
equivalent characterization: a metric space is complete if and only if every Cauchy sequence
converges to a limit within the space.
Balls
Open and closed sets can be complicated objects, but we can build them from simpler pieces. The
open ball centered at a point \(x\) with radius \(r\) is the set of all points
within distance \(r\) of \(x\). These balls are the "building blocks" of the topology: every open
set can be expressed as a union of open balls.
More importantly for algorithms, balls give us a coordinate-free way to talk about
neighborhoods. When we say "there exists a step size \(\alpha\) such that gradient descent stays in the
feasible region," we are implicitly saying the current point lies in an open ball contained within
that region.
Definition: Open Balls & Closed Balls
Suppose \((X, d)\) is a metric space and \(x \in X\). For each \(r \in \mathbb{R}^+\), we define:
- The open ball in \(X\) centered at the point \(x\) and with radius \(r\) to be the set
\[
\mathcal{B}[x ; r) = \{y \in X \mid d(x, y) < r\}.
\]
- The closed ball in \(X\) centered at the point \(x\) and with radius \(r\) to be the set
\[
\mathcal{B}[x ; r] = \{y \in X \mid d(x, y) \leq r\}.
\]
Intuitively, we want to say that the open ball in \(X\) is the open subset of \(X\), and similary, the closed ball in \(X\)
is the closed subset of \(X\). Indeed, these are true. We shall show that only the open ball case here.
Proof: The Open Ball is Open
Suppose \(X, d\) is a metric space, \(x \in X\), and \(r \in \mathbb{R}^+\). Consider the open ball
\[
\mathcal{B}[x ; r) = \{y \in X \mid d(x, y) < r\} \subseteq X.
\]
If \(z \in \mathcal{B}\), then
\[
\exists s \in (0, r) \text{ such that } d(x, z) < r - s.
\]
Now, consider any point \(w\) outside the ball, i.e., \(w \in X \setminus \mathcal{B}\).
By definition, \(d(w, x) \geq r\). Using the triangle inequality, we have:
\[
d(z, w) \geq d(w, x) - d(x, z) > s.
\]
Since \(w\) is an arbitrary point in the complement \(X \setminus \mathcal{B}\),
this inequality holds for all points outside the ball. This implies that the
"shortest distance" (infimum) from \(z\) to the outside the bal is at least \(s\):
\[
\text{dist }(z, X \setminus \mathcal{B}) = \inf \{d(z, w) \mid w \notin \mathcal{B}\} \geq s.
\]
Recall the definition of a boundary point requires the distance to the complement to be zero.
Since the distance here is strictly positive (\(\geq s\)), \(z\) is not a boundary point.
Therefore, being in \(\mathcal{B}\) but not on its boundary, \(z\) must be in the interior:
\[
z \in \mathcal{B}^{\circ}.
\]
Since \(z\) was chosen arbitrarily, every point in \(\mathcal{B}\) is an interior point:
\[
\mathcal{B} = \mathcal{B}^{\circ}.
\]
Therefore, \(\mathcal{B}\) is open.
The open ball also provides the foundation for discussing convexity in a coordinate-free manner.
In a normed linear space, all open balls of the same radius have identical shape - each is simply a translation
of the others. Moreover, these balls are themselves convex sets. We can now state the definition of convexity
that applies to general normed spaces, not just \(\mathbb{R}^n\):
Definition: Convex
Suppose \(V\) is a normed linear space and \(C \subseteq V\). Then \(C\) is said to be convex if and only if
for each \(a, b \in C\), the line segment \(\{ (1 - t)a + tb \mid t \in [0, 1] \}\) jointing \(a\) and \(b\) is included in
\(C\).