What We Have Seen So Far...
We have encountered various "spaces" throughout our study. In linear algebra, we started with
vector spaces. To discuss "length," we introduced
normed vector spaces, and to formalize "orthogonality," we
considered inner product spaces. Strictly speaking, our familiar space \(\mathbb{R}^n\) is categorized
as a Hilbert space, which is an inner product space that guarantees completeness.
Completeness is critical in calculus and analysis because, without it, we cannot guarantee
the convergence of limits "within" the space. Moreover, we frequently encounter the \(L^2\) space in
machine learning contexts. What exactly is it? The \(L^2\) space is also a Hilbert space, but unlike \(\mathbb{R}^n\), whose
elements are finite-dimensional vectors, it is a function space whose elements are functions.
Furthermore, we have explored \(L^1\) and \(L^\infty\). These are categorized as Banach spaces - complete
normed vector spaces that do not necessarily arise from an inner product. Thus, a Banach space is a more general concept than a Hilbert space,
yet it is essential in both analysis and machine learning.
For instance, in deep learning architectures, a neural network can be viewed
as a point (a function) within a vast function space, and we build highly complex models via topological connections. While it is commonly
stated in engineering that training works as long as the functions are differentiable (which provides the local gradient), mathematically,
differentiability only provides the direction. For a sequence of optimized models to actually converge to a valid limit without "falling
through a hole," the underlying space must first be complete - an absolute prerequisite before we can even rely on properties
like convexity or compactness
to guarantee the existence of an optimum.
Fundamentally, these function spaces are defined by their properties regarding integration. Specific rules of integrability
dictate which functions qualify to be part of spaces like \(L^1, L^2,\) and \(L^\infty\).
This is precisely why we studied Lebesgue integration: it is the mathematical framework that
ensures these function spaces possess completeness, using measures as their foundation. It is crucial
not to confuse a vector space with a measure space. A measure space is not a vector space; rather, it provides the underlying
domain upon which functions are defined. In other words, a measure space itself does not possess algebraic operations like vector addition or
scalar multiplication.
Viewed through this lens, core concepts in statistics gain clear mathematical structure. We can now rigorously understand that
random variables are actually measurable functions that map outcomes from a sample space to real numbers, and a
probability space is simply a specific type of measure space (with a total measure of 1)
used to describe data distributions. Consequently, the expected value \(\mathbb{E}[X]\) is precisely a Lebesgue integral over this probability
space, and the variance \(\mathbb{V}[X]\) is the squared \(L^2\) norm of the random variable centered around its mean.
Going back to our algebraic foundations, we know that \(\mathbb{R}\) is a field
in abstract algebra. Simultaneously, from the perspective of functional analysis, \(\mathbb{R}\) itself acts as a one-dimensional
Hilbert space. Thus, \(\mathbb{R}\) serves both as the foundational "rule" for arithmetic operations and as the underlying "stage"
(codomain) for our functions. A field equipped with a topology that makes its algebraic operations continuous is known as a topological field.
In particular, \(\mathbb{R}\) and \(\mathbb{C}\) are indispensable because they are complete topological fields - a highly special
property in mathematics, even though we have implicitly relied on them since elementary school (for comparison, the field of rational numbers \(\mathbb{Q}\)
is a topological field, but it is not complete).
As a final note on notation: in applied mathematics and machine learning contexts, you will frequently see \(L_1, L_2,\) and \(L_\infty\)
written with subscripts instead of superscripts (\(L^1, L^2, L^\infty\)). This variation stems from differing conventional focuses across
disciplines, but both notations refer to exactly the same mathematical concepts and spaces.
Normed Spaces & Banach Spaces
To perform calculus or optimization on a vector space, we need a notion of "distance" to define limits and convergence.
A norm provides this by measuring the "length" of a vector.
Crucially, any norm naturally induces a metric (distance function)
defined by \(d(x, y) = \|x - y\|\). Once we have a metric, we can define
Cauchy sequences and ask whether the space is
complete.
Definition: Normed Space (Normed Vector Space)
A normed space is a pair \((\mathcal{X}, \| \cdot\|)\), where \(\mathcal{X}\) is a vector space
and \(\| \cdot\|\) is a norm on \(\mathcal{X}\).
Definition: Banach Space
A Banach space is a normed space that is complete with respect to the
metric defined by the norm.
Note: Every Cauchy sequence in the space converges to a limit that is also within the space.
The most important examples of Banach spaces in analysis and machine learning are the \(L^p\) spaces.
Thanks to the rigorous foundation of Lebesgue integration, these spaces of functions are guaranteed to be complete.
Definition: \(L^p\) Spaces
For \(1 \leq p < \infty\), the space \(L^p\) consists of all measurable functions \(f\) for which the Lebesgue integral of
\(|f|^p\) is finite. The norm is defined as:
\[
\|f\|_p = \left( \int |f(x)|^p d\mu \right)^{1/p}.
\]
Definition: \(L^\infty\) Space
The space \(L^\infty\) consists of all measurable functions \(f\) that are essentially bounded.
The norm is defined as the essential supremum:
\[
\|f\|_\infty = \text{ess sup}_{x} |f(x)| = \inf \{ C \geq 0 : |f(x)| \leq C \text{ a.e.} \}.
\]
Note: "Almost everywhere" (a.e.) means the condition holds except on a set of
measure zero. The essential supremum ignores
the function's behavior on negligible sets, making it robust to isolated singularities.
Rigorously, to ensure \(\|f\|_\infty = 0 \implies f = 0\) (a requirement for a valid norm),
the elements of \(L^\infty\) (and all \(L^p\) spaces) are actually equivalence classes
of functions that are equal almost everywhere.
Inner Product Spaces & Hilbert Spaces
While Banach spaces allow us to measure lengths and distances, they lack the geometric concepts of "angles" and "orthogonality."
To recover the rich geometry of Euclidean space in abstract settings, we need an
inner product.
Every inner product naturally induces a norm via \(\|x\| = \sqrt{\langle x, x \rangle}\), meaning every inner product space is
automatically a normed space.
Definition: Inner Product Space
An inner product space is a vector space over a field \(\mathbb{F}\) equipped with an
inner product \(\langle \cdot, \cdot \rangle\).
Just as a normed space requires completeness to become a Banach space, an inner product space requires completeness
to become the ultimate setting for functional analysis: a Hilbert space.
Definition: Hilbert Space
A Hilbert space \(\mathcal{H}\) is an inner product space that is complete
with respect to the metric \(d(x, y) = \|x - y\|\) induced by its inner product.
Hilbert spaces are the "nicest" of all infinite-dimensional spaces because they behave almost exactly like \(\mathbb{R}^n\).
The foundational inequality that connects the inner product to the norm is the Cauchy-Schwarz inequality, which is vital for
proving bounds in optimization.
Theorem: Cauchy-Schwarz Inequality
For all \(x, y\) in an inner product space, \( |\langle x, y \rangle| \leq \|x\| \|y\| \).
But how do we know if a given Banach space is actually a Hilbert space? It turns out that a norm is induced by an inner product
if and only if it satisfies the Parallelogram Law. This profound connection between algebra and geometry was
proven by Pascual Jordan and John von Neumann. Among the \(L^p\) spaces, only \(L^2\) satisfies this law, making \(L^2\) the
unique and profoundly important Hilbert space of functions.
Theorem: Jordan-von Neumann Characterization
A Banach space is a Hilbert space (i.e., its norm is induced by an inner product)
if and only if its norm satisfies the Parallelogram Law:
\[
\|x + y\|^2 + \|x - y\|^2 = 2\|x\|^2 + 2\|y\|^2
\].
Finite vs Infinite Dimensions
Because \(\mathbb{R}^n\) is a Hilbert space, it is easy to assume that all complete spaces behave like \(\mathbb{R}^n\).
However, the transition from finite-dimensional vectors to infinite-dimensional function spaces introduces profound topological
differences.
Theorem: Equivalence of Norms in Finite Dimensions
In a finite-dimensional vector space, all norms are topologically equivalent. This means if a
sequence converges under the \(L^1\) norm, it is guaranteed to converge under the \(L^2\) or \(L^\infty\) norm.
In infinite-dimensional spaces, this is false. A sequence of functions might converge in \(L^1\) but
diverge in \(L^\infty\).
The most drastic difference, however, lies in compactness.
As we established in the Heine-Borel Theorem,
in \(\mathbb{R}^n\), every closed and bounded set is compact. This guarantees that any continuous optimization problem
on a bounded region has a minimum. But as we hinted earlier, this comfort vanishes in infinite dimensions.
Theorem: Compactness of the Unit Ball
The closed unit ball \(\{x : \|x\| \leq 1\}\) in a normed space is compact if and only if the space is
finite-dimensional.
Corollary: In any infinite-dimensional space, the closed unit ball (which is strictly closed and bounded)
is never compact.
Why does this matter? Because a closed unit ball is the prototypical "closed and bounded" set. The theorem tells us that in an
infinite-dimensional function space, simply bounding your parameters does not guarantee that your optimization algorithm will
converge to a solution inside that boundary. A sequence can wander endlessly inside a bounded ball without ever converging to a point.
Geometry of Unit Balls & ML Applications
The choice of norm in a Banach space dictates the geometry of its open and closed balls.
In machine learning, we frequently use norms as regularization terms to constrain our model parameters
(\(\|w\|_p \leq C\)). The geometric shape of these unit balls perfectly explains why different regularizers produce
different types of AI models.
Geometry of \(L^p\) Unit Balls in \(\mathbb{R}^2\)
- \(L^1\) norm (Lasso):
The unit ball is a diamond (a square rotated by 45 degrees). It has sharp corners precisely on the axes.
- \(L^2\) norm (Ridge):
The unit ball is a perfect circle. It is strictly convex and rotationally invariant.
- \(L^\infty\) norm:
The unit ball is a square aligned with the axes.
When we optimize a loss function subject to an \(L^1\) penalty (Lasso regression), the expanding contour of the loss function
is highly likely to hit the "sharp corners" of the \(L^1\) diamond first. Because these corners lie exactly on the axes,
many parameter weights become exactly zero. This geometric property of the \(L^1\) Banach space is the mathematical mechanism
behind Sparsity and feature selection in ML.
Conversely, the \(L^2\) Hilbert space penalty (Ridge regression) has a perfectly round, strictly convex unit ball.
The loss contour will touch it at a tangent, smoothly shrinking all parameters but rarely setting them to exactly zero.
The strict convexity of \(L^2\) guarantees that the optimization problem has a unique global solution,
making it mathematically highly stable.
Finally, when we leverage the inner product structure of Hilbert spaces, we can project data into infinite-dimensional spaces
to find linear boundaries for non-linear data. This is the foundation of the
Kernel Trick and
Reproducing Kernel Hilbert Spaces (RKHS), which power
Support Vector Machines (SVMs) and
Gaussian processes.