From Matrices to Operators
Remember, in Linear Algebra, we studied linear transformations
\(T: \mathbb{R}^n \to \mathbb{R}^m\). We learned that a transformation is linear if it preserves addition and scalar multiplication:
\[
T(\alpha x + \beta y) = \alpha T(x) + \beta T(y).
\]
In Functional Analysis, we keep this exact same algebraic definition, but we change the stage from finite-dimensional vectors
(\(\mathbb{R}^n\)) to infinite-dimensional function spaces (like Banach and Hilbert spaces). We call these maps
linear operators.
The Infinite-Dimensional Trap
Here is the critical difference: In finite dimensions (matrices), every linear map is "well-behaved" — it is automatically continuous.
A small change in input always results in a small change in output.
In infinite dimensions, this is no longer true. A linear operator can be "wild" and discontinuous.
Counter Example: The Derivative Operator
Let \(\mathcal{X} = C^1([0, \pi])\) be the space of continuously differentiable functions, and \(\mathcal{Y} = C^0([0, \pi])\) be the space of
continuous functions. Equip both spaces with the supremum norm: \(\|f\|_\infty = \sup_{t \in [0, \pi]} |f(t)|\)
(which equals the maximum for continuous functions on a closed interval).
Consider the differentiation operator \(T(f) = f'\).
Let \(f_k(t) = \sin(kt)\) for integers \(k \geq 1\). The "size" of the input is always \(\|f_k\|_\infty = 1\), regardless of \(k\)
(the maximum value \(1\) is attained at \(t = \pi/(2k) \in [0, \pi]\)).
However, applying the operator gives:
\[
T(f_k) = \frac{d}{dt}\sin(kt) = k\cos(kt)
\]
The size of the output is \(\|Tf_k\|_\infty = k\). As \(k \to \infty\), the output explodes to infinity even
though the input stays bounded. Thus, under the supremum norm, the differentiation operator is unbounded.
A profound lesson in Functional Analysis: Boundedness depends entirely on the chosen norm.
If we change the norm on the input space \(X\) to \(\|f\|_{C^1} = \|f\|_\infty + \|f'\|_\infty\),
then \(\|Tf\|_\infty = \|f'\|_\infty \leq \|f\|_\infty + \|f'\|_\infty = \|f\|_{C^1}\).
Suddenly, the exact same mapping rule becomes bounded (with \(M=1\)).
Because calculus and optimization require continuity, we must restrict our focus to the class of operators that behave well:
bounded linear operators.
Boundedness & Continuity
In Intro to Functional Analysis, we established that Banach and Hilbert spaces are the proper
"stages" for doing calculus in infinite dimensions. Now, we need to identify the "actors" — the maps between these spaces —
that are well-behaved enough for analysis. The key idea is deceptively simple: we want operators that do not amplify inputs
by an arbitrarily large factor.
Definition: Bounded Linear Operator
Let \(\mathcal{X}\) and \(\mathcal{Y}\) be normed spaces. A linear operator \(T: \mathcal{X} \to \mathcal{Y}\)
is called bounded if there exists a constant \(M \geq 0\) such that
\[
\|T(x)\|_{\mathcal{Y}} \leq M \|x\|_{\mathcal{X}} \quad \text{for all } x \in \mathcal{X}.
\]
In other words, \(T\) has a finite "maximum gain." It cannot stretch any input by more than a factor of \(M\). Note that
when the codomain \(\mathcal{Y}\) is the scalar field (\(\mathbb{R}\) or \(\mathbb{C}\)), the bounded linear operator is
specifically called a bounded linear functional. In this case, the boundedness condition simply becomes
\(|f(x)| \leq M \|x\|_{\mathcal{X}}\). We will formally explore the space of all such continuous functionals in the
next chapter.
This definition captures an analogy from engineering: think of a bounded operator as an amplifier with a finite gain knob.
No matter what signal you feed in, the output power is controlled.
The derivative operator from the previous section, by contrast,
is an amplifier with no gain limit — feeding in higher-frequency signals produces arbitrarily large outputs.
The following theorem is arguably the most important result in this chapter. It tells us that for linear maps,
"bounded" and "continuous" are exactly the same concept. This is remarkable because boundedness
is an algebraic/analytic condition, while continuity is a topological one.
Theorem: Bounded ⟺ Continuous
Let \(T: \mathcal{X} \to \mathcal{Y}\) be a linear operator between normed spaces. The following are equivalent:
(a) \(T\) is continuous at every point of \(\mathcal{X}\).
(b) \(T\) is continuous at 0.
(c) \(T\) is bounded.
Proof Sketch:
(a) \(\Rightarrow\) (b) is trivial (continuity everywhere implies continuity at \(0\)).
(b) \(\Rightarrow\) (c): Since \(T\) is linear, \(T(0) = 0\). Continuity at \(0\) means:
for every \(\varepsilon > 0\), there exists \(\delta > 0\) such that
\(\|x\| < \delta \implies \|Tx\| < \varepsilon\). Choosing \(\varepsilon = 1\) (the specific value
does not matter), we obtain \(\delta > 0\) such that
\(\|x\| < \delta \implies \|Tx\| < 1\). For any nonzero \(x \in \mathcal{X}\), the vector
\(z = \frac{\delta}{2} \cdot \frac{x}{\|x\|}\) satisfies \(\|z\| < \delta\), so \(\|Tz\| < 1\).
By linearity:
\[
\left\| T\!\left(\frac{\delta}{2} \cdot \frac{x}{\|x\|}\right) \right\| = \frac{\delta}{2\|x\|} \|Tx\| < 1
\implies \|Tx\| < \frac{2}{\delta}\|x\|.
\]
For \(x = 0\), both sides of the boundedness inequality vanish, so it holds trivially.
Thus \(T\) is bounded with \(M = 2/\delta\) (any \(M \geq 2/\delta\) also works).
(c) \(\Rightarrow\) (a): If \(\|Tx\| \leq M\|x\|\) for all \(x\), then by linearity, for any \(x, y \in \mathcal{X}\),
\[
\|Tx - Ty\| \;=\; \|T(x - y)\| \;\leq\; M\|x - y\|.
\]
This is the Lipschitz condition with constant \(M\), so \(T\) is
Lipschitz continuous, hence continuous, everywhere. \(\square\)
Connection to Forward Pass Stability
Every layer of a neural network is a composition of a linear map (the weight matrix \(\mathbf{W}\)) followed
by a nonlinear activation. The linear part \(x \mapsto \mathbf{W}x\) is a bounded linear operator on
\(\mathbb{R}^n\), and its "gain" is precisely its operator norm. When we pass inputs forward through many layers,
the total stretch of the signal is bounded by the product of the individual gains.
Ensuring these operators remain strictly bounded is the mathematical foundation of
Lipschitz-constrained networks, which are essential for training stable
Generative Adversarial Networks (GANs) and ensuring robustness against adversarial attacks.
Finite Dimensions: Automatic Boundedness
Recall from Intro to Functional Analysis that in finite-dimensional spaces,
all norms are Lipschitz equivalent. Combined with a basis expansion, this yields a striking consequence:
Theorem: Every Linear Operator on Finite-Dimensional Spaces is Bounded
If \(\mathcal{X}\) is a finite-dimensional normed space, then every linear operator \(T: \mathcal{X} \to \mathcal{Y}\)
(for any normed space \(\mathcal{Y}\)) is bounded.
Proof:
Fix a basis \(\{e_1, \ldots, e_n\}\) of \(\mathcal{X}\) and identify \(\mathcal{X}\) with \(\mathbb{F}^n\)
via coordinates, where \(\mathbb{F} \in \{\mathbb{R}, \mathbb{C}\}\). For any \(x = \sum_i x_i e_i\),
linearity of \(T\) and the triangle inequality in \(\mathcal{Y}\) give
\[
\|T x\|_{\mathcal{Y}} = \Bigl\|\sum_i x_i T(e_i)\Bigr\|_{\mathcal{Y}}
\;\leq\; \sum_i |x_i|\, \|T(e_i)\|_{\mathcal{Y}}
\;\leq\; \Bigl(\sum_i \|T(e_i)\|_{\mathcal{Y}}^2\Bigr)^{1/2} \|x\|_2
\]
by Cauchy-Schwarz on \(\mathbb{F}^n\), where \(\|x\|_2 = (\sum_i |x_i|^2)^{1/2}\). Hence \(T\) is
bounded under the Euclidean norm on \(\mathcal{X}\). By the
equivalence of norms on finite-dimensional spaces,
\(T\) is therefore bounded under any norm on \(\mathcal{X}\). \(\square\)
This is why the question of "is this matrix continuous?" never arises in basic Linear Algebra — it is
automatically true. Boundedness only becomes a nontrivial, essential constraint in infinite-dimensional spaces.
Riesz's Lemma and the Geometry of Infinite-Dimensional Balls
The finite-dimensional comfort we just described — compact unit balls, automatic boundedness,
equivalence of norms — all rests on a single geometric fact: the unit ball is compact precisely when
the space is finite-dimensional. The reverse direction of this characterization is subtle and
requires a careful geometric construction. The key is the following lemma, which shows that in any
proper closed subspace, one can always find a unit vector "almost orthogonal" to the entire subspace.
Theorem (Riesz's Lemma)
Let \(\mathcal{X}\) be a normed space and \(M \subsetneq \mathcal{X}\) a closed proper subspace.
Then for every \(\varepsilon \in (0, 1)\), there exists a unit vector \(x_\varepsilon \in \mathcal{X}\)
with \(\|x_\varepsilon\| = 1\) such that
\[
\operatorname{dist}(x_\varepsilon, M) \;=\; \inf_{m \in M} \|x_\varepsilon - m\| \;\geq\; 1 - \varepsilon.
\]
Proof:
Since \(M\) is proper, pick any \(y \in \mathcal{X} \setminus M\). Because \(M\) is closed,
\(d := \operatorname{dist}(y, M) > 0\) (otherwise \(y\) would lie in the closure, hence in \(M\)).
By the definition of infimum, for any \(\varepsilon \in (0, 1)\) we can choose \(m_0 \in M\) with
\[
d \;\leq\; \|y - m_0\| \;\leq\; \frac{d}{1 - \varepsilon}.
\]
Set
\[
x_\varepsilon \;=\; \frac{y - m_0}{\|y - m_0\|}.
\]
Then \(\|x_\varepsilon\| = 1\) by construction. For any \(m \in M\), using the fact that
\(m_0 + \|y - m_0\| \cdot m \in M\) (since \(M\) is a subspace),
\[
\|x_\varepsilon - m\| \;=\; \frac{\|y - m_0 - \|y - m_0\| \cdot m\|}{\|y - m_0\|}
\;=\; \frac{\|y - (m_0 + \|y - m_0\|\,m)\|}{\|y - m_0\|}
\;\geq\; \frac{d}{\|y - m_0\|} \;\geq\; 1 - \varepsilon.
\]
Taking the infimum over \(m \in M\) yields \(\operatorname{dist}(x_\varepsilon, M) \geq 1 - \varepsilon\).
Riesz's Lemma lets us iterate: in an infinite-dimensional space, we can build an infinite sequence
of unit vectors, each separated from the span of its predecessors by distance at least \(1/2\).
No subsequence can be Cauchy, let alone convergent. This is the content of the following
characterization, which we first glimpsed in
Intro to Functional Analysis.
Theorem: Compact Unit Ball ⟺ Finite-Dimensional
The closed unit ball \(\overline{B}_1 = \{x \in \mathcal{X} : \|x\| \leq 1\}\) of a normed space
\(\mathcal{X}\) is compact if and only if \(\mathcal{X}\) is finite-dimensional.
Proof:
(\(\Leftarrow\)) Finite-dimensional \(\Rightarrow\) compact unit ball.
By the equivalence of norms in finite dimensions, we may assume \(\mathcal{X} = \mathbb{F}^n\)
(\(\mathbb{F} \in \{\mathbb{R}, \mathbb{C}\}\)) with the Euclidean norm. The closed unit ball is
then closed and bounded in \(\mathbb{F}^n \cong \mathbb{R}^n\) (real case) or \(\mathbb{R}^{2n}\)
(complex case), hence compact by the
Heine-Borel theorem.
(\(\Rightarrow\)) Compact unit ball \(\Rightarrow\) finite-dimensional.
We prove the contrapositive: if \(\mathcal{X}\) is infinite-dimensional, then \(\overline{B}_1\) is not compact.
Construct a sequence \(\{x_n\} \subset \overline{B}_1\) with no convergent subsequence, as follows.
Pick any unit vector \(x_1\). Let \(M_1 = \operatorname{span}\{x_1\}\); this is finite-dimensional,
hence a closed proper subspace of \(\mathcal{X}\) (it is closed because finite-dimensional normed spaces
are complete — a consequence of the equivalence of norms together with completeness of \(\mathbb{F}^n\) —
and a complete subspace of a normed space is closed).
By Riesz's Lemma with \(\varepsilon = 1/2\), choose a unit vector \(x_2\) with
\(\operatorname{dist}(x_2, M_1) \geq 1/2\). Inductively, let \(M_n = \operatorname{span}\{x_1, \dots, x_n\}\)
and choose \(x_{n+1}\) on the unit sphere with \(\operatorname{dist}(x_{n+1}, M_n) \geq 1/2\).
Since \(\mathcal{X}\) is infinite-dimensional, \(M_n \subsetneq \mathcal{X}\) at every stage, so the
construction never terminates.
The resulting sequence satisfies \(\|x_m - x_n\| \geq 1/2\) for all \(m \neq n\)
(since for \(m < n\), \(x_m \in M_{n-1}\) while \(x_n\) is chosen at distance at least \(1/2\) from \(M_{n-1}\)). No subsequence is Cauchy, so no subsequence converges.
Therefore \(\overline{B}_1\) is not sequentially compact,
and hence not compact.
Why This Matters: The Failure of Brute-Force Optimization
In finite-dimensional optimization (standard machine learning with a fixed-size parameter vector),
we can always appeal to compactness: bound the parameters, and a continuous loss function attains its minimum.
In infinite-dimensional function space optimization — training a neural network of unbounded width,
finding the optimal function in a reproducing kernel Hilbert space, solving a variational problem —
this guarantee evaporates. A bounded sequence of candidate solutions may fail to have any convergent subsequence,
meaning the "optimal" function may fail to exist in the space at all. This is why infinite-dimensional
optimization requires genuinely new tools: weak convergence (next chapter), convexity,
and carefully chosen function space topologies — none of which are needed in finite dimensions.
The Operator Norm
We know that a bounded operator has a finite "maximum gain." The operator norm captures this by taking
the supremum stretch over all nonzero inputs — which, as we show below, is equivalently the smallest
constant \(M\) that witnesses boundedness.
Definition: The Operator Norm
Let \(T: \mathcal{X} \to \mathcal{Y}\) be a bounded linear operator between normed spaces. The
operator norm of \(T\) is:
\[
\|T\| = \sup_{\substack{x \in \mathcal{X} \\ x \neq 0}} \frac{\|Tx\|_\mathcal{Y}}{\|x\|_\mathcal{X}}
\qquad (\mathcal{X} \neq \{0\}),
\]
with the convention \(\|T\| := 0\) in the degenerate case \(\mathcal{X} = \{0\}\) (for which the only
linear operator is \(T = 0\)).
Intuitively, \(\|T\|\) answers the question: "What is the maximum factor by which \(T\) can amplify
any input?" This supremum is taken over all nonzero inputs — it finds the "worst-case stretch."
The following equivalent characterizations are used constantly in practice. Each captures the same idea
from a slightly different geometric perspective:
Proof:
All three equalities rely on the linearity of \(T\) and the homogeneity of the norm: for any
\(\lambda > 0\), \(\|T(\lambda x)\| = \lambda \|Tx\|\) and \(\|\lambda x\| = \lambda \|x\|\),
so the ratio \(\|Tx\|/\|x\|\) is invariant under positive scaling.
Sphere = closed ball: For any \(x \neq 0\), write \(\hat{x} = x / \|x\|\). Then
\(\|T\hat{x}\| = \|Tx\|/\|x\|\), so the supremum of \(\|Tx\|/\|x\|\) over \(x \neq 0\) equals
\(\sup_{\|y\|=1} \|Ty\|\). For the closed ball: the sphere is a subset, so
\(\sup_{\|x\| \leq 1} \|Tx\| \geq \sup_{\|y\|=1} \|Ty\|\); conversely, for \(0 < \|x\| \leq 1\),
\(\|Tx\| = \|x\|\,\|T\hat{x}\| \leq \|T\hat{x}\| \leq \sup_{\|y\|=1} \|Ty\|\) (and \(x = 0\) contributes
\(\|Tx\| = 0\)), giving the reverse inequality. The three expressions coincide.
Smallest bounding constant: Any \(M\) with \(\|Tx\| \leq M\|x\|\) for all \(x\)
satisfies \(\|Tx\|/\|x\| \leq M\) for \(x \neq 0\), hence
\(\sup_{x \neq 0} \|Tx\|/\|x\| \leq M\). Taking infimum over admissible \(M\) gives
\(\|T\| \leq \inf\{M : \ldots\}\). Conversely, \(\|T\|\) itself is an admissible \(M\) by definition,
so \(\inf\{M : \ldots\} \leq \|T\|\).
Example: Operator Norm of a Matrix
For a matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) acting on \((\mathbb{R}^n, \|\cdot\|_2) \to (\mathbb{R}^m, \|\cdot\|_2)\),
the operator norm induced by the \(L^2\) norm equals the largest singular value:
\[
\|\mathbf{A}\|_{\text{op}} = \sigma_{\max}(\mathbf{A}) = \sqrt{\lambda_{\max}(\mathbf{A}^T\mathbf{A})}.
\]
This result is specific to the \(L^2\)-induced norm. If we instead equip the domain with the \(L^1\) norm, the operator norm
becomes the maximum absolute column sum; under the \(L^\infty\) norm, it becomes the maximum absolute row sum.
The choice of norm fundamentally determines the operator's "gain" — echoing the lesson from our derivative operator example.
This connects directly to
the Singular Value Decomposition (SVD).
The SVD decomposes \(\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T\).
Because orthogonal matrices are isometries under the \(L^2\) norm
(they preserve lengths, so \(\|\mathbf{U}\|_{\text{op}} = \|\mathbf{V}^T\|_{\text{op}} = 1\)),
they contribute no "stretch" to the composition. By submultiplicativity,
\(\|\mathbf{A}\|_{\text{op}} \leq \|\mathbf{U}\|_{\text{op}}\|\boldsymbol{\Sigma}\|_{\text{op}}\|\mathbf{V}^T\|_{\text{op}} = \|\boldsymbol{\Sigma}\|_{\text{op}}\);
conversely, orthogonality gives \(\boldsymbol{\Sigma} = \mathbf{U}^T \mathbf{A} \mathbf{V}\), so the same argument
yields \(\|\boldsymbol{\Sigma}\|_{\text{op}} \leq \|\mathbf{A}\|_{\text{op}}\). Hence \(\|\mathbf{A}\|_{\text{op}} = \|\boldsymbol{\Sigma}\|_{\text{op}}\),
and the latter is simply the largest diagonal entry (since \(\boldsymbol{\Sigma}\) scales each basis vector independently).
In this light, the operator norm is the infinite-dimensional generalization of the largest singular value.
Key Properties of the Operator Norm
The operator norm satisfies the standard axioms of a norm (non-negativity, homogeneity, triangle inequality),
but it also has a crucial submultiplicativity property that is essential for studying compositions.
Proposition: Operator Norm Axioms
For bounded linear operators \(S, T: \mathcal{X} \to \mathcal{Y}\) and scalar \(\alpha \in \mathbb{F}\):
- (Positive definiteness) \(\|T\| \geq 0\), with equality if and only if \(T = 0\);
- (Homogeneity) \(\|\alpha T\| = |\alpha|\,\|T\|\);
- (Triangle inequality) \(\|S + T\| \leq \|S\| + \|T\|\).
Proof:
(1) \(\|T\| \geq 0\) as a supremum of non-negative quantities. If \(\|T\| = 0\), then
\(\|Tx\|_{\mathcal{Y}} \leq 0 \cdot \|x\|_{\mathcal{X}} = 0\) for all \(x\), hence \(Tx = 0\), i.e., \(T = 0\).
The converse is immediate.
(2) \(\|\alpha T\| = \sup_{x \neq 0} \|\alpha Tx\|/\|x\| = |\alpha|\,\sup_{x \neq 0} \|Tx\|/\|x\| = |\alpha|\,\|T\|\)
by homogeneity of the norm on \(\mathcal{Y}\).
(3) For any \(x\), \(\|(S + T)x\|_{\mathcal{Y}} \leq \|Sx\|_{\mathcal{Y}} + \|Tx\|_{\mathcal{Y}}
\leq (\|S\| + \|T\|)\,\|x\|_{\mathcal{X}}\), so \(S + T\) is bounded with bounding constant \(\|S\| + \|T\|\);
the smallest-bounding-constant characterization gives \(\|S + T\| \leq \|S\| + \|T\|\). \(\square\)
Beyond these basic axioms, the operator norm satisfies an additional crucial property governing compositions:
Proposition: Submultiplicativity
If \(T: \mathcal{X} \to \mathcal{Y}\) and \(S: \mathcal{Y} \to \mathcal{Z}\) are bounded linear operators, then the
composition \(S \circ T: \mathcal{X} \to \mathcal{Z}\) is bounded and
\[
\|S \circ T\| \leq \|S\| \cdot \|T\|.
\]
Proof:
\(S \circ T\) is linear as a composition of linear maps. For any \(x \in \mathcal{X}\), applying the defining bound of the operator norm twice gives
\[
\|(S \circ T)(x)\|_\mathcal{Z} \;=\; \|S(Tx)\|_\mathcal{Z} \;\leq\; \|S\| \cdot \|Tx\|_\mathcal{Y}
\;\leq\; \|S\| \cdot \|T\| \cdot \|x\|_\mathcal{X}.
\]
Hence \(S \circ T\) is bounded with bounding constant \(\|S\| \cdot \|T\|\), and by the
smallest-bounding-constant characterization of the operator norm,
\(\|S \circ T\| \leq \|S\| \cdot \|T\|\).
Connection to Gradient Flow and Backpropagation
While layer gains affect the forward pass, submultiplicativity strictly governs the backward pass.
In a neural network, the Jacobian \(\mathbf{J}_\ell\) of layer \(\ell\) can be decomposed via the chain rule as
\(\mathbf{J}_\ell = \mathbf{D}_\ell \mathbf{W}_\ell\), where \(\mathbf{W}_\ell\) is the weight matrix and \(\mathbf{D}_\ell\)
is the diagonal matrix of the activation function's derivatives.
Standard activation functions like ReLU and Tanh satisfy \(|\sigma'(t)| \leq 1\) everywhere (they are 1-Lipschitz),
so the diagonal entries of \(\mathbf{D}_\ell\) are all at most 1 in absolute value, giving
\(\|\mathbf{D}_\ell\|_{\text{op}} \leq 1\). By submultiplicativity, the norm of the entire gradient through \(L\) layers
is bounded by:
\[
\left\|\prod_{\ell=1}^{L} \mathbf{J}_\ell \right\|_{\text{op}} \leq \prod_{\ell=1}^{L} \|\mathbf{D}_\ell\|_{\text{op}} \|\mathbf{W}_\ell\|_{\text{op}} \leq \prod_{\ell=1}^{L} \|\mathbf{W}_\ell\|_{\text{op}}.
\]
This reveals a profound architectural beauty: the burden of preventing exploding/vanishing gradients falls entirely
on the operator norm of the weight matrices \(\mathbf{W}_\ell\). Techniques like orthogonal initialization
(\(\|\mathbf{W}_\ell\|_{\text{op}} \approx 1\)) and spectral normalization are direct applications of this theory
to guarantee stable gradient flow.
The Space \(\mathcal{B}(\mathcal{X}, \mathcal{Y})\) & Completeness
A remarkable feature of functional analysis is that operators themselves form a vector space.
Just as we add and scale functions pointwise, we can add and scale operators:
\[
(S + T)(x) = S(x) + T(x), \qquad (\alpha T)(x) = \alpha \cdot T(x).
\]
If we collect all bounded linear operators from \(\mathcal{X}\) to \(\mathcal{Y}\) and equip this collection with the
operator norm, we obtain a normed space of its own.
Definition: The Space \(\mathcal{B}(\mathcal{X}, \mathcal{Y})\)
Let \(\mathcal{X}\) and \(\mathcal{Y}\) be normed spaces over the same scalar field. Define
\[
\mathcal{B}(\mathcal{X}, \mathcal{Y}) = \{ T: \mathcal{X} \to \mathcal{Y} \mid T \text{ is linear and bounded} \},
\]
equipped with pointwise addition \((S + T)(x) = S(x) + T(x)\), scalar multiplication \((\alpha T)(x) = \alpha\, T(x)\),
and the operator norm \(\|T\| = \sup_{\|x\| \leq 1} \|Tx\|\).
It remains to verify that this is a well-defined normed vector space. Closure under the operations is immediate:
for \(S, T \in \mathcal{B}(\mathcal{X}, \mathcal{Y})\) and \(\alpha \in \mathbb{F}\), the triangle inequality gives
\(\|(S+T)x\| \leq \|Sx\| + \|Tx\| \leq (\|S\| + \|T\|)\|x\|\), so \(S + T\) is bounded, and similarly
\(\|(\alpha T)x\| = |\alpha|\,\|Tx\| \leq |\alpha|\,\|T\|\,\|x\|\), so \(\alpha T\) is bounded. The vector space
axioms are inherited from pointwise operations on \(\mathcal{Y}\); the norm axioms are
verified above.
This construction is a hallmark of the "functional analysis philosophy": we study maps between spaces by
turning those maps into points of a new space, and then we apply the same analytical tools
(norms, convergence, completeness) to this new space of operators.
The next theorem tells us when this space of operators is complete. The answer is elegant:
the operators inherit completeness from their codomain.
Theorem: Completeness of \(\mathcal{B}(\mathcal{X}, \mathcal{Y})\)
If \(\mathcal{X}\) is a normed space and \(\mathcal{Y}\) is a Banach space, then \(\mathcal{B}(\mathcal{X}, \mathcal{Y})\)
is also a Banach space.
Note: The completeness of the input space \(\mathcal{X}\) is entirely irrelevant for this theorem.
The operator space \(\mathcal{B}(\mathcal{X}, \mathcal{Y})\) inherits its completeness strictly from the codomain \(\mathcal{Y}\).
Proof Sketch:
Let \(\{T_n\}\) be a Cauchy sequence in \(\mathcal{B}(\mathcal{X}, \mathcal{Y})\). For each fixed \(x \in \mathcal{X}\):
\[
\|T_n(x) - T_m(x)\|_\mathcal{Y} = \|(T_n - T_m)(x)\|_{\mathcal{Y}} \leq \|T_n - T_m\|_{\text{op}} \cdot \|x\|_{\mathcal{X}}.
\]
Since \(\{T_n\}\) is Cauchy in the operator norm, \(\{T_n(x)\}\) is a Cauchy sequence in \(\mathcal{Y}\).
Because \(\mathcal{Y}\) is complete (Banach), each \(T_n(x)\) converges to some limit. We define
\(T(x) = \lim_{n \to \infty} T_n(x)\). Linearity of \(T\) follows from the linearity of limits.
To see that \(T\) is bounded: since every Cauchy sequence is bounded, there exists \(M\) such that
\(\|T_n\|_{\text{op}} \leq M\) for all \(n\). Therefore \(\|T_n(x)\|_{\mathcal{Y}} \leq M\|x\|_{\mathcal{X}}\),
and taking \(n \to \infty\) yields \(\|T(x)\|_{\mathcal{Y}} \leq M\|x\|_{\mathcal{X}}\),
so \(T\) is bounded.
Finally, we show \(\|T_n - T\|_{\text{op}} \to 0\). Given \(\varepsilon > 0\), choose \(N\) such that
\(\|T_n - T_m\|_{\text{op}} < \varepsilon\) for all \(n, m \geq N\); equivalently
\(\|T_n(x) - T_m(x)\|_{\mathcal{Y}} \leq \varepsilon \|x\|_{\mathcal{X}}\) for all \(x\).
Letting \(m \to \infty\) and using continuity of the norm gives
\(\|T_n(x) - T(x)\|_{\mathcal{Y}} \leq \varepsilon \|x\|_{\mathcal{X}}\) for all \(x\),
hence \(\|T_n - T\|_{\text{op}} \leq \varepsilon\) for \(n \geq N\). \(\square\)
The Special Case: \(\mathcal{X}^* = \mathcal{B}(\mathcal{X}, \mathbb{F})\)
Definition: Bounded Linear Functional
A linear functional
\(f: \mathcal{X} \to \mathbb{F}\) is called a bounded linear functional if it is bounded
as an operator; explicitly, if
there exists \(M \geq 0\) such that
\[
|f(x)| \leq M \|x\|_{\mathcal{X}} \quad \text{for all } x \in \mathcal{X}.
\]
By Bounded ⟺ Continuous,
this is equivalent to \(f\) being a continuous linear functional.
When the codomain is the scalar field \(\mathbb{F}\) (which is always complete), we get a particularly
important instance of the operator space:
\[
\mathcal{X}^* = \mathcal{B}(\mathcal{X}, \mathbb{F}).
\]
As a consequence of the Completeness Theorem above, \(\mathcal{X}^*\) is always a Banach space, regardless of
whether \(\mathcal{X}\) itself is complete. This space \(\mathcal{X}^*\) is the topological dual space
of \(\mathcal{X}\) — the space of all continuous linear functionals.
We introduced functionals at the end of Intro to Functional Analysis;
the dual space will be the central object of the next chapter (Dual Spaces & Riesz Representation).
Connection to Function Space Optimization
The completeness of \(\mathcal{B}(\mathcal{X}, \mathcal{Y})\) guarantees that if we have a sequence of "improving" operators
(e.g., a sequence of trained neural network weight configurations converging in operator norm), the limit
is a valid bounded operator that remains in the space. Without this result, there would be
no guarantee that the "best" operator obtained by an optimization process actually exists within our
working space — the same foundational role that completeness of \(L^2\) plays for function approximation.
"Linear Manifold" vs "Manifold": A Terminological Warning
Before proceeding further, we must flag a terminological pitfall that trips up readers moving between
classical functional analysis and modern differential geometry / Geometric Deep Learning.
Conway's A Course in Functional Analysis (and several other classical texts) uses the term
"linear manifold" to mean a subspace of a vector space — possibly non-closed,
in the normed setting — reserving the word "subspace" for a closed linear manifold.
In either case the object is flat: it passes through the origin and is closed under linear combinations.
This is a completely different meaning from the word "manifold" as used in Lee's Introduction to Smooth Manifolds
or in Geometric Deep Learning, where a manifold is a curved space that only locally looks Euclidean.
The adjective "linear" in "linear manifold" is precisely what marks the object as flat; the term has largely fallen out of
use outside classical functional analysis, so a reader trained on Lee will rarely encounter it elsewhere — and when it does
appear in Conway, misreading it as a curved manifold is an easy trap.
Definition: Manifold (Topological / Smooth)
An \(n\)-dimensional manifold is a second-countable Hausdorff topological space in which
every point has a neighbourhood homeomorphic to an open subset of \(\mathbb{R}^n\) (via homeomorphisms
called charts); the integer \(n\) is the dimension, fixed for each connected component.
A smooth manifold additionally requires that the transition maps between overlapping charts
are infinitely differentiable, enabling calculus on curved spaces.
Key examples include spheres \(S^n\), the rotation group \(SO(3)\), and the space of
positive definite matrices — all of which carry intrinsic curvature (under their natural metrics).
The distinction to keep in mind:
-
"Linear manifold" (Conway, classical FA): a subspace of a vector space. Flat, globally linear,
passes through the origin. Same thing as what modern texts simply call a subspace.
-
"Manifold" (Lee, GDL, modern): a locally Euclidean topological space, potentially curved,
with no ambient vector space required. Carries rich geometric structure — tangent spaces, geodesics, curvature.
Why This Matters for GDL
In Geometric Deep Learning, "manifold" always refers to curved spaces: the sphere \(S^2\) for
omnidirectional images, the Lie group \(SO(3)\) for 3D rotations in robotics, or the statistical
manifold of probability distributions for
natural gradient descent.
If you encounter "linear manifold" in a functional analysis text, mentally replace it with
"subspace" — the flat, algebraic concept — and reserve "manifold" for the curved,
geometric object that dominates the later portions of this curriculum.
With this clarification in place, we have now assembled the essential toolkit of bounded linear operators
and their norms. In the next chapter, we turn to the most important special case:
when the codomain is the scalar field. These are the continuous linear functionals,
and collecting them gives us the Dual Space \(\mathcal{X}^*\) — the gateway to understanding
why gradients are covectors and to the Riesz Representation Theorem.