Dual Spaces & Riesz Representation

The Dual Space \(\mathcal{X}^*\) The Riesz Representation Theorem Musical Isomorphisms: \(\sharp\) and \(\flat\) The Gradient is a Covector Bra-Ket, Double Duals & Reflexivity

The Dual Space \(\mathcal{X}^*\)

Remember, we learned a functional — a map that takes a function (or vector) and returns a scalar. Also, in Bounded Linear Operators, we saw that when the codomain is the scalar field \(\mathbb{F}\), the space \(\mathcal{X}^* = \mathcal{B}(\mathcal{X}, \mathbb{F})\) is always a Banach space, regardless of whether \(\mathcal{X}\) itself is complete.

We are now ready to study this space in full. The topological dual space \(\mathcal{X}^*\) is not merely an abstract construction — it is the space of all possible "measurements" one can perform on \(\mathcal{X}\). Every element \(\varphi \in \mathcal{X}^*\) is a continuous linear functional: a device that inputs a vector and outputs a number, while respecting both linearity and the topology of the space.

The central question of this chapter is deceptively simple:

"Given a continuous linear functional \(\varphi \in \mathcal{H}^*\), can we always find a concrete vector \(y \in \mathcal{H}\) that 'represents' it via the inner product?"

The answer — the Riesz Representation Theorem — will fundamentally reshape how we think about gradients, derivatives, and optimization on curved spaces.

Formal Construction

Definition: Topological Dual Space

Let \(\mathcal{X}\) be a normed space over the field \(\mathbb{F}\) (\(\mathbb{R}\) or \(\mathbb{C}\)). The topological dual space (or simply the dual space) of \(\mathcal{X}\) is \[ \mathcal{X}^* \;=\; \mathcal{B}(\mathcal{X},\, \mathbb{F}) \;=\; \bigl\{\,\varphi : \mathcal{X} \to \mathbb{F} \;\big|\; \varphi \text{ is linear and continuous}\,\bigr\}. \] Elements of \(\mathcal{X}^*\) are called continuous linear functionals, or covectors (particularly in differential geometry and physics).

Since \(\mathbb{F}\) is always complete, the result from the previous chapter immediately gives us:

Theorem: \(\mathcal{X}^*\) is Always Banach

For any normed space \(\mathcal{X}\), the dual space \(\mathcal{X}^*\) equipped with the operator norm \[ \|\varphi\|_{\mathcal{X}^*} \;=\; \sup_{\|x\| \leq 1} |\varphi(x)| \] is a Banach space, regardless of whether \(\mathcal{X}\) is complete.

Connection to Machine Learning: The "Measurement Space"

Think of a neural network's parameter space \(\boldsymbol{\theta} \in \mathbb{R}^d\). The loss function \(\mathcal{L}\) assigns a scalar to each configuration — it is a functional. Its derivative \(d\mathcal{L}_{\boldsymbol{\theta}}\) at a point is a linear functional: it takes a small perturbation \(\delta\boldsymbol{\theta}\) and returns the first-order change in loss. This derivative lives not in the parameter space itself, but in its dual. This distinction is conceptually important when transitioning from flat-space optimization to optimization on manifolds, where vectors and covectors are no longer interchangeable.

Concrete Examples of Dual Spaces

To build intuition, let us examine the dual spaces of the most important spaces in our curriculum. When we say a space \(\mathcal{Y}\) is the dual of \(\mathcal{X}\) (written \(\mathcal{X}^* \cong \mathcal{Y}\)), we mean there is a natural isometric isomorphism between them, usually given by an integral or an infinite sum.

Convention (complex scalars). The duality pairings \(\sum x_i y_i\) and \(\int f g\, d\mu\) above are bilinear — they carry no complex conjugation. This is the standard way to identify \((\ell^p)^*\) with \(\ell^q\) and \((L^p)^*\) with \(L^q\) as Banach-space isomorphisms. It should not be confused with the Hilbert-space inner product on \(\ell^2\) or \(L^2\), which (under our first-slot-linear convention) carries conjugation in the second slot: \(\langle x, y\rangle = \sum x_i \overline{y_i}\). The two coincide over \(\mathbb{R}\); over \(\mathbb{C}\) they differ by a complex conjugation on the representing element.

Space \(\mathcal{X}\) Dual \(\mathcal{X}^*\) Pairing \(\varphi(x)\) Reflexive?
\(\mathbb{R}^n\) \(\mathbb{R}^n\) \(y^T x\)
\(\ell^p\;(1 < p < \infty)\) \(\ell^q\;\;(1/p + 1/q = 1)\) \(\sum x_i y_i\)
\(L^p\;(1 < p < \infty)\) \(L^q\;\;(1/p + 1/q = 1)\) \(\int f\, g\, d\mu\)
\(L^1\) \(L^\infty\) \(\int f\, g\, d\mu\)
\(L^\infty\) \(\supsetneq L^1\) (ba space)
\(\mathcal{H}\) (Hilbert) \(\mathcal{H}\) \(\langle x, y \rangle\)

The Riesz Representation Theorem

We now arrive at one of the central results of functional analysis. In the examples above, we saw that identifying the dual of a general Banach space can be quite intricate — the relationship between \(L^p\) and \(L^q\) depends on Hölder conjugates, and pathological cases like \((L^\infty)^*\) escape any simple description.

Hilbert spaces are the exception. Their inner product provides a canonical, isometric identification between the space and its dual. This is the content of the Riesz Representation Theorem, due to Frigyes Riesz (1907).

Let us first build the intuition. In \(\mathbb{R}^n\) with the standard dot product, every linear functional \(\varphi: \mathbb{R}^n \to \mathbb{R}\) can be written as \(\varphi(x) = y^T x\) for a unique vector \(y \in \mathbb{R}^n\). The Riesz theorem says: this works in every Hilbert space, even infinite-dimensional ones.

Convention: Inner Product Linearity

Throughout this chapter we adopt the mathematician's convention that the inner product \(\langle \cdot, \cdot \rangle\) is linear in the first argument and conjugate-linear in the second: \[ \langle \alpha x + y,\, z \rangle = \alpha \langle x, z \rangle + \langle y, z \rangle, \qquad \langle x,\, \alpha y \rangle = \overline{\alpha}\,\langle x, y \rangle. \] Over \(\mathbb{R}\) this collapses to ordinary bilinearity. Under this convention, the Riesz representation \(\varphi(x) = \langle x, y_\varphi \rangle\) makes the Riesz map \(\Phi(y) = \langle \cdot, y \rangle\) conjugate-linear in \(y\) over \(\mathbb{C}\). Some physics texts (and Dirac's Bra-Ket notation) instead place conjugate-linearity in the first slot; conversion between the two is straightforward but must be done explicitly to avoid sign errors.

Theorem: Riesz Representation

Let \(\mathcal{H}\) be a Hilbert space over \(\mathbb{F}\). For every continuous linear functional \(\varphi \in \mathcal{H}^*\), there exists a unique element \(y_\varphi \in \mathcal{H}\) such that \[ \varphi(x) \;=\; \langle x,\, y_\varphi \rangle \quad \text{for all } x \in \mathcal{H}. \] Moreover, the correspondence \(\varphi \mapsto y_\varphi\) is isometric and bijective: \[ \|\varphi\|_{\mathcal{H}^*} \;=\; \|y_\varphi\|_{\mathcal{H}}. \] Over \(\mathbb{R}\) it is linear; over \(\mathbb{C}\) it is conjugate-linear (anti-linear).

Let us unpack this statement before proving it. The theorem asserts three things simultaneously:

  1. Existence: No matter how abstract or exotic the functional \(\varphi\) is, there is always a concrete vector \(y_\varphi\) that reproduces it via the inner product.
  2. Uniqueness: This representative vector is determined uniquely — there is no ambiguity.
  3. Isometry: The "size" of \(\varphi\) as a functional equals the "size" of \(y_\varphi\) as a vector. No information is lost in the translation.

Proof

The proof relies on one key fact about Hilbert spaces that has no analogue in general Banach spaces. In Linear Algebra, we proved that every vector in \(\mathbb{R}^n\) can be uniquely decomposed as the sum of a component in a subspace \(W\) and a component in \(W^\perp\). The following theorem extends this to infinite-dimensional Hilbert spaces, where the assumption that \(M\) is closed becomes essential.

Theorem: Projection Theorem (Orthogonal Decomposition in Hilbert Spaces)

Let \(\mathcal{H}\) be a Hilbert space and let \(M \subseteq \mathcal{H}\) be a closed subspace. Then every \(x \in \mathcal{H}\) can be uniquely written as \[ x \;=\; m \;+\; m^\perp, \qquad m \in M,\;\; m^\perp \in M^\perp. \] Equivalently, \(\mathcal{H} = M \oplus M^\perp\). The element \(m\) is called the orthogonal projection of \(x\) onto \(M\), and it is the unique closest point in \(M\) to \(x\): \[ \|x - m\| \;=\; \inf_{y \in M} \|x - y\|. \]

The proof relies on two key ingredients specific to Hilbert spaces: the completeness of \(\mathcal{H}\) (to guarantee that a minimizing sequence converges) and the Parallelogram Law (to extract Cauchyness from the minimization). This is precisely where the Hilbert space structure — as opposed to mere Banach space structure — is indispensable: in a Banach space without an inner product, closest points in closed subspaces need not exist or be unique.

Proof of the Projection Theorem

Existence of the nearest point. Set \(d := \inf_{y \in M}\|x - y\|\) and take a minimizing sequence \(\{y_n\} \subset M\) with \(\|x - y_n\| \to d\). Applying the parallelogram law to the vectors \(x - y_n\) and \(x - y_m\) yields \[ \|(x - y_n) + (x - y_m)\|^2 \;+\; \|(x - y_n) - (x - y_m)\|^2 \;=\; 2\|x - y_n\|^2 + 2\|x - y_m\|^2, \] which rearranges to \[ \|y_n - y_m\|^2 \;=\; 2\|x - y_n\|^2 + 2\|x - y_m\|^2 - 4\left\|x - \tfrac{y_n + y_m}{2}\right\|^2. \] Since \(M\) is a linear subspace, \(\tfrac{y_n + y_m}{2} \in M\), so \(\|x - \tfrac{y_n+y_m}{2}\|^2 \geq d^2\), meaning the subtracted term \(-4\|x - \tfrac{y_n+y_m}{2}\|^2\) is at most \(-4d^2\). Hence \[ 0 \;\leq\; \|y_n - y_m\|^2 \;\leq\; 2\|x - y_n\|^2 + 2\|x - y_m\|^2 - 4d^2. \] As \(n, m \to \infty\), the right-hand side tends to \(2d^2 + 2d^2 - 4d^2 = 0\), so \(\|y_n - y_m\|^2 \to 0\) as well. Thus \(\{y_n\}\) is Cauchy. By completeness of \(\mathcal{H}\), it converges to some \(m \in \mathcal{H}\); by closedness of \(M\), \(m \in M\). The reverse triangle inequality gives \[ \bigl|\|x - y_n\| - \|x - m\|\bigr| \leq \|y_n - m\| \to 0, \] so \(\|x - y_n\| \to \|x - m\|\); combined with \(\|x - y_n\| \to d\), this yields \(\|x - m\| = d\).

Orthogonality. We show \(x - m \in M^\perp\), i.e. \(\langle x - m,\, y\rangle = 0\) for every \(y \in M\). If \(y = 0\), the claim is trivial, so assume \(y \neq 0\). For any \(t \in \mathbb{F}\), the point \(m + ty\) lies in \(M\), hence \[ \|x - m - ty\|^2 \;\geq\; d^2 \;=\; \|x - m\|^2. \] Expanding using the convention that \(\langle\cdot, \cdot\rangle\) is linear in the first slot and conjugate-linear in the second: \[ \|x - m - ty\|^2 \;=\; \langle x - m - ty,\; x - m - ty\rangle \] \[ =\; \|x - m\|^2 \;-\; \overline{t}\,\langle x - m,\, y\rangle \;-\; t\,\langle y,\, x - m\rangle \;+\; |t|^2\|y\|^2. \] Since \(\langle y,\, x - m\rangle = \overline{\langle x - m,\, y\rangle}\), the middle two terms combine as \[ -\overline{t}\,\langle x-m,y\rangle - t\,\overline{\langle x-m,y\rangle} = -2\operatorname{Re}\bigl(\overline{t}\,\langle x - m,\, y\rangle\bigr). \] Thus \[ \|x - m - ty\|^2 \;=\; \|x - m\|^2 \;-\; 2\operatorname{Re}\bigl(\overline{t}\,\langle x - m,\, y \rangle\bigr) \;+\; |t|^2\|y\|^2. \] Subtracting \(\|x - m\|^2\) and applying the inequality \(\|x-m-ty\|^2 \geq \|x-m\|^2\): \[ -2\operatorname{Re}\bigl(\overline{t}\,\langle x - m,\, y \rangle\bigr) + |t|^2\|y\|^2 \;\geq\; 0 \quad \text{for all } t \in \mathbb{F}. \] Choose \(t = s\,\langle x - m,\, y\rangle\) with \(s \in \mathbb{R}\), \(s > 0\). Then \(\overline{t} = s\,\overline{\langle x-m, y\rangle}\), so \[ \overline{t}\,\langle x - m,\, y\rangle \;=\; s\,\overline{\langle x-m,y\rangle}\,\langle x-m,y\rangle \;=\; s\,|\langle x - m,\, y\rangle|^2 \;\in\; \mathbb{R}_{\geq 0}. \] The \(\operatorname{Re}\) therefore acts as the identity, and \(|t|^2 = s^2\,|\langle x-m, y\rangle|^2\). Writing \(A := |\langle x - m,\, y\rangle|^2 \geq 0\), the inequality becomes \[ -2sA + s^2 A\,\|y\|^2 \;\geq\; 0, \qquad \text{i.e.} \qquad s\,A\,\bigl(s\|y\|^2 - 2\bigr) \;\geq\; 0. \] For any fixed \(s\) in the range \(0 < s < 2/\|y\|^2\) (valid since \(y \neq 0\)), the factor \(s(s\|y\|^2 - 2)\) is strictly negative; the inequality \(s A (s\|y\|^2 - 2) \geq 0\) then forces \(A \leq 0\). Combined with \(A \geq 0\), this gives \(A = 0\), hence \(\langle x - m,\, y\rangle = 0\). Since \(y \in M\) was arbitrary, \(x - m \in M^\perp\). Setting \(m^\perp := x - m\) gives the decomposition \(x = m + m^\perp\) with \(m \in M\) and \(m^\perp \in M^\perp\).

Uniqueness. Suppose \(x = m_1 + m_1^\perp = m_2 + m_2^\perp\) are two such decompositions with \(m_i \in M\) and \(m_i^\perp \in M^\perp\). Then \(m_1 - m_2 = m_2^\perp - m_1^\perp\); the left side lies in \(M\) and the right side in \(M^\perp\), so the common vector \(v := m_1 - m_2\) lies in \(M \cap M^\perp\). By the definition of \(M^\perp\), \(v\) is orthogonal to every element of \(M\) — in particular, to itself. Hence \(\|v\|^2 = \langle v, v\rangle = 0\), so \(v = 0\); that is, \(m_1 = m_2\) (and consequently \(m_1^\perp = m_2^\perp\)).

With this tool in hand, we can now prove the Riesz Representation Theorem.

Proof of the Riesz Representation Theorem

1. Existence: If \(\varphi = 0\), we simply choose \(y_\varphi = 0\). Assume \(\varphi \neq 0\). The null space (kernel) \(\ker(\varphi) = \{x \in \mathcal{H} : \varphi(x) = 0\}\) is a closed linear subspace of \(\mathcal{H}\): linearity of \(\varphi\) gives the subspace structure, and continuity gives closedness (it is the preimage of the closed set \(\{0\}\)). Since \(\varphi \neq 0\), \(\ker(\varphi)\) is not the whole space. By the Projection Theorem, \[ \mathcal{H} = \ker(\varphi) \oplus \ker(\varphi)^\perp; \] since \(\ker(\varphi) \subsetneq \mathcal{H}\), the decomposition forces \(\ker(\varphi)^\perp \neq \{0\}\), so we may pick a nonzero \(z \in \ker(\varphi)^\perp\). Note that \(z \notin \ker(\varphi)\): otherwise \(z \in \ker(\varphi) \cap \ker(\varphi)^\perp\) would give \(\langle z, z \rangle = 0\), contradicting \(z \neq 0\). Hence \(\varphi(z) \neq 0\).

For any arbitrary \(x \in \mathcal{H}\), consider the vector \(v = x - \frac{\varphi(x)}{\varphi(z)}z\). Applying \(\varphi\) to \(v\) yields \(0\), meaning \(v \in \ker(\varphi)\). Because \(z \in \ker(\varphi)^\perp\), we have \(\langle v, z \rangle = 0\). Substituting \(v\) gives: \[ \left\langle x - \frac{\varphi(x)}{\varphi(z)}z,\; z \right\rangle = 0 \implies \langle x, z \rangle = \frac{\varphi(x)}{\varphi(z)}\|z\|^2. \] Solving for \(\varphi(x)\), we obtain \[ \varphi(x) = \tfrac{\varphi(z)}{\|z\|^2}\,\langle x, z \rangle. \] Pulling the scalar inside the second slot using conjugate-linearity — \(\langle x,\, \overline{\alpha}\, z \rangle = \alpha\, \langle x, z \rangle\) with \(\alpha = \varphi(z)/\|z\|^2\) — this becomes \[ \varphi(x) \;=\; \left\langle x,\; \frac{\overline{\varphi(z)}}{\|z\|^2}\, z \right\rangle. \] Thus, the required vector is \(y_\varphi = \dfrac{\overline{\varphi(z)}}{\|z\|^2}\, z\). (Over \(\mathbb{R}\), the conjugation is invisible and this step is just scalar rearrangement.)

2. Uniqueness: If two vectors \(y_1, y_2\) represent \(\varphi\), then \(\langle x, y_1 \rangle = \langle x, y_2 \rangle\) for all \(x\). This means \(\langle x, y_1 - y_2 \rangle = 0\) for all \(x\). Choosing \(x = y_1 - y_2\) yields \(\|y_1 - y_2\|^2 = 0\), so \(y_1 = y_2\).

3. Isometry: If \(\varphi = 0\), then \(y_\varphi = 0\) and both norms vanish trivially. Assume \(\varphi \neq 0\), so \(y_\varphi \neq 0\). By Cauchy-Schwarz, \[ |\varphi(x)| = |\langle x, y_\varphi \rangle| \leq \|x\|\,\|y_\varphi\| \] for all \(x\), so \[ \|\varphi\|_{\mathcal{H}^*} \leq \|y_\varphi\|. \] Conversely, taking \(x = y_\varphi / \|y_\varphi\|\) (a unit vector) gives \[ \varphi(x) \;=\; \left\langle \tfrac{y_\varphi}{\|y_\varphi\|},\; y_\varphi \right\rangle \;=\; \tfrac{\langle y_\varphi, y_\varphi \rangle}{\|y_\varphi\|} \;=\; \|y_\varphi\|, \] where we used first-slot linearity to pull the scalar \(1/\|y_\varphi\|\) out (note \(\|y_\varphi\| \in \mathbb{R}_{> 0}\), so no conjugation occurs). The result \(\|y_\varphi\|\) is real and positive, hence \(\|\varphi\|_{\mathcal{H}^*} \geq |\varphi(x)| = \|y_\varphi\|\). Combining both directions, \(\|\varphi\|_{\mathcal{H}^*} = \|y_\varphi\|\).

Why Hilbert, Not Banach?

Let us highlight exactly where each assumption was used in the proof above:

  1. Continuity of \(\varphi\): Guarantees that \(\ker(\varphi)\) is closed (the preimage of the closed set \(\{0\}\)).
  2. Inner product (Hilbert structure): Enables the Projection Theorem (\(\mathcal{H} = M \oplus M^\perp\)), which produces the nonzero \(z \in \ker(\varphi)^\perp\).
  3. Completeness: Required by the Projection Theorem itself (the minimizing sequence must converge).

In a general Banach space, we have continuity but lack the inner product. Without orthogonal complements, the entire construction collapses — there is no canonical way to "represent" a functional as an element of the space. This is why the Riesz theorem is specific to Hilbert spaces.

The Isomorphism \(\mathcal{H}^* \cong \mathcal{H}\): A Hilbert Space is Self-Dual

The Riesz theorem can be restated more compactly as the existence of a canonical map between \(\mathcal{H}\) and \(\mathcal{H}^*\).

Definition: The Riesz Map \(\Phi\)

The Riesz map (or Riesz isomorphism) is the mapping \[ \Phi : \mathcal{H} \to \mathcal{H}^*, \qquad \Phi(y) = \langle \,\cdot\,, y \rangle. \] That is, \(\Phi(y)\) is the functional that acts on any \(x \in \mathcal{H}\) by \([\Phi(y)](x) = \langle x, y \rangle\).

The Riesz Representation Theorem tells us that \(\Phi\) has three structural properties:

  1. \(\Phi\) is isometric: \(\|\Phi(y)\|_{\mathcal{H}^*} = \|y\|_{\mathcal{H}}\) for all \(y\).
  2. \(\Phi\) is surjective: every \(\varphi \in \mathcal{H}^*\) equals \(\Phi(y_\varphi)\) for some \(y_\varphi\).
  3. Over \(\mathbb{R}\), \(\Phi\) is linear. Over \(\mathbb{C}\), \(\Phi\) is conjugate-linear (anti-linear): \(\Phi(\alpha y_1 + y_2) = \overline{\alpha}\,\Phi(y_1) + \Phi(y_2)\).

Therefore, \(\Phi\) is an isometric (anti-linear) isomorphism. In the real case, it is simply an isometric isomorphism. Either way, the conclusion is:

A Hilbert space is self-dual: \(\mathcal{H}^* \cong \mathcal{H}\).

This is the cleanest possible answer to the question we posed in the beginning of this page. While the duals of general Banach spaces can be exotic (\((L^\infty)^*\) is unwieldy), Hilbert spaces have the property that their dual is, via the Riesz isomorphism, a mirror image of themselves. The inner product provides the mirror.

Warning: Conjugate-Linearity in the Complex Case

Over \(\mathbb{C}\), the Riesz map \(\Phi\) is conjugate-linear, not linear. This means \(\Phi(\alpha y) = \overline{\alpha}\,\Phi(y)\), not \(\alpha\,\Phi(y)\). The conjugation arises from the sesquilinearity of the inner product: \(\langle x, \alpha y \rangle = \overline{\alpha}\langle x, y \rangle\). This subtlety is precisely why physicists use Dirac's Bra-Ket notation (see below): the "Bra" \(\langle\phi|\) carries the conjugation automatically, preventing sign errors in quantum mechanical calculations.

Musical Isomorphisms: The \(\sharp\) and \(\flat\) Dictionary

The Riesz Representation Theorem gives us a canonical way to convert between a Hilbert space and its dual. In differential geometry, this conversion has an evocative name: the Musical Isomorphisms, denoted by the symbols \(\flat\) ("flat") and \(\sharp\) ("sharp") — borrowed from musical notation.

These names are not mere whimsy. They encode one of the deepest ideas in modern geometry: the inner product (or more generally, the metric tensor) serves as a "dictionary" that translates between two fundamentally different types of geometric objects.

Definition: The Flat Map \(\flat\) (Vector → Covector)

Let \(\mathcal{H}\) be a Hilbert space with inner product \(\langle \cdot, \cdot \rangle\). The flat map is the mapping \[ \flat : \mathcal{H} \to \mathcal{H}^*, \qquad y \;\mapsto\; y^\flat \] where \(y^\flat\) is the continuous linear functional defined by \[ y^\flat(x) \;=\; \langle x,\, y \rangle \quad \text{for all } x \in \mathcal{H}. \] Musically: \(\flat\) lowers a vector to a covector (just as \(\flat\) lowers a musical note by a half-step).

Definition: The Sharp Map \(\sharp\) (Covector → Vector)

The sharp map is the inverse of \(\flat\): \[ \sharp : \mathcal{H}^* \to \mathcal{H}, \qquad \varphi \;\mapsto\; \varphi^\sharp \] where \(\varphi^\sharp\) is the unique element \(y_\varphi \in \mathcal{H}\) guaranteed by the Riesz Representation Theorem: \[ \varphi(x) \;=\; \langle x,\, \varphi^\sharp \rangle \quad \text{for all } x \in \mathcal{H}. \] Musically: \(\sharp\) raises a covector to a vector (just as \(\sharp\) raises a musical note by a half-step).

The Riesz theorem guarantees that \(\flat\) and \(\sharp\) are inverses of each other, and both are isometries: \[ (\flat \circ \sharp) = \text{id}_{\mathcal{H}^*}, \qquad (\sharp \circ \flat) = \text{id}_{\mathcal{H}}, \qquad \|y^\flat\|_{\mathcal{H}^*} = \|y\|_{\mathcal{H}}. \]

Connection to Geometric Deep Learning: The Metric as a Dictionary

In flat Euclidean space \(\mathbb{R}^n\) with the standard inner product \(\langle x, y \rangle = x^T y\), the flat map is simply \(y^\flat = y^T\) (transpose a column vector into a row vector), and the sharp map transposes it back. The "dictionary" is the identity matrix — trivial.

But on a curved manifold (the setting of Geometric Deep Learning), the inner product at each point is encoded by a metric tensor \(g_{ij}\). The flat and sharp maps become: \[ (v^\flat)_i = g_{ij}\, v^j, \qquad (\omega^\sharp)^i = g^{ij}\, \omega_j. \] Here, converting between vectors (tangent space) and covectors (cotangent space) depends on the geometry of the space. This is why the choice of metric matters for optimization on manifolds — different metrics yield different "gradient" vectors from the same differential.

Index Notation: Raising and Lowering Indices

In differential geometry and physics, the Musical Isomorphisms are expressed in index notation (Einstein summation convention). This notation makes the role of the metric tensor explicit and will be essential when we reach the manifold chapters.

Object Lives in Index Position Examples
Vector \(T_p M\) (tangent space) Upper index: \(v^i\) velocity, tangent vector
Covector \(T_p^* M\) (cotangent space) Lower index: \(\omega_i\) differential \(df\), force
Flat \(\flat\) ("lower the index") \(V \to V^*\) \(v_i = g_{ij}\, v^j\) contract with metric
Sharp \(\sharp\) ("raise the index") \(V^* \to V\) \(\omega^i = g^{ij}\, \omega_j\) contract with inverse metric

The mnemonic is simple: \(\flat\) lowers the index (superscript \(\to\) subscript) using \(g_{ij}\), while \(\sharp\) raises the index (subscript \(\to\) superscript) using \(g^{ij}\). In Euclidean space where \(g_{ij} = \delta_{ij}\), raising and lowering are invisible — the components don't change. On a curved manifold, they are nontrivial operations that depend on the geometry.

Concrete Example: How the Metric Changes the Dictionary

Example: \(\flat\) and \(\sharp\) on \(\mathbb{R}^2\) with a Non-Euclidean Metric

Consider \(\mathbb{R}^2\) equipped with the metric tensor \(g = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}\), whose inverse is \(g^{-1} = \frac{1}{5}\begin{bmatrix} 3 & -1 \\ -1 & 2 \end{bmatrix}\).

Flat map \(\flat\) (vector \(\to\) covector): Let \(v = \begin{bmatrix} 1 \\ 0 \end{bmatrix}\) be the standard basis vector \(e_1\). Then: \[ v^\flat = g \cdot v = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}\begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 2 \\ 1 \end{bmatrix}. \] In index notation: \(v_i = g_{ij} v^j\), giving \(v_1 = 2,\; v_2 = 1\). Compare: with the Euclidean metric (\(g = \mathbf{I}\)), we would have \(v^\flat = \begin{bmatrix} 1 \\ 0 \end{bmatrix}\) — unchanged.

Sharp map \(\sharp\) (covector \(\to\) vector): Let \(\omega = \begin{bmatrix} 1 \\ 1 \end{bmatrix}\) be a covector. Then: \[ \begin{align*} \omega^\sharp = g^{-1} \cdot \omega &= \frac{1}{5}\begin{bmatrix} 3 & -1 \\ -1 & 2 \end{bmatrix}\begin{bmatrix} 1 \\ 1 \end{bmatrix} \\\\ &= \frac{1}{5}\begin{bmatrix} 2 \\ 1 \end{bmatrix} \\\\ &= \begin{bmatrix} 0.4 \\ 0.2 \end{bmatrix}. \end{align*} \] With the Euclidean metric, we would simply have \(\omega^\sharp = \begin{bmatrix} 1 \\ 1 \end{bmatrix}\). The same covector \(\omega\) produces a different vector depending on the metric — this is the geometric content of "the gradient depends on the metric."

The Gradient is a Covector

We have now assembled all the machinery to state what is perhaps the single most important conceptual correction for anyone transitioning from standard machine learning to Geometric Deep Learning:

The derivative \(df\) of a function \(f\) is not a vector. It is a covector — an element of the dual space (cotangent space). What we call "the gradient" \(\nabla f\) is the vector obtained by applying the sharp map \(\sharp\) to \(df\).

Let us make this precise. Consider a differentiable function \(f: \mathcal{H} \to \mathbb{R}\) on a Hilbert space. At each point \(x\), the Fréchet derivative (or differential) \(df_x\) is a continuous linear functional: \[ df_x : \mathcal{H} \to \mathbb{R}, \qquad df_x(h) = \lim_{t \to 0} \frac{f(x + th) - f(x)}{t}. \] This \(df_x\) lives in \(\mathcal{H}^*\), not in \(\mathcal{H}\). It is a covector.

The gradient \(\nabla f(x)\) is defined as the Riesz representative of this covector:

Definition: The Gradient as \((df)^\sharp\)

Let \(f: \mathcal{H} \to \mathbb{R}\) be Fréchet differentiable at \(x\). The gradient of \(f\) at \(x\) is defined as the unique vector \(\nabla f(x) \in \mathcal{H}\) satisfying \[ \nabla f(x) \;=\; (df_x)^\sharp, \] or equivalently, by the Riesz Representation Theorem: \[ df_x(h) \;=\; \langle h,\, \nabla f(x) \rangle \quad \text{for all } h \in \mathcal{H}. \]

Connection to Gradient Descent on Manifolds

In standard \(\mathbb{R}^d\) with the Euclidean inner product, the sharp map is the identity, so \(\nabla f = df\) "looks like" a vector — and we never notice the distinction. The familiar gradient descent update \(\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla \mathcal{L}\) works precisely because the Euclidean metric makes \(\sharp\) invisible.

But on a Riemannian manifold \((M, g)\), the gradient depends on the metric: \[ (\nabla f)^i \;=\; g^{ij} \frac{\partial f}{\partial x^j} \;=\; g^{ij}\,(df)_j. \] This is exactly the \(\sharp\) map in coordinates! Different metrics \(g\) yield different gradients from the same differential \(df\). This is the mathematical foundation of the Natural Gradient (Amari, 1998): by replacing the Euclidean metric \(g_{ij} = \delta_{ij}\) with the Fisher Information Matrix \(F_{ij}\), we obtain the natural gradient \(\tilde{\nabla} \mathcal{L} = F^{-1} \nabla \mathcal{L}\), which is the steepest descent direction with respect to the Fisher-Rao metric on the statistical manifold. Whether this metric is the operationally preferable choice depends on the loss landscape and the optimization objective; the Fisher-Rao metric measures distance between distributions, which is the relevant scale for likelihood-based objectives, but is not universally the "correct" notion of distance for every learning problem.

The Unifying Perspective: Every "Gradient" is \((df)^\sharp\)

The deepest insight of this chapter is that three seemingly different optimization algorithms — standard gradient descent, Natural Gradient, and Newton's method — are all the same formula applied with different choices of inner product (metric):

Metric \(g\) Gradient \((df)^\sharp\): \((\nabla f)^i = g^{ij}(df)_j\) Algorithm
\(\mathbf{I}\) (Euclidean identity) \(\nabla f = \mathbf{I}^{-1} df = df\) SGD, Adam
\(\mathbf{F}\) (Fisher Information) \(\tilde{\nabla} f = \mathbf{F}^{-1} \nabla f\) Natural Gradient
\(\mathbf{H}\) (Hessian) \(\Delta \boldsymbol{\theta} = \mathbf{H}^{-1} \nabla f\) Newton's Method

Connection to Geometric Deep Learning: One Formula, Many Geometries

Every row in this table is the same abstract operation: apply the sharp map \(\sharp\) (raise the index) to the covector \(df\). The only difference is which inner product defines \(\sharp\). This is the power of the Riesz-Musical Isomorphism framework: it reveals that choosing an optimization algorithm is equivalent to choosing a geometry on parameter space.

A caveat is in order: these three "metrics" differ in geometric status. The Fisher Information Matrix \(\mathbf{F}\) is a true Riemannian metric on the statistical manifold — by Cencov's theorem, it is essentially the unique metric invariant under reparametrization of the model. The Hessian \(\mathbf{H}\), by contrast, depends on the specific loss function and is not guaranteed to be positive definite (it fails at saddle points), so it serves as a bona fide inner product only near strict local minima. We will revisit this distinction rigorously when we reach Riemannian Metrics and Information Geometry later in the curriculum.

Example: How the Metric Changes the Gradient

Consider a function \(f(x_1, x_2) = x_1^2 + 3x_1 x_2\) on \(\mathbb{R}^2\). The differential (covector) is the row vector of partial derivatives: \[ df = \begin{bmatrix} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} \end{bmatrix} = \begin{bmatrix} 2x_1 + 3x_2 & 3x_1 \end{bmatrix}. \] Case 1: Standard Euclidean Metric (Identity Matrix)
If the inner product is standard (\(g = \mathbf{I}\)), the sharp map is just the transpose: \[ \nabla f_{\text{Euclidean}} = \mathbf{I}^{-1} (df)^T = \begin{bmatrix} 2x_1 + 3x_2 \\ 3x_1 \end{bmatrix}. \] Case 2: A Curved Manifold Metric
Suppose the geometry of our space is defined by a metric tensor \(g = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}\). The sharp map now requires the inverse metric \(g^{-1} = \frac{1}{5}\begin{bmatrix} 3 & -1 \\ -1 & 2 \end{bmatrix}\): \[ \begin{align*} \nabla f_{\text{Manifold}} &= g^{-1} (df)^T \\\\ &= \frac{1}{5}\begin{bmatrix} 3 & -1 \\ -1 & 2 \end{bmatrix} \begin{bmatrix} 2x_1 + 3x_2 \\ 3x_1 \end{bmatrix} \\\\ &= \frac{1}{5}\begin{bmatrix} 3x_1 + 9x_2 \\ 4x_1 - 3x_2 \end{bmatrix}. \end{align*} \]

Notice that the gradient vector points in a completely different direction than the Euclidean case, even though the function \(f\) and its differential \(df\) are identical. The covector \(df\) measures how fast the function changes; the gradient \(\nabla f\) points to the "steepest ascent" — but "steepest" depends entirely on how distance is measured by the metric \(g\).

Bra-Ket Notation, Double Duals & Reflexivity

The Riesz Representation Theorem tells us that Hilbert spaces enjoy a perfect symmetry between vectors and covectors. Before we close this chapter, let us explore two important extensions of this idea: Dirac's Bra-Ket notation (the physicist's language for the Riesz isomorphism) and the concept of the double dual and reflexivity (which tells us when a general Banach space shares this self-dual property).

Dirac's Bra-Ket Notation

In quantum mechanics, Paul Dirac introduced a notation that is, in hindsight, a compact encoding of the Riesz isomorphism. The notation splits the inner product \(\langle \cdot \,,\, \cdot \rangle\) into two halves:

Definition: Bra-Ket (Dirac Notation)

Let \(\mathcal{H}\) be a Hilbert space.

A Ket \(|\psi\rangle\) denotes a vector \(\psi \in \mathcal{H}\).

A Bra \(\langle\phi|\) denotes the continuous linear functional \(\phi^\flat \in \mathcal{H}^*\) defined by \[ \langle\phi| \;:\; |\psi\rangle \;\mapsto\; \langle \psi,\, \phi \rangle. \]

The bracket (inner product) is the evaluation of the Bra on the Ket: \[ \langle\phi|\psi\rangle \;=\; \phi^\flat(\psi) \;=\; \langle \psi,\, \phi \rangle. \]

In this notation, the \(\flat\) map is simply the operation of "turning a Ket into a Bra": \(|\phi\rangle \mapsto \langle\phi|\). The \(\sharp\) map reverses it: \(\langle\phi| \mapsto |\phi\rangle\).

Connection to Attention Mechanisms (Transformers)

The Query-Key-Value mechanism in Transformer architectures has a natural Bra-Ket interpretation. A Query vector \(|\mathbf{q}\rangle\) is a Ket in the embedding space, while a Key vector, transposed as \(\langle\mathbf{k}|\), acts as a Bra. The attention score \(\langle\mathbf{k}|\mathbf{q}\rangle\) is literally a bracket — the action of a covector on a vector. The full self-attention matrix \(\text{softmax}(QK^T / \sqrt{d_k})V\) computes all such pairings simultaneously.

The Hahn-Banach Theorem

Before discussing the double dual, we need a fundamental result that guarantees \(\mathcal{X}^*\) always contains "enough" functionals to distinguish points of \(\mathcal{X}\) and to detect their norms. This is the Hahn-Banach Theorem, one of the three pillars of functional analysis (alongside the Uniform Boundedness Principle and the Open Mapping Theorem).

Theorem: Hahn-Banach Extension (Normed Space Version)

Let \(\mathcal{X}\) be a normed space over \(\mathbb{F}\), and let \(\mathcal{Y} \subseteq \mathcal{X}\) be a linear subspace equipped with the norm inherited from \(\mathcal{X}\). For every bounded linear functional \(\varphi_0 \in \mathcal{Y}^*\), there exists an extension \(\varphi \in \mathcal{X}^*\) satisfying \[ \varphi|_{\mathcal{Y}} = \varphi_0, \qquad \|\varphi\|_{\mathcal{X}^*} = \|\varphi_0\|_{\mathcal{Y}^*}. \] That is, any bounded linear functional defined on a subspace can be extended to the ambient space with the same norm.

We take this theorem as an input and derive the concrete application we need below.

Corollary: Norming Functionals

For every \(x \in \mathcal{X}\), \[ \|x\|_{\mathcal{X}} \;=\; \sup\bigl\{\,|\varphi(x)| \;:\; \varphi \in \mathcal{X}^*,\; \|\varphi\|_{\mathcal{X}^*} \leq 1\,\bigr\}, \] and this supremum is attained. Concretely, for every nonzero \(x\) there exists \(\varphi \in \mathcal{X}^*\) with \(\|\varphi\|_{\mathcal{X}^*} = 1\) and \(\varphi(x) = \|x\|_{\mathcal{X}}\). Any such \(\varphi\) is called a norming functional for \(x\).

Proof

Let \(\alpha := \sup\{|\psi(x)| : \psi \in \mathcal{X}^*,\; \|\psi\|\leq 1\}\). For any such \(\psi\), the operator-norm bound gives \(|\psi(x)| \leq \|\psi\|\,\|x\| \leq \|x\|\), so \(\alpha \leq \|x\|\).

For the reverse inequality (and simultaneous attainment), the case \(x = 0\) is trivial (\(\alpha = 0 = \|x\|\)). For \(x \neq 0\), let \(\mathcal{Y} := \operatorname{span}\{x\} = \{\beta x : \beta \in \mathbb{F}\}\), a one-dimensional subspace of \(\mathcal{X}\), and define \(\varphi_0 : \mathcal{Y} \to \mathbb{F}\) by \[ \varphi_0(\beta x) \;:=\; \beta\,\|x\|_{\mathcal{X}}. \] Then \(\varphi_0\) is linear, and for every \(\beta \in \mathbb{F}\), \[ |\varphi_0(\beta x)| \;=\; |\beta|\,\|x\| \;=\; \|\beta x\|, \] which shows \(\|\varphi_0\|_{\mathcal{Y}^*} = 1\) (the identity \(|\varphi_0(y)| = \|y\|\) holds for every \(y \in \mathcal{Y}\)). By the Hahn-Banach Extension Theorem, there exists \(\varphi \in \mathcal{X}^*\) with \(\varphi|_{\mathcal{Y}} = \varphi_0\) and \(\|\varphi\|_{\mathcal{X}^*} = \|\varphi_0\|_{\mathcal{Y}^*} = 1\). In particular, \(\varphi(x) = \varphi_0(x) = \|x\|\). Since \(\|\varphi\| = 1\), this \(\varphi\) is admissible in the supremum defining \(\alpha\), giving \(\alpha \geq |\varphi(x)| = \|x\|\). Combined with \(\alpha \leq \|x\|\) above, \(\alpha = \|x\|\), and the supremum is attained by \(\varphi\).

Why This Matters: \(\mathcal{X}^*\) is Always "Rich Enough"

The Hahn-Banach theorem is what prevents the dual space from being pathologically small. For any Banach space \(\mathcal{X}\), no matter how exotic, \(\mathcal{X}^*\) is guaranteed to:

  • Separate points: if \(x \neq y\), applying the Corollary to \(x - y \neq 0\) produces \(\varphi \in \mathcal{X}^*\) with \(\varphi(x - y) = \|x - y\| \neq 0\). By linearity, \(\varphi(x) \neq \varphi(y)\).
  • Detect the norm: the Corollary is itself the statement that \(\|x\|_{\mathcal{X}}\) is recoverable from \(\mathcal{X}^*\) — namely, as the supremum of \(|\varphi(x)|\) over the unit ball of \(\mathcal{X}^*\), and moreover this supremum is attained.

Both facts will be used repeatedly: the first drives uniqueness of weak limits in the next chapter, the second is the mechanism behind the canonical embedding \(J : \mathcal{X} \to \mathcal{X}^{**}\) being isometric (below).

The Double Dual and Reflexive Spaces

Since \(\mathcal{X}^*\) is a Banach space, it has its own dual: the double dual \(\mathcal{X}^{**} = (\mathcal{X}^*)^*\). Elements of \(\mathcal{X}^{**}\) are linear functionals that act on covectors.

There is a natural way to embed our original space \(\mathcal{X}\) into \(\mathcal{X}^{**}\). For any vector \(x \in \mathcal{X}\), we define an evaluation functional \(J(x) \in \mathcal{X}^{**}\) that simply "evaluates" each covector at \(x\):

Definition: Canonical Embedding \(J: \mathcal{X} \to \mathcal{X}^{**}\)

For each \(x \in \mathcal{X}\), define \(J(x) \in \mathcal{X}^{**}\) by \[ [J(x)](\varphi) \;=\; \varphi(x) \quad \text{for all } \varphi \in \mathcal{X}^*. \] The map \(J: \mathcal{X} \to \mathcal{X}^{**}\) is linear and isometric (\(\|J(x)\|_{\mathcal{X}^{**}} = \|x\|_{\mathcal{X}}\)). However, \(J\) is not always surjective: there may be elements of \(\mathcal{X}^{**}\) that are not the evaluation at any point of \(\mathcal{X}\).

Proof that \(J\) is linear and isometric

Linearity. For \(x, y \in \mathcal{X}\), \(\lambda \in \mathbb{F}\), and any \(\varphi \in \mathcal{X}^*\), \[ [J(\lambda x + y)](\varphi) \;=\; \varphi(\lambda x + y) \;=\; \lambda\,\varphi(x) + \varphi(y) \;=\; \lambda\,[J(x)](\varphi) + [J(y)](\varphi), \] so \(J(\lambda x + y) = \lambda J(x) + J(y)\).

Isometry. Fix \(x \in \mathcal{X}\). For any \(\varphi \in \mathcal{X}^*\), \[ |[J(x)](\varphi)| \;=\; |\varphi(x)| \;\leq\; \|\varphi\|_{\mathcal{X}^*}\,\|x\|_{\mathcal{X}}, \] so \(\|J(x)\|_{\mathcal{X}^{**}} = \sup_{\|\varphi\| \leq 1} |[J(x)](\varphi)| \leq \|x\|\). For the reverse inequality, the case \(x = 0\) is trivial; if \(x \neq 0\), the norming functional corollary produces \(\psi \in \mathcal{X}^*\) with \(\|\psi\|_{\mathcal{X}^*} = 1\) and \(\psi(x) = \|x\|\). Then \[ \|J(x)\|_{\mathcal{X}^{**}} \;\geq\; |[J(x)](\psi)| \;=\; |\psi(x)| \;=\; \|x\|. \] Combining both directions, \(\|J(x)\|_{\mathcal{X}^{**}} = \|x\|_{\mathcal{X}}\).

Definition: Reflexive Space

A Banach space \(\mathcal{X}\) is called reflexive if the canonical embedding \(J: \mathcal{X} \to \mathcal{X}^{**}\) is surjective. In other words, \(\mathcal{X}\) is isometrically isomorphic to its double dual (\(\mathcal{X} \cong \mathcal{X}^{**}\)).

By the Riesz Representation Theorem, every Hilbert space is self-dual (\(\mathcal{H} \cong \mathcal{H}^*\)), so it is immediately reflexive. The spaces \(L^p\) and \(\ell^p\) for \(1 < p < \infty\) are also reflexive (their double duals cycle back through Hölder conjugation: \(p \to q \to p\)). However, \(L^1\), \(L^\infty\), and \(c_0\) (the space of sequences converging to zero) are not reflexive — a fact that creates genuine obstacles in infinite-dimensional optimization.

Why Reflexivity Matters for Machine Learning Optimization

Reflexivity is the mathematical property that guarantees "stability" in infinite dimensions. According to the Eberlein-Šmulian theorem, in a reflexive Banach space, every bounded sequence has a weakly convergent subsequence. This means if you are running an optimization algorithm (like training a model) and your parameters stay bounded, the algorithm is guaranteed to have a convergent limit point (a solution). Because \(L^1\) is not reflexive, \(L^1\)-regularized optimization (like Lasso in infinite dimensions) is notoriously difficult to analyze and often lacks exact minimizers without special treatment.

Looking Ahead: From Hilbert Spaces to Manifolds

Let us summarize the conceptual journey of this chapter with a roadmap:

  1. The Dual Space \(\mathcal{X}^*\): Collects all continuous linear "measurement tools" on \(\mathcal{X}\). It is always a Banach space.
  2. The Riesz Representation Theorem: In a Hilbert space, every measurement \(\varphi \in \mathcal{H}^*\) is secretly an inner product with a unique vector \(y_\varphi\). The space is self-dual.
  3. Musical Isomorphisms \(\sharp\) and \(\flat\): The Riesz map and its inverse provide a "dictionary" — powered by the inner product — between vectors and covectors.
  4. The Gradient is a Covector: The derivative \(df\) naturally lives in the dual. The gradient \(\nabla f = (df)^\sharp\) depends on the choice of inner product (metric).

This final point is the bridge to the geometry track of our curriculum. On smooth manifolds, the inner product varies from point to point (it becomes the Riemannian metric tensor \(g\)), the vectors live in the tangent space \(T_pM\), and the covectors live in the cotangent space \(T_p^*M\). The Musical Isomorphisms, now point-dependent, become the central tool for translating between forces and velocities, between differentials and gradients.

In the next chapter, however, we remain in the functional analysis setting and ask: what happens when we weaken the notion of convergence? The Weak Topologies and the Banach-Alaoglu Theorem will give us the compactness results we need to guarantee that optimization problems on infinite-dimensional spaces actually have solutions.