Weak Topologies & Banach-Alaoglu

The Crisis of Infinite Dimensions Weak Convergence The Weak* Topology The Banach-Alaoglu Theorem Reflexivity & The Existential Guarantee

The Crisis of Infinite Dimensions

At the end of Dual Spaces & Riesz Representation, we introduced the concept of reflexive spaces and hinted that reflexivity would play a crucial role in guaranteeing the existence of optimal solutions. In this chapter, we make that promise precise. But first, let us confront the problem head-on.

In Intro to Functional Analysis, we proved a result that should have been deeply alarming (recalled here for reference):

Recall: Failure of Compactness in Infinite Dimensions

By the Compactness of the Unit Ball Theorem, the closed unit ball \(\overline{B} = \{x \in \mathcal{X} : \|x\| \leq 1\}\) in a normed space \(\mathcal{X}\) is compact if and only if \(\mathcal{X}\) is finite-dimensional.

In \(\mathbb{R}^n\), the Heine-Borel Theorem tells us that every closed and bounded set is compact. Combined with the Extreme Value Theorem, this guarantees that any continuous function on such a set attains its minimum — the theoretical bedrock of finite-dimensional optimization.

But in infinite-dimensional function spaces — exactly the spaces where modern machine learning lives (parameter spaces of overparameterized networks, Reproducing Kernel Hilbert Spaces, spaces of probability distributions) — this guarantee vanishes. The unit ball is closed and bounded, but not compact. Sequences can wander forever inside a bounded region without ever converging.

The Optimization Nightmare

Let us make this concrete. Suppose we are searching for the best function \(f^*\) in an infinite-dimensional Hilbert space \(\mathcal{H}\) (e.g., an RKHS) that minimizes a loss functional \(\mathcal{L} : \mathcal{H} \to \mathbb{R}\). Our training algorithm generates a sequence of candidate models \(\{f_n\}_{n=1}^\infty\) with decreasing losses: \[ \mathcal{L}(f_1) \geq \mathcal{L}(f_2) \geq \mathcal{L}(f_3) \geq \cdots \to \inf_{f \in \mathcal{H}} \mathcal{L}(f). \] Even if we bound the parameters via regularization (\(\|f_n\| \leq R\)), the lack of compactness means this sequence need not have a convergent subsequence in the norm topology. The infimum of the loss might not be achieved by any actual function in the space.

A finite-dimensional analogy captures the flavor: consider minimizing \(f(x) = x\) on the open interval \((0, 1)\). The infimum is \(0\), but no \(x \in (0,1)\) achieves it; the minimum "escapes" through the open boundary. The feasible set is bounded but not closed — it lacks compactness, so EVT does not apply. In infinite dimensions, the pathology is far worse: even with a closed and bounded set, the minimum can still escape because closed-and-bounded no longer implies compact.

A Concrete Escape: The Sliding Bump

Consider the standard orthonormal basis \(\{e_n\}_{n=1}^\infty\) in \(\ell^2\) (the space of square-summable sequences). Every \(e_n\) lies on the unit sphere: \(\|e_n\| = 1\) for all \(n\). Yet for \(m \neq n\), we have \(\|e_m - e_n\| = \sqrt{2}\). The sequence lives in the closed unit ball — bounded and closed — but no subsequence converges in the norm topology. Each vector is "far" from every other in the \(\sqrt{2}\) sense.

In a continuous function space like \(L^2([0,1])\), this same pathology takes a vivid form known as a "concentrating bump" (or Dirac-like) sequence: \[ f_n(x) = \sqrt{n}\,\mathbf{1}_{[0, 1/n]}(x). \] Each \(f_n\) has unit norm (\(\|f_n\|_{L^2} = 1\)), but the bumps become infinitely tall and thin, concentrating at the origin. No subsequence converges in the \(L^2\) norm.

The strong (norm) topology in infinite dimensions is too strict. It demands that sequences converge "in every direction simultaneously," which is an impossible standard when there are infinitely many independent directions. If we want compactness back, we must weaken the topology.

Weak Convergence: Convergence by Consensus

The resolution is elegant: instead of asking "does \(\|x_n - x\| \to 0\)?", we ask a softer question — "does every continuous linear functional agree that \(x_n\) is approaching \(x\)?" This is exactly where the dual space \(\mathcal{X}^*\) enters the story.

Definition: Strong (Norm) Convergence

A sequence \(\{x_n\}\) in a normed space \(\mathcal{X}\) converges strongly (or in norm) to \(x \in \mathcal{X}\), written \(x_n \to x\), if \[ \|x_n - x\| \;\to\; 0. \]

Definition: Weak Convergence

A sequence \(\{x_n\}\) in a normed space \(\mathcal{X}\) converges weakly to \(x \in \mathcal{X}\), written \(x_n \rightharpoonup x\), if \[ \varphi(x_n) \;\to\; \varphi(x) \quad \text{for every } \varphi \in \mathcal{X}^*. \]

Let us unpack this definition carefully. Each \(\varphi \in \mathcal{X}^*\) is a continuous linear functional — a "measurement device" that extracts a single real (or complex) number from a vector. Weak convergence says: no matter which measurement device you use, the readings \(\varphi(x_n)\) eventually stabilize at \(\varphi(x)\). The sequence converges "by consensus of all observers."

Strong Implies Weak (but Not Conversely)

If \(x_n \to x\) strongly, then for every \(\varphi \in \mathcal{X}^*\): \[ |\varphi(x_n) - \varphi(x)| = |\varphi(x_n - x)| \leq \|\varphi\|_{\mathcal{X}^*} \|x_n - x\| \to 0. \] Here we used the boundedness of \(\varphi\) (equivalently, its continuity). So strong convergence implies weak convergence. In finite dimensions, the converse also holds (all topologies coincide on finite-dimensional spaces). But in infinite dimensions, the converse is false — and this gap is precisely what makes weak convergence useful.

The Canonical Counterexample: Orthonormal Bases

Example: Weak Convergence to Zero in a Hilbert Space

Let \(\{e_n\}_{n=1}^\infty\) be an orthonormal basis for a Hilbert space \(\mathcal{H}\). We claim that \(e_n \rightharpoonup 0\) (weakly), even though \(\|e_n\| = 1\) for all \(n\) (so \(e_n \not\to 0\) strongly).

By the Riesz Representation Theorem, every \(\varphi \in \mathcal{H}^*\) has the form \(\varphi(x) = \langle x, y_\varphi \rangle\) for a unique \(y_\varphi \in \mathcal{H}\). Therefore: \[ \varphi(e_n) = \langle e_n, y_\varphi \rangle. \] But \(\{\langle e_n, y_\varphi \rangle\}_{n=1}^\infty\) are the coordinates of \(y_\varphi\) with respect to the orthonormal sequence \(\{e_n\}\), and by Bessel's inequality: \[ \sum_{n=1}^\infty |\langle e_n, y_\varphi \rangle|^2 \;\leq\; \|y_\varphi\|^2 \;<\; \infty. \] Convergence of the series forces its terms to zero: \(\langle e_n, y_\varphi \rangle \to 0\) for every \(y_\varphi \in \mathcal{H}\). Since this holds for every \(\varphi \in \mathcal{H}^*\), we conclude \(e_n \rightharpoonup 0\).

This is remarkable. The basis vectors \(e_n\) never get closer to the zero vector in norm (\(\|e_n - 0\| = 1\) always), yet from the perspective of every measurement device, they are fading to zero. Each \(e_n\) "points in a new direction" that no fixed functional can permanently detect.

Key Properties of Weak Convergence

Weak convergence inherits some — but not all — of the properties of strong convergence. Let us record the essential facts.

Theorem: Properties of Weak Convergence

Let \(\{x_n\}\) be a sequence in a normed space \(\mathcal{X}\).

(a) Uniqueness of weak limits: If \(x_n \rightharpoonup x\) and \(x_n \rightharpoonup y\), then \(x = y\).

(b) Weak limits are bounded: If \(x_n \rightharpoonup x\), then \(\sup_n \|x_n\| < \infty\) (by the Uniform Boundedness Principle).

(c) Norm lower semicontinuity: If \(x_n \rightharpoonup x\), then \[ \|x\| \;\leq\; \liminf_{n \to \infty} \|x_n\|. \]

(d) Mazur's Theorem: If \(x_n \rightharpoonup x\), then \(x\) lies in the closed convex hull of \(\{x_n\}\). That is, suitable convex combinations of the \(x_n\) converge to \(x\) strongly.

Proof:

(a) Uniqueness. If \(x_n \rightharpoonup x\) and \(x_n \rightharpoonup y\), then for every \(\varphi \in \mathcal{X}^*\), \(\varphi(x) = \lim \varphi(x_n) = \varphi(y)\), so \(\varphi(x - y) = 0\). If \(x - y \neq 0\), the norming functional corollary of Hahn-Banach produces \(\varphi \in \mathcal{X}^*\) with \(\varphi(x - y) = \|x - y\| \neq 0\), contradicting the above. Hence \(x = y\).

(c) Norm lower semicontinuity. By the norming functional corollary of Hahn-Banach, for each nonzero \(x\) there exists \(\varphi \in \mathcal{X}^*\) with \(\|\varphi\|_{\mathcal{X}^*} = 1\) and \(\varphi(x) = \|x\|\). For this \(\varphi\), \[ \|x\| = |\varphi(x)| = \lim_{n \to \infty} |\varphi(x_n)| \leq \liminf_{n \to \infty} \|\varphi\|_{\mathcal{X}^*} \|x_n\| = \liminf_{n \to \infty} \|x_n\|. \] The first equality uses \(\varphi(x) = \|x\|\) (a real positive value); the second uses \(\varphi(x_n) \to \varphi(x)\) (weak convergence) together with continuity of \(|\cdot|\); the inequality uses \(|\varphi(x_n)| \leq \|\varphi\|_{\mathcal{X}^*} \|x_n\|\) and \(\|\varphi\|_{\mathcal{X}^*} = 1\). The case \(x = 0\) is trivial.

(b) Boundedness. Consider the family of evaluation functionals \(\{J(x_n)\}_n \subseteq \mathcal{X}^{**}\), where \(J: \mathcal{X} \to \mathcal{X}^{**}\) is the canonical embedding. For each fixed \(\varphi \in \mathcal{X}^*\), the sequence \(\{J(x_n)(\varphi)\}_n = \{\varphi(x_n)\}_n\) converges (by weak convergence \(x_n \rightharpoonup x\)) and is therefore bounded in \(\mathbb{F}\). Thus \(\{J(x_n)\}_n\) is a pointwise-bounded family of bounded linear functionals on the Banach space \(\mathcal{X}^*\). The Uniform Boundedness Principle yields a uniform bound \(\sup_n \|J(x_n)\|_{\mathcal{X}^{**}} < \infty\). Since \(J\) is an isometry — the norming functional corollary of Hahn-Banach gives \(\|J(x_n)\|_{\mathcal{X}^{**}} = \|x_n\|\) — we conclude \(\sup_n \|x_n\| < \infty\).

(d) Mazur's Theorem. This is a consequence of the geometric Hahn-Banach theorem (convex-set separation version — a distinct result from the extension theorem used for parts (a) and (c) above);

Property (c) is especially important for optimization. It tells us that the norm can only decrease (or stay the same) under weak limits — it cannot jump up. We will return to this semicontinuity principle in the section on reflexivity, where it becomes the key ingredient for proving that minima actually exist.

The Weak Topology

Weak convergence arises from a genuine topology on \(\mathcal{X}\), called the weak topology \(\sigma(\mathcal{X}, \mathcal{X}^*)\). It is the coarsest (i.e., fewest open sets) topology on \(\mathcal{X}\) that makes every functional \(\varphi \in \mathcal{X}^*\) continuous.

Definition: Weak Topology \(\sigma(\mathcal{X}, \mathcal{X}^*)\)

The weak topology on a normed space \(\mathcal{X}\) is the coarsest topology making every functional \(\varphi \in \mathcal{X}^*\) continuous (as a map \(\mathcal{X} \to \mathbb{F}\)). Equivalently, it is the topology generated by the family of seminorms \[ p_\varphi(x) = |\varphi(x)|, \quad \varphi \in \mathcal{X}^*. \] A net \(x_\alpha \to x\) converges in the weak topology if and only if \(\varphi(x_\alpha) \to \varphi(x)\) for all \(\varphi \in \mathcal{X}^*\).

Because the weak topology has fewer open sets than the norm topology, it has more compact sets — which is exactly what we need.

Connection to Machine Learning: Regularization as Weak Compactness

When we impose \(L^2\) regularization (\(\|\boldsymbol{\theta}\|_2 \leq R\)) during training, we are confining our parameter sequence to a bounded set. In finite dimensions, the Heine-Borel Theorem guarantees that this set is compact (in the strong topology), so the Extreme Value Theorem applies directly. In infinite dimensions, the same bound does not yield strong compactness — but it does yield weak compactness in reflexive spaces (as we will see in the reflexivity section). The weak topology is precisely the machinery that lets regularization "work" in function spaces.

The Weak* Topology: Convergence of Observers

So far, we have weakened convergence for elements of \(\mathcal{X}\) by using functionals from \(\mathcal{X}^*\). Now we flip the perspective: what does it mean for a sequence of functionals to converge?

The dual space \(\mathcal{X}^*\) is itself a Banach space (as established in Bounded Linear Operators), so it carries a norm topology (strong convergence of functionals) and a weak topology (using functionals on \(\mathcal{X}^*\), i.e., elements of \(\mathcal{X}^{**}\)). But there is a third, and often more natural, topology — one that uses only the original space \(\mathcal{X}\) as the set of "test objects."

Definition: Weak* (Weak-Star) Convergence

A sequence \(\{\varphi_n\}\) in \(\mathcal{X}^*\) converges in the weak* topology to \(\varphi \in \mathcal{X}^*\), written \(\varphi_n \xrightarrow{w^*} \varphi\), if \[ \varphi_n(x) \;\to\; \varphi(x) \quad \text{for every } x \in \mathcal{X}. \]

Observe the asymmetry: weak convergence tests elements of \(\mathcal{X}\) against all of \(\mathcal{X}^*\), while weak* convergence tests elements of \(\mathcal{X}^*\) against only the evaluation functionals from \(\mathcal{X}\) (via the canonical embedding \(J: \mathcal{X} \to \mathcal{X}^{**}\)), rather than all of \(\mathcal{X}^{**}\).

Definition: Weak* Topology \(\sigma(\mathcal{X}^*, \mathcal{X})\)

The weak* topology on \(\mathcal{X}^*\) is the coarsest topology making the evaluation maps \[ \operatorname{ev}_x : \mathcal{X}^* \to \mathbb{F}, \quad \operatorname{ev}_x(\varphi) = \varphi(x) \] continuous for every \(x \in \mathcal{X}\). It is generated by the family of seminorms \[ p_x(\varphi) = |\varphi(x)|, \quad x \in \mathcal{X}. \]

The Hierarchy of Topologies on \(\mathcal{X}^*\)

We now have three topologies on the dual space \(\mathcal{X}^*\), each progressively coarser (fewer open sets, hence more compact sets):

Topology Notation \(\varphi_n \to \varphi\) means... Tested by
Strong (Norm) \(\|\cdot\|_{\mathcal{X}^*}\) \(\|\varphi_n - \varphi\|_{\mathcal{X}^*} \to 0\) Operator norm
Weak \(\sigma(\mathcal{X}^*, \mathcal{X}^{**})\) \(\Psi(\varphi_n) \to \Psi(\varphi)\) for all \(\Psi \in \mathcal{X}^{**}\) All of \(\mathcal{X}^{**}\)
Weak* \(\sigma(\mathcal{X}^*, \mathcal{X})\) \(\varphi_n(x) \to \varphi(x)\) for all \(x \in \mathcal{X}\) Only \(J(\mathcal{X}) \subseteq \mathcal{X}^{**}\)

The key ordering is: \[ \text{Strong convergence} \;\Longrightarrow\; \text{Weak convergence} \;\Longrightarrow\; \text{Weak* convergence}. \] The first implication is strict in most infinite-dimensional spaces (e.g., all infinite-dimensional Hilbert spaces and \(L^p\) for \(1 < p < \infty\), as illustrated by the orthonormal basis example above); the second is strict precisely when \(\mathcal{X}\) is not reflexive (in reflexive spaces, weak and weak* coincide on \(\mathcal{X}^*\), as we will see below). The weak* topology is the "gentlest" notion of convergence — it asks the least of a sequence of functionals — and consequently has the best compactness properties.

When Do Weak and Weak* Coincide?

The weak topology on \(\mathcal{X}^*\) uses all of \(\mathcal{X}^{**}\) as test functionals, while the weak* topology uses only the image \(J(\mathcal{X}) \subseteq \mathcal{X}^{**}\). These two topologies coincide if and only if \(J(\mathcal{X}) = \mathcal{X}^{**}\) — that is, if and only if \(\mathcal{X}\) is reflexive.

Proposition: Weak = Weak* in Reflexive Spaces

If \(\mathcal{X}\) is a reflexive Banach space, then the weak topology and the weak* topology on \(\mathcal{X}^*\) coincide: \[ \sigma(\mathcal{X}^*, \mathcal{X}^{**}) = \sigma(\mathcal{X}^*, \mathcal{X}). \]

Proof:

By the general construction of weak topologies, both the weak topology \(\sigma(\mathcal{X}^*, \mathcal{X}^{**})\) and the weak* topology \(\sigma(\mathcal{X}^*, \mathcal{X})\) are defined as the coarsest topologies on \(\mathcal{X}^*\) making a specified family of evaluation maps continuous: the former uses the family \(\{\Psi : \Psi \in \mathcal{X}^{**}\}\), while the latter uses \(\{J(x) : x \in \mathcal{X}\} \subseteq \mathcal{X}^{**}\), where \(J\) is the canonical embedding. Reflexivity means precisely that \(J(\mathcal{X}) = \mathcal{X}^{**}\), so the two families of test maps coincide, and the induced topologies coincide as well. \(\square\)

Intuition: Strong vs. Weak vs. Weak*

Imagine evaluating a sequence of classifiers \(\{h_n\}\) in a hypothesis space.

Strong convergence (\(h_n \to h\)) demands that \(h_n\) and \(h\) agree "uniformly" over the entire input domain — the worst-case discrepancy shrinks to zero.

Weak convergence (\(h_n \rightharpoonup h\)) demands only that every "test" (linear functional) applied to \(h_n\) stabilizes — like requiring convergence of every moment of a distribution, without requiring the PDFs to converge pointwise.

Weak* convergence (\(\varphi_n \xrightarrow{w^*} \varphi\)) is even gentler: the functionals need only agree when evaluated at each fixed point. It is "pointwise convergence for linear functionals."

The Banach-Alaoglu Theorem

We now arrive at the central result of this chapter — the theorem that recovers compactness in infinite dimensions by leveraging the weak* topology.

Theorem: Banach-Alaoglu

Let \(\mathcal{X}\) be a normed space. The closed unit ball of the dual space, \[ \overline{B}_{\mathcal{X}^*} = \{\varphi \in \mathcal{X}^* : \|\varphi\|_{\mathcal{X}^*} \leq 1\}, \] is compact in the weak* topology \(\sigma(\mathcal{X}^*, \mathcal{X})\).

Banach-Alaoglu is the Heine-Borel theorem of infinite dimensions. We traded topological "tightness" (the norm) for topological "gentleness" (weak*), and in return, we recovered the compactness that infinite dimensions had stolen from us.

Proof (via Tychonoff's Theorem)

The proof is a beautiful application of point-set topology — specifically, Tychonoff's Theorem. This is one of the rare moments where abstract topology directly solves a concrete problem in analysis.

Proof:

Step 1: Embed into a product space. For each \(x \in \mathcal{X}\), a functional \(\varphi \in \overline{B}_{\mathcal{X}^*}\) satisfies \(|\varphi(x)| \leq \|x\|\). Let \[ D_x = \{z \in \mathbb{F} : |z| \leq \|x\|\} \] — a compact interval \([-\|x\|, \|x\|] \subseteq \mathbb{R}\) when \(\mathbb{F} = \mathbb{R}\), or a compact closed disk in \(\mathbb{C}\) when \(\mathbb{F} = \mathbb{C}\). Identifying each \(\varphi\) with the function \(x \mapsto \varphi(x)\), we view \(\overline{B}_{\mathcal{X}^*}\) as a subset of the product space \[ \prod_{x \in \mathcal{X}} D_x \;\subseteq\; \mathbb{F}^{\mathcal{X}}. \]

Step 2: Apply Tychonoff. By Tychonoff's Theorem, the product \(\prod_{x \in \mathcal{X}} D_x\) is compact in the product topology, since each factor \(D_x\) is compact in \(\mathbb{F}\).

Step 3: Identify the weak* topology with the subspace topology. Under the identification of Step 1, the evaluation map \(\operatorname{ev}_x : \mathcal{X}^* \to \mathbb{F}\), \(\varphi \mapsto \varphi(x)\) coincides with the restriction of the projection \(\pi_x : \mathbb{F}^{\mathcal{X}} \to \mathbb{F}\), \(f \mapsto f(x)\). Both topologies in question — the weak* topology (defined as the coarsest topology making each \(\operatorname{ev}_x\) continuous) and the subspace topology inherited from the product topology on \(\mathbb{F}^{\mathcal{X}}\) (which is the coarsest topology making each \(\pi_x\) continuous) — are therefore generated by the same family of test maps and so coincide. Restricting both topologies to the subset \(\overline{B}_{\mathcal{X}^*}\) preserves this coincidence, so the weak* topology on \(\overline{B}_{\mathcal{X}^*}\) (inherited from \(\mathcal{X}^*\)) agrees with the topology inherited from \(\prod_x D_x\).

Step 4: Show \(\overline{B}_{\mathcal{X}^*}\) is closed in \(\prod_x D_x\). Under the embedding of Step 1, a point \(\varphi \in \prod_x D_x\) lies in (the image of) \(\overline{B}_{\mathcal{X}^*}\) if and only if it is linear:

  • (\(\Longrightarrow\)) Elements of \(\overline{B}_{\mathcal{X}^*}\) are linear functionals by definition.
  • (\(\Longleftarrow\)) Conversely, any linear \(\varphi \in \prod_x D_x\) already satisfies \(|\varphi(x)| \leq \|x\|\) for all \(x\) (by membership in \(\prod_x D_x\)), hence \(\|\varphi\|_{\mathcal{X}^*} \leq 1\) and \(\varphi \in \overline{B}_{\mathcal{X}^*}\).

It therefore suffices to show that the set of linear functionals in \(\prod_x D_x\) is closed. For each fixed \(x, y \in \mathcal{X}\) and \(\alpha, \beta \in \mathbb{F}\), the linearity condition \[ \varphi(\alpha x + \beta y) - \alpha \varphi(x) - \beta \varphi(y) = 0 \] defines a closed subset of \(\mathbb{F}^{\mathcal{X}}\): the map \(\varphi \mapsto \varphi(\alpha x + \beta y) - \alpha \varphi(x) - \beta \varphi(y)\) is continuous in the product topology (it depends on only three coordinates, namely \(\alpha x + \beta y\), \(x\), and \(y\)), and we are taking the preimage of \(\{0\} \subseteq \mathbb{F}\), which is closed. The image of \(\overline{B}_{\mathcal{X}^*}\) is the intersection over all such \((x, y, \alpha, \beta)\) of these closed sets, hence closed in \(\prod_x D_x\). A closed subset of a compact space is compact, so \(\overline{B}_{\mathcal{X}^*}\) is weak*-compact. \(\square\)

The elegance of this proof lies in the conceptual shift: we reinterpret a functional \(\varphi : \mathcal{X} \to \mathbb{F}\) as a single point in the product space \(\mathbb{F}^{\mathcal{X}}\), where each "coordinate" is the value \(\varphi(x)\). The constraint \(\|\varphi\| \leq 1\) confines this point to a product of compact factors (intervals when \(\mathbb{F} = \mathbb{R}\), disks when \(\mathbb{F} = \mathbb{C}\)), and Tychonoff's theorem does the rest.

Generalization: Arbitrary Radius

By a simple rescaling argument, the Banach-Alaoglu theorem extends to balls of any radius:

Corollary: Compactness of Arbitrary Dual Balls

For any \(R > 0\), the closed ball \(\{\varphi \in \mathcal{X}^* : \|\varphi\|_{\mathcal{X}^*} \leq R\}\) is weak*-compact.

A Critical Caveat: Metrizability

The weak* topology on \(\overline{B}_{\mathcal{X}^*}\) is metrizable (i.e., can be described by a metric) if and only if \(\mathcal{X}\) is separable (has a countable dense subset). When \(\mathcal{X}\) is separable, compactness and sequential compactness coincide, so we can extract weak*-convergent subsequences — which is the form most useful in applications. The spaces \(L^p(\Omega)\) and \(\ell^p\) are separable for \(1 \leq p < \infty\), so this condition is satisfied in most ML-relevant settings.

Connection to Probability & GANs: Convergence of Distributions

In the theory of Generative Adversarial Networks, the generator produces a sequence of probability distributions \(\{\mu_n\}\) aiming to match a target distribution \(\mu\). Distributions can be viewed as continuous linear functionals on spaces of continuous functions (via integration: \(\mu(f) = \int f \, d\mu\)). The Banach-Alaoglu theorem, applied to the dual of \(C_0(\mathbb{R}^d)\) (continuous functions vanishing at infinity), guarantees that any sequence of finite signed measures with bounded total variation has a weak*-convergent subsequence. This is one of the mathematical foundations of weak convergence of measures — though additional conditions (such as tightness, captured by Prokhorov's theorem) are needed to ensure the limit remains a probability distribution rather than losing mass at infinity. Banach-Alaoglu thus provides the existence of subsequential limits; the analytical refinements needed for full convergence theory in machine learning belong to measure theory and are developed in the probability section.

Reflexivity & The Existential Guarantee

The Banach-Alaoglu theorem gives us compactness in the dual space \(\mathcal{X}^*\) under the weak* topology. But in many optimization problems, we want compactness in the original space \(\mathcal{X}\). This is where reflexivity becomes decisive.

From Banach-Alaoglu to Weak Compactness of \(\mathcal{X}\)

Suppose \(\mathcal{X}\) is reflexive, so the canonical embedding \(J: \mathcal{X} \to \mathcal{X}^{**}\) is a surjective isometric isomorphism. Apply the Banach-Alaoglu theorem to the Banach space \(\mathcal{X}^*\): the closed unit ball of \((\mathcal{X}^*)^* = \mathcal{X}^{**}\) is compact in the weak* topology \(\sigma(\mathcal{X}^{**}, \mathcal{X}^*)\). Under \(J\), the closed unit ball \(\overline{B}_{\mathcal{X}}\) corresponds to \(\overline{B}_{\mathcal{X}^{**}}\) (since \(J\) is an isometry), and the weak topology on \(\mathcal{X}\) corresponds to the weak* topology on \(\mathcal{X}^{**}\) (since the test functionals — elements of \(\mathcal{X}^*\) — act on \(\mathcal{X}\) and on \(\mathcal{X}^{**}\) by the same formula \(\varphi(x) = J(x)(\varphi)\)). Transferring the conclusion through this isomorphism gives weak compactness of \(\overline{B}_{\mathcal{X}}\).

Theorem: Weak Compactness of the Unit Ball (Reflexive Case)

Let \(\mathcal{X}\) be a reflexive Banach space. Then the closed unit ball \[ \overline{B}_{\mathcal{X}} = \{x \in \mathcal{X} : \|x\| \leq 1\} \] is compact in the weak topology \(\sigma(\mathcal{X}, \mathcal{X}^*)\).

Moreover, the converse holds: if \(\overline{B}_{\mathcal{X}}\) is weakly compact, then \(\mathcal{X}\) is reflexive (this is a consequence of James' Theorem).

In other words: reflexivity is equivalent to weak compactness of the unit ball. The spaces that matter most for machine learning — all Hilbert spaces (by the Riesz Representation Theorem, \(\mathcal{H} \cong \mathcal{H}^*\)), and \(L^p\) for \(1 < p < \infty\) — are reflexive.

From Compactness to Subsequences: Eberlein-Šmulian

Weak compactness of \(\overline{B}_{\mathcal{X}}\) gives us, in principle, a weak topological notion of compactness. But in applications — particularly optimization — we need to extract subsequences from bounded sequences. In a general topological space, compactness (every open cover has a finite subcover) and sequential compactness (every sequence has a convergent subsequence) are not equivalent. The following theorem is one of the most remarkable results in Banach space theory: for the weak topology, they are equivalent.

Theorem: Eberlein-Šmulian

Let \(\mathcal{X}\) be a Banach space and \(K \subseteq \mathcal{X}\) a weakly closed subset. The following are equivalent:

  • \(K\) is weakly compact (compact in the weak topology \(\sigma(\mathcal{X}, \mathcal{X}^*)\));
  • \(K\) is weakly sequentially compact (every sequence in \(K\) has a weakly convergent subsequence with limit in \(K\)).

In particular, in a reflexive Banach space, every bounded sequence \(\{x_n\}\) has a weakly convergent subsequence; if \(\|x_n\| \leq R\) for all \(n\), then the weak limit \(x_*\) satisfies \(\|x_*\| \leq R\) as well.

The proof is substantial and relies on a delicate interplay between separability considerations and the structure of the bidual. What makes Eberlein-Šmulian remarkable is that it holds in arbitrary Banach spaces — with no separability or metrizability assumption. In the weak topology, compactness and sequential compactness are generally distinct notions, yet they coincide here due to the specific geometry of Banach spaces.

Weak Lower Semicontinuity: The Missing Piece

Compactness alone guarantees convergent subsequences, but to conclude that a loss functional attains its minimum, we need one more ingredient: the loss must "behave well" under weak limits.

Definition: Weakly Lower Semicontinuous (w-lsc) Functional

A functional \(\mathcal{L}: \mathcal{X} \to \mathbb{R} \cup \{+\infty\}\) is weakly lower semicontinuous if whenever \(x_n \rightharpoonup x\), \[ \mathcal{L}(x) \;\leq\; \liminf_{n \to \infty} \mathcal{L}(x_n). \]

In words: the functional value at the weak limit is no worse than the best asymptotic value along the sequence. The loss cannot "jump up" when we pass to weak limits.

The following classical result gives us a large supply of weakly lower semicontinuous functionals:

Proposition: Convex + Strongly lsc \(\Longrightarrow\) Weakly lsc

Let \(\mathcal{X}\) be a normed space and \(\mathcal{L}: \mathcal{X} \to \mathbb{R} \cup \{+\infty\}\). If \(\mathcal{L}\) is convex and lower semicontinuous in the norm topology, then \(\mathcal{L}\) is weakly lower semicontinuous.

This is a consequence of Mazur's Theorem — or more precisely, its standard equivalent formulation: a convex norm-closed set is also weakly closed. Recall that a functional is lower semicontinuous in a given topology if and only if all of its sublevel sets are closed in that topology (a standard characterization in general topological spaces). Since \(\mathcal{L}\) is convex, each sublevel set \(\{x : \mathcal{L}(x) \leq \alpha\}\) is convex; since \(\mathcal{L}\) is norm-lsc, each sublevel set is norm-closed. By Mazur's Theorem, each is therefore weakly closed, and by the sublevel set characterization, \(\mathcal{L}\) is weakly lsc. In particular, the norm itself is always weakly lower semicontinuous (which we already recorded as property (c)).

The Fundamental Theorem of Optimization in Reflexive Spaces

We can now assemble the full argument. This is the theorem that makes regularization rigorous.

Theorem: Existence of Minimizers

Let \(\mathcal{H}\) be a Hilbert space (or more generally, a reflexive Banach space), let \(\Omega = \{f \in \mathcal{H} : \|f\| \leq R\}\) for some \(R > 0\), and let \(\mathcal{L}: \mathcal{H} \to \mathbb{R}\) be a weakly lower semicontinuous functional. Then \(\mathcal{L}\) attains its minimum on \(\Omega\): there exists \(f^* \in \Omega\) such that \[ \mathcal{L}(f^*) = \inf_{f \in \Omega} \mathcal{L}(f). \]

Proof:

Let \(m = \inf_{f \in \Omega} \mathcal{L}(f)\) and choose a minimizing sequence \(\{f_n\} \subseteq \Omega\) with \(\mathcal{L}(f_n) \to m\).

Since \(\mathcal{H}\) is reflexive, the closed unit ball is weakly compact (by the theorem above); by a linear rescaling \(f \mapsto f/R\), the closed ball \(\Omega\) of any radius \(R > 0\) is weakly compact as well. By the Eberlein-Šmulian theorem, weak compactness of \(\Omega\) coincides with weak sequential compactness, so the minimizing sequence \(\{f_n\} \subseteq \Omega\) has a weakly convergent subsequence \(f_{n_k} \rightharpoonup f^* \in \Omega\).

By weak lower semicontinuity: \[ \mathcal{L}(f^*) \;\leq\; \liminf_{k \to \infty} \mathcal{L}(f_{n_k}) = m. \] But \(f^* \in \Omega\), so \(\mathcal{L}(f^*) \geq m\) by definition of infimum. Hence \(\mathcal{L}(f^*) = m\).

Let us record what each ingredient contributed:

  1. Reflexivity (from Banach-Alaoglu + \(J\) surjective): Provides weak compactness of \(\Omega\).
  2. Eberlein-Šmulian theorem: Converts weak compactness into weak sequential compactness, so the minimizing sequence has a weakly convergent subsequence whose limit remains in \(\Omega\).
  3. Weak lower semicontinuity of \(\mathcal{L}\): Ensures the limit point actually achieves the infimal loss value.

Application: Why Weight Decay Works

Connection to Deep Learning: \(L^2\) Regularization and Weight Decay

In practice, training a model with \(L^2\) regularization amounts to minimizing \[ \mathcal{L}_\lambda(f) = \mathcal{L}_{\text{data}}(f) + \lambda \|f\|^2 \] over a Hilbert space \(\mathcal{H}\) (e.g., an RKHS for kernel methods, or the function space in which overparameterized networks are analyzed in the infinite-width limit). The regularization term \(\lambda \|f\|^2\) is convex and strongly continuous, hence weakly lower semicontinuous. If \(\mathcal{L}_{\text{data}}\) is also convex and continuous (as in linear/kernel regression, SVMs, and logistic regression), then the total loss \(\mathcal{L}_\lambda\) is weakly lsc. Moreover, the penalty \(\lambda \|f\|^2\) is coercive (it grows to \(\infty\) as \(\|f\| \to \infty\)), so we can restrict attention to a bounded sublevel set \(\Omega\), which is weakly compact by Banach-Alaoglu.

The Regularization Guarantee then tells us: the optimal model exists. This is not merely an engineering heuristic — it is a rigorous mathematical theorem.

The Non-Reflexive Pitfall: \(L^1\) Regularization

As noted in Dual Spaces, the space \(L^1\) is not reflexive (nor is its sequence-space counterpart \(\ell^1\), which underlies Lasso-type penalties on parameters). This means the closed unit ball of \(L^1\) is not weakly compact, and the argument above breaks down. In practice, \(L^1\)-regularized problems in infinite-dimensional settings may lack minimizers entirely, or require additional structural assumptions (e.g., finite-dimensionality of the effective parameter space) to guarantee solutions. This is one of the deep mathematical reasons why \(L^2\) regularization is better-behaved than \(L^1\) in function-space settings.

Looking Ahead: From Existence to Structure

We have now closed the "existence gap" in infinite-dimensional optimization. The chain of reasoning is:

  1. Banach & Hilbert Spaces: Built the spaces where models and functions live.
  2. Bounded Operators: Understood the well-behaved maps between these spaces.
  3. Dual Spaces & Riesz: Introduced measurement tools (functionals) and the duality machinery.
  4. This chapter: Used duality to weaken the topology, recovering compactness and proving existence of optimal solutions.

In the next chapter, we shift from existence to structure: we ask not merely "does an optimal solution exist?" but "what does it look like?" The Spectral Theorem will decompose operators into their fundamental modes — the infinite-dimensional generalization of eigendecomposition — providing the structural insights that power Kernel Methods, PCA, and the Graph Laplacian.