Introduction
Continuity is one of the most fundamental concepts in analysis because it allows us to
transfer properties such as connectivity and compactness from the domain to the range. In calculus,
you learned that a function is continuous if "small changes in input produce small changes in output."
With our metric space framework, we can now make this precise and extend it to functions between
arbitrary metric spaces, not just subsets of \(\mathbb{R}\).
Why does this matter for machine learning and optimization?
Neural networks are essentially compositions of continuous functions (linear layers and non-linear activations).
The "loss landscape" we navigate during training relies on continuity to ensure that we can essentially "slide down"
to a minimum. Without continuity, optimization methods like Gradient Descent would be fundamentally broken.
But not all notions of continuity are equal.
Uniform continuity provides stronger global guarantees, and
Lipschitz continuity, which bounds how fast a function can change is
essential for guaranteeing the convergence speed of these algorithms.
Continuity
In calculus, continuity is often taught as "no breaks in the graph." However, in the rigorous language of
modern analysis, continuity is about preserving structure. It ensures that "nearness" in the
domain translates to "nearness" in the range, without tearing the space apart.
We define continuity using the concept of open sets (neighborhoods). This structural definition is
superior because it captures the essence of topology without relying on specific distance calculations.
Definition: Continuous Function
Suppose \(X\) and \(Y\) are metric spaces, \(z \in X\) and \(f: X \to Y\). We say that \(f\) is
continuous at \(z\) in \(X\) if and only if for each open subset \(V\) of \(Y\)
with \(f(z) \in V\), there exists an open subset \(U\) of \(X\) with \(z \in U\) such that
\[
f(U) \subseteq V.
\]
If \(f\) is continuous at every point \(z \in X\), we say \(f\) is continuous on \(X\).
Insight: Mapping Neighborhoods
This definition says: "If you define a target neighborhood \(V\) around the output \(f(z)\),
I can always find a source neighborhood \(U\) around the input \(z\) that lands entirely within your target."
This guarantees that the function does not "tear" the space or make abrupt jumps.
While the open set definition gives us the structural view, in practical optimization we often need to
compute bounds using distances. In metric spaces, the structural definition is logically equivalent to
the familiar \(\epsilon\)-\(\delta\) formulation.
Theorem: Metric Characterization (\(\varepsilon\)-\(\delta\))
Let \((X, d)\) and \((Y, e)\) be metric spaces. A function \(f: X \to Y\) is continuous at \(z\)
(in the sense of the definition above) if and only if for every \(\varepsilon > 0\), there exists
\(\delta > 0\) such that
\[
d(x, z) < \delta \implies e(f(x), f(z)) < \varepsilon.
\]
Proof:
(Open-set \(\Rightarrow\) \(\varepsilon\)-\(\delta\)):
Given \(\varepsilon > 0\), take \(V := \mathcal{B}^Y[f(z); \varepsilon)\), which is an open subset of
\(Y\) containing \(f(z)\). By the open-set definition of continuity, there exists an open subset
\(U\) of \(X\) with \(z \in U\) and \(f(U) \subseteq V\). Since open balls form a basis for the
metric topology, there exists \(\delta > 0\) with \(\mathcal{B}^X[z; \delta) \subseteq U\). Then
\(d(x, z) < \delta\) implies \(x \in U\), so \(f(x) \in V\), i.e. \(e(f(x), f(z)) < \varepsilon\).
(\(\varepsilon\)-\(\delta\) \(\Rightarrow\) Open-set):
Let \(V\) be an open subset of \(Y\) with \(f(z) \in V\). By the basis property, there exists
\(\varepsilon > 0\) with \(\mathcal{B}^Y[f(z); \varepsilon) \subseteq V\). The \(\varepsilon\)-\(\delta\)
condition then yields \(\delta > 0\) such that \(d(x, z) < \delta\) implies
\(e(f(x), f(z)) < \varepsilon\). Setting \(U := \mathcal{B}^X[z; \delta)\), we have \(U\) open with
\(z \in U\) and \(f(U) \subseteq \mathcal{B}^Y[f(z); \varepsilon) \subseteq V\).
Theorem: Sequential Characterization
Let \((X, d)\) and \((Y, e)\) be metric spaces. A function \(f: X \to Y\) is continuous at \(z\)
if and only if, for every sequence \(\{x_n\}\) in \(X\) with \(x_n \to z\), we have
\(f(x_n) \to f(z)\) in \(Y\).
Proof:
(\(\Rightarrow\)):
Suppose \(f\) is continuous at \(z\) and \(x_n \to z\). Let \(V\) be any open subset of \(Y\)
containing \(f(z)\). By continuity, there exists an open \(U \ni z\) with \(f(U) \subseteq V\).
Since \(x_n \to z\), the set \(U\) includes a tail of \(\{x_n\}\), so \(V\) includes the
corresponding tail of \(\{f(x_n)\}\). Hence \(f(x_n) \to f(z)\).
(\(\Leftarrow\)):
We prove the contrapositive. Suppose \(f\) is not continuous at \(z\): there exists an open
\(V \ni f(z)\) such that for every open \(U \ni z\), \(f(U) \not\subseteq V\). Applying this to
\(U_n := \mathcal{B}^X[z; 1/n)\) for each \(n \in \mathbb{N}\), we obtain \(x_n \in U_n\) with
\(f(x_n) \notin V\). Then \(d(x_n, z) < 1/n\), so by the Archimedean property \(x_n \to z\); yet
the entire sequence \(\{f(x_n)\}\) lies outside \(V\), so \(f(x_n) \not\to f(z)\).
The sequential characterization is the bridge from topology to analysis: convergence of sequences —
the operational concept developed in the previous chapter — now encodes continuity. This is the form
most directly useful in iterative algorithms, where we verify continuity by checking the behavior of
the function along the sequence of iterates.
The power of the open-set definition becomes clear when we look at global properties.
Continuity allows us to pull topological structures back from the range to the domain.
Theorem: Global Topological Characterization
A function \(f: X \to Y\) is continuous on \(X\) if and only if for every open subset \(V\) of \(Y\),
the preimage \(f^{-1}(V)\) is an open subset of \(X\).
Proof:
(⇒) Suppose \(f\) is continuous on \(X\), and let \(V\) be an open subset of \(Y\).
For each \(z \in f^{-1}(V)\), we have \(f(z) \in V\), so by continuity at \(z\), there
exists an open \(U_z \ni z\) in \(X\) with \(f(U_z) \subseteq V\), i.e. \(U_z \subseteq f^{-1}(V)\).
Thus \(f^{-1}(V) = \bigcup_{z \in f^{-1}(V)} U_z\), which is a union of open sets and hence open.
(⇐) Suppose every preimage of an open set is open. Let \(z \in X\) and let \(V\) be an open subset
of \(Y\) with \(f(z) \in V\). Then \(U := f^{-1}(V)\) is open in \(X\), contains \(z\), and
satisfies \(f(U) \subseteq V\). Hence \(f\) is continuous at \(z\); since \(z\) was arbitrary,
\(f\) is continuous on \(X\).
Using this topological definition, it becomes straightforward to show that continuity is preserved
under composition and subspace operations.
Theorem: Continuity under Codomain Extension
Suppose \((X, d)\) and \((Y, e)\) are metric spaces and \(f: X \to Y\) is a continuous function. Suppose
\(Z\) is any metric superspace of \((f(X), e)\). Then \(f: X \to Z\) is also continuous.
Proof:
Suppose \(U\) is an open subset of \(Z\). Then \(U \cap f(X)\) is open in \((f(X), e)\) because \(Z\) is a
metric superspace of \((f(X), e)\). Since \((f(X), e)\) is a metric subspace of \((Y, e)\), we have
\(U \cap f(X) = W \cap f(X)\) for some open subset \(W\) of \((Y, e)\). Thus,
\[
\begin{align*}
f^{-1}(U) &= f^{-1}(U \cap f(X)) \\\\
&= f^{-1}(W \cap f(X)) \\\\
&= f^{-1}(W),
\end{align*}
\]
which is open in \(X\) because \(f: X \to Y\) is a continuous function. Since \(U\) was an arbitrary open subset
of \(Z\), this implies \(f: X \to Z\) is also continuous.
Theorem: Continuity of Compositions
Suppose \(X\), \(Y\), and \(Z\) are metric spaces and \(f: X \to Y\) and \(g: Y \to Z\). If \(f\) and \(g\)
are continuous, then the composition \(g \circ f\) is continuous.
Proof:
Suppose \(f\) and \(g\) are continuous and \(W\) is an open subset of \(Z\). Then \(g^{-1}(W)\) is open
in \(Y\) and so \(f^{-1}(g^{-1}(W))\) is open in \(X\), but this set is \((g \circ f)^{-1}(W)\).
Since \(W\) is any open subset of \(Z\), \(g \circ f\) is also continuous.
Insight: Continuity and Robustness
In deep learning, continuity is a prerequisite for robustness.
It implies that if an input image \(x\) is perturbed slightly (noise), the model's prediction \(f(x)\)
should not jump abruptly. However, mere continuity allows the function to change arbitrarily fast locally.
This loophole is exploited by Adversarial Attacks, where imperceptible noise causes a
continuous but unstable model to misclassify data with high confidence.
Lipschitz Continuity
A stronger form of continuity, often encountered in linear maps between normed linear spaces, is
Lipschitz continuity. Unlike uniform continuity which only says "steepness is bounded locally,"
Lipschitz continuity gives us a concrete number, the Lipschitz constant, that bounds the rate
of change globally.
Definition: Lipschitz Continuous Function
Suppose \((X, d)\) and \((Y, e)\) are metric spaces, and \(f: X \to Y\). If there exists \(L \geq 0\)
such that
\[
e(f(a), f(b)) \leq L \cdot d(a, b), \quad \forall \, a, b \in X,
\]
then \(f\) is called a Lipschitz function on \(X\) with Lipschitz constant \(L\).
The case \(L = 0\) corresponds to constant functions.
Theorem: Connection to Uniform Continuity
Suppose \((X, d)\) and \((Y, e)\) are metric spaces, \(S \subseteq X\), and \(f: X \to Y\).
If \(f\) is a Lipschitz function on \(S\) with Lipschitz constant \(L\), then \(f\) is uniformly continuous on \(S\).
The \(\delta\) in the definition of uniform continuity can be taken to be \(\frac{\epsilon}{L}\).
Hierarchy: Lipschitz \(\implies\) Uniformly Continuous \(\implies\) Continuous.
Proof:
Let \(\epsilon > 0\) and set \(\delta := \epsilon / L\) (take any positive value if \(L = 0\); the map is then constant).
For any \(a, b \in S\) with \(d(a, b) < \delta\), the Lipschitz condition gives
\[
e(f(a), f(b)) \leq L \cdot d(a, b) < L \cdot \frac{\epsilon}{L} = \epsilon.
\]
Since \(\delta\) depends only on \(\epsilon\) (not on the base point), \(f\) is uniformly continuous on \(S\).
Lipschitz continuity provides a global bound on the rate of change, defined by the Lipschitz Constant \(L\).
When a Lipschitz map has its domain and codomain in the same space, we further classify it by the value of \(L\),
giving the contraction factor \(k\):
-
Contraction (\(k \leq 1\)):
Any Lipschitz map \(f: X \to X\) with contraction factor \(k \leq 1\) is called a contraction.
These maps are stable because they never increase the distance between points.
-
Strong Contraction (\(k < 1\)):
If \(k\) is strictly less than 1, \(f\) is a strong contraction, which
is uniformly continuous on its domain. These maps strictly pull points closer to one another
and are fundamental for guaranteeing unique solutions.
Terminology Note:
In our curriculum's structural hierarchy, we use Contraction for \(k \le 1\) and Strong Contraction for \(k < 1\).
Please note that in many other standard texts — and in the optimization / ML literature
(see the Duality chapter) — the word "Contraction" is used exclusively for the \(k < 1\) case,
while the \(k \le 1\) case is referred to as Non-expansive.
We provide this distinction to better visualize the "rigidity" of mappings before introducing fixed-point theorems.
Definition: Strong Contraction
A map \(f: X \to X\) is a strong contraction if there exists \(k \in [0, 1)\) such that:
\[
d(f(x), f(z)) \leq k \cdot d(x, z) \quad \forall x, z \in X.
\]
The constant \(k\) is called the contraction factor.
Definition: Isometry
A map \(\phi: X \to Y\) is called an isometry if and only if it preserves distances exactly:
\[
e(\phi(a), \phi(b)) = d(a, b) \quad \forall a, b \in X.
\]
Note: Every isometry is a Lipschitz function with constant \(L = 1\), and in our curriculum's convention
(see terminology note above) it is classified as a contraction (with contraction factor \(k = 1\),
but not a strong contraction). The converse fails: a Lipschitz function with \(L = 1\) need not be an isometry.
For instance, the absolute-value map \(f: \mathbb{R} \to \mathbb{R}\), \(f(x) = |x|\), satisfies
\(\big||a| - |b|\big| \leq |a - b|\) (reverse triangle inequality), so \(L = 1\) works — but \(f(-1) = f(1) = 1\),
violating injectivity and hence isometry.
Insight: Why Hierarchy of Continuity Matters
In machine learning and optimization, we don't just want functions to be continuous;
we want to understand how "well-behaved" they are. This hierarchy acts as a set of guardrails:
- Continuity: The bare minimum for Gradient Descent to "slide" on a loss landscape without jumps.
- Uniform Continuity: Ensures that a fixed step size won't lead to unpredictable behavior just because we moved to a different region of the space.
- Lipschitz Continuity: The gold standard. It provides a concrete limit on "steepness," ensuring stability during training.
Real-world applications in ML theory:
-
Convergence Rates:
In convex optimization, if the gradient is \(L\)-Lipschitz (called \(L\)-smooth),
gradient descent is guaranteed to converge at a rate of \(O(1/T)\). The constant \(L\) determines the maximum safe learning rate (\(\eta < 2/L\)).
-
GAN Stability:
In Wasserstein GANs (WGAN), the theoretical framework requires the discriminator (critic) to be 1-Lipschitz.
Techniques like Gradient Penalty or Spectral Normalization are explicitly designed to enforce this.
-
Fixed Point Iterations:
Strong contractions guarantee a unique fixed point (Banach Fixed Point Theorem).
This is why the Bellman Operator in Reinforcement Learning converges to the optimal value function.