Lagrange's Mean Value Theorem
Up to this point, we have focused on the instantaneous rate of change at a single point.
Lagrange's Mean Value Theorem (MVT) bridges the gap between this local derivative and the
average change over an interval.
Theorem 1: Lagrange's Mean Value Theorem
Suppose a function \(f\) is a continuous on \([a, b]\) and differentiable on \((a, b)\). Then
there exists a number \(c \in (a, b)\) such that
\[
f'(c) = \frac{f(b) - f(a)}{b - a}. \tag{1}
\]
This implies that at some point \(c\), the instantaneous velocity is exactly equal to the average velocity over
the entire journey from \(a\) to \(b\). A special case where \(f(a) = f(b)\) is known as Rolle's Theorem.
In this case, the theorem asserts that there is a point where the derivative is zero (\(f'(c) = 0\)).
Outline of the Proof of Rolle's Theorem:
By the Extreme Value Theorem, \(f\) attains a maximum and a minimum on \([a, b]\):
- case 1: \(f(x)\) is constant, \(f'(x)\) = 0 for all \(c \in (a,b)\).
- case 2: \(f(x) > f(a)\), show that at \(c\), \(f\) has local maximum, then \(f'(c) = 0\) by the Extreme Value Theorem.
- case 3: \(f(x) < f(a)\), similarly, show that at \(c\), \(f\) has local minimum, then \(f'(c) = 0\).
Now, we use Rolle's Theorem to prove Lagrange's Mean Value Theorem.
Proof: Lagrange's Mean Value Theorem
Suppose a function \(f\) is a continuous on \([a, b]\) and differentiable on \((a, b)\).
Consider the line connecting \(a, f(a)\) and \(b, f(b)\). The equation of the line can be written as
\[
y - f(a) = \frac{f(b)-f(a)}{b -a} (x - a).
\]
Let g(x) be the vertical difference between \(x, f(x)\) and (x, y). Then
\[
g(x) = f(x) - f(a) - \frac{f(b)-f(a)}{b -a} (x -a). \tag{2}
\]
\(g\) is continuous on \([a, b]\) and differentiable on \((a, b)\) because \(g\) is the sum of \(f\)
and the first-degree polynomial, both of which are continuous and differentialble on the same interval.
From Equation (2),
\[
g'(x) = f'(x) - \frac{f(b)-f(a)}{b -a}.
\]
Also, clearly, \(g(a) = g(b) = 0\). Thus, by Rolle's Theorem
\[
\exists c \in (a,b) \text{ such that } g'(c) = 0.
\]
This implies
\[
g'(c) = f'(c) - \frac{f(b)-f(a)}{b -a} = 0
\]
and so
\[
f'(c) = \frac{f(b)-f(a)}{b -a}.
\]
Taylor's Theorem
By rearranging Equation (1) as \(f(b) = f(a) + f'(c)(b-a)\), we see that MVT is essentially a
first-order Taylor expansion with an exact remainder. Taylor's Theorem generalizes
this to higher-order polynomials.
Theorem 2: Taylor's Theorem
Let \(n \geq 1\) be an integer and let \(f: \mathbb{R} \to \mathbb{R}\) be \(n\) times
differentiable at \(a \in \mathbb{R}\). Then there exists \(g: \mathbb{R} \to \mathbb{R}\)
such that
\[
f(x) = \sum_{k = 0}^{n-1} \frac{f^{(k)}(a)}{k !}(x - a)^k + g_n (x) (x - a)^n
\]
and
\[
\lim_{x \to a} g_n(x) = 0.
\]
Note: \(g_n (x) (x - a)^n \) represents the reminder. An alternative and commonly used form of the
remainder is the Lagrange's form of reminder:
\[
R_n (x) = \frac{f^{(n)}(c)}{n!}(x-a)^n \text{ where } c \in (a, x)
\]
Another common representation of the remainder, particularly in asymptotic analysis, is:
\[
R_n (x) = \frac{f^{(n)}(a)}{n!}(x-a)^n + o(| x -a |^n)
\]
where the little-o notation \(o(| x -a |^n)\) indicates that the term converges to 0 faster than
\(| x -a |^n\) as \(x \to a\).
The polynomial appearing in Taylor's Theorem is called n-th order Taylor polynomial:
\[
\begin{align*}
T_n (x) &= f(a) + \frac{f'(a)}{1!}(x -a) + \frac{f''(a)}{2!}(x -a)^2 + \cdots + \frac{f^{(n)}(a)}{n!}(x - a)^n \\\\
&= \sum_{k = 0}^{n} \frac{f^{(n)}(a)}{n !}(x - a)^n.
\end{align*}
\]
Also, as \(n \to \infty\), if the function \(f\) is infinitely differentiable and the remainder term \(R_n (x)\)
tends to zero, Taylor series expansion becomes:
\[
f(x) = \sum_{n = 0}^{\infty} \frac{f^{(n)}(a)}{n !}(x - a)^n.
\]
This is Taylor series of the function \(f\) centered at \(a\). The series converges to \(f\) within
it radius of convergence. So, Taylor polynomials are used for approximation of \(f(x)\). For example, the 1st order Taylor
polynomial
\[
T_1 (x) = f(a) + f'(a)(x-a)
\]
is equivalent to the linearlization of \(f\) at \(a\), providing a first-order approximation.
Approximation vs.Reality
We often use the 1st-order approximation \(T_1(x) = f(a) + f'(a)(x-a)\). Taylor's theorem tells us exactly how
much "error" we incur by ignoring higher-order terms, which is critical when analyzing the convergence of optimization
algorithms.
Cauchy's Mean Value Theorem
Theorem 3: Cauchy's Mean Value Theorem
Let both \(f\) and \(g\) are continuous on \([a,b]\) and differentiable \((a,b)\). If
\(\forall x \in (a, b), \, g'(x) \neq 0\) then \(\exists c \in (a, b)\) such that
\[
\frac{f(b)-f(a)}{g(b)-g(a)} = \frac{f'(c)}{g'(c)}.
\]
Cauchy's Mean Value Theorem generalizes Lagrange's MVT(\(g(x) = x\)).
While it is a vital tool for proving foundational calculus results like L'Hôpital's Rule,
it is less commonly applied directly in everyday machine learning optimization compared to the standard MVT or
Taylor's Theorem.
Why it matters: From Limits to Information Geometry
While Lagrange's MVT deals with the change of a single function, Cauchy's MVT considers the
ratio of changes between two functions. This is not only the rigorous foundation
for L'Hôpital's Rule but also a critical concept when transitioning to advanced topics.
In the context of modern machine learning and Information Geometry, this theorem
foreshadows the analysis of how different metrics (like loss functions vs. probability constraints)
scale relative to one another. Understanding these relative rates of change is essential for analyzing
the geometry of probability manifolds and natural gradient methods.
Higher-dimensional MVT
Before moving to optimization, we must extend the MVT to multidimensional spaces. While gradient descent relies on local
linear approximations in practice, this theorem provides the theoretical foundation for proving the convergence of such
algorithms. It guarantees that the discrete change in a scalar-valued loss function can be related to the continuous field
of its gradients.
Higher-Dimensional MVT
Let \(f: X \to \mathbb{R}\) be a differentiable function where X is an open subset of \(\mathbb{R}^n\).
For any points \(a, b \in X\) such that \( [a, b] \subset X\), \(\, \exists c \in (a,b) \text{ such that }\)
\[
f(b) - f(a) = \nabla f(c) \cdot (b-a).
\]
Note: \([a, b] = \{(1-t)a + tb \, | \, t \in [0, 1] \}\) and \((a, b) = \{(1-t)a + tb \, | \, t \in (0, 1) \}\).
Proof:
Define \(h(t) = f((1-t)a + tb) = f(a + t(b-a))\) where \(t \in \mathbb{R}\).
By Theorem 1: Lagrange's Mean Value Theorem,
\(\exists \theta \in (0,1)\) such that
\[
h'(\theta ) = h(1) - h(0) = f(b) - f(a).
\]
Also,
\[
h'(\theta ) = \nabla f(a + \theta (b -a)) \cdot (b - a).
\]
Here, let \(c = a + \theta (b-a)\), then we obtain
\[
\nabla f(c) \cdot (b - a) = f(b) - f(a).
\]
Important Constraint: Scalar-valued Outputs
It is crucial to note that this exact equality holds only for scalar-valued functions
(\(f: \mathbb{R}^n \to \mathbb{R}\)). For vector-valued functions (where the output is also a vector),
the MVT generally does not hold as an equality; instead, it is expressed as the Mean Value Inequality.
involving norms.