Mean Value Theorem

Lagrange's Mean Value Theorem Taylor's Theorem Cauchy's Mean Value Theorem Higher-dimensional MVT

Lagrange's Mean Value Theorem

Up to this point, we have focused on the instantaneous rate of change at a single point. Lagrange's Mean Value Theorem (MVT) bridges the gap between this local derivative and the average change over an interval.

Theorem 1: Lagrange's Mean Value Theorem

Suppose a function \(f\) is a continuous on \([a, b]\) and differentiable on \((a, b)\). Then there exists a number \(c \in (a, b)\) such that \[ f'(c) = \frac{f(b) - f(a)}{b - a}. \tag{1} \]

This implies that at some point \(c\), the instantaneous velocity is exactly equal to the average velocity over the entire journey from \(a\) to \(b\). A special case where \(f(a) = f(b)\) is known as Rolle's Theorem. In this case, the theorem asserts that there is a point where the derivative is zero (\(f'(c) = 0\)).

Outline of the Proof of Rolle's Theorem:

By the Extreme Value Theorem, \(f\) attains a maximum and a minimum on \([a, b]\):

  • case 1: \(f(x)\) is constant, \(f'(x)\) = 0 for all \(c \in (a,b)\).
  • case 2: \(f(x) > f(a)\), show that at \(c\), \(f\) has local maximum, then \(f'(c) = 0\) by the Extreme Value Theorem.
  • case 3: \(f(x) < f(a)\), similarly, show that at \(c\), \(f\) has local minimum, then \(f'(c) = 0\).

Now, we use Rolle's Theorem to prove Lagrange's Mean Value Theorem.

Proof: Lagrange's Mean Value Theorem Suppose a function \(f\) is a continuous on \([a, b]\) and differentiable on \((a, b)\).
Consider the line connecting \(a, f(a)\) and \(b, f(b)\). The equation of the line can be written as \[ y - f(a) = \frac{f(b)-f(a)}{b -a} (x - a). \] Let g(x) be the vertical difference between \(x, f(x)\) and (x, y). Then \[ g(x) = f(x) - f(a) - \frac{f(b)-f(a)}{b -a} (x -a). \tag{2} \] \(g\) is continuous on \([a, b]\) and differentiable on \((a, b)\) because \(g\) is the sum of \(f\) and the first-degree polynomial, both of which are continuous and differentialble on the same interval.
From Equation (2), \[ g'(x) = f'(x) - \frac{f(b)-f(a)}{b -a}. \] Also, clearly, \(g(a) = g(b) = 0\). Thus, by Rolle's Theorem \[ \exists c \in (a,b) \text{ such that } g'(c) = 0. \] This implies \[ g'(c) = f'(c) - \frac{f(b)-f(a)}{b -a} = 0 \] and so \[ f'(c) = \frac{f(b)-f(a)}{b -a}. \]

Taylor's Theorem

By rearranging Equation (1) as \(f(b) = f(a) + f'(c)(b-a)\), we see that MVT is essentially a first-order Taylor expansion with an exact remainder. Taylor's Theorem generalizes this to higher-order polynomials.

Theorem 2: Taylor's Theorem

Let \(n \geq 1\) be an integer and let \(f: \mathbb{R} \to \mathbb{R}\) be \(n\) times differentiable at \(a \in \mathbb{R}\). Then there exists \(g: \mathbb{R} \to \mathbb{R}\) such that \[ f(x) = \sum_{k = 0}^{n-1} \frac{f^{(k)}(a)}{k !}(x - a)^k + g_n (x) (x - a)^n \] and \[ \lim_{x \to a} g_n(x) = 0. \]

Note: \(g_n (x) (x - a)^n \) represents the reminder. An alternative and commonly used form of the remainder is the Lagrange's form of reminder: \[ R_n (x) = \frac{f^{(n)}(c)}{n!}(x-a)^n \text{ where } c \in (a, x) \] Another common representation of the remainder, particularly in asymptotic analysis, is: \[ R_n (x) = \frac{f^{(n)}(a)}{n!}(x-a)^n + o(| x -a |^n) \] where the little-o notation \(o(| x -a |^n)\) indicates that the term converges to 0 faster than \(| x -a |^n\) as \(x \to a\).

The polynomial appearing in Taylor's Theorem is called n-th order Taylor polynomial: \[ \begin{align*} T_n (x) &= f(a) + \frac{f'(a)}{1!}(x -a) + \frac{f''(a)}{2!}(x -a)^2 + \cdots + \frac{f^{(n)}(a)}{n!}(x - a)^n \\\\ &= \sum_{k = 0}^{n} \frac{f^{(n)}(a)}{n !}(x - a)^n. \end{align*} \] Also, as \(n \to \infty\), if the function \(f\) is infinitely differentiable and the remainder term \(R_n (x)\) tends to zero, Taylor series expansion becomes: \[ f(x) = \sum_{n = 0}^{\infty} \frac{f^{(n)}(a)}{n !}(x - a)^n. \] This is Taylor series of the function \(f\) centered at \(a\). The series converges to \(f\) within it radius of convergence. So, Taylor polynomials are used for approximation of \(f(x)\). For example, the 1st order Taylor polynomial \[ T_1 (x) = f(a) + f'(a)(x-a) \] is equivalent to the linearlization of \(f\) at \(a\), providing a first-order approximation.

Approximation vs.Reality

We often use the 1st-order approximation \(T_1(x) = f(a) + f'(a)(x-a)\). Taylor's theorem tells us exactly how much "error" we incur by ignoring higher-order terms, which is critical when analyzing the convergence of optimization algorithms.

Cauchy's Mean Value Theorem

Theorem 3: Cauchy's Mean Value Theorem Let both \(f\) and \(g\) are continuous on \([a,b]\) and differentiable \((a,b)\). If \(\forall x \in (a, b), \, g'(x) \neq 0\) then \(\exists c \in (a, b)\) such that \[ \frac{f(b)-f(a)}{g(b)-g(a)} = \frac{f'(c)}{g'(c)}. \]

Cauchy's Mean Value Theorem generalizes Lagrange's MVT(\(g(x) = x\)). While it is a vital tool for proving foundational calculus results like L'Hôpital's Rule, it is less commonly applied directly in everyday machine learning optimization compared to the standard MVT or Taylor's Theorem.

Why it matters: From Limits to Information Geometry

While Lagrange's MVT deals with the change of a single function, Cauchy's MVT considers the ratio of changes between two functions. This is not only the rigorous foundation for L'Hôpital's Rule but also a critical concept when transitioning to advanced topics.

In the context of modern machine learning and Information Geometry, this theorem foreshadows the analysis of how different metrics (like loss functions vs. probability constraints) scale relative to one another. Understanding these relative rates of change is essential for analyzing the geometry of probability manifolds and natural gradient methods.

Higher-dimensional MVT

Before moving to optimization, we must extend the MVT to multidimensional spaces. While gradient descent relies on local linear approximations in practice, this theorem provides the theoretical foundation for proving the convergence of such algorithms. It guarantees that the discrete change in a scalar-valued loss function can be related to the continuous field of its gradients.

Higher-Dimensional MVT

Let \(f: X \to \mathbb{R}\) be a differentiable function where X is an open subset of \(\mathbb{R}^n\).
For any points \(a, b \in X\) such that \( [a, b] \subset X\), \(\, \exists c \in (a,b) \text{ such that }\) \[ f(b) - f(a) = \nabla f(c) \cdot (b-a). \]
Note: \([a, b] = \{(1-t)a + tb \, | \, t \in [0, 1] \}\) and \((a, b) = \{(1-t)a + tb \, | \, t \in (0, 1) \}\).

Proof: Define \(h(t) = f((1-t)a + tb) = f(a + t(b-a))\) where \(t \in \mathbb{R}\).
By Theorem 1: Lagrange's Mean Value Theorem,
\(\exists \theta \in (0,1)\) such that \[ h'(\theta ) = h(1) - h(0) = f(b) - f(a). \] Also, \[ h'(\theta ) = \nabla f(a + \theta (b -a)) \cdot (b - a). \] Here, let \(c = a + \theta (b-a)\), then we obtain \[ \nabla f(c) \cdot (b - a) = f(b) - f(a). \]

Important Constraint: Scalar-valued Outputs

It is crucial to note that this exact equality holds only for scalar-valued functions (\(f: \mathbb{R}^n \to \mathbb{R}\)). For vector-valued functions (where the output is also a vector), the MVT generally does not hold as an equality; instead, it is expressed as the Mean Value Inequality. involving norms.