Lagrange's Mean Value Theorem
Up to this point, we have focused on the instantaneous rate of change at a single point.
Lagrange's Mean Value Theorem (MVT) bridges the gap between this local derivative and the
average change over an interval.
Lagrange's Mean Value Theorem
Suppose \(f\) is continuous on \([a, b]\) and differentiable on \((a, b)\). Then there exists
\(c \in (a, b)\) such that
\[
f'(c) \;=\; \frac{f(b) - f(a)}{b - a}. \tag{1}
\]
This says that at some point \(c\), the instantaneous rate of change equals the average rate of
change over the entire interval. The special case \(f(a) = f(b)\) is known as
Rolle's Theorem: under the same regularity hypotheses, if \(f(a) = f(b)\) then
there exists \(c \in (a,b)\) with \(f'(c) = 0\).
Sketch of the proof of Rolle's Theorem.
Since \(f\) is continuous on the compact interval \([a,b]\), the Extreme Value Theorem
guarantees that \(f\) attains its maximum \(M\) and minimum \(m\) on \([a,b]\). Since
\(m \leq f(a) = f(b) \leq M\), exactly one of the following three cases holds:
- If \(M = m\), then \(f\) is constant and \(f'(c) = 0\) for every \(c \in (a,b)\).
- If \(M > f(a) = f(b)\), the maximum is attained at some interior point \(c \in (a,b)\);
since \(f\) is differentiable there and \(c\) is a local maximum, Fermat's theorem gives
\(f'(c) = 0\).
- If \(m < f(a) = f(b)\), the minimum is attained at an interior point \(c\), and the
same argument yields \(f'(c) = 0\).
These three cases are exhaustive: if \(M = m\) fails then \(M > m\), so at least one of the
values \(M, m\) differs from \(f(a) = f(b)\), triggering case 2 or case 3.
Now we use Rolle's Theorem to prove Lagrange's MVT.
Proof of Lagrange's MVT.
Let the secant line joining \((a, f(a))\) and \((b, f(b))\) be
\[
y \;=\; f(a) + \frac{f(b)-f(a)}{b - a}\,(x - a),
\]
and define \(g(x)\) as the vertical gap between the graph of \(f\) and this secant line:
\[
g(x) \;=\; f(x) - f(a) - \frac{f(b)-f(a)}{b - a}\,(x - a). \tag{2}
\]
Since \(g\) is the sum of \(f\) and a first-degree polynomial, and sums of continuous
(resp. differentiable) functions are continuous (resp. differentiable), \(g\) is continuous
on \([a, b]\) and differentiable on \((a, b)\). Differentiating (2),
\[
g'(x) \;=\; f'(x) - \frac{f(b)-f(a)}{b - a}.
\]
A direct substitution shows \(g(a) = g(b) = 0\), so by Rolle's Theorem there exists
\(c \in (a,b)\) with \(g'(c) = 0\), i.e.
\[
f'(c) - \frac{f(b)-f(a)}{b - a} \;=\; 0,
\]
which rearranges to the desired identity. \(\blacksquare\)
Taylor's Theorem
By rearranging Equation (1) as \(f(b) = f(a) + f'(c)(b-a)\), we see that MVT is essentially a
first-order Taylor expansion with an exact remainder. Taylor's Theorem generalizes
this to higher-order polynomials.
Taylor's theorem comes in two standard forms that differ in their hypotheses and in the
information they provide. The Peano form captures the local asymptotic
behavior of \(f\) near \(a\) under a pointwise differentiability hypothesis, while the
Lagrange form provides a pointwise remainder expression under a stronger,
interval-wide differentiability hypothesis.
We first introduce the \(n\)-th order Taylor polynomial, which is the polynomial part common
to both forms:
\[
T_n(x) \;=\; \sum_{k = 0}^{n} \frac{f^{(k)}(a)}{k!}\,(x - a)^k
\;=\; f(a) + \frac{f'(a)}{1!}(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \cdots + \frac{f^{(n)}(a)}{n!}(x-a)^n.
\]
The first-order Taylor polynomial
\[
T_1(x) \;=\; f(a) + f'(a)(x-a)
\]
is precisely the linearization
of \(f\) at \(a\); Taylor's theorem is its systematic higher-order generalization.
Taylor's Theorem, Peano Form
Let \(n \geq 1\) be an integer, let \(U \subseteq \mathbb{R}\) be an open neighborhood of
\(a\), and suppose \(f: U \to \mathbb{R}\) is \((n-1)\) times differentiable on \(U\) with
\(f^{(n-1)}\) differentiable at \(a\) (so that \(f^{(n)}(a)\) exists). Then
\[
f(x) \;=\; T_n(x) + o\bigl(|x - a|^n\bigr) \qquad \text{as } x \to a,
\]
meaning that the remainder \(R_n(x) := f(x) - T_n(x)\) satisfies
\(\lim_{x \to a} R_n(x) / |x-a|^n = 0\).
The little-\(o\) symbol, introduced in the Linear
Approximations page, means that \(R_n(x)\) vanishes strictly faster than \(|x-a|^n\) as
\(x \to a\). The Peano form is the natural statement for asymptotic analysis:
it characterizes how well \(T_n\) approximates \(f\) in the limit \(x \to a\), but it gives
no explicit formula or bound for \(R_n\) at any fixed \(x \neq a\). For that, we need a
stronger hypothesis.
Taylor's Theorem, Lagrange Form
Let \(n \geq 1\) be an integer and let \(I \subseteq \mathbb{R}\) be an interval
containing both \(a\) and \(x\). Suppose \(f: I \to \mathbb{R}\) is \(n\) times
differentiable on \(I\). Then there exists
\(c\) strictly between \(a\) and \(x\) such that
\[
R_n(x) \;=\; \frac{f^{(n)}(c)}{n!}\,(x - a)^n,
\]
where \(R_n(x) = f(x) - T_n(x)\) is the remainder.
The Lagrange form is useful when a concrete pointwise bound on \(R_n\) is needed:
any bound on \(|f^{(n)}|\) over \(I\) yields a bound on \(|R_n(x)|\) via
\(|R_n(x)| \leq \frac{\sup_{c \in I} |f^{(n)}(c)|}{n!}\,|x-a|^n\). This is the form invoked
in error analysis for numerical methods and in convergence proofs for optimization algorithms
where an explicit remainder estimate is required.
When \(f\) is infinitely differentiable on \(I\) and the Lagrange remainder satisfies
\(R_n(x) \to 0\) as \(n \to \infty\), the partial sums \(T_n(x)\) converge to \(f(x)\), and
the limit is the Taylor series of \(f\) centered at \(a\):
\[
f(x) \;=\; \sum_{k = 0}^{\infty} \frac{f^{(k)}(a)}{k!}\,(x - a)^k.
\]
Approximation vs. Reality
We often use the first-order approximation \(T_1(x) = f(a) + f'(a)(x-a)\). Taylor's theorem
tells us exactly how much "error" we incur by ignoring higher-order terms — in the Lagrange
form, a bound on \(|f''|\) on the interval becomes a bound on the error. This is critical when
analyzing the convergence of optimization algorithms, where higher-order remainders determine
how large a step a method can safely take.
A subtle point deserves emphasis: being infinitely differentiable is not enough for the
Taylor series to recover the function. The standard counter-example is
\[
f(x) = \begin{cases} e^{-1/x^2} & x \neq 0, \\ 0 & x = 0, \end{cases}
\]
which is \(C^\infty\) on all of \(\mathbb{R}\) with \(f^{(k)}(0) = 0\) for every \(k\).
Its Taylor series at \(0\) is therefore identically zero, yet \(f(x) > 0\) for every
\(x \neq 0\) — the series converges everywhere but fails to equal \(f\) anywhere except at
the origin. Functions whose Taylor series converges to \(f\) on some neighborhood of every
point are called real-analytic, a condition strictly stronger than \(C^\infty\).
This distinction matters in information geometry, where the exponential and other parametric
families are typically real-analytic, and in the high-accuracy convergence theory of spectral
methods (Chebyshev and Fourier expansions), where analyticity yields exponential convergence
while mere \(C^\infty\) regularity only gives polynomial convergence.
Cauchy's Mean Value Theorem
Cauchy's Mean Value Theorem
Let \(f\) and \(g\) be continuous on \([a,b]\) and differentiable on \((a,b)\). If
\(g'(x) \neq 0\) for every \(x \in (a, b)\), then \(g(b) \neq g(a)\) (else Rolle's theorem
would give a point where \(g'\) vanishes), and there exists \(c \in (a, b)\) such that
\[
\frac{f(b)-f(a)}{g(b)-g(a)} \;=\; \frac{f'(c)}{g'(c)}.
\]
The proof is the same auxiliary-function reduction used for Lagrange's MVT above. Apply Rolle's
theorem to
\(h(x) = f(x) - f(a) - \tfrac{f(b)-f(a)}{g(b)-g(a)}\bigl(g(x) - g(a)\bigr)\),
which satisfies \(h(a) = h(b) = 0\); the resulting point \(c\) with \(h'(c) = 0\) gives the
stated identity after rearranging \(h'(c) = f'(c) - \tfrac{f(b)-f(a)}{g(b)-g(a)}\,g'(c)\).
Cauchy's Mean Value Theorem generalizes Lagrange's MVT(\(g(x) = x\)).
While it is a vital tool for proving foundational calculus results like L'Hôpital's Rule,
it is less commonly applied directly in everyday machine learning optimization compared to the standard MVT or
Taylor's Theorem.
Why it matters: From Limits to Information Geometry
While Lagrange's MVT deals with the change of a single function, Cauchy's MVT considers the
ratio of changes between two functions. This is not only the rigorous foundation
for L'Hôpital's Rule but also a critical concept when transitioning to advanced topics.
In the context of modern machine learning and Information Geometry, this theorem
foreshadows the analysis of how different metrics (like loss functions vs. probability constraints)
scale relative to one another. Understanding these relative rates of change is essential for analyzing
the geometry of probability manifolds and natural gradient methods.
Higher-dimensional MVT
Before moving to optimization, we must extend the MVT to multidimensional spaces. While gradient descent relies on local
linear approximations in practice, this theorem provides the theoretical foundation for proving the convergence of such
algorithms. It guarantees that the discrete change in a scalar-valued loss function can be related to the continuous field
of its gradients.
Higher-Dimensional MVT
Let \(f: X \to \mathbb{R}\) be differentiable on an open set \(X \subseteq \mathbb{R}^n\)
(in the sense that the linearization
exists at every point of \(X\) — equivalently, Fréchet differentiable, which is stronger than
mere existence of all partial derivatives). Let \(a, b \in X\) with the closed segment
\([a,b] \subset X\). Then there exists \(c \in (a,b)\) such that
\[
f(b) - f(a) \;=\; \nabla f(c) \cdot (b - a).
\]
Here \([a,b] := \{(1-t)a + tb : t \in [0,1]\}\) and
\((a,b) := \{(1-t)a + tb : t \in (0,1)\}\) denote the closed and open segments joining \(a\)
and \(b\).
Proof.
Define \(h : [0,1] \to \mathbb{R}\) by \(h(t) = f\bigl((1-t)a + tb\bigr) = f(a + t(b-a))\).
Since \([a,b] \subset X\) and \(f\) is differentiable on \(X\), \(h\) is continuous on
\([0,1]\) and differentiable on \((0,1)\). By the chain rule,
\[
h'(t) \;=\; \nabla f(a + t(b-a)) \cdot (b - a).
\]
Applying Lagrange's MVT to \(h\) on \([0,1]\), there exists
\(\theta \in (0,1)\) such that
\[
h(1) - h(0) \;=\; h'(\theta)(1 - 0) \;=\; h'(\theta).
\]
Since \(h(1) = f(b)\) and \(h(0) = f(a)\), setting \(c := a + \theta(b - a) \in (a,b)\) gives
\[
f(b) - f(a) \;=\; \nabla f(c) \cdot (b - a). \qquad \blacksquare
\]
For vector-valued maps the exact equality fails: each scalar component
\(F^j\) supplies its own mean value point \(c_j\), and there is in general no single \(c\)
serving all components at once. What survives is an inequality, and remarkably it survives
with no loss in the constant. The standard device is to test the vector increment against a
fixed direction, reducing the vector statement to a single application of the scalar theorem.
Mean Value Inequality
Let \(F : X \to \mathbb{R}^m\) be differentiable on an open set
\(X \subseteq \mathbb{R}^n\), and let \(a, b \in X\) with the closed segment
\([a,b] \subset X\). Then
\[
\|F(b) - F(a)\| \;\leq\; \Big(\sup_{c \in [a,b]} \|DF(c)\|\Big)\,\|b - a\|,
\]
where \(\|DF(c)\|\) denotes the
operator norm
of the total derivative
\(DF(c) : \mathbb{R}^n \to \mathbb{R}^m\). In particular, if \(F\) is of class \(C^1\)
and \(K \subseteq X\) is compact and convex, then \(F\) is
Lipschitz continuous
on \(K\) with Lipschitz constant \(\sup_{x \in K} \|DF(x)\|\).
Proof.
If \(F(b) = F(a)\) the inequality is trivial, so assume \(F(b) \neq F(a)\) and set
\(\mathbf{u} := F(b) - F(a) \in \mathbb{R}^m\). Define the scalar function
\(\varphi : [0,1] \to \mathbb{R}\) by
\[
\varphi(t) \;=\; \big\langle \mathbf{u},\, F(a + t(b-a)) \big\rangle.
\]
Since \([a,b] \subset X\) and \(F\) is differentiable on \(X\), the composition with the
segment \(t \mapsto a + t(b-a)\) is differentiable on \((0,1)\) and continuous on \([0,1]\);
by the chain rule,
\[
\varphi'(t) \;=\; \big\langle \mathbf{u},\, DF\big(a + t(b-a)\big)(b - a) \big\rangle.
\]
Applying Lagrange's MVT to \(\varphi\) on \([0,1]\) yields
\(\theta \in (0,1)\) with \(\varphi(1) - \varphi(0) = \varphi'(\theta)\). The left-hand side
is \(\langle \mathbf{u}, F(b)\rangle - \langle \mathbf{u}, F(a)\rangle
= \langle \mathbf{u}, \mathbf{u}\rangle = \|\mathbf{u}\|^2\). Writing \(c := a + \theta(b-a)\)
and applying the Cauchy–Schwarz inequality to the right-hand side,
\[
\|\mathbf{u}\|^2 \;=\; \big\langle \mathbf{u},\, DF(c)(b-a) \big\rangle
\;\leq\; \|\mathbf{u}\|\,\|DF(c)(b-a)\|
\;\leq\; \|\mathbf{u}\|\,\|DF(c)\|\,\|b-a\|,
\]
the last step by the definition of the operator norm. Dividing by
\(\|\mathbf{u}\| > 0\) gives
\(\|F(b) - F(a)\| \leq \|DF(c)\|\,\|b-a\| \leq \big(\sup_{c \in [a,b]}\|DF(c)\|\big)\|b-a\|\).
For the \(C^1\) statement, fix a compact convex \(K \subseteq X\). The operator norm
\(x \mapsto \|DF(x)\|\) is continuous, hence attains a finite maximum
\(M = \sup_{x \in K}\|DF(x)\|\) on \(K\). For any \(a, b \in K\), convexity gives
\([a,b] \subseteq K\), so the inequality above yields
\(\|F(b) - F(a)\| \leq M\,\|b-a\|\), which is precisely the Lipschitz condition with
constant \(M\). \(\blacksquare\)
Important Constraint: Scalar-valued Outputs
The exact equality of the Higher-Dimensional MVT holds only for
scalar-valued functions \(f : \mathbb{R}^n \to \mathbb{R}\); for
vector-valued maps only the Mean Value Inequality above survives. This distinction is the
reason first-order convergence proofs for gradient descent operate on scalar loss functions
directly rather than on vector-valued forward maps. The inequality form, by contrast, is the
workhorse for controlling how far a smooth map can spread points apart, and it is exactly
the estimate that converts a bound on a derivative into the contraction property underlying
the inverse function theorem.
Multivariate Taylor's Theorem
The Higher-Dimensional MVT controlled the first-order behavior of a scalar field
along a segment. To control approximation to arbitrary order — and, crucially, to
obtain an exact remainder rather than an existential mean value point — we
extend Taylor's Theorem to functions of several variables. The single-variable Lagrange form
above located the remainder at an unknown point \(c\); the multivariate version below instead
expresses the remainder as an integral, a form that is indispensable wherever the
remainder must be manipulated algebraically rather than merely bounded. One such place is the
foundation of differential geometry, where this exact integral remainder is the engine that
identifies tangent vectors with directional-derivative operators.
The mechanism is identical to the one used in the proof of the Higher-Dimensional MVT: we
restrict \(f\) to the segment from \(a\) to \(x\) by composing with the curve
\(t \mapsto a + t(x-a)\), reducing a multivariable problem to a single-variable one, and then
invoke the single-variable theory. What changes is only the bookkeeping for higher-order
derivatives, which we organize with multi-index notation.
Multi-index notation
Writing higher-order partial derivatives and the matching monomials explicitly becomes unwieldy
past first order. A multi-index packages the bookkeeping. We use the
ordered convention throughout (each slot independently ranges over
\(1,\dots,n\)), which is what makes the compressed sums below coincide with fully written-out
sums with no extra combinatorial factors.
Definition: Multi-index (ordered convention)
Fix \(n \geq 1\). A multi-index of length \(m\) is an ordered tuple
\(I = (i_1, \dots, i_m)\) with each \(i_j \in \{1, \dots, n\}\); we write \(|I| = m\) for
its order. For a point \(x = (x^1,\dots,x^n) \in \mathbb{R}^n\) and a base
point \(a = (a^1,\dots,a^n) \in \mathbb{R}^n\), the associated scalar monomial is
\[
(x - a)^I \;:=\; (x^{i_1} - a^{i_1})\,(x^{i_2} - a^{i_2}) \cdots (x^{i_m} - a^{i_m}) \;\in\; \mathbb{R},
\]
and for an \(m\)-times differentiable scalar function \(f\) the associated partial derivative is
\[
\partial_I f \;:=\; \frac{\partial^m f}{\partial x^{i_1} \partial x^{i_2} \cdots \partial x^{i_m}} \;\in\; \mathbb{R}.
\]
Both \((x-a)^I\) and \(\partial_I f\) are scalars, even though \(x\) and
\(a\) are vectors. Because the convention is ordered, summing a quantity over all
multi-indices of a fixed order \(|I| = m\) is the same as summing independently over each
slot:
\[
\sum_{I : |I| = m} \;=\; \sum_{i_1 = 1}^{n} \sum_{i_2 = 1}^{n} \cdots \sum_{i_m = 1}^{n}.
\]
Before stating the general theorem, it is worth writing out the lowest orders explicitly, since
these are the cases that arise most often. At first order (\(k = 1\)), the
expansion reads
\[
f(x) \;=\; f(a) \;+\; \sum_{i=1}^{n} \frac{\partial f}{\partial x^i}(a)\,(x^i - a^i)
\;+\; \underbrace{\sum_{i,j=1}^{n} (x^i - a^i)(x^j - a^j) \int_0^1 (1 - t)\,
\frac{\partial^2 f}{\partial x^i \partial x^j}\bigl(a + t(x-a)\bigr)\,dt}_{\text{remainder } R_1(x)} .
\]
Every term here is a scalar: \(f(a)\) and each \(\partial f/\partial x^i(a)\)
are scalars, \((x^i - a^i)\) is a scalar component of the displacement vector \(x - a\), and the
integral is a scalar (the integrand is evaluated at the vector \(a + t(x-a)\) but returns a real
number). The linear term \(\sum_i \partial_i f(a)(x^i - a^i)\) is exactly the directional
contribution \(\nabla f(a) \cdot (x - a)\) seen in the Higher-Dimensional MVT, now accompanied by
an explicit second-order remainder rather than hidden inside a mean value point.
At second order (\(k = 2\)), the new term is the quadratic
\[
\frac{1}{2} \sum_{i,j=1}^{n} \frac{\partial^2 f}{\partial x^i \partial x^j}(a)\,(x^i - a^i)(x^j - a^j),
\]
which is precisely one-half the quadratic form of the Hessian — the symmetric
matrix of second partial derivatives \(\partial^2 f / \partial x^i \partial x^j(a)\) — evaluated
on the displacement vector \(x - a\). This quadratic form is itself a scalar. It is this
\(k = 2\) expansion that underlies second-order optimization and Laplace-type approximations.
The general statement compresses these patterns using the multi-index notation above.
Multivariate Taylor's Theorem (integral remainder)
Let \(X \subseteq \mathbb{R}^n\) be open and let \(a \in X\) be fixed. Suppose
\(f \in C^{k+1}(X)\) for some integer \(k \geq 0\). If \(W \subseteq X\) is any convex set
containing \(a\), then for all \(x \in W\),
\[
f(x) \;=\; P_k(x) + R_k(x),
\]
where the \(k\)th-order Taylor polynomial \(P_k\) is the scalar-valued
function
\[
P_k(x) \;=\; f(a) + \sum_{m=1}^{k} \frac{1}{m!} \sum_{I : |I| = m} \partial_I f(a)\,(x - a)^I,
\]
and the \(k\)th remainder term \(R_k\) is
\[
R_k(x) \;=\; \frac{1}{k!} \sum_{I : |I| = k+1} (x - a)^I \int_0^1 (1 - t)^k\,
\partial_I f\bigl(a + t(x-a)\bigr)\,dt .
\]
Proof.
Fix \(x \in W\). Since \(W\) is convex and contains both \(a\) and \(x\), the segment
\(\{a + t(x-a) : t \in [0,1]\}\) lies in \(W \subseteq X\), so the scalar function
\[
u(t) \;:=\; f\bigl(a + t(x-a)\bigr), \qquad t \in [0,1],
\]
is well defined and of class \(C^{k+1}\). We prove the claim by induction on \(k\), tracking
derivatives of \(u\) via the
chain rule.
Base case \(k = 0\). Here \(P_0(x) = f(a)\). By the chain rule,
\(u'(t) = \sum_{i=1}^n \partial_i f(a + t(x-a))\,(x^i - a^i)
= \sum_{I : |I| = 1} (x-a)^I\, \partial_I f(a + t(x-a))\). The Fundamental Theorem of
Calculus applied to \(u\) on \([0,1]\) gives
\[
f(x) - f(a) \;=\; u(1) - u(0) \;=\; \int_0^1 u'(t)\,dt
\;=\; \sum_{I : |I| = 1} (x-a)^I \int_0^1 \partial_I f\bigl(a + t(x-a)\bigr)\,dt,
\]
which is exactly \(f(x) = P_0(x) + R_0(x)\), since \((1-t)^0 = 1\) and \(0! = 1\).
Inductive step. Suppose the formula holds for some \(k\); we promote it to
\(k + 1\). Each remainder integral has the form \(\int_0^1 (1-t)^k\, g(t)\,dt\) with
\(g(t) = \partial_I f(a + t(x-a))\). Integrating by parts, using
\(\frac{d}{dt}\!\left[-\frac{(1-t)^{k+1}}{k+1}\right] = (1-t)^k\) and again the chain rule
for \(g'(t) = \sum_{j=1}^n \partial_j \partial_I f(a + t(x-a))\,(x^j - a^j)\), yields
\[
\int_0^1 (1-t)^k\, g(t)\,dt
\;=\; \frac{1}{k+1}\, g(0)
\;+\; \frac{1}{k+1} \sum_{j=1}^n (x^j - a^j) \int_0^1 (1-t)^{k+1}\,
\partial_j \partial_I f\bigl(a + t(x-a)\bigr)\,dt .
\]
The boundary term contributes \(\frac{1}{k+1}\,\partial_I f(a)\), which (summed with the
prefactor \(\frac{1}{k!}\) over \(|I| = k+1\)) supplies exactly the order-\((k+1)\) term of
\(P_{k+1}\), since \(\frac{1}{k!}\cdot\frac{1}{k+1} = \frac{1}{(k+1)!}\). The surviving
integral, with one extra differentiation slot \(j\) prepended, reassembles into the sum over
multi-indices of order \(k + 2\), giving \(R_{k+1}\) with weight \((1-t)^{k+1}\) and
prefactor \(\frac{1}{(k+1)!}\). This is precisely the asserted formula with \(k\) replaced by
\(k + 1\), completing the induction. \(\qquad \blacksquare\)
When only a bound on the error is needed — rather than the exact integral — the integral
remainder collapses to a clean estimate.
Taylor Error Bound
Let \(X \subseteq \mathbb{R}^n\) be open, \(a \in X\), and \(f \in C^{k+1}(X)\) for some
\(k \geq 0\). If \(W \subseteq X\) is convex, contains \(a\), and every partial derivative of
\(f\) of order \(k + 1\) is bounded in absolute value by a constant \(M\) on \(W\), then for
all \(x \in W\),
\[
\bigl| f(x) - P_k(x) \bigr| \;\leq\; \frac{n^{k+1}\,M}{(k+1)!}\,\lVert x - a\rVert^{\,k+1},
\]
where \(P_k\) is the \(k\)th Taylor polynomial of \(f\) at \(a\) and \(\lVert\cdot\rVert\)
is the Euclidean norm.
Proof.
The remainder \(R_k(x) = f(x) - P_k(x)\) is a sum over the multi-indices \(I\) with
\(|I| = k+1\); by the ordered convention there are \(n^{k+1}\) such terms. For each, the
scalar factor satisfies \(\lvert (x-a)^I \rvert \leq \lVert x - a\rVert^{\,k+1}\) (each of the
\(k+1\) scalar factors \(\lvert x^{i_j} - a^{i_j}\rvert\) is at most the Euclidean norm
\(\lVert x - a\rVert\)), the integrand obeys
\(\lvert \partial_I f(a + t(x-a))\rvert \leq M\), and
\(\int_0^1 (1-t)^k\,dt = \tfrac{1}{k+1}\). Combining with the prefactor \(\tfrac{1}{k!}\)
and \(\tfrac{1}{k!}\cdot\tfrac{1}{k+1} = \tfrac{1}{(k+1)!}\), the triangle inequality over the
\(n^{k+1}\) terms gives the stated bound. \(\qquad \blacksquare\)
Why the integral remainder matters
The Lagrange form earlier in this page is ideal for bounding error, and the
Taylor error bound above recovers exactly that role in the multivariate setting. But the integral form of the
remainder does something the Lagrange form cannot: it exhibits the remainder as an explicit
sum in which each term carries a factor \((x^i - a^i)\) that vanishes at
\(x = a\) multiplied by a smooth coefficient. This algebraic structure — a remainder built
from products that die at the base point — is exactly what is needed when one must apply a
linear operator to the remainder and have it annihilate the leftover terms. That is the
decisive step in the smooth-manifold theory of tangent vectors, where the first-order
(\(k = 1\)) case of this theorem identifies abstract tangent vectors with classical
directional derivatives.