Mean Value Theorem

Lagrange's Mean Value Theorem Taylor's Theorem Cauchy's Mean Value Theorem Higher-dimensional MVT Multivariate Taylor's Theorem

Lagrange's Mean Value Theorem

Up to this point, we have focused on the instantaneous rate of change at a single point. Lagrange's Mean Value Theorem (MVT) bridges the gap between this local derivative and the average change over an interval.

Lagrange's Mean Value Theorem

Suppose \(f\) is continuous on \([a, b]\) and differentiable on \((a, b)\). Then there exists \(c \in (a, b)\) such that \[ f'(c) \;=\; \frac{f(b) - f(a)}{b - a}. \tag{1} \]

This says that at some point \(c\), the instantaneous rate of change equals the average rate of change over the entire interval. The special case \(f(a) = f(b)\) is known as Rolle's Theorem: under the same regularity hypotheses, if \(f(a) = f(b)\) then there exists \(c \in (a,b)\) with \(f'(c) = 0\).

Sketch of the proof of Rolle's Theorem.

Since \(f\) is continuous on the compact interval \([a,b]\), the Extreme Value Theorem guarantees that \(f\) attains its maximum \(M\) and minimum \(m\) on \([a,b]\). Since \(m \leq f(a) = f(b) \leq M\), exactly one of the following three cases holds:

  • If \(M = m\), then \(f\) is constant and \(f'(c) = 0\) for every \(c \in (a,b)\).
  • If \(M > f(a) = f(b)\), the maximum is attained at some interior point \(c \in (a,b)\); since \(f\) is differentiable there and \(c\) is a local maximum, Fermat's theorem gives \(f'(c) = 0\).
  • If \(m < f(a) = f(b)\), the minimum is attained at an interior point \(c\), and the same argument yields \(f'(c) = 0\).

These three cases are exhaustive: if \(M = m\) fails then \(M > m\), so at least one of the values \(M, m\) differs from \(f(a) = f(b)\), triggering case 2 or case 3.

Now we use Rolle's Theorem to prove Lagrange's MVT.

Proof of Lagrange's MVT.

Let the secant line joining \((a, f(a))\) and \((b, f(b))\) be \[ y \;=\; f(a) + \frac{f(b)-f(a)}{b - a}\,(x - a), \] and define \(g(x)\) as the vertical gap between the graph of \(f\) and this secant line: \[ g(x) \;=\; f(x) - f(a) - \frac{f(b)-f(a)}{b - a}\,(x - a). \tag{2} \] Since \(g\) is the sum of \(f\) and a first-degree polynomial, and sums of continuous (resp. differentiable) functions are continuous (resp. differentiable), \(g\) is continuous on \([a, b]\) and differentiable on \((a, b)\). Differentiating (2), \[ g'(x) \;=\; f'(x) - \frac{f(b)-f(a)}{b - a}. \] A direct substitution shows \(g(a) = g(b) = 0\), so by Rolle's Theorem there exists \(c \in (a,b)\) with \(g'(c) = 0\), i.e. \[ f'(c) - \frac{f(b)-f(a)}{b - a} \;=\; 0, \] which rearranges to the desired identity. \(\blacksquare\)

Taylor's Theorem

By rearranging Equation (1) as \(f(b) = f(a) + f'(c)(b-a)\), we see that MVT is essentially a first-order Taylor expansion with an exact remainder. Taylor's Theorem generalizes this to higher-order polynomials.

Taylor's theorem comes in two standard forms that differ in their hypotheses and in the information they provide. The Peano form captures the local asymptotic behavior of \(f\) near \(a\) under a pointwise differentiability hypothesis, while the Lagrange form provides a pointwise remainder expression under a stronger, interval-wide differentiability hypothesis.

We first introduce the \(n\)-th order Taylor polynomial, which is the polynomial part common to both forms: \[ T_n(x) \;=\; \sum_{k = 0}^{n} \frac{f^{(k)}(a)}{k!}\,(x - a)^k \;=\; f(a) + \frac{f'(a)}{1!}(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \cdots + \frac{f^{(n)}(a)}{n!}(x-a)^n. \] The first-order Taylor polynomial \[ T_1(x) \;=\; f(a) + f'(a)(x-a) \] is precisely the linearization of \(f\) at \(a\); Taylor's theorem is its systematic higher-order generalization.

Taylor's Theorem, Peano Form

Let \(n \geq 1\) be an integer, let \(U \subseteq \mathbb{R}\) be an open neighborhood of \(a\), and suppose \(f: U \to \mathbb{R}\) is \((n-1)\) times differentiable on \(U\) with \(f^{(n-1)}\) differentiable at \(a\) (so that \(f^{(n)}(a)\) exists). Then \[ f(x) \;=\; T_n(x) + o\bigl(|x - a|^n\bigr) \qquad \text{as } x \to a, \] meaning that the remainder \(R_n(x) := f(x) - T_n(x)\) satisfies \(\lim_{x \to a} R_n(x) / |x-a|^n = 0\).

The little-\(o\) symbol, introduced in the Linear Approximations page, means that \(R_n(x)\) vanishes strictly faster than \(|x-a|^n\) as \(x \to a\). The Peano form is the natural statement for asymptotic analysis: it characterizes how well \(T_n\) approximates \(f\) in the limit \(x \to a\), but it gives no explicit formula or bound for \(R_n\) at any fixed \(x \neq a\). For that, we need a stronger hypothesis.

Taylor's Theorem, Lagrange Form

Let \(n \geq 1\) be an integer and let \(I \subseteq \mathbb{R}\) be an interval containing both \(a\) and \(x\). Suppose \(f: I \to \mathbb{R}\) is \(n\) times differentiable on \(I\). Then there exists \(c\) strictly between \(a\) and \(x\) such that \[ R_n(x) \;=\; \frac{f^{(n)}(c)}{n!}\,(x - a)^n, \] where \(R_n(x) = f(x) - T_n(x)\) is the remainder.

The Lagrange form is useful when a concrete pointwise bound on \(R_n\) is needed: any bound on \(|f^{(n)}|\) over \(I\) yields a bound on \(|R_n(x)|\) via \(|R_n(x)| \leq \frac{\sup_{c \in I} |f^{(n)}(c)|}{n!}\,|x-a|^n\). This is the form invoked in error analysis for numerical methods and in convergence proofs for optimization algorithms where an explicit remainder estimate is required.

When \(f\) is infinitely differentiable on \(I\) and the Lagrange remainder satisfies \(R_n(x) \to 0\) as \(n \to \infty\), the partial sums \(T_n(x)\) converge to \(f(x)\), and the limit is the Taylor series of \(f\) centered at \(a\): \[ f(x) \;=\; \sum_{k = 0}^{\infty} \frac{f^{(k)}(a)}{k!}\,(x - a)^k. \]

Approximation vs. Reality

We often use the first-order approximation \(T_1(x) = f(a) + f'(a)(x-a)\). Taylor's theorem tells us exactly how much "error" we incur by ignoring higher-order terms — in the Lagrange form, a bound on \(|f''|\) on the interval becomes a bound on the error. This is critical when analyzing the convergence of optimization algorithms, where higher-order remainders determine how large a step a method can safely take.

A subtle point deserves emphasis: being infinitely differentiable is not enough for the Taylor series to recover the function. The standard counter-example is \[ f(x) = \begin{cases} e^{-1/x^2} & x \neq 0, \\ 0 & x = 0, \end{cases} \] which is \(C^\infty\) on all of \(\mathbb{R}\) with \(f^{(k)}(0) = 0\) for every \(k\). Its Taylor series at \(0\) is therefore identically zero, yet \(f(x) > 0\) for every \(x \neq 0\) — the series converges everywhere but fails to equal \(f\) anywhere except at the origin. Functions whose Taylor series converges to \(f\) on some neighborhood of every point are called real-analytic, a condition strictly stronger than \(C^\infty\). This distinction matters in information geometry, where the exponential and other parametric families are typically real-analytic, and in the high-accuracy convergence theory of spectral methods (Chebyshev and Fourier expansions), where analyticity yields exponential convergence while mere \(C^\infty\) regularity only gives polynomial convergence.

Cauchy's Mean Value Theorem

Cauchy's Mean Value Theorem

Let \(f\) and \(g\) be continuous on \([a,b]\) and differentiable on \((a,b)\). If \(g'(x) \neq 0\) for every \(x \in (a, b)\), then \(g(b) \neq g(a)\) (else Rolle's theorem would give a point where \(g'\) vanishes), and there exists \(c \in (a, b)\) such that \[ \frac{f(b)-f(a)}{g(b)-g(a)} \;=\; \frac{f'(c)}{g'(c)}. \]

The proof is the same auxiliary-function reduction used for Lagrange's MVT above. Apply Rolle's theorem to \(h(x) = f(x) - f(a) - \tfrac{f(b)-f(a)}{g(b)-g(a)}\bigl(g(x) - g(a)\bigr)\), which satisfies \(h(a) = h(b) = 0\); the resulting point \(c\) with \(h'(c) = 0\) gives the stated identity after rearranging \(h'(c) = f'(c) - \tfrac{f(b)-f(a)}{g(b)-g(a)}\,g'(c)\).

Cauchy's Mean Value Theorem generalizes Lagrange's MVT(\(g(x) = x\)). While it is a vital tool for proving foundational calculus results like L'Hôpital's Rule, it is less commonly applied directly in everyday machine learning optimization compared to the standard MVT or Taylor's Theorem.

Why it matters: From Limits to Information Geometry

While Lagrange's MVT deals with the change of a single function, Cauchy's MVT considers the ratio of changes between two functions. This is not only the rigorous foundation for L'Hôpital's Rule but also a critical concept when transitioning to advanced topics.

In the context of modern machine learning and Information Geometry, this theorem foreshadows the analysis of how different metrics (like loss functions vs. probability constraints) scale relative to one another. Understanding these relative rates of change is essential for analyzing the geometry of probability manifolds and natural gradient methods.

Higher-dimensional MVT

Before moving to optimization, we must extend the MVT to multidimensional spaces. While gradient descent relies on local linear approximations in practice, this theorem provides the theoretical foundation for proving the convergence of such algorithms. It guarantees that the discrete change in a scalar-valued loss function can be related to the continuous field of its gradients.

Higher-Dimensional MVT

Let \(f: X \to \mathbb{R}\) be differentiable on an open set \(X \subseteq \mathbb{R}^n\) (in the sense that the linearization exists at every point of \(X\) — equivalently, Fréchet differentiable, which is stronger than mere existence of all partial derivatives). Let \(a, b \in X\) with the closed segment \([a,b] \subset X\). Then there exists \(c \in (a,b)\) such that \[ f(b) - f(a) \;=\; \nabla f(c) \cdot (b - a). \] Here \([a,b] := \{(1-t)a + tb : t \in [0,1]\}\) and \((a,b) := \{(1-t)a + tb : t \in (0,1)\}\) denote the closed and open segments joining \(a\) and \(b\).

Proof.

Define \(h : [0,1] \to \mathbb{R}\) by \(h(t) = f\bigl((1-t)a + tb\bigr) = f(a + t(b-a))\). Since \([a,b] \subset X\) and \(f\) is differentiable on \(X\), \(h\) is continuous on \([0,1]\) and differentiable on \((0,1)\). By the chain rule, \[ h'(t) \;=\; \nabla f(a + t(b-a)) \cdot (b - a). \] Applying Lagrange's MVT to \(h\) on \([0,1]\), there exists \(\theta \in (0,1)\) such that \[ h(1) - h(0) \;=\; h'(\theta)(1 - 0) \;=\; h'(\theta). \] Since \(h(1) = f(b)\) and \(h(0) = f(a)\), setting \(c := a + \theta(b - a) \in (a,b)\) gives \[ f(b) - f(a) \;=\; \nabla f(c) \cdot (b - a). \qquad \blacksquare \]

For vector-valued maps the exact equality fails: each scalar component \(F^j\) supplies its own mean value point \(c_j\), and there is in general no single \(c\) serving all components at once. What survives is an inequality, and remarkably it survives with no loss in the constant. The standard device is to test the vector increment against a fixed direction, reducing the vector statement to a single application of the scalar theorem.

Mean Value Inequality

Let \(F : X \to \mathbb{R}^m\) be differentiable on an open set \(X \subseteq \mathbb{R}^n\), and let \(a, b \in X\) with the closed segment \([a,b] \subset X\). Then \[ \|F(b) - F(a)\| \;\leq\; \Big(\sup_{c \in [a,b]} \|DF(c)\|\Big)\,\|b - a\|, \] where \(\|DF(c)\|\) denotes the operator norm of the total derivative \(DF(c) : \mathbb{R}^n \to \mathbb{R}^m\). In particular, if \(F\) is of class \(C^1\) and \(K \subseteq X\) is compact and convex, then \(F\) is Lipschitz continuous on \(K\) with Lipschitz constant \(\sup_{x \in K} \|DF(x)\|\).

Proof.

If \(F(b) = F(a)\) the inequality is trivial, so assume \(F(b) \neq F(a)\) and set \(\mathbf{u} := F(b) - F(a) \in \mathbb{R}^m\). Define the scalar function \(\varphi : [0,1] \to \mathbb{R}\) by \[ \varphi(t) \;=\; \big\langle \mathbf{u},\, F(a + t(b-a)) \big\rangle. \] Since \([a,b] \subset X\) and \(F\) is differentiable on \(X\), the composition with the segment \(t \mapsto a + t(b-a)\) is differentiable on \((0,1)\) and continuous on \([0,1]\); by the chain rule, \[ \varphi'(t) \;=\; \big\langle \mathbf{u},\, DF\big(a + t(b-a)\big)(b - a) \big\rangle. \] Applying Lagrange's MVT to \(\varphi\) on \([0,1]\) yields \(\theta \in (0,1)\) with \(\varphi(1) - \varphi(0) = \varphi'(\theta)\). The left-hand side is \(\langle \mathbf{u}, F(b)\rangle - \langle \mathbf{u}, F(a)\rangle = \langle \mathbf{u}, \mathbf{u}\rangle = \|\mathbf{u}\|^2\). Writing \(c := a + \theta(b-a)\) and applying the Cauchy–Schwarz inequality to the right-hand side, \[ \|\mathbf{u}\|^2 \;=\; \big\langle \mathbf{u},\, DF(c)(b-a) \big\rangle \;\leq\; \|\mathbf{u}\|\,\|DF(c)(b-a)\| \;\leq\; \|\mathbf{u}\|\,\|DF(c)\|\,\|b-a\|, \] the last step by the definition of the operator norm. Dividing by \(\|\mathbf{u}\| > 0\) gives \(\|F(b) - F(a)\| \leq \|DF(c)\|\,\|b-a\| \leq \big(\sup_{c \in [a,b]}\|DF(c)\|\big)\|b-a\|\).

For the \(C^1\) statement, fix a compact convex \(K \subseteq X\). The operator norm \(x \mapsto \|DF(x)\|\) is continuous, hence attains a finite maximum \(M = \sup_{x \in K}\|DF(x)\|\) on \(K\). For any \(a, b \in K\), convexity gives \([a,b] \subseteq K\), so the inequality above yields \(\|F(b) - F(a)\| \leq M\,\|b-a\|\), which is precisely the Lipschitz condition with constant \(M\). \(\blacksquare\)

Important Constraint: Scalar-valued Outputs

The exact equality of the Higher-Dimensional MVT holds only for scalar-valued functions \(f : \mathbb{R}^n \to \mathbb{R}\); for vector-valued maps only the Mean Value Inequality above survives. This distinction is the reason first-order convergence proofs for gradient descent operate on scalar loss functions directly rather than on vector-valued forward maps. The inequality form, by contrast, is the workhorse for controlling how far a smooth map can spread points apart, and it is exactly the estimate that converts a bound on a derivative into the contraction property underlying the inverse function theorem.

Multivariate Taylor's Theorem

The Higher-Dimensional MVT controlled the first-order behavior of a scalar field along a segment. To control approximation to arbitrary order — and, crucially, to obtain an exact remainder rather than an existential mean value point — we extend Taylor's Theorem to functions of several variables. The single-variable Lagrange form above located the remainder at an unknown point \(c\); the multivariate version below instead expresses the remainder as an integral, a form that is indispensable wherever the remainder must be manipulated algebraically rather than merely bounded. One such place is the foundation of differential geometry, where this exact integral remainder is the engine that identifies tangent vectors with directional-derivative operators.

The mechanism is identical to the one used in the proof of the Higher-Dimensional MVT: we restrict \(f\) to the segment from \(a\) to \(x\) by composing with the curve \(t \mapsto a + t(x-a)\), reducing a multivariable problem to a single-variable one, and then invoke the single-variable theory. What changes is only the bookkeeping for higher-order derivatives, which we organize with multi-index notation.

Multi-index notation

Writing higher-order partial derivatives and the matching monomials explicitly becomes unwieldy past first order. A multi-index packages the bookkeeping. We use the ordered convention throughout (each slot independently ranges over \(1,\dots,n\)), which is what makes the compressed sums below coincide with fully written-out sums with no extra combinatorial factors.

Definition: Multi-index (ordered convention)

Fix \(n \geq 1\). A multi-index of length \(m\) is an ordered tuple \(I = (i_1, \dots, i_m)\) with each \(i_j \in \{1, \dots, n\}\); we write \(|I| = m\) for its order. For a point \(x = (x^1,\dots,x^n) \in \mathbb{R}^n\) and a base point \(a = (a^1,\dots,a^n) \in \mathbb{R}^n\), the associated scalar monomial is \[ (x - a)^I \;:=\; (x^{i_1} - a^{i_1})\,(x^{i_2} - a^{i_2}) \cdots (x^{i_m} - a^{i_m}) \;\in\; \mathbb{R}, \] and for an \(m\)-times differentiable scalar function \(f\) the associated partial derivative is \[ \partial_I f \;:=\; \frac{\partial^m f}{\partial x^{i_1} \partial x^{i_2} \cdots \partial x^{i_m}} \;\in\; \mathbb{R}. \] Both \((x-a)^I\) and \(\partial_I f\) are scalars, even though \(x\) and \(a\) are vectors. Because the convention is ordered, summing a quantity over all multi-indices of a fixed order \(|I| = m\) is the same as summing independently over each slot: \[ \sum_{I : |I| = m} \;=\; \sum_{i_1 = 1}^{n} \sum_{i_2 = 1}^{n} \cdots \sum_{i_m = 1}^{n}. \]

Before stating the general theorem, it is worth writing out the lowest orders explicitly, since these are the cases that arise most often. At first order (\(k = 1\)), the expansion reads \[ f(x) \;=\; f(a) \;+\; \sum_{i=1}^{n} \frac{\partial f}{\partial x^i}(a)\,(x^i - a^i) \;+\; \underbrace{\sum_{i,j=1}^{n} (x^i - a^i)(x^j - a^j) \int_0^1 (1 - t)\, \frac{\partial^2 f}{\partial x^i \partial x^j}\bigl(a + t(x-a)\bigr)\,dt}_{\text{remainder } R_1(x)} . \] Every term here is a scalar: \(f(a)\) and each \(\partial f/\partial x^i(a)\) are scalars, \((x^i - a^i)\) is a scalar component of the displacement vector \(x - a\), and the integral is a scalar (the integrand is evaluated at the vector \(a + t(x-a)\) but returns a real number). The linear term \(\sum_i \partial_i f(a)(x^i - a^i)\) is exactly the directional contribution \(\nabla f(a) \cdot (x - a)\) seen in the Higher-Dimensional MVT, now accompanied by an explicit second-order remainder rather than hidden inside a mean value point.

At second order (\(k = 2\)), the new term is the quadratic \[ \frac{1}{2} \sum_{i,j=1}^{n} \frac{\partial^2 f}{\partial x^i \partial x^j}(a)\,(x^i - a^i)(x^j - a^j), \] which is precisely one-half the quadratic form of the Hessian — the symmetric matrix of second partial derivatives \(\partial^2 f / \partial x^i \partial x^j(a)\) — evaluated on the displacement vector \(x - a\). This quadratic form is itself a scalar. It is this \(k = 2\) expansion that underlies second-order optimization and Laplace-type approximations.

The general statement compresses these patterns using the multi-index notation above.

Multivariate Taylor's Theorem (integral remainder)

Let \(X \subseteq \mathbb{R}^n\) be open and let \(a \in X\) be fixed. Suppose \(f \in C^{k+1}(X)\) for some integer \(k \geq 0\). If \(W \subseteq X\) is any convex set containing \(a\), then for all \(x \in W\), \[ f(x) \;=\; P_k(x) + R_k(x), \] where the \(k\)th-order Taylor polynomial \(P_k\) is the scalar-valued function \[ P_k(x) \;=\; f(a) + \sum_{m=1}^{k} \frac{1}{m!} \sum_{I : |I| = m} \partial_I f(a)\,(x - a)^I, \] and the \(k\)th remainder term \(R_k\) is \[ R_k(x) \;=\; \frac{1}{k!} \sum_{I : |I| = k+1} (x - a)^I \int_0^1 (1 - t)^k\, \partial_I f\bigl(a + t(x-a)\bigr)\,dt . \]

Proof.

Fix \(x \in W\). Since \(W\) is convex and contains both \(a\) and \(x\), the segment \(\{a + t(x-a) : t \in [0,1]\}\) lies in \(W \subseteq X\), so the scalar function \[ u(t) \;:=\; f\bigl(a + t(x-a)\bigr), \qquad t \in [0,1], \] is well defined and of class \(C^{k+1}\). We prove the claim by induction on \(k\), tracking derivatives of \(u\) via the chain rule.

Base case \(k = 0\). Here \(P_0(x) = f(a)\). By the chain rule, \(u'(t) = \sum_{i=1}^n \partial_i f(a + t(x-a))\,(x^i - a^i) = \sum_{I : |I| = 1} (x-a)^I\, \partial_I f(a + t(x-a))\). The Fundamental Theorem of Calculus applied to \(u\) on \([0,1]\) gives \[ f(x) - f(a) \;=\; u(1) - u(0) \;=\; \int_0^1 u'(t)\,dt \;=\; \sum_{I : |I| = 1} (x-a)^I \int_0^1 \partial_I f\bigl(a + t(x-a)\bigr)\,dt, \] which is exactly \(f(x) = P_0(x) + R_0(x)\), since \((1-t)^0 = 1\) and \(0! = 1\).

Inductive step. Suppose the formula holds for some \(k\); we promote it to \(k + 1\). Each remainder integral has the form \(\int_0^1 (1-t)^k\, g(t)\,dt\) with \(g(t) = \partial_I f(a + t(x-a))\). Integrating by parts, using \(\frac{d}{dt}\!\left[-\frac{(1-t)^{k+1}}{k+1}\right] = (1-t)^k\) and again the chain rule for \(g'(t) = \sum_{j=1}^n \partial_j \partial_I f(a + t(x-a))\,(x^j - a^j)\), yields \[ \int_0^1 (1-t)^k\, g(t)\,dt \;=\; \frac{1}{k+1}\, g(0) \;+\; \frac{1}{k+1} \sum_{j=1}^n (x^j - a^j) \int_0^1 (1-t)^{k+1}\, \partial_j \partial_I f\bigl(a + t(x-a)\bigr)\,dt . \] The boundary term contributes \(\frac{1}{k+1}\,\partial_I f(a)\), which (summed with the prefactor \(\frac{1}{k!}\) over \(|I| = k+1\)) supplies exactly the order-\((k+1)\) term of \(P_{k+1}\), since \(\frac{1}{k!}\cdot\frac{1}{k+1} = \frac{1}{(k+1)!}\). The surviving integral, with one extra differentiation slot \(j\) prepended, reassembles into the sum over multi-indices of order \(k + 2\), giving \(R_{k+1}\) with weight \((1-t)^{k+1}\) and prefactor \(\frac{1}{(k+1)!}\). This is precisely the asserted formula with \(k\) replaced by \(k + 1\), completing the induction. \(\qquad \blacksquare\)

When only a bound on the error is needed — rather than the exact integral — the integral remainder collapses to a clean estimate.

Taylor Error Bound

Let \(X \subseteq \mathbb{R}^n\) be open, \(a \in X\), and \(f \in C^{k+1}(X)\) for some \(k \geq 0\). If \(W \subseteq X\) is convex, contains \(a\), and every partial derivative of \(f\) of order \(k + 1\) is bounded in absolute value by a constant \(M\) on \(W\), then for all \(x \in W\), \[ \bigl| f(x) - P_k(x) \bigr| \;\leq\; \frac{n^{k+1}\,M}{(k+1)!}\,\lVert x - a\rVert^{\,k+1}, \] where \(P_k\) is the \(k\)th Taylor polynomial of \(f\) at \(a\) and \(\lVert\cdot\rVert\) is the Euclidean norm.

Proof.

The remainder \(R_k(x) = f(x) - P_k(x)\) is a sum over the multi-indices \(I\) with \(|I| = k+1\); by the ordered convention there are \(n^{k+1}\) such terms. For each, the scalar factor satisfies \(\lvert (x-a)^I \rvert \leq \lVert x - a\rVert^{\,k+1}\) (each of the \(k+1\) scalar factors \(\lvert x^{i_j} - a^{i_j}\rvert\) is at most the Euclidean norm \(\lVert x - a\rVert\)), the integrand obeys \(\lvert \partial_I f(a + t(x-a))\rvert \leq M\), and \(\int_0^1 (1-t)^k\,dt = \tfrac{1}{k+1}\). Combining with the prefactor \(\tfrac{1}{k!}\) and \(\tfrac{1}{k!}\cdot\tfrac{1}{k+1} = \tfrac{1}{(k+1)!}\), the triangle inequality over the \(n^{k+1}\) terms gives the stated bound. \(\qquad \blacksquare\)

Why the integral remainder matters

The Lagrange form earlier in this page is ideal for bounding error, and the Taylor error bound above recovers exactly that role in the multivariate setting. But the integral form of the remainder does something the Lagrange form cannot: it exhibits the remainder as an explicit sum in which each term carries a factor \((x^i - a^i)\) that vanishes at \(x = a\) multiplied by a smooth coefficient. This algebraic structure — a remainder built from products that die at the base point — is exactly what is needed when one must apply a linear operator to the remainder and have it annihilate the leftover terms. That is the decisive step in the smooth-manifold theory of tangent vectors, where the first-order (\(k = 1\)) case of this theorem identifies abstract tangent vectors with classical directional derivatives.