Example: numerical validation of \(d(X^2) = X\,(dX) + (dX)\,X\)
In the previous page we derived the differential of matrix-valued matrix functions symbolically, producing formulas such as \[ d(X^3) \;=\; (dX)X^2 + X(dX)X + X^2(dX), \qquad d(X^{-1}) \;=\; -X^{-1}(dX)X^{-1}. \] Symbolic derivations like these are short but error-prone — miss a term, flip a factor, and the whole downstream optimization is silently wrong. A basic hygiene step of computational mathematics is therefore to validate each derivation numerically before trusting it. In this page we work through the simpler case \(f(X) = X^2\), for which the differential is \[ df \;=\; f'(X)[dX] \;=\; X(dX) + (dX)X, \] and we verify it against finite differences.
Finite differences
For a differentiable function \(f\), the differential \(df\) at \(X\) is the linear part of the increment — equivalently, the quantity that satisfies \[ f(X + dX) - f(X) \;=\; df + o(\|dX\|). \] Two standard finite-difference approximations of \(df\) are \[ \text{Forward difference:} \quad f(X + dX) - f(X), \qquad \text{Backward difference:} \quad f(X) - f(X - dX). \] Each equals \(df\) up to an \(o(\|dX\|)\) term, so in the limit \(dX \to 0\) both agree with the analytical formula.
When comparing a computed approximation against a reference value, we measure quality by relative error in a norm (we use the Frobenius norm throughout): \[ \text{relative error} \;=\; \frac{\|\,\text{true} - \text{approx}\,\|_F}{\|\,\text{true}\,\|_F}. \] A small absolute difference does not necessarily mean a small error — scale matters. Relative error normalizes by the size of the quantity being approximated.
Choosing the perturbation scale \(\|dX\|\)
Taking \(dX\) ever smaller does not make a finite-difference approximation ever more accurate. Two error sources compete:
- Truncation error from the \(o(\|dX\|)\) remainder. For \(C^2\) functions — which is the regime where finite-difference validation is used in practice, and which covers \(f(X) = X^2\) here (the remainder is exactly \((dX)^2\)) — the remainder is \(O(\|dX\|^2)\), and its relative size (against \(\|df\| = \Theta(\|dX\|)\), assuming \(df \neq 0\)) is \(O(\|dX\|)\).
- Rounding error from subtracting two nearly equal floating-point numbers in \(f(X + dX) - f(X)\); its relative size is \(O(\varepsilon_{\text{mach}} / \|dX\|)\), where \(\varepsilon_{\text{mach}} \approx 2.2\times 10^{-16}\) for IEEE 754 double precision.
Balancing the two gives optimal \(\|dX\| \approx \sqrt{\varepsilon_{\text{mach}}} \approx 1.5 \times 10^{-8}\) for one-sided differences, with corresponding best achievable relative error \(\approx \sqrt{\varepsilon_{\text{mach}}}\). This is why we set the perturbation scale to \(10^{-8}\) in the code below: any smaller and rounding dominates; any larger and the quadratic truncation term does.
Good practice for test inputs
Draw test matrices from a continuous distribution (here, np.random.randn) rather than
hand-picking "nice" integer-valued matrices. Integer matrices can accidentally satisfy identities
(commuting \(X\) and \(dX\), exact zero entries, singular matrices) that mask bugs in the
general-position code path. Random non-integer values exercise the full logic.
Implementation and run
import numpy as np
# Random test matrix X in R^{2x2}
X = np.random.randn(2, 2)
# Perturbation dX at the sqrt(eps_machine) scale
dX = np.random.randn(2, 2) * 1e-8
# f(X) = X^2
f_X = X @ X
# Forward difference: f(X + dX) - f(X)
f_approx = (X + dX) @ (X + dX) - f_X
# Backward difference: f(X) - f(X - dX)
b_approx = f_X - (X - dX) @ (X - dX)
# Analytical differential: df = X dX + dX X
exact = X @ dX + dX @ X
# Relative errors in Frobenius norm
f_rel_err = np.linalg.norm(f_approx - exact, 'fro') / np.linalg.norm(exact, 'fro')
b_rel_err = np.linalg.norm(b_approx - exact, 'fro') / np.linalg.norm(exact, 'fro')
print(f"Input matrix X:\n{X}\n")
print(f"Perturbation dX:\n{dX}\n")
print(f"Forward difference f(X + dX) - f(X):\n{f_approx}\n")
print(f"Backward difference f(X) - f(X - dX):\n{b_approx}\n")
print(f"Analytical df = X dX + dX X:\n{exact}\n")
print(f"Relative error (forward) : {f_rel_err:.2e}")
print(f"Relative error (backward): {b_rel_err:.2e}")
Typical output shows relative errors around \(10^{-8}\) — consistent with the optimal accuracy predicted by the balance of truncation and rounding errors discussed above. If one of these errors suddenly spiked to \(10^{-2}\) or larger, that would indicate either a bug in the analytical formula or an inappropriate perturbation scale, and the validation would have caught it.
Why numerical validation matters
Modern machine-learning frameworks compute gradients with automatic differentiation, not by hand. But whenever one writes a custom operator — a new layer type, an exotic loss function, a physics-informed constraint — the framework's autodiff only works for the parts built from its registered primitives. Anything custom needs a hand-written derivative, and a hand-written derivative needs finite-difference validation before it goes into training. The same logic applies to gradient checks in scientific-computing libraries and to the testing of custom CUDA kernels. The technique on this page — pick a perturbation at the \(\sqrt{\varepsilon_{\text{mach}}}\) scale, compare against a forward or backward difference, measure relative error — is the workhorse behind all of these checks.