Derivative of the Frobenius Norm
In machine learning, we frequently map high-dimensional weight matrices to a single scalar "cost" or "penalty" term.
The Frobenius norm provides the most natural way to measure
the total magnitude of a matrix \(X \in \mathbb{R}^{m \times n}\), acting as the matrix-space equivalent of the Euclidean
\(L_2\) norm.
It is defined through the trace operator, which we will see is the fundamental tool for extracting
scalar information from matrix products:
\[
f(X) = \| X \|_F = \sqrt{\text{tr}(X^TX)} = \sqrt{\sum_{i,j} X_{ij}^2}
\]
Derivatives implicitly rely on norms to measure the magnitude of changes in both the input \(dx\)
and the output \(df\) ensuring a consistent comparison of their scales.
Derivative of the Frobenius Norm
Let's take the differential \(df\) of
\[
f(X) = \| X \|_F = \sqrt{\text{tr}(X^TX)} = \sqrt{\sum_{i,j} X_{ij}^2}.
\]
First, by the chain rule:
\[
df = \frac{1}{2\sqrt{\text{tr }(X^TX)}}d[\text{tr }(X^TX)]
\]
Note: for any matrix \(A\),
\[
\begin{align*}
d(\text{tr }(A)) &= \text{tr }(A + dA) - \text{tr }(A) \\\\\
&= \text{tr }(A) + \text{tr }(dA) - \text{tr }(A) \\\\
&= \text{tr }(dA)
\end{align*}
\]
Thus:
\[
df = \frac{1}{2\sqrt{\text{tr }(X^TX)}}\text{tr }[d(X^TX)]
\]
By the product rule:
\[
\begin{align*}
df &= \frac{1}{2\sqrt{\text{tr }(X^TX)}}\text{tr }[dX^TX + X^TdX]\\\\
&= \frac{1}{2\sqrt{\text{tr }(X^TX)}}\text{tr }(dX^TX) + \text{tr }(X^TdX)
\end{align*}
\]
Since
\[
\text{tr }(dX^TX) = \text{tr }((dX^TX)^T) = \text{tr }(X^TdX),
\]
We obtain
\[
\begin{align*}
df &= \frac{1}{2\sqrt{\text{tr }(X^TX)}}2\text{tr }(X^TdX)\\\\
&= \frac{1}{\sqrt{\text{tr }(X^TX)}}\text{tr }(X^TdX)
\end{align*}
\]
Here, \(\text{tr }(X^TdX)\) represents the Frobenius inner product of \(X\) and \(dX\). Then:
\[
df = \left\langle \frac{X}{\sqrt{\text{tr }(X^TX)}}, dX \right\rangle_F \tag{1}
\]
Therefore,
\[
\nabla f = \frac{X}{ \| X \|_F}.
\]
Note:The expression in (1) is equivalent to
\[
df = \text{tr }((\nabla f)^TdX) \tag{2}
\]
The trace operator satisfies linearity and the cyclic property, making it a convenient way to express derivatives
in terms of gradients.
Appling Cyclic Property
consider \(f(A) = x^TAy\) where \(A\) is a \(m \times n\) matrix, \(x \in \mathbb{R}^m\), and
\(y \in \mathbb{R}^n\).
By the product rule,
\[
df = x^TdAy
\]
Since \(df\) is a scalar, taking the trace does not change its value:
\[
df = \text{tr }(x^TdAy)
\]
By the cyclic property of the trace:
\[
df = \text{tr }(yx^TdA)
\]
Therefore, comparing this with \(df = \text{tr }((\nabla f)^TdA)\),
\[
\nabla f = (yx^T)^T = xy^T
\]
Derivative of the Determinant
The determinant \(\det(X)\) represents the oriented volume scaling
factor of a linear transformation. Its derivative is crucial for Normalizing Flows and any optimization
involving Covariance Matrices.
The derivative of the of a square matrix \(A \in \mathbb{R}^{n \times n}\)
can be expressed using several equivalent forms. Recall that
\[
\text{adj }(A) = \text{cofactor }(A)^T = (\det A)A^{-1}.
\]
This implies:
\[
\text{cofactor }(A) = \text{adj }(A)^T = (\det A)(A^{-1})^T.
\]
By the cofactor expansion of the determinant based on \(i\)-th row of \(A\):
\[
\det (A) = A_{i1}C_{i1} + A_{i2}C_{i2} + \cdots + A_{in}C_{in}
\]
Thus, \(\frac{\partial \det A}{\partial A_{ij}} = C_{ij} \) and then:
\[
\nabla (\det A) = \text{cofactor } (A)
\]
Equivalently, using the expression (2):
\[
\begin{align*}
d (\det A) &= \text{tr }(\text{cofactor }(A)^T dA) \\\\
&= \text{tr }(\text{adj }(A) dA) \\\\
&= \text{tr }((\det A)A^{-1} dA) \tag{3}
\end{align*}
\]
Therefore,
\[
\begin{align*}
\nabla (\det A) &= \text{cofactor } (A) \\\\
&= \text{adj }(A)^T \\\\
&= (\det A)(A^{-1})^T
\end{align*}
\]
Since \(\det A\) is a scalar, the expression (3) can be written as:
\[
d(\det A) = (\det A)\text{tr }(A^{-1} dA) \tag{4}
\]
Characteristic polynomial
Consider the scalar function \(p(x) = \det(xI - A)\), which is the characteristic polynomial of \(A\).
In practice, when approximating eigenvalues using numerical methods, we often need to compute the derivative
of \(p(x)\) at different values of \(x\).
Applying the expression (4), we have:
\[
\begin{align*}
&d(\det (xI-A)) \\\\
&= (\det (xI-A)) \text{tr } ((xI-A)^{-1} d(xI -A)).
\end{align*}
\]
Since \(A\) is constant, \(d(xI -A) = dxI\), and also \(dx\) is a scalar. Thus,
\[
\begin{align*}
&d(\det (xI-A)) \\\\
&= (\det (xI-A)) \text{tr } ((xI-A)^{-1})dx
\end{align*}
\]
Note: While the analytical approach provides a useful formula for the differential of many functions, in practice,
reverse mode automatic differentiation offers a more stable and
efficient way to compute the gradient of functions. Auto-diff allows us to compute the derivative with respect to matrix parameters directly
without explicitly computing like \(A^{-1}\), which can introduce numerical instability in certain cases (e.g., \(\det(xI - A) \approx 0\))/
Here's an example using auto-diff in PyTorch vs using the analytical formula to compute the derivative of \(p(x)\):
import torch
# Random square matrix
def generate_matrix(n):
return torch.randn(n, n, dtype=torch.float64, requires_grad=False)
# Characteristic polynomial p(x) = det(xI - A)
def p(x, A):
return torch.det(x * torch.eye(A.shape[0], dtype=A.dtype, device=A.device) - A)
# Derivative of p(x) using auto-differentiation
def dp_torch(x, A):
x_tensor = torch.tensor([x], requires_grad=True, dtype=A.dtype, device=A.device)
grad = torch.autograd.grad(p(x_tensor, A), x_tensor)[0]
return grad.item()
# Analytical formula d(p(x)) = (det (xI-A))*tr((xI-A)^-1)dx
def dp(x, A):
return (
p(x, A).item() *
torch.trace(
torch.inverse(x * torch.eye(A.shape[0], dtype=A.dtype, device=A.device) - A)
).item()
)