Scalar Functions of a Matrix

Derivative of the Frobenius norm Derivative of the Determinant

Derivative of the Frobenius Norm

In machine learning, we frequently map high-dimensional weight matrices to a single scalar "cost" or "penalty" term. The Frobenius norm provides the most natural way to measure the total magnitude of a matrix \(X \in \mathbb{R}^{m \times n}\), acting as the matrix-space equivalent of the Euclidean \(L_2\) norm.

It is defined through the trace operator, which we will see is the fundamental tool for extracting scalar information from matrix products: \[ f(X) = \| X \|_F = \sqrt{\text{tr}(X^TX)} = \sqrt{\sum_{i,j} X_{ij}^2} \]

Derivatives implicitly rely on norms to measure the magnitude of changes in both the input \(dx\) and the output \(df\) ensuring a consistent comparison of their scales.

Derivative of the Frobenius Norm

Let's take the differential \(df\) of \[ f(X) = \| X \|_F = \sqrt{\text{tr}(X^TX)} = \sqrt{\sum_{i,j} X_{ij}^2}. \] First, by the chain rule: \[ df = \frac{1}{2\sqrt{\text{tr }(X^TX)}}d[\text{tr }(X^TX)] \] Note: for any matrix \(A\), \[ \begin{align*} d(\text{tr }(A)) &= \text{tr }(A + dA) - \text{tr }(A) \\\\\ &= \text{tr }(A) + \text{tr }(dA) - \text{tr }(A) \\\\ &= \text{tr }(dA) \end{align*} \] Thus: \[ df = \frac{1}{2\sqrt{\text{tr }(X^TX)}}\text{tr }[d(X^TX)] \] By the product rule: \[ \begin{align*} df &= \frac{1}{2\sqrt{\text{tr }(X^TX)}}\text{tr }[dX^TX + X^TdX]\\\\ &= \frac{1}{2\sqrt{\text{tr }(X^TX)}}\text{tr }(dX^TX) + \text{tr }(X^TdX) \end{align*} \] Since \[ \text{tr }(dX^TX) = \text{tr }((dX^TX)^T) = \text{tr }(X^TdX), \] We obtain \[ \begin{align*} df &= \frac{1}{2\sqrt{\text{tr }(X^TX)}}2\text{tr }(X^TdX)\\\\ &= \frac{1}{\sqrt{\text{tr }(X^TX)}}\text{tr }(X^TdX) \end{align*} \] Here, \(\text{tr }(X^TdX)\) represents the Frobenius inner product of \(X\) and \(dX\). Then: \[ df = \left\langle \frac{X}{\sqrt{\text{tr }(X^TX)}}, dX \right\rangle_F \tag{1} \] Therefore, \[ \nabla f = \frac{X}{ \| X \|_F}. \]

Note:The expression in (1) is equivalent to \[ df = \text{tr }((\nabla f)^TdX) \tag{2} \]

The trace operator satisfies linearity and the cyclic property, making it a convenient way to express derivatives in terms of gradients.

Appling Cyclic Property

consider \(f(A) = x^TAy\) where \(A\) is a \(m \times n\) matrix, \(x \in \mathbb{R}^m\), and \(y \in \mathbb{R}^n\).

By the product rule, \[ df = x^TdAy \] Since \(df\) is a scalar, taking the trace does not change its value: \[ df = \text{tr }(x^TdAy) \] By the cyclic property of the trace: \[ df = \text{tr }(yx^TdA) \] Therefore, comparing this with \(df = \text{tr }((\nabla f)^TdA)\), \[ \nabla f = (yx^T)^T = xy^T \]

Derivative of the Determinant

The determinant \(\det(X)\) represents the oriented volume scaling factor of a linear transformation. Its derivative is crucial for Normalizing Flows and any optimization involving Covariance Matrices.

The derivative of the of a square matrix \(A \in \mathbb{R}^{n \times n}\) can be expressed using several equivalent forms. Recall that \[ \text{adj }(A) = \text{cofactor }(A)^T = (\det A)A^{-1}. \] This implies: \[ \text{cofactor }(A) = \text{adj }(A)^T = (\det A)(A^{-1})^T. \] By the cofactor expansion of the determinant based on \(i\)-th row of \(A\): \[ \det (A) = A_{i1}C_{i1} + A_{i2}C_{i2} + \cdots + A_{in}C_{in} \] Thus, \(\frac{\partial \det A}{\partial A_{ij}} = C_{ij} \) and then: \[ \nabla (\det A) = \text{cofactor } (A) \]

Equivalently, using the expression (2): \[ \begin{align*} d (\det A) &= \text{tr }(\text{cofactor }(A)^T dA) \\\\ &= \text{tr }(\text{adj }(A) dA) \\\\ &= \text{tr }((\det A)A^{-1} dA) \tag{3} \end{align*} \] Therefore, \[ \begin{align*} \nabla (\det A) &= \text{cofactor } (A) \\\\ &= \text{adj }(A)^T \\\\ &= (\det A)(A^{-1})^T \end{align*} \] Since \(\det A\) is a scalar, the expression (3) can be written as: \[ d(\det A) = (\det A)\text{tr }(A^{-1} dA) \tag{4} \]

Characteristic polynomial

Consider the scalar function \(p(x) = \det(xI - A)\), which is the characteristic polynomial of \(A\). In practice, when approximating eigenvalues using numerical methods, we often need to compute the derivative of \(p(x)\) at different values of \(x\).

Applying the expression (4), we have: \[ \begin{align*} &d(\det (xI-A)) \\\\ &= (\det (xI-A)) \text{tr } ((xI-A)^{-1} d(xI -A)). \end{align*} \] Since \(A\) is constant, \(d(xI -A) = dxI\), and also \(dx\) is a scalar. Thus, \[ \begin{align*} &d(\det (xI-A)) \\\\ &= (\det (xI-A)) \text{tr } ((xI-A)^{-1})dx \end{align*} \]

Note: While the analytical approach provides a useful formula for the differential of many functions, in practice, reverse mode automatic differentiation offers a more stable and efficient way to compute the gradient of functions. Auto-diff allows us to compute the derivative with respect to matrix parameters directly without explicitly computing like \(A^{-1}\), which can introduce numerical instability in certain cases (e.g., \(\det(xI - A) \approx 0\))/

Here's an example using auto-diff in PyTorch vs using the analytical formula to compute the derivative of \(p(x)\):

                    import torch

                    # Random square matrix
                    def generate_matrix(n):
                        return torch.randn(n, n, dtype=torch.float64, requires_grad=False)

                    # Characteristic polynomial p(x) = det(xI - A)
                    def p(x, A):
                        return torch.det(x * torch.eye(A.shape[0], dtype=A.dtype, device=A.device) - A)

                    # Derivative of p(x) using auto-differentiation
                    def dp_torch(x, A):
                        x_tensor = torch.tensor([x], requires_grad=True, dtype=A.dtype, device=A.device)
                        grad = torch.autograd.grad(p(x_tensor, A), x_tensor)[0]
                        return grad.item()

                    # Analytical formula d(p(x)) = (det (xI-A))*tr((xI-A)^-1)dx 
                    def dp(x, A):
                        return (
                            p(x, A).item() *
                            torch.trace(
                                torch.inverse(x * torch.eye(A.shape[0], dtype=A.dtype, device=A.device) - A)
                            ).item()
                        )