Trace & Norms

Trace Norms Introduction to Metric Spaces

Trace

So far, we have studied vectors, matrices, eigenvalues, orthogonality, and quadratic forms — all within the concrete world of \(\mathbb{R}^n\). This page marks a quiet turning point. On the surface, the trace looks like just another scalar attached to a square matrix: the sum of its diagonal entries. But this simple operation will lead us, within a few paragraphs, to an inner product on the space of \(m \times n\) matrices — a space that is not \(\mathbb{R}^n\). Once "inner product" and "norm" are shown to live on spaces other than \(\mathbb{R}^n\), a larger question becomes unavoidable: what exactly is a norm? what is a distance? The final section of this page opens that door. The rest of the curriculum, beginning with the foundations of analysis in Section II, walks through it.

Despite its simple definition, the trace has deep connections to eigenvalues and matrix norms, and appears throughout optimization and machine learning. Its cyclic permutation property, in particular, makes it indispensable for manipulating matrix expressions.

Definition: Trace

The trace of an \(n \times n\) matrix \(A\) is the sum of the diagonal entries of \(A\) and denoted by \(\operatorname{tr}(A)\): \[ \operatorname{tr}(A) =\sum_{k=1}^n a_{kk}. \]

The trace satisfies several important properties:

Theorem: Properties of Trace

Let \(A, B\) be \(n \times n\) matrices and \(c \in \mathbb{R}\). The trace satisfies:

  1. Linearity: \(\operatorname{tr}(cA) = c \operatorname{tr}(A)\)
  2. Additivity: \(\operatorname{tr}(A + B) = \operatorname{tr}(A) + \operatorname{tr}(B)\)
  3. Transpose invariance: \(\operatorname{tr}(A^\top) = \operatorname{tr}(A)\)
  4. Frobenius inner product: \(\operatorname{tr}(A^\top B) = \sum_{i=1}^n \sum_{j=1}^n a_{ij} b_{ij}\)
  5. Cyclic permutation: \(\operatorname{tr}(AB) = \operatorname{tr}(BA)\)
  6. Eigenvalue sum: \(\operatorname{tr}(A) = \sum_{i=1}^n \lambda_i\) where \(\lambda_1, \ldots, \lambda_n\) are the eigenvalues of \(A\) (counted with algebraic multiplicity).

Note: Property 5 extends to rectangular matrices — if \(A\) is \(m \times n\) and \(B\) is \(n \times m\), then both \(AB\) and \(BA\) are square and \(\operatorname{tr}(AB) = \operatorname{tr}(BA)\). More generally, \(\operatorname{tr}(ABC) = \operatorname{tr}(CAB) = \operatorname{tr}(BCA)\). Property 4 generalizes to \(A, B \in \mathbb{R}^{m \times n}\) with the double sum running over \(1 \leq i \leq m\) and \(1 \leq j \leq n\); we revisit this in the Frobenius inner product definition below.

Proof (sketch):

Properties 1, 2 follow immediately from the diagonal-sum definition and the linearity of summation. For Property 3, note that \((A^\top)_{ii} = a_{ii}\), so \(\operatorname{tr}(A^\top) = \sum_i (A^\top)_{ii} = \sum_i a_{ii} = \operatorname{tr}(A)\).

For Property 5, compute directly from the definitions of trace and matrix product: \[ \operatorname{tr}(AB) = \sum_{i=1}^n (AB)_{ii} = \sum_{i=1}^n \sum_{j=1}^n a_{ij} b_{ji} = \sum_{j=1}^n \sum_{i=1}^n b_{ji} a_{ij} = \sum_{j=1}^n (BA)_{jj} = \operatorname{tr}(BA), \] where the middle equality interchanges the order of a finite double sum.

Property 4 follows from the same kind of direct computation — it does not require Property 5. By the definitions of trace and matrix product: \[ \operatorname{tr}(A^\top B) = \sum_{i=1}^n (A^\top B)_{ii} = \sum_{i=1}^n \sum_{j=1}^n (A^\top)_{ij} b_{ji} = \sum_{i=1}^n \sum_{j=1}^n a_{ji} b_{ji}. \] Renaming dummy indices (swap \(i \leftrightarrow j\)) gives \(\sum_{i=1}^n \sum_{j=1}^n a_{ij} b_{ij}\), matching the stated form.

Property 6 is deeper and requires the characteristic polynomial. For a matrix \(A\) with eigenvalues \(\lambda_1, \ldots, \lambda_n\) (counted with algebraic multiplicity over \(\mathbb{C}\)), the characteristic polynomial factors as \(\det(\lambda I - A) = \prod_{i=1}^n (\lambda - \lambda_i)\). Expanding both sides and comparing the coefficient of \(\lambda^{n-1}\) yields \(\operatorname{tr}(A) = \sum_i \lambda_i\). A detailed proof is omitted here for space.

Insight: The Trace Trick

Consider a quadratic form \(\mathbf{x}^\top A \mathbf{x} \in \mathbb{R}\). While this expression is a scalar, we often rewrite it using the trace operator to gain structural advantages: \[ \mathbf{x}^\top A \mathbf{x} = \operatorname{tr}(\mathbf{x}^\top A \mathbf{x}) = \operatorname{tr}(A \mathbf{x} \mathbf{x}^\top). \] This "trace trick" is indispensable in Computer Science and Statistics for the following reasons:

  1. Decoupling System and Data: It separates the system matrix \(A\) from the data-dependent rank-1 matrix \(\mathbf{x} \mathbf{x}^\top\). In the original form, \(A\) is "sandwiched" between vectors, making it difficult to analyze the contribution of \(A\) and \(\mathbf{x}\) independently.
  2. Computational Efficiency (The Summation Trick): Suppose we need to compute the average of quadratic forms over \(N\) data points. Calculating each individually is computationally expensive, but the trace trick allows us to factor through the covariance matrix: \[ \frac{1}{N} \sum_{i=1}^N \mathbf{x}_i^\top A \mathbf{x}_i = \operatorname{tr}\!\left( A \cdot \frac{1}{N} \sum_{i=1}^N \mathbf{x}_i \mathbf{x}_i^\top \right) = \operatorname{tr}(A \Sigma). \] By pre-computing the covariance \(\Sigma\), we reduce \(N\) vector-matrix-vector products to a single matrix product and one trace operation.
  3. Structural Analysis: This form is the gateway to advanced topics like the expectation of quadratic forms. In statistics, \(\mathbb{E}[\mathbf{x}^\top A \mathbf{x}]\) simplifies to \(\operatorname{tr}(A \, \mathbb{E}[\mathbf{x} \mathbf{x}^\top])\), which is impossible to express without the trace trick.

Note: although \(\mathbf{x}^\top A \mathbf{x}\) is a \(1 \times 1\) scalar, the internal term \(A \mathbf{x} \mathbf{x}^\top\) is an \(n \times n\) matrix. The trace "compresses" this matrix back to the scalar.

Property 4 deserves a closer look. Up to this point, the inner product has lived on \(\mathbb{R}^n\): a structure on arrows in n-dimensional space. But nothing in the defining axioms of an inner product — bilinearity, symmetry, positivity — is tied to \(\mathbb{R}^n\). Property 4 reveals something striking: the trace defines an inner product on an entirely different space — the space \(\mathbb{R}^{m \times n}\) of matrices. The formula \(\langle A, B \rangle_F = \operatorname{tr}(A^\top B)\) satisfies all the same axioms, just with matrices playing the role that vectors played before. The notion of "inner product" survives the transplant intact.

This is the first hint of a general pattern: inner products, and the norms and distances they induce, are abstract structures that can live on many different vector spaces. We give this instance a name.

Definition: Frobenius Inner Product

For matrices \(A, B \in \mathbb{R}^{m \times n}\), the Frobenius inner product is defined as: \[ \left\langle A, B \right\rangle_F = \operatorname{tr}(A^\top B) = \sum_{i=1}^m \sum_{j=1}^n a_{ij} b_{ij}. \]

Definition: Frobenius Norm

The Frobenius norm of a matrix \(A \in \mathbb{R}^{m \times n}\) is defined as \[ \| A \|_F = \sqrt{\left\langle A, A \right\rangle_F} = \sqrt{\operatorname{tr}(A^\top A)} = \sqrt{\sum_{i=1}^m \sum_{j=1}^n a_{ij}^2}. \]

This is one of many possible matrix norms. Each norm emphasizes different aspects of a matrix's structure and is suited to different applications.

Norms

While the trace characterizes matrices through a single scalar, norms provide a way to measure the "size" or "magnitude" of both vectors and matrices. In machine learning, norms are fundamental for training stability. They are used to scale data to a common range, ensuring that every feature contributes equally to model training and reducing sensitivity to feature scale. This normalization process is important in distance-based models such as support vector machines (SVMs) and \(k\)-nearest neighbors (k-NN). Norms also appear in regularization, which penalizes model complexity to avoid overfitting and help the model generalize to new data.

Another popular norm of a matrix is the nuclear norm (also called trace norm).

Definition: Nuclear Norm (Trace Norm)

Given a matrix \(A \in \mathbb{R}^{m \times n}\), the nuclear norm of \(A\) is defined as the sum of its singular values: \[ \| A \|_{\ast} = \sum_{i=1}^{\min(m,n)} \sigma_i \geq 0. \]

Theorem: Nuclear Norm via SVD

Let \(A \in \mathbb{R}^{m \times n}\) have singular value decomposition \(A = U \Sigma V^\top\). Then \[ \| A \|_{\ast} = \sum_{i=1}^{\min(m,n)} \sigma_i = \operatorname{tr}\!\left(\sqrt{A^\top A}\right). \]

Proof:

Let \(A = U\Sigma V^\top\) be the SVD of \(A\), with \(U \in \mathbb{R}^{m \times m}\) and \(V \in \mathbb{R}^{n \times n}\) orthogonal and \(\Sigma \in \mathbb{R}^{m \times n}\) containing the singular values \(\sigma_1 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0\) on its diagonal. Since \(U^\top U = I_m\), \[ A^\top A = V \Sigma^\top U^\top U \Sigma V^\top = V (\Sigma^\top \Sigma) V^\top, \] where \(\Sigma^\top \Sigma \in \mathbb{R}^{n \times n}\) is diagonal with entries \(\sigma_1^2, \ldots, \sigma_{\min(m,n)}^2\) (padded with zeros if \(n > m\)). Thus \(A^\top A\) is symmetric positive semidefinite, and its unique symmetric positive semidefinite square root is obtained by taking nonnegative square roots of the diagonal entries: \[ \sqrt{A^\top A} = V \, \Sigma_+ \, V^\top, \quad \text{where } \Sigma_+ = \operatorname{diag}(\sigma_1, \ldots, \sigma_{\min(m,n)}, 0, \ldots, 0) \in \mathbb{R}^{n \times n}. \] (Uniqueness of the PSD square root follows from the spectral theorem applied to \(A^\top A\): diagonalize, take nonnegative roots, rotate back.)

Now \(\sqrt{A^\top A}\) is an \(n \times n\) symmetric matrix, so its trace is well-defined. By the cyclic permutation property of trace (Property 5) together with \(V^\top V = I_n\), \[ \operatorname{tr}\!\left(\sqrt{A^\top A}\right) = \operatorname{tr}(V \Sigma_+ V^\top) = \operatorname{tr}(\Sigma_+ V^\top V) = \operatorname{tr}(\Sigma_+) = \sum_{i=1}^{\min(m,n)} \sigma_i = \|A\|_*. \]

Definition: Induced Norm

Given a vector \(p\)-norm on \(\mathbb{R}^n\) and \(\mathbb{R}^m\), the induced norm (or operator norm) of a matrix \(A \in \mathbb{R}^{m \times n}\) is \[ \| A \|_p = \max_{\mathbf{x} \neq \mathbf{0}} \frac{\| A \mathbf{x} \|_p}{\| \mathbf{x} \|_p} = \max_{\| \mathbf{x} \|_p = 1} \| A \mathbf{x} \|_p. \] The two expressions are equal because the ratio \(\|A \mathbf{x}\|_p / \|\mathbf{x}\|_p\) is invariant under nonzero scaling of \(\mathbf{x}\).

Definition: Spectral Norm

When \(p = 2\), the induced norm is called the spectral norm of \(A\) and equals the largest singular value: \[ \| A \|_2 = \sigma_{\max}(A) = \sqrt{\lambda_{\max}(A^\top A)}, \] where \(\lambda_{\max}(A^\top A)\) is the largest eigenvalue of \(A^\top A\).

Derivation:

Let \(A = U \Sigma V^\top\) be the SVD with \(U, V\) orthogonal and singular values ordered as \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0\). For notational convenience, extend \(\sigma_i := 0\) for \(\min(m,n) < i \leq n\). For any \(\mathbf{x} \in \mathbb{R}^n\) with \(\|\mathbf{x}\|_2 = 1\), set \(\mathbf{y} = V^\top \mathbf{x}\). Since \(V\) is orthogonal, \(\|\mathbf{y}\|_2 = \|\mathbf{x}\|_2 = 1\) by the norm-preserving property. Similarly \(U\) preserves norms, so: \[ \|A \mathbf{x}\|_2^2 = \|U \Sigma V^\top \mathbf{x}\|_2^2 = \|\Sigma \mathbf{y}\|_2^2 = \sum_{i=1}^n \sigma_i^2 \, y_i^2. \] Bounding above by the largest singular value, \[ \|A \mathbf{x}\|_2^2 = \sum_{i=1}^n \sigma_i^2 y_i^2 \leq \sigma_1^2 \sum_{i=1}^n y_i^2 = \sigma_1^2. \] Equality is achieved by \(\mathbf{y} = \mathbf{e}_1\) (equivalently \(\mathbf{x} = V \mathbf{e}_1\), the first right singular vector), which gives \(\|A\mathbf{x}\|_2^2 = \sigma_1^2\). Thus \(\|A\|_2 = \sigma_1 = \sigma_{\max}\). The equality \(\sigma_{\max} = \sqrt{\lambda_{\max}(A^\top A)}\) follows directly from the definition of singular values as \(\sigma_i = \sqrt{\lambda_i(A^\top A)}\).

The spectral norm is commonly used to control the Lipschitz continuity of functions in neural networks to prevent vanishing or exploding gradients.

The induced norm depends on a choice of vector \(p\)-norm. The Euclidean norm used throughout linear algebra is the special case \(p = 2\). In general, we define the \(p\)-norm on vectors as follows.

Definition: \(p\)-Norm

For \(p \geq 1\), the \(p\)-norm of a vector \(\mathbf{x} \in \mathbb{R}^n\) is \[ \| \mathbf{x} \|_p = \left( \sum_{i=1}^n |x_i|^p \right)^{1/p}. \] The restriction \(p \geq 1\) ensures the triangle inequality holds (via Minkowski's inequality); for \(0 < p < 1\), the expression above still defines a function but fails to be a norm.

When \(p = 1\), we call it the Manhattan norm (or \(\ell_1\) norm), which is commonly used in grid-based environments. Also, when we need sparsity in models such as Lasso regression, the Manhattan norm serves as a regularizer.

When \(p \to \infty\), we obtain the maximum norm: \[ \| \mathbf{x} \|_\infty = \lim_{p \to \infty} \| \mathbf{x} \|_p = \max_{i} |x_i|. \] In numerical analysis, the maximum norm is used for "worst-case" error analysis. Similarly, in machine learning, it is used when we want to minimize the worst error rather than the total error across sample data.

Introduction to Metric Spaces

Look back over the norms we have collected in the previous two sections: the Frobenius norm on matrices, the nuclear and spectral norms measuring singular-value magnitudes, the \(p\)-norms on vectors, the Manhattan and maximum norms as special cases. At first glance they feel ad hoc — each invented for a specific purpose on a specific space. But something invariant runs through all of them. Every norm we have seen satisfies three properties: non-negativity, symmetry, and the triangle inequality. These three properties — and nothing more — are what "distance" actually is. When we extract them from the vector setting and apply them to an arbitrary set, we arrive at one of the most fertile structures in modern mathematics: the metric space.

Definition: Metric Space

A metric space is an ordered pair \((M, d)\) consisting of a nonempty set \(M\) and a metric \(d\) on \(M\). The metric \(d\) is a function \(d: M \times M \to \mathbb{R}\) that defines the distance between any two elements of \(M\). For all \(a, b, c \in M\), the following axioms must hold:

  1. Non-negativity: \(d(a,b) \geq 0\) with equality if and only if \(a = b\).
  2. Symmetry: \(d(a,b) = d(b,a)\).
  3. Triangle inequality: \(d(a,b) \leq d(a,c) + d(c, b)\).

Two things are striking about this definition. First, \(M\) is any nonempty set — not required to be a vector space, not required to have any algebraic structure at all. Points could be matrices, functions, probability distributions, strings of DNA, or vertices of a graph. Second, the axioms are strikingly minimal. Often we just say "the metric space \(M\)" when \(d\) is understood from context.

Notational aside. For sets \(X\) and \(Y\), the Cartesian product is the set of ordered pairs \[ X \times Y = \{(x, y) \mid x \in X, \; y \in Y\}, \] which is the natural domain for \(d\).

Every vector space equipped with a norm becomes a metric space: the norm \(\|\cdot\|\) induces a metric via \(d(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|\). This includes every setting we have worked with so far. Even more specifically, most of Section I has lived inside inner product spaces, where the norm itself comes from an inner product: \[ d(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\| = \sqrt{\langle \mathbf{u} - \mathbf{v}, \mathbf{u} - \mathbf{v} \rangle}. \] The most familiar example is the Euclidean space \(\mathbb{R}^n\). But now the Frobenius inner product has shown us that \(\mathbb{R}^{m \times n}\) carries exactly the same structure. Both are special cases of a much broader family.

Why This Abstraction Matters for ML and CS

The convergence of iterative algorithms — gradient descent, Newton's method, EM, MCMC — is not fundamentally about real numbers or Euclidean geometry. It is about whether successive iterates get closer to a limit, in some notion of distance. The relevant notion depends on the space:

  • On \(\mathbb{R}^n\), gradient descent is analyzed using the Euclidean metric.
  • On spaces of matrices, low-rank recovery is analyzed using the Frobenius or nuclear norm.
  • On spaces of probability distributions, convergence is measured by the Wasserstein metric or KL divergence.
  • On function spaces (where neural networks live in the infinite-width limit), the natural metric is an \(L^p\) norm.

Each of these is a different metric space. To speak the language of convergence in modern ML — why an algorithm converges, at what rate, and to what limit — one must first speak the language of metric spaces (in Section II), which is exactly where this path continues.