Trace & Norms

Trace Norms Introduction to Metric Spaces

Trace

The trace is a fundamental scalar function that associates a single number with a square matrix. Despite its simple definition as the sum of diagonal entries, the trace has deep connections to eigenvalues, matrix norms, and appears throughout optimization and machine learning. Its cyclic permutation property makes it invaluable for manipulating matrix expressions.

Definition: Trace

The trace of an \(n \times n\) matrix \(A\) is the sum of the diagonal entries of \(A\) and denoted by \(\text{tr }(A)\): \[ \text{tr }(A) =\sum_{k=1}^n a_{kk}. \]

The trace satisfies several important properties:

Theorem: Properties of Trace

Let \(A, B\) be \(n \times n\) matrices and \(c \in \mathbb{R}\). The trace satisfies:

  1. Linearity: \(\text{tr }(cA) = c(\text{tr }A)\)
  2. Additivity: \(\text{tr }(A + B) = \text{tr}A + \text{tr }B\)
  3. Transpose invariance: \(\text{tr }(A^\top) = \text{tr}A\)
  4. Frobenius inner product: \(\text{tr }(A^\top B) = \sum_{i=1}^m\sum_{j=1}^n a_{ij}b_{ij}\)
  5. Cyclic permutation: \(\text{tr }(AB) = \text{tr }(BA)\)
  6. Eigenvalue sum: \(\text{tr }A = \sum_{i=1}^n \lambda_i\) where \(\lambda_1, \ldots, \lambda_n\) are the eigenvalues of \(A\)

Note: For property 5, \(A\) can be \(m \times n\) and \(B\) can be \(n \times m\). More generally, \(\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA)\).

Insight: The Trace Trick

Consider a quadratic form \(x^\top Ax \in \mathbb{R}\). While this expression is a scalar, we often rewrite it using the trace operator to gain structural advantages: \[ \begin{align*} x^\top Ax = \text{tr}(x^\top Ax) = \text{tr}(Axx^\top). \end{align*} \] This "trace trick" is indispensable in Computer Science and Statistics for the following reasons:

  1. Decoupling System and Data: It separates the system matrix \(A\) from the data-dependent rank-1 matrix \(xx^\top\). In the original form, \(A\) is "sandwiched" between vectors, making it difficult to analyze the contribution of \(A\) and \(x\) independently.
  2. Computational Efficiency (The Summation Trick): Suppose we need to compute the average of quadratic forms over \(N\) data points. Calculating each individually is computationally expensive ("calculation hell"), but the trace trick allows us to use the Covariance Matrix: \[ \frac{1}{N} \sum_{i=1}^N x_i^\top A x_i = \text{tr}\left( A \cdot \left( \frac{1}{N} \sum_{i=1}^N x_i x_i^\top \right) \right) = \text{tr}(A\Sigma) \] By pre-calculating the covariance \(\Sigma\), we reduce \(N\) vector-matrix products to a single matrix multiplication and trace operation.
  3. Structural Analysis: This form is the gateway to advanced topics like Expectation of Quadratic Forms. In statistics, the expectation \(\mathbb{E}[x^\top Ax]\) simplifies beautifully to \(\text{tr}(A \mathbb{E}[xx^\top])\), which is impossible to express without the trace trick.

Note: Although \(x^\top Ax\) is a \(1 \times 1\) scalar, the internal term \(Axx^\top\) is an \(n \times n\) matrix. The trace operator "compresses" this matrix back into the original scalar value.

Property 4 is the notion of the inner product for matrices. We call it the Frobenius inner product, and also like the Euclidean (\(l_2\)) norm \(\|v\|_2 = \sqrt{v \cdot v} = \sqrt{v^\top v}\) of vectors, we define the norm for matrices.

Definition: Frobenius Inner Product

For matrices \(A, B \in \mathbb{R}^{m \times n}\), the Frobenius inner product is defined as: \[ \left\langle A, B \right\rangle_F = \text{tr }(A^\top B) = \sum_{i=1}^m\sum_{j=1}^n a_{ij}b_{ij}. \]

In addition, the Frobenius norm of a matrix \(A\) is defined as \[ \| A \|_F = \sqrt{\left\langle A, A \right\rangle_F} = \sqrt{\text{tr }(A^\top A)}. \]

This is one of many possible matrix norms. Each norm emphasizes different aspects of a matrix's structure and is suited to different applications.

Norms

While the trace characterizes matrices through a single scalar, norms provide a way to measure the "size" or "magnitude" of both vectors and matrices. In machine learning, norms are fundamental for training stability of models. Norms are used to scale data to a "common" range. It ensures that every single feature contributes "equally" to the training of the model and reduces sensitivity to the scale of each feature. This normalization process is important in the "distance(i.e., norm)" based models such as support vector machines(SVMs) and \(k\)-nearest neighbor(k-NN). Also, regularization process uses norms to "penalize" the model complexity, which avoids overfitting and then helps the model generalize to new data.

Another popular norm of a matrix is the nuclear norm(trace norm).

Definition: Nuclear Norm(Trace Norm)

Given a matrix \(A \in \mathbb{R}^{m \times n}\), then the nuclear norm of \(A\) is defined as the sum of its singular values: \[ \| A \|_{\ast} = \sum_{i} \sigma_i \geq 0. \]

Theorem 1:

Let a matrix \(A = U\Sigma V^\top\) by SVD. Then \[ \| A \|_{\ast} = \sum_{i} \sigma_i = \text{tr }(\Sigma) = \text{tr }(\sqrt{A^\top A}) \].

Proof:

Suppose \(A = U\Sigma V^\top \) by the singular value decomposition. Since \(A^\top A = (V\Sigma U^\top)(U\Sigma V^\top) = V\Sigma^2 V^\top \), \[ \sqrt{A^\top A} = V\Sigma V^\top \] Then by the cyclic permutation property of trace, \[ \begin{align*} \text{tr }(\sqrt{A^\top A}) = \text{tr }(V\Sigma V^\top) &= \text{tr }(\Sigma V^\top V) \\\\ &= \text{tr }(\Sigma ). \end{align*} \]

Definition: Induced Norm & Spectral Norm

The induced norm of \(A\) is defined as \[ \| A \|_p = \max_{x \neq 0} \frac{\| Ax \|_p}{\| x \|_p} = \max_{\| x \| =1} \| Ax \|_p. \tag{1} \]

Typically, when \(p =2\), we call it the spectral norm of \(A\), which is just the largest singular value of \(A\): \[ \| A \|_2 = \max_{i} \sigma_i = \sqrt{\lambda_{\max} (A^\top A)} \] where \(\lambda_{\max}\) is the largest eigenvalue of \(A^\top A\).

The spectral norm is commonly used to control the Lipschitz continuity of functions in the neural network in order to prevent vanishing or exploding gradients.

\(\| x \|_p \) in (1) is called the \(p\)-norm of the vector \(x\). The Euclidean norm is a specific case of \(p\)-norm where \(p =2\). In general, we define the \(p\)-norm as follows.

Definition: \(p\)-Norm

The \(p\)-norm of the vector \(x\) is given by: \[ \| x \|_p = (\sum_{i =1} ^n |x_i|^p)^{\frac{1}{p}} \quad , p \geq 1. \]

When \(p =1\), we call it Manhattan norm, which is commonly used in the grid-based environments. Also, if we need sparsity in models such as Lasso regression, often, the Manhattan norm is used as a regularizer.

When \(p\) approaches \(\infty\), we define the maximum norm: \[ \| x \|_\infty = \lim_{p \to \infty} \| x \|_p = \max_{i} |x_i|. \] In numerical analysis, the maximum norm is used for "worst-case" error analysis. Similarly, in machine learning, it is used when we want to minimize the worst error rather than the total error across sample data.

Introduction to Metric Spaces

We have learned about vector spaces and norms, which measure "distance" between points(vectors) in \(\mathbb{R}^n\). Although this concept may seem "general" as it extends beyond the familiar 3D space, the vector space with norms is actually a specific case of a more general mathematical structure called a metric space.

Understanding this broader framework helps us see how the norms we use in linear algebra fit into the larger picture of mathematical analysis. As an introduction to this perspective, we define the metric space \((M, d)\) as follows.

Definition: Metric Space

A metric space is an ordered pair \((M, d)\)consisting of a nonempty set \(M\) and a metric \(d\) on \(M\). The metric \(d\) is a function \(d: M \times M \to \mathbb{R}\) that defines the distance between any two elements of \(M\). For all \(a, b, c \in M\), the following axioms must hold:

  1. Positivity: \(d(a,b) \geq 0\) with equality if and only if \(a = b\).
  2. Symmetry: \(d(a,b) = d(b,a)\).
  3. The triangle inequality \(d(a,b) \leq d(a,c) + d(c, b)\).

Often we just say "the metric space \(M\)" if the metric \(d\) is obvious.
Note: Suppose \(X\) and \(Y\) are sets. Then Cartesian product of \(X\) and \(Y\) is a set of "ordered" pair: \[ X \times Y = \{(x, y) | x \in X, \, y\in Y\}. \]

Now, you can see that any vector space equipped with a norm is a metric space. We call it a normed vector space. More specifically, in linear algebra, we have worked within the framework of inner product spaces whose norm is defined by the inner product \[ d(u, v) = \|u-v\|= \sqrt{\langle u-v, u-v \rangle}. \] The most familiar example of inner product space is the Euclidean space \(\mathbb{R}^n\).

In machine learning, understanding how inner product spaces naturally extend to normed spaces and then to metric spaces provides the mathematical foundation for defining similarity measures, regularization techniques, and convergence criteria in optimization algorithms.