Trace
The trace is a fundamental scalar function that associates a single number with a square matrix.
Despite its simple definition as the sum of diagonal entries, the trace has deep connections to
eigenvalues, matrix norms, and appears throughout optimization and machine learning. Its cyclic
permutation property makes it invaluable for manipulating matrix expressions.
Definition: Trace
The trace of an \(n \times n\) matrix \(A\) is the sum of the diagonal entries of \(A\) and
denoted by \(\text{tr }(A)\):
\[
\text{tr }(A) =\sum_{k=1}^n a_{kk}.
\]
The trace satisfies several important properties:
Theorem: Properties of Trace
Let \(A, B\) be \(n \times n\) matrices and \(c \in \mathbb{R}\). The trace satisfies:
- Linearity: \(\text{tr }(cA) = c(\text{tr }A)\)
- Additivity: \(\text{tr }(A + B) = \text{tr}A + \text{tr }B\)
- Transpose invariance: \(\text{tr }(A^\top) = \text{tr}A\)
- Frobenius inner product: \(\text{tr }(A^\top B) = \sum_{i=1}^m\sum_{j=1}^n a_{ij}b_{ij}\)
- Cyclic permutation: \(\text{tr }(AB) = \text{tr }(BA)\)
- Eigenvalue sum: \(\text{tr }A = \sum_{i=1}^n \lambda_i\) where \(\lambda_1, \ldots, \lambda_n\)
are the eigenvalues of \(A\)
Note: For property 5, \(A\) can be \(m \times n\) and \(B\) can be \(n \times m\). More generally,
\(\text{tr}(ABC) = \text{tr}(CAB) = \text{tr}(BCA)\).
Insight: The Trace Trick
Consider a quadratic form \(x^\top Ax \in \mathbb{R}\).
While this expression is a scalar, we often rewrite it using the trace operator to gain structural advantages:
\[
\begin{align*}
x^\top Ax = \text{tr}(x^\top Ax) = \text{tr}(Axx^\top).
\end{align*}
\]
This "trace trick" is indispensable in Computer Science and Statistics for the following reasons:
-
Decoupling System and Data:
It separates the system matrix \(A\) from the data-dependent rank-1 matrix \(xx^\top\). In the original form, \(A\) is "sandwiched" between vectors, making it difficult to analyze the contribution of \(A\) and \(x\) independently.
-
Computational Efficiency (The Summation Trick):
Suppose we need to compute the average of quadratic forms over \(N\) data points. Calculating each individually is computationally expensive ("calculation hell"), but the trace trick allows us to use the Covariance Matrix:
\[
\frac{1}{N} \sum_{i=1}^N x_i^\top A x_i = \text{tr}\left( A \cdot \left( \frac{1}{N} \sum_{i=1}^N x_i x_i^\top \right) \right) = \text{tr}(A\Sigma)
\]
By pre-calculating the covariance \(\Sigma\), we reduce \(N\) vector-matrix products to a single matrix multiplication and trace operation.
-
Structural Analysis:
This form is the gateway to advanced topics like Expectation of Quadratic Forms. In statistics, the expectation \(\mathbb{E}[x^\top Ax]\) simplifies beautifully to \(\text{tr}(A \mathbb{E}[xx^\top])\), which is impossible to express without the trace trick.
Note: Although \(x^\top Ax\) is a \(1 \times 1\) scalar, the internal term \(Axx^\top\) is an \(n \times n\) matrix. The trace operator "compresses" this matrix back into the original scalar value.
Property 4 is the notion of the inner product for matrices.
We call it the Frobenius inner product, and also like the Euclidean (\(l_2\)) norm \(\|v\|_2 = \sqrt{v \cdot v} = \sqrt{v^\top v}\)
of vectors, we define the norm for matrices.
Definition: Frobenius Inner Product
For matrices \(A, B \in \mathbb{R}^{m \times n}\), the Frobenius inner product is defined as:
\[
\left\langle A, B \right\rangle_F = \text{tr }(A^\top B) = \sum_{i=1}^m\sum_{j=1}^n a_{ij}b_{ij}.
\]
In addition, the Frobenius norm of a matrix \(A\) is defined as
\[
\| A \|_F = \sqrt{\left\langle A, A \right\rangle_F} = \sqrt{\text{tr }(A^\top A)}.
\]
This is one of many possible matrix norms. Each norm emphasizes different aspects of a matrix's structure
and is suited to different applications.
Norms
While the trace characterizes matrices through a single scalar, norms provide a way to measure
the "size" or "magnitude" of both vectors and matrices. In machine learning, norms are fundamental
for training stability of models. Norms are used to scale data to a "common" range. It ensures that
every single feature contributes "equally" to the training of the model and reduces sensitivity to
the scale of each feature. This normalization process is important in the "distance(i.e., norm)"
based models such as support vector machines(SVMs)
and \(k\)-nearest neighbor(k-NN).
Also, regularization process uses norms to
"penalize" the model complexity, which avoids overfitting and then helps the model generalize to new data.
Another popular norm of a matrix is the nuclear norm(trace norm).
Definition: Nuclear Norm(Trace Norm)
Given a matrix \(A \in \mathbb{R}^{m \times n}\), then the nuclear norm of \(A\) is defined as the sum of its
singular values:
\[
\| A \|_{\ast} = \sum_{i} \sigma_i \geq 0.
\]
Theorem 1:
Let a matrix \(A = U\Sigma V^\top\) by SVD. Then
\[
\| A \|_{\ast} = \sum_{i} \sigma_i = \text{tr }(\Sigma) = \text{tr }(\sqrt{A^\top A})
\].
Proof:
Suppose \(A = U\Sigma V^\top \) by the singular value decomposition.
Since \(A^\top A = (V\Sigma U^\top)(U\Sigma V^\top) = V\Sigma^2 V^\top \),
\[
\sqrt{A^\top A} = V\Sigma V^\top
\]
Then by the cyclic permutation property of trace,
\[
\begin{align*}
\text{tr }(\sqrt{A^\top A}) = \text{tr }(V\Sigma V^\top) &= \text{tr }(\Sigma V^\top V) \\\\
&= \text{tr }(\Sigma ).
\end{align*}
\]
Definition: Induced Norm & Spectral Norm
The induced norm of \(A\) is defined as
\[
\| A \|_p = \max_{x \neq 0} \frac{\| Ax \|_p}{\| x \|_p} = \max_{\| x \| =1} \| Ax \|_p. \tag{1}
\]
Typically, when \(p =2\), we call it the spectral norm of \(A\), which is just
the largest singular value of \(A\):
\[
\| A \|_2 = \max_{i} \sigma_i = \sqrt{\lambda_{\max} (A^\top A)}
\]
where \(\lambda_{\max}\) is the largest eigenvalue of \(A^\top A\).
The spectral norm is commonly used to control the Lipschitz continuity of
functions in the neural network in order to prevent
vanishing or exploding gradients.
\(\| x \|_p \) in (1) is called the \(p\)-norm of the vector \(x\). The Euclidean norm is
a specific case of \(p\)-norm where \(p =2\). In general, we define the \(p\)-norm as follows.
Definition: \(p\)-Norm
The \(p\)-norm of the vector \(x\) is given by:
\[
\| x \|_p = (\sum_{i =1} ^n |x_i|^p)^{\frac{1}{p}} \quad , p \geq 1.
\]
When \(p =1\), we call it Manhattan norm, which is commonly used in the grid-based
environments. Also, if we need sparsity in models such as Lasso regression, often,
the Manhattan norm is used as a regularizer.
When \(p\) approaches \(\infty\), we define the maximum norm:
\[
\| x \|_\infty = \lim_{p \to \infty} \| x \|_p = \max_{i} |x_i|.
\]
In numerical analysis, the maximum norm is used for "worst-case" error analysis. Similarly,
in machine learning, it is used when we want to minimize the worst error rather than the total
error across sample data.
Introduction to Metric Spaces
We have learned about vector spaces and
norms, which measure "distance" between points(vectors) in \(\mathbb{R}^n\). Although this concept may seem
"general" as it extends beyond the familiar 3D space, the vector space with norms is actually a specific
case of a more general mathematical structure called a metric space.
Understanding this broader framework helps us see how the norms we use in linear algebra fit into the larger
picture of mathematical analysis. As an introduction to this perspective, we define the metric
space \((M, d)\) as follows.
Definition: Metric Space
A metric space is an ordered pair \((M, d)\)consisting of a nonempty set
\(M\) and a metric \(d\) on \(M\). The metric \(d\) is a function
\(d: M \times M \to \mathbb{R}\) that defines the distance between any two elements of \(M\).
For all \(a, b, c \in M\), the following axioms must hold:
- Positivity: \(d(a,b) \geq 0\) with equality if and only if \(a = b\).
- Symmetry: \(d(a,b) = d(b,a)\).
- The triangle inequality \(d(a,b) \leq d(a,c) + d(c, b)\).
Often we just say "the metric space \(M\)" if the metric \(d\) is obvious.
Note: Suppose \(X\) and \(Y\) are sets. Then Cartesian product of \(X\) and \(Y\)
is a set of "ordered" pair:
\[
X \times Y = \{(x, y) | x \in X, \, y\in Y\}.
\]
Now, you can see that any vector space equipped with a norm is a metric space. We call it
a normed vector space. More specifically, in linear algebra, we have worked within
the framework of inner product spaces whose norm is defined by the inner product
\[
d(u, v) = \|u-v\|= \sqrt{\langle u-v, u-v \rangle}.
\]
The most familiar example of inner product space is the Euclidean space \(\mathbb{R}^n\).
In machine learning, understanding how inner product spaces naturally extend to normed spaces and then to metric
spaces provides the mathematical foundation for defining similarity measures, regularization techniques,
and convergence criteria in optimization algorithms.