Vectorization
When working with matrices in optimization and statistics, we often need to treat a matrix as a single vector.
For instance, to take derivatives with respect to all entries simultaneously or to express matrix equations as
standard linear systems. The vectorization operation provides exactly this conversion.
Definition: Vectorization
The vectorization of a matrix \(A \in \mathbb{R}^{m \times n}\), denoted \(\text{vec}(A)\),
stacks the columns of \(A\) into a single column vector in \(\mathbb{R}^{mn}\):
\[
\text{vec}(A) = \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix}
\]
where \(a_j \in \mathbb{R}^m\) is the \(j\)th column of \(A\).
For example, let \(x \in \mathbb{R}^n \), and consider outer product \(xx^\top \in \mathbb{R}^{n \times n}\).
\[
\begin{align*}
\text{vec }(xx^\top) &= \begin{bmatrix}x_1x_1 \\ x_2x_1 \\ \vdots \\ x_nx_1 \\ x_1x_2 \\ x_2x_2 \\ \vdots \\ x_nx_2 \\ \vdots \\
x_1x_n \\ x_2x_n \\ \vdots \\ x_nx_n
\end{bmatrix} \\ \\
&= x \otimes x
\end{align*}
\]
The notation \(\otimes\) denotes the Kronecker product, which we define formally in the next section.
The identity \(\text{vec}(xx^\top) = x \otimes x\) is a special case of a more general relationship between
vectorization and the Kronecker product.
Insight: Vectorization in Machine Learning
Vectorization is more than a notational convenience; it is the algebraic bridge between coordinate-free matrix
operations and hardware-level linear algebra. Deep learning frameworks like PyTorch and JAX leverage this by
flattening multi-dimensional parameter tensors into vectors to perform efficient Jacobian-vector products (JVPs)
during backpropagation.
The identity
\[
\text{vec}(ABC) = (C^\top \otimes A)\,\text{vec}(B)
\]
is a cornerstone for deriving closed-form solutions to matrix-valued optimization problems. A prime example is the
continuous-time Lyapunov equation
\[
AX + XA^\top = Q
\],
used to certify the stability of dynamical systems (including RNNs and neural controllers). By applying vectorization,
this matrix equation is transformed into the standard linear system
\[
(I \otimes A + A \otimes I)\text{vec}(X) = \text{vec}(Q).
\]
This transformation allows us to use high-performance linear solvers to find the Lyapunov candidate \(X\), thereby
mathematically guaranteeing that a system's "energy" dissipates over time.
Conversely, we can reshape a vector back into a matrix. However, the result depends on the
memory layout convention used — an important practical distinction when moving between
mathematical formulas and code.
Example: Row-Major vs Column-Major Order
Consider the vector
\[
a = \begin{bmatrix} 1 & 2 & 3 & 4 & 5 & 6 \end{bmatrix}^\top.
\]
Reshaping \(a\) into a \(2 \times 3\) matrix yields different results depending on the convention:
Row-major order (used by C, C++, and Python/NumPy by default): elements fill each row before moving to the next.
\[
A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}
\]
Column-major order (used by Julia, MATLAB, R, and Fortran): elements fill each column before moving to the next.
\[
A = \begin{bmatrix} 1 & 3 & 5 \\ 2 & 4 & 6 \end{bmatrix}
\]
Note that the standard mathematical definition of \(\text{vec}(\cdot)\) follows column-major order.
That is, \(\text{vec}(A)\) stacks columns, so the inverse operation (reshaping back) also fills by columns.
import numpy as np
vector = np.array([1, 2, 3, 4, 5, 6])
#In Python, you can choose these two ways(default is row-major):
matrix_r = vector.reshape((2, 3), order='C') # C stands for "C" programming language
matrix_c = vector.reshape((2, 3), order='F') # F stands for "F"ortran programming language
Kronecker Product
The vectorization operation converts matrices into vectors, but we also need an algebraic operation that builds
larger structured matrices from smaller ones. The Kronecker product combines two matrices into a
block matrix whose structure encodes pairwise interactions between their entries.
This operation arises naturally whenever we need to represent the joint action of two independent linear maps.
Definition: Kronecker product
Let \(A \in \mathbb{R}^{m \times n}\) and \(B \in \mathbb{R}^{p \times q}\). In general, the Kronecker product
\(A\otimes B\) is given by \((mp) \times (nq)\) matrix:
\[
A\otimes B = \begin{bmatrix}
a_{11}B & a_{12}B & \cdots & a_{1n} B \\
a_{21}B & a_{22}B & \cdots & a_{2n} B \\
\vdots & \vdots & \ddots & \vdots \\
a_{m1}B & a_{m2}B & \cdots & a_{mn} B \\
\end{bmatrix}.
\]
Each element \(a_{ij}\) of \(A\) is multiplied by the entire matrix \(B\) resulting in blocks of size \(p \times q\).
Example:
\[
\begin{bmatrix} 1 & 2 \\ 3 & 4 \\ \end{bmatrix}
\otimes
\begin{bmatrix} 5 & 6 \\ 7 & 8 \\ \end{bmatrix}
=
\begin{bmatrix}
5 & 6 & 10 & 12 \\
7 & 8 & 14 & 16 \\
15 & 18 & 20 & 24 \\
21 & 24 & 28 & 32 \\
\end{bmatrix}.
\]
The Kronecker product satisfies several elegant algebraic properties that make it a powerful
tool for theoretical analysis. Note that the Kronecker product is not commutative in general
(\(A \otimes B \neq B \otimes A\)), but it interacts well with standard matrix operations.
Useful Properties of the Kronecker Product
- Mixed-product property
\[
(A \otimes B)(C \otimes D) = (AC) \otimes (BD)
\]
- Transpose
\[
(A \otimes B)^\top = A^\top \otimes B^\top
\]
- Inverse (when \(A\) and \(B\) are invertible)
\[
(A \otimes B)^{-1} = A^{-1} \otimes B^{-1}
\]
- Trace
\[
\text{Tr }(A \otimes B) = \text{Tr }(A) \text{Tr }(B)
\]
- Determinant (for square matrices \(A \in \mathbb{R}^{m \times m}\) and \(B \in \mathbb{R}^{n \times n}\))
\[
\det (A \otimes B) = \det(A)^n \det(B)^m
\]
where \(A \in \mathbb{R}^{m \times m}\), and \(B \in \mathbb{R}^{n \times n}\).
- Eigenvalues
If \(A \in \mathbb{R}^{m \times m}\) has eigenvalues \(\lambda_1, \ldots, \lambda_m\) and
\(B \in \mathbb{R}^{n \times n}\) has eigenvalues \(\mu_1, \ldots, \mu_n\), then the
eigenvalues of \(A \otimes B\) are all products
\[
\lambda_i \mu_j \quad (i = 1, \ldots, m, \; j = 1, \ldots, n).
\]
In particular, if \(Av = \lambda v\) and \(Bw = \mu w\), then
\((A \otimes B)(v \otimes w) = \lambda\mu (v \otimes w)\).
Insight: Kronecker Products in Machine Learning
The mixed-product property \((A \otimes B)(C \otimes D) = (AC) \otimes (BD)\) is the algebraic foundation of
Kronecker-factored approximations. The K-FAC optimizer approximates the Fisher Information Matrix
of a layer as \(F \approx A \otimes G\), where \(A\) is the covariance of activations and \(G\) is the covariance of
backpropagated gradient statistics.
By exploiting this structure, K-FAC reduces the inversion cost from \(O(n^3)\) on the full parameter space to
approximately \(O(n^{3/2})\). The eigenvalue property further enables efficient
spectral preconditioning, allowing large-scale neural networks to converge faster by utilizing
the curvature of the loss landscape.
With the Kronecker product established, we now turn to tensors - the natural
generalization of vectors and matrices to higher dimensions. While the Kronecker product builds
larger matrices from smaller ones, tensors extend this idea beyond two dimensions entirely.
Tensor
So far, we have worked with scalars, vectors, and matrices — objects of dimension 0, 1, and 2 respectively.
Many applications require data structures that extend beyond two dimensions. A tensor
provides exactly this generalization: it is a multidimensional array that can be understood both as a
computational data structure and as an element of a tensor product of vector spaces.
Example: Familiar Tensors
The objects we have studied throughout linear algebra are all tensors of low order:
- A scalar is an order-0 tensor (a single number, \(a \in \mathbb{R}\)).
- A vector is an order-1 tensor (a one-dimensional array, \(v \in \mathbb{R}^n\)).
- A matrix is an order-2 tensor (a two-dimensional array, \(A \in \mathbb{R}^{m \times n}\)).
The order (also called mode or informally "dimension") of a tensor refers to
the number of indices required to specify an entry.
Higher-order tensors arise naturally in applications. For example, a color image is an order-3 tensor
\(I \in \mathbb{R}^{H \times W \times C}\), where \(H\) is height, \(W\) is width, and \(C\) is the
number of channels (e.g., \(C = 3\) for RGB). A batch of \(N\) such images forms an order-4 tensor in
\(\mathbb{R}^{N \times H \times W \times C}\).
Beyond this computational viewpoint, tensors also have a precise algebraic definition rooted in the
tensor product of vector spaces.
Definition: Tensor Product Space
Let \(V\) and \(W\) be vector spaces over a field \(\mathbb{F}\). The tensor product
\(V \otimes W\) is a vector space whose elements are (finite) linear combinations of elementary tensors:
\[
\sum_{i} \alpha_i \, (v_i \otimes w_i), \quad v_i \in V, \; w_i \in W, \; \alpha_i \in \mathbb{F}.
\]
More generally, an order-\(r\) tensor is an element of
\[
T \in V_1 \otimes V_2 \otimes \cdots \otimes V_r.
\]
When each \(V_k = \mathbb{R}^{n_k}\), such a tensor can be represented as a multidimensional array
with entries \(T_{i_1 i_2 \cdots i_r}\) where \(1 \leq i_k \leq n_k\).
Under this formal definition, the familiar objects of linear algebra are recovered as special cases:
a scalar is an order-0 tensor (an element of \(\mathbb{F}\)), a vector is an order-1 tensor (an element of \(V\)),
and a matrix is an order-2 tensor (an element of \(V \otimes W\)).
The Kronecker product is the coordinate representation of the tensor product: when
\(V = \mathbb{R}^m\) and \(W = \mathbb{R}^n\), the tensor product of matrices (viewed as linear maps)
corresponds precisely to the Kronecker product of their matrix representations. The tensor product itself
is the abstract, coordinate-free operation, while the Kronecker product is what we compute in practice.
Insight: Tensors as the Language of Deep Learning
While deep learning frameworks treat tensors primarily as multidimensional arrays, their algebraic power stems
from their identity as multilinear maps. In Transformer architectures, the Attention mechanism
is fundamentally a series of tensor contractions—specifically, bilinear operations where queries and keys interact to produce
attention weights.
In the emerging field of Geometric Deep Learning, the mathematical definition of a tensor
— how it transforms under coordinate changes — becomes critical. This enables the design of
Equivariant Neural Networks, where internal representations (scalars, vectors, and higher-order tensors)
transform consistently according to group representations. This ensures the model respects the underlying physical symmetries
of the data, such as rotation and reflection invariance in 3D molecular modeling or medical imaging.