Kronecker Product & Tensor

Vectorization Kronecker Product Tensor

Vectorization

When working with matrices in optimization and statistics, we often need to treat a matrix as a single vector. For instance, to take derivatives with respect to all entries simultaneously or to express matrix equations as standard linear systems. The vectorization operation provides exactly this conversion.

Definition: Vectorization

The vectorization of a matrix \(A \in \mathbb{R}^{m \times n}\), denoted \(\text{vec}(A)\), stacks the columns of \(A\) into a single column vector in \(\mathbb{R}^{mn}\): \[ \text{vec}(A) = \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} \] where \(a_j \in \mathbb{R}^m\) is the \(j\)th column of \(A\).

For example, let \(x \in \mathbb{R}^n \), and consider outer product \(xx^\top \in \mathbb{R}^{n \times n}\). \[ \begin{align*} \text{vec }(xx^\top) &= \begin{bmatrix}x_1x_1 \\ x_2x_1 \\ \vdots \\ x_nx_1 \\ x_1x_2 \\ x_2x_2 \\ \vdots \\ x_nx_2 \\ \vdots \\ x_1x_n \\ x_2x_n \\ \vdots \\ x_nx_n \end{bmatrix} \\ \\ &= x \otimes x \end{align*} \]

The notation \(\otimes\) denotes the Kronecker product, which we define formally in the next section. The identity \(\text{vec}(xx^\top) = x \otimes x\) is a special case of a more general relationship between vectorization and the Kronecker product.

Insight: Vectorization in Machine Learning

Vectorization is more than a notational convenience; it is the algebraic bridge between coordinate-free matrix operations and hardware-level linear algebra. Deep learning frameworks like PyTorch and JAX leverage this by flattening multi-dimensional parameter tensors into vectors to perform efficient Jacobian-vector products (JVPs) during backpropagation.

The identity \[ \text{vec}(ABC) = (C^\top \otimes A)\,\text{vec}(B) \] is a cornerstone for deriving closed-form solutions to matrix-valued optimization problems. A prime example is the continuous-time Lyapunov equation \[ AX + XA^\top = Q \], used to certify the stability of dynamical systems (including RNNs and neural controllers). By applying vectorization, this matrix equation is transformed into the standard linear system \[ (I \otimes A + A \otimes I)\text{vec}(X) = \text{vec}(Q). \] This transformation allows us to use high-performance linear solvers to find the Lyapunov candidate \(X\), thereby mathematically guaranteeing that a system's "energy" dissipates over time.

Conversely, we can reshape a vector back into a matrix. However, the result depends on the memory layout convention used — an important practical distinction when moving between mathematical formulas and code.

Example: Row-Major vs Column-Major Order

Consider the vector \[ a = \begin{bmatrix} 1 & 2 & 3 & 4 & 5 & 6 \end{bmatrix}^\top. \] Reshaping \(a\) into a \(2 \times 3\) matrix yields different results depending on the convention:

Row-major order (used by C, C++, and Python/NumPy by default): elements fill each row before moving to the next. \[ A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \]

Column-major order (used by Julia, MATLAB, R, and Fortran): elements fill each column before moving to the next. \[ A = \begin{bmatrix} 1 & 3 & 5 \\ 2 & 4 & 6 \end{bmatrix} \]

Note that the standard mathematical definition of \(\text{vec}(\cdot)\) follows column-major order. That is, \(\text{vec}(A)\) stacks columns, so the inverse operation (reshaping back) also fills by columns.

                    import numpy as np

                    vector = np.array([1, 2, 3, 4, 5, 6])

                    #In Python, you can choose these two ways(default is row-major):

                    matrix_r = vector.reshape((2, 3), order='C') # C stands for "C" programming language
                    matrix_c = vector.reshape((2, 3), order='F') # F stands for "F"ortran programming language
                

Kronecker Product

The vectorization operation converts matrices into vectors, but we also need an algebraic operation that builds larger structured matrices from smaller ones. The Kronecker product combines two matrices into a block matrix whose structure encodes pairwise interactions between their entries. This operation arises naturally whenever we need to represent the joint action of two independent linear maps.

Definition: Kronecker product

Let \(A \in \mathbb{R}^{m \times n}\) and \(B \in \mathbb{R}^{p \times q}\). In general, the Kronecker product \(A\otimes B\) is given by \((mp) \times (nq)\) matrix: \[ A\otimes B = \begin{bmatrix} a_{11}B & a_{12}B & \cdots & a_{1n} B \\ a_{21}B & a_{22}B & \cdots & a_{2n} B \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1}B & a_{m2}B & \cdots & a_{mn} B \\ \end{bmatrix}. \] Each element \(a_{ij}\) of \(A\) is multiplied by the entire matrix \(B\) resulting in blocks of size \(p \times q\).

Example:

\[ \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ \end{bmatrix} \otimes \begin{bmatrix} 5 & 6 \\ 7 & 8 \\ \end{bmatrix} = \begin{bmatrix} 5 & 6 & 10 & 12 \\ 7 & 8 & 14 & 16 \\ 15 & 18 & 20 & 24 \\ 21 & 24 & 28 & 32 \\ \end{bmatrix}. \]

The Kronecker product satisfies several elegant algebraic properties that make it a powerful tool for theoretical analysis. Note that the Kronecker product is not commutative in general (\(A \otimes B \neq B \otimes A\)), but it interacts well with standard matrix operations.

Useful Properties of the Kronecker Product
  1. Mixed-product property
    \[ (A \otimes B)(C \otimes D) = (AC) \otimes (BD) \]
  2. Transpose
    \[ (A \otimes B)^\top = A^\top \otimes B^\top \]
  3. Inverse (when \(A\) and \(B\) are invertible)
    \[ (A \otimes B)^{-1} = A^{-1} \otimes B^{-1} \]
  4. Trace
    \[ \text{Tr }(A \otimes B) = \text{Tr }(A) \text{Tr }(B) \]
  5. Determinant (for square matrices \(A \in \mathbb{R}^{m \times m}\) and \(B \in \mathbb{R}^{n \times n}\))
    \[ \det (A \otimes B) = \det(A)^n \det(B)^m \] where \(A \in \mathbb{R}^{m \times m}\), and \(B \in \mathbb{R}^{n \times n}\).
  6. Eigenvalues
    If \(A \in \mathbb{R}^{m \times m}\) has eigenvalues \(\lambda_1, \ldots, \lambda_m\) and \(B \in \mathbb{R}^{n \times n}\) has eigenvalues \(\mu_1, \ldots, \mu_n\), then the eigenvalues of \(A \otimes B\) are all products \[ \lambda_i \mu_j \quad (i = 1, \ldots, m, \; j = 1, \ldots, n). \] In particular, if \(Av = \lambda v\) and \(Bw = \mu w\), then \((A \otimes B)(v \otimes w) = \lambda\mu (v \otimes w)\).

Insight: Kronecker Products in Machine Learning

The mixed-product property \((A \otimes B)(C \otimes D) = (AC) \otimes (BD)\) is the algebraic foundation of Kronecker-factored approximations. The K-FAC optimizer approximates the Fisher Information Matrix of a layer as \(F \approx A \otimes G\), where \(A\) is the covariance of activations and \(G\) is the covariance of backpropagated gradient statistics.

By exploiting this structure, K-FAC reduces the inversion cost from \(O(n^3)\) on the full parameter space to approximately \(O(n^{3/2})\). The eigenvalue property further enables efficient spectral preconditioning, allowing large-scale neural networks to converge faster by utilizing the curvature of the loss landscape.

With the Kronecker product established, we now turn to tensors - the natural generalization of vectors and matrices to higher dimensions. While the Kronecker product builds larger matrices from smaller ones, tensors extend this idea beyond two dimensions entirely.

Tensor

So far, we have worked with scalars, vectors, and matrices — objects of dimension 0, 1, and 2 respectively. Many applications require data structures that extend beyond two dimensions. A tensor provides exactly this generalization: it is a multidimensional array that can be understood both as a computational data structure and as an element of a tensor product of vector spaces.

Example: Familiar Tensors

The objects we have studied throughout linear algebra are all tensors of low order:

  • A scalar is an order-0 tensor (a single number, \(a \in \mathbb{R}\)).
  • A vector is an order-1 tensor (a one-dimensional array, \(v \in \mathbb{R}^n\)).
  • A matrix is an order-2 tensor (a two-dimensional array, \(A \in \mathbb{R}^{m \times n}\)).

The order (also called mode or informally "dimension") of a tensor refers to the number of indices required to specify an entry.

Higher-order tensors arise naturally in applications. For example, a color image is an order-3 tensor \(I \in \mathbb{R}^{H \times W \times C}\), where \(H\) is height, \(W\) is width, and \(C\) is the number of channels (e.g., \(C = 3\) for RGB). A batch of \(N\) such images forms an order-4 tensor in \(\mathbb{R}^{N \times H \times W \times C}\).

Beyond this computational viewpoint, tensors also have a precise algebraic definition rooted in the tensor product of vector spaces.

Definition: Tensor Product Space

Let \(V\) and \(W\) be vector spaces over a field \(\mathbb{F}\). The tensor product \(V \otimes W\) is a vector space whose elements are (finite) linear combinations of elementary tensors: \[ \sum_{i} \alpha_i \, (v_i \otimes w_i), \quad v_i \in V, \; w_i \in W, \; \alpha_i \in \mathbb{F}. \] More generally, an order-\(r\) tensor is an element of \[ T \in V_1 \otimes V_2 \otimes \cdots \otimes V_r. \] When each \(V_k = \mathbb{R}^{n_k}\), such a tensor can be represented as a multidimensional array with entries \(T_{i_1 i_2 \cdots i_r}\) where \(1 \leq i_k \leq n_k\).

Under this formal definition, the familiar objects of linear algebra are recovered as special cases: a scalar is an order-0 tensor (an element of \(\mathbb{F}\)), a vector is an order-1 tensor (an element of \(V\)), and a matrix is an order-2 tensor (an element of \(V \otimes W\)).

The Kronecker product is the coordinate representation of the tensor product: when \(V = \mathbb{R}^m\) and \(W = \mathbb{R}^n\), the tensor product of matrices (viewed as linear maps) corresponds precisely to the Kronecker product of their matrix representations. The tensor product itself is the abstract, coordinate-free operation, while the Kronecker product is what we compute in practice.

Insight: Tensors as the Language of Deep Learning

While deep learning frameworks treat tensors primarily as multidimensional arrays, their algebraic power stems from their identity as multilinear maps. In Transformer architectures, the Attention mechanism is fundamentally a series of tensor contractions—specifically, bilinear operations where queries and keys interact to produce attention weights.

In the emerging field of Geometric Deep Learning, the mathematical definition of a tensor — how it transforms under coordinate changes — becomes critical. This enables the design of Equivariant Neural Networks, where internal representations (scalars, vectors, and higher-order tensors) transform consistently according to group representations. This ensures the model respects the underlying physical symmetries of the data, such as rotation and reflection invariance in 3D molecular modeling or medical imaging.