Vectorization
The previous page surfaced an unexpected fact: the space \(\mathbb{R}^{m \times n}\) of matrices carries a
Frobenius inner product of its own —
"inner product" was not tied to \(\mathbb{R}^n\), but a structure transplantable onto entirely different vector spaces.
That observation immediately invites a deeper question: what is \(\mathbb{R}^{m \times n}\), really?
Is it merely "an array of numbers," or does it have a hidden architecture worth uncovering?
This page answers in three movements. First, we will see that \(\mathbb{R}^{m \times n}\) can be identified with the long vector space \(\mathbb{R}^{mn}\) —
its dimension is \(m \cdot n\), suspiciously a product. Second, we will define an algebraic operation, the Kronecker product,
that combines two matrices in a way whose dimensions also multiply. By the end, we will see that neither fact is coincidence:
both reflect a single underlying structure, the tensor product of vector spaces \(V \otimes W\), of which \(\mathbb{R}^{m \times n}\) is the coordinate representation when \(V = \mathbb{R}^m\) and \(W = \mathbb{R}^n\).
What we have called "matrices" are the coordinates of a deeper construction.
The first step — vectorization — is the simplest version of this identification: a literal flattening that turns a matrix
into a single column vector. We use it constantly when taking derivatives with respect to all entries at once, or
when re-expressing matrix equations as standard linear systems.
Definition: Vectorization
The vectorization of a matrix \(A \in \mathbb{R}^{m \times n}\), denoted \(\operatorname{vec}(A)\),
stacks the columns of \(A\) into a single column vector in \(\mathbb{R}^{mn}\):
\[
\operatorname{vec}(A) = \begin{bmatrix} \mathbf{a}_1 \\ \mathbf{a}_2 \\ \vdots \\ \mathbf{a}_n \end{bmatrix}
\]
where \(\mathbf{a}_j \in \mathbb{R}^m\) is the \(j\)th column of \(A\).
For example, let \(\mathbf{x} \in \mathbb{R}^n\), and consider the outer product \(\mathbf{x}\mathbf{x}^\top \in \mathbb{R}^{n \times n}\).
\[
\begin{align*}
\operatorname{vec}(\mathbf{x}\mathbf{x}^\top) &= \begin{bmatrix}x_1x_1 \\ x_2x_1 \\ \vdots \\ x_nx_1 \\ x_1x_2 \\ x_2x_2 \\ \vdots \\ x_nx_2 \\ \vdots \\
x_1x_n \\ x_2x_n \\ \vdots \\ x_nx_n
\end{bmatrix} \\ \\
&= \mathbf{x} \otimes \mathbf{x}
\end{align*}
\]
The notation \(\otimes\) denotes the Kronecker product, which we define formally in the next section.
The identity \(\operatorname{vec}(\mathbf{x}\mathbf{x}^\top) = \mathbf{x} \otimes \mathbf{x}\) is a special case of a more general relationship between
vectorization and the Kronecker product.
Insight: Vectorization in Machine Learning
Vectorization is more than a notational convenience; it is the algebraic bridge between coordinate-free matrix
operations and hardware-level linear algebra. Deep learning frameworks like PyTorch and JAX leverage this by
flattening multi-dimensional parameter tensors into vectors to perform efficient Jacobian-vector products (JVPs)
during backpropagation.
The identity
\[
\operatorname{vec}(ABC) = (C^\top \otimes A)\,\operatorname{vec}(B)
\]
is a cornerstone for deriving closed-form solutions to matrix-valued optimization problems. (We prove this identity at the end of the next section, once the Kronecker product is in hand; see The vec–Kronecker Identity.) A prime example is the
continuous-time Lyapunov equation
\[
AX + XA^\top = Q,
\]
used to certify the stability of dynamical systems (including RNNs and neural controllers). By applying vectorization,
this matrix equation is transformed into the standard linear system
\[
(I \otimes A + A \otimes I)\operatorname{vec}(X) = \operatorname{vec}(Q).
\]
This transformation allows us to use high-performance linear solvers to find the Lyapunov candidate \(X\), thereby
mathematically guaranteeing that a system's "energy" dissipates over time.
Conversely, we can reshape a vector back into a matrix. However, the result depends on the
memory layout convention used — an important practical distinction when moving between
mathematical formulas and code.
Example: Row-Major vs Column-Major Order
Consider the vector
\[
a = \begin{bmatrix} 1 & 2 & 3 & 4 & 5 & 6 \end{bmatrix}^\top.
\]
Reshaping \(a\) into a \(2 \times 3\) matrix yields different results depending on the convention:
Row-major order (used by C, C++, and Python/NumPy by default): elements fill each row before moving to the next.
\[
A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}
\]
Column-major order (used by Julia, MATLAB, R, and Fortran): elements fill each column before moving to the next.
\[
A = \begin{bmatrix} 1 & 3 & 5 \\ 2 & 4 & 6 \end{bmatrix}
\]
Note that the standard mathematical definition of \(\operatorname{vec}(\cdot)\) follows column-major order.
That is, \(\operatorname{vec}(A)\) stacks columns, so the inverse operation (reshaping back) also fills by columns.
import numpy as np
vector = np.array([1, 2, 3, 4, 5, 6])
#In Python, you can choose these two ways(default is row-major):
matrix_r = vector.reshape((2, 3), order='C') # C stands for "C" programming language
matrix_c = vector.reshape((2, 3), order='F') # F stands for "F"ortran programming language
Kronecker Product
Vectorization unfolded a matrix into a long vector. Now we move in the opposite direction: an operation that builds a larger
matrix out of two smaller ones. But the construction is not arbitrary block-stacking; it has a precise meaning.
Suppose we have two linear maps \(A: \mathbb{R}^n \to \mathbb{R}^m\) and \(B: \mathbb{R}^q \to \mathbb{R}^p\) acting on two independent spaces.
What is the natural single linear map that captures the joint action of both — an operator on a combined input space that respects the independence of the two factors?
The answer is the Kronecker product \(A \otimes B\), an \((mp) \times (nq)\) matrix. The dimensions multiply for the same
reason the dimension of \(\mathbb{R}^{m \times n}\) was \(m \cdot n\): we are working in a combined space whose size is the product of the parts.
The Kronecker product is the concrete, coordinate-level realization of an abstract operation we will name in the next section.
Definition: Kronecker product
Let \(A \in \mathbb{R}^{m \times n}\) and \(B \in \mathbb{R}^{p \times q}\). In general, the Kronecker product
\(A\otimes B\) is given by \((mp) \times (nq)\) matrix:
\[
A\otimes B = \begin{bmatrix}
a_{11}B & a_{12}B & \cdots & a_{1n}B \\
a_{21}B & a_{22}B & \cdots & a_{2n}B \\
\vdots & \vdots & \ddots & \vdots \\
a_{m1}B & a_{m2}B & \cdots & a_{mn}B \\
\end{bmatrix}.
\]
Each element \(a_{ij}\) of \(A\) is multiplied by the entire matrix \(B\) resulting in blocks of size \(p \times q\).
Example:
\[
\begin{bmatrix} 1 & 2 \\ 3 & 4 \\ \end{bmatrix}
\otimes
\begin{bmatrix} 5 & 6 \\ 7 & 8 \\ \end{bmatrix}
=
\begin{bmatrix}
5 & 6 & 10 & 12 \\
7 & 8 & 14 & 16 \\
15 & 18 & 20 & 24 \\
21 & 24 & 28 & 32 \\
\end{bmatrix}.
\]
The Kronecker product satisfies several elegant algebraic properties that make it a powerful
tool for theoretical analysis. Note that the Kronecker product is not commutative in general
(\(A \otimes B \neq B \otimes A\)), but it interacts well with standard matrix operations.
Theorem: Useful Properties of the Kronecker Product
- Mixed-product property
\[
(A \otimes B)(C \otimes D) = (AC) \otimes (BD)
\]
- Transpose
\[
(A \otimes B)^\top = A^\top \otimes B^\top
\]
- Inverse (when \(A\) and \(B\) are invertible)
\[
(A \otimes B)^{-1} = A^{-1} \otimes B^{-1}
\]
- Trace (for square matrices \(A \in \mathbb{R}^{m \times m}\) and \(B \in \mathbb{R}^{n \times n}\))
\[
\operatorname{tr}(A \otimes B) = \operatorname{tr}(A) \operatorname{tr}(B)
\]
- Determinant (for square matrices \(A \in \mathbb{R}^{m \times m}\) and \(B \in \mathbb{R}^{n \times n}\))
\[
\det (A \otimes B) = \det(A)^n \det(B)^m
\]
- Eigenvalues
If \(A \in \mathbb{R}^{m \times m}\) has eigenvalues \(\lambda_1, \ldots, \lambda_m\) and
\(B \in \mathbb{R}^{n \times n}\) has eigenvalues \(\mu_1, \ldots, \mu_n\) (each counted with algebraic multiplicity over \(\mathbb{C}\)), then the
eigenvalues of \(A \otimes B\) are all products
\[
\lambda_i \mu_j \quad (i = 1, \ldots, m, \; j = 1, \ldots, n).
\]
In particular, if \(A \mathbf{v} = \lambda \mathbf{v}\) and \(B \mathbf{w} = \mu \mathbf{w}\), then
\[
(A \otimes B)(\mathbf{v} \otimes \mathbf{w}) = \lambda\mu \, (\mathbf{v} \otimes \mathbf{w}).
\]
Here \(\mathbf{v} \otimes \mathbf{w} \in \mathbb{R}^{mn}\) denotes the Kronecker product of column vectors, equivalently expressible as \(\mathbf{v} \otimes \mathbf{w} = \operatorname{vec}(\mathbf{w}\mathbf{v}^\top)\).
(Note the order: with our column-major \(\operatorname{vec}\) convention, the rank-one matrix appears as \(\mathbf{w}\mathbf{v}^\top\), not \(\mathbf{v}\mathbf{w}^\top\); the special case \(\mathbf{v} = \mathbf{w} = \mathbf{x}\) recovers the identity \(\mathbf{x} \otimes \mathbf{x} = \operatorname{vec}(\mathbf{x}\mathbf{x}^\top)\) from the previous section.)
Proof (Mixed-Product Property):
We prove Property 1, the mixed-product property, by direct block computation. The remaining properties follow from this one or by analogous direct calculation; we discuss them briefly afterward.
Let \(A \in \mathbb{R}^{m \times n}\), \(B \in \mathbb{R}^{p \times q}\), \(C \in \mathbb{R}^{n \times r}\), \(D \in \mathbb{R}^{q \times s}\), so that the products \(AC\) and \(BD\) are well-defined. Both sides of the identity are matrices in \(\mathbb{R}^{mp \times rs}\).
By the definition of the Kronecker product,
\[
A \otimes B = \bigl[\,a_{ij} B\,\bigr]_{\substack{i = 1, \ldots, m \\ j = 1, \ldots, n}}
\]
is an \(m \times n\) array of \(p \times q\) blocks, and
\[
C \otimes D = \bigl[\,c_{jk} D\,\bigr]_{\substack{j = 1, \ldots, n \\ k = 1, \ldots, r}}
\]
is an \(n \times r\) array of \(q \times s\) blocks. The block partitions are compatible: the column-block dimension of \(A \otimes B\) is \(n\) (matching the row-block dimension of \(C \otimes D\)), and within each block the inner dimension is \(q\) (cols of \(B\) = rows of \(D\)).
We may therefore apply block multiplication: the \((i, k)\)-block of \((A \otimes B)(C \otimes D)\) is
\[
\sum_{j = 1}^{n} (a_{ij} B)(c_{jk} D) = \sum_{j = 1}^{n} a_{ij} c_{jk} \, (BD) = \left(\sum_{j = 1}^{n} a_{ij} c_{jk}\right) (BD) = (AC)_{ik} \, (BD),
\]
where the second equality pulls scalars out of the matrix product and the last equality uses the definition of the matrix product \(AC\).
On the other hand, by the definition of the Kronecker product applied to the \(m \times r\) matrix \(AC\) and the \(p \times s\) matrix \(BD\),
\[
(AC) \otimes (BD) = \bigl[\,(AC)_{ik} \, (BD)\,\bigr]_{\substack{i = 1, \ldots, m \\ k = 1, \ldots, r}}
\]
— exactly the matrix whose \((i, k)\)-block we just computed. Therefore
\[
(A \otimes B)(C \otimes D) = (AC) \otimes (BD). \qquad\blacksquare
\]
Proofs of the Remaining Properties:
Each property's dimension and structural requirements are stated locally with the proof.
Property 2 (Transpose). Let \(A \in \mathbb{R}^{m \times n}\) and \(B \in \mathbb{R}^{p \times q}\). Index the entries of \(A \otimes B\) using row index \(I = (i-1)p + r\) (with \(1 \leq i \leq m\), \(1 \leq r \leq p\)) and column index \(J = (j-1)q + s\). By the definition of the Kronecker product, the \((i,j)\)-block is \(a_{ij} B\), so its \((r, s)\)-entry — which is the global \((I, J)\)-entry — equals \(a_{ij} b_{rs}\). Then
\[
\bigl((A \otimes B)^\top\bigr)_{JI} = (A \otimes B)_{IJ} = a_{ij} b_{rs}.
\]
On the other hand, \(A^\top \otimes B^\top \in \mathbb{R}^{nq \times mp}\) has, by the same Kronecker indexing applied with \(A^\top\) and \(B^\top\) as outer and inner factors, the \((J, I)\)-entry \((A^\top)_{ji} (B^\top)_{sr} = a_{ij} b_{rs}\). The entries match for all \(I, J\), so \((A \otimes B)^\top = A^\top \otimes B^\top\). \(\quad\blacksquare\)
Property 3 (Inverse). Suppose \(A \in \mathbb{R}^{m \times m}\) and \(B \in \mathbb{R}^{n \times n}\) are invertible. By the mixed-product property,
\[
(A \otimes B)(A^{-1} \otimes B^{-1}) = (A A^{-1}) \otimes (B B^{-1}) = I_m \otimes I_n = I_{mn},
\]
where the last equality uses that \(I_m \otimes I_n\) is, by direct inspection, the \(mn \times mn\) identity matrix (its diagonal blocks are \(I_n\), off-diagonal blocks are zero). The same calculation with the factors reversed gives \((A^{-1} \otimes B^{-1})(A \otimes B) = I_{mn}\). Hence \(A \otimes B\) is invertible with inverse \(A^{-1} \otimes B^{-1}\). \(\quad\blacksquare\)
Property 4 (Trace). Suppose \(A \in \mathbb{R}^{m \times m}\) and \(B \in \mathbb{R}^{n \times n}\). The diagonal entries of \(A \otimes B \in \mathbb{R}^{mn \times mn}\) lie inside its diagonal blocks \(a_{ii} B\) for \(i = 1, \ldots, m\); the diagonal of the \(i\)-th such block contributes \(a_{ii} \operatorname{tr}(B)\). Summing over \(i\),
\[
\operatorname{tr}(A \otimes B) = \sum_{i=1}^m a_{ii} \operatorname{tr}(B) = \operatorname{tr}(B) \sum_{i=1}^m a_{ii} = \operatorname{tr}(A) \operatorname{tr}(B),
\]
using the definition of the trace at both ends. \(\quad\blacksquare\)
Property 6 (Eigenvalues). Suppose \(A \in \mathbb{R}^{m \times m}\) and \(B \in \mathbb{R}^{n \times n}\) are diagonalizable over \(\mathbb{C}\), with eigenpairs \(A \mathbf{v}_i = \lambda_i \mathbf{v}_i\) for \(i = 1, \ldots, m\) and \(B \mathbf{w}_j = \mu_j \mathbf{w}_j\) for \(j = 1, \ldots, n\), where \(\{\mathbf{v}_i\}\) and \(\{\mathbf{w}_j\}\) are bases. By the mixed-product property (treating \(\mathbf{v}_i\) and \(\mathbf{w}_j\) as column matrices),
\[
(A \otimes B)(\mathbf{v}_i \otimes \mathbf{w}_j) = (A\mathbf{v}_i) \otimes (B\mathbf{w}_j) = (\lambda_i \mathbf{v}_i) \otimes (\mu_j \mathbf{w}_j) = \lambda_i \mu_j \, (\mathbf{v}_i \otimes \mathbf{w}_j),
\]
where the last equality uses that the Kronecker product is bilinear in scalars. So each \(\lambda_i \mu_j\) is an eigenvalue of \(A \otimes B\) with eigenvector \(\mathbf{v}_i \otimes \mathbf{w}_j\).
It remains to show this list is complete. By the basis-based
tensor-product construction
(extended to \(\mathbb{C}\)),
the set \(\{\mathbf{v}_i \otimes \mathbf{w}_j : 1 \leq i \leq m, \; 1 \leq j \leq n\}\) is a basis of \(\mathbb{C}^{mn}\) — it is the image of the standard tensor-product basis under the invertible change-of-basis induced by \(\{\mathbf{e}_i\} \mapsto \{\mathbf{v}_i\}\), \(\{\mathbf{f}_j\} \mapsto \{\mathbf{w}_j\}\). We have therefore exhibited \(mn\) linearly independent eigenvectors of \(A \otimes B\), accounting for all of its eigenvalues. The list \(\{\lambda_i \mu_j\}_{i, j}\) (counted with the algebraic multiplicities of \(\lambda_i\) and \(\mu_j\)) is the complete spectrum.
Remark on the non-diagonalizable case. The identity \(\operatorname{spec}(A \otimes B) = \{\lambda_i \mu_j\}\) holds in full generality (with multiplicity), but a complete proof requires reducing to the diagonalizable case via density arguments or working with the Jordan canonical form. We omit the details here. \(\quad\blacksquare\)
Property 5 (Determinant) — Scope-out. Once Property 6 is in hand, the determinant formula follows from the identity \(\det(M) = \prod_i \lambda_i\) (counting algebraic multiplicities): each \(\lambda_i\) appears paired with each of the \(n\) eigenvalues \(\mu_j\), giving \(\det(A \otimes B) = \prod_{i, j} \lambda_i \mu_j = \prod_i \lambda_i^n \cdot \prod_j \mu_j^m = \det(A)^n \det(B)^m\). A direct proof avoiding eigenvalues exists (e.g.,~via the factorization \(A \otimes B = (A \otimes I_n)(I_m \otimes B)\) and a permutation-similarity argument) but is more involved; we leave it aside.
Vectorization and the Kronecker product, introduced in the previous two sections, fit together through one central identity. We mentioned this identity informally in the vectorization insight box; we now prove it.
Theorem: The vec–Kronecker Identity
Let \(A \in \mathbb{R}^{m \times n}\), \(B \in \mathbb{R}^{n \times p}\), \(C \in \mathbb{R}^{p \times q}\). Then
\[
\operatorname{vec}(ABC) = (C^\top \otimes A) \, \operatorname{vec}(B).
\]
Proof:
Both sides lie in \(\mathbb{R}^{mq}\), structured as \(q\) stacked blocks of size \(m\). We show the \(j\)-th block agrees on each side, for \(j = 1, \ldots, q\).
Denote the columns of \(B\) by \(\mathbf{b}_1, \ldots, \mathbf{b}_p \in \mathbb{R}^n\) and the columns of \(C\) by \(\mathbf{c}_1, \ldots, \mathbf{c}_q \in \mathbb{R}^p\). The \(j\)-th column of \(ABC\) is \(A B \mathbf{c}_j\), and \(\operatorname{vec}(ABC)\) is obtained by stacking these columns. Expanding \(B \mathbf{c}_j\) in terms of the columns of \(B\),
\[
B \mathbf{c}_j = \sum_{k=1}^p c_{kj} \, \mathbf{b}_k,
\]
so the \(j\)-th block of \(\operatorname{vec}(ABC)\) is
\[
A B \mathbf{c}_j = \sum_{k=1}^p c_{kj} \, A \mathbf{b}_k.
\]
Now consider \((C^\top \otimes A) \, \operatorname{vec}(B)\). View \(C^\top \otimes A\) as a \(q \times p\) array of \(m \times n\) blocks: the \((j, k)\)-block is \((C^\top)_{jk} \, A = c_{kj} \, A\). View \(\operatorname{vec}(B) \in \mathbb{R}^{np}\) as \(p\) stacked blocks of size \(n\), where the \(k\)-th block is \(\mathbf{b}_k\). The block partitions are compatible (column-block dimension \(p\) on each side; inner dimension \(n\) on each side), so by block multiplication the \(j\)-th block of \((C^\top \otimes A) \, \operatorname{vec}(B)\) is
\[
\sum_{k=1}^p (c_{kj} A) \, \mathbf{b}_k = \sum_{k=1}^p c_{kj} \, A \mathbf{b}_k.
\]
This matches the \(j\)-th block of \(\operatorname{vec}(ABC)\) computed above. Since \(j\) was arbitrary, the two vectors are equal. \(\quad\blacksquare\)
As a quick consequence relevant to the Lyapunov example mentioned earlier: applying the identity to \(AX\) (taking \(B = X\), \(C = I_n\)) and to \(XA^\top\) (taking \(A = I_n\), \(B = X\), \(C = A^\top\)), then summing,
\[
\operatorname{vec}(AX + XA^\top) = (I_n \otimes A) \operatorname{vec}(X) + (A \otimes I_n) \operatorname{vec}(X) = (I_n \otimes A + A \otimes I_n) \operatorname{vec}(X),
\]
which is the linear-system reformulation announced in the vectorization insight box.
Insight: Kronecker Products in Machine Learning
The mixed-product property \((A \otimes B)(C \otimes D) = (AC) \otimes (BD)\) is the algebraic foundation of
Kronecker-factored approximations. The K-FAC optimizer approximates the Fisher Information Matrix
of a layer as \(F \approx A \otimes G\), where \(A\) is the covariance of activations and \(G\) is the covariance of
backpropagated gradient statistics.
By exploiting this structure, K-FAC reduces the inversion cost from \(O(n^3)\) on the full parameter space to
approximately \(O(n^{3/2})\). The eigenvalue property further enables efficient
spectral preconditioning, allowing large-scale neural networks to converge faster by utilizing
the curvature of the loss landscape. K-FAC is treated in detail in the Natural Gradient Descent page of Section V.
With the Kronecker product established, we now turn to tensors. Two views will appear in tension: the computational view (a tensor is a multidimensional array, generalizing matrices to more than two indices) and the algebraic view (a tensor is an element of a tensor product of vector spaces).
Reconciling them is what reveals the structure that has been quietly underlying everything in this page.
Tensor
So far, we have worked with scalars, vectors, and matrices — objects of dimension 0, 1, and 2 respectively.
Many applications require data structures that extend beyond two dimensions. A tensor
provides exactly this generalization: it is a multidimensional array that can be understood both as a
computational data structure and as an element of a tensor product of vector spaces.
Example: Familiar Tensors
The objects we have studied throughout linear algebra are all tensors of low order:
- A scalar is an order-0 tensor (a single number, \(a \in \mathbb{R}\)).
- A vector is an order-1 tensor (a one-dimensional array, \(\mathbf{v} \in \mathbb{R}^n\)).
- A matrix is an order-2 tensor (a two-dimensional array, \(A \in \mathbb{R}^{m \times n}\)).
The order (also called mode or informally "dimension") of a tensor refers to
the number of indices required to specify an entry.
Higher-order tensors arise naturally in applications. For example, a color image is an order-3 tensor
\(I \in \mathbb{R}^{H \times W \times C}\), where \(H\) is height, \(W\) is width, and \(C\) is the
number of channels (e.g., \(C = 3\) for RGB). A batch of \(N\) such images forms an order-4 tensor in
\(\mathbb{R}^{N \times H \times W \times C}\).
These computational examples — multidimensional arrays — are useful, but they describe how tensors are stored, not what tensors are.
For that, we need an algebraic definition rooted in the construction we have been circling around for the entire page: the tensor product of vector spaces.
The patterns we have collected so far — vec identifying \(\mathbb{R}^{m \times n}\) with \(\mathbb{R}^{mn}\), Kronecker building \((mp) \times (nq)\) matrices from \(m \times n\) and \(p \times q\) ones, dimensions multiplying everywhere — all become a single statement once the construction is in place.
Definition: Tensor Product Space (Finite-Dimensional)
Let \(V\) and \(W\) be finite-dimensional vector spaces over \(\mathbb{R}\) with bases
\(\{\mathbf{e}_1, \ldots, \mathbf{e}_m\}\) of \(V\) and \(\{\mathbf{f}_1, \ldots, \mathbf{f}_n\}\) of \(W\).
The tensor product space \(V \otimes W\) is the \(mn\)-dimensional vector space with basis
\(\{\mathbf{e}_i \otimes \mathbf{f}_j : 1 \leq i \leq m, \; 1 \leq j \leq n\}\).
A general element is a linear combination
\[
T = \sum_{i=1}^m \sum_{j=1}^n T_{ij} \, (\mathbf{e}_i \otimes \mathbf{f}_j), \quad T_{ij} \in \mathbb{R}.
\]
For \(\mathbf{v} = \sum_i v_i \mathbf{e}_i \in V\) and \(\mathbf{w} = \sum_j w_j \mathbf{f}_j \in W\), the elementary tensor \(\mathbf{v} \otimes \mathbf{w}\) is defined to be
\[
\mathbf{v} \otimes \mathbf{w} := \sum_{i, j} v_i w_j \, (\mathbf{e}_i \otimes \mathbf{f}_j).
\]
More generally, an order-\(r\) tensor is an element of
\[
T \in V_1 \otimes V_2 \otimes \cdots \otimes V_r,
\]
constructed analogously with basis \(\{\mathbf{e}^{(1)}_{i_1} \otimes \cdots \otimes \mathbf{e}^{(r)}_{i_r}\}\). When each \(V_k = \mathbb{R}^{n_k}\), such a tensor is represented by a multidimensional array with entries \(T_{i_1 i_2 \cdots i_r}\), where \(1 \leq i_k \leq n_k\).
Remark. A basis-free definition exists via a universal property: \(V \otimes W\) is the unique (up to isomorphism) vector space \(U\) equipped with a bilinear map \(V \times W \to U\) through which every bilinear map out of \(V \times W\) factors uniquely. This formulation makes the tensor product independent of any basis choice and extends naturally to infinite-dimensional and more general settings. We will not need this level of generality here; the basis-based definition above suffices for everything in Section I.
Under this formal definition, the familiar objects of linear algebra are recovered as special cases:
a scalar is an order-0 tensor (an element of \(\mathbb{R}\)), a vector is an order-1 tensor (an element of \(V\)),
and a matrix is an order-2 tensor (an element of \(V \otimes W\)).
Now the loose threads of this page tie together. The space \(\mathbb{R}^{m \times n}\) is not "an array of numbers" but the coordinate representation of \(\mathbb{R}^m \otimes \mathbb{R}^n\) — the tensor product of two simpler spaces. Vectorization is the explicit isomorphism \(\mathbb{R}^m \otimes \mathbb{R}^n \cong \mathbb{R}^{mn}\) that takes a tensor and lists its coordinates as a single column. The Kronecker product is the coordinate representation of the tensor product under a specific ordering convention: when \(V = \mathbb{R}^m\) and \(W = \mathbb{R}^n\) with standard bases, ordering the tensor-product basis as \((\mathbf{e}_i \otimes \mathbf{f}_j) \mapsto (i - 1)n + j\) identifies the abstract elementary tensor \(\mathbf{v} \otimes \mathbf{w}\) with the Kronecker product of the column vectors (this is the same convention that gave \(\mathbf{v} \otimes \mathbf{w} = \operatorname{vec}(\mathbf{w}\mathbf{v}^\top)\) earlier; a different ordering would swap the roles of \(\mathbf{v}\) and \(\mathbf{w}\)). More generally, given linear maps \(f: V \to V'\) and \(g: W \to W'\), they induce a linear map \(f \otimes g: V \otimes W \to V' \otimes W'\) whose matrix representation (in this tensor-product basis ordering) is the Kronecker product of the matrices of \(f\) and \(g\). The tensor product itself is the abstract, coordinate-free operation; what we have called "matrices" and "Kronecker products" are its shadows in coordinates.
A complete treatment of how linear maps act on tensor products — and how this perspective extends to group actions, where a group \(G\) acts on \(V \otimes W\) compatibly with its actions on \(V\) and \(W\) — belongs to the language of representation theory, which we develop in a future page of Section I. That language, in turn, is what makes the next idea possible: building neural networks whose internal computations respect symmetries of the data.
Insight: Tensors as the Language of Deep Learning
While deep learning frameworks treat tensors primarily as multidimensional arrays, their algebraic power stems
from their identity as multilinear maps. In Transformer architectures, the Attention mechanism
is fundamentally a series of tensor contractions—specifically, bilinear operations where queries and keys interact to produce
attention weights.
In the emerging field of Geometric Deep Learning, the mathematical definition of a tensor
— how it transforms under coordinate changes — becomes critical. This enables the design of
Equivariant Neural Networks, where internal representations (scalars, vectors, and higher-order tensors)
transform consistently according to group representations. This ensures the model respects the underlying physical symmetries
of the data, such as rotation and reflection invariance in 3D molecular modeling or medical imaging. The relevant symmetries are formalized through continuous matrix groups such as \(SO(n)\), developed in the Lie groups series later in Section I; the geometric flavor of these groups is previewed in Geometry of Symmetry.