Matrix Calculus
by Heinrich Hartmann / 2025-06-23 / Stemwede
Motivation
Good notation, has vast influence on how well we are able to perform computations, and how much we can trust the results. A good "calculus" allows us to comprehend formulas and spot inconsistencies easily.
In this text we are following the general "function space" approach layed out in Natural Operators, but focus on matrix algebra suited for computation.
Function Spaces
Let \(X\) be a set. Typical examples are \(X=[N]=\{1,\dots,N\}\) if we are doing computations, but also infinite sets like \(X=\IR\). We introduce the following notation:
- \(\IR[X]= (\stext{real valued functions on X}) = Map(X, \IR)\).
- For \(a\in \IR[X], x\in X\), we denote by \(a[x]\in \IR\) the value of \(a\) at \(x\).
- For \(x \in X\) we denote by \(\hat{x} \in \IR[X]\) the "delta" function with \(\hat{x}(x)=1\), and \(\hat{x}(y) = 0\) for all \(y\neq x\).
- \(\IR[X]\) is an commutative, associate, unital algebra with pointwise operations:
- \((a + b)[x] = a[x]+b[x] \in \IR\)
- \((a \cdot b)[x] = a[x]b[x] \in \IR\)
- The scalar multiplication by \(r \in \IR\) is denoted by \(r a\). We sometimes write \(r.a\) if we want to stress the scalar nature of \(r\).
Notes on Notation
- We choose the variables \(x \in X\) denote points. \(x\) is a non-linear discrete entity without further structure. It may also be an index, \(x \in [N]\).
In other text, we would see: \(i = 1,\dots,N\).
- For functions \(a \in \IR[X]\) we choose the variable \(a\). Here \(a\) is an algebraic entity,
with a rich set of operations, inherited from the real numbers. In general, functions are
best denoted by letters which are similar to the target space. We parametrize values in the target \(a[x]\) by \(x \in X\).
In other text, we would see \(v\) with \(v_i, i = 1,\dots, N\).
- The notation \(a[x]\) is inspired py Python array indexing. If we choose \(X = \{ 0,\dots,N-1\}\) we exactly recover python-style array indexing \(a[i]\).
Slicing
For \(A \in \IR[X,Y], x\in X,y\in Y\) we define.
- \(A[x,:] \in \IR[Y]\) by \(A[x,:][y] = A[x,y]\).
- \(A[:,y] \in \IR[X]\) by \(A[:,y][x] = A[x,y]\).
Inner Products
Assume \(X\) is finite, and \(a,b\) in \(\IR[X]\).
We define the inner product as
- Commutativity We have \(a * b = b * a\).
- Reconstruction We have \(a = \sum_x a[x] \hat{x} = \sum_x (a * \hat{x})\hat{x}\)
Side-note The inner product is also defined if \(X\) is infinite but either \(a\) or \(b\) have finite support, i.e. \(a[x]=0\) for almost all \(x\).
Matrix Products
Let \(X,Y,Z\) be finite sets, \(x \in X,y \in Y\), \(a \in \IR[X], b \in \IR[Y], c \in \IR[Z]\).
- We define \(\IR[X,Y]=\IR[ X\times Y]\) as real valued functions on \(X\times Y\).
- For \(A \in \IR[X,Y]\) we denote the entries by \(A[x,y] \in \IR\).
- We define the vector-matrix product \(a * A \in \IR[Y]\) by \(a * A[y] = \sum_x a[x]A[x,y].\)
- We define the matrix-vector product \(A * b \in \IR[X]\) by \(A * b[x] = \sum_y A[x,y]b[y].\)
- We define the matrix-matrix product \(A * B\in \IR[X,Z]\) by \(A*B[x,z]=\sum_y A[x,y]B[y,z]\), where \(B \in \IR[Y,Z]\).
Note There are no row and column vectors here. Matrix vector products are defined in both directions. \(a * A\) and \(A * b\). This is perfectly natural from an indexing standpoint, as "*" is defined as a "convolution operator" that sums over adjacent indices. This extends naturally to n-dimensional arrays.
Traditionally \(v * A\) is denoted as \(v^T \cdot A\) and \(A * v\) is \(A \cdot v\), where \(v^T\) is a 1xN matrix "row matrix" and \(v\) an Nx1 "column matrix".
Associativity
- The matrix product is associative: \(A * (B * C) = (A * B) * C\), hence we write \(A * B * C\) without ambiguity.
- The matrix product is associative with vector-matrix product: \(a * (A * B) = (a * A) * B\), hence we write \(a * A * B\) without ambiguity.
- The vector-matrix product is associative with the scalar product: \((a * A) * b = a * (A * b)\), hence we write \(a * A * b\) without ambiguity.
Note that, since the scalar product is also commutative we have:
It's important to note, that in this case we can't remove the brackets, since "\((b * a) * A\)" is something entirely different: It's the 1x1 matrix \(b * a\) viewed as a scalar and multipled with the matrix \(A\).
Unit
- The identity matrix \(I_X \in \IR[X,X]\) is given by: \(I_X[x,y] = \delta_{x,y}\), where \(\delta_{x,x} = 1\) and \(\delta_{x,y} = 0\) for \(x \neq y \in X\).
- For any vector \(a \in \IR[X]\) we have \(a * I_X = I_X * a = a\).
- For any matrix \(A \in \IR[X,Y]\), \(B \in \IR[Y,X]\) we have: \(I_X * A = A\) and \(B * I_X = B\).
Transposition
- We define \(A^* \in \IR[Y,X]\) as \(A^*[y,x] = A[x,y]\).
- We have \(A^* * a = a * A\)
In this text we will not use transposition of vectors / 1d arrays,
Outer Products
For \(a \in \IR[X]\) and \(b \in \IR[Y]\), define \(a \boxtimes b \in \IR[X,Y]\) as \((a \boxtimes b)[x,y] = a[x]b[y]\).
Properties
- For \(A \in \IR[W,X]\) and \(B \in \IR[Y,Z]\) we have: \(A * (a \boxtimes b) * B = (A * a) \boxtimes (b * B)\).
- The matrix \(a \boxtimes b\) acts on a vectors \(a' \in \IR[X], b' \in \IR[Y]\) as \(a' * (a \boxtimes b) * b' = (a' * a) . (b * c)\).
- (Row- and Column-vectors) We can convert vectors into matrices via \(1 \boxtimes a \in \IR[1,X]\) and \(1 \boxtimes a \in \IR[X,1]\).
- For \(x \in X\) the vector \(\hat{x} \boxtimes b = \hat{x} * (1 \boxtimes b)\) is the "row vector" centered at \(x\).
- For \(y \in X\) the vector \(a \boxtimes \hat{y} = (a \boxtimes 1) * \hat{y}\) is the "column vector" centered at \(y\).
- The identity matrix, can be expressed as \(I_X = \sum_{x\in X} \hat{x} \boxtimes \hat{x}\), for \(X\) finite.
- The identity \(I_X * A = A\) recovers \(A\) as sum of it's "rows" $$ A = I_X * A = \sum_{x\in X} \hat{x} \boxtimes (\hat{x} * A) = \sum_{x\in X} \hat{x} \boxtimes A[x,:] $$
- The identity \(A * I_Y\) recovers \(A\) as sum of it's "columns" $$ A = A * I_Y = \sum_{y\in Y} (A * \hat{y}) \boxtimes \hat{y} = \sum_{y\in Y} A[:,y] \boxtimes \hat{y}. $$
Proposition
- The matrix \(a \boxtimes b\) has rank one.
- Every rank one matrix \(R \in \IR[X,Y]\) is of the form \(a \boxtimes b\).
Remark
- In classical treatments vectors \(a \in \IR[X]\) are identified with "column vectors" in \(\IR[X,1]\), which we denote by \(a \boxtimes 1\).
- The classical scalar product \(v^t \cdot w\), is expressed as \((1 \boxtimes v) * (w \boxtimes 1) = v * w \in \IR\).
- The classical rank-1 matrix \(v . w^t\), is expressed as \((v \boxtimes 1) * (1 \boxtimes w) = v \boxtimes w \in \IR[X,Y]\).
- The product \(\boxtimes\) generalizes naturally to the Kroneker Product or Tensor Product for higher order-tensors \(\IR[X_1, \dots, X_n]\).
Orthonormal Systems
- An orthonormal system \(u_z \in \IR[X], z \in Z\) is a set of vectors for which \(u_z * u_w = \delta_{z,w}\) for \(z,w \in Z\).
- The vectors \(u_z \in \IR[X], z\in Z\) form an orthonormal system if and only if the matrix \(U = \sum_z u_z \boxtimes \hat{z} \in \IR[X,Z]\) fulfills \(U^* U = I_Z\).
- An orthonormal basis (ONB) \(u_x \in \IR[X], x \in X\) is an orthonormal system in \(\IR[X]\) indexed by \(X\).
Proposition
- Orthonormal systems are linearly independent in \(\IR[X]\).
- Any orthonormal system \(u_z \in \IR[X]\), with \(Z \subset X\) and be extended to an orthonormal basis.
- The reverse composition \(P = U * U^* \in \IR[X,X]\) and \(Q = I_X - P\) are orthogonal projection matrices:
- Projection property: \(P * P = P, \quad Q*Q = Q\).
- Orthogonality: \(P * Q = 0, \quad Q * P = 0\).
- \(Im(P) = Im(U) = Span(u_z, z\in Z) = Ker(Q)\)
- \(Im(Q) = Im(P)^\perp\)
Linear Analysis
We repeat some of the results from the Analysis of Linear Functions post.
- For a vector-valued differentiable function \(f: \IR[N] \lra \IR[M]\), with components \(f^m: \IR^N \lra \IR\) and a point \(p \in \IR^N\) we define the Jacobi Matrix \(D_pf \in \IR[M,N]\) as $$ (D_p f)[m,n] = \frac{\del}{\del t}|_{t=0} f^m(p + t \hat{n})$$
- For a real-valued differentiable function \(f: \IR[N] \lra \IR\) define the gradient \(d f(p) \in \IR[N]\) at a point \(p \in \IR^N\) as $$ d f(p)[n] = \frac{\del}{\del t}|_{t=0} f(p + t\hat{n}) = D_pf[1,:] $$
Calculus
Let \(a, b\) are two real valued functions on \(\IR[M]\), \(\lambda \in \IR\). Let \(c\) be a real valued functions on \(\IR[M]\). Let \(f: \IR[M] \lra \IR[N]\), \(g: \IR[L] \lra \IR[M]\), and pick \(p \in \IR[M]\), \(q\in \IR[L]\) with \(g(q) = p\).
We ommit the points from the notation and write \(Df = D_p(f), df = df(p)\).
- Linearity $$ d (a + \lambda b) = da + \lambda db \quad \in \IR[M]. $$
- Chain Rule $$ D_q (f \circ g) = (D_p f) * (D_q g) \quad \in \IR[L,N]. $$
- Chain Rule (1-dim) $$ d(a \circ g)(q) = d a (p) * D_q g \quad \in \IR[M]. $$
- Product Rule $$ d(a \cdot b) = (da).b + a.(db) \quad \in \IR[N] $$.
- Product Rule (N-dim) $$ D (f \cdot c) = D f . c + f \boxtimes d c \quad \in \IR[M,N] $$.
- Quotient Rule $$ d(a/b) = (da \cdot b - a \cdot db)/b^2 \quad \in \IR[N] $$.
- Quotient Rule (N-dim) $$ D(f/c) = (Df . c - f \boxtimes dc)/c^2 \quad \in \IR[M,N] $$.
- Power rule \(\quad d(a^n) = n a^{n-1} da\).
- Inverse Rule \(\quad d(1/a) = - da / a^2\).
Examples:
- Let \(f(p) = a * p\), for some fixed \(a \in \IR[N]\) we have \(df (p) = a\).
- Let \(f(p) = p*p\), then \(df (p) = 2 p\).
- Let \(f(p) = p * G * p\), then \(d f (p) = G*p + p*G\).
- Let \(f(p) = p * A^* * A * p\), then \(d f (p) = 2 A^* * A * p\).
- Let \(f(p) = \| A*p - t \|^2\), then \(d f(p) = 2A^* * A * p - 2A^* * t\).
Spectral Theory
Spectral Decomposition
Theorem — Spectral Decomposition (Vector Form)
Let \(A \in \IR[X,X]\) be a symmetric matrix (\(A^* = A\)). Then there exists an orthonormal system \(u_z \in \IR[X], z \in Z\) and real numbers \(\lambda_z \in \IR\) so that: $$ A = \sum_{z \in Z} \lambda_z . u_z \boxtimes u_z $$
- Each \(u_z\) is an eigenvector of \(A\) for eigenvalue \(\lambda_z\): \(A * u_z = \lambda_z u_z\).
- The eigenvalues \(\lambda_x\) are uniquely determined by \(A\) up to ordering.
Proof
We prove by induction on the dimension of the space.
If \(\#X=1\) is trivial since every \(1 \times 1\) has the form \(a \hat{x} \boxtimes \hat{x}\).
For the inductive step, assume the theorem holds for all symmetric matrices of dimension \(n-1\).
The function \(f(a) = a * A * a \in \IR\) is continuous on the compact set \(S = \{a \in \mathbb{R}^n : a * a - 1 = 0\}\) (the unit sphere). By the extreme value theorem, \(f\) attains its maximum on \(S\) at some point \(u \in S\).
We apply the Lagrange multiplier criteria for maximizing \(f(a) = a * A * a\) subject to the constraint \(g(a) = a * a - 1 = 0\): At the maximum \(u\), we have \(df(a) = \lambda dg(a)\). The derivatives are given by \(df(a) = 2 A*a\) (here the assumption \(A^* = A\) enteres) and \(dg(a) = 2 a\). Therefore, at the maximum point \(u\), we have \(A u = \lambda u\).
If \(\lambda = 0\), then \(f(a) = 0\) for all \(a\) and hence \(A = 0\) has already the desired form. Hence we can assume \(\lambda \neq 0\) from here on.
Consider the orthogonal complement \(u^\perp = \{a \in \IR[X] : a * u = 0\}\). For \(a \in V^\perp\), we find \(a * A * u = a * \lambda * u = 0\), hence \(A * a \in u^\perp.\) and \(A\) restricts to a linear operator \(A_u: u^\perp \ra u^\perp\).
Now \(dim(u^\perp) = \# X - 1\), so we find a representation: \(A_u = \sum_{x \in Z'} \lambda_x u_x \boxtimes u_x\) indexed by a set \(Z'\) of cardinality \(\# Z' = \# X - 1\). Extending \(Z'\) by a new symbol \(*\) to \(Z\), and setting \(u_* = u, \lambda_* = \lambda\) we get:
The following equivalent formulation is often used in practice.
Theorem — Spectral Decomposition (Matrix Form)
Let \(A \in \IR[X,X]\) be any symmetric matrix.
Then \(A\) can be factored as
$$
A = U * D * U^*
$$
where
- \(U \in \IR[X,X]\) is an orthogonal transformation: \(U^* * U = U * U^* = I_X\).
- \(D \in \IR[X,X]\) is diagonal: \(D = \sum_x \lambda_x . \hat{x} \boxtimes \hat{x}\).
Theorem - Abstract Spectral Decomposition
Let \(A \in \IR[X,X]\) be any symmetric matrix. Let \(A_\lambda = Ker(A - \lambda I_X)\) be the eigenspace of \(A\) for eigenvalue \(\lambda\).
Then
- Only finitely many \(A_\lambda\) are non-zero.
- \(\IR[X]\) decomposes into an orthogonal direct sum \(\Vsum A_\lambda\).
- Let \(P_\lambda \in \IR[X,X]\) be the orthogonal projector with image \(A_\lambda\), then: $$ A = \sum_{\lambda \in \IR} \lambda . P_\lambda $$
- The projectors are mutually orthogonal: \(P_\lambda * P_\mu = \delta_{\lambda, \mu} P_\lambda.\)
- The identity matrix decomposes as \(I_X = \sum_\lambda P_\lambda\)
Theorem — Sylvester Formula for Spectral Projectors
Let \(A \in \IR[X,X]\) be a symmetric matrix with distinct eigenvalues \(\lambda_1, \dots, \lambda_m \in \IR\).
Then the orthogonal projector onto the eigenspace for \(\lambda_i\) is given by:
$$
P_{\lambda_i} = \prod_{\substack{j=1 \ j \neq i}}^m \frac{A - \lambda_j I_X}{\lambda_i - \lambda_j}.
$$
This is known as Sylvester’s formula (or Lagrange interpolation for matrices).
See e.g. Sylvester's formula — Wikipedia or Horn & Johnson (1985), Matrix Analysis, Section 2.5.
Singular Value Decomposition (SVD)
Theorem - SVD Vector Form
Let \(A \in \IR[Y,X]\) be any matrix. Then there is an index set \(Z\) and orthonormal systems \(u_z \in \IR[X], z \in Z\) and \(v_z \in \IR[Y], z\in Z\) so that: $$ A = \sum_{z \in Z} \sigma_z . v_z \boxtimes u_z $$ Here \(\sigma_z \in \IR, z\in Z\), are positive numbers \(\sigma_z > 0\), called the singular values of \(A\).
- The singular values are uniquely determined by \(A\) up to re-ordering.
- The size of the index set \(\# Z\) is equal to the rank of \(A\).
Proof
Existence
- The matrix \(K = A^* * A \in \IR[X,X]\) is symmetric and positive semi-definite.
- By the spectral theorem, we can write \(K\) we find an orthogonal matrix \(U \in \IR[X,X]\), so that \(K = \sum_{x \in X} \lambda_x . u_x \boxtimes u_x\).
- Since \(K\) is positive semidefinite, we have \(u_x * K * u_x = \lambda_x \geq 0\).
- Let \(Z \subset X\) be the subset of \(z \in X\) with \(\lambda_z > 0\). With this notation we have \(K = \sum_{z \in Z} \lambda_{z} . u_{z} \boxtimes u_z.\)
- For \(z \in Z\), let \(\sigma_z = \sqrt{\lambda_z}\), and set \(v_z = A * u_z / \sigma_z \in \IR[Y]\).
This is an ortho-normal system in \(\IR[Y]\) since for \(z,w \in Z\) we have \(v_z * v_w = u_{z} * A^* * A * u_{w} / (\sigma_z \sigma_w) = \delta_{z,w}\).
Now \(A = \sum_{z \in Z} \sigma_z v_z \boxtimes u_z\), since the values on the basis \(u_x, x \in X\) agree:
- For \(x \in Z \subset X\), we find \(\sum_{z \in Z} \sigma_z v_z \boxtimes u_z * u_x = \sigma_x v_x = A * u_x\)
- For \(x \in Z, x \notin Z\), we find \(\sum_{z \in Z} \sigma_z v_z \boxtimes u_z * u_x = 0\), and \(\| A * u_x \| = u_x *K *u_x = 0\), so \(A u_x = 0\).
Here are two more equivalent versions of the SVD theorem:
Theorem - Reduced Matrix SVD
Let \(A \in \IR[Y,X]\) be any matrix. \(A\) can be factored as
where
- \(V \in \IR[Y,Z]\) is an orthonormal system in \(\IR[Y]\): \(V^* * V = I_Z\).
- \(U^* \in \IR[X,Z]\) is an orthonormal systems in \(\IR[X]\) \(U * U^* = I_Z\).
- \(\Sigma \in \IR[Z,Z]\) is a diagonal matrix \(\Sigma = \sum_z \sigma_z . \hat{z} \boxtimes \hat{z}\).
- \(rk(U)=rk(V)=rk(\Sigma)=rk(A) = \# Z\).
Theorem - Full Matrix SVD
Let \(A \in \IR[Y,X]\) be any matrix. \(A\) can be factored as
where
- \(V \in \IR[Y,Y]\) is an orthonormal transformation: \(V^* * V = V * V^* = I_Y\).
- \(U \in \IR[X,X]\) is an orthonormal transformation: \(U^* * U = U * U^* = I_X\).
- \(\Sigma\) is a (non-square) diagonal matrix: \(\Sigma = \sum_z \sigma_z y(z)\hat\ \boxtimes x(z)\hat\ ,\) for a pair of embeddings \(x:Z\ra X\), \(y:Z\ra Y\).
Relationship between SVD and SD
Proposition
Let \(A \in \IR[Y,X]\) be an arbitrary matrix with singular value decomposition \(A = \sum_{z \in Z} \sigma_z v_z \boxtimes u_z\).
Then:
- \(A^* * A = \sum_{z \in Z} \sigma_z^2 \, u_z \boxtimes u_z\) is a spectral decomposition of \(A^* * A \in \IR[X,X]\).
- \(A * A^* = \sum_{z \in Z} \sigma_z^2 \, v_z \boxtimes v_z\) is a spectral decomposition of \(A * A^* \in \IR[Y,Y]\).
Proposition
Let \(A \in \IR[Y,X]\) be an arbitrary matrix, \(K = A^* * A \in \IR[X,X]\) and \(L = A * A^* \in \IR[Y,Y]\).
Let \(\IR[X] = \Vsum_\lambda K_\lambda\) and \(\IR[Y] = \Vsum_\lambda L_\lambda\) the abstract spectral decompisitons into eigenspaces.
Then:
- \(A(K_\lambda) \subseteq L_\lambda\), hence we get an induced map as restriction which we denote by \(A_\lambda: K_\lambda \ra L_\lambda\).
- \(A\) decomposes as \(A = \sum_\lambda A_\lambda\)
- \(A_\lambda^* A_\lambda = \lambda I\) on \(K_\lambda\) and \(A_\lambda A_\lambda^* = \lambda I\) on \(L_\lambda\)
- If \(\lambda > 0\) then \(A_\lambda\) is an isomorphism.
- For \(\lambda = 0\) we have \(A(K_0)=0\) and \(A^*(L_0)=0\).
Pseudoinverse
Definition — Pseudoinverse
Let \(A \in \IR[Y,X]\) be any matrix.
The pseudoinverse \(A^+ \in \IR[X,Y]\) is the unique map satisfying:
- \(A * A^+ * A = A\) - Normal Equation
- \(A^+ * A * A^+ = A^+\) - Normal Equation
- \((A * A^+)^* = A * A^+\) - Symmetry
- \((A^+ * A)^* = A^+ * A\) - Symmetry
Proposition
- The inverse matrix is a pseudoinverse if it exists.
- If a pseudoinverse exists it is unique.
Proposition — Full Column Rank Case
Let \(A \in \IR[Y,X]\).
- If \(A\) has full column rank (\(\operatorname{rk}(A) = \# X\)). Then \(A^+ = (A^* * A)^{-1} * A^*.\)
- If \(A\) has full row rank (\(\operatorname{rk}(A) = \# Y\)). Then \(A^+ = A^* * (A * A^*)^{-1}.\)
Proof
Set \(y = A^+ x\). The pseudoinverse satisfies \(A^* A A^+ = A^*\), so \(A^* A (A^+ x) = A^* x\) showing \(y = A^+ x\) solves the "normal equation" \(A^* A x = A^* y\).
Solving for \(y\) gives us \(A^+ x = (A^* A)^{-1} A^* x\).
Verifying properties:
- \(A A^+ A = A\) since \((A^* A)^{-1} A^* A = I\).
- \(A^+ A A^+ = A^+\) since \(A^+ A = I\).
- Symmetry conditions follow since \(A^* A\) is symmetric.
The row rank case follows similarly.
Proposition — Pseudoinverse via SVD
Let \(A = \sum_{z \in Z} \sigma_z v_z \boxtimes u_z\) be the singular value decomposition of \(A\), with singular values \(\sigma_z > 0\).
Then $$ A^+ = \sum_{z \in Z} \frac{1}{\sigma_z} u_z \boxtimes v_z. $$
Proof
We compute:
$$
A * A^+ = \sum_{z} v_z \boxtimes v_z, \quad
A^+ * A = \sum_{z} u_z \boxtimes u_z
$$
These are orthogonal projectors onto \(\operatorname{Im}(A)\) and \(\operatorname{Im}(A^*)\).
Verifying the defining properties:
- \(A A^+ A = A\)
- \(A^+ A A^+ = A^+\)
- \(A A^+\), \(A^+ A\) symmetric.