Content
1. Geometry of Vectors 2. Basic Vector Calculation 3. Hyperplanes 4. Basic Matrix Calculation and Properties 5. Tensors and Common Linear Algebra Operations Summary <- Go BackUniversity of Michigan at Ann Arbor
Last Edit Date: 01/19/2023
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Dive into Deep Learning and Professor Ambuj Tewari. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License..
i. Vectors in Python
A vector is a list of numbers such as the Python list as following.
v = [1, 4, 0, 5]
print(v)
[1, 4, 0, 5]
ii. Written formats of vectors
iii. Visualize vectors
i. Point: We can simply visualize vectors as points in space. From the origin to the point, which gives us the location and forms the vector.
ii. Direction: We can visualize vectors as directions in space. Think of the vector $\textbf{v} = [3, 2]^T$ as the location 3 units to the right and 2 units up from the origin or we can think of it as 3 steps to the right and 2 steps up.
Note: Any vector can be visualized as an arrow in the plane. In this case, every vector drawn is a representation of the vector $\textbf{v} = [3, 2]^T$.
i. Linear Combination
$$\boxed{c\textbf{v} + d\textbf{w} = c \begin{bmatrix}1\\1\end{bmatrix} + d \begin{bmatrix}2\\3\end{bmatrix} = \begin{bmatrix}c+2d\\c+3d\end{bmatrix}}$$import numpy as np
c = 2
d = 4
v = np.array([1, 1])
print("v =", v)
w = np.array([2, 3])
print("w =", w)
combination = c * v + d * w
print("cv + dw =", combination)
v = [1 1] w = [2 3] cv + dw = [10 14]
ii. Addition
$$\text{If}~\textbf{v} = \begin{bmatrix}v_1\\v_2\end{bmatrix}~,~\textbf{w} = \begin{bmatrix}w_1\\w_2\end{bmatrix}$$$$\text{then}~\boxed{\textbf{v} + \textbf{w} = \begin{bmatrix}v_1 + w_1\\v_2 + w_2\end{bmatrix}}$$Visualization example:
v = np.array([1, 1])
print("v =", v)
w = np.array([2, 3])
print("w =", w)
summation = v + w
print("v + w =", summation)
v = [1 1] w = [2 3] v + w = [3 4]
iii. Scalar Multiplication
$$\text{If}~\textbf{v} = \begin{bmatrix}v_1\\v_2\end{bmatrix}~,~\alpha~\text{is the scalar, a constant number}$$$$\text{then}~\boxed{\alpha\textbf{v} = \begin{bmatrix}\alpha v_1\\\alpha v_2\end{bmatrix}}$$v = np.array([1, 1])
print("v =", v)
scalar = 4
print("4v =", scalar * v)
v = [1 1] 4v = [4 4]
iv. Dot Product
$$\text{If}~\textbf{v} = \begin{bmatrix}v_1\\v_2\end{bmatrix}~,~\textbf{w} = \begin{bmatrix}w_1\\w_2\end{bmatrix}$$$$\text{then}~\boxed{\textbf{v} \cdot \textbf{w} = v_1 w_1 + v_2 w_2}$$Note: When the dot product of two vectors is 0, then these two vectors are perpendicular to each other.
v = np.array([1, 1])
print("v =", v)
w = np.array([2, 3])
print("w =", w)
product = np.dot(v, w)
print("v • w =", product)
v = [1 1] w = [2 3] v • w = 5
v. Length
$$\boxed{\text{length} = \|\textbf{v}\| = \sqrt{\textbf{v} \cdot \textbf{v}} = \sqrt{\Sigma_{i = 1}^{n}v_i^2} = \sqrt{v_1^2 + v_2^2 + \cdots +v_n^2}}$$w = np.array([2, 3])
print("w =", w)
length = np.sqrt(sum(w ** 2)) ## np.linalg.norm(w) works the same
print("||w|| =", length)
w = [2 3] ||w|| = 3.605551275463989
vi. Angle
$$\text{If }\textbf{v}\text{ and }\textbf{w}\text{ are nonzero vectors}$$$$\text{then }\boxed{\cos{\theta} = \frac{\textbf{v}\cdot\textbf{w}}{\|\textbf{v}\|~\|\textbf{w}\|}} \rightarrow \boxed{\theta = \arccos\left(\frac{\textbf{v}\cdot\textbf{w}}{\|\textbf{v}\|~\|\textbf{w}\|}\right)}~,$$$$\text{where }\theta\text{ is the angle between these two vectors.}$$v = np.array([2, 2])
print("v =", v)
w = np.array([3, -1])
print("w =", w)
cos_theta = np.dot(v, w) / (np.linalg.norm(v) * np.linalg.norm(w))
print("cos_theta =", cos_theta)
theta = np.arccos(cos_theta)
print("theta =", theta)
print("Angle degrees =", theta * (180 / np.pi))
v = [2 2] w = [ 3 -1] cos_theta = 0.4472135954999579 theta = 1.1071487177940904 Angle degrees = 63.43494882292201
The cosine has:
A maximum value of 1, when the two vectors point in the same direction;
A minimum value of -1, when two vectors point in opposite directions;
A value of 0, when the two vectors are orthogonal (perpendicular to each other).
Note 1: If the components of high-dimensional vectors are sampled randomly with mean 0, their cosine will nearly always be close to 0.
Note 2: $\cos(\pi) = -1,~\cos(0) = 1$.
Hyperplane is a generalization to higher dimensions of a line (two dimensions) or of a plane (three dimensions.)
In an d-dimensional vector space, a hyperplane has d - 1 dimensions and divides the space into two half spaces.
i. 2-D Example
Suppose we have a column vector $\textbf{w} = \begin{bmatrix}2&1\end{bmatrix}^T$. We want to know what are the points $\textbf{v}$ with $\textbf{w}\cdot\textbf{v}=1$. This is equivalent to the following:
$$\|\textbf{v}\|\|\textbf{w}\|\cos(\theta)=1 \Leftrightarrow \|\textbf{v}\|\cos(\theta)=\frac{1}{\|\textbf{w}\|}=\frac{1}{\sqrt{5}}$$Because of trignometry, we know that $\|\textbf{v}\|\cdot\cos(\theta)$ is the length of the projection of the vector $\textbf{v}$ onto the direction of $\textbf{w}$.
We can say that the length of the projection of $\textbf{v}$ onto the direction of $\textbf{w}$ is exactly $\frac{1}{\|\textbf{w}\|}$.
We know that the line of $\|\textbf{w}\|$ can be written as $y = \frac{1}{2}x$. In order to get the line that makes $\textbf{w}\cdot\textbf{v}=1$, we can do the inner product as $\begin{bmatrix} 2\\1 \end{bmatrix}\cdot\begin{bmatrix} x\\y \end{bmatrix}=1 \Rightarrow 2x + y = 1$. Equivalently, $y = 1 - 2x$.
As a result, if a point is:
On the left side of the line $y = 1 - 2x$ that forms a vector $\textbf{v}$ can have $\textbf{v}\cdot\textbf{w}<1$
On the line $y = 1 - 2x$ that forms a vector $\textbf{v}$ can have $\textbf{v}\cdot\textbf{w}=1$
On the right side of the line $y = 1 - 2x$ that forms a vector $\textbf{v}$ can have $\textbf{v}\cdot\textbf{w}>1$
ii. 3-D Example
The story in higher dimension is much the same.
Take $\textbf{w} = \begin{bmatrix}1&2&3\end{bmatrix}^T$ as an example. We still want to get the vector $\textbf{v}$ that make $\textbf{v} \cdot \textbf{w} = 1$. We can get the plane that by doing the inner product $\begin{bmatrix} 1\\2\\3 \end{bmatrix}\cdot\begin{bmatrix} x\\y\\z \end{bmatrix}=1 \Rightarrow x + 2y + 3z= 1$. This forms a plane.
iii. Higher Dimension Example
While our ability to visualize runs out at this point, nothing stops us from doing this in tens, hundreds, or billions of dimensions. This occurs often when thinking about machine learned models. In this context, such hyperplanes are often referred to as decision planes.
Here is a higher dimension example. We can produce a reasonable model to classify tiny images of t-shirts and trousers from the Fashion MNIST dataset by just taking the vector between their means to define the decision plane and eyeball a crude threshold. Each image has $28 \times 28$ pixels.
import warnings
warnings.filterwarnings('ignore')
# Load in the dataset
from tensorflow.keras.datasets import fashion_mnist
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
# Keep only two classes each with 6,000 examples
X_train_0 = x_train[y_train == 0] # label id 0 is for T-shirt/top
X_train_1 = x_train[y_train == 1] # label id 1 is for Trouser
# Full test set has 10,000 images, we'll just keep the 2,000 that have labels 0 or 1
X_test = x_test[(y_test == 0) | (y_test == 1)]
Y_test = y_test[(y_test == 0) | (y_test == 1)]
# Compute averages
ave_0 = np.mean(X_train_0, axis=0)
ave_1 = np.mean(X_train_1, axis=0)
It can be informative to examine these averages in detail, so let’s plot what they look like. In this case, we see that the average indeed resembles a blurry image of a t-shirt.
import matplotlib.pyplot as plt
# Plot average t-shirt
plt.imshow(ave_0.reshape(28, 28).tolist(), cmap='Greys')
plt.show()
In the second case, we again see that the average resembles a blurry image of trousers.
# Plot average trousers
plt.imshow(ave_1.reshape(28, 28).tolist(), cmap='Greys')
plt.show()
In a fully machine learned solution, we would learn the threshold from the dataset. In this case, we simply use a threshold that looked good on the training data by human trial and error.
# Print test set accuracy with eyeballed threshold
w = (ave_1 - ave_0).T
predictions = X_test.reshape(2000, 28*28).dot(w.flatten()) > -1500000
# Accuracy
print("Accuracy =", np.mean(1.0*predictions == Y_test))
Accuracy = 0.7905
i. Linear transformation
Suppose we have a $2 \times 2$ matrix $\textbf{A} = \begin{bmatrix}a&b\\c&d\end{bmatrix}$.
If we want to apply this to an arbitrary vector $\mathbf{v} = [x, y]^\top$, we multiply and see that
$$ \begin{aligned} \mathbf{A}\mathbf{v} & = \begin{bmatrix}a & b \\ c & d\end{bmatrix}\begin{bmatrix}x \\ y\end{bmatrix} \\ & = \begin{bmatrix}ax+by\\ cx+dy\end{bmatrix} \\ & = x\begin{bmatrix}a \\ c\end{bmatrix} + y\begin{bmatrix}b \\d\end{bmatrix} \\ & = x\left\{\mathbf{A}\begin{bmatrix}1\\0\end{bmatrix}\right\} + y\left\{\mathbf{A}\begin{bmatrix}0\\1\end{bmatrix}\right\}. \end{aligned} $$Note: The two vectors $[1,0]^T$ and $[0,1]^T$ are an example of basis, where we can write any vector in our space as a wreighted sum of these basis vectors.
They can help us reduce an infinite problem (what happens to any pair of real numbers) to a finite one (what happens to these specific vectors).
$$\mathbf{A}\begin{bmatrix}1\\0\end{bmatrix}=\begin{bmatrix}a&b\\c&d\end{bmatrix}\begin{bmatrix}1\\0\end{bmatrix}=\begin{bmatrix}a\cdot1+b\cdot0\\c\cdot1+d\cdot0\end{bmatrix}=\begin{bmatrix}a\\c\end{bmatrix}$$$$\mathbf{A}\begin{bmatrix}0\\1\end{bmatrix}=\begin{bmatrix}a&b\\c&d\end{bmatrix}\begin{bmatrix}0\\1\end{bmatrix}=\begin{bmatrix}a\cdot0+b\cdot1\\c\cdot0+d\cdot1\end{bmatrix}=\begin{bmatrix}c\\d\end{bmatrix}$$Visualization:
We have a matrix $\textbf{A} = \begin{bmatrix}1&2\\-1&3\end{bmatrix}$ and a vector $\textbf{v}=\begin{bmatrix}2\\-1\end{bmatrix}$.
We can see that the vector $\textbf{v} = 2\cdot \begin{bmatrix}1\\0\end{bmatrix} + -1\cdot\begin{bmatrix}0\\1\end{bmatrix} = \begin{bmatrix}2\\0\end{bmatrix} + \begin{bmatrix}0\\-1\end{bmatrix} = \begin{bmatrix}2\\-1\end{bmatrix}$.
Now let us bring the matrix $\textbf{A}$ into the calculation, which becomes
$$2\cdot\left\{\textbf{A}\begin{bmatrix}1\\0\end{bmatrix}\right\}+ -1\cdot\left\{\textbf{A}\begin{bmatrix}0\\1\end{bmatrix}\right\} = 2\begin{bmatrix}1\\-1\end{bmatrix} - \begin{bmatrix}2\\3\end{bmatrix} = \begin{bmatrix}0\\-5\end{bmatrix}.$$As we can see from the calculation and picture above, we know that the matrix multiplication can skew, rotate, and scale the grid, but the grid structure must remain.
In some cases, some distortions can be severe. Suppose we have a matrix $\textbf{B} = \begin{bmatrix}2&-1\\4&-2\end{bmatrix}$, which compresses the entire two-fimensional plane to a single line. This is because its columns are linear dependent.
ii. Linear Dependence
Suppose we have a matrix $\textbf{B} = \begin{bmatrix}2&-1\\4&-2\end{bmatrix}$.
This compresses the entire plane down to live on the single line $y = 2x$.
We can consider $\textbf{b}_1 = \begin{bmatrix}2\\4\end{bmatrix}$ and $\textbf{b}_2 = \begin{bmatrix}-1\\-2\end{bmatrix}$ as two columns of the matrix $\textbf{B}$. Then we can find out that $\textbf{b}_1 = -2 \cdot \textbf{b}_2$ means that we can write any linear combination of those two columns entirely in terms of $\text{b}_2$ because
$$a_1\textbf{b}_1 + a_2\textbf{b}_2 = -2a_1\textbf{b}_2 + a_2\textbf{b}_2 = (a_2 - 2a_1)\textbf{b}_2.$$We can also wirte it in terms of $\textbf{b}_1$.
The above tells us that one of the columns is redundant because it does not define a unique direction in space. By rewriting the linear dependence $\textbf{b}_1 = -2 \cdot \textbf{b}_2$ to $\textbf{b}_1 + 2 \cdot \textbf{b}_2 = 0$, we can understand it easier.
In general, we say that a collection of vectors $\textbf{v}_1, \cdot \cdot \cdot, \textbf{v}_k$ are linearly dependent if there exist coefficients $a_1, \cdot \cdot \cdot,a_k$ not all equal to zero so that
$$\boxed{\Sigma_{i=1}^{k} a_i \textbf{v}_i = 0}.$$As a result, a linear dependence in the columns of a matrix is a witness to the fact that our matrix is compressing the space down to some lower dimension.
If there is no linear dependence we say the vectors are linearly independent. If the columns of a matrix are linearly independent, no compression occurs and the operation can be undone.
iii. Rank
The rank of a matrix tells us what dimension space the matrix aps into. Be aware that the rank of a matrix is the largest number of linearly indeoendent columns among all subsets of folumns.
Rank = 1 example:
$$\textbf{A} = \begin{bmatrix}1&3&10\\2&6&20\\3&9&30\end{bmatrix} \rightarrow \textbf{R} = \begin{bmatrix}1&3&10\\0&0&0\\0&0&0\end{bmatrix}$$Since there is only 1 row that is not all 0 we get $rank(A)=1$.
A = np.array(
[[1, 3, 10],
[2, 6, 20],
[3, 9, 30]]
)
print("rank(A) =", np.linalg.matrix_rank(A))
rank(A) = 1
Rank = 2 example 1:
$$\textbf{B} = \begin{bmatrix}1&2&3\\4&5&6\\7&8&9\end{bmatrix} \rightarrow \begin{bmatrix}1&2&3\\0&-3&-6\\0&-6&-12\end{bmatrix} \rightarrow \textbf{R} =\begin{bmatrix}1&2&3\\0&-3&-6\\0&0&0\end{bmatrix}$$Since there are 2 rows that are not all 0 we get $rank(B)=2$.
B = np.array(
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
)
print("rank(B) =", np.linalg.matrix_rank(B))
rank(B) = 2
Rank = 2 example 2:
$$ \mathbf{C} = \begin{bmatrix} 1& 3 & 0 & -1 & 0 \\ -1 & 0 & 1 & 1 & -1 \\ 0 & 3 & 1 & 0 & -1 \\ 2 & 3 & -1 & -2 & 1 \end{bmatrix} \rightarrow \begin{bmatrix} 1& 3 & 0 & -1 & 0 \\ 0 & 3 & 1 & 0 & -1 \\ 0 & 3 & 1 & 0 & -1 \\ 2 & 3 & -1 & -2 & 1 \end{bmatrix} \rightarrow \begin{bmatrix} 1& 3 & 0 & -1 & 0 \\ 0 & 3 & 1 & 0 & -1 \\ 0 & 0 & 0 & 0 & 0 \\ 2 & 3 & -1 & -2 & 1 \end{bmatrix} \rightarrow \begin{bmatrix} 1& 3 & 0 & -1 & 0 \\ 0 & 3 & 1 & 0 & -1 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & -3 & -1 & 0 & 1 \end{bmatrix} $$$$ \rightarrow \mathbf{R} = \begin{bmatrix} 1& 3 & 0 & -1 & 0 \\ 0 & 3 & 1 & 0 & -1 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{bmatrix} $$Since there are 2 rows that are not all 0 we get $rank(C)=2$.
C = np.array(
[[ 1, 3, 0, -1, 0],
[-1, 0, 1, 1, -1],
[ 0, 3, 1, 0, -1],
[ 2, 3, -1, -2, 1]]
)
print("rank(C) =", np.linalg.matrix_rank(C))
rank(C) = 2
iv. Invertibility
Multiplication by a full-rank matrix (i.e., some $\mathbf{A}$ that is $n \times n$ matrix with rank $n$) is able to undo. That means:
Consider matrix $\textbf{A}$ of size $n \times n$,
Note: $\textbf{I}$ is the identity matrix, which is the matrix with ones along the diagonal, and zeros elsewhere. It is the matrix which leaves our data unchanged when applied (i.e., $\textbf{A}\textbf{I}=\textbf{A}$). $\textbf{A}^{-1}$ means the inverse of $\textbf{A}$.
To find a matrix which undoes what our matrix $\mathbf{A}$ has done, we want to find a matrix $\mathbf{A}^{-1}$ such that
$$ \boxed{\mathbf{A}^{-1}\mathbf{A} = \mathbf{A}\mathbf{A}^{-1} = \mathbf{I}}. $$If there is an eligible matrix $\textbf{A} = \begin{bmatrix}a&b\\c&d\end{bmatrix}$, then we can find its inverse matrix:
$$\boxed{\mathbf{A}^{-1}= \frac{1}{ad-bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix}}. $$v. Useful Rules of Matrix
Laws of Matrix
$\textbf{A}+\textbf{B} = \textbf{B}+\textbf{A}$ (Commutative Law of Addition)
$\textbf{A}+\textbf{B}+\textbf{C} = \textbf{A} +(\textbf{B}+\textbf{C}) = (\textbf{A}+\textbf{B})+\textbf{C}$ (Associative law of addition)
$\textbf{ABC} = \textbf{A}(\textbf{BC}) = (\textbf{AB})\textbf{C}$ (Associative law of multiplication)
$\textbf{A}(\textbf{B}+\textbf{C}) = \textbf{AB} + \textbf{AC}$ (Distributive law of matrix algebra)
$\alpha(\textbf{A}+\textbf{B}) = \alpha\textbf{A} + \alpha\textbf{B}$, where $\alpha$ is a constant number
Rules for Transposition of Matrix
$(\textbf{A}^T)^T = \textbf{A}$
$(\textbf{A} + \textbf{B})^T = \textbf{A}^T + \textbf{B}^T$
$(\textbf{A}\textbf{B})^T = \textbf{B}^T \textbf{A}^T$
$\textbf{ABC} = \textbf{C}^T\textbf{B}^T\textbf{A}^T$
Rules for Inverse of Matrix
$\textbf{AI} = \textbf{IA} = \textbf{A}$
$\mathbf{A}^{-1}\mathbf{A} = \mathbf{A}\mathbf{A}^{-1} = \mathbf{I}$
$(\textbf{A}^{-1})^{-1} = \textbf{A}$
$(\textbf{A}\textbf{B})^{-1} = \textbf{B}^{-1} \textbf{A}^{-1}$
$\textbf{ABC} = \textbf{C}^{-1}\textbf{B}^{-1}\textbf{A}^{-1}$
$(\textbf{A}^T)^{-1} = (\textbf{A}^{-1})^{T}$
In this section, we will dive more deeply into tensor contractions (the tensor equivalent of matrix multiplication), and see how it can provide a unified view on a number of matrix and vector operations.
With matrices and vectors we knew how to multiply them to transform data. We need to have a similar definition for tensors if they are to be useful to us. Think about matrix multiplication:
$$ \mathbf{C} = \mathbf{A}\mathbf{B}, $$or equivalently
$$ c_{i, j} = \sum_{k} a_{i, k}b_{k, j}.$$This pattern is one we can repeat for tensors. For tensors, there is no one case of what to sum over that can be universally chosen, so we need specify exactly which indices we want to sum over. For instance we could consider
$$ y_{il} = \sum_{jk} x_{ijkl}a_{jk}. $$Such a transformation is called a tensor contraction. It can represent a far more flexible family of transformations that matrix multiplication alone.
As a often-used notational simplification, we can notice that the sum is over exactly those indices that occur more than once in the expression, thus people often work with Einstein notation, where the summation is implicitly taken over all repeated indices. This gives the compact expression:
$$ y_{il} = x_{ijkl}a_{jk}. $$Common Examples from Linear Algebra
Let us see how many of the linear algebraic definitions we have seen before can be expressed in this compressed tensor notation:
In this way, we can replace a myriad of specialized notations with short tensor expressions.
Expressing in Code
Tensors may flexibly be operated on in code as well.
# Define tensors
B = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = np.array([[1, 2], [3, 4]])
v = np.array([1, 2])
# Print out the shapes
print(A.shape, B.shape, v.shape)
(2, 2) (2, 2, 3) (2,)
Einstein summation has been implemented directly. The indices that occurs in the Einstein summation can be passed as a string, followed by the tensors that are being acted upon.
For instance, to implement matrix multiplication, we can consider the Einstein summation seen above ($\mathbf{A}\mathbf{v} = a_{ij}v_j$) and strip out the indices themselves to get the implementation:
# Reimplement matrix multiplication
print(np.einsum("ij, j -> i", A, v))
[ 5 11]
print(A.dot(v))
[ 5 11]
This is a highly flexible notation. For instance if we want to compute what would be traditionally written as
$$ c_{kl} = \sum_{ij} \mathbf{b}_{ijk}\mathbf{a}_{il}v_j. $$it can be implemented via Einstein summation as:
print(np.einsum("ijk, il, j -> kl", B, A, v))
[[ 90 126] [102 144] [114 162]]
Either notation allows for concise and efficient representation of tensor contractions in code.
print(np.linalg.matrix_power(A, 4)) # A to the fourth power
[[199 290] [435 634]]
print(np.einsum(A, [0, 1],
A, [1, 2],
A, [2, 3],
A, [3, 4],
[0, 4])) # A to the fourth power via Einstein summation
[[199 290] [435 634]]