Content
1. Basics of Regression 2. Linear Regression 3. Gradient Descent 4. Minibatch Stochastic Gradient Descent 5. The Normal Distribution and Squared Loss 6. Neural Network and Biology Summary <- Go BackUniversity of Michigan at Ann Arbor
Last Edit Date: 02/03/2023
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Dive into Deep Learning and Professor Ambuj Tewari. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License..
i. Kinds of Machine Learning Problems:
Supervised learning: have labeled data
Unsupervised learning: have unlabeled data
Reinforcement learning: interact with an environment and receive rewards
ii. Regression Regression problems pop up whenever we want to predict a numerical value. Common examples include predicting prices (of homes, stocks, etc.), predicting the length of stay (for patients in the hospital), forecasting demand (for retail sales), among countless others.
Modeling the relationship between input variables and a real-valued output
Input variable also called:
Output variable also called:
"Regress $y$ on $x$" means "run a regression with $x$ as input, $y$ as output".
Linear regression may be both the simplest and most popular among the standard tools for tackling regression problems.
i. Model
Simple Linear Regression Model:
$$\boxed{Y_i = \beta_0 + \beta_1 X_i+ \varepsilon_i}$$General Model:
$$\boxed{\boldsymbol{Y} = \boldsymbol{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}},$$$$\text{where }\boldsymbol{Y} = \begin{bmatrix}Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix}_{~n\times 1}, \boldsymbol{X} = \begin{bmatrix}1 & X_{11} & X_{12} & \dots & X_{1p} \\ 1 & X_{21} & X_{22} & \dots & X_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n1} & X_{n2} & \dots & X_{np}\end{bmatrix}_{~n\times (1+p)}, \boldsymbol{\beta} = \begin{bmatrix}\beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix}_{~(1+p)\times 1}, \boldsymbol{\varepsilon} = \begin{bmatrix}\varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix}_{~n\times 1}$$when we have $n$ observations and $p$ inputs in each observation.
ii. Assumption (for general model)
$\boldsymbol{\varepsilon} \sim N(0, \sigma^2 \boldsymbol{I})$
$E(\boldsymbol{\varepsilon}) = 0$
$Var(\boldsymbol{\varepsilon}) = \sigma^2 \boldsymbol{I} = \begin{bmatrix}\sigma^2 & 0 & \dots & 0 \\ 0 & \sigma^2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \sigma^2 \end{bmatrix}$
Note the variance-covariance matrix tells the $\varepsilon_i$ are independent and have constant variance $\sigma^2$.
All covariance terms are 0, only diagonal is $\sigma^2$, which means they are uncorrelated. As a result, under normality, this indicating independence.
iii. Loss Function
Loss functions can quantify the distance between the real and predicted values of the target. Usually, the loss is a non-negative number where smaller values are better and the perfect situation is that the loss equals to 0.
For regression problem, the most common loss function is squared error. When our prediction for an example $i$ is $\hat{Y}_i$ adnd the corresponding true label is $Y_{i}$, the square error is:
$$\boxed{\ell(\beta_i,\beta_0) = \frac{1}{2} \left(Y_i - \hat{Y}_i\right)^2}$$The constant $\frac{1}{2}$ makes no real difference but proves to be notationally convenient, since it cancels out when we take the derivative of the loss. The following plot shows how the loss function works.
Note: Because of the quadratic form of the loss, large differences between estimates $\hat{Y}_i$ and target $Y_i$ can lead to even larger contributions to the loss.
This can be a double-edge sword. While it encourages the model to avoid large errors, it can also lead to excessive sensitivity to anomalous data.
To measure the quality of a model on the entire dataset of $n$ examples, we simply average (or equivalently, sum) the losses on the training set:
$$L(\boldsymbol{\beta},\beta_0) = \frac{1}{n} \sum_{i = 1}^{n}\ell(\beta_i,\beta_0) = \boxed{\frac{1}{2n} \sum_{i = 1}^{n}\left(Y_i - \beta_i \boldsymbol{X} - \beta_0\right)^2}$$General form ($\beta_0$ is actually included in $\boldsymbol{\beta}$):
$$L(\boldsymbol{\beta}) = \frac{1}{2n} \left(\boldsymbol{\boldsymbol{Y} - \beta}\boldsymbol{X}\right)^T \left(\boldsymbol{Y} - \boldsymbol{\beta}\boldsymbol{X}\right) = \boxed{\frac{1}{2n}\left(\boldsymbol{Y}^T\boldsymbol{Y} - \boldsymbol{\beta}^T \boldsymbol{X}^T \boldsymbol{Y}\right)}$$iv. Minimization
When training the model, we want to find parameters ($\boldsymbol{\hat{\beta}}$, $\hat{\beta}_0$) that minimize the total loss across all training examples:
$$\boldsymbol{\hat{\beta}}, \hat{\beta}_0 = \operatorname*{argmin}_{\boldsymbol{\beta},~\boldsymbol{\varepsilon}}L(\boldsymbol{\beta},\beta_0)$$In order to minimize the loss, we can do the partial derivative of the loss function with respect to $\boldsymbol{\beta}$ and setting it equals to 0:
$$ \triangledown L(\boldsymbol{\beta}) = \frac{\partial_L}{\partial_{\beta}} = \frac{\partial}{\partial_{\beta}}\left[ \left(\boldsymbol{\boldsymbol{Y} - \boldsymbol{X}\beta}\right)^T \left(\boldsymbol{Y} - \boldsymbol{X}\boldsymbol{\beta}\right) \right] $$$$ ~~~~~~\Rightarrow \frac{\partial}{\partial_{\beta}}\left[\left(\boldsymbol{\boldsymbol{Y}^T - \beta}^T\boldsymbol{X}^T\right) \left(\boldsymbol{Y} - \boldsymbol{X}\boldsymbol{\beta}\right) \right] $$$$ ~~~~~~~~~~~~~~~~~~~~~~~~~~\Rightarrow \frac{\partial}{\partial_{\beta}} \left(\boldsymbol{\boldsymbol{Y}^T \boldsymbol{Y} - \beta}^T\boldsymbol{X}^T \boldsymbol{Y} - \boldsymbol{Y}^T \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\beta}^T\boldsymbol{X}^T \boldsymbol{X}\boldsymbol{\beta} \right) $$$$ ~~~~~~~~~~~~~~~~~~~\Rightarrow \frac{\partial}{\partial_{\beta}}\left(\boldsymbol{\boldsymbol{Y}^T \boldsymbol{Y} - 2\beta}^T\boldsymbol{X}^T \boldsymbol{Y} + \boldsymbol{\beta}^T\boldsymbol{X}^T \boldsymbol{X}\boldsymbol{\beta} \right) = 0 $$Now we can assign it equal to 0. $$ -2\boldsymbol{X}^T \boldsymbol{Y} + 2\boldsymbol{X}^T \boldsymbol{X}\boldsymbol{\beta}=0 $$
$$ \Rightarrow 2\boldsymbol{X}^T \boldsymbol{X}\boldsymbol{\beta}=2\boldsymbol{X}^T \boldsymbol{Y}~~~~~~~~~~~~~~~~ $$$$ \Rightarrow 2\boldsymbol{X}^T \boldsymbol{X}\boldsymbol{\beta}=2\boldsymbol{X}^T \boldsymbol{Y}~~~~~~~~~~~~~~~~ $$$$ \Rightarrow \boldsymbol{X}^T \boldsymbol{X}\boldsymbol{\beta}=\boldsymbol{X}^T \boldsymbol{Y}~~~~~~~~~~~~~~~~~~~~ $$$$ ~~~~~~~~~~~~~~\Rightarrow \left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1}\boldsymbol{X}^T \boldsymbol{X}\boldsymbol{\beta}=\left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1}\boldsymbol{X}^T \boldsymbol{Y} $$$$ \Rightarrow \boxed{\boldsymbol{\hat{\beta}}=\left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1}\boldsymbol{X}^T \boldsymbol{Y}}~~~~~~~~ $$We can also achive this using the chain rule as following.
We know that $L\left(\boldsymbol{\beta}\right) = \frac{1}{2} \|\boldsymbol{Y} - \boldsymbol{X}\boldsymbol{\beta}\|^2$, then we can have $G\left(\boldsymbol{\beta}\right) = \boldsymbol{Y} - \boldsymbol{X}\boldsymbol{\beta}$. Alternatively, we can have a general function $F\left(u\right) = \frac{1}{2}\|u\|^2$, where $u = \boldsymbol{Y} - \boldsymbol{X}\boldsymbol{\beta}$.
Now we can calculate the derivative using chain rule:
$$\triangledown L\left(\boldsymbol{\beta}\right) = \triangledown F(u) = F'(u) \times u'$$$$~~~~~~~~= \boldsymbol{Y} - \boldsymbol{X}\boldsymbol{\beta} \times (-\boldsymbol{X}^T)$$$$~~~~~~~ =-\boldsymbol{X}^T\boldsymbol{Y} + \boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\beta}$$Note: We call $\triangledown L(\boldsymbol{\beta})$ the gradient of $L(\boldsymbol{\beta})$.
Now we can assign it equal to 0, and we will get the same answer as above.
$$-\boldsymbol{X}^T\boldsymbol{Y} + \boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\beta} = 0$$$$\Rightarrow \boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\beta} = \boldsymbol{X}^T\boldsymbol{Y}~~~~~~~~~~~~~$$$$~~~~~~~~~~~~~~~~~~~~\Rightarrow \left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1}\boldsymbol{X}^T \boldsymbol{X}\boldsymbol{\beta}=\left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1}\boldsymbol{X}^T \boldsymbol{Y}$$$$\Rightarrow \boxed{\boldsymbol{\hat{\beta}}=\left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1}\boldsymbol{X}^T \boldsymbol{Y}}~~$$The $\boldsymbol{\hat{\beta}}$ we get above is the optimal result we can get in our predicted model. Note that the solution will only be unique when $\boldsymbol{X^T X}$ is invertible (i.e., when the columns of the design matrix are linearly independent).
Tip: Useful matrix / vector derivative ($A$ is a scalar and $\boldsymbol{W}$ is a valid vector or matirx)
$F\left(\boldsymbol{X}\right)$ | $\rightarrow$ | $\triangledown F\left(\boldsymbol{X}\right)$ |
---|---|---|
$A\boldsymbol{X}$ | $\rightarrow$ | $A$ |
$\boldsymbol{X}^TA\boldsymbol{X}$ | $\rightarrow$ | $2A\boldsymbol{X}$ |
$\boldsymbol{W}^T\boldsymbol{X}$ | $\rightarrow$ | $\boldsymbol{X}$ |
$\boldsymbol{X}^T\boldsymbol{X}$ | $\rightarrow$ | $2\boldsymbol{X}$ |
Learn more Math relates to Linear Regression:
UC Davis STA 108 - Applied Statistical Methods: Regression Analysis
i. Introduction
Although analytic solutions allow for nice mathematical analysis, the requirement of an analytic solution is so restrictive that it would exclude almost all exciting aspects of deep learning.
Key technique for incrementally lowering the loss function is gradient descent. It iteratively reduces the loss by updating the parameters in the direction of the negative gradient.
Let's say you are on a mountain and you can't see too far. You goal is to find the quickest way to the base of the mountain. Best way to go about:
For any objective function $L\left(\boldsymbol{\beta}\right)$, gradient desent takes the form:
$$\boxed{\boldsymbol{\hat{\beta}} = \boldsymbol{\beta} - \eta \triangledown L\left(\boldsymbol{\beta}\right)},$$where $\triangledown L(\boldsymbol{\beta}) = \begin{bmatrix} \frac{\partial{L_1}}{\partial{\beta_1}} & \frac{\partial{L_2}}{\partial{\beta_1}} & \dots & \frac{\partial{L_p}}{\partial{\beta_1}} \\ \frac{\partial{L_1}}{\partial{\beta_2}} & \frac{\partial{L_2}}{\partial{\beta_2}} & \dots & \frac{\partial{L_p}}{\partial{\beta_2}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial{L_1}}{\partial{\beta_n}} & \frac{\partial{L_2}}{\partial{\beta_n}} & \dots & \frac{\partial{L_p}}{\partial{\beta_n}} \end{bmatrix}_{~n\times p} $, $\eta$ is the step size (learning rate), which controls how much we move.
You can think of $\beta$ as the current position, $L(\beta)$ as a cost function (in this metaphor, it would measure the height of the position from the base), $-\eta L(\beta)$ as the direction of the steepest slope, $\eta$ as the size of the baby step or the learning rate. A good choice of $\eta$ can determine the usefulness of Gradient Descent.
ii. Algorithm for Simple linear regression
$$\boxed{ \hat{\beta}_1 = \beta_1 - \eta \triangledown L\left(\beta_1 \right) = \beta_1 -\frac{\eta}{n}\sum_{i =1}^nX_i(Y_i - \beta_1 X_i - \beta_0)} $$$$\boxed{ \hat{\beta}_0 = \beta_0 - \eta \triangledown L\left(\beta_0 \right) = \beta_0 -\frac{\eta}{n}\sum_{i =1}^nX_i(Y_i - \beta_1 X_i - \beta_0)} $$We can use Python code to achieve this as following:
import numpy as np
import math
def ols_grad_descent(x, y, epochs, initial_beta1 = 0, initial_beta0 = 1, eta = 0.1):
beta1 = initial_beta1
beta0 = initial_beta0
n = x.shape[0]
losses = []
for i in range(epochs):
y_pred = (beta1 * x) + beta0
losses.append(np.mean(np.square(y - y_pred)))
grad_beta1 = (-1 / n) * np.sum(x * (y - y_pred))
grad_beta0 = (-1 / n) * np.sum(y - y_pred)
beta1 = beta1 - eta * grad_beta1
beta0 = beta0 - eta * grad_beta0
if i % 50 == 0:
print("Step {}: Current loss is {}, slope (beta1) is {}, and intercept (beta0) is {}."
.format(i + 1, round(losses[i], 4), round(beta1, 4) , round(beta0, 4)))
return beta1, beta0, losses
Let's try this algorithm on a simulation. We can control the baby steps in the algorithm using the epoch parameter.
rng = np.random.RandomState(1)
x = 10 * rng.rand(1000)
y = 2 * x - 5 + rng.randn(1000)
beta1, beta0, losses = ols_grad_descent(x, y, epochs = 1000, initial_beta1 = 0, initial_beta0 = 1, eta= 0.01)
Step 1: Current loss is 50.8109, slope (beta1) is 0.3696, and intercept (beta0) is 1.0406. Step 51: Current loss is 8.3511, slope (beta1) is 1.1844, and intercept (beta0) is 0.4467. Step 101: Current loss is 6.768, slope (beta1) is 1.2782, and intercept (beta0) is -0.1738. Step 151: Current loss is 5.5276, slope (beta1) is 1.3612, and intercept (beta0) is -0.7231. Step 201: Current loss is 4.5556, slope (beta1) is 1.4347, and intercept (beta0) is -1.2093. Step 251: Current loss is 3.794, slope (beta1) is 1.4997, and intercept (beta0) is -1.6397. Step 301: Current loss is 3.1973, slope (beta1) is 1.5573, and intercept (beta0) is -2.0207. Step 351: Current loss is 2.7297, slope (beta1) is 1.6082, and intercept (beta0) is -2.3579. Step 401: Current loss is 2.3633, slope (beta1) is 1.6533, and intercept (beta0) is -2.6564. Step 451: Current loss is 2.0762, slope (beta1) is 1.6932, and intercept (beta0) is -2.9207. Step 501: Current loss is 1.8513, slope (beta1) is 1.7286, and intercept (beta0) is -3.1546. Step 551: Current loss is 1.675, slope (beta1) is 1.7599, and intercept (beta0) is -3.3617. Step 601: Current loss is 1.5369, slope (beta1) is 1.7876, and intercept (beta0) is -3.5449. Step 651: Current loss is 1.4287, slope (beta1) is 1.8121, and intercept (beta0) is -3.7072. Step 701: Current loss is 1.3439, slope (beta1) is 1.8338, and intercept (beta0) is -3.8508. Step 751: Current loss is 1.2775, slope (beta1) is 1.853, and intercept (beta0) is -3.9779. Step 801: Current loss is 1.2254, slope (beta1) is 1.87, and intercept (beta0) is -4.0904. Step 851: Current loss is 1.1846, slope (beta1) is 1.885, and intercept (beta0) is -4.19. Step 901: Current loss is 1.1527, slope (beta1) is 1.8983, and intercept (beta0) is -4.2782. Step 951: Current loss is 1.1276, slope (beta1) is 1.9101, and intercept (beta0) is -4.3563.
We can plot the losses out to see how well this algorithm works.
from matplotlib import pyplot as plt
plt.plot(range(1, 1000+1), losses, label = "loss")
plt.legend()
plt.xlabel("iteration")
plt.show()
The results seem to very close to the true $\beta_1$ and $\beta_0$ values. Furthermore, the loss values are monotonically decreasing as we increase the number of iterations.
Let's see if increasing the number of repetitions will give us smaller training loss and get us even closer to the true values for the slope and intercept.
beta1, beta0, losses = ols_grad_descent(x, y, epochs = 2000, initial_beta1 = 0, initial_beta0 = 1, eta= 0.01)
Step 1: Current loss is 50.8109, slope (beta1) is 0.3696, and intercept (beta0) is 1.0406. Step 51: Current loss is 8.3511, slope (beta1) is 1.1844, and intercept (beta0) is 0.4467. Step 101: Current loss is 6.768, slope (beta1) is 1.2782, and intercept (beta0) is -0.1738. Step 151: Current loss is 5.5276, slope (beta1) is 1.3612, and intercept (beta0) is -0.7231. Step 201: Current loss is 4.5556, slope (beta1) is 1.4347, and intercept (beta0) is -1.2093. Step 251: Current loss is 3.794, slope (beta1) is 1.4997, and intercept (beta0) is -1.6397. Step 301: Current loss is 3.1973, slope (beta1) is 1.5573, and intercept (beta0) is -2.0207. Step 351: Current loss is 2.7297, slope (beta1) is 1.6082, and intercept (beta0) is -2.3579. Step 401: Current loss is 2.3633, slope (beta1) is 1.6533, and intercept (beta0) is -2.6564. Step 451: Current loss is 2.0762, slope (beta1) is 1.6932, and intercept (beta0) is -2.9207. Step 501: Current loss is 1.8513, slope (beta1) is 1.7286, and intercept (beta0) is -3.1546. Step 551: Current loss is 1.675, slope (beta1) is 1.7599, and intercept (beta0) is -3.3617. Step 601: Current loss is 1.5369, slope (beta1) is 1.7876, and intercept (beta0) is -3.5449. Step 651: Current loss is 1.4287, slope (beta1) is 1.8121, and intercept (beta0) is -3.7072. Step 701: Current loss is 1.3439, slope (beta1) is 1.8338, and intercept (beta0) is -3.8508. Step 751: Current loss is 1.2775, slope (beta1) is 1.853, and intercept (beta0) is -3.9779. Step 801: Current loss is 1.2254, slope (beta1) is 1.87, and intercept (beta0) is -4.0904. Step 851: Current loss is 1.1846, slope (beta1) is 1.885, and intercept (beta0) is -4.19. Step 901: Current loss is 1.1527, slope (beta1) is 1.8983, and intercept (beta0) is -4.2782. Step 951: Current loss is 1.1276, slope (beta1) is 1.9101, and intercept (beta0) is -4.3563. Step 1001: Current loss is 1.108, slope (beta1) is 1.9206, and intercept (beta0) is -4.4253. Step 1051: Current loss is 1.0926, slope (beta1) is 1.9298, and intercept (beta0) is -4.4865. Step 1101: Current loss is 1.0806, slope (beta1) is 1.938, and intercept (beta0) is -4.5406. Step 1151: Current loss is 1.0711, slope (beta1) is 1.9452, and intercept (beta0) is -4.5886. Step 1201: Current loss is 1.0637, slope (beta1) is 1.9517, and intercept (beta0) is -4.631. Step 1251: Current loss is 1.0579, slope (beta1) is 1.9573, and intercept (beta0) is -4.6685. Step 1301: Current loss is 1.0534, slope (beta1) is 1.9623, and intercept (beta0) is -4.7018. Step 1351: Current loss is 1.0498, slope (beta1) is 1.9668, and intercept (beta0) is -4.7312. Step 1401: Current loss is 1.0471, slope (beta1) is 1.9707, and intercept (beta0) is -4.7572. Step 1451: Current loss is 1.0449, slope (beta1) is 1.9742, and intercept (beta0) is -4.7803. Step 1501: Current loss is 1.0432, slope (beta1) is 1.9773, and intercept (beta0) is -4.8007. Step 1551: Current loss is 1.0418, slope (beta1) is 1.98, and intercept (beta0) is -4.8187. Step 1601: Current loss is 1.0408, slope (beta1) is 1.9824, and intercept (beta0) is -4.8347. Step 1651: Current loss is 1.0399, slope (beta1) is 1.9846, and intercept (beta0) is -4.8489. Step 1701: Current loss is 1.0393, slope (beta1) is 1.9865, and intercept (beta0) is -4.8614. Step 1751: Current loss is 1.0388, slope (beta1) is 1.9881, and intercept (beta0) is -4.8725. Step 1801: Current loss is 1.0384, slope (beta1) is 1.9896, and intercept (beta0) is -4.8823. Step 1851: Current loss is 1.0381, slope (beta1) is 1.9909, and intercept (beta0) is -4.891. Step 1901: Current loss is 1.0378, slope (beta1) is 1.9921, and intercept (beta0) is -4.8987. Step 1951: Current loss is 1.0376, slope (beta1) is 1.9931, and intercept (beta0) is -4.9055.
plt.plot(range(1, 2000+1), losses, label = "loss")
plt.legend()
plt.xlabel("iteration")
plt.show()
We can see that the results seem to be much closer to the true $\beta_1$ and $\beta_0$ values as we increase the iteration.
Will increasing the number of repetitons increase the accuracy of the model significantly? This is a very difficult question to answer.
Its challenging to intelligently pick the number of repetitions to strike the perfect tradeoff between accuracy and computational cost.
So, a convergence stopping condition can help with this. We can have the algorithm stop whenever the changes to $\beta$ and $\epsilon$ are getting very very small. let's rewrite the algorithm with this stopping condition.
def eps_ols_grad_descent(x, y, alpha = 0.001, initial_beta1 = 0, initial_beta0 = 1, eta = 0.1):
beta1 = initial_beta1
beta0 = initial_beta0
old_beta1 = beta1
old_beta0 = beta0
n = x.shape[0]
diff = 1.0
# print(n)
i = 0
losses = []
while diff > alpha:
y_pred = (beta1 * x) + beta0
losses.append(np.mean(np.square(y - y_pred)))
grad_beta1 = (-1 / n) * np.sum(x * (y - y_pred))
grad_beta0 = (-1 / n) * np.sum(y - y_pred)
beta1 = beta1 - eta * grad_beta1
beta0 = beta0 - eta * grad_beta0
if i % 50 == 0:
print("Step {}: Current loss is {}, slope (beta1) is {} and intercept (beta0) is {}."
.format(i + 1, round(losses[i],4), round(beta1, 4), round(beta0, 4)))
i = i + 1
diff = max(abs(old_beta1 - beta1), abs(old_beta0 - beta0))
old_beta1 = beta1
old_beta0 = beta0
return beta1, beta0, losses
beta1, beta0, losses = eps_ols_grad_descent(x, y, alpha = 0.001, initial_beta1 = 0, initial_beta0 = 1, eta= 0.01)
Step 1: Current loss is 50.8109, slope (beta1) is 0.3696 and intercept (beta0) is 1.0406. Step 51: Current loss is 8.3511, slope (beta1) is 1.1844 and intercept (beta0) is 0.4467. Step 101: Current loss is 6.768, slope (beta1) is 1.2782 and intercept (beta0) is -0.1738. Step 151: Current loss is 5.5276, slope (beta1) is 1.3612 and intercept (beta0) is -0.7231. Step 201: Current loss is 4.5556, slope (beta1) is 1.4347 and intercept (beta0) is -1.2093. Step 251: Current loss is 3.794, slope (beta1) is 1.4997 and intercept (beta0) is -1.6397. Step 301: Current loss is 3.1973, slope (beta1) is 1.5573 and intercept (beta0) is -2.0207. Step 351: Current loss is 2.7297, slope (beta1) is 1.6082 and intercept (beta0) is -2.3579. Step 401: Current loss is 2.3633, slope (beta1) is 1.6533 and intercept (beta0) is -2.6564. Step 451: Current loss is 2.0762, slope (beta1) is 1.6932 and intercept (beta0) is -2.9207. Step 501: Current loss is 1.8513, slope (beta1) is 1.7286 and intercept (beta0) is -3.1546. Step 551: Current loss is 1.675, slope (beta1) is 1.7599 and intercept (beta0) is -3.3617. Step 601: Current loss is 1.5369, slope (beta1) is 1.7876 and intercept (beta0) is -3.5449. Step 651: Current loss is 1.4287, slope (beta1) is 1.8121 and intercept (beta0) is -3.7072. Step 701: Current loss is 1.3439, slope (beta1) is 1.8338 and intercept (beta0) is -3.8508. Step 751: Current loss is 1.2775, slope (beta1) is 1.853 and intercept (beta0) is -3.9779. Step 801: Current loss is 1.2254, slope (beta1) is 1.87 and intercept (beta0) is -4.0904. Step 851: Current loss is 1.1846, slope (beta1) is 1.885 and intercept (beta0) is -4.19. Step 901: Current loss is 1.1527, slope (beta1) is 1.8983 and intercept (beta0) is -4.2782. Step 951: Current loss is 1.1276, slope (beta1) is 1.9101 and intercept (beta0) is -4.3563. Step 1001: Current loss is 1.108, slope (beta1) is 1.9206 and intercept (beta0) is -4.4253. Step 1051: Current loss is 1.0926, slope (beta1) is 1.9298 and intercept (beta0) is -4.4865. Step 1101: Current loss is 1.0806, slope (beta1) is 1.938 and intercept (beta0) is -4.5406.
plt.plot(range(1, 1109+1), losses, label = "loss")
plt.legend()
plt.xlabel("iteration")
plt.show()
As we can see, the algorithm stopped after 1101 iterations. However, the results still look good in general.
Let's see what happens if we increase the learning rate to $0.1$. Since it could potentially get to the minimum faster, we could also decrease the number of repetitions.
beta1, beta0, losses = ols_grad_descent(x, y, epochs = 500, initial_beta1 = 0, initial_beta0 = 1, eta= 0.1)
Step 1: Current loss is 50.8109, slope (beta1) is 3.696, and intercept (beta0) is 1.4059. Step 51: Current loss is 7.342036083869394e+39, slope (beta1) is 3.499588537195117e+19, and intercept (beta0) is 5.287696237287645e+18. Step 101: Current loss is 1.3329942304915815e+78, slope (beta1) is 4.715446945615442e+38, and intercept (beta0) is 7.124795045603993e+37. Step 151: Current loss is 2.4201374090596947e+116, slope (beta1) is 6.353729777254186e+57, and intercept (beta0) is 9.600155183631025e+56. Step 201: Current loss is 4.393916301175761e+154, slope (beta1) is 8.561199510452273e+76, and intercept (beta0) is 1.2935527121817004e+76. Step 251: Current loss is 7.977439788941246e+192, slope (beta1) is 1.1535608158873063e+96, and intercept (beta0) is 1.7429703866096797e+95. Step 301: Current loss is 1.4483558908291895e+231, slope (beta1) is 1.5543412512767047e+115, and intercept (beta0) is 2.3485287765927346e+114. Step 351: Current loss is 2.6295839793207815e+269, slope (beta1) is 2.094364416809775e+134, and intercept (beta0) is 3.1644756886620065e+133. Step 401: Current loss is inf, slope (beta1) is 2.8220072695079514e+153, and intercept (beta0) is 4.263906188393032e+152. Step 451: Current loss is inf, slope (beta1) is 3.80245432229336e+172, and intercept (beta0) is 5.745310684027919e+171.
/usr/local/lib/python3.9/dist-packages/numpy/core/_methods.py:180: RuntimeWarning: overflow encountered in reduce ret = umr_sum(arr, axis, dtype, out, keepdims, where=where) /tmp/ipykernel_8368/3847878196.py:11: RuntimeWarning: overflow encountered in square losses.append(np.mean(np.square(y - y_pred)))
plt.plot(range(1, 500+1), losses, label = "loss")
plt.legend()
plt.xlabel("iteration")
plt.show()
As we can see, just by increasing the learning rate by a little, we see the slope and intercept being inflated and moving farther away from the true values. On the other hand, the loss values are increasing where it should have been decreasing. This shows how important it is to pick a good learning rate.
In each iteration, we first randomly sample a minibatch $\mathcal{B}_t$ consisting of a fixed number of training examples. This method can avoid using a sample with a large sample size.
Then we use the similar algorithm for simple linear regression:
$$\boxed{ \hat{\beta}_1 = \beta_1 - \frac{\eta}{|\mathcal{B}|} \triangledown L\left(\beta_1 \right) = \beta_1 -\frac{\eta}{|\mathcal{B}|}\sum_{i \in \mathcal{B}_t} X_i(Y_i - \beta_1 X_i - \beta_0)} $$$$\boxed{ \hat{\beta}_0 = \beta_0 - \frac{\eta}{|\mathcal{B}|} \triangledown L\left(\beta_0 \right) = \beta_0 -\frac{\eta}{|\mathcal{B}|}\sum_{i \in \mathcal{B}_t} X_i(Y_i - \beta_1 X_i - \beta_0)} $$Note: The set $\mathcal{B}$ is random and changes from iteration to iteration. Since we pick a minibatch $\mathcal{B}$ we need to normalize by its size $|\mathcal{B}|$.
$\eta$ and $|\mathcal{B}|$ (the batch size) are hyperparameters, meaning that they are kept fixed during training.
However, an outer loop might optimize them by tracking performance on a validation set.
Example: https://www.geeksforgeeks.org/ml-mini-batch-gradient-descent-with-python/
i. Gaussian (Normal) Distribution
So far we’ve given a fairly functional motivation of the squared loss objective: the optimal parameters return the conditional expectation $E[Y\mid X]$ whenever the underlying pattern is truly linear, and the loss assigns outsize penalties for outliers. We can also provide a more formal motivation for the squared loss objective by making probabilistic assumptions about the distribution of noise.
Normal distribution has a mean $\mu$ and a variance $\sigma^2$. We usually note it as $\mathcal{N}(\mu, \sigma^2)$.
$$\boxed{p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left[-\frac{(x - \mu)^2}{2 \sigma^2}\right]}$$ii. Likelihood Function of Linear Regression
The probability density function (pdf) for a simple linear regression looks like following:
$$\boxed{f_Y(y) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left[-\frac{(y - \mu)^2}{2 \sigma^2}\right]}$$Note: $\mu$ is the mean of $Y$, $\sigma^2$ is the varaince of $Y$, and $-\infty < Y < \infty$.
As a result, we have $Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i$, where $\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$ and it is idependent and identically distributed (iid).
Maximum Likelihood Estimation Function (MLE) for a simple linear regression:
$$\boxed{L\left(\beta_0,\beta_1,\sigma^2 | \left(Y_i, X_i\right)\right) = \prod_{i = 1}^{n} \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left[-\frac{(Y_i - \beta_0 -\beta_1 X_i)^2}{2 \sigma^2}\right]}$$We can take $\log$ to obtain log-likelihood:
$$\ell(\beta_0, \beta_1, \sigma^2 | (Y_i, X_i)) = \log (L) = \log \left(\prod_{i = 1}^{n} \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left[-\frac{(Y_i - \beta_0 -\beta_1 X_i)^2}{2 \sigma^2}\right]\right)$$$$= \boxed{\log \left[\left(\frac{1}{\sqrt{2 \pi}}\right)\right]^n - \frac{n}{2} \log \left(\sigma^2 \right) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} \left(Y_i - \beta_0 - \beta_1 X_i\right)^2}$$Note: If we assume that $\sigma$ is fixed, then we can ignore the first two terms, because it does not depend on $\beta_0$ or $\beta_1$. The last term is identical to the squared error loss introduced earlier, except for the multiplicative constant $\frac{1}{2\sigma^2}$. It follows that minimizing the mean squared error is equivalent to maximum likelihood estimation of a linear model under the assumption of additive Gaussian noise.
We can take partial derivatives with respect to $\beta_0$, $\beta_1$, and $\sigma^2$:
$$\frac{\partial{\ell}}{\partial{\beta_0}} = -\frac{1}{2\sigma^2}\sum_{i=1}^n 2\left(Y_i - \beta_0 -\beta_1 X_i\right) (-1)$$$$~~~~~~~~~~~~~~~\Rightarrow -\frac{1}{2\sigma^2}\sum_{i=1}^n 2\left(Y_i - \beta_0 -\beta_1 X_i\right) (-1) = 0$$$$\Rightarrow \sum_{i=1}^n \left(Y_i - \beta_0 -\beta_1 X_i\right) = 0~~~~~~$$$$~~~\Rightarrow \sum_{i=1}^n Y_i - \sum_{i=1}^n \beta_0 - \beta_1 \sum_{i=1}^n X_i = 0$$$$\Rightarrow n \bar{Y} - n \beta_0 - n \beta_1 \bar{X} = 0~~~~~~~~~~$$$$\Rightarrow \boxed{\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1\bar{X}}~~~~~~~~~~~~~~~~~~~~$$$$\frac{\partial{\ell}}{\partial{\beta_1}} = -\frac{1}{2\sigma^2}\sum_{i=1}^n -2 X_i \left(Y_i - \beta_0 -\beta_1 X_i\right)$$$$~~~~~~~~~~~~~~~~\Rightarrow -\frac{1}{2\sigma^2}\sum_{i=1}^n -2X_i\left(Y_i - \beta_0 -\beta_1 X_i\right) = 0$$$$~\Rightarrow \sum_{i=1}^n X_i\left(Y_i - \beta_0 -\beta_1 X_i\right) = 0$$$$~~~~~~~~~~~~~~~~\Rightarrow \sum_{i=1}^n X_i Y_i - \beta_0 \sum_{i=1}^n X_i - \beta_1 \sum_{i=1}^n X_i^2 = 0$$$$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\Rightarrow \sum_{i=1}^n X_i Y_i - (\bar{Y} - \beta_1\bar{X}) \sum_{i=1}^n X_i - \beta_1 \sum_{i=1}^n X_i^2 = 0$$$$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\Rightarrow \sum_{i=1}^n X_i Y_i - \bar{Y} \sum_{i=1}^n X_i - \beta_1\bar{X} \sum_{i=1}^n X_i - \beta_1 \sum_{i=1}^n X_i^2 = 0$$$$~~~~~~~~~~~~~~~~~~~~~~~~~\Rightarrow \sum_{i=1}^n X_i Y_i - n \bar{Y} \bar{X} + n\beta_1\bar{X}^2 - \beta_1 \sum_{i=1}^n X_i^2 = 0$$$$\Rightarrow \hat{\beta}_1 = \frac{\sum_{i=1}^n X_i Y_i - n \bar{Y} \bar{X}}{\sum_{i=1}^n X_i^2 - n\bar{X}^2}~~~$$$$\Rightarrow \hat{\beta}_1 = \frac{\sum_{i=1}^n (X_i - \bar{X}) Y_i}{\sum_{i=1}^n (X_i - \bar{X})^2}~~~~~~$$$$~~~\Rightarrow \hat{\beta}_1 = \frac{\sum_{i=1}^n (X_i - \bar{X}) (Y_i - \bar{Y})}{\sum_{i=1}^n (X_i - \bar{X})^2}$$$$\Rightarrow \boxed{\hat{\beta}_1 = \frac{S_{XY}}{S_{XX}}}~~~~~~~~~~~~~~~~~~~~~~~$$$$\frac{\partial{\ell}}{\partial{\sigma^2}} = \frac{\partial}{\partial{\sigma^2}}\left[\frac{n}{2}\log(\sigma^2)\right]-\frac{\partial}{\partial{\sigma^2}}\left[\frac{1}{2\sigma^2}\sum_{i=1}^{n} -2X_i (Y_i-\beta_0-\beta_1X_i)^2\right]$$$$\Rightarrow -\frac{n}{2}\cdot \frac{1}{\sigma^2} + \frac{1}{2(\sigma^2)^2} \sum_{i=1}^{n} (Y_i-\beta_0-\beta_1X_i)^2 = 0~~~~~~~~~~~~~$$$$\Rightarrow \boxed{\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^{n} (Y_i-\hat{\beta}_0-\hat{\beta}_1X_i)^2}~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$$The probability density function (pdf) in matrix form (general form) looks like following:
$$\boxed{f_{\boldsymbol{Y}}(y) = \left(\frac{1}{\sqrt{2\pi}}\right)^n \left(\det(\boldsymbol{\Sigma})\right)^{-\frac{1}{2}} \exp \left[-\frac{(\boldsymbol{Y} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1}(\boldsymbol{Y} - \boldsymbol{\mu})}{2}\right]}$$Note: $\boldsymbol{\mu} = \begin{bmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_n \end{bmatrix}_{~n\times1} = \boldsymbol{X \beta}$, $\boldsymbol{\Sigma} = \begin{bmatrix} \sigma_{11}^2 & \sigma_{12}^2 & \dots & \sigma_{1n}^2 \\ \sigma_{12}^2 & \sigma_{22}^2 & \dots & \sigma_{2n}^2 \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{n1}^2 & \sigma_{n2}^2 & \dots & \sigma_{nn}^2 \end{bmatrix}_{~n\times n} = \sigma^2 \boldsymbol{I}$, and $Y = \mathbb{R}^{n}$.
As a result, we have $\boldsymbol{Y} = \boldsymbol{\beta X} + \boldsymbol{\varepsilon}$, where $\boldsymbol{\varepsilon} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$.
Maximum Likelihood Estimation Function (MLE) in matrix form (general form):
$$\boxed{L\left(\boldsymbol{\beta}, \sigma^2 | (\boldsymbol{Y}, \boldsymbol{X})\right) = \left(\frac{1}{2\pi}\right)^n (\sigma^2 \boldsymbol{I})^{-\frac{n}{2}} \exp \left[-\frac{(\boldsymbol{Y} - \boldsymbol{\mu})^T (\boldsymbol{Y} - \boldsymbol{\mu})}{2\sigma^2}\right]}$$We can take $\log$ to obtain log-likelihood:
$$\ell \left(\boldsymbol{\beta},\sigma^2 | (\boldsymbol{Y},\boldsymbol{X})\right) = \log\left[\left(\frac{1}{2\pi}\right)^n (\sigma^2 \boldsymbol{I})^{-\frac{n}{2}} \exp \left[-\frac{(\boldsymbol{Y} - \boldsymbol{\mu})^T (\boldsymbol{Y} - \boldsymbol{\mu})}{2\sigma^2}\right]\right]$$$$= \log \left[\left(\frac{1}{\sqrt{2 \pi}}\right)\right]^n - \frac{n}{2} \log \left(\sigma^2 \right) - \frac{1}{2\sigma^2} (\boldsymbol{Y} - \boldsymbol{\mu})^T (\boldsymbol{Y} - \boldsymbol{\mu})$$We can also take partial derivatives with respect to $\boldsymbol{\beta}$ and $\sigma^2$:
$$\frac{\partial{\ell}}{\partial{\boldsymbol{\beta}}} = - \frac{1}{2\sigma^2} \frac{\partial}{\partial_{\boldsymbol{\beta}}}\left(\boldsymbol{\boldsymbol{Y}^T \boldsymbol{Y} - 2\beta}^T\boldsymbol{X}^T \boldsymbol{Y} + \boldsymbol{\beta}^T\boldsymbol{X}^T \boldsymbol{X}\boldsymbol{\beta} \right)$$$$\Rightarrow - \frac{1}{2\sigma^2} \left(-2\boldsymbol{X}^T\boldsymbol{Y} + 2\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\beta} \right) = 0~~~~~~~~~~$$$$\Rightarrow -2\boldsymbol{X}^T\boldsymbol{Y} + 2\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\beta} = 0~~~~~~~~~~~~~~~~~~~~~~~~$$$$\Rightarrow \boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\beta} = \boldsymbol{X}^T\boldsymbol{Y}~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$$$$\Rightarrow \left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{\beta} = \left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{Y}~~~~~$$$$\Rightarrow \boxed{\boldsymbol{\hat{\beta}} = \left(\boldsymbol{X}^T\boldsymbol{X}\right)^{-1}\boldsymbol{X}^T\boldsymbol{Y}}~~~~~~~~~~~~~~~~~~~~~~~~~~~~$$$$\frac{\partial{\ell}}{\partial{\sigma^2}} = -\frac{n}{2}\cdot \frac{1}{\sigma^2} + \frac{1}{2(\sigma^2)^2} \left(\boldsymbol{Y} - \boldsymbol{X\beta}\right)^T\left(\boldsymbol{Y} - \boldsymbol{X\beta}\right)~~~~~~~~~~~~~$$$$~~\Rightarrow -\frac{n}{2}\cdot \frac{1}{\sigma^2} + \frac{1}{2(\sigma^2)^2} \left(\boldsymbol{Y} - \boldsymbol{X\beta}\right)^T\left(\boldsymbol{Y} - \boldsymbol{X\beta}\right) = 0$$$$~~\Rightarrow \frac{n}{\sigma^2} = \frac{1}{(\sigma^2)^2} \left(\boldsymbol{Y} - \boldsymbol{X\beta}\right)^T\left(\boldsymbol{Y} - \boldsymbol{X\beta}\right)~~~~~~~~~~~~~~~~~$$$$~~\Rightarrow \boxed{\hat{\sigma}^2 = \frac{1}{n} \left(\boldsymbol{Y} - \boldsymbol{X\hat{\beta}}\right)^T\left(\boldsymbol{Y} - \boldsymbol{X\boldsymbol{\hat{\beta}}}\right)}~~~~~~~~~~~~~~~~~~~$$While linear models are not sufficiently rich to express the many complicated neural networks that we will introduce in this book, neural networks are rich enough to subsume linear models as neural networks in which every feature is represented by an input neuron, all of which are connected directly to the output.
The diagram highlights the connectivity pattern such as how each input is connected to the output, but not the specific values taken by the weights or biases.
The inputs are $x_1, \ldots, x_d$. We refer to $d$ as the number of inputs or feature dimensionality in the input layer. The output of the network is $o_1$. Because we are just trying to predict a single numerical value, we have only one output neuron. Note that the input values are all given. There is just a single computed neuron. In summary, we can think of linear regression as a single-layer fully connected neural network. We will encounter networks with far more layers in future chapters.
Because linear regression predates computational neuroscience, it might seem anachronistic to describe linear regression in terms of neural networks.
Nonetheless, they were a natural place to start when the cyberneticists and neurophysiologists Warren McCulloch and Walter Pitts began to develop models of artificial neurons.
Consider the cartoonish picture of a biological neuron below, consisting of:
Info $x_i$ arriving from other neurons (or sensors, e.g., retina) is received in the dendrites.
Info is weighted by synaptic weights $\beta_i$ determining the effect of the inputs
Simple linear weighting gives the result $Y = \beta_0 + \sum_{i=1}^n X_i \beta_i$.
After applying a nonlinear $\sigma$, the result $\sigma(y)$ is sent to other neurons via the axon for further processing.
Certainly, the high-level idea that many such units could be combined with the right connectivity and right learning algorithm, to produce far more interesting and complex behavior than any one neuron alone could express owes to our study of real biological neural systems.
At the same time, most research in deep learning today draws inspiration from a much wider source. We invoke Stuart Russell and Peter Norvig (Russell and Norvig, 2016) who pointed out that although airplanes might have been inspired by birds, ornithology has not been the primary driver of aeronautics innovation for some centuries.
Likewise, inspiration in deep learning these days comes in equal or greater measure from mathematics, linguistics, psychology, statistics, computer science, and many other fields.
In this section, we introduced traditional linear regression, where the parameters of a linear function are chosen to minimize squared loss on the training set. We also motivated this choice of objective both via some practical considerations and through an interpretation of linear regression as maximimum likelihood estimation under an assumption of linearity and Gaussian noise. After discussing both computational considerations and connections to statistics, we showed how such linear models could be expressed as simple neural networks where the inputs are directly wired to the output(s). While we will soon move past linear models altogether, they are sufficient to introduce most of the components that all of our models require: parametric forms, differentiable objectives, optimization via minibatch stochastic gradient descent, and ultimately, evaluation on previously unseen data.