Content
1. Regression and Classification 2. Classification Example Problem 3. Softmax Operation 4. Loss Function 5. Gradient Descent of Cross-entropy 6. Information Theory Basics Summary <- Go BackUniversity of Michigan at Ann Arbor
Last Edit Date: 02/08/2023
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Dive into Deep Learning and Professor Ambuj Tewari. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Regression answers how much or how many questions
Classification answers which one questions
Suppose we have each input consists of a $2\times 2$ grayscale image. We can represent each pixel value with a single scalar, giving us four features $X_1, X_2, X_3, X_4$. Moreover, we assume that each image belongs to one among the categories "cat", "chicken", and "dog", which are our outputs $Y_i$.
i. Labelling
The most natural way to represent the labels can be $Y\in \{1,2,3\}$ which equivalents to $\{\text{cat}, \text{chicken}, \text{dog}\}$ respectively. This labelling method can be great when the categories themselves have a natural order. However, classification problems do not come with natural orders among the classes.
One-hot encoding is a good way to label categories in classification problems.
In this way, $Y$ should be a three-dimentional vector, with $(1, 0, 0)$ corresponding to "cat", $(0, 1, 0)$ corresponding to "chicken", $(0, 0, 1)$ corresponding to "dog".
$$Y \in \{(1,0,0),(0,1,0),(0,0,1)\}$$ii. Model
Hard labels output: A definite assignment to one of the classes.
Soft labels output: Probabilities.
In order to do classification, we need a model with multiple outputs, one per class. Moreover, if we restrict the model to be linear (affine), then we need 3 affine functions, each affine function has 4 weights and 1 bias as following:
$$O_1 = \beta_{10} + \beta_{11}X_1 + \beta_{12}X_2 + \beta_{13}X_3 + \beta_{14}X_4$$$$O_2 = \beta_{20} + \beta_{21}X_1 + \beta_{22}X_2 + \beta_{23}X_3 + \beta_{24}X_4$$$$O_3 = \beta_{30} + \beta_{31}X_1 + \beta_{32}X_2 + \beta_{33}X_3 + \beta_{34}X_4$$We can also write this in matrix form as following:
$$\boldsymbol{O} = \boldsymbol{\beta}\boldsymbol{X},$$where $\boldsymbol{O} = \begin{bmatrix}O_1 \\ O_2 \\ O_3 \end{bmatrix}_{~3\times 1}$, $\boldsymbol{\beta} = \begin{bmatrix}\beta_{10} & \beta_{11} & \beta_{12} & \beta_{13} & \beta_{14} \\ \beta_{20} & \beta_{21} & \beta_{22} & \beta_{23} & \beta_{24} \\ \beta_{30} & \beta_{31} & \beta_{32} & \beta_{33} & \beta_{34}\end{bmatrix}_{~3\times5}$, $\boldsymbol{X} = \begin{bmatrix}1 \\ X_1 \\ X_2 \\ X_3 \\ X_4 \end{bmatrix}_{~5\times1}$.
We consider $\boldsymbol{X}$ as inputs and $\boldsymbol{Y}$ as outputs, then we can visualize the model as following:
iii. Parameters in a Fully Connected Layer
Fully-connected layers are ubiquitous in deep learning.
Fully-connected layers have many learnable parameters.
For a fully-connected layer with $d$ inputs and $q$ outputs, the parameterization cost is $O(dq)$.
For a fully-connected layer with $d$ inputs and $q$ outputs, the number of trainable variables is $(d+1)\times q$. In our example above, there are $(4 + 1) \times 3 = 15$ trainable variables.
i. Why Softmax Operation
Assuming a suitable loss function, we could try, directly, to minimize the difference between $\boldsymbol{O}$ and the labels $\boldsymbol{Y}$.
While it turns out that treating classification as a vector-valued regression problem works surprisingly well, it is nonetheless lacking in the following ways:
There is no guarantee that the outputs $O_i$ sum up to 1 in the way we expect probabilities to behave.
There is no guarantee that the outputs $O_i$ are even nonnegative, even if their outputs sum up to 1, or that they do not exceed 1.
Both aspects render the estimation problem difficult to solve and the solution very brittle to outliers.
As a result we will use softmax operation to overcome these issues.
ii. Softmax Function
This process is called normalization. Putting these two pieces together gives us the softmax function:
$$\boldsymbol{\hat{Y}} = \text{softmax}(\boldsymbol{O})~~\text{where}~~\hat{Y}_i = \frac{exp(O_i)}{\sum_{j=1}^{q} exp(O _j)}$$iii. Properties of Softmax Function
It produces valid probabilities.
It preserves ordering, so we do not need to compute the softmax to determine which class has been assigned the highest probability.
Our model is changed to $\boldsymbol{\hat{Y}} = \text{softmax}(\boldsymbol{\beta X})$ (non-linear function).
The output is in conditional probabilities form $P(\boldsymbol{Y} | \boldsymbol{X})$ or $P(\boldsymbol{Y} | \boldsymbol{\hat{Y}})$.
Now we have a mapping from features $\boldsymbol{X}$ to probabilities $\boldsymbol{\hat{Y}}$, we need a way to optimize the accuracy of this mapping. Note that our one-hot encoding labels are represented as $\boldsymbol{Y}$.
We will rely on maximum likelihood estimation (MLE), the very same concept that we encountered when providing a probabilistic justification for the mean squared error loss in linear regression.
i. Likelihood Function
The softmax function gives us a vector $\boldsymbol{\hat{Y}}$, which we can interpret as conditional probabilities of each class, given any input $\boldsymbol{X}$. For example, $\hat{Y}_1 = P(Y = \text{cat} | \boldsymbol{X})$.
Suppose we have a dataset with features $\boldsymbol{X}$ and labels $\boldsymbol{Y}$, which is in one-hot encoding. Our example size is $n$. Example $i$ consists of a feature vector $\boldsymbol{X}_i$ and a one-hot label vector $Y_i$.
Then we can have the following likelihood function:
$$\boxed{P(\boldsymbol{Y} | \boldsymbol{X}) = \prod_{i = 1}^{n}P(Y_i | X_i)}$$Note that $P(Y_i | \boldsymbol{X}_{i})$ can be written in terms of $\hat{Y}_{i}$. It is simply the product $\boxed{\prod_{j=1}^{q} \left(\hat{Y}_j\right)^{Y_j}}$.
This is a complicated way of saying that the probability of seeing an observed label is one of the components of your predicted probability vector.
Suppose $\boldsymbol{\hat{Y}} = \begin{bmatrix} 0.2 \\ 0.3 \\ 0.5 \end{bmatrix}$ and $\boldsymbol{Y} = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}$. Then $P(\boldsymbol{Y} | \boldsymbol{\hat{Y}}) = \prod_{j=1}^3 (\hat{Y}_j)^{Y_j} = 0.2^0 \times 0.3^1 \times 0.5^0 = 1 \times 0.3 \times 1 = 0.3$.
ii. Log-Likelihood
We can use the negative log to minimize its loss as following:
$$-\log P(\boldsymbol{X} | \boldsymbol{Y}) = \sum_{i=1}^{n} -logP(Y_i | \boldsymbol{X}_i) = \boxed{\sum_{i=1}^{n} \ell \left(Y_i,\hat{Y}_i\right)}$$For any pair of lable $\boldsymbol{Y}$ and model prediction $\boldsymbol{\hat{Y}}$ over $q$ classes, the loss function $\ell$ is:
$$\boxed{\ell\left(\boldsymbol{Y}, \boldsymbol{\hat{Y}}\right) = -\sum_{j=1}^{q}Y_j \log \hat{Y}_j}$$Note: $\boldsymbol{\hat{Y}} = \frac{e^{O_j}}{\sum_{k = 1}^q e^{O_k}} $
The loss function above is commonly called the cross-entropy loss.
Note: $-\log(0) = \infty$, $-\log(1) = 0$
We can compute $\triangledown \ell_{\boldsymbol{\beta}}(\boldsymbol{Y}, \boldsymbol{\hat{Y}})$ using the chain rule, where $L(\boldsymbol{w}) = \frac{1}{n}\sum_{i =1}^n \ell_{\boldsymbol{w}}( \boldsymbol{Y}, \boldsymbol{\hat{Y}})$.
Remember that $\boldsymbol{Y} = \boldsymbol{\beta}\boldsymbol{X}$. We can solve $\frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{\beta}}$ based on the information.
We know that $\ell\left(\boldsymbol{Y}, \boldsymbol{\hat{Y}}\right) = -\sum_{j=1}^{q}Y_j \log \hat{Y}_j$. We can rewrite it as $\ell\left(\boldsymbol{Y}, \boldsymbol{\hat{Y}}\right) = - \sum_{j=1}^q Y_j \log \frac{\exp{(O_j)}}{\sum_{k=1}^q \exp{(O_k)}} = \log \sum_{j=1}^q \exp{(O_j)} - \sum_{j=1}^q Y_j O_j$.
Note: $\boldsymbol{\hat{Y}} = \frac{e^{O_j}}{\sum_{k = 1}^q e^{O_k}}$ and $\sum_{j=1}^q Y_j = 1$.
$$\frac{\partial \ell\left(\boldsymbol{Y}, \boldsymbol{\hat{Y}}\right)}{\partial \boldsymbol{Y}} = \frac{\partial}{\partial \boldsymbol{\beta}}\left(\log \sum_{j=1}^q \exp{(O_j)} - \sum_{j=1}^q Y_j O_j\right)$$$$= \frac{e^{O_j}}{\sum_{k = 1}^q e^{O_k}} - \boldsymbol{Y}~~~~~~~~~~~~$$$$= \boxed{\boldsymbol{\hat{Y}} - \boldsymbol{Y}}~~~~~~~~~~~~~~~~~~~~~~~$$As a result
$$\boxed{\triangledown \ell_{\boldsymbol{\beta}}(\boldsymbol{Y}, \boldsymbol{\hat{Y}}) = \boldsymbol{X} \left(\boldsymbol{\hat{Y}} - \boldsymbol{Y}\right)}$$i. Entropy
The central idea in information theory is to quantify the amount of information contained in data. This places a limit on our ability to compress data. For a distribution $P$ its entropy is defined as:
$$H[P] = \sum_j - P(j) \log P(j)$$One of the fundamental theorems of information theory states that in order to encode data drawn randomly from the distribution $P$, we need at least $H[P]$ “nats” to encode it (Shannon, 1948).
If you wonder what a “nat” is, it is the equivalent of bit but when using a code with base $e$ rather than one with base 2. Thus, one nat is $\frac{1}{\log(2)} \approx 1.44$ bit.
Example:
Suppose we have
$P(1), P(2), ..., P(26)$ corresponding to $A, B, ..., Z$
$$H(P) = \sum_{j = 1}^{26} P(j) \log_2 \frac{1}{P(j)}$$$\log_2 \frac{1}{P(j)}$ is the symbol number encoded in these many bits. $H[P]$ gives us the expected number of bits all over the sample.
ii. Cross-entropy
The true distribution was P but we think it is Q. Then we will assign an encoding with length $- \log Q(j)$ to symbol the $j$-th symbol. So our expected code length will be:
$$H(P, Q) = - \sum_j P(j) \log Q(j)$$iii. KL divergence or relative entropy
Note that if Q is not the same as P, we expect some overhead. Actaully, $H[P,Q] > H[P]$ always. KL divergence, aka relative entropy, measures this excess as shown below:
$$KL(P \| Q) = H(P,Q) - H(P) = \sum_j P(j) \log \frac{P(j)}{Q(j)}$$In this section, we encountered the first nontrivial loss function, allowing us to optimize over discrete output spaces. Key in its design was that we took a probabilistic approach, treating discrete categories as instances of draws from a probability distribution. As a side effect, we encountered the softmax, a convenient activation function that transforms outputs of an ordinary neural network layer into valid discrete probability distributions. We saw that the derivative of the cross entropy loss when combined with softmax behaves very similarly to the derivative of squared error, namely by taking the difference between the expected behavior and its prediction. And, while we were only able to scratch the very surface of it, we encountered exciting connections to statistical physics and information theory.