STATS 315: Statistics and Artificial Intelligence

Section 03 Softmax and Classification

University of Michigan at Ann Arbor

Last Edit Date: 02/08/2023

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Dive into Deep Learning and Professor Ambuj Tewari. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

1. Regression and Classification¶

Regression answers how much or how many questions

Price of a house
Number of wins of a baseball team
Hospitalization duration in days

Classification answers which one questions

Email spam or not?
Customer sign up for a new service or not?
Which movie is a customer going to watch next?

2. Classification Example Problem¶

Suppose we have each input consists of a $2\times 2$ grayscale image. We can represent each pixel value with a single scalar, giving us four features $X_1, X_2, X_3, X_4$. Moreover, we assume that each image belongs to one among the categories "cat", "chicken", and "dog", which are our outputs $Y_i$.

i. Labelling

The most natural way to represent the labels can be $Y\in \{1,2,3\}$ which equivalents to $\{\text{cat}, \text{chicken}, \text{dog}\}$ respectively. This labelling method can be great when the categories themselves have a natural order. However, classification problems do not come with natural orders among the classes.

One-hot encoding is a good way to label categories in classification problems.

In this way, $Y$ should be a three-dimentional vector, with $(1, 0, 0)$ corresponding to "cat", $(0, 1, 0)$ corresponding to "chicken", $(0, 0, 1)$ corresponding to "dog".

$$Y \in \{(1,0,0),(0,1,0),(0,0,1)\}$$

ii. Model

Hard labels output: A definite assignment to one of the classes.
- In our example, the value of each $Y$ can only be either $(1,0,0)$ "cat", $(0,1,0)$ "chicken", or $(0,0,1)$ "dog".
Soft labels output: Probabilities.
- For example, $(0.5, 0, 0.5)$ stands for 50 - 50 chance for cat or dog.

In order to do classification, we need a model with multiple outputs, one per class. Moreover, if we restrict the model to be linear (affine), then we need 3 affine functions, each affine function has 4 weights and 1 bias as following:

$$O_1 = \beta_{10} + \beta_{11}X_1 + \beta_{12}X_2 + \beta_{13}X_3 + \beta_{14}X_4$$$$O_2 = \beta_{20} + \beta_{21}X_1 + \beta_{22}X_2 + \beta_{23}X_3 + \beta_{24}X_4$$$$O_3 = \beta_{30} + \beta_{31}X_1 + \beta_{32}X_2 + \beta_{33}X_3 + \beta_{34}X_4$$

We can also write this in matrix form as following:

$$\boldsymbol{O} = \boldsymbol{\beta}\boldsymbol{X},$$

where $\boldsymbol{O} = \begin{bmatrix}O_1 \\ O_2 \\ O_3 \end{bmatrix}_{~3\times 1}$, $\boldsymbol{\beta} = \begin{bmatrix}\beta_{10} & \beta_{11} & \beta_{12} & \beta_{13} & \beta_{14} \\ \beta_{20} & \beta_{21} & \beta_{22} & \beta_{23} & \beta_{24} \\ \beta_{30} & \beta_{31} & \beta_{32} & \beta_{33} & \beta_{34}\end{bmatrix}_{~3\times5}$, $\boldsymbol{X} = \begin{bmatrix}1 \\ X_1 \\ X_2 \\ X_3 \\ X_4 \end{bmatrix}_{~5\times1}$.

We consider $\boldsymbol{X}$ as inputs and $\boldsymbol{Y}$ as outputs, then we can visualize the model as following:

iii. Parameters in a Fully Connected Layer

Fully-connected layers are ubiquitous in deep learning.

Fully-connected layers have many learnable parameters.

For a fully-connected layer with $d$ inputs and $q$ outputs, the parameterization cost is $O(dq)$.

For a fully-connected layer with $d$ inputs and $q$ outputs, the number of trainable variables is $(d+1)\times q$. In our example above, there are $(4 + 1) \times 3 = 15$ trainable variables.

3. Softmax Operation¶

i. Why Softmax Operation

Assuming a suitable loss function, we could try, directly, to minimize the difference between $\boldsymbol{O}$ and the labels $\boldsymbol{Y}$.

While it turns out that treating classification as a vector-valued regression problem works surprisingly well, it is nonetheless lacking in the following ways:

There is no guarantee that the outputs $O_i$ sum up to 1 in the way we expect probabilities to behave.
There is no guarantee that the outputs $O_i$ are even nonnegative, even if their outputs sum up to 1, or that they do not exceed 1.

Both aspects render the estimation problem difficult to solve and the solution very brittle to outliers.

As a result we will use softmax operation to overcome these issues.

ii. Softmax Function

We exponentiate our outputs (to get positives values)
Then divide by the sum (to get them to sum to 1)

This process is called normalization. Putting these two pieces together gives us the softmax function:

$$\boldsymbol{\hat{Y}} = \text{softmax}(\boldsymbol{O})~~\text{where}~~\hat{Y}_i = \frac{exp(O_i)}{\sum_{j=1}^{q} exp(O _j)}$$

iii. Properties of Softmax Function

It produces valid probabilities.
It preserves ordering, so we do not need to compute the softmax to determine which class has been assigned the highest probability.
- If $O_2 > O_1 > O_3$, then $\hat{Y}_2 > \hat{Y}_1 > \hat{Y}_3$
Our model is changed to $\boldsymbol{\hat{Y}} = \text{softmax}(\boldsymbol{\beta X})$ (non-linear function).
The output is in conditional probabilities form $P(\boldsymbol{Y} | \boldsymbol{X})$ or $P(\boldsymbol{Y} | \boldsymbol{\hat{Y}})$.

Now we have a mapping from features $\boldsymbol{X}$ to probabilities $\boldsymbol{\hat{Y}}$, we need a way to optimize the accuracy of this mapping. Note that our one-hot encoding labels are represented as $\boldsymbol{Y}$.

We will rely on maximum likelihood estimation (MLE), the very same concept that we encountered when providing a probabilistic justification for the mean squared error loss in linear regression.

i. Likelihood Function

The softmax function gives us a vector $\boldsymbol{\hat{Y}}$, which we can interpret as conditional probabilities of each class, given any input $\boldsymbol{X}$. For example, $\hat{Y}_1 = P(Y = \text{cat} | \boldsymbol{X})$.

Suppose we have a dataset with features $\boldsymbol{X}$ and labels $\boldsymbol{Y}$, which is in one-hot encoding. Our example size is $n$. Example $i$ consists of a feature vector $\boldsymbol{X}_i$ and a one-hot label vector $Y_i$.

Then we can have the following likelihood function:

$$\boxed{P(\boldsymbol{Y} | \boldsymbol{X}) = \prod_{i = 1}^{n}P(Y_i | X_i)}$$

Note that $P(Y_i | \boldsymbol{X}_{i})$ can be written in terms of $\hat{Y}_{i}$. It is simply the product $\boxed{\prod_{j=1}^{q} \left(\hat{Y}_j\right)^{Y_j}}$.

This is a complicated way of saying that the probability of seeing an observed label is one of the components of your predicted probability vector.

Suppose $\boldsymbol{\hat{Y}} = \begin{bmatrix} 0.2 \\ 0.3 \\ 0.5 \end{bmatrix}$ and $\boldsymbol{Y} = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}$. Then $P(\boldsymbol{Y} | \boldsymbol{\hat{Y}}) = \prod_{j=1}^3 (\hat{Y}_j)^{Y_j} = 0.2^0 \times 0.3^1 \times 0.5^0 = 1 \times 0.3 \times 1 = 0.3$.

ii. Log-Likelihood

We can use the negative log to minimize its loss as following:

$$-\log P(\boldsymbol{X} | \boldsymbol{Y}) = \sum_{i=1}^{n} -logP(Y_i | \boldsymbol{X}_i) = \boxed{\sum_{i=1}^{n} \ell \left(Y_i,\hat{Y}_i\right)}$$

For any pair of lable $\boldsymbol{Y}$ and model prediction $\boldsymbol{\hat{Y}}$ over $q$ classes, the loss function $\ell$ is:

$$\boxed{\ell\left(\boldsymbol{Y}, \boldsymbol{\hat{Y}}\right) = -\sum_{j=1}^{q}Y_j \log \hat{Y}_j}$$

Note: $\boldsymbol{\hat{Y}} = \frac{e^{O_j}}{\sum_{k = 1}^q e^{O_k}} $

The loss function above is commonly called the cross-entropy loss.

Because of one-hot encoding, only one term across $j$ will survive.
$\ell\left(\boldsymbol{Y}, \boldsymbol{\hat{Y}}\right) \in [0, \infty)$ because no single entry is larger than 1, hence their negative logarithm cannot be lower than 0.
$\ell\left(\boldsymbol{Y}, \boldsymbol{\hat{Y}}\right) = 0$ if and only if the precited reuslt is the same as the actual label.
Cross-entropy loss also works with soft observed labels too.
- We can interprete it as expected loss under hard labels sampled from the observed soft label.

Note: $-\log(0) = \infty$, $-\log(1) = 0$

5. Gradient Descent of Cross-entropy¶

We can compute $\triangledown \ell_{\boldsymbol{\beta}}(\boldsymbol{Y}, \boldsymbol{\hat{Y}})$ using the chain rule, where $L(\boldsymbol{w}) = \frac{1}{n}\sum_{i =1}^n \ell_{\boldsymbol{w}}( \boldsymbol{Y}, \boldsymbol{\hat{Y}})$.

$$\boxed{\triangledown \ell_{\boldsymbol{\beta}}(\boldsymbol{Y}, \boldsymbol{\hat{Y}}) = \frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{\beta}} \cdot \frac{\partial \ell\left(\boldsymbol{Y}, \boldsymbol{\hat{Y}}\right)}{\partial \boldsymbol{Y}} }$$

Remember that $\boldsymbol{Y} = \boldsymbol{\beta}\boldsymbol{X}$. We can solve $\frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{\beta}}$ based on the information.

$$\frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{\beta}} = \frac{\partial}{\partial \boldsymbol{\beta}} \left(\boldsymbol{\beta X}\right)$$$$= \boxed{\boldsymbol{X}}~$$

We know that $\ell\left(\boldsymbol{Y}, \boldsymbol{\hat{Y}}\right) = -\sum_{j=1}^{q}Y_j \log \hat{Y}_j$. We can rewrite it as $\ell\left(\boldsymbol{Y}, \boldsymbol{\hat{Y}}\right) = - \sum_{j=1}^q Y_j \log \frac{\exp{(O_j)}}{\sum_{k=1}^q \exp{(O_k)}} = \log \sum_{j=1}^q \exp{(O_j)} - \sum_{j=1}^q Y_j O_j$.

Note: $\boldsymbol{\hat{Y}} = \frac{e^{O_j}}{\sum_{k = 1}^q e^{O_k}}$ and $\sum_{j=1}^q Y_j = 1$.

$$\frac{\partial \ell\left(\boldsymbol{Y}, \boldsymbol{\hat{Y}}\right)}{\partial \boldsymbol{Y}} = \frac{\partial}{\partial \boldsymbol{\beta}}\left(\log \sum_{j=1}^q \exp{(O_j)} - \sum_{j=1}^q Y_j O_j\right)$$$$= \frac{e^{O_j}}{\sum_{k = 1}^q e^{O_k}} - \boldsymbol{Y}~~~~~~~~~~~~$$$$= \boxed{\boldsymbol{\hat{Y}} - \boldsymbol{Y}}~~~~~~~~~~~~~~~~~~~~~~~$$

As a result

$$\boxed{\triangledown \ell_{\boldsymbol{\beta}}(\boldsymbol{Y}, \boldsymbol{\hat{Y}}) = \boldsymbol{X} \left(\boldsymbol{\hat{Y}} - \boldsymbol{Y}\right)}$$

6. Information Theory Basics¶

i. Entropy

The central idea in information theory is to quantify the amount of information contained in data. This places a limit on our ability to compress data. For a distribution $P$ its entropy is defined as:

$$H[P] = \sum_j - P(j) \log P(j)$$

One of the fundamental theorems of information theory states that in order to encode data drawn randomly from the distribution $P$, we need at least $H[P]$ “nats” to encode it (Shannon, 1948).

If you wonder what a “nat” is, it is the equivalent of bit but when using a code with base $e$ rather than one with base 2. Thus, one nat is $\frac{1}{\log(2)} \approx 1.44$ bit.

Example:

Suppose we have

$P(1), P(2), ..., P(26)$ corresponding to $A, B, ..., Z$

$$H(P) = \sum_{j = 1}^{26} P(j) \log_2 \frac{1}{P(j)}$$

$\log_2 \frac{1}{P(j)}$ is the symbol number encoded in these many bits. $H[P]$ gives us the expected number of bits all over the sample.

ii. Cross-entropy

The true distribution was P but we think it is Q. Then we will assign an encoding with length $- \log Q(j)$ to symbol the $j$-th symbol. So our expected code length will be:

$$H(P, Q) = - \sum_j P(j) \log Q(j)$$

iii. KL divergence or relative entropy

Note that if Q is not the same as P, we expect some overhead. Actaully, $H[P,Q] > H[P]$ always. KL divergence, aka relative entropy, measures this excess as shown below:

$$KL(P \| Q) = H(P,Q) - H(P) = \sum_j P(j) \log \frac{P(j)}{Q(j)}$$

Summary¶

In this section, we encountered the first nontrivial loss function, allowing us to optimize over discrete output spaces. Key in its design was that we took a probabilistic approach, treating discrete categories as instances of draws from a probability distribution. As a side effect, we encountered the softmax, a convenient activation function that transforms outputs of an ordinary neural network layer into valid discrete probability distributions. We saw that the derivative of the cross entropy loss when combined with softmax behaves very similarly to the derivative of squared error, namely by taking the difference between the expected behavior and its prediction. And, while we were only able to scratch the very surface of it, we encountered exciting connections to statistical physics and information theory.