### Softmax Activation Function with Python

Softmax is a mathematical function that translates a vector of numbers into a vector of probabilities, where the probability of every value is proportional to the relative scale of every value in the vector.

The most typical use of the softmax function in applied machine learning is in its leveraging as an activation function within a neural network model. Particularly, the network is setup to output N values, one for every class in the classification activity, and the softmax function is leveraged to normalize the outputs, translating them from weighted sum values into probabilities that sum to one. Every value in the output of the softmax function is interpreted as the odds of membership for each class.

In this guide, you will find out about the softmax activation function leveraged in neural network models.

After going through this guide, you will be aware of:

- Linear and sigmoid activation functions are not appropriate for multi-class classification tasks.
- Softmax can be perceived of as a softened version of the argmax function that returns the index of the biggest value in a list.
- How to implement the softmax function from the ground up in Python and how to translate the output into a class label.

**Tutorial Summarization**

The tutorial is subdivided into three portions, which are:

1] Forecasting probabilities with neural networks

2] Max, Argmax, and Softmax

3] Softmax activation function

**Forecasting probabilities with neural networks**

Neural network models can be leveraged to model classification predictive modelling problems.

Classification problems are those that consist of forecasting a class label for a provided input. A conventional strategy to modelling classification problems is to leverage a model to forecast the odds of class membership. That is, provided an instance, what is the probability of it belonging to each of the known class labels?

- For a binary classification problem, a Binomial probability distribution is leveraged. This is accomplished leveraging a network with a singular node in the output layer that forecasts the odds of an instance belonging to class 1.
- For a multi-class classification problem, a Multinomial probability is leveraged. This is accomplished leveraging a network with a singular node for each class in the output layer and the sum of the forecasted probabilities is equivalent to one.

A neural network model needs an activation function in the output layer of the model to make the forecast.

There are differing activation functions to select from, let’s look at a few.

**Linear Activation Function**

One strategy to forecasting class membership odds is to leverage a linear activation.

A linear activation function is merely the total of the weighted input to the node, needed as input for any activation function. As such, it is often referenced to as “no activation function” as no extra transformation is carried out.

Remember that a probability or a likelihood is a numeric value between 0 and 1.

Provided that no transformation is carried out on the weighted sum of the input, it is possible for the linear activation function to output any numeric value. This makes the linear activation function not appropriate for forecasting probabilities for either the binomial or multinomial case.

**Sigmoid Activation Function**

Another strategy to forecasting class membership probabilities is to leverage a sigmoid activation function.

This function is also referred to as the logistic function. Regardless of the input, the function always outputs a value between 0 and 1. The form of the function is an S-shape between 0 and 1 with the vertical or middle of the “S” at 0.5.

The sigmoid activation is the best activation function for a binary classification problem where the output is interpreted as a binomial probability distribution.

The sigmoid activation function can additionally be leveraged as an activation function for multi-class classification problems where classes are non-mutually exclusive. These are often referenced to as a multi-label classification instead of a multi-class classification.

The sigmoid activation function is not relevant for multi-class classification issues with mutually exclusive categories where a multinomial probability distribution is needed.

Rather, an alternative activation is needed referred to as the softmax function.

**Max, Argmax, and Softmax**

**Max Function**

The maximum or “max”, mathematical function returns the biggest numeric value for a listing of numeric values.

We can implement this leveraging the max() Python function, for instance,

[Control]

1 2 3 4 5 6 | # example of the max of a list of numbers # define data data = [1, 3, 2] # calculate the max of the list result = max(data) print(result) |

Running the instance returns the biggest value “3” from the listing of numbers.

3

**Argmax Function**

The argmax, or “arg max” mathematical function returns the index in the listing that contains the biggest value.

Perceive of it as the meta variant of max: one level of indirection above max, indicating to the position in the list that possesses the max value instead of the value itself.

We can implement this leveraging the argmax() NumPy function, for instance:

1 2 3 4 5 6 7 | # example of the argmax of a list of numbers from numpy import argmax # define data data = [1, 3, 2] # calculate the argmax of the list result = argmax(data) print(result) |

Running the instance returns the list index value “1” that indicates to the array index [1] that consists of the biggest value in the list “3”

1

**Softmax Function**

The softmax or “soft max” mathematical function can be perceived as a probabilistic or “softer” variant of the argmax function.

The term softmax is leveraged as this activation function indicates a smooth version of the winner-takes-all activation model in which the unit with the biggest input has output +1 while all other units have output 0.

From a probabilistic viewpoint, if the argmax() function returns 1 in the prior section, it returns zero for the other two array indexes, providing full weight to index 1 and no weight to index 0 and index 2 for the biggest value in the list [1, 3, 2]

[0, 1, 0]

What if we were less certain and wished to express the argmax probabilistically, with likelihoods?

This can be accomplished by scaling the values in the list and translating them into probabilities such that all values in the returned list sum to 1.0

This can be accomplished by scaling the values in the list and translating them into probabilities such that all values in the returned list total to 1.0

This can be accomplished by calculating the exponent of every value in the list and dividing it by the total of the exponent values.

- Probability = exp(value) / sum v in list exp(v)

For instance, we can turn the first value “1” in the listing [1, 3, 2] into a probability as follows:

- Probability = exp(1) / (exp(1) + exp(3) + exp(2))
- Probability = exp(1) / (exp(1) + exp(3) + exp(2))
- Probability = 2.718281828459045 / 30.19287485057736
- Probability = 0.09003057317038046

We can depict this for each value in the listing [1, 3, 2] in Python as follows:

# transform values into probabilities

from math import exp

# calculate each probability

p1 = exp(1) / (exp(1) + exp(3) + exp(2))uti

p2 = exp(3) / (exp(1) + exp(3) + exp(2))

p3 = exp(2) / (exp(1) + exp(3) + exp(2))

# report probabilities

print(p1, p2, p3)

# report sum of probabilities

print(p1 + p2 + p3)

Running the instance converts every value in the list into a probability and reports the values, then confirms that all probabilities sum to the value 1.0.

We can observe that most weight is put on index 1 (67%) with reduced weight on index 2 (24%) and even less on index 0 (9%)

0.09003057317038046 0.6652409557748219 0.24472847105479767

1.0

This is the softmax function.

We can go about implementing it as a function that takes a listing of numbers and returns the softmax or multinomial probability distribution for the listing.

The instance below implements the function and demonstrates it on our small listing of numbers.

[Control]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | # example of a function for calculating softmax for a list of numbers from numpy import exp
# calculate the softmax of a vector def softmax(vector): e = exp(vector) return e / e.sum()
# define data data = [1, 3, 2] # convert list of numbers to a list of probabilities result = softmax(data) # report the probabilities print(result) # report the sum of the probabilities print(sum(result)) |

Running the instance reports approximately the same numbers with minor variations in accuracy.

[0.09003057 0.66524096 0.24472847]

1.0

Lastly, we can leverage the built-in softmax() NumPy function to calculate the softmax for an array or listing of numbers, as follows:

1 2 3 4 5 6 7 8 9 10 | # example of calculating the softmax for a list of numbers from scipy.special import softmax # define data data = [1, 3, 2] # calculate softmax result = softmax(data) # report the probabilities print(result) # report the sum of the probabilities print(sum(result)) |

Running the instance, again, we obtain very similar outcomes with really minor differences in accuracy.

[0.09003057 0.66524096 0.24472847]

0.9999999999999997

Now that we are acquainted with the softmax function, let’s observe how it is leveraged in a neural network model.

**Softmax Activation Function**

The softmax function is leveraged as the activation function in the output layer of neural network models that forecast a multinomial probability distribution.

That is, softmax is leveraged as the activation function for multi-class classification problems where class membership is needed on more than two class labels.

Any time we wish to represent a probability distribution over a discrete variable with n potential values, we might leverage the softmax function. This can be observed as a generalization of the sigmoid function which was leveraged to represent a probability distribution over a binary variable.

The function can be leveraged as an activation function for a hidden layer in a neural network, even though this is less usual. It might be leveraged when the model internally requires to select or weight multiple differing inputs at a bottleneck or concatenation layer.

Softmax units naturally signify a probability distribution over a discrete variable with k possible values, so they might be leveraged as a kind of switch.

In the Keras deep learning library with a three-class classification activity, leveraging of softmax in the output layer might look as follows:

…

model.add(Dense(3, activation=’softmax’))

By definition, the softmax activation will output a single value for every node within the output layer. The output values will signify (or can be interpreted as) probabilities and the values sum to 1.0

When modelling a multi-class classification problem, the data must be prepped. The target variable consisting of the class labels is first label encoded, meaning that an integer is applied to every class label from 0 to N-1, where N is the number of class labels.

The label encoded (or integer encoded) target variable are then one-hot encoded. This is a probabilistic representation of the class label, a lot like the softmax output. A vector is developed with a position for every class label and the position. All values are marked 0 (impossible) and a 1 (certain) is leveraged to indicate the position for the class label.

For instance, three class labels will be integer encoded as 0, 1, and 2. Then encoded to vectors as follows:

- Class 0: [1, 0, 0]
- Class 1: [0, 1, 0]
- Class 2: [0, 0, 1]

This is referred to as a one-hot encoding.

It indicates the expected multinomial probability distribution for each class leveraged to rectify the model under supervised learning. The softmax function will output a probability of class membership for every class label and effort to best approximate the expected target for a provided input.

For instance, if the integer encoded class 1 was expected for one instance, the target vector would be:

- [0, 1, 0]

The softmax output might look as follows, which puts the most weight on class 1 and less weight on the other classes.

- [0.09003057 0.66524096 0.24472847]

The error between the expected and predicted multinomial probability distribution is often calculated leveraging cross-entropy, and this error is then leveraged to go about updating the model. This is referred to as the cross-entropy loss function.

We might wish to convert the probabilities back into an integer encoded class label.

This can be accomplished leveraging the argmax() function that returns the index of the listing with the biggest value. Provided that the class labels are integer encoded from 0 to N-1, the argmax of the probabilities will always be the integer encoded class label.

- class integer = argmax([0.09003057 0.66524096 0.24472847])
- class integer =1

**Further Reading**

This portion of the blog furnishes additional resources on the subject if you are seeking to delve deeper.

*Books*

Neural Networks for Pattern Recognition, 1995.

Neural Networks: Tricks of the Trade: Tricks of the Trade, 2nd Edition, 2012

Deep Learning, 2016

*APIs*

numpy.argmax API

scipy.special.softmax API

*Articles*

Softmax function, Wikipedia

**Conclusion**

In this guide, you found out about the softmax activation function leveraged in neural network models.

Particularly, you learned:

- Linear and sigmoid activation functions are not appropriate for multi-class classification tasks.
- Softmax can be perceived of as a softened version of the argmax function that returns the index of the biggest value in a listing.
- How to implement the softmax function from the ground up in Python and how to convert the output into a class label.