### How to select an activation function for deep learning

Activation functions are a crucial portion of the design of a neural network.

The selection of activation function in the hidden layer will manage how well the network model learns the training dataset. The selection of activation function in the output layer will define the variant of predictions the model can make.

As such, a meticulous selection of activation function must be made for every deep learning neural network project.

In this guide, you will find out how to select activation function must be made for every deep learning neural network project.

After going through this guide, you will be aware of:

- Activation functions being a critical portion of neural network design.
- The modern default activation function for hidden layers is the ReLU function.
- The activation function for output layers is dependent on the variant of prediction problem.

__Tutorial Summarization__

1] Activation functions

2] Activation for hidden layers

3] Activation for output layers

__Activation functions__

An activation function is referred to as a transfer function. If the output range of the activation function is limited, then it might be referred to as a squashing function. Several activation functions are nonlinear and might be referenced to as the “nonlinearity” within the layer or the network design.

The selection of activation function has a massive impact on the capacity and performance of the neural network, and differing activation functions might be leveraged in differing portions of the model.

Technically, the activation function is leveraged within or after the internal processing of every node in the network, even though networks are developed to leverage the same activation function for all nodes within a layer.

A network might possess three variants of layers: input layers that take raw input from the domain, hidden layers that take input from another layer and pass output to another layer, and output layers that make a forecast.

All hidden layers usually leverage the same activation function. The output layer will usually leverage a different activation function from the hidden layers and is reliant upon the variant of forecast needed by the model.

Activation functions are also usually differentiable, implying the first-order derivative can be calculated for a provided input value. This is needed provided that neural networks are usually trained leveraging the backpropagation of error algorithm that needs the derivative of prediction error in order to go about updating the weights of the model.

There are several differing types of activation function leveraged in neural networks, even though probably only a minimal number of functions leveraged in practice for hidden and output layers.

Let’s observe the activation functions leveraged for every type of layer in turn.

__Activation for Hidden Layers__

A hidden layer within a neural network as a layer that obtains input from another layer (like another hidden layer or an input layer) and furnishes output to another layer (like another hidden layer or an output layer).

A hidden layer does not directly contact input data or generate outputs for a model, at least in general.

A neural network might have zero or additional hidden layers.

Usually, a differentiable nonlinear activation function is leveraged in the hidden layers of a neural network. This facilitates the model to learn more complicated functions than a network trained leveraging a linear activation function.

In order to obtain access to a much richer hypothesis space that you reap advantages from deep representations, you require a non-linearity, or activation function.

There are probably a trio of activation functions you might wish to consider for leveraging in hidden layers, which are:

- Rectified Linear Activation (ReLU)
- Logistic (Sigmoid)
- Hyperbolic Tangent (Tanh)

This is not a comprehensive list of activation functions leveraged for hidden layers, but they are the most typically leveraged.

Let’s take a deeper look at every one in turn.

__ReLU Hidden Layer Activation Function__

The rectified linear activation function, or ReLU activation function, is probably the most typical function leveraged for hidden layers.

It is typical because it is both simple to implement and efficient at overcoming the restrictions of other prior widespread activation functions, like Sigmoid and Tanh.

Particularly, it is less prone to vanishing gradients that prevent deep models from being trained, even though it can suffer from other issues such as saturated or “dead” units.

The ReLU function is calculated as follows:

- max(0.0, x)

This implies that if the input value (x) is negative, then a value of 0.0 is returned, otherwise, the value is returned.

We can get a feel for the shape of this function with the worked instance below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | # example plot for the relu activation function from matplotlib import pyplot
# rectified linear function def rectified(x): return max(0.0, x)
# define input data inputs = [x for x in range(-10, 10)] # calculate outputs outputs = [rectified(x) for x in inputs] # plot inputs vs outputs pyplot.plot(inputs, outputs) pyplot.show() |

Running the instance calculates the outputs for an array of values and develop a plot of inputs versus outputs.

We can observe the familiar kink shape of the ReLU activation function.

When leveraging the ReLU function for hidden layers, it is a good practice to leverage a “He Normal” or “He Uniform” weight initialization and scale input data to the range 0-1 (normalize) before training.

It is the same function leveraged in the logistic regression classification algorithm.

The function takes any real value as input and output values in the range 0 to 1. The bigger the input (more positive), the closer the output value will be to 1.0, while the smaller the input (more negative), the closer the output will be to 0.0.

The sigmoid activation function is calculated as follows:

- / (1.0 + e^-x)

Where e is a mathematical constant, which is the base of the natural logarithm.

We can obtain an intuition for the shape of this function with the worked instance below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # example plot for the sigmoid activation function from math import exp from matplotlib import pyplot
# sigmoid activation function def sigmoid(x): return 1.0 / (1.0 + exp(-x))
# define input data inputs = [x for x in range(-10, 10)] # calculate outputs outputs = [sigmoid(x) for x in inputs] # plot inputs vs outputs pyplot.plot(inputs, outputs) pyplot.show() |

Running the instance calculates the outputs for an array of values and develops a plot of inputs versus outputs.

We can observe the familiar S-shape of the sigmoid activation function.

When leveraging the Sigmoid function for hidden layers, it is a good practice to leverage a “Xavier Normal” or “Xavier Uniform” weight initialization (also referenced to as Glorot initialization, named after Xavier Glorot) and scale input data to the range 0-1 (for example the range of the activation function) before training.

__Tanh Hidden Layer Activation Function__

The hyperbolic tangent activation function is also referenced to merely as the Tanh (also “tanh” and “TanH”) function.

It is very like the sigmoid activation function and even has the same S-shape.

The function takes any real value as input and outputs values in the range -1 to 1. The bigger the input (more positive), the nearer the output value will be to 1.0, while the smaller the input (more negative), the nearer the output will be to -1.0.

The Tanh activation function is calculated as follows:

(e^x – e^-x) / (e^x + e^-x)

Where e is a mathematical constant that is the base of the natural logarithm.

We can obtain an intuition for the shape of this function with the worked instance below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # example plot for the tanh activation function from math import exp from matplotlib import pyplot
# tanh activation function def tanh(x): return (exp(x) – exp(-x)) / (exp(x) + exp(-x))
# define input data inputs = [x for x in range(-10, 10)] # calculate outputs outputs = [tanh(x) for x in inputs] # plot inputs vs outputs pyplot.plot(inputs, outputs) pyplot.show() |

Running the instance calculates the outputs for an array of values and develops a plot of inputs v. outputs.

We can observe the familiar S-shape of the Tanh activation function.

When leveraging the TanH function for hidden layers, it is best practice to leverage a “Xavier Normal” or “Xavier Uniform” weight initialization (also referenced to as Glorot initialization, named after Xavier Glorot) and scale input data to the range -1 to 1 (for example the range of the activation function) before training.

__How to Select a Hidden Layer Activation Function__

A neural network will nearly always possess the same activation function in all hidden layers.

It is most unusual to vary the activation function via a network model.

Conventionally, the sigmoid activation function was the default activation function in the 1990s. Probably through the mid to late 1990s to 2010s, the Tanh function was the default activation function for hidden layers.

Both the sigmoid and Tanh functions can make the model more prone to issues during training, through the so-called vanishing gradients problem.

The activation function leveraged in hidden layers is usually selected on the basis of the variant of neural network architecture.

Advanced neural network models with typical architectures, like MLP and CNN, will leverage the ReLU activation function, or extensions.

Recurrent networks still typically leverage Tanh or sigmoid activation functions, or even both. For instance, the LSTM typically leverages the Sigmoid activation for recurrent connections and the Tanh activation for output.

- Multilayer Perceptron (MLP): ReLU activation function.
- Convolutional Neural Network (CNN): ReLU activation function.
- Recurrent Neural Network: Tanh and/or Sigmoid activation function.

If you are not sure which activation function to leverage for your respective network, attempt a few and contrast the outcomes.

The figure below summarizes how to select an activation function for the hidden layers of your neural network model.

__Activation for Output Layers__

The output layer is the layer within a neural network model that directly outputs a forecast.

All feed-forward neural network models possess an output layer.

There are probably a trio of activation functions you might wish to consider for leveraging in the output layer, which are:

- Linear
- Logistic (Sigmoid)
- Softmax

This is not a comprehensive listing of activation functions leveraged for output layers, but they are the most typically leveraged.

__Linear Output Activation Function__

The linear activation function is also referred to as “identity” (multiplied by 1.0) or “no activation”

This is due to the fact that the linear function does not modify the weighted total of the input in any fashion and rather returns the value directly.

We can obtain an intuition for the shape of this function with the worked instance below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | # example plot for the linear activation function from matplotlib import pyplot
# linear activation function def linear(x): return x
# define input data inputs = [x for x in range(-10, 10)] # calculate outputs outputs = [linear(x) for x in inputs] # plot inputs vs outputs pyplot.plot(inputs, outputs) pyplot.show() |

Running the instance calculates the outputs for an array of values and develops a plot of inputs v. outputs.

We can observe a diagonal line shape where inputs are plotted against identical outputs.

Target values leveraged to train a model with a linear activation function in the output layer are usually scaled before modelling leveraging normalization or standardization transforms.

__Sigmoid Output Activation Function__

The sigmoid of logistic activation function was detailed in the prior section.

Nonetheless, to include some symmetry, we can review for the shape of this function with the worked instance below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # example plot for the sigmoid activation function from math import exp from matplotlib import pyplot
# sigmoid activation function def sigmoid(x): return 1.0 / (1.0 + exp(-x))
# define input data inputs = [x for x in range(-10, 10)] # calculate outputs outputs = [sigmoid(x) for x in inputs] # plot inputs vs outputs pyplot.plot(inputs, outputs) pyplot.show() |

Running the instance calculates the outputs for an array of values and develops a plot of inputs v. outputs.

We can observe the familiar S-shape of the sigmoid activation function.

Target labels leveraged to train a model with a sigmoid activation function in the output layer will possess the values 0 or 1.

__Softmax Output Activation Functions__

The softmax function outputs a vector of values that sum to 1.0 which can be interpreted as odds of class membership.

It is connected to the argmax function that outputs a 0 for all options and 1 for the selected option. Softmax is a “softer” variant of argmax that facilitates a probability-like output of a winner-take-all function.

As such, the input to the function is a vector of real values and the output is a vector of the same length with values that sum to 1.0 like probabilities.

The softmax function is calculated as follows:

e^x / sum(e^x)

Where x is a vector of outputs and e is a mathematical constant that is the foundation of the natural logarithm.

We cannot plot the softmax function, but we can furnish an instance of calculating it in Python.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | from numpy import exp
# softmax activation function def softmax(x): return exp(x) / exp(x).sum()
# define input data inputs = [1.0, 3.0, 2.0] # calculate outputs outputs = softmax(inputs) # report the probabilities print(outputs) # report the sum of the probabilities print(outputs.sum()) |

Running the instance calculates the softmax output for the input vector.

We then go about confirming that the sum of the outputs of the softmax indeed sums to the value 1.0.

[0.09003057 0.66524096 0.24472847]

1.0

Target labels leveraged to train a model with the softmax activation function in the output layer will be vectors with 1 for the target class and 0 for all other classes.

__How to Select an Output Activation Function__

You must select the activation function for your output layer on the basis of the variant of prediction problem that you are solving.

Particularly, the variant of variable that is being forecasted.

For instance, you might divide prediction problems into two primary groups, forecasting a categorical variable (classification) and forecasting a numerical variable. (regression)

If your problem is a regression problem, you should leverage a linear activation function.

Regression: One node, linear activation.

If your problem is a classification problem, there are a trio of primary variants of classification problems and each one might leverage a differing activation function.

Forecasting a probability is not a regression problem, it is classification. In all scenarios of classification, your model will forecast the probability of class membership (for example odds that an instance belongs to each class) that you can convert to a crisp class label by rounding (for sigmoid) or argmax (for softmax).

If there are two mutually exclusive classes (binary classification), then your output layer will possess a singular node and a sigmoid activation function ought to be leveraged. If there are more than dual mutually exclusive classes (multiclass classification), then your output layer will possess one node per class and a softmax activation should be leveraged. If there are dual or more mutually inclusive classes (multilabel classification), then your output layer will possess one node for every class and a sigmoid activation function is leveraged.

- Binary classification: One node, sigmoid activation.
- Multiclass classification: One node per class, softmax activation.
- Multilabel classification: One node per class, sigmoid activation.

The image below summarizes how to select an activation function for the output layer of your neural network model.

__Further Reading__

This section furnishes additional resources on the subject if you are looking to delve deeper.

*Books*

Deep Learning, 2016

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

Neural Networks for Pattern Recognition, 1996.

Deep learning with Python, 2017.

*Articles*

Activation function, Wikipedia

__Conclusion__

In this guide, you found out how to select activation function for neural network models.

Particularly, you learned:

- Activation functions are a critical portion of neural network design.
- The modern default activation function for hidden layers is the ReLU function.
- The activation function for output layers is dependent on the variant of prediction problem.