How to implement Wasserstein Loss for Generative Adversarial Networks
The Wasserstein Generative Adversarial Network, or Wasserstein GAN is an extension to the generative adversarial network (GAN) that both enhances the stability during training of the model and furnishes a loss function that corresponds with the quality of produced imagery.
It is a critical extension to the GAN framework/model and needs a conceptual shift away from a discriminator that forecasts the odds of a produced image being “real” and toward the notion of a critic model that scores the “authenticity” or “realness” of a provided image.
This conceptual shift is compelled mathematically leveraging the earth mover distance, or Wasserstein distance, to train the GAN that quantifies the distance between the data distribution witnessed in the training dataset and the distribution witnessed in the produced instances.
In this blog article, you will find out how to implement Wasserstein loss for Generative Adversarial Networks.
After going through this article, you will be aware of:
- The conceptual shift in the WGAN from discriminator forecasting a probability to a critic forecasting a score.
- The implementation details for the WGAN as minor changes to the traditional deep convolutional GAN.
- The intuition behind the Wassertein loss function and how to implement it from the ground up.
Tutorial Summarization
This guide is subdivided into five portions, which are:
- GAN stability and the discriminator
- What is a Wasserstein GAN?
- Implementation details of the Wasserstein GAN
- How to implement Wasserstein Loss
- Common point of confusion with expected/predicted labels
GAN Stability and the Discriminator
Generative Adversarial Networks, or GANs, are a challenge to undertake training of.
The discriminator model must classify a provided input image as real (from the dataset) or fake (generated/produced), and the generator model must produce new and plausible imagery.
The reasoning behind why GANs are tough to train is that the architecture consists of the simultaneous training of a generator and a discriminator model in a zero-sum game. Stable training needs identifying and maintaining an equilibrium amongst the capabilities of the two models.
The discriminator model is a neural network that learns a binary classification problem, leveraging a sigmoid activation function in the output layer, and is fitted leveraging a binary cross entropy loss function. As such, the model forecasts a probability that a provided input is real (or fake as 1 minus the forecasted) as a value between nil and one.
The loss function has the impact of penalization of the model proportionally to how far the forecasted probability distribution differs from the expected/predicted probability distribution for a provided image. This furnishes the basis for the error that is back propagated through the discriminator and the generator in order to have better performance on the subsequent batch.
The WGAN relaxes the role of the discriminator when training a GAN and puts forth the alternative of a critic.
What is a Wasserstein GAN?
The Wasserstein GAN, or WGAN in short, was put forth by Martin Arjovsky, et al. in their 2017 paper entitled “Wasserstein GAN”
It is an extension of the GAN that looks for an alternative method of training the generator model/framework to better approximate the distribution of data observed in a provided training dataset.
Rather than leveraging a discriminator to categorize or forecast the odds of produced imagery as being authentic or fake, the WGAN changes or substitutes the discriminator model with a critic that tabulates the realness or fakeness of a provided image.
This change is compelled by a mathematical argument that training the generator should look for a minimization of the distance amongst the distribution of the data witnessed in the training dataset and the distribution witnessed in generated instances. The argument contrasts differing distribution distance measures, like Kullback-Leibler (KL) divergence, Jensen-Shannon (JS) divergence, and the Earth-Mover (EM) distance, referenced to as Wasserstein distance.
The most basic variation amongst such distances is their influence on the convergence of sequences of probability distributions.
They illustrate that a critic neural network can be trained in approximating the Wasserstein distance, and in turn, leveraged to effectively train a generator model.
We define a form of GAN referred to as Wasserstein-GAN that reduces a reasonable and effective approximation of the EM distance, and we theoretically illustrate that the corresponding optimization problem is sound.
Critically, the Wasserstein distance has the attributes that it is ongoing and differentiable and continues to furnish a linear gradient, even after the critic has received adequate training.
The fact that the EM distance is ongoing and differentiable i.e. meaning we can and ought to train the critic till optimality, the more we train the critic, the more reliable gradient of the Wasserstein we obtain, which is actually useful by the fact that Wasserstein is differentiable almost everywhere.
This is not like the discriminator model that, after trained, may fail to furnish useful gradient data for updating of the generator model.
The discriminator gets to know very swiftly to differentiate between fake and real, and as expected furnishes no reliable gradient data. The critic, although, cannot saturate, and converges to a linear function that provides remarkably clean gradients everywhere.
The advantage of the WGAN is that the training procedure is more stable and less sensitive to model architecture and selection of hyperparameter configurations.
Training of WGANs does not need maintenance of a meticulous balance in training of the discriminator and the generator, and does not need a meticulous design of the network architecture either. The mode dropping phenomenon that is common in GANs is also drastically minimized.
Probably most critically, the loss of the discriminator seems to relate the quality of the images developed by the generator.
Particularly, the lower the loss of the critic when assessing generated imagery, the higher the predicted/expected quality of the produced images. This is critical as unlike other GANs that look for stability in terms of identifying an equilibrium amongst two models, the WGAN looks for convergence, reducing generator loss.
To our know-how, this is the first time in GAN literature that such an attribute is displayed, where the loss of the GAN displays attributes of convergence. This attribute is really useful when performing research in adversarial networks as one does not require to stare at the produced samples to figure out failure models and to obtain data on which models are doing better over others.
Implementation Details of the Wasserstein GAN
Even though the theoretical grounding for the WGAN is dense, the implementing of a WGAN needs a few minor modifications to the traditional deep convolutional GAN, or DCGAN.
Those alterations are as follows:
- Leverage a linear activation function in the output layer of the critic model (rather than sigmoid)
- Leverage Wasserstein loss to train the critic and generator models that promote bigger difference between scores for real and produced images.
- Constrain critic model weights to a restricted array following every mini batch update (for example, [-0.01,0.01])
In order to possess parameters lie in a concise space, something simple we can do is clamp the weights to a static box (say W = [-0.01, 0.01]l) after every gradient update.
- Update the critic model more times than the generator every iteration
- Leverage the RMSProp variant of gradient descent with minimal learning rate and no momentum (e.g. 0.00005)
We report that WGAN training becomes unstable at times when one leverages a momentum based optimizer like Adam. Thus, we switched to RMSProp
The image below furnishes a summarization of the primary training loop for training of a WGAN, taken from the paper. Observe the listing of recommended hyperparameters leveraged in the model.
How to implement Wasserstein Loss
The Wasserstein loss function looks to increase the gap between the scores for real and produced imagery.
We can summarize the function as it is detailed in the paper as follows:
- Critic loss = [average critic score on real images] – [average critic score on fake images]
- Generator loss = -[average critic score on fake images]
Where the mean scores are calculated across a mini-batch of samples.
This is just how the loss is implemented for graph-based deep learning frameworks/models like PyTorch and TensorFlow.
The calculations are straightforward to interpret once we remember that the stochastic gradient descent looks to minimize loss.
In the scenario of the generator, a larger score from the critic will have the outcome of a smaller loss for the generator, encouraging the critic to output larger scores for inauthentic imagery. For instance, an average score of 10 becomes -10, an average score of 50 becomes -50, which is smaller, and so on.
In the scenario of the critic, a bigger score for real images has the outcome of a bigger resulting loss for the critic, penalizing the model. This encourages the critic to output a large score for real imagery and a small score for fake imagery and accomplish the same outcome. Some implementations make this alteration.
In the Keras deep learning library (and a few others), we cannot implement the Wasserstein loss function directly as detailed in the paper and as implemented in PyTorch and TensorFlow. Rather, we can accomplish the same impact without having the calculation of the loss for the critic dependent upon the loss calculated for real and fake images.
A good way to contemplate about this is a negative score for real imagery and a positive score for fake imagery, even though this negative/positive split of scores learned during the course of training is not needed; just bigger and smaller is adequate.
- Small Critic Score (e.g. <0): Real – Large Critic Score (e.g. > 0): Fake
We can multiply the average forecasted score by -1 in the scenario of fake imagery so that bigger averages become smaller averages and the gradient is in the right direction, that is, reducing loss. For instance, average scores on fake imagery of [0.5, 0.8, and 1.0] across a trio of batches of fake imagery would become [-0.5, -0.8 and -1.0] when calculating weight updates.
- Loss for fake imagery = -1*Average Critic Score
No alteration is required for the scenario of real scores, as we wish to encourage reduced average scores for real images.
- Loss for real images = Average Critic Scores
This can be implemented on a consistent basis by allocating an expected outcome target of -1 for fake imagery and 1 for real images and implementation of the loss function as the predicted/expected label multiplied by the average score. The -1 label will be multiplied by the average score for fake imagery and encourage a bigger forecasted average, and the +1 label will be multiplied by the average score for actual images and have no impact, promoting a smaller forecasted average.
- Wasserstein Loss = Label * Average Critic Score
Or
- Wasserstein Loss (Real Images) = 1*Average Predicted Score
- Wasserstein Loss (Fake Images) = -1*Average Predicted Score
We can go about implementing this in Keras by allocating the expected/predicted labels of -1 and 1 for fake and real images respectively. The inverse labels could be leveraged to the same effect, e.g., -1 for real and +1 for fake to encourage minimal scores for fake imagery and large scores for real imagery. Some developers do implement the WGAN in this alternative fashion, which is just as right.
The loss function can be implemented through multiplication of the expected label for every sample by the forecasted score (element wise), then calculating the mean.
def wasserstein_loss(y_true, y_pred):
return mean(y_true * y_pred)
The above function is the elegant method to implement the loss function; an alternative, less-elegant implementation that might be more intuitive is as follows:
def wasserstein_loss(y_true, y_pred):
return mean(y_true) * mean(y_pred)
In Keras, the mean function can be implemented leveraging the Keras backend API to ensure the mean is calculated throughout samples in the furnished tensors: for instance:
1 2 3 4 5 | from keras import backend
# implementation of wasserstein loss def wasserstein_loss(y_true, y_pred): return backend.mean(y_true * y_pred) |
Now that we are aware how to go about implementing the Wasserstein loss function in Keras, let’s clarify one typical point of understanding.
Common point of confusion with expected/predicted labels
Remember that we are leveraging the expected/predicted labels of -1 for fake imagery and +1 for real images.
A typical point of confusion is that an ideal critic model output -1 for each fake image and +1 for each real image.
This is wrong.
Again, remember we are leveraging stochastic gradient descent to identify the grouping of weights in the critic (and generator) models that reduce the loss function.
We have found out that we wish for the critic model to output bigger scores on average for fake imagery and smaller scores on average for real images. We then developed a loss function to encourage this result.
This is the critical point with regards to loss functions leveraged to train neural network models. They encourage a desirable model behaviour, and they do not have to accomplish this by furnishing the expected/predicted outcomes. In this scenario, we defined our Wasserstein loss function to go about interpreting the average score forecasted by the critic model and leveraged labels for the real and fake cases to assist with this interpretation.
So what is a good loss for real and inauthentic imagery under Wasserstein loss?
Wasserstein is not an absolute and comparable loss for contrasting across GAN models. Rather, it is relative and is dependent on your model configuration and dataset. What is critical is that it is consistent for a provided critic model and convergence of the generator (improved loss) does correlate with improved generated image quality.
It could be negative scores for real images and positive scores for fake imagery, but this is not needed. All scores could be positive or all scores could be negative.
The loss function only encourages a separation amongst scores for fake and real images as larger and smaller, not necessarily positive and negative.
Further Reading
This section furnishes more resources on the subject if you are seeking to delve deeper.
Papers
Articles
Wasserstein Generative Adversarial Networks (WGANS) Project, Github
Keras-GAN: Keras implementations of Generative Adversarial Networks, Github
From GAN to WGAN, 2017
GAN-Wasserstein GAN and WGAN-GP, 2018
Improved WGAN, keras-contrib Project, Github
Wasserstein GAN, Reddit
Wasserstein GAN in Keras, 2017
Wasserstein GAN and the Kantorovich-Rubinstein Duality
Is the WGAN Wasserstein Loss Function Correct?
Conclusion
In this blog article, you found out how to go about implementing Wasserstein loss for Generative Adversarial Networks.
Particularly, you learned:
- The conceptual shift in the WGAN from a discriminator forecasting a probability to a critic forecasting a score.
- The implementation details for the WGAN as minor modifications to the traditional deep convolutional GAN.
- The intuition behind the Wasserstein loss function and how implement it from the ground up.