Autoencoding Generative Adversarial Networks
How the AEGAN architecture stabilizes GAN training and prevents mode collapse
Apr 18 ·9min read
GANs are hard to train. When they work, they work wonders , but anyone who’s tried to train one themselves knows they’re damn finicky bastards. Two of the most common problems in GAN training are mode collapse and lack of convergence. In mode collapse, the generator learns to only generate a handful of samples; in generating “handwritten” digits, a GAN undergoing mode collapse might only learn to draw sevens, albeit highly-realistic sevens. With lack of convergence, the healthy competition between the generator and the discriminator sours, usually with the discriminator becoming much better than the generator; when the discriminator is able to easily and completely discern between real and generated samples, the generator doesn’t get useful feedback and isn’t able to improve.
In a recent paper , I proposed a technique which appears to stabilize GAN training and addresses both of the above issues. A side-effect of this technique is that it allows for efficient and direct interpolation between real samples. In this article, I aim to step through the key ideas of the paper and illustrate why I think the AEGAN technique has the potential to be a very useful tool in the GAN trainer’s toolbox.
Enter the AEGAN
Bijective Mapping
GANs learn a mapping from some latent space Z (the random noise) to some sample space X (the dataset, usually images). These mappings are naturally injective — each point z in Z corresponds to some sample x in X . However, they’re rarely surjective — many samples in X do not have a corresponding point in Z. Indeed, mode collapse occurs when many points zi , zj , and zk map to a single sample xi , and the GAN is unable to generate points xj or xk . With this in mind, a more ideal GAN would have the following qualities:
- Each latent point z in Z should correspond to a unique sample x in X.
- Each sample x in X should correspond to a unique latent point z in Z .
- The probability of drawing z from Z, p(Z=z) , should equal the probability of drawing x from X , p(X=x) .
These three qualities suggest that we should aim for a one-to-one relationship (i.e. a bijective mapping) between the latent space and the sample space. To do this, we train a function G : Z ⟶ X , which is our generator, and another function E : X ⟶ Z , which we will call the encoder. The intents of these functions are:
- G(z) should produce realistic samples in the same proportions as they are distributed in X . (This is what regular GANs aim to do)
- E(x) should produce likely latent points in the same proportions as they are distributed in Z.
- The composition E(G(z)) should faithfully reproduce the original latent point z .
- The composition G(E(x)) should faithfully reproduce the original image x .
Architecture
AEGAN is a four-network model comprising of two GANs and two autoencoders, illustrated in Figure 1, and is a generalization of the CycleGAN technique for unpaired image-to-image translation where one of the image domains is replaced with random noise. In short, we train two networks to translate between sample space X and latent space Z , and we train another two networks to discriminate between real and fake samples and latent vectors. Figure 1 is a complicated diagram, so let me break it down:
Networks (Squares):
- G is the generator network. It takes a latent vector z as input and returns an image x as output.
- E is the encoder network. It takes an image x as input and returns a latent vector z as output.
- Dx is the image discriminator network. It takes an image x as input and returns the probability that x was drawn from the original dataset as output.
- Dz is the latent discriminator network. It takes a latent vector z as input and returns the probability that z was drawn from the latent distribution as output.
Values (Circles):
- x : genuine samples from the original dataset. This is a bit ambiguous, because in some places I use x to mean any value in the domain X. Sorry about that.
- z : genuine samples from the latent-generating distribution (random noise).
- x_hat : samples produced by G given a real random vector, i.e. x_hat=G(z).
- z_hat : vectors produced by E given a real sample, i.e. z_hat=E(x).
- x_tilde : samples reproduced by G from encodings produced by E , i.e. x_tilde=G(z_hat)=G(E(x)).
- z_tilde : vectors reproduced by E from images generated by G , i.e. z_tilde=E(z_hat)=E(G(z)).
Losses (Diamonds):
- L1 (blue): The image reconstruction loss ||G(E(x))-x||_1 , i.e. the Manhattan distance between the pixels of the original image and the autoencoded reconstruction.
- L2 (green): The latent vector reconstruction loss ||E(G(z))-z||_2 , i.e. the Euclidean distance between the original latent vector and the autoencoded reconstruction.
- GAN (red): The adversarial loss for images. Dx is trained to discriminate between real images ( x ) and fake images ( x_hat and x_tilde (not shown))
- GAN (yellow): The adversarial loss for latent vectors. Dz is trained to discriminate between real random noise ( z ) and encodings ( z_hat and z_tilde (not shown))
Training
The AEGAN is trained in the same way as a GAN, alternatingly updating the generators ( G and E ) and the discriminators ( Dx and Dz ). The AEGAN loss function is slightly more complex than the typical GAN loss, however. It consists of four adversarial components:
and two reconstruction components (shown here summed together):
which, all summed, form the AEGAN loss. E and G try to minimize this loss while Dx and Dz try to maximize it. If you don’t care for the math, the intuition is simple:
- G tries to trick Dx into believing the generated samples x_hat and the autoencoded samples x_tilde are real, while Dx tries to distinguish those from the real samples x .
- E tries to trick Dz into believing the generated samples z_hat and the autoencoded samples z_tilde are real, while Dz tries to distinguish those from the real samples z .
- G and E have to work together so that the autoencoded samples G(E(x))=x_tilde are similar to the original x , and that the autoencoded samples E(G(z))=z_tilde are similar to the original z .
Results
To start with, a disclaimer. Due to personal reasons, I’ve only had the time and energy to test this on a single dataset. I’m publishing my work as-is so that others can test out the technique themselves and validate my results or show that this is a dead-end. That said, here’s a sample of the results after 300k training steps:
By itself, figure 2 isn’t all that exciting. If you’re reading a Medium article about GANs, then you’ve probably seen the StyleGAN trained on anime faces that produces way better results . What is exciting is comparing the above results to figure 3:
The GAN used to generate the images in figure 3 and the AEGAN used to generate the images in figure 2 have the exact same architectures for G and for Dx ; the only difference is that the AEGAN was made to learn the reverse function as well. This stabilized the training process. And before you ask, no, this wasn’t a one-off fluke; I repeated the training for both the GAN and the AEGAN five times, and in each case, the AEGAN produced good results and the GAN produced garbage.
An exciting side-effect of the AEGAN technique is that it allows for direct interpolation between real samples. GANs are known for their ability to interpolate between samples; draw two random vectors z1 and z2 , interpolate between the vectors, then feed the interpolations to the generator and boom ! With AEGAN, we can interpolate between real samples:
Because the encoder E is able to map a sample x to its corresponding point z in the latent space, the AEGAN allows us to find points z1 and z2 for any samples x1 and x2 and interpolate between them as one would for a typical GAN. Figure 5 illustrates the reconstructions of 50 random samples from the dataset:
Discussion
First, I’d like to address the shortcomings of this experiment. As I said, this was only tested on a single dataset. The structures of the individual networks G , E , Dx , and Dz also weren’t extensively explored and no meaningful hypertuning was performed (on number or shape of layers, λs, etc.). The networks themselves were fairly simplistic; a more thorough and fair experiment would be to apply the AEGAN technique as a wrapper to a more powerful GAN on a more complex dataset such as CelebA .
That said, the AEGAN has a number of desirable theoretical properties which make it ripe for further exploration.
- Forcing the AEGAN to preserve information about the latent vector in the generated image prevents mode collapse by definition. This also allows us to avoid batch independence-breaking techniques like batch normalization and batch discrimination. Incidentally, I was forced to avoid batch normalization in this experiment due to an issue with its implementation in TF.keras 2.0, but that’s a story for another day…
- Learning a bijective function allows for direct interpolation between real samples, without relying on auxiliary networks or invertible layers. It also may allow for better exploration and manipulation of the latent space, possibly by experimenting with different distributions as was done in Adversarial Autoencoders .
- Exposing the generator to real samples directly allows it to spend less time wading about in the abyss of pixel-space.
To that last point, the generators of regular GANs are never directly exposed to the training data, and only learn what the data looks like indirectly through the discriminator’s feedback (hence the nickname “ blind forger ”). By including a reconstruction loss, the generator can beeline towards the low-dimensional manifold in high-dimensional pixel space. Consider figure 6, which shows the AEGAN’s output after only 200 training steps:
Compare this to figure 7, which shows a regular GAN with the same architecture as the AEGAN at the same point in its training:
As you can see, the AEGAN is particularly effective at finding the low-dimensional manifold, although measuring its ability to fit that manifold will require further experimentation.
Further Work
- Apply AEGAN to state-of-the-art techniques like StyleGAN to see if it improves quality and/or rate of convergence.
- Explore the λ hyperparameters to find optimal values; explore curriculum methods, such as gradually decreasing the λs over time.
- Apply conditionality to the training; explore Bernoulli and Multinoulli latent components, as was done in Adversarial Autoencoders .
- Apply AEGAN to a designed image dataset with a known underlying manifold, to measure how effectively the technique can reproduce it.
- Find a way to match the dimensionality of the latent space to the dimensionality of the data-generating function’s manifold (easier said than done!)
Errata
I’d be remiss if I didn’t mention Variational Autoencoder / GANs somewhere, which is an interesting, related technique, so here it is. The data used to train these models is available on Kaggle . You can check out the original paper here . My tf.keras implementation of this network is available at the following github repo:
以上所述就是小编给大家介绍的《Autoencoding Generative Adversarial Networks》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。