内容简介:In this post, I discuss the recent paper ‘Deep Single Image Manipulation’ by Vinker, Horwitz, Zabari & Hoshen(2020) which details how we can use conditional generative adversarial networks (CGANs) to alter or manipulate the features of an image — training
The neural network that can (literally) turn your smile upside down
Jul 16 ·6min read
In this post, I discuss the recent paper ‘Deep Single Image Manipulation’ by Vinker, Horwitz, Zabari & Hoshen(2020) which details how we can use conditional generative adversarial networks (CGANs) to alter or manipulate the features of an image — training on single target images .
Read this post if:
- you want a simplified understanding of the paper cited above OR
- you are more generally interested in how to manipulate the features of an image using deep learning OR
- you are interested in image manipulation in the context of using only ONE target image during training
It is highly recommended that you read up on CGANs before reading this blog post.
So what is so special about this paper? Anyone familiar with a CGAN knows that approaches like Pix2Pix and BicycleGAN have already achieved interesting image manipulation results. For example, consider the following awesome results from the Pix2Pix paper:
Well, the special thing about this latest approach is that it uses only one image. The authors mention how regular image manipulation techniques require training on a large dataset, and this can be ‘slow and tricky to train.’ Instead, this paper proposes conditioning only on a single target image.
Great, now on to the paper itself! Here are the four main contributions of the paper
Contribution #1: A general-purpose approach for training conditional generators from a single image
The network in this paper is called a mapping network . We have an input image, y. And we are mapping from a primitive x to the input image y . To better understand the inputs and outputs, consider the following figure from the paper:
On the left, we have the training image pair. We train the mapping network to map from the primitive (above) to the true image (below). On the right, we have the inference inputs and outputs, where we input an altered primitive as the new ‘conditional label’ to obtain a new image.
This aligns with our conception of CGANs, as you can think of this primitive x as the conditional label for the input image y. We train the network on the following two objectives to enable manipulating the image later:
- Minimize the perceptual loss between the true target image and the predicted target image .
- Minimize the adversarial loss . The adversarial loss is the ability of a discriminator to differentiate between the (input, generated image) pair (input, true image) pair
At this point, we do not widely diverge from regular CGAN-based approaches for image manipulation. It is by introducing contribution #2 and contribution #3 that we are able to achieve good results by training only on a single image .
Contribution #2: Proposing a TPS-based augmentation for conditional image generation, and demonstrating its importance for single image training
What we mean by augmentation is that we modify the single target image, to get a slightly larger dataset, and we train on this ‘augmented’ dataset.
If we were to simply use the vanilla approach discussed in contribution #1 above, then we would overfit on the single image — the vanilla approach only works on large datasets. To prevent overfitting, we need to augment the single target image that we are using .
In the large dataset formulation of the image manipulation problem, we can use simple “crop” and “shift” augmentations, because the data is what enables the generalizability of the model weights — we use the data to properly update the prior and still obtain generalizable model weights.
However, in this single image case, generalizability is provided by the prior itself — which means ‘crop’ and ‘shift’ augmentations will not even be generalizable to a rotation of the target image— this is terrible. Instead, we need a sophisticated augmentation, that is highly generalizable and makes up for using only a single target image.
The paper proposes using a thin-plate-spline augmentation :
We model the image as a grid and shift each grid point by a uniformly distributed random distance. This forms the shifted grid t. We use a thin-plate-spline (TPS) to smooth the transformation into a more realistic warp f.
And the thin-plate-spline smoothing optimization is as follows:
You need not worry too much about the details of this augmentation, just remember that it is designed such that we can provide generalizable model weights for image manipulation, even though we only train on a single target image! Pretty cool.
Contribution #3: A novel primitive representation allowing concurrent low and high-level image editing.
The primitive is simply the representation of the input image on which we train our generator. The paper mentions two criteria for an image primitive:
- able to precisely specify the required output image
- ease of manipulation by image editor
The two objectives are often in conflict with one another, because whilst ‘the most precise representation of the edited image is the edited image itself,’ the manipulation of the image into the edited image is very difficult to achieve manually.
What does this mean? Well, recall that we are using the primitive as the ‘conditional label.’ So during inference, the label that we provide must be editable and easily specified by a human being — the actual target image is too complicated to be edited/specified by a human being so we cannot use it as a primitive!
To achieve a tradeoff between these two goals, the authors use a super primitive which combines segmentation maps (easy to manipulate) and edge maps (more precise specification of image).
Contribution #4: Extensive evaluations showing remarkable visual performance, and the introduction of a novel protocol enabling quantitative evaluation.
This last contribution is mainly the quality of results achieved by this new image manipulation method!
Result #1
This result pertains to the primary task of image manipulation. On the left we have the training image pair i.e. the image x and the primitive. Remember that this is what we used during training in contribution #1 earlier — including augmentations of course!
On the right, we have the inputs and outputs during inference. The input is in the form of the editable primitive that we mentioned earlier — this functions as the conditional label. The model successfully produces the edited outputs.
Besides providing more examples of their model’s success on this main task, the authors also compare their model to the existing image manipulation models like Pix2Pix and BicycleGAN:
Result #2
As we can see, the authors’ approach produces the best reconstruction of the shoe. Pix2Pix does not incorporate style information, and this paper’s approach exceeds the performance of BicycleGAN which does provide some style information.
Finally, the authors did introduce a new method for quantitatively evaluating the quality of their outputs, but I don’t want to go into too much detail on that because it is tangential to the main contributions of the paper. If you are interested, do refer to the main paper !
I really hope that this blog post has been helpful in understanding this new approach to deep single image manipulation. Do let me know if anything is confusing or requires correction!
Do check out the authors’ PyTorch implementation of their model here .
References
Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125–1134).
Vinker, Y., Horwitz, E., Zabari, N., & Hoshen, Y. (2020). Deep Single Image Manipulation. arXiv preprint arXiv:2007.01289 .
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。