内容简介:In this post, I discuss the recent paper ‘Deep Single Image Manipulation’ by Vinker, Horwitz, Zabari & Hoshen(2020) which details how we can use conditional generative adversarial networks (CGANs) to alter or manipulate the features of an image — training
The neural network that can (literally) turn your smile upside down
Jul 16 ·6min read
In this post, I discuss the recent paper ‘Deep Single Image Manipulation’ by Vinker, Horwitz, Zabari & Hoshen(2020) which details how we can use conditional generative adversarial networks (CGANs) to alter or manipulate the features of an image — training on single target images .
Read this post if:
- you want a simplified understanding of the paper cited above OR
- you are more generally interested in how to manipulate the features of an image using deep learning OR
- you are interested in image manipulation in the context of using only ONE target image during training
It is highly recommended that you read up on CGANs before reading this blog post.
So what is so special about this paper? Anyone familiar with a CGAN knows that approaches like Pix2Pix and BicycleGAN have already achieved interesting image manipulation results. For example, consider the following awesome results from the Pix2Pix paper:
Well, the special thing about this latest approach is that it uses only one image. The authors mention how regular image manipulation techniques require training on a large dataset, and this can be ‘slow and tricky to train.’ Instead, this paper proposes conditioning only on a single target image.
Great, now on to the paper itself! Here are the four main contributions of the paper
Contribution #1: A general-purpose approach for training conditional generators from a single image
The network in this paper is called a mapping network . We have an input image, y. And we are mapping from a primitive x to the input image y . To better understand the inputs and outputs, consider the following figure from the paper:
On the left, we have the training image pair. We train the mapping network to map from the primitive (above) to the true image (below). On the right, we have the inference inputs and outputs, where we input an altered primitive as the new ‘conditional label’ to obtain a new image.
This aligns with our conception of CGANs, as you can think of this primitive x as the conditional label for the input image y. We train the network on the following two objectives to enable manipulating the image later:
- Minimize the perceptual loss between the true target image and the predicted target image .
- Minimize the adversarial loss . The adversarial loss is the ability of a discriminator to differentiate between the (input, generated image) pair (input, true image) pair
At this point, we do not widely diverge from regular CGAN-based approaches for image manipulation. It is by introducing contribution #2 and contribution #3 that we are able to achieve good results by training only on a single image .
Contribution #2: Proposing a TPS-based augmentation for conditional image generation, and demonstrating its importance for single image training
What we mean by augmentation is that we modify the single target image, to get a slightly larger dataset, and we train on this ‘augmented’ dataset.
If we were to simply use the vanilla approach discussed in contribution #1 above, then we would overfit on the single image — the vanilla approach only works on large datasets. To prevent overfitting, we need to augment the single target image that we are using .
In the large dataset formulation of the image manipulation problem, we can use simple “crop” and “shift” augmentations, because the data is what enables the generalizability of the model weights — we use the data to properly update the prior and still obtain generalizable model weights.
However, in this single image case, generalizability is provided by the prior itself — which means ‘crop’ and ‘shift’ augmentations will not even be generalizable to a rotation of the target image— this is terrible. Instead, we need a sophisticated augmentation, that is highly generalizable and makes up for using only a single target image.
The paper proposes using a thin-plate-spline augmentation :
We model the image as a grid and shift each grid point by a uniformly distributed random distance. This forms the shifted grid t. We use a thin-plate-spline (TPS) to smooth the transformation into a more realistic warp f.
And the thin-plate-spline smoothing optimization is as follows:
You need not worry too much about the details of this augmentation, just remember that it is designed such that we can provide generalizable model weights for image manipulation, even though we only train on a single target image! Pretty cool.
Contribution #3: A novel primitive representation allowing concurrent low and high-level image editing.
The primitive is simply the representation of the input image on which we train our generator. The paper mentions two criteria for an image primitive:
- able to precisely specify the required output image
- ease of manipulation by image editor
The two objectives are often in conflict with one another, because whilst ‘the most precise representation of the edited image is the edited image itself,’ the manipulation of the image into the edited image is very difficult to achieve manually.
What does this mean? Well, recall that we are using the primitive as the ‘conditional label.’ So during inference, the label that we provide must be editable and easily specified by a human being — the actual target image is too complicated to be edited/specified by a human being so we cannot use it as a primitive!
To achieve a tradeoff between these two goals, the authors use a super primitive which combines segmentation maps (easy to manipulate) and edge maps (more precise specification of image).
Contribution #4: Extensive evaluations showing remarkable visual performance, and the introduction of a novel protocol enabling quantitative evaluation.
This last contribution is mainly the quality of results achieved by this new image manipulation method!
Result #1
This result pertains to the primary task of image manipulation. On the left we have the training image pair i.e. the image x and the primitive. Remember that this is what we used during training in contribution #1 earlier — including augmentations of course!
On the right, we have the inputs and outputs during inference. The input is in the form of the editable primitive that we mentioned earlier — this functions as the conditional label. The model successfully produces the edited outputs.
Besides providing more examples of their model’s success on this main task, the authors also compare their model to the existing image manipulation models like Pix2Pix and BicycleGAN:
Result #2
As we can see, the authors’ approach produces the best reconstruction of the shoe. Pix2Pix does not incorporate style information, and this paper’s approach exceeds the performance of BicycleGAN which does provide some style information.
Finally, the authors did introduce a new method for quantitatively evaluating the quality of their outputs, but I don’t want to go into too much detail on that because it is tangential to the main contributions of the paper. If you are interested, do refer to the main paper !
I really hope that this blog post has been helpful in understanding this new approach to deep single image manipulation. Do let me know if anything is confusing or requires correction!
Do check out the authors’ PyTorch implementation of their model here .
References
Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125–1134).
Vinker, Y., Horwitz, E., Zabari, N., & Hoshen, Y. (2020). Deep Single Image Manipulation. arXiv preprint arXiv:2007.01289 .
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
游戏编程权威指南
Mike McShaffry 麦克沙福瑞、David “Rez” Graham 格雷海姆 / 师蓉、李静、李青翠 / 人民邮电 / 2016-3 / 99.00元
全书分为4个部分共24章。首部分是游戏编程基础,主要介绍了游戏编程的定义、游戏架构等基础知识。 第二部分是让游戏跑起来,主要介绍了初始化和关闭代码、主循环、游戏主题和用户界面等。 第三部分是核心游戏技术,主要介绍了一些*为复杂的代码 示例,如3D编程、游戏音频、物理和AI编程等。 第四部分是综合应用,主要介绍了网络编程、多道程序设计和用C#创建工具等,并利用前面所讲的 知识开发出......一起来看看 《游戏编程权威指南》 这本书的介绍吧!