内容简介:Using deep learning and GANs to enable professional quality background replacement from your own homeDo you wish that you could make professional quality videos without a full studio? Or that Zoom’s virtual background function worked better during your vid
Using deep learning and GANs to enable professional quality background replacement from your own home
Apr 9 ·7min read
Do you wish that you could make professional quality videos without a full studio? Or that Zoom’s virtual background function worked better during your video conferences?
Our recently published paper [1] in CVPR 2020 provides a new and easy method to replace your background for a wide variety of applications. You can do this at home in everyday settings, using a fixed or handheld camera. Our method is also state-of-the-art and gives outputs comparable to professional results. In this article we walk through the motivation, technical details, and usage tips for our method.
You can also checkout out our project page and codebase .
What is Matting?
Matting is the process of separating an image into foreground and background so you can composite the foreground onto a new background. This is the key technique behind the green screen effect, and it is widely used in video production, graphics, and consumer apps. To model this problem, we represent every pixel in the captured image as a combination of foreground and background:
Our problem is to solve for the foreground (F), background (B), and transparency (alpha) for every pixel given a captured image (C). Clearly this is highly undetermined, and since images have RGB channels, this requires solving 7 unknowns from 3 observed values.
The Problem with Segmentation
One possible approach is to use segmentation to separate foreground for compositing. Although segmentation has made huge strides in recent years, it does not solve the full matting equation. Segmentation assigns a binary (0,1) label to each pixel in order to represent foreground and background instead of solving for a continuous alpha value. The effects of this simplification are visible in the following example:
The areas around the edge, particularly in the hair, have a true alpha value between 0 and 1. Therefore, the binary nature of segmentation creates a harsh boundary around the foreground, leaving visible artifacts. Solving for the partial transparency and foreground color allows much better compositing in the second frame.
Using A Casually Captured Background
Because matting is a harder problem than segmentation, additional information is often used to solve this unconstrained problem, even when using deep learning.
Many existing methods [3][4][5] use a trimap, or a hand-annotated map of known foreground, background, and unknown regions. Although this is possible to do for an image, annotating video is extremely time consuming and is not a feasible research direction for this problem.
We choose instead to use a captured background as an estimate of the true background. This makes it easier to solve for the foreground and alpha value. We call it a “casually captured” background because it can contain slight movements, color differences, slight shadows, or similar colors as the foreground.
The figure above shows how we can easily provide a rough estimate of the true background. As the person leaves the scene, we capture the background behind them. The figure below shows what this looks like:
Notice how this image is challenging because it has a very similar background and foreground color (particularly around the hair). It was also recorded with a handheld phone and contains slight background movements.
“We call it a casually captured background because it can contain slight movements, color differences, slight shadows, or similar colors as the foreground.”
Tips for Capturing
Although our method works with some background perturbations, it is still better when the background is constant and best in indoor settings. For example, it does not work in the presence of highly noticeable shadows cast by the subject, moving backgrounds (e.g. water, cars, trees), or large exposure changes.
We also recommend capturing the background by having the person leave the scene at the end of the video, and pulling that frame from the continuous video. Many phones have different zoom and exposure settings when you switch from video mode to photo mode. You should also enable auto-exposure lock when filming with a phone.
A summary of the capture tips:
- Choose the most constant background you can find.
- Don’t stand too close to the background so you don’t cast a shadow.
- Enable auto-exposure and auto-focus locks on the phone.
Is This Method Like Background Subtraction?
Another natural question is whether this is like background subtraction. Firstly, if it were easy to use any background for compositing, the movie industry would not be spending thousands of dollars on green screens all these years.
In addition, background subtraction does not solve for partial alpha values, giving the same hard edge as segmentation. It also does not work well when there is a similar foreground and background color or any motions in the background.
Network Details
The network consists of a supervised step followed by an unsupervised refinement. We’ll briefly summarize them here, but for full details you can always check out the paper.
Supervised Learning
In order to first train the network, we use the Adobe Composition-1k dataset, which contains 450 carefully annotated ground truth alpha mattes. We train the network in a fully supervised way, with a per pixel loss on the output.
Notice that we take several inputs, including the image, background, soft segmentation, and temporal motion information. Our novel Context Switching Block also ensures robustness to poor inputs.
Unsupervised Refinement with GANs
The problem with supervised learning is that the adobe dataset only contains 450 ground truth outputs, which is not nearly enough to train a good network. Obtaining more data is extremely difficult because it involves hand-annotating the alpha matte of an image.
To solve this problem, we use a GAN refinement step. We take the output alpha matte from the supervised network and composite it on a new background. The discriminator then tries to tell if it is a real or fake image. In response, the generator learns to update the alpha matte so the resulting composite is as real as possible in order to fool the discriminator.
The important part here is that we don’t need any labelled training data. The discriminator was trained with thousands of real images, which are very easy to obtain.
Training the GAN on Your Data
What’s also useful about the GAN is that you can train the generator on your own images to improve results at test time. Suppose you run the network and the output is not very good. You can update the weights of the generator on that exact data in order to better fool the discriminator. This will overfit to your data, but will improve the results on the images you provided.
Future Work
Although the results we see are quite good, we are continuing to make this method more accurate and easy to use.
In particular, we would like to make this method more robust to circumstances like background motions, camera movements, shadows, etc. We are also looking at ways to make this method work in real-time and with less computational resource power. This could enable a wide variety of use cases in areas like video streaming or mobile apps.
If you have any questions feel free to reach out to me , Vivek Jayaram, or Soumyadip Sengupta
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:- Background Matting: The World is Your Green Screen
- Background Matting: The World is Your Green Screen
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。