arXiv Paper Daily: Wed, 22 Jan 2020

栏目: IT技术 · 发布时间: 5年前

内容简介：Data-driven methods have recently been developed to discover underlying

Neural and Evolutionary Computing

DLGA-PDE: Discovery of PDEs with incomplete candidate library via combination of deep learning and genetic algorithm

Hao Xu , Haibin Chang , Dongxiao Zhang Subjects : Neural and Evolutionary Computing (cs.NE) ; Machine Learning (cs.LG)

Data-driven methods have recently been developed to discover underlying

partial differential equations (PDEs) of physical problems. However, for these

methods, a complete candidate library of potential terms in a PDE are usually

required. To overcome this limitation, we propose a novel framework combining

deep learning and genetic algorithm, called DLGA-PDE, for discovering PDEs. In

the proposed framework, a deep neural network that is trained with available

data of a physical problem is utilized to generate meta-data and calculate

derivatives, and the genetic algorithm is then employed to discover the

underlying PDE. Owing to the merits of the genetic algorithm, such as mutation

and crossover, DLGA-PDE can work with an incomplete candidate library. The

proposed DLGA-PDE is tested for discovery of the Korteweg-de Vries (KdV)

equation, the Burgers equation, the wave equation, and the Chaffee-Infante

equation, respectively, for proof-of-concept. Satisfactory results are obtained

without the need for a complete candidate library, even in the presence of

noisy and limited data.

Multi-factorial Optimization for Large-scale Virtual Machine Placement in Cloud Computing

Zhengping Liang , Jian Zhang , Liang Feng , Zexuan Zhu Subjects : Neural and Evolutionary Computing (cs.NE)

The placement scheme of virtual machines (VMs) to physical servers (PSs) is

crucial to lowering operational cost for cloud providers. Evolutionary

algorithms (EAs) have been performed promising-solving on virtual machine

placement (VMP) problems in the past. However, as growing demand for cloud

services, the existing EAs fail to implement in large-scale virtual machine

placement (LVMP) problem due to the high time complexity and poor scalability.

Recently, the multi-factorial optimization (MFO) technology has surfaced as a

new search paradigm in evolutionary computing. It offers the ability to evolve

multiple optimization tasks simultaneously during the evolutionary process.

This paper aims to apply the MFO technology to the LVMP problem in

heterogeneous environment. Firstly, we formulate a deployment cost based VMP

problem in the form of the MFO problem. Then, a multi-factorial evolutionary

algorithm (MFEA) embedded with greedy-based allocation operator is developed to

address the established MFO problem. After that, a re-migration and merge

operator is designed to offer the integrated solution of the LVMP problem from

the solutions of MFO problem. To assess the effectiveness of our proposed

method, the simulation experiments are carried on large-scale and extra

large-scale VMs test data sets. The results show that compared with various

heuristic methods, our method could shorten optimization time significantly and

offer a competitive placement solution for the LVMP problem in heterogeneous

environment.

Population-based metaheuristics for Association Rule Text Mining

Iztok Fister Jr. , Suash Deb , Iztok Fister Subjects : Neural and Evolutionary Computing (cs.NE)

Nowadays, the majority of data on the Internet is held in an unstructured

format, like websites and e-mails. The importance of analyzing these data has

been growing day by day. Similar to data mining on structured data, text mining

methods for handling unstructured data have also received increasing attention

from the research community. The paper deals with the problem of Association

Rule Text Mining. To solve the problem, the PSO-ARTM method was proposed, that

consists of three steps: Text preprocessing, Association Rule Text Mining using

population-based metaheuristics, and text postprocessing. The method was

applied to a transaction database obtained from professional triathlon

athletes’ blogs and news posted on their websites. The obtained results reveal

that the proposed method is suitable for Association Rule Text Mining and,

therefore, offers a promising way for further development.

EdgeNets:Edge Varying Graph Neural Networks

Elvin Isufi , Fernando Gama , Alejandro Ribeiro Subjects : Machine Learning (cs.LG) ; Neural and Evolutionary Computing (cs.NE); Signal Processing (eess.SP); Machine Learning (stat.ML)

Driven by the outstanding performance of neural networks in the structured

Euclidean domain, recent years have seen a surge of interest in developing

neural networks for graphs and data supported on graphs. The graph is leveraged

as a parameterization to capture detail at the node level with a reduced number

of parameters and complexity. Following this rationale, this paper puts forth a

general framework that unifies state-of-the-art graph neural networks (GNNs)

through the concept of EdgeNet. An EdgeNet is a GNN architecture that allows

different nodes to use different parameters to weigh the information of

different neighbors. By extrapolating this strategy to more iterations between

neighboring nodes, the EdgeNet learns edge- and neighbor-dependent weights to

capture local detail. This is the most general local operation that a node can

do and encompasses under one formulation all graph convolutional neural

networks (GCNNs) as well as graph attention networks (GATs). In writing

different GNN architectures with a common language, EdgeNets highlight specific

architecture advantages and limitations, while providing guidelines to improve

their capacity without compromising their local implementation. For instance,

we show that GCNNs have a parameter sharing structure that induces permutation

equivariance. This can be an advantage or a limitation, depending on the

application. When it is a limitation, we propose hybrid approaches and provide

insights to develop several other solutions that promote parameter sharing

without enforcing permutation equivariance. Another interesting conclusion is

the unification of GCNNs and GATs -approaches that have been so far perceived

as separate. In particular, we show that GATs are GCNNs on a graph that is

learned from the features. This particularization opens the doors to develop

alternative attention mechanisms for improving discriminatory power.

Ensemble Genetic Programming

Nuno M. Rodrigues , João E. Batista , Sara Silva

Comments: eurogp 2020 submission

Subjects

Machine Learning (cs.LG)

; Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

Ensemble learning is a powerful paradigm that has been usedin the top

state-of-the-art machine learning methods like Random Forestsand XGBoost.

Inspired by the success of such methods, we have devel-oped a new Genetic

Programming method called Ensemble GP. The evo-lutionary cycle of Ensemble GP

follows the same steps as other GeneticProgramming systems, but with

differences in the population structure,fitness evaluation and genetic

operators. We have tested this method oneight binary classification problems,

achieving results significantly betterthan standard GP, with much smaller

models. Although other methodslike M3GP and XGBoost were the best overall,

Ensemble GP was able toachieve exceptionally good generalization results on a

particularly hardproblem where none of the other methods was able to succeed.

An Efficient Framework for Automated Screening of Clinically Significant Macular Edema

Renoh Johnson Chalakkal , Faizal Hafiz , Waleed Abdulla , Akshya Swain Subjects : Image and Video Processing (eess.IV) ; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

The present study proposes a new approach to automated screening of

Clinically Significant Macular Edema (CSME) and addresses two major challenges

associated with such screenings, i.e., exudate segmentation and imbalanced

datasets. The proposed approach replaces the conventional exudate segmentation

based feature extraction by combining a pre-trained deep neural network with

meta-heuristic feature selection. A feature space over-sampling technique is

being used to overcome the effects of skewed datasets and the screening is

accomplished by a k-NN based classifier. The role of each data-processing step

(e.g., class balancing, feature selection) and the effects of limiting the

region-of-interest to fovea on the classification performance are critically

analyzed. Finally, the selection and implication of operating point on Receiver

Operating Characteristic curve are discussed. The results of this study

convincingly demonstrate that by following these fundamental practices of

machine learning, a basic k-NN based classifier could effectively accomplish

the CSME screening.

Memory capacity of neural networks with threshold and ReLU activations

Roman Vershynin

Comments: 25 pages

Subjects

Machine Learning (cs.LG)

; Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)

Overwhelming theoretical and empirical evidence shows that mildly

overparametrized neural networks — those with more connections than the size

of the training data — are often able to memorize the training data with

(100\%) accuracy. This was rigorously proved for networks with sigmoid

activation functions and, very recently, for ReLU activations. Addressing a

1988 open question of Baum, we prove that this phenomenon holds for general

multilayered perceptrons, i.e. neural networks with threshold activation

functions, or with any mix of threshold and ReLU activations. Our construction

is probabilistic and exploits sparsity.

Computer Vision and Pattern Recognition

SAUNet: Shape Attentive U-Net for Interpretable Medical Image Segmentation

Jesse Sun , Fatemeh Darbeha , Mark Zaidi , Bo Wang Subjects : Computer Vision and Pattern Recognition (cs.CV)

Medical image segmentation is a difficult but important task for many

clinical operations such as cardiac bi-ventricular volume estimation. More

recently, there has been a shift to utilizing deep learning and fully

convolutional neural networks (CNNs) to perform image segmentation that has

yielded state-of-the-art results in many public benchmark datasets. Despite the

progress of deep learning in medical image segmentation, standard CNNs are

still not fully adopted in clinical settings as they lack robustness and

interpretability. Shapes are generally more meaningful features than solely

textures of images, which are features regular CNNs learn, causing a lack of

robustness. Likewise, previous works surrounding model interpretability have

been focused on post hoc gradient-based saliency methods. However,

gradient-based saliency methods typically require additional computations post

hoc and have been shown to be unreliable for interpretability. Thus, we present

a new architecture called Shape Attentive U-Net (SAUNet) which focuses on model

interpretability and robustness. The proposed architecture attempts to address

these limitations by the use of a secondary shape stream that captures rich

shape-dependent information in parallel with the regular texture stream.

Furthermore, we suggest multi-resolution saliency maps can be learned using our

dual-attention decoder module which allows for multi-level interpretability and

mitigates the need for additional computations post hoc. Our method also

achieves state-of-the-art results on the two large public cardiac MRI image

segmentation datasets of SUN09 and AC17.

PatchPerPix for Instance Segmentation

Peter Hirsch , Lisa Mais , Dagmar Kainmueller Subjects : Computer Vision and Pattern Recognition (cs.CV)

In this paper we present a novel method for proposal free instance

segmentation that can handle sophisticated object shapes that span large parts

of an image and form dense object clusters with crossovers. Our method is based

on predicting dense local shape descriptors, which we assemble to form

instances. All instances are assembled simultaneously in one go. To our

knowledge, our method is the first non-iterative method that guarantees

instances to be composed of learnt shape patches. We evaluate our method on a

variety of data domains, where it defines the new state of the art on two

challenging benchmarks, namely the ISBI 2012 EM segmentation benchmark, and the

BBBC010 C. elegans dataset. We show furthermore that our method performs well

also on 3d image data, and can handle even extreme cases of complex shape

clusters.

Geometric Proxies for Live RGB-D Stream Enhancement and Consolidation

Adrien Kaiser , José Alonso Ybanez Zepeda , Tamy Boubekeur

Comments: extension of our ECCV 2018 paper at this http URL

Subjects

Computer Vision and Pattern Recognition (cs.CV)

We propose a geometric superstructure for unified real-time processing of

RGB-D data. Modern RGB-D sensors are widely used for indoor 3D capture, with

applications ranging from modeling to robotics, through augmented reality.

Nevertheless, their use is limited by their low resolution, with frames often

corrupted with noise, missing data and temporal inconsistencies. Our approach

consists in generating and updating through time a single set of compact local

statistics parameterized over detected geometric proxies, which are fed from

raw RGB-D data. Our proxies provide several processing primitives, which

improve the quality of the RGB-D stream on the fly or lighten further

operations. Experimental results confirm that our lightweight analysis

framework copes well with embedded execution as well as moderate memory and

computational capabilities compared to state-of-the-art methods. Processing

RGB-D data with our proxies allows noise and temporal flickering removal, hole

filling and resampling. As a substitute of the observed scene, our proxies can

additionally be applied to compression and scene reconstruction. We present

experiments performed with our framework in indoor scenes of different natures

within a recent open RGB-D dataset.

Multimodal Deep Unfolding for Guided Image Super-Resolution

Iman Marivani , Evaggelia Tsiligianni , Bruno Cornelis , Nikos Deligiannis Subjects : Computer Vision and Pattern Recognition (cs.CV) ; Machine Learning (cs.LG)

The reconstruction of a high resolution image given a low resolution

observation is an ill-posed inverse problem in imaging. Deep learning methods

rely on training data to learn an end-to-end mapping from a low-resolution

input to a high-resolution output. Unlike existing deep multimodal models that

do not incorporate domain knowledge about the problem, we propose a multimodal

deep learning design that incorporates sparse priors and allows the effective

integration of information from another image modality into the network

architecture. Our solution relies on a novel deep unfolding operator,

performing steps similar to an iterative algorithm for convolutional sparse

coding with side information; therefore, the proposed neural network is

interpretable by design. The deep unfolding architecture is used as a core

component of a multimodal framework for guided image super-resolution. An

alternative multimodal design is investigated by employing residual learning to

improve the training efficiency. The presented multimodal approach is applied

to super-resolution of near-infrared and multi-spectral images as well as depth

upsampling using RGB images as side information. Experimental results show that

our model outperforms state-of-the-art methods.

P(^2)-GAN: Efficient Style Transfer Using Single Style Image

Zhentan Zheng , Jianyi Liu

Comments: 5 pages, 5 figures

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Style transfer is a useful image synthesis technique that can re-render given

image into another artistic style while preserving its content information.

Generative Adversarial Network (GAN) is a widely adopted framework toward this

task for its better representation ability on local style patterns than the

traditional Gram-matrix based methods. However, most previous methods rely on

sufficient amount of pre-collected style images to train the model. In this

paper, a novel Patch Permutation GAN (P(^2)-GAN) network that can efficiently

learn the stroke style from a single style image is proposed. We use patch

permutation to generate multiple training samples from the given style image. A

patch discriminator that can simultaneously process patch-wise images and

natural images seamlessly is designed. We also propose a local texture

descriptor based criterion to quantitatively evaluate the style transfer

quality. Experimental results showed that our method can produce finer quality

re-renderings from single style image with improved computational efficiency

compared with many state-of-the-arts methods.

Detecting Face2Face Facial Reenactment in Videos

Prabhat Kumar , Mayank Vatsa , Richa Singh

Comments: 9 pages

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Visual content has become the primary source of information, as evident in

the billions of images and videos, shared and uploaded on the Internet every

single day. This has led to an increase in alterations in images and videos to

make them more informative and eye-catching for the viewers worldwide. Some of

these alterations are simple, like copy-move, and are easily detectable, while

other sophisticated alterations like reenactment based DeepFakes are hard to

detect. Reenactment alterations allow the source to change the target

expressions and create photo-realistic images and videos. While technology can

be potentially used for several applications, the malicious usage of automatic

reenactment has a very large social implication. It is therefore important to

develop detection techniques to distinguish real images and videos with the

altered ones. This research proposes a learning-based algorithm for detecting

reenactment based alterations. The proposed algorithm uses a multi-stream

network that learns regional artifacts and provides a robust performance at

various compression levels. We also propose a loss function for the balanced

learning of the streams for the proposed network. The performance is evaluated

on the publicly available FaceForensics dataset. The results show

state-of-the-art classification accuracy of 99.96%, 99.10%, and 91.20% for no,

easy, and hard compression factors, respectively.

Learning Diverse Features with Part-Level Resolution for Person Re-Identification

Ben Xie , Xiaofu Wu , Suofei Zhang , Shiliang Zhao , Ming Li

Comments: 8 pages, 5 figures, submitted to IEEE TCSVT

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Artificial Intelligence (cs.AI)

Learning diverse features is key to the success of person re-identification.

Various part-based methods have been extensively proposed for learning local

representations, which, however, are still inferior to the best-performing

methods for person re-identification. This paper proposes to construct a strong

lightweight network architecture, termed PLR-OSNet, based on the idea of

Part-Level feature Resolution over the Omni-Scale Network (OSNet) for achieving

feature diversity. The proposed PLR-OSNet has two branches, one branch for

global feature representation and the other branch for local feature

representation. The local branch employs a uniform partition strategy for

part-level feature resolution but produces only a single identity-prediction

loss, which is in sharp contrast to the existing part-based methods. Empirical

evidence demonstrates that the proposed PLR-OSNet achieves state-of-the-art

performance on popular person Re-ID datasets, including Market1501,

DukeMTMC-reID and CUHK03, despite its small model size.

Evaluating Weakly Supervised Object Localization Methods Right

Junsuk Choe , Seong Joon Oh , Seungho Lee , Sanghyuk Chun , Zeynep Akata , Hyunjung Shim Subjects : Computer Vision and Pattern Recognition (cs.CV) ; Machine Learning (cs.LG)

Weakly-supervised object localization (WSOL) has gained popularity over the

last years for its promise to train localization models with only image-level

labels. Since the seminal WSOL work of class activation mapping (CAM), the

field has focused on how to expand the attention regions to cover objects more

broadly and localize them better. However, these strategies rely on full

localization supervision to validate hyperparameters and for model selection,

which is in principle prohibited under the WSOL setup. In this paper, we argue

that WSOL task is ill-posed with only image-level labels, and propose a new

evaluation protocol where full supervision is limited to only a small held-out

set not overlapping with the test set. We observe that, under our protocol, the

five most recent WSOL methods have not made a major improvement over the CAM

baseline. Moreover, we report that existing WSOL methods have not reached the

few-shot learning baseline, where the full-supervision at validation time is

used for model training instead. Based on our findings, we discuss some future

directions for WSOL.

An End-to-end Deep Learning Approach for Landmark Detection and Matching in Medical Images

Monika Grewal , Timo M. Deist , Jan Wiersma , Peter A. N. Bosman , Tanja Alderliesten

Comments: SPIE Medical Imaging Conference – 2020

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Anatomical landmark correspondences in medical images can provide additional

guidance information for the alignment of two images, which, in turn, is

crucial for many medical applications. However, manual landmark annotation is

labor-intensive. Therefore, we propose an end-to-end deep learning approach to

automatically detect landmark correspondences in pairs of two-dimensional (2D)

images. Our approach consists of a Siamese neural network, which is trained to

identify salient locations in images as landmarks and predict matching

probabilities for landmark pairs from two different images. We trained our

approach on 2D transverse slices from 168 lower abdominal Computed Tomography

(CT) scans. We tested the approach on 22,206 pairs of 2D slices with varying

levels of intensity, affine, and elastic transformations. The proposed approach

finds an average of 639, 466, and 370 landmark matches per image pair for

intensity, affine, and elastic transformations, respectively, with spatial

matching errors of at most 1 mm. Further, more than 99% of the landmark pairs

are within a spatial matching error of 2 mm, 4 mm, and 8 mm for image pairs

with intensity, affine, and elastic transformations, respectively. To

investigate the utility of our developed approach in a clinical setting, we

also tested our approach on pairs of transverse slices selected from follow-up

CT scans of three patients. Visual inspection of the results revealed landmark

matches in both bony anatomical regions as well as in soft tissues lacking

prominent intensity gradients.

From Planes to Corners: Multi-Purpose Primitive Detection in Unorganized 3D Point Clouds

Christiane Sommer , Yumin Sun , Leonidas Guibas , Daniel Cremers , Tolga Birdal

Comments: Accepted to IEEE Robotics and Automation Letters 2020 | Video: this https URL | Code: this https URL

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Robotics (cs.RO)

We propose a new method for segmentation-free joint estimation of orthogonal

planes, their intersection lines, relationship graph and corners lying at the

intersection of three orthogonal planes. Such unified scene exploration under

orthogonality allows for multitudes of applications such as semantic plane

detection or local and global scan alignment, which in turn can aid robot

localization or grasping tasks. Our two-stage pipeline involves a rough yet

joint estimation of orthogonal planes followed by a subsequent joint refinement

of plane parameters respecting their orthogonality relations. We form a graph

of these primitives, paving the way to the extraction of further reliable

features: lines and corners. Our experiments demonstrate the validity of our

approach in numerous scenarios from wall detection to 6D tracking, both on

synthetic and real data.

VMRFANet:View-Specific Multi-Receptive Field Attention Network for Person Re-identification

Honglong Cai , Yuedong Fang , Zhiguan Wang , Tingchun Yeh , Jinxing Cheng

Comments: Accepted by ICAART2020

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Person re-identification (re-ID) aims to retrieve the same person across

different cameras. In practice, it still remains a challenging task due to

background clutter, variations on body poses and view conditions, inaccurate

bounding box detection, etc. To tackle these issues, in this paper, we propose

a novel multi-receptive field attention (MRFA) module that utilizes filters of

various sizes to help network focusing on informative pixels. Besides, we

present a view-specific mechanism that guides attention module to handle the

variation of view conditions. Moreover, we introduce a Gaussian horizontal

random cropping/padding method which further improves the robustness of our

proposed network. Comprehensive experiments demonstrate the effectiveness of

each component. Our method achieves 95.5% / 88.1% in rank-1 / mAP on

Market-1501, 88.9% / 80.0% on DukeMTMC-reID, 81.1% / 78.8% on CUHK03 labeled

dataset and 78.9% / 75.3% on CUHK03 detected dataset, outperforming current

state-of-the-art methods.

Transfer Learning using Neural Ordinary Differential Equations

Rajath S , Sumukh Aithal K , Natarajan Subramanyam Subjects : Computer Vision and Pattern Recognition (cs.CV)

A concept of using Neural Ordinary Differential Equations(NODE) for Transfer

Learning has been introduced. In this paper we use the EfficientNets to explore

transfer learning on CIFAR-10 dataset. We use NODE for fine-tuning our model.

Using NODE for fine tuning provides more stability during training and

validation.These continuous depth blocks can also have a trade off between

numerical precision and speed .Using Neural ODEs for transfer learning has

resulted in much stable convergence of the loss function.

Face Verification via learning the kernel matrix

Ning Yuan , Xiao-Jun Wu , He-Feng Yin

Comments: 10 pages

Subjects

Computer Vision and Pattern Recognition (cs.CV)

The kernel function is introduced to solve the nonlinear pattern recognition

problem. The advantage of a kernel method often depends critically on a proper

choice of the kernel function. A promising approach is to learn the kernel from

data automatically. Over the past few years, some methods which have been

proposed to learn the kernel have some limitations: learning the parameters of

some prespecified kernel function and so on. In this paper, the nonlinear face

verification via learning the kernel matrix is proposed. A new criterion is

used in the new algorithm to avoid inverting the possibly singular within-class

which is a computational problem. The experimental results obtained on the

facial database XM2VTS using the Lausanne protocol show that the verification

performance of the new method is superior to that of the primary method Client

Specific Kernel Discriminant Analysis (CSKDA). The method CSKDA needs to choose

a proper kernel function through many experiments, while the new method could

learn the kernel from data automatically which could save a lot of time and

have the robust performance.

Neural Style Difference Transfer and Its Application to Font Generation

Gantugs Atarsaikhan , Brian Kenji Iwana , Seiichi Uchida

Comments: Submitted to DAS2020

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Machine Learning (cs.LG)

Designing fonts requires a great deal of time and effort. It requires

professional skills, such as sketching, vectorizing, and image editing.

Additionally, each letter has to be designed individually. In this paper, we

will introduce a method to create fonts automatically. In our proposed method,

the difference of font styles between two different fonts is found and

transferred to another font using neural style transfer. Neural style transfer

is a method of stylizing the contents of an image with the styles of another

image. We proposed a novel neural style difference and content difference loss

for the neural style transfer. With these losses, new fonts can be generated by

adding or removing font styles from a font. We provided experimental results

with various combinations of input fonts and discussed limitations and future

development for the proposed method.

Recovering Geometric Information with Learned Texture Perturbations

Jane Wu , Yongxu Jin , Zhenglin Geng , Hui Zhou , Ronald Fedkiw Subjects : Computer Vision and Pattern Recognition (cs.CV)

Regularization is used to avoid overfitting when training a neural network;

unfortunately, this reduces the attainable level of detail hindering the

ability to capture high-frequency information present in the training data.

Even though various approaches may be used to re-introduce high-frequency

detail, it typically does not match the training data and is often not time

coherent. In the case of network inferred cloth, these sentiments manifest

themselves via either a lack of detailed wrinkles or unnaturally appearing

and/or time incoherent surrogate wrinkles. Thus, we propose a general strategy

whereby high-frequency information is procedurally embedded into low-frequency

data so that when the latter is smeared out by the network the former still

retains its high-frequency detail. We illustrate this approach by learning

texture coordinates which when smeared do not in turn smear out the

high-frequency detail in the texture itself but merely smoothly distort it.

Notably, we prescribe perturbed texture coordinates that are subsequently used

to correct the over-smoothed appearance of inferred cloth, and correcting the

appearance from multiple camera views naturally recovers lost geometric

information.

UR2KiD: Unifying Retrieval, Keypoint Detection, and Keypoint Description without Local Correspondence Supervision

Tsun-Yi Yang , Duy-Kien Nguyen , Huub Heijnen , Vassileios Balntas Subjects : Computer Vision and Pattern Recognition (cs.CV) ; Machine Learning (cs.LG)

In this paper, we explore how three related tasks, namely keypoint detection,

description, and image retrieval can be jointly tackled using a single unified

framework, which is trained without the need of training data with point to

point correspondences. By leveraging diverse information from sequential layers

of a standard ResNet-based architecture, we are able to extract keypoints and

descriptors that encode local information using generic techniques such as

local activation norms, channel grouping and dropping, and self-distillation.

Subsequently, global information for image retrieval is encoded in an

end-to-end pipeline, based on pooling of the aforementioned local responses. In

contrast to previous methods in local matching, our method does not depend on

pointwise/pixelwise correspondences, and requires no such supervision at all

i.e. no depth-maps from an SfM model nor manually created synthetic affine

transformations. We illustrate that this simple and direct paradigm, is able to

achieve very competitive results against the state-of-the-art methods in

various challenging benchmark conditions such as viewpoint changes, scale

changes, and day-night shifting localization.

Autocamera Calibration for traffic surveillance cameras with wide angle lenses

Aman Gajendra Jain , Nicolas Saunier Subjects : Computer Vision and Pattern Recognition (cs.CV) ; Robotics (cs.RO)

We propose a method for automatic calibration of a traffic surveillance

camera with wide-angle lenses. Video footage of a few minutes is sufficient for

the entire calibration process to take place. This method takes in the height

of the camera from the ground plane as the only user input to overcome the

scale ambiguity. The calibration is performed in two stages, 1. Intrinsic

Calibration 2. Extrinsic Calibration. Intrinsic calibration is achieved by

assuming an equidistant fisheye distortion and an ideal camera model. Extrinsic

calibration is accomplished by estimating the two vanishing points, on the

ground plane, from the motion of vehicles at perpendicular intersections. The

first stage of intrinsic calibration is also valid for thermal cameras.

Experiments have been conducted to demonstrate the effectiveness of this

approach on visible as well as thermal cameras.

Index Terms: fish-eye, calibration, thermal camera, intelligent

transportation systems, vanishing points

Spectral Pyramid Graph Attention Network for Hyperspectral Image Classification

Tinghuai Wang , Guangming Wang , Kuan Eeik Tan , Donghui Tan

Comments: 7 pages, 6 figures, 4 tables

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Convolutional neural networks (CNN) have made significant advances in

hyperspectral image (HSI) classification. However, standard convolutional

kernel neglects the intrinsic connections between data points, resulting in

poor region delineation and small spurious predictions. Furthermore, HSIs have

a unique continuous data distribution along the high dimensional spectrum

domain – much remains to be addressed in characterizing the spectral contexts

considering the prohibitively high dimensionality and improving reasoning

capability in light of the limited amount of labelled data. This paper presents

a novel architecture which explicitly addresses these two issues. Specifically,

we design an architecture to encode the multiple spectral contextual

information in the form of spectral pyramid of multiple embedding spaces. In

each spectral embedding space, we propose graph attention mechanism to

explicitly perform interpretable reasoning in the spatial domain based on the

connection in spectral feature space. Experiments on three HSI datasets

demonstrate that the proposed architecture can significantly improve the

classification accuracy compared with the existing methods.

Active and Incremental Learning with Weak Supervision

Clemens-Alexander Brust , Christoph Käding , Joachim Denzler

Comments: Accepted for publication in KI – K”unstliche Intelligenz

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Large amounts of labeled training data are one of the main contributors to

the great success that deep models have achieved in the past. Label acquisition

for tasks other than benchmarks can pose a challenge due to requirements of

both funding and expertise. By selecting unlabeled examples that are promising

in terms of model improvement and only asking for respective labels, active

learning can increase the efficiency of the labeling process in terms of time

and cost.

In this work, we describe combinations of an incremental learning scheme and

methods of active learning. These allow for continuous exploration of newly

observed unlabeled data. We describe selection criteria based on model

uncertainty as well as expected model output change (EMOC). An object detection

task is evaluated in a continuous exploration context on the PASCAL VOC

dataset. We also validate a weakly supervised system based on active and

incremental learning in a real-world biodiversity application where images from

camera traps are analyzed. Labeling only 32 images by accepting or rejecting

proposals generated by our method yields an increase in accuracy from 25.4% to

42.6%.

BARNet: Bilinear Attention Network with Adaptive Receptive Field for Surgical Instrument Segmentation

Zhen-Liang Ni , Gui-Bin Bian , Guan-An Wang , Xiao-Hu Zhou , Zeng-Guang Hou , Xiao-Liang Xie , Zhen Li , Yu-Han Wang Subjects : Computer Vision and Pattern Recognition (cs.CV) ; Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)

Surgical instrument segmentation is extremely important for computer-assisted

surgery. Different from common object segmentation, it is more challenging due

to the large illumination and scale variation caused by the special surgical

scenes. In this paper, we propose a novel bilinear attention network with

adaptive receptive field to solve these two challenges. For the illumination

variation, the bilinear attention module can capture second-order statistics to

encode global contexts and semantic dependencies between local pixels. With

them, semantic features in challenging areas can be inferred from their

neighbors and the distinction of various semantics can be boosted. For the

scale variation, our adaptive receptive field module aggregates multi-scale

features and automatically fuses them with different weights. Specifically, it

encodes the semantic relationship between channels to emphasize feature maps

with appropriate scales, changing the receptive field of subsequent

convolutions. The proposed network achieves the best performance 97.47% mean

IOU on Cata7 and comes first place on EndoVis 2017 by 10.10% IOU overtaking

second-ranking method.

Multiplication fusion of sparse and collaborative-competitive representation for image classification

Zi-Qi Li , Jun Sun , Xiao-Jun Wu , He-Feng Yin

Comments: submitted to International Journal of Machine Learning and Cybernetics

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Representation based classification methods have become a hot research topic

during the past few years, and the two most prominent approaches are sparse

representation based classification (SRC) and collaborative representation

based classification (CRC). CRC reveals that it is the collaborative

representation rather than the sparsity that makes SRC successful.

Nevertheless, the dense representation of CRC may not be discriminative which

will degrade its performance for classification tasks. To alleviate this

problem to some extent, we propose a new method called sparse and

collaborative-competitive representation based classification (SCCRC) for image

classification. Firstly, the coefficients of the test sample are obtained by

SRC and CCRC, respectively. Then the fused coefficient is derived by

multiplying the coefficients of SRC and CCRC. Finally, the test sample is

designated to the class that has the minimum residual. Experimental results on

several benchmark databases demonstrate the efficacy of our proposed SCCRC. The

source code of SCCRC is accessible at this https URL .

Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models

Moshiur R. Farazi , Salman H. Khan , Nick Barnes Subjects : Computer Vision and Pattern Recognition (cs.CV) ; Computational Complexity (cs.CC)

Visual Question Answering (VQA) has emerged as a Visual Turing Test to

validate the reasoning ability of AI agents. The pivot to existing VQA models

is the joint embedding that is learned by combining the visual features from an

image and the semantic features from a given question. Consequently, a large

body of literature has focused on developing complex joint embedding strategies

coupled with visual attention mechanisms to effectively capture the interplay

between these two modalities. However, modelling the visual and semantic

features in a high dimensional (joint embedding) space is computationally

expensive, and more complex models often result in trivial improvements in the

VQA accuracy. In this work, we systematically study the trade-off between the

model complexity and the performance on the VQA task. VQA models have a diverse

architecture comprising of pre-processing, feature extraction, multimodal

fusion, attention and final classification stages. We specifically focus on the

effect of “multi-modal fusion” in VQA models that is typically the most

expensive step in a VQA pipeline. Our thorough experimental evaluation leads us

to two proposals, one optimized for minimal complexity and the other one

optimized for state-of-the-art VQA performance.

Plane Pair Matching for Efficient 3D View Registration

Adrien Kaiser , José Alonso Ybanez Zepeda , Tamy Boubekeur Subjects : Computer Vision and Pattern Recognition (cs.CV)

We present a novel method to estimate the motion matrix between overlapping

pairs of 3D views in the context of indoor scenes. We use the Manhattan world

assumption to introduce lightweight geometric constraints under the form of

planes into the problem, which reduces complexity by taking into account the

structure of the scene. In particular, we define a stochastic framework to

categorize planes as vertical or horizontal and parallel or non-parallel. We

leverage this classification to match pairs of planes in overlapping views with

point-of-view agnostic structural metrics. We propose to split the motion

computation using the classification and estimate separately the rotation and

translation of the sensor, using a quadric minimizer. We validate our approach

on a toy example and present quantitative experiments on a public RGB-D

dataset, comparing against recent state-of-the-art methods. Our evaluation

shows that planar constraints only add low computational overhead while

improving results in precision when applied after a prior coarse estimate. We

conclude by giving hints towards extensions and improvements of current

results.

Adaptive Dithering Using Curved Markov-Gaussian Noise in the Quantized Domain for Mapping SDR to HDR Image

Subhayan Mukherjee , Guan-Ming Su , Irene Cheng

Comments: 2018 International Conference on Smart Multimedia

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Image and Video Processing (eess.IV)

High Dynamic Range (HDR) imaging is gaining increased attention due to its

realistic content, for not only regular displays but also smartphones. Before

sufficient HDR content is distributed, HDR visualization still relies mostly on

converting Standard Dynamic Range (SDR) content. SDR images are often

quantized, or bit depth reduced, before SDR-to-HDR conversion, e.g. for video

transmission. Quantization can easily lead to banding artefacts. In some

computing and/or memory I/O limited environment, the traditional solution using

spatial neighborhood information is not feasible. Our method includes noise

generation (offline) and noise injection (online), and operates on pixels of

the quantized image. We vary the magnitude and structure of the noise pattern

adaptively based on the luma of the quantized pixel and the slope of the

inverse-tone mapping function. Subjective user evaluations confirm the superior

performance of our technique.

FD-GAN: Generative Adversarial Networks with Fusion-discriminator for Single Image Dehazing

Yu Dong , Yihao Liu , He Zhang , Shifeng Chen , Yu Qiao

Comments: Accepted by AAAI2020 (with supplementary files)

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Recently, convolutional neural networks (CNNs) have achieved great

improvements in single image dehazing and attained much attention in research.

Most existing learning-based dehazing methods are not fully end-to-end, which

still follow the traditional dehazing procedure: first estimate the medium

transmission and the atmospheric light, then recover the haze-free image based

on the atmospheric scattering model. However, in practice, due to lack of

priors and constraints, it is hard to precisely estimate these intermediate

parameters. Inaccurate estimation further degrades the performance of dehazing,

resulting in artifacts, color distortion and insufficient haze removal. To

address this, we propose a fully end-to-end Generative Adversarial Networks

with Fusion-discriminator (FD-GAN) for image dehazing. With the proposed

Fusion-discriminator which takes frequency information as additional priors,

our model can generator more natural and realistic dehazed images with less

color distortion and fewer artifacts. Moreover, we synthesize a large-scale

training dataset including various indoor and outdoor hazy images to boost the

performance and we reveal that for learning-based dehazing methods, the

performance is strictly influenced by the training data. Experiments have shown

that our method reaches state-of-the-art performance on both public synthetic

datasets and real-world images with more visually pleasing dehazed results.

A hybrid algorithm for disparity calculation from sparse disparity estimates based on stereo vision

Subhayan Mukherjee , Ram Mohana Reddy Guddeti

Comments: 2014 SPCOM

Subjects

Computer Vision and Pattern Recognition (cs.CV)

In this paper, we have proposed a novel method for stereo disparity

estimation by combining the existing methods of block based and region based

stereo matching. Our method can generate dense disparity maps from disparity

measurements of only 18% pixels of either the left or the right image of a

stereo image pair. It works by segmenting the lightness values of image pixels

using a fast implementation of K-Means clustering. It then refines those

segment boundaries by morphological filtering and connected components

analysis, thus removing a lot of redundant boundary pixels. This is followed by

determining the boundaries’ disparities by the SAD cost function. Lastly, we

reconstruct the entire disparity map of the scene from the boundaries’

disparities through disparity propagation along the scan lines and disparity

prediction of regions of uncertainty by considering disparities of the

neighboring regions. Experimental results on the Middlebury stereo vision

dataset demonstrate that the proposed method outperforms traditional disparity

determination methods like SAD and NCC by up to 30% and achieves an improvement

of 2.6% when compared to a recent approach based on absolute difference (AD)

cost function for disparity calculations [1].

G2MF-WA: Geometric Multi-Model Fitting with Weakly Annotated Data

Chao Zhang , Xuequan Lu , Katsuya Hotta , Xi Yang Subjects : Computer Vision and Pattern Recognition (cs.CV)

In this paper we attempt to address the problem of geometric multi-model

fitting with resorting to a few weakly annotated (WA) data points, which has

been sparsely studied so far. In weak annotating, most of the manual

annotations are supposed to be correct yet inevitably mixed with incorrect

ones. The WA data can be naturally obtained in an interactive way for specific

tasks, for example, in the case of homography estimation, one can easily

annotate points on the same plane/object with a single label by observing the

image. Motivated by this, we propose a novel method to make full use of the WA

data to boost the multi-model fitting performance. Specifically, a graph for

model proposal sampling is first constructed using the WA data, given the prior

that the WA data annotated with the same weak label has a high probability of

being assigned to the same model. By incorporating this prior knowledge into

the calculation of edge probabilities, vertices (i.e., data points) lie on/near

the latent model are likely to connect together and further form a

subset/cluster for effective proposals generation. With the proposals

generated, the (alpha)-expansion is adopted for labeling, and our method in

return updates the proposals. This works in an iterative way. Extensive

experiments validate our method and show that the proposed method produces

noticeably better results than state-of-the-art techniques in most cases.

A Novel Image Dehazing and Assessment Method

Saad Bin Sami , Abdul Muqeet , Humera Tariq

Comments: Accepted in IBA-ICICT 2019

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Images captured in hazy weather conditions often suffer from color contrast

and color fidelity. This degradation is represented by transmission map which

represents the amount of attenuation and airlight which represents the color of

additive noise. In this paper, we have proposed a method to estimate the

transmission map using haze levels instead of airlight color since there are

some ambiguities in estimation of airlight. Qualitative and quantitative

results of proposed method show competitiveness of the method given. In

addition we have proposed two metrics which are based on statistics of natural

outdoor images for assessment of haze removal algorithms.

SQuINTing at VQA Models: Interrogating VQA Models with Sub-Questions

Ramprasaath R. Selvaraju , Purva Tendulkar , Devi Parikh , Eric Horvitz , Marco Ribeiro , Besmira Nushi , Ece Kamar Subjects : Computer Vision and Pattern Recognition (cs.CV) ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Existing VQA datasets contain questions with varying levels of complexity.

While the majority of questions in these datasets require perception for

recognizing existence, properties, and spatial relationships of entities, a

significant portion of questions pose challenges that correspond to reasoning

tasks — tasks that can only be answered through a synthesis of perception and

knowledge about the world, logic and / or reasoning. This distinction allows us

to notice when existing VQA models have consistency issues — they answer the

reasoning question correctly but fail on associated low-level perception

questions. For example, models answer the complex reasoning question “Is the

banana ripe enough to eat?” correctly, but fail on the associated perception

question “Are the bananas mostly green or yellow?” indicating that the model

likely answered the reasoning question correctly but for the wrong reason. We

quantify the extent to which this phenomenon occurs by creating a new Reasoning

split of the VQA dataset and collecting Sub-VQA, a new dataset consisting of

200K new perception questions which serve as sub questions corresponding to the

set of perceptual tasks needed to effectively answer the complex reasoning

questions in the Reasoning split. Additionally, we propose an approach called

Sub-Question Importance-aware Network Tuning (SQuINT), which encourages the

model to attend do the same parts of the image when answering the reasoning

question and the perception sub questions. We show that SQuINT improves model

consistency by 7.8%, also marginally improving its performance on the Reasoning

questions in VQA, while also displaying qualitatively better attention maps.

MTI-Net: Multi-Scale Task Interaction Networks for Multi-Task Learning

Simon Vandenhende , Stamatios Georgoulis , Luc Van Gool Subjects : Computer Vision and Pattern Recognition (cs.CV)

In this paper, we highlight the importance of considering task interactions

at multiple scales when distilling task information in a multi-task learning

setup. In contrast to common belief, we show that tasks with high pattern

affinity at a certain scale are not guaranteed to retain this behaviour at

other scales, and vice versa. We propose a novel architecture, MTI-Net, that

builds upon this finding in three ways. First, it explicitly models task

interactions at every scale via a multi-scale multi-modal distillation unit.

Second, it propagates distilled task information from lower to higher scales

via a feature propagation module. Third, it aggregates the refined task

features from all scales via a feature aggregation unit to produce the final

per-task predictions.

Extensive experiments on two multi-task dense labeling datasets show that,

unlike prior work, our multi-task model delivers on the full potential of

multi-task learning, that is, smaller memory footprint, reduced number of

calculations, and better performance w.r.t. single-task learning.

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

Zhu Zhang , Zhou Zhao , Yang Zhao , Qi Wang , Huasheng Liu , Lianli Gao Subjects : Computer Vision and Pattern Recognition (cs.CV)

In this paper, we consider a novel task, Spatio-Temporal Video Grounding for

Multi-Form Sentences (STVG). Given an untrimmed video and a

declarative/interrogative sentence depicting an object, STVG aims to localize

the spatiotemporal tube of the queried object. STVG has two challenging

settings: (1) We need to localize spatio-temporal object tubes from untrimmed

videos, where the object may only exist in a very small segment of the video;

(2) We deal with multi-form sentences, including the declarative sentences with

explicit objects and interrogative sentences with unknown objects. Existing

methods cannot tackle the STVG task due to the ineffective tube pre-generation

and the lack of object relationship modeling. Thus, we then propose a novel

Spatio-Temporal Graph Reasoning Network (STGRN) for this task. First, we build

a spatio-temporal region graph to capture the region relationships with

temporal object dynamics, which involves the implicit and explicit spatial

subgraphs in each frame and the temporal dynamic subgraph across frames. We

then incorporate textual clues into the graph and develop the multi-step

cross-modal graph reasoning. Next, we introduce a spatio-temporal localizer

with a dynamic selection method to directly retrieve the spatiotemporal tubes

without tube pre-generation. Moreover, we contribute a large-scale video

grounding dataset VidSTG based on video relation dataset VidOR. The extensive

experiments demonstrate the effectiveness of our method.

RGB-D Odometry and SLAM

Javier Civera , Seong Hun Lee

Comments: This is the pre-submission version of the manuscript that was later edited and published as a chapter in RGB-D Image Analysis and Processing

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Robotics (cs.RO)

The emergence of modern RGB-D sensors had a significant impact in many

application fields, including robotics, augmented reality (AR) and 3D scanning.

They are low-cost, low-power and low-size alternatives to traditional range

sensors such as LiDAR. Moreover, unlike RGB cameras, RGB-D sensors provide the

additional depth information that removes the need of frame-by-frame

triangulation for 3D scene reconstruction. These merits have made them very

popular in mobile robotics and AR, where it is of great interest to estimate

ego-motion and 3D scene structure. Such spatial understanding can enable robots

to navigate autonomously without collisions and allow users to insert virtual

entities consistent with the image stream. In this chapter, we review common

formulations of odometry and Simultaneous Localization and Mapping (known by

its acronym SLAM) using RGB-D stream input. The two topics are closely related,

as the former aims to track the incremental camera motion with respect to a

local map of the scene, and the latter to jointly estimate the camera

trajectory and the global map with consistency. In both cases, the standard

approaches minimize a cost function using nonlinear optimization techniques.

This chapter consists of three main parts: In the first part, we introduce the

basic concept of odometry and SLAM and motivate the use of RGB-D sensors. We

also give mathematical preliminaries relevant to most odometry and SLAM

algorithms. In the second part, we detail the three main components of SLAM

systems: camera pose tracking, scene mapping and loop closing. For each

component, we describe different approaches proposed in the literature. In the

final part, we provide a brief discussion on advanced research topics with the

references to the state-of-the-art.

Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization

Junjie Yan , Ruosi Wan , Xiangyu Zhang , Wei Zhang , Yichen Wei , Jian Sun

Comments: Published in ICLR2020; code: this https URL

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Machine Learning (stat.ML)

Batch Normalization (BN) is one of the most widely used techniques in Deep

Learning field. But its performance can awfully degrade with insufficient batch

size. This weakness limits the usage of BN on many computer vision tasks like

detection or segmentation, where batch size is usually small due to the

constraint of memory consumption. Therefore many modified normalization

techniques have been proposed, which either fail to restore the performance of

BN completely, or have to introduce additional nonlinear operations in

inference procedure and increase huge consumption. In this paper, we reveal

that there are two extra batch statistics involved in backward propagation of

BN, on which has never been well discussed before. The extra batch statistics

associated with gradients also can severely affect the training of deep neural

network. Based on our analysis, we propose a novel normalization method, named

Moving Average Batch Normalization (MABN). MABN can completely restore the

performance of vanilla BN in small batch cases, without introducing any

additional nonlinear operations in inference procedure. We prove the benefits

of MABN by both theoretical analysis and experiments. Our experiments

demonstrate the effectiveness of MABN in multiple computer vision tasks

including ImageNet and COCO. The code has been released in

this https URL .

Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement

Chunle Guo , Chongyi Li , Jichang Guo , Chen Change Loy , Junhui Hou , Sam Kwong , Runmin Cong Subjects : Computer Vision and Pattern Recognition (cs.CV)

The paper presents a novel method, Zero-Reference Deep Curve Estimation

(Zero-DCE), which formulates light enhancement as a task of image-specific

curve estimation with a deep network. Our method trains a lightweight deep

network, DCE-Net, to estimate pixel-wise and high-order curves for dynamic

range adjustment of a given image. The curve estimation is specially designed,

considering pixel value range, monotonicity, and differentiability. Zero-DCE is

appealing in its relaxed assumption on reference images, i.e., it does not

require any paired or unpaired data during training. This is achieved through a

set of carefully formulated non-reference loss functions, which implicitly

measure the enhancement quality and drive the learning of the network. Our

method is efficient as image enhancement can be achieved by an intuitive and

simple nonlinear curve mapping. Despite its simplicity, we show that it

generalizes well to diverse lighting conditions. Extensive experiments on

various benchmarks demonstrate the advantages of our method over

state-of-the-art methods qualitatively and quantitatively. Furthermore, the

potential benefits of our Zero-DCE to face detection in the dark are discussed.

Code and model will be available at this https URL .

SlideImages: A Dataset for Educational Image Classification

David Morris , Eric Müller-Budack , Ralph Ewerth

Comments: 8 pages, 2 figures, to be presented at ECIR 2020

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Information Retrieval (cs.IR)

In the past few years, convolutional neural networks (CNNs) have achieved

impressive results in computer vision tasks, which however mainly focus on

photos with natural scene content. Besides, non-sensor derived images such as

illustrations, data visualizations, figures, etc. are typically used to convey

complex information or to explore large datasets. However, this kind of images

has received little attention in computer vision. CNNs and similar techniques

use large volumes of training data. Currently, many document analysis systems

are trained in part on scene images due to the lack of large datasets of

educational image data. In this paper, we address this issue and present

SlideImages, a dataset for the task of classifying educational illustrations.

SlideImages contains training data collected from various sources, e.g.,

Wikimedia Commons and the AI2D dataset, and test data collected from

educational slides. We have reserved all the actual educational images as a

test dataset in order to ensure that the approaches using this dataset

generalize well to new educational images, and potentially other domains.

Furthermore, we present a baseline system using a standard deep neural

architecture and discuss dealing with the challenge of limited training data.

Deep Semantic Face Deblurring

Ziyi Shen , Wei-Sheng Lai , Tingfa Xu , Jan Kautz , Ming-Hsuan Yang

Comments: Submitted to International Journal of Computer Vision (IJCV). arXiv admin note: text overlap with arXiv:1803.03345

Subjects

Computer Vision and Pattern Recognition (cs.CV)

In this paper, we propose an effective and efficient face deblurring

algorithm by exploiting semantic cues via deep convolutional neural networks.

As the human faces are highly structured and share unified facial components

(e.g., eyes and mouths), such semantic information provides a strong prior for

restoration. We incorporate face semantic labels as input priors and propose an

adaptive structural loss to regularize facial local structures within an

end-to-end deep convolutional neural network. Specifically, we first use a

coarse deblurring network to reduce the motion blur on the input face image. We

then adopt a parsing network to extract the semantic features from the coarse

deblurred image. Finally, the fine deblurring network utilizes the semantic

information to restore a clear face image. We train the network with perceptual

and adversarial losses to generate photo-realistic results. The proposed method

restores sharp images with more accurate facial features and details.

Quantitative and qualitative evaluations demonstrate that the proposed face

deblurring algorithm performs favorably against the state-of-the-art methods in

terms of restoration quality, face recognition and execution speed.

Gated Path Selection Network for Semantic Segmentation

Qichuan Geng , Hong Zhang , Xiaojuan Qi , Ruigang Yang , Zhong Zhou , Gao Huang Subjects : Computer Vision and Pattern Recognition (cs.CV)

Semantic segmentation is a challenging task that needs to handle large scale

variations, deformations and different viewpoints. In this paper, we develop a

novel network named Gated Path Selection Network (GPSNet), which aims to learn

adaptive receptive fields. In GPSNet, we first design a two-dimensional

multi-scale network – SuperNet, which densely incorporates features from

growing receptive fields. To dynamically select desirable semantic context, a

gate prediction module is further introduced. In contrast to previous works

that focus on optimizing sample positions on the regular grids, GPSNet can

adaptively capture free form dense semantic contexts. The derived adaptive

receptive fields are data-dependent, and are flexible that can model different

object geometric transformations. On two representative semantic segmentation

datasets, i.e., Cityscapes, and ADE20K, we show that the proposed approach

consistently outperforms previous methods and achieves competitive performance

without bells and whistles.

Human-Aware Motion Deblurring

Ziyi Shen , Wenguan Wang , Xiankai Lu , Jianbing Shen , Haibin Ling , Tingfa Xu , Ling Shao

Comments: ICCV2019 paper. Website: this https URL

Subjects

Computer Vision and Pattern Recognition (cs.CV)

This paper proposes a human-aware deblurring model that disentangles the

motion blur between foreground (FG) humans and background (BG). The proposed

model is based on a triple-branch encoder-decoder architecture. The first two

branches are learned for sharpening FG humans and BG details, respectively;

while the third one produces global, harmonious results by comprehensively

fusing multi-scale deblurring information from the two domains. The proposed

model is further endowed with a supervised, human-aware attention mechanism in

an end-to-end fashion. It learns a soft mask that encodes FG human information

and explicitly drives the FG/BG decoder-branches to focus on their specific

domains. To further benefit the research towards Human-aware Image Deblurring,

we introduce a large-scale dataset, named HIDE, which consists of 8,422 blurry

and sharp image pairs with 65,784 densely annotated FG human bounding boxes.

HIDE is specifically built to span a broad range of scenes, human object sizes,

motion patterns, and background complexities. Extensive experiments on public

benchmarks and our dataset demonstrate that our model performs favorably

against the state-of-the-art motion deblurring methods, especially in capturing

semantic details.

GTNet: Generative Transfer Network for Zero-Shot Object Detection

Shizhen Zhao , Changxin Gao , Yuanjie Shao , Lerenhan Li , Changqian Yu , Zhong Ji , Nong Sang

Comments: Accepted by AAAI 2020

Subjects

Computer Vision and Pattern Recognition (cs.CV)

We propose a Generative Transfer Network (GTNet) for zero shot object

detection (ZSD). GTNet consists of an Object Detection Module and a Knowledge

Transfer Module. The Object Detection Module can learn large-scale seen domain

knowledge. The Knowledge Transfer Module leverages a feature synthesizer to

generate unseen class features, which are applied to train a new classification

layer for the Object Detection Module. In order to synthesize features for each

unseen class with both the intra-class variance and the IoU variance, we design

an IoU-Aware Generative Adversarial Network (IoUGAN) as the feature

synthesizer, which can be easily integrated into GTNet. Specifically, IoUGAN

consists of three unit models: Class Feature Generating Unit (CFU), Foreground

Feature Generating Unit (FFU), and Background Feature Generating Unit (BFU).

CFU generates unseen features with the intra-class variance conditioned on the

class semantic embeddings. FFU and BFU add the IoU variance to the results of

CFU, yielding class-specific foreground and background features, respectively.

We evaluate our method on three public datasets and the results demonstrate

that our method performs favorably against the state-of-the-art ZSD approaches.

See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks

Xiankai Lu , Wenguan Wang , Chao Ma , Jianbing Shen , Ling Shao , Fatih Porikli

Comments: CVPR2019. Weblink: this https URL

Journal-ref: CVPR2019

Subjects

Computer Vision and Pattern Recognition (cs.CV)

We introduce a novel network, called CO-attention Siamese Network (COSNet),

to address the unsupervised video object segmentation task from a holistic

view. We emphasize the importance of inherent correlation among video frames

and incorporate a global co-attention mechanism to improve further the

state-of-the-art deep learning based solutions that primarily focus on learning

discriminative foreground representations over appearance and motion in

short-term temporal segments. The co-attention layers in our network provide

efficient and competent stages for capturing global correlations and scene

context by jointly computing and appending co-attention responses into a joint

feature space. We train COSNet with pairs of video frames, which naturally

augments training data and allows increased learning capacity. During the

segmentation stage, the co-attention model encodes useful information by

processing multiple reference frames together, which is leveraged to infer the

frequently reappearing and salient foreground objects better. We propose a

unified and end-to-end trainable framework where different co-attention

variants can be derived for mining the rich context within videos. Our

extensive experiments over three large benchmarks manifest that COSNet

outperforms the current alternatives by a large margin.

Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks

Wenguan Wang , Xiankai Lu , Jianbing Shen , David Crandall , Ling Shao

Comments: ICCV2019(Oral). Website: this https URL

Journal-ref: ICCV2019(Oral)

Subjects

Computer Vision and Pattern Recognition (cs.CV)

This work proposes a novel attentive graph neural network (AGNN) for

zero-shot video object segmentation (ZVOS). The suggested AGNN recasts this

task as a process of iterative information fusion over video graphs.

Specifically, AGNN builds a fully connected graph to efficiently represent

frames as nodes, and relations between arbitrary frame pairs as edges. The

underlying pair-wise relations are described by a differentiable attention

mechanism. Through parametric message passing, AGNN is able to efficiently

capture and mine much richer and higher-order relations between video frames,

thus enabling a more complete understanding of video content and more accurate

foreground estimation. Experimental results on three video segmentation

datasets show that AGNN sets a new state-of-the-art in each case. To further

demonstrate the generalizability of our framework, we extend AGNN to an

additional task: image object co-segmentation (IOCS). We perform experiments on

two famous IOCS datasets and observe again the superiority of our AGNN model.

The extensive experiments verify that AGNN is able to learn the underlying

semantic/appearance relationships among video frames or related images, and

discover the common objects.

Learning Compositional Neural Information Fusion for Human Parsing

Wenguan Wang , Zhijie Zhang , Siyuan Qi , Jianbing Shen , Yanwei Pang , Ling Shao

Comments: ICCV2019. Websie: this https URL

Journal-ref: ICCV2019

Subjects

Computer Vision and Pattern Recognition (cs.CV)

This work proposes to combine neural networks with the compositional

hierarchy of human bodies for efficient and complete human parsing. We

formulate the approach as a neural information fusion framework. Our model

assembles the information from three inference processes over the hierarchy:

direct inference (directly predicting each part of a human body using image

information), bottom-up inference (assembling knowledge from constituent

parts), and top-down inference (leveraging context from parent nodes). The

bottom-up and top-down inferences explicitly model the compositional and

decompositional relations in human bodies, respectively. In addition, the

fusion of multi-source information is conditioned on the inputs, i.e., by

estimating and considering the confidence of the sources. The whole model is

end-to-end differentiable, explicitly modeling information flows and

structures. Our approach is extensively evaluated on four popular datasets,

outperforming the state-of-the-arts in all cases, with a fast processing speed

of 23fps. Our code and results have been released to help ease future research

in this direction.

Image denoising via K-SVD with primal-dual active set algorithm

Quan Xiao , Canhong Wen , Zirui Yan

Comments: 9 pages, 6 figures. The paper was accepted by IEEE. WACV 2020 and will placed in the IEEE Xplore

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Machine Learning (stat.ML)

K-SVD algorithm has been successfully applied to image denoising tasks dozens

of years but the big bottleneck in speed and accuracy still needs attention to

break. For the sparse coding stage in K-SVD, which involves (ell_{0})

constraint, prevailing methods usually seek approximate solutions greedily but

are less effective once the noise level is high. The alternative (ell_{1})

optimization is proved to be powerful than (ell_{0}), however, the time

consumption prevents it from the implementation. In this paper, we propose a

new K-SVD framework called K-SVD(_P) by applying the Primal-dual active set

(PDAS) algorithm to it. Different from the greedy algorithms based K-SVD, the

K-SVD(_P) algorithm develops a selection strategy motivated by KKT

(Karush-Kuhn-Tucker) condition and yields to an efficient update in the sparse

coding stage. Since the K-SVD(_P) algorithm seeks for an equivalent solution to

the dual problem iteratively with simple explicit expression in this denoising

problem, speed and quality of denoising can be reached simultaneously.

Experiments are carried out and demonstrate the comparable denoising

performance of our K-SVD(_P) with state-of-the-art methods.

Towards More Efficient and Effective Inference: The Joint Decision of Multi-Participants

Hui Zhu , Zhulin An , Kaiqiang Xu , Xiaolong Hu , Yongjun Xu Subjects : Computer Vision and Pattern Recognition (cs.CV)

Existing approaches to improve the performances of convolutional neural

networks by optimizing the local architectures or deepening the networks tend

to increase the size of models significantly. In order to deploy and apply the

neural networks to edge devices which are in great demand, reducing the scale

of networks are quite crucial. However, It is easy to degrade the performance

of image processing by compressing the networks. In this paper, we propose a

method which is suitable for edge devices while improving the efficiency and

effectiveness of inference. The joint decision of multi-participants, mainly

contain multi-layers and multi-networks, can achieve higher classification

accuracy (0.26% on CIFAR-10 and 4.49% on CIFAR-100 at most) with similar total

number of parameters for classical convolutional neural networks.

MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recogntion

Kaiyu Shan , Yongtao Wang , Zhuoying Wang , Tingting Liang , Zhi Tang , Ying Chen , Yangyan Li

Comments: Under Review

Subjects

Computer Vision and Pattern Recognition (cs.CV)

To efficiently extract spatiotemporal features of video for action

recognition, most state-of-the-art methods integrate 1D temporal convolution

into a conventional 2D CNN backbone. However, they all exploit 1D temporal

convolution of fixed kernel size (i.e., 3) in the network building block, thus

have suboptimal temporal modeling capability to handle both long-term and

short-term actions. To address this problem, we first investigate the impacts

of different kernel sizes for the 1D temporal convolutional filters. Then, we

propose a simple yet efficient operation called Mixed Temporal Convolution

(MixTConv), which consists of multiple depthwise 1D convolutional filters with

different kernel sizes. By plugging MixTConv into the conventional 2D CNN

backbone ResNet-50, we further propose an efficient and effective network

architecture named MSTNet for action recognition, and achieve state-of-the-art

results on multiple benchmarks.

NETNet: Neighbor Erasing and Transferring Network for Better Single Shot Object Detection

Yazhao Li , Yanwei Pang , Jianbing Shen , Jiale Cao , Ling Shao

Comments: 10 pages, 8 figures

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Due to the advantages of real-time detection and improved performance,

single-shot detectors have gained great attention recently. To solve the

complex scale variations, single-shot detectors make scale-aware predictions

based on multiple pyramid layers. However, the features in the pyramid are not

scale-aware enough, which limits the detection performance. Two common problems

in single-shot detectors caused by object scale variations can be observed: (1)

small objects are easily missed; (2) the salient part of a large object is

sometimes detected as an object. With this observation, we propose a new

Neighbor Erasing and Transferring (NET) mechanism to reconfigure the pyramid

features and explore scale-aware features. In NET, a Neighbor Erasing Module

(NEM) is designed to erase the salient features of large objects and emphasize

the features of small objects in shallow layers. A Neighbor Transferring Module

(NTM) is introduced to transfer the erased features and highlight large objects

in deep layers. With this mechanism, a single-shot network called NETNet is

constructed for scale-aware object detection. In addition, we propose to

aggregate nearest neighboring pyramid features to enhance our NET. NETNet

achieves 38.5% AP at a speed of 27 FPS and 32.0% AP at a speed of 55 FPS on MS

COCO dataset. As a result, NETNet achieves a better trade-off for real-time and

accurate object detection.

Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video

Jie Wu , Guanbin Li , Si Liu , Liang Lin

Comments: To appear in AAAI2020

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Temporally language grounding in untrimmed videos is a newly-raised task in

video understanding. Most of the existing methods suffer from inferior

efficiency, lacking interpretability, and deviating from the human perception

mechanism. Inspired by human’s coarse-to-fine decision-making paradigm, we

formulate a novel Tree-Structured Policy based Progressive Reinforcement

Learning (TSP-PRL) framework to sequentially regulate the temporal boundary by

an iterative refinement process. The semantic concepts are explicitly

represented as the branches in the policy, which contributes to efficiently

decomposing complex policies into an interpretable primitive action.

Progressive reinforcement learning provides correct credit assignment via two

task-oriented rewards that encourage mutual promotion within the

tree-structured policy. We extensively evaluate TSP-PRL on the Charades-STA and

ActivityNet datasets, and experimental results show that TSP-PRL achieves

competitive performance over existing state-of-the-art methods.

ENAS U-Net: Evolutionary Neural Architecture Search for Retinal Vessel Segmentation

Zhun Fan , Jiahong Wei , Guijie Zhu , Jiajie Mo , Wenji Li Subjects : Computer Vision and Pattern Recognition (cs.CV)

The accurate retina vessel segmentation (RVS) is of great significance to

assist doctors in the diagnosis of ophthalmology diseases and other systemic

diseases, and manually designing a valid neural network architecture for

retinal vessel segmentation requires high expertise and a large workload. In

order to further improve the performance of vessel segmentation and reduce the

workload of manually designing neural network. We propose a specific search

space based on encoder-decoder framework and apply neural architecture search

(NAS) to retinal vessel segmentation. The search space is a macro-architecture

search that involves some operations and adjustments to the entire network

topology. For the architecture optimization, we adopt the modified evolutionary

strategy which can evolve with limited computing resource to evolve the

architectures. During the evolution, we select the elite architectures for the

next generation evolution based on their performances. After the evolution, the

searched model is evaluated on three mainstream datasets, namely DRIVE, STARE

and CHASE_DB1. The searched model achieves top performance on all three

datasets with fewer parameters (about 2.3M). Moreover, the results of

cross-training between above three datasets show that the searched model is

with considerable scalability, which indicates that the searched model is with

potential for clinical disease diagnosis.

Multi-View Photometric Stereo: A Robust Solution and Benchmark Dataset for Spatially Varying Isotropic Materials

Min Li , Zhenglong Zhou , Zhe Wu , Boxin Shi , Changyu Diao , Ping Tan Subjects : Computer Vision and Pattern Recognition (cs.CV)

We present a method to capture both 3D shape and spatially varying

reflectance with a multi-view photometric stereo (MVPS) technique that works

for general isotropic materials. Our algorithm is suitable for perspective

cameras and nearby point light sources. Our data capture setup is simple, which

consists of only a digital camera, some LED lights, and an optional automatic

turntable. From a single viewpoint, we use a set of photometric stereo images

to identify surface points with the same distance to the camera. We collect

this information from multiple viewpoints and combine it with

structure-from-motion to obtain a precise reconstruction of the complete 3D

shape. The spatially varying isotropic bidirectional reflectance distribution

function (BRDF) is captured by simultaneously inferring a set of basis BRDFs

and their mixing weights at each surface point. In experiments, we demonstrate

our algorithm with two different setups: a studio setup for highest precision

and a desktop setup for best usability. According to our experiments, under the

studio setting, the captured shapes are accurate to 0.5 millimeters and the

captured reflectance has a relative root-mean-square error (RMSE) of 9%. We

also quantitatively evaluate state-of-the-art MVPS on a newly collected

benchmark dataset, which is publicly available for inspiring future research.

Text-to-Image Generation with Attention Based Recurrent Neural Networks

Tehseen Zia , Shahan Arif , Shakeeb Murtaza , Mirza Ahsan Ullah Subjects : Computer Vision and Pattern Recognition (cs.CV)

Conditional image modeling based on textual descriptions is a relatively new

domain in unsupervised learning. Previous approaches use a latent variable

model and generative adversarial networks. While the formers are approximated

by using variational auto-encoders and rely on the intractable inference that

can hamper their performance, the latter is unstable to train due to Nash

equilibrium based objective function. We develop a tractable and stable

caption-based image generation model. The model uses an attention-based encoder

to learn word-to-pixel dependencies. A conditional autoregressive based decoder

is used for learning pixel-to-pixel dependencies and generating images.

Experimentations are performed on Microsoft COCO, and MNIST-with-captions

datasets and performance is evaluated by using the Structural Similarity Index.

Results show that the proposed model performs better than contemporary

approaches and generate better quality images. Keywords: Generative image

modeling, autoregressive image modeling, caption-based image generation, neural

attention, recurrent neural networks.

Stacked Adversarial Network for Zero-Shot Sketch based Image Retrieval

Anubha Pandey , Ashish Mishra , Vinay Kumar Verma , Anurag Mittal , Hema A. Murthy

Comments: Accepted in WACV’2020

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)

Conventional approaches to Sketch-Based Image Retrieval (SBIR) assume that

the data of all the classes are available during training. The assumption may

not always be practical since the data of a few classes may be unavailable, or

the classes may not appear at the time of training. Zero-Shot Sketch-Based

Image Retrieval (ZS-SBIR) relaxes this constraint and allows the algorithm to

handle previously unseen classes during the test. This paper proposes a

generative approach based on the Stacked Adversarial Network (SAN) and the

advantage of Siamese Network (SN) for ZS-SBIR. While SAN generates a

high-quality sample, SN learns a better distance metric compared to that of the

nearest neighbor search. The capability of the generative model to synthesize

image features based on the sketch reduces the SBIR problem to that of an

image-to-image retrieval problem. We evaluate the efficacy of our proposed

approach on TU-Berlin, and Sketchy database in both standard ZSL and

generalized ZSL setting. The proposed method yields a significant improvement

in standard ZSL as well as in a more challenging generalized ZSL setting (GZSL)

for SBIR.

Deep Metric Structured Learning For Facial Expression Recognition

Pedro D. Marrero Fernandez , Tsang Ing Ren , Tsang Ing Jyh , Fidel A. Guerrero Peña , Alexandre Cunha Subjects : Computer Vision and Pattern Recognition (cs.CV)

We propose a deep metric learning model to create embedded sub-spaces with a

well defined structure. A new loss function that imposes Gaussian structures on

the output space is introduced to create these sub-spaces thus shaping the

distribution of the data. Having a mixture of Gaussians solution space is

advantageous given its simplified and well established structure. It allows

fast discovering of classes within classes and the identification of mean

representatives at the centroids of individual classes. We also propose a new

semi-supervised method to create sub-classes. We illustrate our methods on the

facial expression recognition problem and validate results on the FER+,

AffectNet, Extended Cohn-Kanade (CK+), BU-3DFE, and JAFFE datasets. We

experimentally demonstrate that the learned embedding can be successfully used

for various applications including expression retrieval and emotion

recognition.

A Foreground-background Parallel Compression with Residual Encoding for Surveillance Video

Lirong Wu , Kejie Huang , Haibin Shen , Lianli Gao Subjects : Computer Vision and Pattern Recognition (cs.CV)

The data storage has been one of the bottlenecks in surveillance systems. The

conventional video compression algorithms such as H.264 and H.265 do not fully

utilize the low information density characteristic of the surveillance video.

In this paper, we propose a video compression method that extracts and

compresses the foreground and background of the video separately. The

compression ratio is greatly improved by sharing background information among

multiple adjacent frames through an adaptive background updating and

interpolation module. Besides, we present two different schemes to compress the

foreground and compare their performance in the ablation study to show the

importance of temporal information for video compression. In the decoding end,

a coarse-to-fine two-stage module is applied to achieve the composition of the

foreground and background and the enhancements of frame quality. Furthermore,

an adaptive sampling method for surveillance cameras is proposed, and we have

shown its effects through software simulation. The experimental results show

that our proposed method requires 69.5% less bpp (bits per pixel) than the

conventional algorithm H.265 to achieve the same PSNR (36 dB) on the HECV

dataset.

A GAN-based Tunable Image Compression System

Lirong Wu , Kejie Huang , Haibin Shen Subjects : Computer Vision and Pattern Recognition (cs.CV)

The method of importance map has been widely adopted in DNN-based lossy image

compression to achieve bit allocation according to the importance of image

contents. However, insufficient allocation of bits in non-important regions

often leads to severe distortion at low bpp (bits per pixel), which hampers the

development of efficient content-weighted image compression systems. This paper

rethinks content-based compression by using Generative Adversarial Network

(GAN) to reconstruct the non-important regions. Moreover, multiscale pyramid

decomposition is applied to both the encoder and the discriminator to achieve

global compression of high-resolution images. A tunable compression scheme is

also proposed in this paper to compress an image to any specific compression

ratio without retraining the model. The experimental results show that our

proposed method improves MS-SSIM by more than 10.3% compared to the recently

reported GAN-based method to achieve the same low bpp (0.05) on the Kodak

dataset.

Harmonic Convolutional Networks based on Discrete Cosine Transform

Matej Ulicny , Vladimir A. Krylov , Rozenn Dahyot

Comments: arXiv admin note: substantial text overlap with arXiv:1812.03205

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Machine Learning (cs.LG)

Convolutional neural networks (CNNs) learn filters in order to capture local

correlation patterns in feature space. In this paper we propose to revert to

learning combinations of preset spectral filters by switching to CNNs with

harmonic blocks. We rely on the use of the Discrete Cosine Transform (DCT)

filters which have excellent energy compaction properties and are widely used

for image compression. The proposed harmonic blocks rely on DCT-modeling and

replace conventional convolutional layers to produce partially or fully

harmonic versions of new or existing CNN architectures. We demonstrate how the

harmonic networks can be efficiently compressed in a straightforward manner by

truncating high-frequency information in harmonic blocks which is possible due

to the redundancies in the spectral domain. We report extensive experimental

validation demonstrating the benefits of the introduction of harmonic blocks

into state-of-the-art CNN models in image classification, segmentation and edge

detection applications.

Media Forensics and DeepFakes: an overview

Luisa Verdoliva Subjects : Computer Vision and Pattern Recognition (cs.CV)

With the rapid progress of recent years, techniques that generate and

manipulate multimedia content can now guarantee a very advanced level of

realism. The boundary between real and synthetic media has become very thin. On

the one hand, this opens the door to a series of exciting applications in

different fields such as creative arts, advertising, film production, video

games. On the other hand, it poses enormous security threats. Software packages

freely available on the web allow any individual, without special skills, to

create very realistic fake images and videos. So-called deepfakes can be used

to manipulate public opinion during elections, commit fraud, discredit or

blackmail people. Potential abuses are limited only by human imagination.

Therefore, there is an urgent need for automated tools capable of detecting

false multimedia content and avoiding the spread of dangerous false

information. This review paper aims to present an analysis of the methods for

visual media integrity verification, that is, the detection of manipulated

images and videos. Special emphasis will be placed on the emerging phenomenon

of deepfakes and, from the point of view of the forensic analyst, on modern

data-driven forensic methods. The analysis will help to highlight the limits of

current forensic tools, the most relevant issues, the upcoming challenges, and

suggest future directions for research.

Adapting Grad-CAM for Embedding Networks

Lei Chen , Jianhui Chen , Hossein Hajimirsadeghi , Greg Mori

Comments: WACV 2020 camera ready

Subjects

Computer Vision and Pattern Recognition (cs.CV)

The gradient-weighted class activation mapping (Grad-CAM) method can

faithfully highlight important regions in images for deep model prediction in

image classification, image captioning and many other tasks. It uses the

gradients in back-propagation as weights (grad-weights) to explain network

decisions. However, applying Grad-CAM to embedding networks raises significant

challenges because embedding networks are trained by millions of dynamically

paired examples (e.g. triplets). To overcome these challenges, we propose an

adaptation of the Grad-CAM method for embedding networks. First, we aggregate

grad-weights from multiple training examples to improve the stability of

Grad-CAM. Then, we develop an efficient weight-transfer method to explain

decisions for any image without back-propagation. We extensively validate the

method on the standard CUB200 dataset in which our method produces more

accurate visual attention than the original Grad-CAM method. We also apply the

method to a house price estimation application using images. The method

produces convincing qualitative results, showcasing the practicality of our

approach.

Temporal Interlacing Network

Hao Shao , Shengju Qian , Yu Liu

Comments: Accepted to AAAI 2020. Winning entry of ICCV Multi-Moments in Time Challenge 2019. Code is available at this https URL

Subjects

Computer Vision and Pattern Recognition (cs.CV)

For a long time, the vision community tries to learn the spatio-temporal

representation by combining convolutional neural network together with various

temporal models, such as the families of Markov chain, optical flow, RNN and

temporal convolution. However, these pipelines consume enormous computing

resources due to the alternately learning process for spatial and temporal

information. One natural question is whether we can embed the temporal

information into the spatial one so the information in the two domains can be

jointly learned once-only. In this work, we answer this question by presenting

a simple yet powerful operator — temporal interlacing network (TIN). Instead

of learning the temporal features, TIN fuses the two kinds of information by

interlacing spatial representations from the past to the future, and vice

versa. A differentiable interlacing target can be learned to control the

interlacing process. In this way, a heavy temporal model is replaced by a

simple interlacing operator. We theoretically prove that with a learnable

interlacing target, TIN performs equivalently to the regularized temporal

convolution network (r-TCN), but gains 4% more accuracy with 6x less latency on

6 challenging benchmarks. These results push the state-of-the-art performances

of video understanding by a considerable margin. Not surprising, the ensemble

model of the proposed TIN won the (1^{st}) place in the ICCV19 – Multi Moments

in Time challenge. Code is made available to facilitate further research at

this https URL

FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence

Kihyuk Sohn , David Berthelot , Chun-Liang Li , Zizhao Zhang , Nicholas Carlini , Ekin D. Cubuk , Alex Kurakin , Han Zhang , Colin Raffel Subjects : Machine Learning (cs.LG) ; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Semi-supervised learning (SSL) provides an effective means of leveraging

unlabeled data to improve a model’s performance. In this paper, we demonstrate

the power of a simple combination of two common SSL methods: consistency

regularization and pseudo-labeling. Our algorithm, FixMatch, first generates

pseudo-labels using the model’s predictions on weakly-augmented unlabeled

images. For a given image, the pseudo-label is only retained if the model

produces a high-confidence prediction. The model is then trained to predict the

pseudo-label when fed a strongly-augmented version of the same image. Despite

its simplicity, we show that FixMatch achieves state-of-the-art performance

across a variety of standard semi-supervised learning benchmarks, including

94.93% accuracy on CIFAR-10 with 250 labels and 88.61% accuracy with 40 — just

4 labels per class. Since FixMatch bears many similarities to existing SSL

methods that achieve worse performance, we carry out an extensive ablation

study to tease apart the experimental factors that are most important to

FixMatch’s success. We make our code available at

this https URL .

Generate High-Resolution Adversarial Samples by Identifying Effective Features

Sizhe Chen , Peidong Zhang , Chengjin Sun , Jia Cai , Xiaolin Huang Subjects : Machine Learning (cs.LG) ; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

As the prevalence of deep learning in computer vision, adversarial samples

that weaken the neural networks emerge in large numbers, revealing their

deep-rooted defects. Most adversarial attacks calculate an imperceptible

perturbation in image space to fool the DNNs. In this strategy, the

perturbation looks like noise and thus could be mitigated. Attacks in feature

space produce semantic perturbation, but they could only deal with low

resolution samples. The reason lies in the great number of coupled features to

express a high-resolution image. In this paper, we propose Attack by

Identifying Effective Features (AIEF), which learns different weights for

features to attack. Effective features, those with great weights, influence the

victim model much but distort the image little, and thus are more effective for

attack. By attacking mostly on them, AIEF produces high resolution adversarial

samples with acceptable distortions. We demonstrate the effectiveness of AIEF

by attacking on different tasks with different generative models.

batchboost: regularization for stabilizing training with resistance to underfitting & overfitting

Maciej A. Czyzewski

Comments: 6 pages; 5 figures

Subjects

Machine Learning (cs.LG)

; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Overfitting & underfitting and stable training are an important challenges in

machine learning. Current approaches for these issues are mixup, SamplePairing

and BC learning. In our work, we state the hypothesis that mixing many images

together can be more effective than just two. Batchboost pipeline has three

stages: (a) pairing: method of selecting two samples. (b) mixing: how to create

a new one from two samples. (c) feeding: combining mixed samples with new ones

from dataset into batch (with ratio (gamma)). Note that sample that appears in

our batch propagates with subsequent iterations with less and less importance

until the end of training. Pairing stage calculates the error per sample, sorts

the samples and pairs with strategy: hardest with easiest one, than mixing

stage merges two samples using mixup, (x_1 + (1-lambda)x_2). Finally, feeding

stage combines new samples with mixed by ratio 1:1. Batchboost has 0.5-3%

better accuracy than the current state-of-the-art mixup regularization on

CIFAR-10 & Fashion-MNIST. Our method is slightly better than SamplePairing

technique on small datasets (up to 5%). Batchboost provides stable training on

not tuned parameters (like weight decay), thus its a good method to test

performance of different architectures. Source code is at:

this https URL

Cut-Based Graph Learning Networks to Discover Compositional Structure of Sequential Video Data

Kyoung-Woon On , Eun-Sol Kim , Yu-Jung Heo , Byoung-Tak Zhang

Comments: 8 pages, 3 figures, Association for the Advancement of Artificial Intelligence (AAAI2020). arXiv admin note: substantial text overlap with arXiv:1907.01709

Subjects

Machine Learning (cs.LG)

; Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Machine Learning (stat.ML)

Conventional sequential learning methods such as Recurrent Neural Networks

(RNNs) focus on interactions between consecutive inputs, i.e. first-order

Markovian dependency. However, most of sequential data, as seen with videos,

have complex dependency structures that imply variable-length semantic flows

and their compositions, and those are hard to be captured by conventional

methods. Here, we propose Cut-Based Graph Learning Networks (CB-GLNs) for

learning video data by discovering these complex structures of the video. The

CB-GLNs represent video data as a graph, with nodes and edges corresponding to

frames of the video and their dependencies respectively. The CB-GLNs find

compositional dependencies of the data in multilevel graph forms via a

parameterized kernel with graph-cut and a message passing framework. We

evaluate the proposed method on the two different tasks for video

understanding: Video theme classification (Youtube-8M dataset) and Video

Question and Answering (TVQA dataset). The experimental results show that our

model efficiently learns the semantic compositional structure of video data.

Furthermore, our model achieves the highest performance in comparison to other

baseline methods.

Motif Difference Field: A Simple and Effective Image Representation of Time Series for Classification

Yadong Zhang , Xin Chen Subjects : Machine Learning (cs.LG) ; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Time series motifs play an important role in the time series analysis. The

motif-based time series clustering is used for the discovery of higher-order

patterns or structures in time series data. Inspired by the convolutional

neural network (CNN) classifier based on the image representations of time

series, motif difference field (MDF) is proposed. Compared to other image

representations of time series, MDF is simple and easy to construct. With the

Fully Convolution Network (FCN) as the classifier, MDF demonstrates the

superior performance on the UCR time series dataset in benchmark with other

time series classification methods. It is interesting to find that the triadic

time series motifs give the best result in the test. Due to the motif

clustering reflected in MDF, the significant motifs are detected with the help

of the Gradient-weighted Class Activation Mapping (Grad-CAM). The areas in MDF

with high weight in Grad-CAM have a high contribution from the significant

motifs with the desired ordinal patterns associated with the signature patterns

in time series. However, the signature patterns cannot be identified with the

neural network classifiers directly based on the time series.

Joint Learning of Instance and Semantic Segmentation for Robotic Pick-and-Place with Heavy Occlusions in Clutter

Kentaro Wada , Kei Okada , Masayuki Inaba

Comments: 7 pages, 13 figures, IEEE International Conference on Robotics and Automation (ICRA) 2019

Subjects

Robotics (cs.RO)

; Computer Vision and Pattern Recognition (cs.CV)

We present joint learning of instance and semantic segmentation for visible

and occluded region masks. Sharing the feature extractor with instance

occlusion segmentation, we introduce semantic occlusion segmentation into the

instance segmentation model. This joint learning fuses the instance- and

image-level reasoning of the mask prediction on the different segmentation

tasks, which was missing in the previous work of learning instance segmentation

only (instance-only). In the experiments, we evaluated the proposed joint

learning comparing the instance-only learning on the test dataset. We also

applied the joint learning model to 2 different types of robotic pick-and-place

tasks (random and target picking) and evaluated its effectiveness to achieve

real-world robotic tasks.

Breast lesion segmentation in ultrasound images with limited annotated data

Bahareh Behboodi , Mina Amiri , Rupert Brooks , Hassan Rivaz

Comments: Accepted to ISBI 2020

Subjects

Image and Video Processing (eess.IV)

; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)

Ultrasound (US) is one of the most commonly used imaging modalities in both

diagnosis and surgical interventions due to its low-cost, safety, and

non-invasive characteristic. US image segmentation is currently a unique

challenge because of the presence of speckle noise. As manual segmentation

requires considerable efforts and time, the development of automatic

segmentation algorithms has attracted researchers attention. Although recent

methodologies based on convolutional neural networks have shown promising

performances, their success relies on the availability of a large number of

training data, which is prohibitively difficult for many applications.

Therefore, in this study we propose the use of simulated US images and natural

images as auxiliary datasets in order to pre-train our segmentation network,

and then to fine-tune with limited in vivo data. We show that with as little as

19 in vivo images, fine-tuning the pre-trained network improves the dice score

by 21% compared to training from scratch. We also demonstrate that if the same

number of natural and simulation US images is available, pre-training on

simulation data is preferable.

Digital synthesis of histological stains using micro-structured and multiplexed virtual staining of label-free tissue

Yijie Zhang , Kevin de Haan , Yair Rivenson , Jingxi Li , Apostolos Delis , Aydogan Ozcan

Comments: 19 pages, 5 figures, 2 tables

Subjects

Image and Video Processing (eess.IV)

; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph); Quantitative Methods (q-bio.QM)

Histological staining is a vital step used to diagnose various diseases and

has been used for more than a century to provide contrast to tissue sections,

rendering the tissue constituents visible for microscopic analysis by medical

experts. However, this process is time-consuming, labor-intensive, expensive

and destructive to the specimen. Recently, the ability to virtually-stain

unlabeled tissue sections, entirely avoiding the histochemical staining step,

has been demonstrated using tissue-stain specific deep neural networks. Here,

we present a new deep learning-based framework which generates

virtually-stained images using label-free tissue, where different stains are

merged following a micro-structure map defined by the user. This approach uses

a single deep neural network that receives two different sources of information

at its input: (1) autofluorescence images of the label-free tissue sample, and

(2) a digital staining matrix which represents the desired microscopic map of

different stains to be virtually generated at the same tissue section. This

digital staining matrix is also used to virtually blend existing stains,

digitally synthesizing new histological stains. We trained and blindly tested

this virtual-staining network using unlabeled kidney tissue sections to

generate micro-structured combinations of Hematoxylin and Eosin (H&E), Jones

silver stain, and Masson’s Trichrome stain. Using a single network, this

approach multiplexes virtual staining of label-free tissue with multiple types

of stains and paves the way for synthesizing new digital histological stains

that can be created on the same tissue cross-section, which is currently not

feasible with standard histochemical staining methods.

Recommending Themes for Ad Creative Design via Visual-Linguistic Representations

Yichao Zhou , Shaunak Mishra , Manisha Verma , Narayan Bhamidipati , Wei Wang

Comments: 7 pages, 8 figures, 2 tables, accepted by The Web Conference 2020

Subjects

Computation and Language (cs.CL)

; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)

There is a perennial need in the online advertising industry to refresh ad

creatives, i.e., images and text used for enticing online users towards a

brand. Such refreshes are required to reduce the likelihood of ad fatigue among

online users, and to incorporate insights from other successful campaigns in

related product categories. Given a brand, to come up with themes for a new ad

is a painstaking and time consuming process for creative strategists.

Strategists typically draw inspiration from the images and text used for past

ad campaigns, as well as world knowledge on the brands. To automatically infer

ad themes via such multimodal sources of information in past ad campaigns, we

propose a theme (keyphrase) recommender system for ad creative strategists. The

theme recommender is based on aggregating results from a visual question

answering (VQA) task, which ingests the following: (i) ad images, (ii) text

associated with the ads as well as Wikipedia pages on the brands in the ads,

and (iii) questions around the ad. We leverage transformer based cross-modality

encoders to train visual-linguistic representations for our VQA task. We study

two formulations for the VQA task along the lines of classification and

ranking; via experiments on a public dataset, we show that cross-modal

representations lead to significantly better classification accuracy and

ranking precision-recall metrics. Cross-modal representations show better

performance compared to separate image and text representations. In addition,

the use of multimodal information shows a significant lift over using only

textual or visual information.

Learning Deformable Registration of Medical Images with Anatomical Constraints

Lucas Mansilla , Diego H. Milone , Enzo Ferrante

Comments: Accepted for publication in Neural Networks (Elsevier). Source code and resulting segmentation masks for the NIH Chest-XRay14 dataset with estimated quality index available at this https URL

Subjects

Image and Video Processing (eess.IV)

; Computer Vision and Pattern Recognition (cs.CV)

Deformable image registration is a fundamental problem in the field of

medical image analysis. During the last years, we have witnessed the advent of

deep learning-based image registration methods which achieve state-of-the-art

performance, and drastically reduce the required computational time. However,

little work has been done regarding how can we encourage our models to produce

not only accurate, but also anatomically plausible results, which is still an

open question in the field. In this work, we argue that incorporating

anatomical priors in the form of global constraints into the learning process

of these models, will further improve their performance and boost the realism

of the warped images after registration. We learn global non-linear

representations of image anatomy using segmentation masks, and employ them to

constraint the registration process. The proposed AC-RegNet architecture is

evaluated in the context of chest X-ray image registration using three

different datasets, where the high anatomical variability makes the task

extremely challenging. Our experiments show that the proposed anatomically

constrained registration model produces more realistic and accurate results

than state-of-the-art methods, demonstrating the potential of this approach.

A deep network for sinogram and CT image reconstruction

Wei Wang , Xiang-Gen Xia , Chuanjiang He , Zemin Ren , Jian Lu , Tianfu Wang , Baiying Lei Subjects : Image and Video Processing (eess.IV) ; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

A CT image can be well reconstructed when the sampling rate of the sinogram

satisfies the Nyquist criteria and the sampled signal is noise-free. However,

in practice, the sinogram is usually contaminated by noise, which degrades the

quality of a reconstructed CT image. In this paper, we design a deep network

for sinogram and CT image reconstruction. The network consists of two cascaded

blocks that are linked by a filter backprojection (FBP) layer, where the former

block is responsible for denoising and completing the sinograms while the

latter is used to removing the noise and artifacts of the CT images.

Experimental results show that the reconstructed CT images by our methods have

the highest PSNR and SSIM in average compared to state of the art methods.

Deep Image Clustering with Tensor Kernels and Unsupervised Companion Objectives

Daniel J. Trosten , Michael C. Kampffmeyer , Robert Jenssen

Comments: Submitted to IEEE Transactions on Neural Networks and Learning Systems

Subjects

Machine Learning (stat.ML)

; Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

In this paper we develop a new model for deep image clustering, using

convolutional neural networks and tensor kernels. The proposed Deep Tensor

Kernel Clustering (DTKC) consists of a convolutional neural network (CNN),

which is trained to reflect a common cluster structure at the output of its

intermediate layers. Encouraging a consistent cluster structure throughout the

network has the potential to guide it towards meaningful clusters, even though

these clusters might appear to be nonlinear in the input space. The cluster

structure is enforced through the idea of unsupervised companion objectives,

where separate loss functions are attached to layers in the network. These

unsupervised companion objectives are constructed based on a proposed

generalization of the Cauchy-Schwarz (CS) divergence, from vectors to tensors

of arbitrary rank. Generalizing the CS divergence to tensor-valued data is a

crucial step, due to the tensorial nature of the intermediate representations

in the CNN. Several experiments are conducted to thoroughly assess the

performance of the proposed DTKC model. The results indicate that the model

outperforms, or performs comparable to, a wide range of baseline algorithms. We

also empirically demonstrate that our model does not suffer from objective

function mismatch, which can be a problematic artifact in autoencoder-based

clustering models.