arXiv Paper Daily: Thu, 15 Jun 2017

栏目: 编程工具 · 发布时间: 7年前

内容简介：arXiv Paper Daily: Thu, 15 Jun 2017

Neural and Evolutionary Computing

A Fast Foveated Fully Convolutional Network Model for Human Peripheral Vision

Lex Fridman , Benedikt Jenik , Shaiyan Keshvari , Bryan Reimer , Christoph Zetzsche , Ruth Rosenholtz

Comments: NIPS 2017 submission

Subjects

Neural and Evolutionary Computing (cs.NE)

; Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)

Visualizing the information available to a human observer in a single glance

at an image provides a powerful tool for evaluating models of full-field human

vision. The hard part is human-realistic visualization of the periphery.

Degradation of information with distance from fixation is far more complex than

a mere reduction of acuity that might be mimicked using blur with a standard

deviation that linearly increases with eccentricity. Rather,

behaviorally-validated models hypothesize that peripheral vision measures a

large number of local texture statistics in pooling regions that overlap, grow

with eccentricity, and tile the visual field. We propose a “foveated” variant

of a fully convolutional network that approximates one such model. Our approach

achieves a 21,000 fold reduction in average running time (from 4.2 hours to 0.7

seconds per image), and statistically similar results to the

behaviorally-validated model.

MATIC: Adaptation and In-situ Canaries for Energy-Efficient Neural Network Acceleration

Sung Kim , Patrick Howe , Thierry Moreau , Armin Alaghi , Luis Ceze , Visvesh Sathe Subjects : Neural and Evolutionary Computing (cs.NE)

We present MATIC (Memory-Adaptive Training and In-situ Canaries), a voltage

scaling methodology that addresses the SRAM efficiency bottleneck in DNN

accelerators. To overscale DNN weight SRAMs, MATIC combines specific

characteristics of destructive SRAM reads with the error resilience of neural

networks in a memory-adaptive training process. PVT-related voltage margins are

eliminated using bit-cells from synaptic weights as in-situ canaries to track

runtime environmental variation. Demonstrated on a low-power DNN accelerator

fabricated in 65nm CMOS, MATIC enables up to 3.3x total energy reduction, or

18.6x application error reduction.

Neural Models for Key Phrase Detection and Question Generation

Sandeep Subramanian , Tong Wang , Xingdi Yuan , Adam Trischler Subjects : Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)

We propose several neural models arranged in a two-stage framework to tackle

question generation from documents. First, we estimate the probability of

“interesting” answers in a document using a neural model trained on a

question-answering corpus. The predicted key phrases are then used as answers

to condition a sequence-to-sequence question generation model. Empirically, our

neural key phrase detection models significantly outperform an entity-tagging

baseline system. We demonstrate that the question generator formulates good

quality natural language questions from extracted key phrases. The resulting

questions and answers can be used to assess reading comprehension in

educational settings.

Transfer entropy-based feedback improves performance in artificial neural networks

Sebastian Herzog , Christian Tetzlaff , Florentin Wörgötter Subjects : Learning (cs.LG) ; Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)

The structure of the majority of modern deep neural networks is characterized

by uni- directional feed-forward connectivity across a very large number of

layers. By contrast, the architecture of the cortex of vertebrates contains

fewer hierarchical levels but many recurrent and feedback connections. Here we

show that a small, few-layer artificial neural network that employs feedback

will reach top level performance on a standard benchmark task, otherwise only

obtained by large feed-forward structures. To achieve this we use feed-forward

transfer entropy between neurons to structure feedback connectivity. Transfer

entropy can here intuitively be understood as a measure for the relevance of

certain pathways in the network, which are then amplified by feedback. Feedback

may therefore be key for high network performance in small brain-like

architectures.

Adversarially Regularized Autoencoders for Generating Discrete Structures

Junbo (Jake)

Zhao , Yoon Kim , Kelly Zhang , Alexander M. Rush , Yann LeCun Subjects : Learning (cs.LG) ; Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)

Generative adversarial networks are an effective approach for learning rich

latent representations of continuous data, but have proven difficult to apply

directly to discrete structured data, such as text sequences or discretized

images. Ideally we could encode discrete structures in a continuous code space

to avoid this problem, but it is difficult to learn an appropriate

general-purpose encoder. In this work, we consider a simple approach for

handling these two challenges jointly, employing a discrete structure

autoencoder with a code space regularized by generative adversarial training.

The model learns a smooth regularized code space while still being able to

model the underlying data, and can be used as a discrete GAN with the ability

to generate coherent discrete outputs from continuous samples. We demonstrate

empirically how key properties of the data are captured in the model’s latent

space, and evaluate the model itself on the tasks of discrete image generation,

text generation, and semi-supervised learning.

Identifying Spatial Relations in Images using Convolutional Neural Networks

Mandar Haldekar , Ashwinkumar Ganesan , Tim Oates Subjects : Artificial Intelligence (cs.AI) ; Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

Traditional approaches to building a large scale knowledge graph have usually

relied on extracting information (entities, their properties, and relations

between them) from unstructured text (e.g. Dbpedia). Recent advances in

Convolutional Neural Networks (CNN) allow us to shift our focus to learning

entities and relations from images, as they build robust models that require

little or no pre-processing of the images. In this paper, we present an

approach to identify and extract spatial relations (e.g., The girl is standing

behind the table) from images using CNNs. Our research addresses two specific

challenges: providing insight into how spatial relations are learned by the

network and which parts of the image are used to predict these relations. We

use the pre-trained network VGGNet to extract features from an image and train

a Multi-layer Perceptron (MLP) on a set of synthetic images and the sun09

dataset to extract spatial relations. The MLP predicts spatial relations

without a bounding box around the objects or the space in the image depicting

the relation. To understand how the spatial relations are represented in the

network, a heatmap is overlayed on the image to show the regions that are

deemed important by the network. Also, we analyze the MLP to show the

relationship between the activation of consistent groups of nodes and the

prediction of a spatial relation. We show how the loss of these groups affects

the networks ability to identify relations.

Computer Vision and Pattern Recognition

Learning without Prejudice: Avoiding Bias in Webly-Supervised Action Recognition

Christian Rupprecht , Ansh Kapil , Nan Liu , Lamberto Ballan , Federico Tombari

Comments: Submitted to CVIU SI: Computer Vision and the Web

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Webly-supervised learning has recently emerged as an alternative paradigm to

traditional supervised learning based on large-scale datasets with manual

annotations. The key idea is that models such as CNNs can be learned from the

noisy visual data available on the web. In this work we aim to exploit web data

for video understanding tasks such as action recognition and detection. One of

the main problems in webly-supervised learning is cleaning the noisy labeled

data from the web. The state-of-the-art paradigm relies on training a first

classifier on noisy data that is then used to clean the remaining dataset. Our

key insight is that this procedure biases the second classifier towards samples

that the first one understands. Here we train two independent CNNs, a RGB

network on web images and video frames and a second network using temporal

information from optical flow. We show that training the networks independently

is vastly superior to selecting the frames for the flow classifier by using our

RGB network. Moreover, we show benefits in enriching the training set with

different data sources from heterogeneous public web databases. We demonstrate

that our framework outperforms all other webly-supervised methods on two public

benchmarks, UCF-101 and Thumos’14.

Learning local shape descriptors with view-based convolutional networks

Haibin Huang , Evangelos Kalogerakis , Siddhartha Chaudhuri , Duygu Ceylan , Vladimir G. Kim , Ersin Yumer Subjects : Computer Vision and Pattern Recognition (cs.CV) ; Graphics (cs.GR)

We present a new local descriptor for 3D shapes, directly applicable to a

wide range of shape analysis problems such as point correspondences, semantic

segmentation, affordance prediction, and shape-to-scan matching. Our key

insight is that the neighborhood of a point on a shape is effectively captured

at multiple scales by a succession of progressively zoomed out views, taken

from care fully selected camera positions. We propose a convolutional neural

network that uses local views around a point to embed it to a multidimensional

descriptor space, such that geometrically and semantically similar points are

close to one another. To train our network, we leverage two extremely large

sources of data. First, since our network processes 2D images, we repurpose

architectures pre-trained on massive image datasets. Second, we automatically

generate a synthetic dense correspondence dataset by part-aware, non-rigid

alignment of a massive collection of 3D models. As a result of these design

choices, our view-based architecture effectively encodes multi-scale local

context and fine-grained surface detail. We demonstrate through several

experiments that our learned local descriptors are more general and robust

compared to state of the art alternatives, and have a variety of applications

without any additional fine-tuning.

Large-Scale YouTube-8M Video Understanding with Deep Neural Networks

Manuk Akopyan (1), Eshsou Khashba (1) ((1) Institute for System Programming)

Comments: 6 pages, 5 figures, 3 tables

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Video classification problem has been studied many years. The success of

Convolutional Neural Networks (CNN) in image recognition tasks gives a powerful

incentive for researchers to create more advanced video classification

approaches. As video has a temporal content Long Short Term Memory (LSTM)

networks become handy tool allowing to model long-term temporal clues. Both

approaches need a large dataset of input data. In this paper three models

provided to address video classification using recently announced YouTube-8M

large-scale dataset. The first model is based on frame pooling approach. Two

other models based on LSTM networks. Mixture of Experts intermediate layer is

used in third model allowing to increase model capacity without dramatically

increasing computations. The set of experiments for handling imbalanced

training data has been conducted.

SalProp: Salient object proposals via aggregated edge cues

Prerana Mukherjee , Brejesh Lall , Sarvaswa Tandon

Comments: 5 pages, 4 figures, accepted at ICIP 2017

Subjects

Computer Vision and Pattern Recognition (cs.CV)

In this paper, we propose a novel object proposal generation scheme by

formulating a graph-based salient edge classification framework that utilizes

the edge context. In the proposed method, we construct a Bayesian probabilistic

edge map to assign a saliency value to the edgelets by exploiting low level

edge features. A Conditional Random Field is then learned to effectively

combine these features for edge classification with object/non-object label. We

propose an objectness score for the generated windows by analyzing the salient

edge density inside the bounding box. Extensive experiments on PASCAL VOC 2007

dataset demonstrate that the proposed method gives competitive performance

against 10 popular generic object detection techniques while using fewer number

of proposals.

(ν)-net: Deep Learning for Generalized Biventricular Cardiac Mass and Function Parameters

Hinrich B Winther , Christian Hundt , Bertil Schmidt , Christoph Czerner , Johann Bauersachs , Frank Wacker , Jens Vogel-Claussen Subjects : Computer Vision and Pattern Recognition (cs.CV) ; Machine Learning (stat.ML)

Background: Cardiac MRI derived biventricular mass and function parameters,

such as end-systolic volume (ESV), end-diastolic volume (EDV), ejection

fraction (EF), stroke volume (SV), and ventricular mass (VM) are clinically

well established. Image segmentation can be challenging and time-consuming, due

to the complex anatomy of the human heart.

Objectives: This study introduces (

u)-net (/nju:n(varepsilon)t/) — a deep

learning approach allowing for fully-automated high quality segmentation of

right (RV) and left ventricular (LV) endocardium and epicardium for extraction

of cardiac function parameters.

Methods: A set consisting of 253 manually segmented cases has been used to

train a deep neural network. Subsequently, the network has been evaluated on 4

different multicenter data sets with a total of over 1000 cases.

Results: For LV EF the intraclass correlation coefficient (ICC) is 98, 95,

and 80 % (95 %), and for RV EF 96, and 87 % (80 %) on the respective data sets

(human expert ICCs reported in parenthesis). The LV VM ICC is 95, and 94 % (84

%), and the RV VM ICC is 83, and 83 % (54 %). This study proposes a simple

adjustment procedure, allowing for the adaptation to distinct segmentation

philosophies. (

u)-net exhibits state of-the-art performance in terms of dice

coefficient.

Conclusions: Biventricular mass and function parameters can be determined

reliably in high quality by applying a deep neural network for cardiac MRI

segmentation, especially in the anatomically complex right ventricle. Adaption

to individual segmentation styles by applying a simple adjustment procedure is

viable, allowing for the processing of novel data without time-consuming

additional training.

Alignment Distances on Systems of Bags

Alexander Sagel , Martin Kleinsteuber Subjects : Computer Vision and Pattern Recognition (cs.CV)

Recent research in image and video recognition indicates that many visual

processes can be thought of as being generated by a time-varying generative

model. A nearby descriptive model for visual processes is thus a statistical

distribution that varies over time. Specifically, modeling visual processes as

streams of histograms generated by a kernelized linear dynamic system turns out

to be efficient. We refer to such a model as a System of Bags. In this work, we

investigate Systems of Bags with special emphasis on dynamic scenes and dynamic

textures. Parameters of linear dynamic systems suffer from ambiguities. In

order to cope with these ambiguities in the kernelized setting, we develop a

kernelized version of the alignment distance. For its computation, we use a

Jacobi-type method and prove its convergence to a set of critical points. We

employ it as a dissimilarity measure on Systems of Bags. As such, it

outperforms other known dissimilarity measures for kernelized linear dynamic

systems, in particular the Martin Distance and the Maximum Singular Value

Distance, in every tested classification setting. A considerable margin can be

observed in settings, where classification is performed with respect to an

abstract mean of video sets. For this scenario, the presented approach can

outperform state-of-the-art techniques, such as Dynamic Fractal Spectrum or

Orthogonal Tensor Dictionary Learning.

Shape-Color Differential Moment Invariants under Affine Transformations

Hanlin Mo , Shirui Li , You Hao , Hua Li

Comments: 13 pages, 4 figures

Subjects

Computer Vision and Pattern Recognition (cs.CV)

We propose the general construction formula of shape-color primitives by

using partial differentials of each color channel in this paper. By using all

kinds of shape-color primitives, shape-color differential moment invariants can

be constructed very easily, which are invariant to the shape affine and color

affine transforms. 50 instances of SCDMIs are obtained finally. In experiments,

several commonly used color descriptors and SCDMIs are used in image

classification and retrieval of color images, respectively. By comparing the

experimental results, we find that SCDMIs get better results.

Zoom-in-Net: Deep Mining Lesions for Diabetic Retinopathy Detection

Zhe Wang , Yanxin Yin , Jianping Shi , Wei Fang , Hongsheng Li , Xiaogang Wang

Comments: accepted by MICCAI 2017

Subjects

Computer Vision and Pattern Recognition (cs.CV)

We propose a convolution neural network based algorithm for simultaneously

diagnosing diabetic retinopathy and highlighting suspicious regions. Our

contributions are two folds: 1) a network termed Zoom-in-Net which mimics the

zoom-in process of a clinician to examine the retinal images. Trained with only

image-level supervisions, Zoomin-Net can generate attention maps which

highlight suspicious regions, and predicts the disease level accurately based

on both the whole image and its high resolution suspicious patches. 2) Only

four bounding boxes generated from the automatically learned attention maps are

enough to cover 80% of the lesions labeled by an experienced ophthalmologist,

which shows good localization ability of the attention maps. By clustering

features at high response locations on the attention maps, we discover

meaningful clusters which contain potential lesions in diabetic retinopathy.

Experiments show that our algorithm outperform the state-of-the-art methods on

two datasets, EyePACS and Messidor.

Hierarchical Gaussian Descriptors with Application to Person Re-Identification

Tetsu Matsukawa , Takahiro Okabe , Einoshin Suzuki , Yoichi Sato

Comments: 14 pages, 12 figures, 4 tables

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Describing the color and textural information of a person image is one of the

most crucial aspects of person re-identification (re-id). In this paper, we

present novel meta-descriptors based on a hierarchical distribution of pixel

features. Although hierarchical covariance descriptors have been successfully

applied to image classification, the mean information of pixel features, which

is absent from the covariance, tends to be the major discriminative information

for person re-id. To solve this problem, we describe a local region in an image

via hierarchical Gaussian distribution in which both means and covariances are

included in their parameters. More specifically, the region is modeled as a set

of multiple Gaussian distributions in which each Gaussian represents the

appearance of a local patch. The characteristics of the set of Gaussians are

again described by another Gaussian distribution. In both steps, we embed the

parameters of the Gaussian into a point of Symmetric Positive Definite (SPD)

matrix manifold. By changing the way to handle mean information in this

embedding, we develop two hierarchical Gaussian descriptors. Additionally, we

develop feature norm normalization methods with the ability to alleviate the

biased trends that exist on the descriptors. The experimental results conducted

on five public datasets indicate that the proposed descriptors achieve

remarkably high performance on person re-id.

Teaching Compositionality to CNNs

Austin Stone , Huayan Wang , Michael Stark , Yi Liu , D. Scott Phoenix , Dileep George

Comments: Preprint appearing in CVPR 2017

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Learning (cs.LG)

Convolutional neural networks (CNNs) have shown great success in computer

vision, approaching human-level performance when trained for specific tasks via

application-specific loss functions. In this paper, we propose a method for

augmenting and training CNNs so that their learned features are compositional.

It encourages networks to form representations that disentangle objects from

their surroundings and from each other, thereby promoting better

generalization. Our method is agnostic to the specific details of the

underlying CNN to which it is applied and can in principle be used with any

CNN. As we show in our experiments, the learned representations lead to feature

activations that are more localized and improve performance over

non-compositional baselines in object recognition tasks.

Photo-realistic Facial Texture Transfer

Parneet Kaur , Hang Zhang , Kristin J. Dana Subjects : Computer Vision and Pattern Recognition (cs.CV)

Style transfer methods have achieved significant success in recent years with

the use of convolutional neural networks. However, many of these methods

concentrate on artistic style transfer with few constraints on the output image

appearance. We address the challenging problem of transferring face texture

from a style face image to a content face image in a photorealistic manner

without changing the identity of the original content image. Our framework for

face texture transfer (FaceTex) augments the prior work of MRF-CNN with a novel

facial semantic regularization that incorporates a face prior regularization

smoothly suppressing the changes around facial meso-structures (e.g eyes, nose

and mouth) and a facial structure loss function which implicitly preserves the

facial structure so that face texture can be transferred without changing the

original identity. We demonstrate results on face images and compare our

approach with recent state-of-the-art methods. Our results demonstrate superior

texture transfer because of the ability to maintain the identity of the

original face image.

Accurate Pulmonary Nodule Detection in Computed Tomography Images Using Deep Convolutional Neural Networks

Jia Ding , Aoxue Li , Zhiqiang Hu , Liwei Wang

Comments: MICCAI 2017 accepted

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Early detection of pulmonary cancer is the most promising way to enhance a

patient’s chance for survival. Accurate pulmonary nodule detection in computed

tomography (CT) images is a crucial step in diagnosing pulmonary cancer. In

this paper, inspired by the successful use of deep convolutional neural

networks (DCNNs) in natural image recognition, we propose a novel pulmonary

nodule detection approach based on DCNNs. We first introduce a deconvolutional

structure to Faster Region-based Convolutional Neural Network (Faster R-CNN)

for candidate detection on axial slices. Then, a three-dimensional DCNN is

presented for the subsequent false positive reduction. Experimental results of

the LUng Nodule Analysis 2016 (LUNA16) Challenge demonstrate the superior

detection performance of the proposed approach on nodule detection (average

FROC-score of 0.893, ranking the 1st place over all submitted results), which

outperforms the best result on the leaderboard of the LUNA16 Challenge (average

FROC-score of 0.864).

Saliency detection by aggregating complementary background template with optimization framework

Chenxing Xia , Hanling Zhang , Xiuju Gao

Comments: 28 pages,10 figures

Subjects

Computer Vision and Pattern Recognition (cs.CV)

This paper proposes an unsupervised bottom-up saliency detection approach by

aggregating complementary background template with refinement. Feature vectors

are extracted from each superpixel to cover regional color, contrast and

texture information. By using these features, a coarse detection for salient

region is realized based on background template achieved by different

combinations of boundary regions instead of only treating four boundaries as

background. Then, by ranking the relevance of the image nodes with foreground

cues extracted from the former saliency map, we obtain an improved result.

Finally, smoothing operation is utilized to refine the foreground-based

saliency map to improve the contrast between salient and non-salient regions

until a close to binary saliency map is reached. Experimental results show that

the proposed algorithm generates more accurate saliency maps and performs

favorably against the state-off-the-art saliency detection methods on four

publicly available datasets.

When Image Denoising Meets High-Level Vision Tasks: A Deep Learning Approach

Ding Liu , Bihan Wen , Xianming Liu , Thomas S. Huang Subjects : Computer Vision and Pattern Recognition (cs.CV)

Conventionally, image denoising and high-level vision tasks are handled

separately in computer vision, and their connection is fragile. In this paper,

we cope with the two jointly and explore the mutual influence between them,

with the focus on two questions, namely (1) how image denoising can help

solving high-level vision problems, and (2) how the semantic information from

high-level vision tasks can be used to guide image denoising. We propose a deep

convolutional neural network solution that cascades two modules for image

denoising and various high level tasks, respectively, and propose the use of

joint loss for training to allow the semantic information flowing into the

optimization of the denoising network via back-propagation. Our experimental

results demonstrate that the proposed architecture not only yields superior

image denoising results preserving fine details, but also overcomes the

performance degradation of different high-level vision tasks, e.g., image

classification and semantic segmentation, due to image noise or artifacts

caused by conventional denoising approaches such as over-smoothing.

AFIF4: Deep Gender Classification based on AdaBoost-based Fusion of Isolated Facial Features and Foggy Faces

Mahmoud Afifi , Abdelrahman Abdelhamed

Comments: submitted to Journal of Visual Communication and Image Representation. 26 pages, 7 figures, 7 tables

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Gender classification aims at recognizing a person’s gender. Despite the high

accuracy achieved by state-of-the-art methods for this task, there still room

for improvement in generalized and unrestricted datasets. In this paper, we

advocate a new strategy inspired by the behavior of humans in gender

recognition. Instead of dealing with the face image as a sole feature, we rely

on the combination of isolated facial features and a holistic feature which we

call the foggy face. Then, we use these features to train deep convolutional

neural networks followed by an AdaBoost-based score fusion to infer the final

gender class. We evaluate our method on four challenging datasets to

demonstrate its efficacy in achieving better or on-par accuracy with

state-of-the-art methods. In addition, we present a new face dataset that

intensifies the challenges of occluded faces and illumination changes, which we

believe to be a much-needed resource for gender classification research.

Action Search: Learning to Search for Human Activities in Untrimmed Videos

Humam Alwassel , Fabian Caba Heilbron , Bernard Ghanem (King Abdullah University of Science and Technology (KAUST))

Comments: 9 pages, 9 figures

Subjects

Computer Vision and Pattern Recognition (cs.CV)

Traditional approaches for action detection use trimmed data to learn

sophisticated action detector models. Although these methods have achieved

great success at detecting human actions, we argue that huge information is

discarded when ignoring the process, through which this trimmed data is

obtained. In this paper, we propose Action Search, a novel approach that mimics

the way people annotate activities in video sequences. Using a Recurrent Neural

Network, Action Search can efficiently explore a video and determine the time

boundaries during which an action occurs. Experiments on the THUMOS14 dataset

reveal that our model is not only able to explore the video efficiently but

also accurately find human activities, outperforming state-of-the-art methods.

von Mises-Fisher Mixture Model-based Deep learning: Application to Face Verification

Md. Abul Hasnat , Julien Bohné , Jonathan Milgram , Stéphane Gentric , Liming Chen

Comments: Under review

Subjects

Computer Vision and Pattern Recognition (cs.CV)

A number of pattern recognition tasks, e.g., face verification, can be boiled

down to classification or clustering of unit length directional feature vectors

whose distance can be simply computed by their angle. In this paper, we propose

the von Mises-Fisher (vMF) mixture model as the theoretical foundation for an

effective deep-learning of such directional features and derive a novel vMF

Mixture Loss and its corresponding vMF deep features. The proposed vMF features

learning achieves a discriminative learning, i.e., compacting the instances of

the same class while increasing the distance of instances from different

classes, and subsumes a number of loss functions or deep learning practice,

e.g., normalization. The experiments carried out on face verification using 4

different challenging face datasets, i.e., LFW, IJB-A, YouTube Faces and CACD,

show the effectiveness of the proposed approach, which displays very

competitive and state-of-the-art results.

The "something something" video database for learning and evaluating visual common sense

Raghav Goyal , Samira Kahou , Vincent Michalski , Joanna Materzyńska , Susanne Westphal , Heuna Kim , Valentin Haenel , Ingo Fruend , Peter Yianilos , Moritz Mueller-Freitag , Florian Hoppe , Christian Thurau , Ingo Bax , Roland Memisevic Subjects : Computer Vision and Pattern Recognition (cs.CV)

Neural networks trained on datasets such as ImageNet have led to major

advances in visual object classification. One obstacle that prevents networks

from reasoning more deeply about complex scenes and situations, and from

integrating visual knowledge with natural language, like humans do, is their

lack of common sense knowledge about the physical world. Videos, unlike still

images, contain a wealth of detailed information about the physical world.

However, most labelled video datasets represent high-level concepts rather than

detailed physical aspects about actions and scenes. In this work, we describe

our ongoing collection of the “something-something” database of video

prediction tasks whose solutions require a common sense understanding of the

depicted situation. The database currently contains more than 100,000 videos

across 174 classes, which are defined as caption-templates. We also describe

the challenges in crowd-sourcing this data at scale.

Online Convolutional Dictionary Learning for Multimodal Imaging

Kevin Degraux , Ulugbek S. Kamilov , Petros T. Boufounos , Dehong Liu Subjects : Computer Vision and Pattern Recognition (cs.CV)

Computational imaging methods that can exploit multiple modalities have the

potential to enhance the capabilities of traditional sensing systems. In this

paper, we propose a new method that reconstructs multimodal images from their

linear measurements by exploiting redundancies across different modalities. Our

method combines a convolutional group-sparse representation of images with

total variation (TV) regularization for high-quality multimodal imaging. We

develop an online algorithm that enables the unsupervised learning of

convolutional dictionaries on large-scale datasets that are typical in such

applications. We illustrate the benefit of our approach in the context of joint

intensity-depth imaging.

Automatic Localization of Deep Stimulation Electrodes Using Trajectory-based Segmentation Approach

Roger Gomez Nieto , Andres Marino Alvarez Meza , Julian David Echeverry Correa , Alvaro Angel Orozco Gutierrez

Comments: 13 pages, 5 figures

Subjects

Computer Vision and Pattern Recognition (cs.CV)

; Neurons and Cognition (q-bio.NC)

Parkinson’s disease (PD) is a degenerative condition of the nervous system,

which manifests itself primarily as muscle stiffness, hypokinesia,

bradykinesia, and tremor. In patients suffering from advanced stages of PD,

Deep Brain Stimulation neurosurgery (DBS) is the best alternative to medical

treatment, especially when they become tolerant to the drugs. This surgery

produces a neuronal activity, a result from electrical stimulation, whose

quantification is known as Volume of Tissue Activated (VTA). To locate

correctly the VTA in the cerebral volume space, one should be aware exactly the

location of the tip of the DBS electrodes, as well as their spatial projection.

In this paper, we automatically locate DBS electrodes using a threshold-based

medical imaging segmentation methodology, determining the optimal value of this

threshold adaptively. The proposed methodology allows the localization of DBS

electrodes in Computed Tomography (CT) images, with high noise tolerance, using

automatic threshold detection methods.

Deep Learning Methods for Efficient Large Scale Video Labeling

Miha Skalic , Marcin Pekalski , Xingguo E. Pan

Comments: 7 pages, 5 tables, 1 figure

Subjects

Machine Learning (stat.ML)

; Computer Vision and Pattern Recognition (cs.CV); Learning (cs.LG)

We present a solution to “Google Cloud and YouTube-8M Video Understanding

Challenge” that ranked 5th place. The proposed model is an ensemble of three

model families, two frame level and one video level. The training was performed

on augmented dataset, with cross validation.