Workshops
Tree-to-tree Neural Networks for Program Translation
Program translation is an important tool to migrate legacy code in one language into an ecosystem built in a different language. In this work, we are the first to consider employing deep neural networks toward tackling this problem. We observe that program translation is a modular procedure, in which a sub-tree of the source tree is translated into the corresponding target sub-tree at each step. To capture this intuition, we design a tree-to-tree neural network as an encoder-decoder architecture to translate a source tree into a target one. Meanwhile, we develop an attention mechanism for the tree-to-tree model, so that when the decoder expands one non-terminal in the target tree, the attention mechanism locates the corresponding sub-tree in the source tree to guide the expansion of the decoder. We evaluate the program translation capability of our tree-to-tree model against several state-of-the-art approaches. Compared against other neural translation models, we observe that our approach is consistently better than the baselines with a margin of up to 15 points. Further, our approach can improve the previous state-of-the-art program translation approaches by a margin of 20 points on the translation of real-world projects.
Faster Neural Networks Straight from JPEG
Training convolutional neural networks (CNNs) directly from RGB pixels hasenjoyed overwhelming empirical success. But can more performance be squeezedout of networks by using different input representations? In this paper we proposeand explore a simple idea: train CNNs directly on the blockwise discrete cosinetransform (DCT) coefficients computed and available in the middle of the JPEG codec. We modify libjpeg to produce DCT coefficients directly, modify a ResNet-50 network to accommodate the differently sized and strided input, andevaluate performance on ImageNet. We find networks that are both faster and moreaccurate, as well as networks with about the same accuracy but 1.77x faster thanResNet-50.
Differentiable Neural Network Architecture Search
The successes of deep learning in recent years has been fueled by the development of innovative new neural network architectures. However, the design of a neural network architecture remains a difficult problem, requiring significant human expertise as well as computational resources. In this paper, we propose a method for transforming a discrete neural network architecture space into a continuous and differentiable form, which enables the use of standard gradient-based optimization techniques for this problem, and allows us to learn the architecture and the parameters simultaneously. We evaluate our methods on the Udacity steering angle prediction dataset, and show that our method can discover architectures with similar or better predictive accuracy but significantly fewer parameters and smaller computational cost.
TransNets for Review Generation
In recommender systems, review generation is increasingly becoming an important task. Previously proposed neural models concatenate the user and item information to each timestep of an RNN to steer it towards generating their specific review. In this paper, we show how a student-teacher like architecture can be used to rapidly build a review generator with a low perplexity score.
Towards Mixed-initiative generation of multi-channel sequential structure
We argue for the benefit of designing deep generative models through a mixed-initiative, co-creative combination of deep learning algorithms and human specifications, focusing on multi-channel music composition. Sequence models have shown convincing results in domains such as summarization and translation; however, longer-term structure remains a major challenge. Given lengthy inputs and outputs, deep generative systems still lack reliable representations of beginnings, middles, and ends, which are standard aspects of creating content in domains such as music composition. This paper aims to contribute a framework for mixed-initiative generation approaches that let humans both supply and control some of these aspects in deep generative models for music, and present a case study of Counterpoint by Convolutional Neural Network (CoCoNet).
Capturing Human Category Representations by Sampling in Deep Feature Spaces
Understanding how people represent categories is a core problem in cognitive science, with the flexibility of human learning remaining a gold standard to which modern artificial intelligence and machine learning aspire. Decades of psychological research have yielded a variety of formal theories of categories, yet validating these theories with naturalistic stimuli remains a challenge. The problem is that human category representations cannot be directly observed and running informative experiments with naturalistic stimuli such as images requires having a workable representation of these stimuli. Deep neural networks have recently been successful in a range of computer vision tasks and provide a way to represent the features of images. In this paper, we introduce a method for estimating the structure of human categories that draws on ideas from both cognitive science and machine learning, blending human-based algorithms with state-of-the-art deep representation learners. We provide qualitative and quantitative results as a proof of concept for the feasibility of the method. Samples drawn from human distributions rival the quality of current state-of-the-art generative models and outperform alternative methods for estimating the structure of human categories.
Generative Modeling for Protein Structures
We apply deep generative models to the task of generating protein structures, toward application in protein design. We encode protein structures in terms of pairwise distances between alpha-carbons on the protein backbone, which by construction eliminates the need for the generative model to learn translational and rotational symmetries. We then introduce a convex formulation of corruption-robust 3-D structure recovery to fold protein structures from generated pairwise distance matrices, and solve this optimization problem using the Alternating Direction Method of Multipliers. Finally, we demonstrate the effectiveness of our models by predicting completions of corrupted protein structures and show that in many cases the models infer biochemically viable solutions.
Feature Incay for Representation Regularization
Softmax-based loss is widely used in deep learning for multi-class classification, where each class is represented by a weight vector and each sample is represented as a feature vector. Different from traditional learning algorithms where features are pre-defined and only weight vectors are tunable through training, feature vectors are also tunable as representation learning in deep learning. Thus we investigate how to improve the classification performance by better adjusting the features. One main observation is that elongating the feature norm of both correctly-classified and mis-classified feature vectors improves learning: (1) increasing the feature norm of correctly-classified examples induce smaller training loss; (2) increasing the feature norm of mis-classified examples can upweight the contribution from hard examples. Accordingly, we propose feature incay to regularize representation learning by encouraging larger feature norm. In contrast to weight decay which shrinks the weight norm, feature incay is proposed to stretch the feature norm. Extensive empirical results on MNIST, CIFAR10, CIFAR100 and LFW demonstrate the effectiveness of feature incay.
Selecting the Best in GANs Family: a Post Selection Inference Framework
Semi-Supervised Learning With GANs: Revisiting Manifold Regularization
GANS are powerful generative models that are able to model the manifold of natural images. We leverage this property to perform manifold regularization by approximating the Laplacian norm using a Monte Carlo approximation that is easily computed with the GAN. When incorporated into the feature-matching GAN of Salimans et al. (2016), we achieve state-of-the-art results for GAN-based semisupervised learning on the CIFAR-10 dataset, with a method that is significantly easier to implement than competing methods.
Towards Specification-Directed Program Repair
Several recent papers have developed neural network program synthesizers by using supervised learning over large sets of randomly generated programs and specifications.
In this paper, we investigate the feasibility of this approach for program repair: given a specification and a candidate program assumed similar to a correct program for the specification, synthesize a program which meets the specification.
Working in the Karel domain with a dataset of synthetically generated candidates, we develop models that can make effective use of the extra information in candidate programs, achieving 40% error reduction compared to a baseline program synthesis model that only receives the specification and not a candidate program.
STOCHASTIC GRADIENT LANGEVIN DYNAMICS THAT EXPLOIT NEURAL NETWORK STRUCTURE
Tractable approximate Bayesian inference for deep neural networks remains challenging. Stochastic Gradient Langevin Dynamics (SGLD) offers a tractable approximation to the gold standard of Hamiltonian Monte Carlo. We improve on existing methods for SGLD by incorporating a recently-developed tractable approximation of the Fisher information, known as K-FAC, as a preconditioner.
LSTM Iteration Networks: An Exploration of Differentiable Path Finding
Our motivation is to scale value iteration to larger environments without a huge increase in computational demand, and fix the problems inherent to Value Iteration Networks (VIN) such as spatial invariance and unstable optimization. We show that VINs, and even extended VINs which improve some of their shortcomings, are empirically difficult to optimize, exhibiting instability during training and sensitivity to random seeds. Furthermore, we explore whether the inductive biases utilized in past differentiable path planning modules are even necessary, and demonstrate that the requirement that the architectures strictly resemble path-finding algorithms does not hold. We do this by designing a new path planning architecture called the LSTM-Iteration Network, which achieves better performance than VINs in metrics such as success rate, training stability, and sensitivity to random seeds.
Deep Convolutional Malware Classifiers Can Learn from Raw Executables and Labels Only
We propose and evaluate a simple convolutional deep neural network architecture detecting malicious \emph{Portable Executables} (Windows executable files) by learning from their raw sequences of bytes and labels only, that is, without any domain-specific feature extraction nor preprocessing. On a dataset of 20 million \emph{unpacked} half megabyte Portable Executables, such end-to-end approach achieves performance almost on par with the traditional machine learning pipeline based on handcrafted features of Avast.
Extending the Framework of Equilibrium Propagation to General Dynamics
The biological plausibility of the backpropagation algorithm has long been doubted by neuroscientists. Two major reasons are that neurons would need to send two different types of signal in the forward and backward phases, and that pairs of neurons would need to communicate through symmetric bidirectional connections. We present a simple two-phase learning procedure for fixed point recurrent networks that addresses both these issues. In our model, neurons perform leaky integration and synaptic weights are updated through a local mechanism. Our learning method extends the framework of Equilibrium Propagation to general dynamics, relaxing the requirement of an energy function. As a consequence of this generalization, the algorithm does not compute the true gradient of the objective function, but rather approximates it at a precision which is proven to be directly related to the degree of symmetry of the feedforward and feedback weights. We show experimentally that the intrinsic properties of the system lead to alignment of the feedforward and feedback weights, and that our algorithm optimizes the objective function.
Local Explanation Methods for Deep Neural Networks Lack Sensitivity to Parameter Values
Explaining the output of a complicated machine learning model like a deep neural network (DNN) is a central challenge in machine learning. Several proposed local explanation methods address this issue by identifying what dimensions of a single input are most responsible for a DNN's output. The goal of this work is to assess the sensitivity of local explanations to DNN parameter values. Somewhat surprisingly, we find that DNNs with randomly-initialized weights produce explanations that are both visually and quantitatively similar to those produced by DNNs with learned weights. Our conjecture is that this phenomenon occurs because these explanations are dominated by the lower level features of a DNN, and that a DNN's architecture provides a strong prior which significantly affects the representations learned at these lower layers.
Stable and Effective Trainable Greedy Decoding for Sequence to Sequence Learning
We introduce a fast, general method to manipulate the behavior of the decoder in a sequence to sequence neural network model. We propose a small neural network actor that observes and manipulates the hidden state of a previously-trained decoder. We evaluate our model on the task of neural machine translation. In this task, we use beam search to decode sentences from the plain decoder for each training set input, rank them by BLEU score, and train the actor to encourage the decoder to generate the highest-BLEU output in a single greedy decoding operation without beam search. Experiments on several datasets and models show that our method yields substantial improvements in both translation quality and translation speed over its base system, with no additional data.
A Language and Compiler View on Differentiable Programming
Current and emerging deep learning architectures call for an expressive high-level programming style with end-to-end differentiation and for a high-performance implementation at the same time. But the current generation of deep learning frameworks either limits expressiveness and ease of use for increased performance (e.g., TensorFlow) or vice versa (e.g., PyTorch). In this paper we demonstrate that a “best of both worlds” approach is possible, based on multi-stage programming and delimited continuations, two orthogonal ideas firmly rooted in programming languages research.
ShakeDrop regularization
This paper proposes a powerful regularization method named ShakeDrop regularization. ShakeDrop is inspired by Shake-Shake regularization that decreases error rates by disturbing learning. While Shake-Shake can be applied to only ResNeXt which has multiple branches, ShakeDrop can be applied to not only ResNeXt but also ResNet, and PyramidNet in a memory efficient way. Important and interesting feature of ShakeDrop is that it strongly disturbs learning by multiplying even a negative factor to the output of a convolutional layer in the forward training pass. ShakeDrop outperformed state-of-the-arts on CIFAR-10/100. The full version of the paper including other experiments is available at https://arxiv.org/abs/1802.02375.
Policy Optimization with Second-Order Advantage Information
Policy optimization on high-dimensional action spaces exhibits its difficulty caused by the high variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Blackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, the algorithm learns the underlying factorization structure among the action space based on the second-order gradient of the advantage function with respect to the action. Empirical studies demonstrate the performance improvement on high-dimensional synthetic settings and OpenAI Gym's MuJoCo continuous control tasks.
Tempered Adversarial Networks
Generative adversarial networks (GANs) have been shown to produce realistic samples from high-dimensional distributions, but training them is considered hard. A possible explanation for training instabilities is the inherent imbalance between the networks: While the discriminator is trained directly on both real and fake samples, the generator only has control over the fake samples it produces since the real data distribution is fixed by the choice of a given dataset. We propose a simple modification that gives the generator control over the real samples, leading to a tempered learning process for both generator and discriminator. The real data distribution passes through a lens before being revealed to the discriminator, balancing the training process by gradually revealing more detailed features necessary to produce high-quality results. The proposed module automatically adjusts the learning process to the current strength of the networks, yet is generic and easy to add to any GAN variant. In a number of experiments, we show that this is a promising technique to improve quality, stability and/or convergence speed across a range of different GAN architectures (DCGAN, LSGAN, WGAN-GP).
Exploring Deep Recurrent Models with Reinforcement Learning for Molecule Design
The design of small molecules with bespoke properties is of central importance to drug discovery. However significant challenges yet remain for computational methods, despite recent advances such as deep recurrent networks and reinforcement learning strategies for sequence generation, and it can be difficult to compare results across different works. This work proposes 19 benchmarks selected by subject experts, expands smaller datasets previously used to approximately 1.1 million training molecules, and explores how to apply new reinforcement learning techniques effectively for molecular design. The benchmarks here, built as OpenAI Gym environments, will be open-sourced to encourage innovation in molecular design algorithms and to enable usage by those without a background in chemistry. Finally, this work explores recent development in reinforcement-learning methods with excellent sample complexity (the A2C and PPO algorithms) and investigates their behavior in molecular generation, demonstrating significant performance gains compared to standard reinforcement learning techniques.
Predicting Embryo Morphokinetics in Videos with Late Fusion Nets & Dynamic Decoders
To optimize clinical outcomes, many fertility clinics select embryos strategically, based on how quickly they reach certain developmental milestones. This requires manually annotating time-lapse EmbryoScope videos with their corresponding morphokinetics, a time-consuming process that requires experienced embryologists. We propose late-fusion ConvNets with a dynamic programming-based decoder for automatically labeling these videos. Experiments address data extracted from EmbryoScope incubators at the Cleveland Clinic Foundation Fertility Center. We focus on 6 stages, demonstrating 87% per-frame accuracy.
Towards Provable Control for Unknown Linear Dynamical Systems
We study the control of symmetric linear dynamical systems with unknown dynamics and a hidden state. Using a recent spectral filtering technique for concisely representing such systems in a linear basis, we formulate optimal control in this setting as a convex program. This approach eliminates the need to solve the non-convex problem of explicit identification of the system and its latent state, and allows for provable optimality guarantees for the control signal. We give the first efficient algorithm for finding the optimal control signal with an arbitrary time horizon T, with sample complexity (number of training rollouts) polynomial only in log(T) and other relevant parameters.
Are Efficient Deep Representations Learnable?
Many theories of deep learning have shown that a deep network can require dramatically fewer resources to represent a given function compared to a shallow network. But a question remains: can these efficient representations be learned using current deep learning techniques? In this work, we test whether standard deep learning methods can in fact find the efficient representations posited by several theories of deep representation. Specifically, we train deep neural networks to learn two simple functions with known efficient solutions: the parity function and the fast Fourier transform. We find that using gradient-based optimization, a deep network does not learn the parity function, unless initialized very close to a hand-coded exact solution. We also find that a deep linear neural network does not learn the fast Fourier transform, even in the best-case scenario of infinite training data, unless the weights are initialized very close to the exact hand-coded solution. Our results suggest that not every element of the class of compositional functions can be learned efficiently by a deep network, and further restrictions are necessary to understand what functions are both efficiently representable and learnable.
Learning via social awareness: improving sketch representations with facial feedback
In the quest towards general artificial intelligence (AI), researchers have explored developing loss functions that act as intrinsic motivators in the absence of external rewards. This paper argues that such research has overlooked an important and useful intrinsic motivator: social interaction. We posit that making an AI agent aware of implicit social feedback from humans can allow for faster learning of more generalizable and useful representations, and could potentially impact AI safety. We collect social feedback in the form of facial expression reactions to samples from Sketch RNN, an LSTM-based variational autoencoder (VAE) designed to produce sketch drawings. We use a Latent Constraints GAN (LC-GAN) to learn from the facial feedback of a small group of viewers, and then show in an independent evaluation with 76 users that this model produced sketches that lead to significantly more positive facial expressions. Thus, we establish that implicit social feedback can improve the output of a deep learning model.
Winner's Curse? On Pace, Progress, and Empirical Rigor
The field of ML is distinguished both by rapid innovation and rapid dissemination of results. While the pace of progress has been extraordinary by any measure, in this paper we explore potential issues that we believe to be arising as a result. In particular, we observe that the rate of empirical advancement may not have been matched by consistent increase in the level of empirical rigor across the field as a whole. This short position paper highlights examples where progress has actually been slowed as a result, offers thoughts on incentive structures currently at play, and gives suggestions as seeds for discussions on productive change.
FigureQA: An Annotated Figure Dataset for Visual Reasoning
We introduce FigureQA, a visual reasoning corpus of over one million question-answer pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts. We formulate our reasoning task by generating questions from 15 templates; questions concern various relationships between plot elements and examine characteristics like the maximum, the minimum, area-under-the-curve, smoothness, and intersection. To resolve, such questions often require reference to multiple plot elements and synthesis of information distributed spatially throughout a figure. To facilitate the training of machine learning systems, the corpus also includes side data that can be used to formulate auxiliary objectives. In particular, we provide the numerical data used to generate each figure as well as bounding-box annotations for all plot elements. We study the proposed visual reasoning task by training several models, including the recently proposed Relation Network as a strong baseline. Preliminary results indicate that the task poses a significant machine learning challenge. We envision FigureQA as a first step towards developing models that can intuitively recognize patterns from visual representations of data.
HoME: a Household Multimodal Environment
We introduce HoME: a Household Multimodal Environment for artificial agents to learn from vision, audio, semantics, physics, and interaction with objects and other agents, all within a realistic context. HoME integrates over 45,000 diverse 3D house layouts based on the SUNCG dataset, a scale which may facilitate learning, generalization, and transfer. HoME is an open-source, OpenAI Gym-compatible platform extensible to tasks in reinforcement learning, language grounding, sound-based navigation, robotics, multi-agent learning, and more. We hope HoME better enables artificial agents to learn as humans do: in an interactive, multimodal, and richly contextualized setting.
An interpretable LSTM neural network for autoregressive exogenous model
In this paper, we propose an interpretable LSTM recurrent neural network, i.e., multi-variable LSTM for time series with exogenous variables. Currently, widely used attention mechanism in recurrent neural networks mostly focuses on the temporal aspect of data and falls short of characterizing variable importance. To this end, our multi-variable LSTM equipped with tensorized hidden states is developed to learn variable specific representations, which give rise to both temporal and variable level attention. Preliminary experiments demonstrate comparable prediction performance of multi-variable LSTM w.r.t. encoder-decoder based baselines. More interestingly, variable importance in real datasets characterized by the variable attention is highly in line with that determined by statistical Granger causality test, which exhibits the prospect of multi-variable LSTM as a simple and uniform end-to-end framework for both forecasting and knowledge discovery.
Semiparametric Reinforcement Learning
We introduce a semiparametric approach to deep reinforcement learning inspired by complementary learning systems theory in cognitive neuroscience. Our approach allows a neural network to integrate nonparametric, episodic memory-based computations with parametric statistical learning in an end-to-end fashion. We give a deep Q network access to intermediate and final results of a differentiable approximation to k-nearest-neighbors performed on a dictionary of historic state-action embeddings. Our method displays the early-learning advantage associated with episodic memory-based algorithms while mitigating the asymptotic performance disadvantage suffered by such approaches. In several cases we find that our model learns even more quickly from few examples than pure kNN-based approaches. Analysis shows that our semiparametric algorithm relies heavily on the kNN output early on and less so as training progresses, which is consistent with complementary learning systems theory.
GILBO: One Metric to Measure Them All
We propose a simple, tractable lower bound on the mutual information contained in the joint generative density of any latent variable generative model: the GILBO (Generative Information Lower BOund). It offers a data independent measure of the complexity of the learned latent variable description, giving the log of the effective description length. It is well-defined for both VAEs and GANs. We compute the GILBO for 800 GANs and VAE s trained on MNIST and discuss the results.
ComboGAN: Unrestricted Scalability for Image Domain Translation
This past year alone has seen unprecedented leaps in the area of learning-based image translation, namely the unsupervised model CycleGAN, by Zhu et al. But experiments so far have been tailored to merely two domains at a time, and scaling them to more would require an quadratic number of models to be trained. With two-domain models taking days to train on current hardware, the number of domains quickly becomes limited by training. In this paper, we propose a multi-component image translation model and training scheme which scales linearly - both in resource consumption and time required - with the number of domains.
Negative eigenvalues of the Hessian in deep neural networks
We study the loss function of a deep neural network through the eigendecomposition of its Hessian matrix. We focus on negative eigenvalues, how important they are, and how to best deal with them. The goal is to develop an optimization method specifically tailored for deep neural networks.
Resilient Backpropagation (Rprop) for Batch-learning in TensorFlow
The resilient backpropagation (Rprop) algorithms are fast and accurate batch learning methods for neural networks. We describe their implementation in the popular machine learning framework TensorFlow. We present the first empirical evaluation of Rprop for training recurrent neural networks with gated recurrent units. In our experiments, Rprop with default hyperparameters outperformed vanilla steepest descent as well as the optimization algorithms RMSprop and Adam even if their hyperparameters were tuned.
Combating Adversarial Attacks Using Sparse Representations
It is by now well-known that small adversarial perturbations can induce classification errors in deep neural networks (DNNs). In this paper, we make the case that sparse representations of the input data are a crucial tool for combating such attacks. For linear classifiers, we show that a sparsifying front end is provably effective against l∞-bounded attacks, reducing output distortion due to the attack by a factor of roughly K/N where N is the data dimension and K is the sparsity level. We then extend this concept to DNNs, showing that a “locally linear” model can be used to develop a theoretical foundation for crafting attacks and defenses. Experimental results for the MNIST dataset show the efficacy of the proposed sparsifying front end.
Bayesian Incremental Learning for Deep Neural Networks
In industrial machine learning pipelines, data often arrive in parts. Particularly in the case of deep neural networks, it may be too expensive to train the model from scratch each time, so one would rather use a previously learned model and the new data to improve performance. However, deep neural networks are prone to getting stuck in a suboptimal solution when trained on only new data as compared to the full dataset. Our work focuses on a continuous learning setup where the task is always the same and new parts of data arrive sequentially. We apply a Bayesian approach to update the posterior approximation with each new piece of data and find this method to outperform the traditional approach in our experiments.
Can Deep Reinforcement Learning solve Erdos-Selfridge-Spencer Games?
Deep reinforcement learning has achieved many recent successes, but our understanding of its strengths and limitations is hampered by the lack of rich environments in which we can fully characterize optimal behavior, and correspondingly diagnose individual actions against such a characterization. Here we consider a family of combinatorial games, arising from work of Erdos, Selfridge, and Spencer, and we propose their use as environments for evaluating and comparing different approaches to reinforcement learning. These games have a number of appealing features: they are challenging for current learning approaches, but they form (i) a low-dimensional, simply parametrized environment where (ii) there is a linear closed form solution for optimal behavior from any state, and (iii) the difficulty of the game can be tuned by changing environment parameters in an interpretable way. We use these Erdos-Selfridge-Spencer games not only to compare different algorithms, but test for generalization, make comparisons to supervised learning, analyse multiagent play, and even develop a self play algorithm.
GitGraph - from Computational Subgraphs to Smaller Architecture Search Spaces
To simplify neural architecture creation, AutoML is gaining traction - from evolutionary algorithms to reinforcement learning or simple search in a constrained space of neural modules. A big issues is its computational cost: the size of the search space can easily go above 10^10 candidates for a 10-layer network and the cost of evaluating a single candidate is high - even if it's not fully trained. In this work, we use the collective wisdom within the neural networks published in online code repositories to create better reusable neural modules. Concretely, we (a) extract and publish GitGraph, a corpus of neural architectures and their descriptions; (b) we create problem-specific neural architecture search spaces, implemented as a textual search mechanism over GitGraph and (c) we propose a method of identifying unique common computational subgraphs.
Pelee: A Real-Time Object Detection System on Mobile Devices
An increasing need of running Convolutional Neural Network (CNN) models on mobile devices with limited computing power and memory resource encourages studies on efficient model design. A number of efficient architectures have been proposed in recent years, for example, MobileNet, ShuffleNet, and NASNet-A. However, all these models are heavily dependent on depthwise separable convolution which lacks efficient implementation in most deep learning frameworks. In this study, we propose an efficient architecture named PeleeNet, which is built with conventional convolution instead. On ImageNet ILSVRC 2012 dataset, our proposed PeleeNet achieves a higher accuracy by 0.6% (71.3% vs. 70.7%) and 11% lower computational cost than MobileNet, the state-of-the-art efficient architecture. Meanwhile, PeleeNet is only half of the model size of MobileNet. We then propose a real-time object detection system by combining PeleeNet with Single Shot MultiBox Detector (SSD) method and optimizing the architecture for fast speed. Our proposed detection system, named Pelee, achieves 70.9% mAP (mean average precision) on PASCAL VOC2007 dataset at the speed of 17.1 FPS on iPhone 6s and 23.6 FPS on iPhone 8. Compared to TinyYOLOv2, our proposed Pelee is more accurate (70.9% vs. 57.1%), 1.88 times lower in computational cost and 1.92 times smaller in model size. The code and models are open sourced.
Realistic Evaluation of Semi-Supervised Learning Algorithms
Semi-supervised learning (SSL) provides a powerful framework for leveraging unlabeled data when labels are limited or expensive to obtain. Approaches based on deep neural networks have recently proven successful on standard benchmark tasks. However, we argue that these benchmarks fail to address many issues that these algorithms would face in real-world applications. After creating a unified reimplementation of various widely-used SSL techniques, we test them in a suite of experiments designed to address these issues. We find that simple baselines which do not use unlabeled data can be competitive with the state-of-the-art, that SSL methods differ in sensitivity to the amount of labeled and unlabeled data, and that performance can degrade substantially when the unlabeled dataset contains out-of-class examples.
Learning to Learn Without Labels
A major goal of unsupervised learning is for algorithms to learn representations of data, useful for subsequent tasks, without access to supervised labels or other high-level attributes. Typically, these algorithms minimize a surrogate objective, such as reconstruction error or likelihood of a generative model, with the hope that representations useful for subsequent tasks will arise as a side effect (e.g. semi-supervised classification). In this work, we propose using meta-learning to learn an unsupervised learning rule, and meta-optimize the learning rule directly to produce good representations for a desired task. Here, our desired task (meta-objective) is the performance of the representation on semi-supervised classification, and we meta-learn an algorithm -- an unsupervised weight update rule -- that produces representations that perform well under this meta-objective. We examine the performance of the learned algorithm on several datasets and show that it learns useful features, generalizes across both network architectures and a wide array of datasets, and outperforms existing unsupervised learning techniques.
Predict Responsibly: Increasing Fairness by Learning to Defer
When machine learning models are used for high-stakes decisions, they should predict accurately, fairly, and responsibly. To fulfill these three requirements, a model must be able to output a reject option (i.e. say "``I Don't Know") when it is not qualified to make a prediction. In this work, we propose learning to defer, a method by which a model can defer judgment to a downstream decision-maker such as a human user. We show that learning to defer generalizes the rejection learning framework in two ways: by considering the effect of other agents in the decision-making process, and by allowing for optimization of complex objectives. We propose a learning algorithm which accounts for potential biases held by decision-makerslater in a pipeline. Experiments on real-world datasets demonstrate that learning to defer can make a model not only more accurate but also less biased. Even when operated by highly biased users, we show that deferring models can still greatly improve the fairness of the entire pipeline.
MemCNN: a Framework for Developing Memory Efficient Deep Invertible Networks
Reversible operations have recently been successfully applied to classification problems to reduce memory requirements during neural network training. This feature is accomplished by removing the need to store the input activation for computing the gradients at the backward pass and instead reconstruct them on demand. However, current approaches rely on custom implementations of backpropagation, which limits applicability and extendibility. We present MemCNN, a novel PyTorch framework which simplifies the application of reversible functions by removing the need for a customized backpropagation. The framework contains a set of practical generalized tools, which can wrap common operations like convolutions and batch normalization and which take care of the memory management. We validate the presented framework by reproducing state-of-the-art experiments comparing classification accuracy and training time on Cifar-10 and Cifar-100 with the existing state-of-the-art, achieving similar classification accuracy and faster training times.
Learning Deep Models: Critical Points and Local Openness
In this paper we present a unifying framework to study the local/global optima equivalence of the optimization problems arising from training non-convex deep models. Using the local openness property of the underlying training models, we provide simple sufficient conditions under which any local optimum of the resulting optimization problem is globally optimal. We first completely characterize the local openness of matrix multiplication mapping in its range. Then we use our characterization to: 1) show that every local optimum of two layer linear networks is globally optimal. Unlike many existing results, our result requires no assumption on the target data matrix Y, and input data matrix X. 2) Develop almost complete characterization of the local/global optima equivalence of multi-layer linear neural networks. 3) Show global/local optima equivalence of non-linear deep models having certain pyramidal structure. Unlike some existing works, our result requires no assumption on the differentiability of the activation functions.
Conditional Networks for Few-Shot Semantic Segmentation
Few-shot learning methods aim for good performance in the low-data regime. Structured output tasks such as segmentation present difficulties for few-shot learning because of their high dimensionality and the statistical dependencies among outputs. To tackle this problem, we propose the co-FCN, a conditional network learned by end-to-end optimization to perform fast, accurate few-shot segmentation. The network conditions on an annotated support set of images via feature fusion to perform inference on an unannotated query image. Once learned, our conditioning approach requires no further optimization for new data. Addi- tional annotated inputs are used to update the output via a single inference step, making the model suitable for interactive use. Our conditional network signifi- cantly improves few-shot accuracy over the prior state-of-the-art.
An Optimization View on Dynamic Routing Between Capsules
Despite the effectiveness of dynamic routing procedure recently proposed in \citep{sabour2017dynamic}, we still lack a standard formalization of the heuristic and its implications. In this paper, we partially formulate the routing strategy proposed in \citep{sabour2017dynamic} as an optimization problem that minimizes a combination of clustering-like loss and a KL regularization term between the current coupling distribution and its last states. We then introduce another simple routing approach, which enjoys few interesting properties. In an unsupervised perceptual grouping task, we show experimentally that our routing algorithm outperforms the dynamic routing method proposed in \citep{sabour2017dynamic}.
Gradients explode - Deep Networks are shallow - ResNet explained
Whereas it is believed that techniques such as Adam, batch normalization and, more recently, SeLU nonlinearities "solve" the exploding gradient problem, we show that this is not the case and that in a range of popular MLP architectures, exploding gradients exist and that they limit the depth to which networks can be effectively trained, both in theory and in practice. We explain why exploding gradients occur and highlight the collapsing domain problem, which can arise in architectures that avoid exploding gradients.
ResNets have significantly lower gradients and thus can circumvent the exploding gradient problem, enabling the effective training of much deeper networks, which we show is a consequence of a surprising mathematical property. By noticing that any neural network is a residual network, we devise the residual trick, which reveals that introducing skip connections simplifies the network mathematically, and that this simplicity may be the major cause for their success.
Learning Invariance with Compact Transforms
The problem of building machine learning models that admit efficient representations and also capture an appropriate inductive bias for the domain has recently attracted significant interest. Existing work for compressing deep learning pipelines has explored classes of structured matrices that exhibit forms of shift-invariance akin to convolutions. We leverage the displacement rank framework to automatically learn the structured class, allowing for adaptation to the invariances required for a given dataset while preserving asymptotically efficient multiplication and storage. In a setting with a small fixed parameter budget, our broad classes of structured matrices improve final accuracy by 5-7% on standard image classification datasets compared to conventional parameter constraining methods.
Fast Node Embeddings: Learning Ego-Centric Representations
Representation learning is one of the foundations of Deep Learning and allowed important improvements on several Machine Learning tasks, such as Neural Machine Translation, Question Answering and Speech Recognition. Recent works have proposed new methods for learning representations for nodes and edges in graphs. Several of these methods are based on the SkipGram algorithm, and they usually process a large number of multi-hop neighbors in order to produce the context from which node representations are learned. In this paper, we propose an effective and also efficient method for generating node embeddings in graphs that employs a restricted number of permutations over the immediate neighborhood of a node as context to generate its representation, thus ego-centric representations. We present a thorough evaluation showing that our method outperforms state-of-the-art methods in six different datasets related to the problems of link prediction and node classification, being one to three orders of magnitude faster than baselines when generating node embeddings for very large graphs.
Synthesizing Audio with GANs
While Generative Adversarial Networks (GANs) have seen wide success at the problem of synthesizing realistic images, they have seen little application to audio generation. In this paper, we introduce WaveGAN, a first attempt at applying GANs to raw audio synthesis in an unsupervised setting. Our experiments on speech demonstrate that WaveGAN can produce intelligible words from a small vocabulary of human speech, as well as synthesize audio from other domains such as bird vocalizations, drums, and piano. Qualitatively, we find that human judges prefer the generated examples from WaveGAN over those from a method which naïvely applies GANs on image-like audio feature representations.
Multi-Agent Generative Adversarial Imitation Learning
We propose a new framework for multi-agent imitation learning for general Markov games, where we build upon a generalized notion of inverse reinforcement learning. We introduce a practical multi-agent actor-critic algorithm with good empirical performance. Our method can be used to imitate complex behaviors in high-dimensional environments with multiple cooperative or competitive agents.
Learning and Memorization
In the machine learning research community, it is generally believed that there is a tension between memorization and generalization. In this work we examine to what extent this tension exists, by exploring if it is possible to generalize through memorization alone. Although direct memorization with a lookup table obviously does not generalize, we find that introducing depth in the form of a network of support-limited lookup tables leads to generalization that is significantly above chance and closer to those obtained by standard learning algorithms on several tasks derived from MNIST and CIFAR-10. Furthermore, we demonstrate through a series of empirical results that our approach allows for a smooth tradeoff between memorization and generalization and exhibits some of the most salient characteristics of neural networks: depth improves performance; random data can be memorized and yet there is generalization on real data; and memorizing random data is harder in a certain sense than memorizing real data. The extreme simplicity of the algorithm and potential connections with stability provide important insights into the impact of depth on learning algorithms, and point to several interesting directions for future research.
Learning Efficient Tensor Representations with Ring Structure Networks
\emph{Tensor train (TT) decomposition} is a powerful representation for high-order tensors, which has been successfully applied to various machine learning tasks in recent years. In this paper, we propose a more generalized tensor decomposition with ring structure network by employing circular multilinear products over a sequence of lower-order core tensors, which is termed as TR representation. Several learning algorithms including blockwise ALS with adaptive tensor ranks and SGD with high scalability are presented. Furthermore, the mathematical properties are investigated, which enables us to perform basic algebra operations in a computationally efficiently way by using TR representations. Experimental results on synthetic signals and real-world datasets demonstrate the effectiveness of TR model and the learning algorithms. In particular, we show that the structure information and high-order correlations within a 2D image can be captured efficiently by employing tensorization and TR representation.
Investigating Human Priors for Playing Video Games
Deep reinforcement learning algorithms have recently achieved impressive results on a range of video games, yet they remain much less efficient than an average human player at learning a new game. What makes humans so good at solving these video games? Here, we study one aspect critical to human gameplay -- their use of strong priors that enable efficient decision making and problem-solving. We created a sample video game and conducted various experiments to quantify the kinds of prior knowledge humans bring in while playing such games. We do this by modifying the video game environment to systematically remove different types of visual information that could be used by humans as priors. We find that human performance degrades drastically once prior information has been removed, while that of an RL agent does not change. Interestingly, we also find that general priors about objects that humans learn when they are as little as two months old are some of the most critical priors that help in human gameplay. Based on these findings, we then propose a taxonomy of object priors people employ when solving video games that can potentially serve as a benchmark for future reinforcement learning algorithms aiming to incorporate human-like representations in their systems.
A Dataset To Evaluate The Representations Learned By Video Prediction Models
We present a parameterized synthetic dataset called Moving Symbols to support the objective study of video prediction networks. Using several instantiations of the dataset in which variation is explicitly controlled, we highlight issues in an existing state-of-the-art approach and propose the use of a performance metric with greater semantic meaning to improve experimental interpretability. Our dataset provides canonical test cases that will help the community better understand, and eventually improve, the representations learned by such networks in the future. Code is available at https://github.com/rszeto/moving-symbols.
An Evaluation of Fisher Approximations Beyond Kronecker Factorization
We study two coarser approximations on top of a Kronecker factorization (K-FAC) of the Fisher information matrix, to scale up Natural Gradient to deep and wide Convolutional Neural Networks (CNNs). The first considers the activations (feature maps) as spatially uncorrelated while the second considers only correlations among groups of channels. Both variants yield a further block-diagonal approximation tailored for CNNs, which is much more efficient to compute and invert. Experiments on the VGG11 and ResNet50 architectures show the technique can substantially speed up both K-FAC and a baseline with Batch Normalization in wall-clock time, yielding faster convergence to similar or better generalization error.
Weightless: Lossy weight encoding for deep neural network compression
The large memory requirements of deep neural networks limit their deployment and adoption on many devices. Model compression methods effectively reduce the memory requirements of these models, usually through applying transformations such as weight pruning or quantization. In this paper, we present a novel scheme for lossy weight encoding which complements conventional compression techniques. The encoding is based on the Bloomier filter, a probabilistic data structure that can save space at the cost of introducing random errors. Leveraging the ability of neural networks to tolerate these imperfections and by re-training around the errors, the proposed technique, Weightless, can compress DNN weights by up to 496× with the same model accuracy. This results in up to a 1.51× improvement over the state-of-the-art.
Empirical Analysis of the Hessian of Over-Parametrized Neural Networks
We study the properties of common loss surfaces through their Hessian matrix. In particular, in the context of deep learning, we empirically show that the spectrum of the Hessian is composed of two parts: (1) the bulk centered near zero, (2) and outliers away from the bulk. We present numerical evidence and mathematical justifications to the following conjectures laid out by Sagun et. al. (2016): Fixing data, increasing the number of parameters merely scales the bulk of the spectrum; fixing the dimension and changing the data (for instance adding more clusters or making the data less separable) only affects the outliers. We believe that our observations have striking implications for non-convex optimization in high dimensions. First, the flatness of such landscapes (which can be measured by the singularity of the Hessian) implies that classical notions of basins of attraction may be quite misleading. And that the discussion of wide/narrow basins may be in need of a new perspective around over-parametrization and redundancy that are able to create large connected components at the bottom of the landscape. Second, the dependence of a small number of large eigenvalues to the data distribution can be linked to the spectrum of the covariance matrix of gradients of model outputs. With this in mind, we may reevaluate the connections within the data-architecture-algorithm framework of a model, hoping that it would shed light on the geometry of high-dimensional and non-convex spaces in modern applications. In particular, we present a case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but we show that they are in fact connected through their flat region and so belong to the same basin.
ChatPainter: Improving Text to Image Generation using Dialogue
Synthesizing realistic images from text descriptions on a dataset like Microsoft Common Objects in Context (COCO), where each image can contain several objects, is a challenging task. Prior work has used text captions to generate images. However, captions might not be informative enough to capture the entire image and insufficient for the model to be able to understand which objects in the images correspond to which words in the captions. We show that adding a dialogue that further describes the scene leads to significant improvement in the inception score and in the quality of generated images on the COCO dataset.
Meta-Learning for Batch Mode Active Learning
Active learning involves selecting unlabeled data items to label in order to best improve an existing classifier. In most applications, batch mode active learning, where a set of items is picked all at once to be labeled and then used to re-train the classifier, is most feasible because it does not require the model to be re-trained after each individual selection and makes most efficient use of human labor for annotation. In this work, we explore using meta-learning to learn an active learning algorithm that selects the best set of unlabeled items to label given a classifier trained on a small training set. Our experiments show that our learned active learning algorithm is able to construct labeled sets that improve a classifier better than commonly used heuristics.
Spatially Parallel Convolutions
The training of convolutional neural networks with large inputs on GPUs is limited by the available GPU memory capacity. In this work, we describe spatially parallel convolutions, which sidestep the memory capacity limit of a single GPU by partitioning tensors along their spatial axes across multiple GPUs. On modern multi-GPU systems, we demonstrate that spatially parallel convolutions attain excellent scaling when applied to input tensors with large spatial dimensions.
Fast and Accurate Text Classification: Skimming, Rereading and Early Stopping
Recent advances in recurrent neural nets (RNNs) have shown much promise in many applications in natural language processing. For most of these tasks, such as sentiment analysis of customer reviews, a recurrent neural net model parses the entire review before forming a decision. We argue that reading the entire input is not always necessary in practice, since a lot of reviews are often easy to classify, i.e., a decision can be formed after reading some crucial sentences or words in the provided text. In this paper, we present an approach of fast reading for text classification. Inspired by several well-known human reading techniques, our approach implements an intelligent recurrent agent which evaluates the importance of the current snippet in order to decide whether to make a prediction, or to skip some texts, or to re-read part of the sentence. Our agent uses an RNN module to encode information from the past and the current tokens, and applies a policy module to form decisions. With an end-to-end training algorithm based on policy gradient, we train and test our agent on several text classification datasets and achieve both higher efficiency and better accuracy compared to previous approaches.
Analysis of Cosmic Microwave Background with Deep Learning
The observation of Cosmic Microwave Background (CMB) has been one of the cornerstones in establishing the current understanding of the Universe. This valuable source of information consists of primary and secondary effects. While the primary source of information in CMB (as a Gaussian random field) can be efficiently analyzed using established statistical methods, CMB is also host to secondary sources of information that are more complex to analyze and understand. Here, we report encouraging preliminary results as well as some difficulties in using deep learning for prediction of the cosmological parameters and uncertainty estimates from the primary CMB. This opens the way to application of deep models in analysis of the secondary CMB and joint analysis of CMB with other modalities such as the large-scale structure
Neuron as an Agent
Existing multi-agent reinforcement learning (MARL) communication methods have relied on a trusted third party (TTP) to distribute reward to agents, leaving them inapplicable in peer-to-peer environments. This paper proposes reward distribution using {\em Neuron as an Agent} (NaaA) in MARL without a TTP with two key ideas: (i) inter-agent reward distribution and (ii) auction theory. Auction theory is introduced because inter-agent reward distribution is insufficient for optimization. Agents in NaaA maximize their profits (the difference between reward and cost) and, as a theoretical result, the auction mechanism is shown to have agents autonomously evaluate counterfactual returns as the values of other agents. NaaA enables representation trades in peer-to-peer environments, ultimately regarding unit in neural networks as agents. Finally, numerical experiments (a single-agent environment from OpenAI Gym and a multi-agent environment from ViZDoom) confirm that NaaA framework optimization leads to better performance in reinforcement learning.
Monotonic models for real-time dynamic malware detection
In dynamic malware analysis, programs are classified as malware or benign based on their execution logs. We propose a concept of applying monotonic classification models to the analysis process, to make the trained model's predictions consistent over execution time and provably stable to the injection of any noise or `benign-looking' activity into the program's behavior. The predictions of such models change monotonically through the log in the sense that the addition of new lines into the log may only increase the probability of the file being found malicious, which make them suitable for real-time classification on a user's machine. We evaluate monotonic neural network models based on the work by Chistyakovet al. (2017) and demonstrate that they provide stable and interpretable results.
Nonlinear Acceleration of CNNs
Regularized Nonlinear Acceleration (RNA) can improve the rate of convergence of many optimization schemes such as gradient descent, SAGA or SVRG, estimating the optimum using a nonlinear average of past iterates. Until now, its analysis was limited to convex problems, but empirical observations show that RNA may be extended to a broader setting. Here, we investigate the benefits of nonlinear acceleration when applied to the training of neural networks, in particular for the task of image recognition on the CIFAR10 and ImageNet data sets. In our experiments, with minimal modifications to existing frameworks, RNA speeds up convergence and improves testing error on standard CNNs.
Hockey-Stick GAN
We propose a new objective for generative adversarial networks (GANs) that is aimed to address current issues in GANs such as mode collapse and unstable convergence. Our approach stems from the hockey-stick divergence that has properties we claim to be of great importance in generative models. We provide theoretical support for the model and preliminary results on synthetic Gaussian data.
SGD on Random Mixtures: Private Machine Learning under Data Breach Threats
We propose Stochastic Gradient Descent on Random Mixtures (SGDRM) as a simple way of protecting data under data breach threats. We show that SGDRM converges to the globally optimal point for deep neural networks with linear activations while being differentially private. We also train nonlinear neural networks with private mixtures as the training data, proving the practicality of SGDRM.
Decoupling Dynamics and Reward for Transfer Learning
Reinforcement Learning (RL) provides a sound decision-theoretic framework to optimize the behavior of learning agents in an interactive setting. However, one of the limitations to applications of RLto real-world tasks is the amount of data required for learning an optimal policy. Our goal is to design an RL model that can be efficiently trained on new tasks, and produce solutions that generalize well beyond the training environment. We take inspiration from Successor Features (Dayan, 1993), which decouples the value function representation into dynamics and rewards, and learns them separately. We take this further by explicitly decoupling learning the state representation, reward function, forward dynamics, and inverse dynamics of the environment. We posit that we can learn a representation space \mathcal{Z} via this decoupling that makes downstream learning easier as: (1) the modules can be learned separately enabling efficient reuse of common knowledge across tasks to quickly adapt to new tasks; (2) the modules can be optimized jointly leading to a representation space that is adapted to the policy and value function, rather than only the observation space; (3) the dynamics model enables forward search and planning, in the usual model-based RL way. Our approach is the first model-based RL method to explicitly incorporate learning of inverse dynamics, and we show that this plays an important role in stabilizing learning
LEARNING AND ANALYZING VECTOR ENCODING OF SYMBOLIC REPRESENTATION
We present a formal language with expressions denoting general symbol structures and queries which access information in those structures. A sequence-to-sequence network processing this language learns to encode symbol structures and query them. The learned representation (approximately) shares a simple linearity property with theoretical techniques for performing this task.
Uncertainty Estimation via Stochastic Batch Normalization
In this work, we investigate Batch Normalization technique and propose its probabilistic interpretation. We propose a probabilistic model and show that Batch Normalization maximazes the lower bound of its marginalized log-likelihood. Then, according to the new probabilistic model, we design an algorithm which acts consistently during train and test. However, inference becomes computationally inefficient. To reduce memory and computational cost, we propose Stochastic Batch Normalization -- an efficient approximation of proper inference procedure. This method provides us with a scalable uncertainty estimation technique. We demonstrate the performance of Stochastic Batch Normalization on popular architectures (including deep convolutional architectures: VGG-like and ResNets) for MNIST and CIFAR-10 datasets.
To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression
Model pruning seeks to induce sparsity in a deep neural network's various connection matrices, thereby reducing the number of nonzero-valued parameters in the model. Recent reports (Han et al., 2015; Narang et al., 2017) prune deep networks at the cost of only a marginal loss in accuracy and achieve a sizable reduction in model size. This hints at the possibility that the baseline models in these experiments are perhaps severely over-parameterized at the outset and a viable alternative for model compression might be to simply reduce the number of hidden units while maintaining the model's dense connection structure, exposing a similar trade-off in model size and accuracy. We investigate these two distinct paths for model compression within the context of energy-efficient inference in resource-constrained environments and propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning and can be seamlessly incorporated within the training process. We compare the accuracy of large, but pruned models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint. Across a broad range of neural network architectures (deep CNNs, stacked LSTM, and seq2seq LSTM models), we find large-sparse models to consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.
On the Limitation of Local Intrinsic Dimensionality for Characterizing the Subspaces of Adversarial Examples
Understanding and characterizing the subspaces of adversarial examples aid in studying the robustness of deep neural networks (DNNs) to adversarial perturbations. Very recently, (Ma et al. ICLR 2018) proposed to use local intrinsic dimensionality (LID) in layer-wise hidden representations of DNNs to study adversarial subspaces. It was demonstrated that LID can be used to characterize the adversarial subspaces associated with different attack methods, e.g., the Carlini and Wagner's (C&W) attack and the fast gradient sign attack.
In this paper, we use MNIST and CIFAR-10 to conduct two new sets of experiments that are absent in existing LID analysis and report the limitation of LID in characterizing the corresponding adversarial subspaces, which are (i) oblivious attacks and LID analysis using adversarial examples with different confidence levels; and (ii) black-box transfer attacks. For (i), we find that the performance of LID is very sensitive to the confidence parameter deployed by an attack, and the LID learned from ensembles of adversarial examples with varying confidence levels surprisingly gives poor performance. For (ii), we find that when adversarial examples are crafted from another DNN model, LID is ineffective in characterizing their adversarial subspaces. These two findings together suggest the limited capability of LID in characterizing the subspaces of adversarial examples.
A Proximal Block Coordinate Descent Algorithm for Deep Neural Network Training
Training deep neural networks (DNNs) efficiently is a challenge due to the associated highly nonconvex optimization. The backpropagation (backprop) algorithm has long been the most widely used algorithm for gradient computation of parameters of DNNs and is used along with gradient descent-type algorithms for this optimization task. Recent work have shown the efficiency of block coordinate descent (BCD) type methods empirically for training DNNs. In view of this, we propose a novel algorithm based on the BCD method for training DNNs and provide its global convergence results built upon the powerful framework of the Kurdyka-Lojasiewicz (KL) property. Numerical experiments on standard datasets demonstrate its competitive efficiency against standard optimizers with backprop.
Semi-Supervised Few-Shot Learning with MAML
We present preliminary results on extending Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017a) to fast adaptation to new classification tasks in the presence of unlabeled data. Using synthetic data, we show that MAML can adapt to new tasks without any labeled examples (unsupervised adaptation) when the new task has the same output space (classes) as the training tasks do. We further extend MAML to the semi-supervised few-shot learning scenario, when the output space of the new tasks can be different from the training tasks.
The loss surface and expressivity of deep convolutional neural networks
We analyze the expressiveness and loss surface of practical deep convolutional neural networks (CNNs) with shared weights. We show that such CNNs produce linearly independent features (and thus linearly separable) at every ``wide'' layer which has more neurons than the number of training samples. This condition holds e.g. for the VGG network. Furthermore, we provide for such wide CNNs necessary and sufficient conditions for global minima with zero training error. For the case where the wide layer is followed by a fully connected layer we show that almost every critical point of the empirical loss is a global minimum with zero training error. Our analysis suggests that both depth and width are equally important in deep learning. While depth brings more representational power and allows the network to learn high level features, width smoothes the optimization landscape of the loss function in the sense that a sufficiently wide CNN has a well-behaved loss surface with almost no bad local minima.
Expert-based reward function training: the novel method to train sequence generators
The training methods of sequence generator with a combination of GAN and policy gradient has shown good performance. In this paper, we propose expert-based reward function training: the novel method to train sequence generator. Different from previous studies of sequence generation, expert-based reward function training does not utilize GAN's framework. Still, our model outperforms SeqGAN and a strong baseline, RankGAN.
Variance-based Gradient Compression for Efficient Distributed Deep Learning
Due to the substantial computational cost, training state-of-the-art deep neural networks for large-scale datasets often requires distributed training using multiple computation workers. However, by nature, workers need to frequently communicate gradients, causing severe bottlenecks, especially on lower bandwidth connections. A few methods have been proposed to compress gradient for efficient communication, but they either suffer a low compression ratio or significantly harm the resulting model accuracy, particularly when applied to convolutional neural networks. To address these issues, we propose a method to reduce the communication overhead of distributed deep learning. Our key observation is that gradient updates can be delayed until an unambiguous (high amplitude, low variance) gradient has been calculated. We also present an efficient algorithm to compute the variance and prove that it can be obtained with negligible additional cost. We experimentally show that our method can achieve very high compression ratio while maintaining the result model accuracy. We also analyze the efficiency using computation and communication cost models and provide the evidence that this method enables distributed deep learning for many scenarios with commodity environments.
Kronecker Recurrent Units
Our work addresses two important issues with recurrent neural networks: (1) they are over-parameterized, and (2) the recurrent weight matrix is ill-conditioned. The former increases the sample complexity of learning and the training time. The latter causes the vanishing and exploding gradient problem. We present a flexible recurrent neural network model called Kronecker Recurrent Units (KRU). KRU achieves parameter efficiency in RNNs through a Kronecker factored recurrent matrix. It overcomes the ill-conditioning of the recurrent matrix by enforcing soft unitary constraints on the factors. Thanks to the small dimensionality of the factors, maintaining these constraints is computationally efficient. Our experimental results on seven standard data-sets reveal that KRU can reduce the number of parameters by three orders of magnitude in the recurrent weight matrix compared to the existing recurrent models, without trading the statistical performance. These results in particular show that while there are advantages in having a high dimensional recurrent space, the capacity of the recurrent part of the model can be dramatically reduced.
Attacking the Madry Defense Model with $L_1$-based Adversarial Examples
Adversarial Spheres
ReinforceWalk: Learning to Walk in Graph with Monte Carlo Tree Search
We consider the problem of learning to walk over a graph towards a target node for a given input query and a source node (e.g., knowledge graph reasoning). We propose a new method called ReinforceWalk, which consists of a deep recurrent neural network (RNN) and a Monte Carlo Tree Search (MCTS). The RNN encodes the history of observations and map it into the Q-value, the policy and the state value. The MCTS is combined with the RNN policy to generate trajectories with more positive rewards, overcoming the sparse reward problem. Then, the RNN policy is updated in an off-policy manner from these trajectories. ReinforceWalk repeats these steps to learn the policy. At testing stage, the MCTS is also combined with the RNN to predict the target node with higher accuracy. Experiment results show that we are able to learn better policies from less number of rollouts compared to other methods, which are mainly based on policy gradient method.
Neural Program Search: Solving Programming Tasks from Description and Examples
We present a Neural Program Search, an algorithm to generate programs from natural language description and a small number of input/output examples. The algorithm combines methods from Deep Learning and Program Synthesis fields by designing rich domain-specific language (DSL) and defining efficient search algorithm guided by a Seq2Tree model on it. To evaluate the quality of the approach we also present a semi-synthetic dataset of descriptions with test examples and corresponding programs. We show that our algorithm significantly outperforms a sequence-to-sequence model with attention baseline.
Minimally Redundant Laplacian Eigenmaps
Spectral algorithms for learning low-dimensional data manifolds have largely been supplanted by deep learning methods in recent years. One reason is that classic spectral manifold learning methods often learn collapsed embeddings that do not fill the embedding space. We show that this is a natural consequence of data where different latent dimensions have dramatically different scaling in observation space. We present a simple extension of Laplacian Eigenmaps to fix this problem based on choosing embedding vectors which are both orthogonal and \textit{minimally redundant} to other dimensions of the embedding. In experiments on NORB and similarity-transformed faces we show that Minimally Redundant Laplacian Eigenmap (MR-LEM) significantly improves the quality of embedding vectors over Laplacian Eigenmaps, accurately recovers the latent topology of the data, and discovers many disentangled factors of variation of comparable quality to state-of-the-art deep learning methods.
Clustering Meets Implicit Generative Models
Clustering is a cornerstone of unsupervised learning which can be thought as disentangling multiple generative mechanisms underlying the data. In this paper we introduce an algorithmic framework to train mixtures of implicit generative models which we particularize for variational autoencoders. Relying on an additional set of discriminators, we propose a competitive procedure in which the models only need to approximate the portion of the data distribution from which they can produce realistic samples. As a byproduct, each model is simpler to train, and a clustering interpretation arises naturally from the partitioning of the training points among the models. We empirically show that our approach splits the training distribution in a reasonable way and increases the quality of the generated samples.
Rethinking Style and Content Disentanglement in Variational Autoencoders
A common test for whether a generative model learns disentangled representations is its ability to learn style and content as independent factors of variation on digit datasets. To achieve such disentanglement with variational autoencoders, the label information is often provided in either a fully-supervised or semi-supervised fashion. We show, however, that the variational objective is insufficient in explaining the observed style and content disentanglement. Furthermore, we present an empirical framework to systematically evaluate the disentanglement behavior of our models. We show that the encoder and decoder independently favor disentangled representations and that this tendency depends on the implicit regularization by stochastic gradient descent.
Compression by the signs: distributed learning is a two-way street
Training large neural networks requires distributing learning over multiple workers. The rate limiting step is often in sending gradients from workers to parameter server and back again. We present signSGD with majority vote: the first gradient compression scheme to achieve 1-bit compression of worker-server communication in both directions with non-vacuous theoretical guarantees. To achieve this, we build an extensive theory of sign-based optimisation, which is also relevant to understanding adaptive gradient methods like Adam and RMSprop. We prove that signSGD can get the best of both worlds: compressed gradients and SGD-level convergence rate. signSGD can exploit mismatches between L1 and L2 geometry: when noise and curvature are much sparser than the gradients, signSGD is expected to converge at the same rate or faster than full-precision SGD. Measurements of the L1 versus L2 geometry of real networks support our theoretical claims, and we find that the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models.
Comparing Fixed and Adaptive Computation Time for Recurrent Neural Networks
Deep networks commonly perform better than shallow ones, but allocating the proper amount of computation for each particular input sample remains an open problem. This issue is particularly challenging in sequential tasks, where the required complexity may vary for different tokens in the input sequence. Adaptive Computation Time (ACT) was proposed as a method for dynamically adapting the computation at each step for Recurrent Neural Networks (RNN). ACT introduces two main modifications to the regular RNN formulation: (1) more than one RNN steps may be executed between an input sample is fed to the layer and and this layer generates an output, and (2) this number of steps is dynamically predicted depending on the input token and the hidden state of the network. In our work, we aim at gaining intuition about the contribution of these two factors to the overall performance boost observed when augmenting RNNs with ACT. We design a new baseline, Repeat-RNN, which performs a constant number of RNN state updates larger than one before generating an output. Surprisingly, such uniform distribution of the computational resources matches the performance of ACT in the studied tasks. We hope that this finding motivates new research efforts towards designing RNN architectures that are able to dynamically allocate computational resources.
Jointly Learning "What" and "How" from Instructions and Goal-States
Training agents to follow instructions requires some way of rewarding them for behavior which accomplishes the intent of the instruction. For non-trivial instructions, which may be either underspecified or contain some ambiguity, it can be difficult or impossible to specify a reward function or obtain relatable expert trajectories for the agent to imitate. For these scenarios, we introduce a method which requires only pairs on instructions and examples of positive goal states, from which we can jointly learn a model of the instruction-conditional reward and a policy which executes instructions. Two sets of experiments in a gridworld compare the effectiveness of our method to that of RL when a reward function can be specified, and the application of our method when no reward function is defined. We furthermore evaluate the generalization of our approach to unseen instructions, and to scenarios where environment dynamics change outside of training, requiring fine-tuning of the policy ``in the wild''.
Evaluating visual "common sense" using fine-grained classification and captioning tasks
We introduce the Something-something V2 dataset, which contains captions of finely-varying human-object interactions. We also discuss various baseline models, and show that neural networks show surprisingly strong performance on many of the very hard, detailed discrimination tasks associated with this dataset.
Deep learning mutation prediction enables early stage lung cancer detection in liquid biopsy
Somatic cancer mutation detection at ultra-low variant allele frequencies (VAFs) is an unmet challenge that is intractable with current state-of-the-art mutation calling methods. Specifically, the limit of VAF detection is closely related to the depth of coverage, due to the requirement of multiple supporting reads in extant methods, precluding the detection of mutations at VAFs that are orders of magnitude lower than the depth of coverage. Nevertheless, the ability to detect cancer-associated mutations in ultra low VAFs is a fundamental requirement for low-tumor burden cancer diagnostics applications such as early detection, monitoring, and therapy nomination using liquid biopsy methods (cell-free DNA). Here we defined a spatial representation of sequencing information adapted for convolutional architecture that enables variant detection at VAFs, in a manner independent of the depth of sequencing. This method enables the detection of cancer mutations even in VAFs as low as 10x-4^, >2 orders of magnitude below the current state-of-the-art. We validated our method on both simulated plasma and on clinical cfDNA plasma samples from cancer patients and non-cancer controls. This method introduces a new domain within bioinformatics and personalized medicine – somatic whole genome mutation calling for liquid biopsy.
Universal Successor Representations for Transfer Reinforcement Learning
The objective of transfer reinforcement learning is to generalize from a set of previous tasks to unseen new tasks. In this work, we focus on the transfer scenario where the dynamics among tasks are the same, but their goals differ. Although general value function (Sutton et al., 2011) has been shown to be useful for knowledge transfer, learning a universal value function can be challenging in practice. To attack this, we propose (1) to use universal successor representations (USR) to represent the transferable knowledge and (2) a USR approximator (USRA) that can be trained by interacting with the environment. Our experiments show that USR can be effectively applied to new tasks, and the agent initialized by the trained USRA can achieve the goal considerably faster than random initialization.
Deep Neural Maps
We introduce a new unsupervised representation learning and visualization method using deep convolutional networks and self organizing maps called Deep Neural Maps (DNM). DNM jointly learns an embedding of the input data and a mapping from the embedding space to a two-dimensional lattice. We compare visualizations of DNM with those of t-SNE and LLE on the MNIST and COIL-20 data sets. Our experiments show that the DNM can learn efficient representations of the input data, which reflects characteristics of each class. This is shown via back- projecting the neurons of the map on the data space.
A Flexible Approach to Automated RNN Architecture Generation
The process of designing neural architectures requires expert knowledge and extensive trial and error. While automated architecture search may simplify these requirements, the recurrent neural network (RNN) architectures generated by existing methods are limited in both flexibility and components. We propose a domain-specific language (DSL) for use in automated architecture search which can produce novel RNNs of arbitrary depth and width. The DSL is flexible enough to define standard architectures such as the Gated Recurrent Unit and Long Short Term Memory and allows the introduction of non-standard RNN components such as trigonometric curves and layer normalization. Using two different candidate generation techniques, random search with a ranking function and reinforcement learning, we explore the novel architectures produced by the RNN DSL for language modeling and machine translation domains. The resulting architectures do not follow human intuition yet perform well on their targeted tasks, suggesting the space of usable RNN architectures is far larger than previously assumed.
A differentiable BLEU loss. Analysis and first results
In natural language generation tasks, like neural machine translation and image captioning, there is usually a mismatch between the optimized loss and the de facto evaluation criterion, namely token-level maximum likelihood and corpus-level BLEU score. This article tries to reduce this gap by defining differentiable computations of the BLEU and GLEU scores. We test this approach on simple tasks, obtaining valuable lessons on its potential applications but also its pitfalls, mainly that these loss functions push each token in the hypothesis sequence toward the average of the tokens in the reference, resulting in a poor training signal.
Stacked Filters Stationary Flow For Hardware-Oriented Acceleration Of Deep Convolutional Neural Networks
To address memory and computation resource limitations for hardware-oriented acceleration of deep convolutional neural networks(CNNs), we present a computation flow, stacked filters stationary flow (SFS), and a corresponding data encoding format, relative indexed compressed sparse filter format (CSF), to make the best of data sparsity, and simplify data handling at execution time. Comparing with the state-of-the-art result (Han et al., 2016b), our methods achieve 1.11x improvement in reducing the storage required by AlexNet, and 1.09x improvement in reducing the storage required by SqueezeNet, without loss of accuracy on the ImageNet dataset. Moreover, using these approaches, chip area for logics handling irregular sparse data access can be saved. Comparing with the 2D-SIMD processure structures in DVAS, ENVISION, etc., our methods achieve about 3.65x processing element (PE) array utilization rate improvement (from 26.4% to 96.5%), using the data from Deep Compression on AlexNet.
Stable Distribution Alignment Using the Dual of the Adversarial Distance
Methods that align distributions by minimizing an adversarial distance between them have recently achieved impressive results. However, these approaches are difficult to optimize with gradient descent and they often do not converge well without careful hyperparameter tuning and proper initialization. We investigate whether turning the adversarial min-max problem into an optimization problem by replacing the maximization part with its dual improves the quality of the resulting alignment and explore its connections to Maximum Mean Discrepancy. Our empirical results suggest that using the dual formulation for the restricted family of linear discriminators results in a more stable convergence to a desirable solution when compared with the performance of a primal min-max GAN-like objective and an MMD objective under the same restrictions. We test our hypothesis on the problem of aligning two synthetic point clouds on a plane and on a real-image domain adaptation problem on digits. In both cases, the dual formulation yields an iterative procedure that gives more stable and monotonic improvement over time.
Learning Longer-term Dependencies in RNNs with Auxiliary Losses
We present a simple method to improve learning long-term dependencies in recurrent neural networks (RNNs) by introducing unsupervised auxiliary losses. These auxiliary losses force RNNs to either remember distant past or predict future, enabling truncated backpropagation through time (BPTT) to work on very long sequences. We experimented on sequences up to 16000 tokens long and report faster training, more resource efficiency and better test performance than full BPTT baselines such as Long Short Term Memory (LSTM) networks or Transformer.
IamNN: Iterative and Adaptive Mobile Neural Network for efficient image classification
Deep residual networks (ResNets) made a recent breakthrough in deep learning. The core idea of ResNets is to have shortcut connections between layers that allow the network to be much deeper while still being easy to optimize avoiding vanishing gradients. These shortcut connections have interesting properties that make ResNets behave differently from other typical network architectures. In this work we use these properties to design a network based on a ResNet but with parameter sharing and with adaptive computation time. The resulting network is much smaller than the original network and can adapt the computational cost to the complexity of the input image.
Concept Learning with Energy-Based Models
We believe that many hallmarks of human intelligence, such as generalizing from limited experience, abstract reasoning and planning, analogical reasoning, creative problem solving, and capacity for language require the ability to consolidate experience into concepts, which act as basic building blocks of understanding and reasoning. We present a framework that defines a concept by an energy function over events in the environment, as well as an attention mask over entities participating in the event. Given few demonstration events, our method uses inference-time optimization procedure to generate events involving similar concepts or identify entities involved in the concept. We evaluate our framework on learning visual, quantitative, compositional, and relational concepts from demonstration events in an unsupervised manner. Our approach is able to successfully generate and identify concepts in a few-shot setting as well as transfer learned concepts between domains.
COLD FUSION: TRAINING SEQ2SEQ MODELS TOGETHER WITH LANGUAGE MODELS
Sequence-to-sequence (Seq2Seq) models with attention have excelled at tasks which involve generating natural language sentences such as machine translation, image captioning and speech recognition. Performance has further been improved by leveraging unlabeled data, often in the form of a language model. In this work, we present the Cold Fusion method, which leverages a pre-trained language model during training, and show its effectiveness on the speech recognition task. We show that Seq2Seq models with Cold Fusion are able to better utilize language information enjoying i) faster convergence and better generalization, and ii) almost complete transfer to a new domain while using less than 10% of the labeled training data.
Understanding the Loss Surface of Single-Layered Neural Networks for Binary Classification
It is widely conjectured that the reason that training algorithms for neural networks are successful because all local minima lead to similar performance; for example, see (LeCun et al., 2015; Choromanska et al., 2015; Dauphin et al., 2014). Performance is typically measured in terms of two metrics: training performance and generalization performance. Here we focus on the training performance of single-layered neural networks for binary classification, and provide conditions under which the training error is zero at all local minima of a smooth hinge loss function. Our conditions are roughly in the following form: the neurons have to be strictly convex and the surrogate loss function should be a smooth version of hinge loss. We also provide counterexamples to show that when the loss function is replaced with quadratic loss or logistic loss, the result may not hold.
Challenges in Disentangling Independent Factors of Variation
We study the problem of building models that disentangle independent factors of variation. Such models encode features that can efficiently be used for classification and to transfer attributes between different images in image synthesis. As data we use a weakly labeled training set, where labels indicate what single factor has changed between two data samples, although the relative value of the change is unknown. This labeling is of particular interest as it may be readily available without annotation costs. We introduce an autoencoder model and train it through constraints on image pairs and triplets. We show the role of feature dimensionality and adversarial training theoretically and experimentally. We formally prove the existence of the reference ambiguity, which is inherently present in the disentangling task when weakly labeled data is used. The numerical value of a factor has different meaning in different reference frames. When the reference depends on other factors, transferring that factor becomes ambiguous. We demonstrate experimentally that the proposed model can successfully transfer attributes on several datasets, but show also cases when the reference ambiguity occurs.
Isolating Sources of Disentanglement in Variational Autoencoders
We decompose the evidence lower bound (ELBO) to show the existence of a total correlation term between latents. This motivates our beta-TCVAE (Total Correlation Variational Autoencoder), a refinement of the state-of-the-art beta-VAE for learning disentangled representations without supervision. We further propose a principled classifier-free measure of disentanglement called the Mutual Information Gap (MIG). We show a strong relationship between total correlation and disentanglement.
No Spurious Local Minima in a Two Hidden Unit ReLU Network
Deep learning models can be efficiently optimized via stochastic gradient descent, but there is little theoretical evidence to support this. A key question in optimization is to understand when the optimization landscape of a neural network is amenable to gradient-based optimization. We focus on a simple neural network two-layer ReLU network with two hidden units, and show that all local minimizers are global. This combined with recent work of Lee et al. (2017); Lee et al. (2016) show that gradient descent converges to the global minimizer.
Analyzing and Exploiting NARX Recurrent Neural Networks for Long-Term Dependencies
Recurrent neural networks (RNNs) have achieved state-of-the-art performance on many diverse tasks, from machine translation to surgical activity recognition, yet training RNNs to capture long-term dependencies remains difficult. To date, the vast majority of successful RNN architectures alleviate this problem using nearly-additive connections between states, as introduced by long short-term memory (LSTM). We take an orthogonal approach and introduce MIST RNNs, a NARX RNN architecture that allows direct connections from the very distant past. We show that MIST RNNs 1) exhibit superior vanishing-gradient properties in comparison to LSTM and previously-proposed NARX RNNs; 2) are far more efficient than previously-proposed NARX RNN architectures, requiring even fewer computations than LSTM; and 3) improve performance substantially over LSTM and Clockwork RNNs on tasks requiring very long-term dependencies.
Multiple Source Domain Adaptation with Adversarial Learning
While domain adaptation has been actively researched in recent years, most theoretical results and algorithms focus on the single-source-single-target adaptation setting. Naive application of such algorithms on multiple source domain adaptation problem may lead to suboptimal solutions. We propose a new generalization bound for domain adaptation when there are multiple source domains with labeled instances and one target domain with unlabeled instances. Compared with existing bounds, the new bound does not require expert knowledge about the target distribution, nor the optimal combination rule for multisource domains. Interestingly, our theory also leads to an efficient learning strategy using adversarial neural networks: we show how to interpret it as learning feature representations that are invariant to the multiple domain shifts while still being discriminative for the learning task. To this end, we propose two models, both of which we call multisource domain adversarial networks (MDANs): the first model optimizes directly our bound, while the second model is a smoothed approximation of the first one, leading to a more data-efficient and task-adaptive model. The optimization tasks of both models are minimax saddle point problems that can be optimized by adversarial training. To demonstrate the effectiveness of MDANs, we conduct extensive experiments showing superior adaptation performance on three real-world datasets: sentiment analysis, digit classification, and vehicle counting.
Adapting to Continuously Shifting Domains
Domain adaptation typically focuses on adapting a model from a single source domain to a target domain. However, in practice, this paradigm of adapting from one source to one target is limiting, as different aspects of the real world such as illumination and weather conditions vary continuously and cannot be effectively captured by two static domains. Approaches that attempt to tackle this problem by adapting from a single source to many different target domains simultaneously are consistently unable to learn across all domain shifts. Instead, we propose an adaptation method that exploits the continuity between gradually varying domains by adapting in sequence from the source to the most similar target domain. By incrementally adapting while simultaneously efficiently regularizing against prior examples, we obtain a single strong model capable of recognition within all observed domains.
Coupled Ensembles of Neural Networks
We present coupled ensembles of neural networks, which is a reconfiguration of existing neural network models into parallel branches. We empirically show that this modification leads to results on CIFAR and SVHN that are competitive to state of the art, with a greatly reduced parameter count. Additionally, for a fixed parameter, or a training time budget coupled ensembles are significantly better than single branch models. Preliminary results on ImageNet are also promising.
THE EFFECTIVENESS OF A TWO-LAYER NEURAL NETWORK FOR RECOMMENDATIONS
We present a personalized recommender system using neural network for recommending products, such as eBooks, audio-books, Mobile Apps, Video and Music. It produces recommendations based on customer’s implicit feedback history such as purchases, listens or watches. Our key contribution is to formulate recommendation problem as a model that encodes historical behavior to predict the future behavior using soft data split, combining predictor and auto-encoder models. We introduce convolutional layer for learning the importance (time decay) of the purchases depending on their purchase date and demonstrate that the shape of the time decay function can be well approximated by a parametrical function. We present offline experimental results showing that neural networks with two hidden layers can capture seasonality changes, and at the same time outperform other modeling techniques, including our recommender in production. Most importantly, we demonstrate that our model can be scaled to all digital categories, and we observe significant improvements in an online A/B test. We also discuss key enhancements to the neural network model and describe our production pipeline. Finally we open-sourced our deep learning library which supports multi-gpu model parallel training. This is an important feature in building neural network based recommenders with large dimensionality of input and output data.
Practical Hyperparameter Optimization
Recently, the bandit-based strategy Hyperband (HB) was shown to yield good hyperparameter settings of deep neural networks faster than vanilla Bayesian optimization (BO). However, for larger budgets, HB is limited by its random search component, and BO works better. We propose to combine the benefits of both approaches to obtain a new practical state-of-the-art hyperparameter optimization method, which we show to consistently outperform both HB and BO on a range of problem types, including feed-forward neural networks, Bayesian neural networks, and deep reinforcement learning. Our method is robust and versatile, while at the same time being conceptually simple and easy to implement.
Decoding Decoders: Finding Optimal Representation Spaces for Unsupervised Similarity Tasks
Experimental evidence indicates that simple models outperform complex deep networks on many unsupervised similarity tasks. We provide a simple yet rigorous explanation for this behaviour by introducing the concept of an optimal representation space, in which semantically close symbols are mapped to representations that are close under a similarity measure induced by the model’s objective function. In addition, we present a straightforward procedure that, without any retraining or architectural modifications, allows deep recurrent models to perform equally well (and sometimes better) when compared to shallow models. To validate our analysis, we conduct a set of consistent empirical evaluations and introduce several new sentence embedding models in the process. Even though this work is presented within the context of natural language processing, the insights are readily applicable to other domains that rely on distributed representations for transfer tasks.
AUTOMATED DESIGN USING NEURAL NETWORKS AND GRADIENT DESCENT
We propose a novel method that makes use of deep neural networks and gradient decent to perform automated design on complex real world engineering tasks. Our approach works by training a neural network to mimic the fitness function of a design optimization task and then, using the differential nature of the neural network, perform gradient decent to maximize the fitness. We demonstrate this methods effectiveness by designing an optimized heat sink and both 2D and 3D airfoils that maximize the lift drag ratio under steady state flow conditions. We highlight that our method has two distinct benefits over other automated design approaches. First, evaluating the neural networks prediction of fitness can be orders of magnitude faster then simulating the system of interest. Second, using gradient decent allows the design space to be searched much more efficiently then other gradient free methods. These two strengths work together to overcome some of the current shortcomings of automated design.
Depth separation and weight-width trade-offs for sigmoidal neural networks
Distributional Adversarial Networks
In most current formulations of adversarial training, the discriminators can be expressed as single-input operators, that is, the mapping they define is separable over observations. In this work, we argue that this property might help explain the infamous mode collapse phenomenon in adversarially-trained generative models. Inspired by discrepancy measures and two-sample tests between probability distributions, we propose distributional adversaries that operate on samples, i.e., on sets of multiple points drawn from a distribution, rather than on single observations. We show how they can be easily implemented on top of existing models. Various experimental results show that generators trained in combination with our distributional adversaries are much more stable and are remarkably less prone to mode collapse than traditional models trained with observation-wise prediction discriminators. In addition, the application of our framework to domain adaptation results in strong improvement over baselines.
SpectralWords: Spectral Embeddings Approach to Word Similarity Task for Large Vocabularies
In this paper we show how recent advances in spectral clustering using Bethe Hessian operator can be used to learn dense word representations. We propose an algorithm SpectralWords that achieves comparable to the state-of-the-art performance on word similarity tasks for medium-size vocabularies and can be superior for datasets with larger vocabularies.
Black-box Attacks on Deep Neural Networks via Gradient Estimation
In this paper, we propose novel Gradient Estimation black-box attacks to generate adversarial examples with query access to the target model's class probabilities, which do not rely on transferability. We also propose strategies to decouple the number of queries required to generate each adversarial example from the dimensionality of the input. An iterative variant of our attack achieves close to 100% attack success rates for both targeted and untargeted attacks on DNNs. We show that the proposed Gradient Estimation attacks outperform all other black-box attacks we tested on both MNIST and CIFAR-10 datasets, achieving attack success rates similar to well known, state-of-the-art white-box attacks. We also apply the Gradient Estimation attacks successfully against a real-world content moderation classifier hosted by Clarifai.
Covariant Compositional Networks For Learning Graphs
Most existing neural networks for learning graphs deal with the issue of permutation invariance by conceiving of the network as a message passing scheme, where each node sums the feature vectors coming from its neighbors. We argue that this imposes a limitation on their representation power, and instead propose a new general architecture for representing objects consisting of a hierarchy of parts, which we call Covariant Compositional Networks (CCNs). Here covariance means that the activation of each neuron must transform in a specific way under permutations, similarly to steerability in CNNs. We achieve covariance by making each activation transform according to a tensor representation of the permutation group, and derive the corresponding tensor aggregation rules that each neuron must implement. Experiments show that CCNs can outperform competing methods on some standard graph learning benchmarks.
Diversity-Driven Exploration Strategy for Deep Reinforcement Learning
Efficient exploration remains a challenging research problem in reinforcement learning, especially when an environment contains large state spaces, deceptive local optima, or sparse rewards. To tackle this problem, we present a diversity-driven approach for exploration, which can be easily combined with both off- and on-policy reinforcement learning algorithms. We show that by simply adding a distance measure to the loss function, the proposed methodology significantly enhances an agent's exploratory behaviors, and thus preventing the policy from being trapped in local optima. We further propose an adaptive scaling method for stabilizing the learning process. Our experimental results in Atari 2600 show that our method outperforms baseline approaches in several tasks in terms of mean scores and exploration efficiency.
Learning Invariances for Policy Generalization
While recent progress has spawned very powerful machine learning systems, those agents remain extremely specialized and fail to transfer the knowledge they gain to similar yet unseen tasks. In this paper, we study a simple reinforcement learning problem and focus on learning policies that encode the proper invariances for generalization to different settings. We evaluate three potential methods for policy generalization: data augmentation, meta-learning and adversarial training. We find our data augmentation method to be effective, and study the potential of meta-learning and adversarial learning as alternative task-agnostic approaches.
Additive Margin Softmax for Face Verification
In this paper, we propose a conceptually simple and geometrically interpretable objective function, i.e. additive margin Softmax (AM-Softmax), for deep face verification. In general, the face verification task can be viewed as a metric learning problem, so learning large-margin face features whose intra-class variation is small and inter-class difference is large is of great importance in order to achieve good performance. Recently, Large-margin Softmax and Angular Softmaxhave been proposed to incorporate the angular margin in a multiplicative manner. In this work, we introduce a novel additive angular margin for the Softmax loss, which is intuitively appealing and more interpretable than the existing works. We also emphasize and discuss the importance of feature normalization in the paper. Most importantly, our experiments on LFW and MegaFace show that our additive margin softmax loss consistently performs better than the current state-of-the-art methods using the same network architecture and training dataset.
Wasserstein Auto-Encoders: Latent Dimensionality and Random Encoders
We study the role of latent space dimensionality in Wasserstein auto-encoders (WAEs). Through experimentation on synthetic and real datasets, we argue that random encoders should be preferred over deterministic encoders.
Weighted Geodesic Distance Following Fermat's Principle
We propose a density-based estimator for weighted geodesic distances suitable for data lying on a manifold of lower dimension than ambient space and sampled from a possibly nonuniform distribution. After discussing its properties and implementation, we evaluate its performance as a tool for clustering tasks. A discussion on the consistency of the estimator is also given.
DiCE: The Infinitely Differentiable Monte-Carlo Estimator
The score function estimator is widely used for estimating gradients of stochastic objectives in Stochastic Computation Graphs (SCG), eg. in reinforcement learning and meta-learning. While deriving the first-order gradient estimators by differentiating a surrogate loss (SL) objective is computationally and conceptually simple, using the same approach for higher-order gradients is more challenging. Firstly, analytically deriving and implementing such estimators is laborious and not compliant with automatic differentiation. Secondly, repeatedly applying SL to construct new objectives for each order gradient involves increasingly cumbersome graph manipulations. Lastly, to match the first-order gradient under differentiation, SL treats part of the cost as a fixed sample, which we show leads to missing and wrong terms for higher-order gradient estimators. To address all these shortcomings in a unified way, we introduce DiCE, which provides a single objective that can be differentiated repeatedly, generating correct gradient estimators of any order in SCGs. Unlike SL, DiCE relies on automatic differentiation for performing the requisite graph manipulations. We verify the correctness of DiCE both through a proof and through numerical evaluation of the DiCE gradient estimates. We also use DiCE to propose and evaluate a novel approach for multi-agent learning. Our code is available at https://goo.gl/xkkGxN.
Graph Partition Neural Networks for Semi-Supervised Classification
We present graph partition neural networks (GPNN), an extension of graph neural networks (GNNs) able to handle extremely large graphs. GPNNs alternate between locally propagating information between nodes in small subgraphs and globally propagating information between the subgraphs. To efficiently partition graphs, we experiment with spectral partitioning and also propose a modified multi-seed flood fill for fast processing of large scale graphs. We extensively test our model on a variety of semi-supervised node classification tasks. Experimental results indicate that GPNNs are either superior or comparable to state-of-the-art methods on a wide variety of datasets for graph-based semi-supervised classification. We also show that GPNNs can achieve similar performance as standard GNNs with fewer propagation steps.
Learning Disentangled Representations with Wasserstein Auto-Encoders
We apply Wasserstein auto-encoders (WAEs) to the problem of disentangled representation learning. We highlight the potential of WAEs with promising results on a benchmark disentanglement task.
Learning How Not to Act in Text-based Games
Large actions spaces impede an agent's ability to learn, especially when many of the actions are redundant or irrelevant. This is especially prevalent in text-based domains. We present the action-elimination architecture which combines the generalization power of Deep Reinforcement Learning and the natural language capabilities of NLP architectures to eliminate unnecessary actions and solves quests in the text-based game of Zork, significantly outperforming the baseline agents.
Aspect-based Question Generation
Asking questions is an important ability for a chatbot. Although there are existing works on question generation with a piece of descriptive text, it remains to be a very challenging problem. In this paper, we consider a new question generation problem which also requires the input of a target aspect in addition to a piece of descriptive text. The key reason for this new problem is that it has been found from practical applications that useful questions need to be targeted toward some relevant aspects. One almost never asks a random question in a conversation. Due to the fact that given a descriptive text, it is often possible to ask many types of questions, generating a question without knowing what it is about is of limited use. in order to solve this problem, we propose a novel neural network which is able to generate aspect-based questions. One major advantage of this model is that it can be trained directly using a question-answering corpus without requiring any additional annotations like annotating aspects in the questions or answers. Experimental results show that our proposed model outperforms the state-of-the-art question generation methods.
Time-Dependent Representation for Neural Event Sequence Prediction
Existing sequence prediction methods are mostly concerned with time-independent sequences, in which the actual time span between events is irrelevant and the distance between events is simply the difference between their order positions in the sequence. While this time-independent view of sequences is applicable for data such as natural languages, e.g., dealing with words in a sentence, it is inappropriate and inefficient for many real world events that are observed and collected at unequally spaced points of time as they naturally arise, e.g., when a person goes to a grocery store or makes a phone call. The time span between events can carry important information about the sequence dependence of human behaviors. In this work, we propose a set of methods for using time in sequence prediction. Because neural sequence models such as RNN are more amenable for handling token-like input, we propose two methods for time-dependent event representation, based on the intuition on how time is tokenized in everyday life and previous work on embedding contextualization. We also introduce two methods for using next event duration as regularization for training a sequence prediction model. We discuss these methods based on recurrent neural nets. We evaluate these methods as well as baseline models on five datasets that resemble a variety of sequence prediction tasks. The experiments revealed that the proposed methods offer accuracy gain over baseline models in a range of settings.
Searching for Activation Functions
The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, f(x) = x * sigmoid(beta * x), which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.
DNA-GAN: Learning Disentangled Representations from Multi-Attribute Images
Disentangling factors of variation has become a very challenging problem on representation learning. Existing algorithms suffer from many limitations, such as unpredictable disentangling factors, poor quality of generated images from encodings, lack of identity information, etc. In this paper, we propose a supervised learning model called DNA-GAN which tries to disentangle different factors or attributes of images. The latent representations of images are DNA-like, in which each individual piece (of the encoding) represents an independent factor of the variation. By annihilating the recessive piece and swapping a certain piece of one latent representation with that of the other one, we obtain two different representations which could be decoded into two kinds of images with the existence of the corresponding attribute being changed. In order to obtain realistic images and also disentangled representations, we further introduce the discriminator for adversarial training. Experiments on Multi-PIE and CelebA datasets finally demonstrate that our proposed method is effective for factors disentangling and even overcome certain limitations of the existing methods.
Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations
Deep neural networks have become the state-of-the-art models in numerous machine learning tasks. However, general guidance to network architecture design is still missing. In our work, we bridge deep neural network design with numerical differential equations. We show that many effective networks, such as ResNet, PolyNet, FractalNet and RevNet, can be interpreted as different numerical discretizations of differential equations. This finding brings us a brand new perspective on the design of effective deep architectures. We can take advantage of the rich knowledge in numerical analysis to guide us in designing new and potentially more effective deep networks. As an example, we propose a linear multi-step architecture (LM-architecture) which is inspired by the linear multi-step method solving ordinary differential equations. The LM-architecture is an effective structure that can be used on any ResNet-like networks. In particular, we demonstrate that LM-ResNet and LM-ResNeXt (i.e. the networks obtained by applying the LM-architecture on ResNet and ResNeXt respectively) can achieve noticeably higher accuracy than ResNet and ResNeXt on both CIFAR and ImageNet with comparable numbers of trainable parameters. In particular, on both CIFAR and ImageNet, LM-ResNet/LM-ResNeXt can significantly compress (>50%) the original networks while maintaining a similar performance. This can be explained mathematically using the concept of modified equation from numerical analysis. Last but not least, we also establish a connection between stochastic control and noise injection in the training process which helps to improve generalization of the networks. Furthermore, by relating stochastic training strategy with stochastic dynamic system, we can easily apply stochastic training to the networks with the LM-architecture. As an example, we introduced stochastic depth to LM-ResNet and achieve significant improvement over the original LM-ResNet on CIFAR10.
NAM - Unsupervised Cross-Domain Image Mapping without Cycles or GANs
Several methods were recently proposed for Unsupervised Domain Mapping, which is the task of translating images between domains without prior knowledge of correspondences. Current approaches suffer from an instability in training due to relying on GANs which are powerful but highly sensitive to hyper-parameters and suffer from mode collapse. In addition, most methods rely heavily on "cycle" relationships between the domains, which enforce a one-to-one mapping. In this work, we introduce an alternative method: NAM. NAM relies on a pre-trained generative model of the source domain, and aligns each target image with an image sampled from the source distribution while jointly optimizing the domain mapping function. Experiments are presented validating the effectiveness of our method.
Efficient Entropy For Policy Gradient with Multi-Dimensional Action Space
This paper considers entropy bonus, which is used to encourage exploration in policy gradient. In the case of high-dimensional action spaces, calculating the entropy and its gradient requires enumerating all the actions in the action space and running forward and backpropagation for each action, which may be computationally infeasible. We develop several novel unbiased estimators for the entropy bonus and its gradient. We apply these estimators to several models for the parameterized policies, including Independent Sampling, CommNet, Autoregressive with Modified MDP, and Autoregressive with LSTM. Finally, we test our algorithms on a multi-hunter multi-rabbit grid environment. The results show that our entropy estimators substantially improve performance with marginal additional computational cost.
3D-Scene-GAN: Three-dimensional Scene Reconstruction with Generative Adversarial Networks
Three-dimensional (3D) Reconstruction is a vital and challenging research topic in advanced computer graphics and computer vision due to the intrinsic complexity and computation cost. Existing methods often produce holes, distortions and obscure parts in the reconstructed 3D models which are not adequate for real usage. The focus of this paper is to achieve high quality 3D reconstruction performance of complicated scene by adopting Generative Adversarial Network (GAN). We propose a novel workflow, namely 3D-Scene-GAN, which can iteratively improve any raw 3D reconstructed models consisting of meshes and textures. 3D-Scene-GAN is a weakly semi-supervised model. It only takes real-time 2D observation images as the supervision, and doesn’t rely on prior knowledge of shape models or any referenced observations. Finally, through the qualitative and quantitative experiments, 3D-Scene-GAN shows compelling advantages over the state-of-the-art methods: balanced rank estimation (BRE) scores are improved by 30%-100% on ICL-NUIM dataset, and 36%-190% on SUN3D dataset. And the mean distance error (MDR) also outperforms other state-of-the-art methods on benchmarks.
Iterative GANs for Rotating Visual Objects
We are interested in learning visual representations which allow for 3D manipulations of visual objects based on a single 2D image. We cast this into an image-to-image transformation task, and propose Iterative Generative Adversarial Networks (IterGANs) to learn a visual representation that can be used for objects seen in training, but also for never seen objects. Since object manipulation requires a full understanding of the geometry and appearance of the object, our IterGANs learn an implicit 3D model and a full appearance model of the object, which are both inferred from a single (test) image. Moreover, the intermediate generated images from IterGANs can be used by additional loss functions to increase the quality of all generated images without the need for additional supervision. Experiments on rotated objects show how iterGANs help with the generation process.
Training Shallow and Thin Networks for Acceleration via Knowledge Distillation with Conditional Adversarial Networks
There is an increasing interest on accelerating neural networks for real-time applications. We study the student-teacher strategy, in which a small and fast student network is trained with the auxiliary information learned from a large and accurate teacher network. We propose to use conditional adversarial networks to learn the loss function to transfer knowledge from teacher to student. The experiments on three different image datasets show the student network gain a performance boost with proposed training strategy.
Ensemble Robustness and Generalization of Stochastic Deep Learning Algorithms
The question why deep learning algorithms generalize so well has attracted increasing research interest. However, most of the well-established approaches, such as hypothesis capacity, stability or sparseness, have not provided complete explanations (Zhang et al., 2016; Kawaguchi et al., 2017). In this work, we focus on the robustness approach (Xu & Mannor, 2012), i.e., if the error of a hypothesis will not change much due to perturbations of its training examples, then it will also generalize well. As most deep learning algorithms are stochastic (e.g., Stochastic Gradient Descent, Dropout, and Bayes-by-backprop), we revisit the robustness arguments of Xu & Mannor, and introduce a new approach – ensemble robustness – that concerns the robustness of a population of hypotheses. Through the lens of ensemble robustness, we reveal that a stochastic learning algorithm can generalize well as long as its sensitiveness to adversarial perturbations is bounded in average over training examples. Moreover, an algorithm may be sensitive to some adversarial examples (Goodfellow et al., 2015) but still generalize well. To support our claims, we provide extensive simulations for different deep learning algorithms and different network architectures exhibiting a strong correlation between ensemble robustness and the ability to generalize.
In reinforcement learning, all objective functions are not equal
Scalable Estimation via LSH Samplers (LSS)
The softmax function has multiple applications in large-scale machine learning. However, calculating the partition function is a major bottleneck for large state spaces. In this paper, we propose a new sampling scheme using locality-sensitive hashing (LSH) and an unbiased estimator that approximates the partition function accurately in sub-linear time. The samples are correlated and unnormalized, but the derived estimator is unbiased. We demonstrate the significant advantages of our proposal by comparing the speed and accuracy of LSH-Based Samplers (LSS) against other state-of-the-art estimation techniques.
Convolutional Sequence Modeling Revisited
Although both convolutional and recurrent architectures have a long history in sequence prediction, the current "default" mindset in much of the deep learning community is that generic sequence modeling is best handled using recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should a practitioner use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. In particular, the models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We further show that thepotential "infinite memory" advantage that RNNs have over TCNs is largely absent in practice: TCNs indeed exhibit longer effective history sizes than their recurrent counterparts. As a whole, we argue that it may be time to (re)consider ConvNets as the default ``go to'' architecture for sequence modeling.
Easing non-convex optimization with neural networks
LSH-SAMPLING BREAKS THE COMPUTATIONAL CHICKEN-AND-EGG LOOP IN ADAPTIVE STOCHASTIC GRADIENT ESTIMATION
Stochastic Gradient Descent or SGD is the most popular optimization algorithm for large-scale problems. SGD estimates the gradient by uniform sampling with sample size one. There have been several other works that suggest faster epoch wise convergence by using weighted non-uniform sampling for better gradient estimates. Unfortunately, the per-iteration cost of maintaining this adaptive distribution for gradient estimation is more than calculating the full gradient. As a result, the false impression of faster convergence in iterations leads to slower convergence in time, which we call a chicken-and-egg loop. In this paper, we break this barrier by providing the first demonstration of a sampling scheme, which leads to superior gradient estimation, while keeping the sampling cost per iteration similar to that of the uniform sampling. Such an algorithm is possible due to the sampling view of Locality Sensitive Hashing (LSH), which came to light recently. As a consequence of superior and fast estimation, we reduce the running time of all existing gradient descent algorithms. We demonstrate the benefits of our proposal on both SGD and AdaGrad.
Feature-Based Metrics for Exploring the Latent Space of Generative Models
Several recent papers have treated the latent space of deep generative models, e.g., GANs or VAEs, as Riemannian manifolds. The argument is that operations such as interpolation are better done along geodesics that minimize path length not in the latent space but in the output space of the generator. However, this implicitly assumes that some simple metric such as L2 is meaningful in the output space, even though it is well known that for, e.g., semantic comparison of images it is woefully inadequate. In this work, we consider imposing an arbitrary metric on the generator’s output space and show both theoretically and experimentally that a feature-based metric can produce much more sensible interpolations than the usual L2 metric. This observation leads to the conclusion that analysis of latent space geometry would benefit from using a suitable, explicitly defined metric.
Autoregressive Generative Adversarial Networks
Generative Adversarial Networks (GANs) learn a generative model by playing an adversarial game between a generator and an auxiliary discriminator, which classifies data samples vs.\ generated ones. However, it does not explicitly model feature co-occurrences in samples. In this paper, we propose a novel Autoregressive Generative Adversarial Network (ARGAN), that models the latent distribution of data using an autoregressive model, rather than relying on binary classification of samples into data/generated categories. In this way, feature co-occurrences in samples can be more efficiently captured. Our model was evaluated on two widely used datasets: CIFAR-10 and STL-10. Its performance is competitive with respect to other GAN models both quantitatively and qualitatively.
DLVM: A modern compiler infrastructure for deep learning systems
Deep learning software demands reliability and performance. However, many of the existing deep learning frameworks are software libraries that act as an unsafe DSL in Python and a computation graph interpreter. We present DLVM, a design and implementation of a compiler infrastructure with a linear algebra intermediate representation, algorithmic differentiation by adjoint code generation, domain- specific optimizations and a code generator targeting GPU via LLVM. Designed as a modern compiler infrastructure inspired by LLVM, DLVM is more modular and more generic than existing deep learning compiler frameworks, and supports tensor DSLs with high expressivity. With our prototypical staged DSL embedded in Swift, we argue that the DLVM system enables a form of modular, safe and performant frameworks for deep learning.
Reconstructing evolutionary trajectories of mutations in cancer
We present a new method, TrackSig, to estimate evolutionary trajectories in cancer. Our method represents cancer evolution in terms of mutational signatures -- multinomial distributions over mutation types. TrackSig infers an approximate order in which mutations accumulated in cancer genome, and then fits the signatures to the mutation time series. We assess TrackSig's reconstruction accuracy using simulations. We find 1.9% median discrepancy between estimated mixtures and ground truth. The size of the signature change is consistent in 87% cases and direction of change is consistent in 95% of cases. The code is available at https://github.com/YuliaRubanova/TrackSig.
DeepNCM: Deep Nearest Class Mean Classifiers
In this paper we introduce DeepNCM, a Nearest Class Mean classification method enhanced to directly learn highly non-linear deep (visual) representations of the data. To overcome the computational expensive process of recomputing the class means after every update of the representation, we opt for approximating the class means with an online estimate. Moreover, to allow the class means to follow closely the drifting representation we introduce per epoch mean condensation. Using online class means with condensation, DeepNCM can train efficiently on large datasets. Our experimental results indicate that DeepNCM performs on par with SoftMax optimised networks.
Accelerating Neural Architecture Search using Performance Prediction
Methods for neural network hyperparameter optimization and meta-modeling are computationally expensive due to the need to train a large number of model configurations. In this paper, we show that standard frequentist regression models can predict the final performance of partially trained model configurations using features based on network architectures, hyperparameters, and time series validation performance data. We empirically show that our performance prediction models are much more effective than prominent Bayesian counterparts, are simpler to implement, and are faster to train. Our models can predict final performance in both visual classification and language modeling domains, are effective for predicting performance of drastically varying model architectures, and can even generalize between model classes. Using these prediction models, we also propose an early stopping method for hyperparameter optimization and meta-modeling, which obtains a speedup of a factor up to 6x in both hyperparameter optimization and meta-modeling. Finally, we empirically show that our early stopping method can be seamlessly incorporated into both reinforcement learning-based architecture selection algorithms and bandit based search methods. Through extensive experimentation, we empirically show our performance prediction models and early stopping algorithm are state-of-the-art in terms of prediction accuracy and speedup achieved while still identifying the optimal model configurations.
Intriguing Properties of Adversarial Examples
It is becoming increasingly clear that many machine learning classifiers are vulnerable to adversarial examples. In attempting to explain the origin of adversarial examples, previous studies have typically focused on the fact that neural networks operate on high dimensional data, they overfit, or they are too linear. Here we show that distributions of logit differences have a universal functional form. This functional form is independent of architecture, dataset, and training protocol; nor does it change during training. This leads to adversarial error having a universal scaling, as a power-law, with respect to the size of the adversarial perturbation. We show that this universality holds for a broad range of datasets (MNIST, CIFAR10, ImageNet, and random data), models (including state-of-the-art deep networks, linear models, adversarially trained networks, and networks trained on randomly shuffled labels), and attacks (FGSM, step l.l., PGD). Motivated by these results, we study the effects of reducing prediction entropy on adversarial robustness. Finally, we study the effect of network architectures on adversarial sensitivity. To do this, we use neural architecture search with reinforcement learning to find adversarially robust architectures on CIFAR10. Our resulting architecture is more robust to white \emph{and} black box attacks compared to previous attempts.
Building Generalizable Agents with a Realistic and Rich 3D Environment
Teaching an agent to navigate in an unseen 3D environment is a challenging task, even in the event of simulated environments. To generalize to unseen environments, an agent needs to be robust to low-level variations (e.g. color, texture, object changes), and also high-level variations (e.g. layout changes of the environment). To improve overall generalization, all types of variations in the environment have to be taken under consideration via different level of data augmentation steps. To this end, we propose House3D, a rich, extensible and efficient environment that contains 45,622 human-designed 3D scenes of visually realistic houses, ranging from single-room studios to multi-storied houses, equipped with a diverse set of fully labeled 3D objects, textures and scene layouts, based on the SUNCG dataset (Song et al., 2017). The diversity in House3D opens the door towards scene-level augmentation, while the label-rich nature of House3D enables us to inject pixel- & task-level augmentations such as domain randomization (Tobin et al., 2017) and multi-task training. Using a subset of houses in House3D, we show that reinforcement learning agents trained with an enhancement of different levels of augmentations perform much better in unseen environments than our baselines with raw RGB input by over 8% in terms of navigation success rate. House3D is publicly available at http://github.com/facebookresearch/House3D.
Censoring Representations with Multiple-Adversaries over Random Subspaces
Adversarial feature learning has been successfully applied to censor the representations of neural networks; for example, AFL could help to learn anonymized representations to avoid privacy issues by constraining the representations with adversarial gradients that confuse the external discriminators that try to discern and extract sensitive information from the activations. In this paper, we propose the ensemble approach for the design of the discriminator based on the intuition that the discriminator need to be robust to the success of the AFL. The empirical validations on three user-anonymization tasks show that our proposed method achieves state-of-the-art performances in all three datasets without significantly harming the utility of data. We also provide initial theoretical results about the generalization error of the adversarial gradients, which suggest that the accuracy of the discriminator is not a deterministic factor for the design of the discriminator.
Rotational Unit of Memory
The concepts of unitary evolution matrices and associative memory have boosted the field of Recurrent Neural Networks (RNN) to state-of-the-art performance in a variety of sequential tasks. However, RNN still has a limited capacity to manipulate long-term memory. To bypass this weakness the most successful applications of RNN use external techniques such as attention mechanisms. In this paper we propose a novel RNN model that unifies the state-of-the-art approaches: Rotational Unit of Memory (RUM). The core of RUM is its rotational operation, which is, naturally, a unitary matrix, providing architectures with the power to learn long-term dependencies by overcoming the vanishing and exploding gradients problem. Moreover, the rotational unit also serves as associative memory. We evaluate our model on synthetic memorization, question answering and language modeling tasks. RUM learns the Copying Memory task completely and improves the state-of-the-art result in the Recall task. RUM’s performance in the bAbI Question Answering task is comparable to that of models with attention mechanism. We also improve the state-of-the-art result to 1.189 bits-per-character (BPC) loss in the Character Level Penn Treebank (PTB) task, which is to signify the applications of RUM to real-world sequential data. The universality of our construction, at the core of RNN, establishes RUM as a promising approach to language modeling, speech recognition and machine translation.
Causal Discovery Using Proxy Variables
In this paper, we develop a framework to estimate the cause-effect relation between two static entities x and y: for instance, an art masterpiece x and its fraudulent copy y. To this end, we introduce the notion of proxy variables, which allow the construction of a pair of random entities (A,B) from the pair of static entities (x,y). Then, estimating the cause-effect relation between A and B using an observational causal discovery algorithm leads to an estimation of the cause-effect relation between x and y. We evaluate our framework in vision and language.
Faster Discovery of Neural Architectures by Searching for Paths in a Large Model
We propose Efficient Neural Architecture Search (ENAS), a faster and less expensive approach to automated model design than previous methods. In ENAS, a controller learns to discover neural network architectures by searching for an optimal path within a larger model. The controller is trained with policy gradient to select a path that maximizes the expected reward on the validation set. Meanwhile the model corresponding to the selected path is trained to minimize the cross entropy loss. On the Penn Treebank dataset, ENAS can discover a novel architecture thats achieves a test perplexity of 57.8, which is state-of-the-art among automatic model design methods on Penn Treebank. On the CIFAR-10 dataset, ENAS can design novel architectures that achieve a test error of 2.89%, close to the 2.65% achieved by standard NAS (Zoph et al., 2017). Most importantly, our experiments show that ENAS is more than 10x faster and 100x less resource-demanding than NAS.
PixelSNAIL: An Improved Autoregressive Generative Model
Efficient Recurrent Neural Networks using Structured Matrices in FPGAs
The Mirage of Action-Dependent Baselines in Reinforcement Learning
Model-free reinforcement learning with flexible function approximators has shown success in goal-directed sequential decision-making problems. Policy gradient methods are a widely used class of stable model-free algorithms and typically, a state-dependent baseline or control variate is necessary to reduce the gradient estimator variance. Several recent papers extend the baseline to depend on both the state and action, and suggest that this enables significant variance reduction and improved sample efficiency without introducing bias into the gradient estimates. To better understand this development, we decompose the variance of the policy gradient estimator and numerically show that learned state-action-dependent baselines do not in fact reduce variance over a state-dependent baseline in the commonly tested benchmark domains. We confirm this unexpected result by reviewing the open-source code accompanying these prior papers, and show that subtle implementation decisions cause deviations from the methods presented in the papers and explain the sources of the previously observed empirical gains.
Systematic Weight Pruning of DNNs using Alternating Direction Method of Multipliers
We present a systematic weight pruning framework of deep neural networks (DNNs) using the alternating direction method of multipliers (ADMM). We first formulate the weight pruning problem of DNNs as a constrained nonconvex optimization problem, and then adopt the ADMM framework for systematic weight pruning. We show that ADMM is highly suitable for weight pruning due to the computational efficiency it offers. We achieve a much higher compression ratio compared with prior work while maintaining the same test accuracy, together with a faster convergence rate.
Spectral Capsule Networks
In search for more accurate predictive models, we customize capsule networks for the learning to diagnose problem. We also propose Spectral Capsule Networks, a novel variation of capsule networks, that converge faster than capsule network with EM routing. Spectral capsule networks consist of spatial coincidence filters that detect entities based on the alignment of extracted features on a one-dimensional linear subspace. Experiments on a public benchmark learning to diagnose dataset not only shows the success of capsule networks on this task, but also confirm the faster convergence of the spectral capsule networks.
One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning
Humans and animals are capable of learning a new behavior by observing others perform the skill just once. We consider the problem of allowing a robot to do the same -- learning from a raw video pixels of a human, even when there is substantial domain shift in the perspective, environment, and embodiment between the robot and the observed human. Prior approaches to this problem have hand-specified how human and robot actions correspond and often relied on explicit human pose detection systems. In this work, we present an approach for one-shot learning from a video of a human by using human and robot demonstration data from a variety of previous tasks to build up prior knowledge through meta-learning. Then, combining this prior knowledge and only a single video demonstration from a human, the robot can perform the task that the human demonstrated. We show experiments on a PR2 arm, demonstrating that after meta-learning, the robot can learn to place, push, and pick-and-place new objects using just one video of a human performing the manipulation.
Benefits of Depth for Long-Term Memory of Recurrent Networks
The key attribute that drives the unprecedented success of modern Recurrent Neural Networks (RNNs) on learning tasks which involve sequential data, is their ever-improving ability to model intricate long-term temporal dependencies. However, a well established measure of RNNs' long-term memory capacity is lacking, and thus formal understanding of their ability to correlate data throughout time is limited. Though depth efficiency in convolutional networks is well established by now, it does not suffice in order to account for the success of deep RNNs on inputs of varying lengths, and the need to address their 'time-series expressive power' arises. In this paper, we analyze the effect of depth on the ability of recurrent networks to express correlations ranging over long time-scales. To meet the above need, we introduce a measure of the information flow across time that can be supported by the network, referred to as the Start-End separation rank. Essentially, this measure reflects the distance of the function realized by the recurrent network from a function that models no interaction whatsoever between the beginning and end of the input sequence. We prove that deep recurrent networks support Start-End separation ranks which are exponentially higher than those supported by their shallow counterparts. Moreover, we show that the ability of deep recurrent networks to correlate different parts of the input sequence increases exponentially as the input sequence extends, while that of vanilla shallow recurrent networks does not adapt to the sequence length at all. Thus, we establish that depth brings forth an overwhelming advantage in the ability of recurrent networks to model long-term dependencies, and provide an exemplar of quantifying this key attribute which may be readily extended to other RNN architectures of interest, e.g. variants of LSTM networks. We obtain our results by considering a class of recurrent networks referred to as Recurrent Arithmetic Circuits (RACs), which merge the hidden state with the input via the Multiplicative Integration operation.
Exponentially vanishing sub-optimal local minima in multilayer neural networks
Background: Statistical mechanics results (Dauphin et al. (2014); Choromanska et al. (2015)) suggest that local minima with high error are exponentially rare in high dimensions. However, to prove low error guarantees for Multilayer Neural Networks (MNNs), previous works so far required either a heavily modified MNN model or training method, strong assumptions on the labels (e.g., “near” linear separability), or an unrealistically wide hidden layer with \Omega(N) units.
Results: We examine a MNN with one hidden layer of piecewise linear units, a single output, and a quadratic loss. We prove that, with high probability in the limit of N\rightarrow\infty datapoints, the volume of differentiable regions of the empiric loss containing sub-optimal differentiable local minima is exponentially vanishing in comparison with the same volume of global minima, given standard normal input of dimension d0=\tilde{\Omega}(\sqrt{N}), and a more realistic number of d1=\tilde{\Omega}(N/d0) hidden units. We demonstrate our results numerically: for example, 0% binary classification training error on CIFAR with only N/d0 = 16 hidden neurons.
Towards Variational Generation of Small Graphs
In this paper we propose a generative model for graphs formulated as a variational autoencoder. We sidestep hurdles associated with linearization of graphs by having the decoder output a probabilistic fully-connected graph of a predefined maximum size directly at once. We evaluate on the challenging task of molecule generation.
PPP-Net: Platform-aware Progressive Search for Pareto-optimal Neural Architectures
Recent breakthroughs in Neural Architectural Search (NAS) have achieved state-of-the-art performances in many applications such as image recognition. However, these techniques typically ignore platform-related constrictions (e.g., inference time and power consumptions) that can be critical for portable devices with limited computing resources. We propose PPP-Net: a multi-objective architectural search framework to automatically generate networks that achieve Pareto Optimality. PPP-Net employs a compact search space inspired by operations used in state-of-the-art mobile CNNs. PPP-Net has also adopted the progressive search strategy used in a recent literature (Liu et al. (2017a)). Experimental results demonstrate that PPP-Net achieves better performances in both (a) higher accuracy and (b) shorter inference time, comparing to the state-of-the-art CondenseNet.
PDE-Net: Learning PDEs from Data
Partial differential equations (PDEs) play a prominent role in many disciplines such as applied mathematics, physics, chemistry, material science, computer science, etc. PDEs are commonly derived based on physical laws or empirical observations. However, the governing equations for many complex systems in modern applications are still not fully known. With the rapid development of sensors, computational power, and data storage in the past decade, huge quantities of data can be easily collected and efficiently stored. Such vast quantity of data offers new opportunities for data-driven discovery of hidden physical laws. Inspired by the latest development of neural network designs in deep learning, we propose a new feed-forward deep network, called PDE-Net, to fulfill two objectives at the same time: to accurately predict dynamics of complex systems and to uncover the underlying hidden PDE models. The basic idea of the proposed PDE-Net is to learn differential operators by learning convolution kernels (filters), and apply neural networks or other machine learning methods to approximate the unknown nonlinear responses. Comparing with existing approaches, which either assume the form of the nonlinear response is known or fix certain finite difference approximations of differential operators, our approach has the most flexibility by learning both differential operators and the nonlinear responses. A special feature of the proposed PDE-Net is that all filters are properly constrained, which enables us to easily identify the governing PDE models while still maintaining the expressive and predictive power of the network. These constrains are carefully designed by fully exploiting the relation between the orders of differential operators and the orders of sum rules of filters (an important concept originated from wavelet theory). We also discuss relations of the PDE-Net with some existing networks in computer vision such as Network-In-Network (NIN) and Residual Neural Network (ResNet). Numerical experiments show that the PDE-Net has the potential to uncover the hidden PDE of the observed dynamics, and predict the dynamical behavior for a relatively long time, even in a noisy environment.
Gradient-based Optimization of Neural Network Architecture
Neural networks can learn relevant features from data, but their predictive accuracy and propensity to overfit are sensitive to the values of the discrete hyperparameters that specify the network architecture (number of hidden layers, number of units per layer, etc.). Previous work optimized these hyperparmeters via grid search, random search, and black box optimization techniques such as Bayesian optimization. Bolstered by recent advances in gradient-based optimization of discrete stochastic objectives, we instead propose to directly model a distribution over possible architectures and use variational optimization to jointly optimize the network architecture and weights in one training pass. We discuss an implementation of this approach that estimates gradients via the Concrete relaxation, and show that it finds compact and accurate architectures for convolutional neural networks applied to the CIFAR10 and CIFAR100 datasets.
Regret Minimization for Partially Observable Deep Reinforcement Learning
Deep reinforcement learning algorithms that estimate state and state-action value functions have been shown to be effective in a variety of challenging domains, including learning control strategies from raw image pixels. However, algorithms that estimate state and state-action value functions typically assume a fully observed state and must compensate for partial or non-Markovian observations by using finite-length frame-history observations or recurrent networks. In this work, we propose a new deep reinforcement learning algorithm based on counterfactual regret minimization that iteratively updates an approximation to a cumulative clipped advantage function and is robust to partially observed state. We demonstrate that on several partially observed reinforcement learning tasks, this new class of algorithms can substantially outperform strong baseline methods: on Pong with single-frame observations, and on the challenging Doom (ViZDoom) and Minecraft (Malmö) first-person navigation benchmarks.
Adaptive Path-Integral Approach for Representation Learning and Planning
We present a novel framework for representation learning that builds a low-dimensional latent dynamical model from high-dimensional sequential raw data, e.g., video. The framework builds upon recent advances in the amortized inference that constructs a fully-differentiable network, and takes advantage of the duality between control and inference to solve the intractable inference problem using the path integral control approach. We also present the efficient planning method that exploits the learned low-dimensional latent dynamics.
GeoSeq2Seq: Information Geometric Sequence-to-Sequence Networks
The Fisher information metric is an important foundation of information geometry, wherein it allows us to approximate the local geometry of a probability distribution. Recurrent neural networks such as the Sequence-to-Sequence (Seq2Seq) networks that have lately been used to yield state-of-the-art performance on speech translation or image captioning have so far ignored the geometry of the latent embedding, that they iteratively learn. We propose the information geometric Seq2Seq (GeoSeq2Seq) network which abridges the gap between deep recurrent neural networks and information geometry. Specifically, the latent embedding offered by a recurrent network is encoded as a Fisher kernel of a parametric Gaussian Mixture Model, a formalism common in computer vision. We utilise such a network to predict the shortest routes between two nodes of a graph by learning the adjacency matrix using the GeoSeq2Seq formalism; our results show that for such a problem the probabilistic representation of the latent embedding supersedes the non-probabilistic embedding by 10-15\%.
Meta-Learning a Dynamical Language Model
We consider the task of word-level language modeling and study the possibility of combining hidden-states-based short-term representations with medium-term representations encoded in dynamical weights of a language model. Our work extends recent experiments on language models with dynamically evolving weights by casting the language modeling problem into an online learning-to-learn framework in which a meta-learner is trained by gradient-descent to continuously update a language model weights.
Reinforcement Learning from Imperfect Demonstrations
Robust real-world learning should benefit from both demonstrations and interaction with the environment. Current approaches to learning from demonstration and reward perform supervised learning on expert demonstration data and use reinforcement learning to further improve performance based on reward from the environment. These tasks have divergent losses which are difficult to jointly optimize; further, such methods can be very sensitive to noisy demonstrations. We propose a unified reinforcement learning algorithm, Normalized Actor-Critic (NAC), that effectively normalizes the Q-function, reducing the Q-values of actions unseen in the demonstration data. NAC learns an initial policy network from demonstration and refines the policy in a real environment. Crucially, both learning from demonstration and interactive refinement use exactly the same objective, unlike prior approaches that combine distinct supervised and reinforcement losses. This makes NAC robust to suboptimal demonstration data, since the method is not forced to mimic all of the examples in the dataset. We show that our unified reinforcement learning algorithm can learn robustly and outperform existing baselines when evaluated on several realistic driving games.
Adaptive Memory Networks
We present Adaptive Memory Networks (AMN) that process input-question pairs to dynamically construct a network architecture optimized for lower inference times. AMN creates multiple memory banks to store entities from the input story to answer the questions. The model learns to reason important entities from the input text based on the question and concentrates these entities within a single memory bank. At inference, one or few banks are used, creating a tradeoff between accuracy and performance. AMN is enabled by first, a novel bank controller that makes discrete decisions with high accuracy and second, the capabilities of dynamic frameworks (such as PyTorch) that allow for dynamic network sizing and efficient variable mini-batching. In our results, we demonstrate that our model learns to construct a varying number of memory banks based on task complexity and achieves faster inference times for standard bAbI tasks, and modified bAbI tasks. We solve all bAbI tasks with an average of 48% fewer entities on tasks containing excess, unrelated information.
Online variance-reducing optimization
We emphasize the importance of variance reduction in stochastic methods and propose a probabilistic interpretation as a way to store information about past gradients. The resulting algorithm is very similar to the momentum method, with the difference that the weight over past gradients depends on the distance moved in parameter space rather than the number of steps.
Learning to Infer
Inference models, which replace an optimization-based inference procedure with a learned model, have been fundamental in advancing Bayesian deep learning, the most notable example being variational auto-encoders (VAEs). In this paper, we propose iterative inference models, which learn how to optimize a variational lower bound through repeatedly encoding gradients. Our approach generalizes VAEs under certain conditions, and by viewing VAEs in the context of iterative inference, we provide further insight into several recent empirical findings. We demonstrate the inference optimization capabilities of iterative inference models, explore unique aspects of these models, and show that they outperform standard inference models on typical benchmark data sets.
Parametric Adversarial Divergences are Good Task Losses for Generative Modeling
Generative modeling of high dimensional data like images is a notoriously difficult and ill-defined problem. In particular, how to evaluate a learned generative model is unclear. In this paper, we argue that adversarial learning, pioneered with generative adversarial networks (GANs), provides an interesting framework to implicitly define more meaningful task losses for unsupervised tasks, such as for generating "visually realistic" images. By relating GANs and structured prediction under the framework of statistical decision theory, we put into light links between recent advances in structured prediction theory and the choice of the divergence in GANs. We argue that the insights about the notions of "hard" and "easy" to learn losses can be analogously extended to adversarial divergences. We also discuss the attractive properties of parametric adversarial divergences for generative modeling, and perform experiments to show the importance of choosing a divergence that reflects the final task.
An Experimental Study of Neural Networks for Variable Graphs
Graph-structured data such as social networks, functional brain networks, chemical molecules have brought the interest in generalizing deep learning techniques to graph domains. In this work, we propose an empirical study of neural networks for graphs with variable size and connectivity. We rigorously compare several graph recurrent neural networks (RNNs) and graph convolutional neural networks (ConvNets) to solve two fundamental and representative graph problems, subgraph matching and graph clustering. Numerical results show that graph ConvNets are 3-17% more accurate and 1.5-4x faster than graph RNNs. Interestingly, graph ConvNets are also 36% more accurate than non-learning (variational) techniques. The benefit of such study is to show that complex architectures like LSTM is not useful in the context of graph neural networks, but one should favour architectures with minimal inner structures, such as locality, weight sharing, index invariance, multi-scale, gates and residuality, to design efficient novel neural network models for applications like drugs design, genes analysis and particle physics.
Learning to Organize Knowledge with N-Gram Machines
Deep neural networks (DNNs) had great success on NLP tasks such as language modeling, machine translation and certain question answering (QA) tasks. However, the success is limited at more knowledge intensive tasks such as QA from a big corpus. Existing end-to-end deep QA models (Miller et al., 2016; Weston et al., 2014) need to read the entire text after observing the question, and therefore their complexity in responding a question is linear in the text size. This is prohibitive for practical tasks such as QA from Wikipedia, a novel, or the Web. We propose to solve this scalability issue by using symbolic meaning representations, which can be indexed and retrieved efficiently with complexity that is independent of the text size. More specifically, we use sequence-to-sequence models to encode knowledge symbolically and generate programs to answer questions from the encoded knowledge. We apply our approach, called the N-Gram Machine (NGM), to the bAbI tasks (Weston et al., 2015) and a special version of them (“life-long bAbI”) which has stories of up to 10 million sentences. Our experiments show that NGM can successfully solve both of these tasks accurately and efficiently. Unlike fully differentiable memory models, NGM’s time complexity and answering quality are not affected by the story length. The whole system of NGM is trained end-to-end with REINFORCE (Williams, 1992). To avoid high variance in gradient estimation, which is typical in discrete latent variable models, we use beam search instead of sampling. To tackle the exponentially large search space, we use a stabilized auto-encoding objective and a structure tweak procedure to iteratively reduce and refine the search space.
Learning Representations and Generative Models for 3D Point Clouds
Three-dimensional geometric data offer an excellent domain for studying representation learning and generative modeling. In this paper, we look at geometric data represented as point clouds. We introduce a deep autoencoder (AE) network with excellent reconstruction quality and generalization ability. The learned representations outperform the state of the art in 3D recognition tasks and enable basic shape editing applications via simple algebraic manipulations, such as semantic part editing, shape analogies and shape interpolation. We also perform a thorough study of different generative models including GANs operating on the raw point clouds, significantly improved GANs trained in the fixed latent space our AEs and, Gaussian mixture models (GMM). Interestingly, GMMs trained in the latent space of our AEs produce samples of the best fidelity and diversity. To perform our quantitative evaluation of generative models, we propose simple measures of fidelity and diversity based on optimally matching between sets point clouds.
Shifting Mean Activation Towards Zero with Bipolar Activation Functions
We propose a simple extension to the ReLU-family of activation functions that allows them to shift the mean activation across a layer towards zero. Combined with proper weight initialization, this alleviates the need for normalization layers. We explore the training of deep vanilla recurrent neural networks (RNNs) with up to 144 layers, and show that bipolar activation functions help learning in this setting. On the Penn Treebank and Text8 language modeling tasks we obtain competitive results, improving on the best reported results for non-gated networks. In experiments with convolutional neural networks without batch normalization, we find that bipolar activations produce a faster drop in training error, and results in a lower test error on the CIFAR-10 classification task.
eCommerceGAN: A Generative Adversarial Network for e-commerce
E-commerce companies such as Amazon, Alibaba, and Flipkart process billions of orders every year. However, these orders represent only a small fraction of all plausible orders. Exploring the space of all plausible orders could help us better understand the relationships between the various entities in an e-commerce ecosystem, namely the customers and the products they purchase. In this paper, we propose a Generative Adversarial Network (GAN) for e-commerce orders. Our contributions include: (a) creating a dense and low-dimensional representation of e-commerce orders, (b) train an ecommerceGAN (ecGAN) with real orders to show the feasibility of the proposed paradigm, and (c) train an ecommerce-conditional- GAN (ec2GAN) to generate the plausible orders involving a particular product. We evaluate ecGAN qualitatively to demonstrate its effectiveness. The ec2GAN is used for various kinds of characterization of possible orders involving cold-start products.
Learning Rich Image Representation with Deep Layer Aggregation
Architectural efforts are exploring many dimensions for network backbones, designing deeper or wider architectures, but how to best aggregate layers and blocks across a network deserves further attention. We augment standard architectures with deeper aggregation to better fuse information across layers. Our deep layer aggregation structures iteratively and hierarchically merge the feature hierarchy to make networks with better accuracy and fewer parameters. Experiments across architectures and tasks show that deep layer aggregation improves recognition and resolution compared to existing branching and merging schemes.
Regularization Neural Networks via Constrained Virtual Movement Field
We smooth the objective of neural networks w.r.t small adversarial perturbations of the inputs. Different from previous works, we assume the adversarial perturbations are caused by the movement field. When the magnitude of movement field approaches 0, we call it virtual movement field. By introducing the movement field, we cast the problem of finding adversarial perturbations into the problem of finding adversarial movement field. By adding proper geometrical constraints to the movement field, such smoothness can be approximated in closed-form by solving a min-max problem and its geometric meaning is clear. We define the approximated smoothness as the regularization term. We derive three regularization terms as running examples which measure the smoothness w.r.t shift, rotation and scale respectively by adding different constraints. We evaluate our methods on synthetic data, MNIST and CIFAR-10. Experimental results show that our proposed method can significantly improve the baseline neural networks. Compared with the state of the art regularization methods, proposed method achieves a tradeoff between accuracy and geometrical interpretability as well as computational cost.
SufiSent - Universal Sentence Representations Using Suffix Encodings
Computing universal distributed representations of sentences is a fundamental task in natural language processing. We propose a method to learn such representations by encoding the suffixes of word sequences in a sentence and training on the Stanford Natural Language Inference (SNLI) dataset. We demonstrate the effectiveness of our approach by evaluating it on the SentEval benchmark, improving on existing approaches on several transfer tasks.
Finding Flatter Minima with SGD
It has been discussed that over-parameterized deep neural networks (DNNs) trained using stochastic gradient descent (SGD) with smaller batch sizes generalize better compared with those trained with larger batch sizes. Additionally, model parameters found by small batch size SGD tend to be in flatter regions. We extend these empirical observations and experimentally show that both large learning rate and small batch size contribute towards SGD finding flatter minima that generalize well. Conversely, we find that small learning rates and large batch sizes lead to sharper minima that correlate with poor generalization in DNNs.
Designing Efficient Neural Attention Systems Towards Achieving Human-level Sharp Vision
Human vision is capable of focusing on subtle visual cues at high resolution by relying on a foveal view coupled with an attention mechanism. Recently, there have been several studies that proposed deep reinforcement learning based attention models. However, these studies do not explicitly consider the design of a foveal representation and its effect on an attention system is unclear. In this paper, we investigate the effect of using a hierarchy of visual streams in training an efficient attention model towards achieving a human-level sharp vision. We perform our evaluation on a simulated human-robot interaction task where the agent attends to faces that are looking at it. The experimental results show that the performance of the system relies on factors such as the number of visual streams, their relative field-of-view and we demonstrate that maintaining a hierarchy within the visual streams is crucial to learn attention strategies.
Adversarial Policy Gradient for Alternating Markov Games
Policy gradient reinforcement learning has been applied to two-player alternate-turn zero-sum games, e.g., in AlphaGo, self-play REINFORCE was used to improve the neural net model after supervised learning. In this paper, we emphasize that two-player zero-sum games with alternating turns, which have been previously formulated as Alternating Markov Games (AMGs), are different from standard MDP because of their two-agent nature. We exploit the difference in associated Bellman equations, which leads to different policy iteration algorithms. As policy gradient method is a kind of generalized policy iteration, we show how these differences in policy iteration are reflected in policy gradient for AMGs. We formulate an adversarial policy gradient and discuss potential possibilities for developing better policy gradient methods other than self-play REINFORCE. The core idea is to estimate the minimum rather than the mean for the “critic”. Experimental results on the game of Hex show the modified Monte Carlo policy gradient methods are able to learn better pure neural net policies than the REINFORCE variants. To apply learned neural weights to multiple board sizes Hex, we describe a board-size independent neural net architecture. We show that when combined with search, using a single neural net model, the resulting program consistently beats MoHex 2.0, the state-of-the-art computer Hex player, on board sizes from 9×9 to 13×13.
Leveraging Constraint Logic Programming for Neural Guided Program Synthesis
We present a method for solving Programming by Example (PBE) problems that tightly integrates a neural network with a constraint logic programming system called miniKanren. Internally, miniKanren searches for a program that satisfies the recursive constraints imposed by the provided examples. Our Recurrent Neural Network (RNN) model uses these constraints as input to score candidate programs. We show evidence that using our method to guide miniKanren’s search is a promising approach to solving PBE problems.
Simple and efficient architecture search for Convolutional Neural Networks
Neural networks have recently had a lot of success for many tasks. However, neural network architectures that perform well are still typically designed manually by experts in a cumbersome trial-and-error process. We propose a new method to automatically search for well-performing CNN architectures based on a simple hill climbing procedure whose operators apply network morphisms, followed by short optimization runs by cosine annealing. Surprisingly, this simple method yields competitive results, despite only requiring resources in the same order of magnitude as training a single network. E.g., on CIFAR-10, our method designs and trains networks with an error rate below 6% in only 12 hours on a single GPU; training for one day reduces this error further, to almost 5%.
Reward Estimation for Variance Reduction in Deep Reinforcement Learning
In reinforcement learning (RL), stochastic environments can make learning a policy difficult due to high degrees of variance. As such, variance reduction methods have been investigated in other works, such as advantage estimation and control-variates estimation. Here, we propose to learn a separate reward estimator to train the value function, to help reduce variance caused by a noisy reward signal. This results in theoretical reductions in variance in the tabular case, as well as empirical improvements in both the function approximation and tabular settings in environments where rewards are stochastic. To do so, we use a modified version of Advantage Actor Critic (A2C) on variations of Atari games.
A moth brain learns to read MNIST
We seek to characterize the learning tools (ie algorithmic components) used in biological neural networks, in order to port them to the machine learning context. In particular we address the regime of very few training samples. The Moth Olfactory Network is among the simplest biological neural systems that can learn. We assigned a computational model of the Moth Olfactory Network the task of classifying the MNIST digits. The moth brain successfully learned to read given very few training samples (1 to 20 samples per class). In this few-samples regime the moth brain substantially outperformed standard ML methods such as Nearest-neighbors, SVM, and CNN. Our experiments elucidate biological mechanisms for fast learning that rely on cascaded networks, competitive inhibition, sparsity, and Hebbian plasticity. These biological algorithmic components represent a novel, alternative toolkit for building neural nets that may offer a valuable complement to standard neural nets.
Extending Robust Adversarial Reinforcement Learning Considering Adaptation and Diversity
We propose two extensions to Robust Adversarial Reinforcement Learning. (Pinto et al., 2017) One is to add a penalty that brings the training domain closer to the test domain to the objective function of the adversarial agent. The other method trains multiple adversarial agents for one protagonist. We conducted experiments with the physical simulator benchmark task. The results show that our method improves performance in the test domain compared to the baseline.
Neural network parameter regression for lattice quantum chromodynamics simulations in nuclear and particle physics
Nuclear and particle physicists seek to understand the structure of matter at the smallest scales through numerical simulations of lattice Quantum Chromodynamics (LQCD) performed on the largest supercomputers available. Multi-scale techniques have the potential to dramatically reduce the computational cost of such simulations, if a challenging parameter regression problem matching physics at different resolution scales can be solved. Simple neural networks applied to this task fail because of the dramatic inverted data hierarchy that this problem displays, with orders of magnitude fewer samples typically available than degrees of freedom per sample. Symmetry-aware networks that respect the complicated invariances of the underlying physics, however, provide an efficient and practical solution. Further efforts to incorporate invariances and constraints that are typical of physics problems into neural networks and other machine learning algorithms have potential to dramatically impact studies of systems in nuclear, particle, condensed matter, and statistical physics.
3D-FilterMap: A Compact Architecture for Deep Convolutional Neural Networks
Inference in probabilistic graphical models by Graph Neural Networks
A useful computation when acting in a complex environment is to infer the marginal probabilities or most probable states of task-relevant variables. Probabilistic graphical models can efficiently represent the structure of such complex data, but performing these inferences is generally difficult. Message-passing algorithms, such as belief propagation, are a natural way to disseminate evidence amongst correlated variables while exploiting the graph structure, but these algorithms can struggle when the conditional dependency graphs contain loops. Here we use Graph Neural Networks (GNNs) to learn a message-passing algorithm that solves these inference tasks. We demonstrate the efficacy of this inference approach by training GNNs on an ensemble of graphical models and showing that they substantially outperform belief propagation on loopy graphs. Our message-passing algorithms generalize out of the training set to larger graphs and graphs with different structure.