ICLR 2019 Videos


Invited Talk
{daterange} @ Great Hall AD
Highlights of Recent Developments in Algorithmic Fairness
Cynthia Dwork

Since its inception as a field of study roughly one decade ago, research in algorithmic fairness has exploded. Much of this work focuses on so-called "group fairness" notions, which address the relative treatment of different demographic groups. More theoretical work advocated for "individual fairness" which, speaking intuitively, requires that people who are similar, with respect to a given classification task, should be treated similarly by classifiers for that task. Both approaches face significant challenges: for example, provable incompatibility of natural fairness desiderata (for the group case), and the absence of similarity information (for the individual case). The past two years have seen exciting developments on several fronts in theoretical computer science including: the investigation of scoring, classifying, ranking, and auditing fairness under definitions aiming to bridge the group and individual notions, and the construction of similarity metrics from (relatively few) queries to a human expert. A parallel vein of research in the ML community explores fairness via representations. This talk will motivate, highlight, and weave together threads from these recent contributions.


Oral
{daterange} @ Great Hall AD
Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling
Jacob Menick · Nal Kalchbrenner

The unconditional generation of high fidelity images is a longstanding benchmark for testing the performance of image decoders. Autoregressive image models have been able to generate small images unconditionally, but the extension of these methods to large images where fidelity can be more readily assessed has remained an open problem. Among the major challenges are the capacity to encode the vast previous context and the sheer difficulty of learning a distribution that preserves both global semantic coherence and exactness of detail. To address the former challenge, we propose the Subscale Pixel Network (SPN), a conditional decoder architecture that generates an image as a sequence of image slices of equal size. The SPN compactly captures image-wide spatial dependencies and requires a fraction of the memory and the computation. To address the latter challenge, we propose to use multidimensional upscaling to grow an image in both size and depth via intermediate stages corresponding to distinct SPNs. We evaluate SPNs on the unconditional generation of CelebAHQ of size 256 and of ImageNet from size 32 to 128. We achieve state-of-the-art likelihood results in multiple settings, set up new benchmark results in previously unexplored settings and are able to generate very high fidelity large scale samples on the basis of both datasets.


Oral
{daterange} @ Great Hall AD
BA-Net: Dense Bundle Adjustment Networks
Chengzhou Tang · Ping Tan

This paper introduces a network architecture to solve the structure-from-motion (SfM) problem via feature-metric bundle adjustment (BA), which explicitly enforces multi-view geometry constraints in the form of feature-metric error. The whole pipeline is differentiable, so that the network can learn suitable features that make the BA problem more tractable. Furthermore, this work introduces a novel depth parameterization to recover dense per-pixel depth. The network first generates several basis depth maps according to the input image, and optimizes the final depth as a linear combination of these basis depth maps via feature-metric BA. The basis depth maps generator is also learned via end-to-end training. The whole system nicely combines domain knowledge (i.e. hard-coded multi-view geometry constraints) and deep learning (i.e. feature learning and basis depth maps learning) to address the challenging dense SfM problem. Experiments on large scale real data prove the success of the proposed method.


Oral
{daterange} @ Great Hall AD
Large Scale GAN Training for High Fidelity Natural Image Synthesis
Andrew Brock · Jeff Donahue · Karen Simonyan

Despite recent progress in generative image modeling, successfully generating high-resolution, diverse samples from complex datasets such as ImageNet remains an elusive goal. To this end, we train Generative Adversarial Networks at the largest scale yet attempted, and study the instabilities specific to such scale. We find that applying orthogonal regularization to the generator renders it amenable to a simple "truncation trick", allowing fine control over the trade-off between sample fidelity and variety by reducing the variance of the Generator's input. Our modifications lead to models which set the new state of the art in class-conditional image synthesis. When trained on ImageNet at 128x128 resolution, our models (BigGANs) achieve an Inception Score (IS) of 166.3 and Frechet Inception Distance (FID) of 9.6, improving over the previous best IS of 52.52 and FID of 18.65.


Invited Talk
{daterange} @ Great Hall AD
Learning Representations Using Causal Invariance
Leon Bottou

Learning algorithms often capture spurious correlations present in the training data distribution instead of addressing the task of interest. Such spurious correlations occur because the data collection process is subject to uncontrolled confounding biases. Suppose however that we have access to multiple datasets exemplifying the same concept but whose distributions exhibit different biases. Can we learn something that is common across all these distributions, while ignoring the spurious ways in which they differ? This can be achieved by projecting the data into a representation space that satisfy a causal invariance criterion. This idea differs in important ways from previous work on statistical robustness or adversarial objectives. Similar to recent work on invariant feature selection, this is about discovering the actual mechanism underlying the data instead of modeling its superficial statistics.


Oral
{daterange} @ Great Hall AD
On Random Deep Weight-Tied Autoencoders: Exact Asymptotic Analysis, Phase Transitions, and Implications to Training
Ping Li · Phan-Minh Nguyen

We study the behavior of weight-tied multilayer vanilla autoencoders under the assumption of random weights. Via an exact characterization in the limit of large dimensions, our analysis reveals interesting phase transition phenomena when the depth becomes large. This, in particular, provides quantitative answers and insights to three questions that were yet fully understood in the literature. Firstly, we provide a precise answer on how the random deep weight-tied autoencoder model performs “approximate inference” as posed by Scellier et al. (2018), and its connection to reversibility considered by several theoretical studies. Secondly, we show that deep autoencoders display a higher degree of sensitivity to perturbations in the parameters, distinct from the shallow counterparts. Thirdly, we obtain insights on pitfalls in training initialization practice, and demonstrate experimentally that it is possible to train a deep autoencoder, even with the tanh activation and a depth as large as 200 layers, without resorting to techniques such as layer-wise pre-training or batch normalization. Our analysis is not specific to any depths or any Lipschitz activations, and our analytical techniques may have broader applicability.


Oral
{daterange} @ Great Hall AD
How Powerful are Graph Neural Networks?
Keyulu Xu · Weihua Hu · Jure Leskovec · Stefanie Jegelka

Graph Neural Networks (GNNs) are an effective framework for representation learning of graphs. GNNs follow a neighborhood aggregation scheme, where the representation vector of a node is computed by recursively aggregating and trans- forming representation vectors of its neighboring nodes. Many GNN variants have been proposed and have achieved state-of-the-art results on both node and graph classification tasks. However, despite GNNs revolutionizing graph representation learning, there is limited understanding of their representational properties and limitations. Here, we present a theoretical framework for analyzing the expressive power of GNNs in capturing different graph structures. Our results characterize the discriminative power of popular GNN variants, such as Graph Convolutional Networks and GraphSAGE, and show that they cannot learn to distinguish certain simple graph structures. We then develop a simple architecture that is provably the most expressive among the class of GNNs and is as powerful as the Weisfeiler-Lehman graph isomorphism test. We empirically validate our theoretical findings on a number of graph classification benchmarks, and demonstrate that our model achieves state-of-the-art performance.


Oral
{daterange} @ Great Hall AD
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Jonathan Frankle · Michael Carbin

Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance. We find that a standard technique for pruning weights naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the "lottery ticket hypothesis": dense, randomly-initialized feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - arrive at comparable test accuracy in a comparable number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective. We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Furthermore, the winning tickets we find above that size learn faster than the original network and exhibit higher test accuracy.


Invited Talk
{daterange} @ Great Hall AD
Can Machine Learning Help to Conduct a Planetary Healthcheck?
Emily Shuckburgh

Our planet is under ever increasing pressure from environmental degradation, biodiversity loss and climate change. If as a global society we are to respond to this we need to quantitatively assess the current state and future trends. Today, the planet’s vital signs are monitored extensively from space and from networks of ground-based sensors. In addition, over the past fifty years we have developed sophisticated numerical models of the Earth’s systems based on our understanding of physics, chemistry and biology. However, we are still limited in our ability to accurately predict future change, especially at the local scale that is most relevant for decision-making. In this talk I will outline a set of "grand challenge" problems and discuss various ways in which Machine Learning is starting to be deployed to advance our capacity to address these, in particular by combining our fundamental understanding of the Earth system processes with new knowledge gleaned from the vast planetary datasets.


Oral
{daterange} @ Great Hall AD
Learning Protein Structure with a Differentiable Simulator
John Ingraham · Adam J Riesselman · Chris Sander · Debora Marks

The Boltzmann distribution is a natural model for many systems, from brains to materials and biomolecules, but is often of limited utility for fitting data because Monte Carlo algorithms are unable simulate it in available time. This gap between the expressive capabilities and sampling practicalities of energy-based models is exemplified by the protein folding problem, since energy landscapes underlie contemporary knowledge of protein biophysics but computer simulations are often unable to fold all but the smallest proteins from first-principles. In this work we aim to bridge the gap between the expressive capacity of energy functions and the practical capabilities of their simulators by using an unrolled Monte Carlo simulation as a model for data. We compose a neural energy function with a novel and efficient simulator based on Langevin dynamics to build an end-to-end-differentiable model of atomic protein structure given amino acid sequence information. We introduce techniques for stabilizing backpropagation under long roll-outs and demonstrate the model's capacity to make multimodal predictions and to (sometimes) generalize to unobserved protein fold types when trained on a large corpus of protein structures.


Oral
{daterange} @ Great Hall AD
Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset
Curtis Hawthorne · Andriy Stasyuk · Adam Roberts · Ian Simon · Anna Huang · Sander Dieleman · Erich K Elsen · Jesse Engel · Douglas Eck

Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.1 ms to ~100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music.


Oral
{daterange} @ Great Hall AD
A Unified Theory of Early Visual Representations from Retina to Cortex through Anatomically Constrained Deep CNNs
Jack Lindsey · Samuel Ocko · Surya Ganguli · Stephane Deny

The vertebrate visual system is hierarchically organized to process visual information in successive stages. Neural representations vary drastically across the first stages of visual processing: at the output of the retina, ganglion cell receptive fields (RFs) exhibit a clear antagonistic center-surround structure, whereas in the primary visual cortex (V1), typical RFs are sharply tuned to a precise orientation. There is currently no unified theory explaining these differences in representations across layers. Here, using a deep convolutional neural network trained on image recognition as a model of the visual system, we show that such differences in representation can emerge as a direct consequence of different neural resource constraints on the retinal and cortical networks, and for the first time we find a single model from which both geometries spontaneously emerge at the appropriate stages of visual processing. The key constraint is a reduced number of neurons at the retinal output, consistent with the anatomy of the optic nerve as a stringent bottleneck. Second, we find that, for simple downstream cortical networks, visual representations at the retinal output emerge as nonlinear and lossy feature detectors, whereas they emerge as linear and faithful encoders of the visual scene for more complex cortical networks. This result predicts that the retinas of small vertebrates (e.g. salamander, frog) should perform sophisticated nonlinear computations, extracting features directly relevant to behavior, whereas retinas of large animals such as primates should mostly encode the visual scene linearly and respond to a much broader range of stimuli. These predictions could reconcile the two seemingly incompatible views of the retina as either performing feature extraction or efficient coding of natural scenes, by suggesting that all vertebrates lie on a spectrum between these two objectives, depending on the degree of neural resources allocated to their visual system.


Invited Talk
{daterange} @ Great Hall AD
Adversarial Machine Learning
Ian Goodfellow

Until about 2013, most researchers studying machine learning for artificial intelligence all worked on a common goal: get machine learning to work for AI-scale tasks. Now that supervised learning works, there is a Cambrian explosion of new research directions: making machine learning secure, making machine learning private, getting machine learning to work for new tasks, reducing the dependence on large amounts of labeled data, and so on. In this talk I survey how adversarial techniques in machine learning are involved in several of these new research frontiers.


Oral
{daterange} @ Great Hall AD
Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware
Florian Tramer · Dan Boneh

As Machine Learning (ML) gets applied to security-critical or sensitive domains, there is a growing need for integrity and privacy for outsourced ML computations. A pragmatic solution comes from Trusted Execution Environments (TEEs), which use hardware and software protections to isolate sensitive computations from the untrusted software stack. However, these isolation guarantees come at a price in performance, compared to untrusted alternatives. This paper initiates the study of high performance execution of Deep Neural Networks (DNNs) in TEEs by efficiently partitioning DNN computations between trusted and untrusted devices. Building upon an efficient outsourcing scheme for matrix multiplication, we propose Slalom, a framework that securely delegates execution of all linear layers in a DNN from a TEE (e.g., Intel SGX or Sanctum) to a faster, yet untrusted, co-located processor. We evaluate Slalom by running DNNs in an Intel SGX enclave, which selectively delegates work to an untrusted GPU. For canonical DNNs (VGG16, MobileNet and ResNet variants) we obtain 6x to 20x increases in throughput for verifiable inference, and 4x to 11x for verifiable and private inference.


Oral
{daterange} @ Great Hall AD
Learning to Remember More with Less Memorization
Hung Le · Truyen Tran · Svetha Venkatesh

Memory-augmented neural networks consisting of a neural controller and an external memory have shown potentials in long-term sequential learning. Current RAM-like memory models maintain memory accessing every timesteps, thus they do not effectively leverage the short-term memory held in the controller. We hypothesize that this scheme of writing is suboptimal in memory utilization and introduces redundant computation. To validate our hypothesis, we derive a theoretical bound on the amount of information stored in a RAM-like system and formulate an optimization problem that maximizes the bound. The proposed solution dubbed Uniform Writing is proved to be optimal under the assumption of equal timestep contributions. To relax this assumption, we introduce modifications to the original solution, resulting in a solution termed Cached Uniform Writing. This method aims to balance between maximizing memorization and forgetting via overwriting mechanisms. Through an extensive set of experiments, we empirically demonstrate the advantages of our solutions over other recurrent architectures, claiming the state-of-the-arts in various sequential modeling tasks.


Oral
{daterange} @ Great Hall AD
Learning Robust Representations by Projecting Superficial Statistics Out
Haohan Wang · Zexue He · Zachary Lipton · Eric P Xing

Despite impressive performance as evaluated on i.i.d. holdout data, deep neural networks depend heavily on superficial statistics of the training data and are liable to break under distribution shift. For example, subtle changes to the background or texture of an image can break a seemingly powerful classifier. Building on previous work on domain generalization, we hope to produce a classifier that will generalize to previously unseen domains, even when domain identifiers are not available during training. We refer to this setting as unguided domain generalization. This setting is challenging because the model may extract many distribution-specific (superficial) signals together with distribution-agnostic (semantic) signals. To overcome this challenge, we incorporate the gray-level co-occurrence matrix (GLCM) to extract patterns that our prior knowledge suggests are superficial. Then we introduce two techniques for improving our networks' out-of-sample performance. The first method is built on the reverse gradient method for tuning the model to be invariant to GLCM representation. The second method is built on the independence introduced by projecting the model's representation onto the subspace orthogonal to GLCM representation's. We test our method on a battery of standard domain generalization data sets and achieve comparable or better performance as compared to other domain generalization methods that explicitly require the distribution identification information.


Oral
{daterange} @ Great Hall AD
ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness
Robert Geirhos · Patricia Rubisch · Claudio Michaelis · Matthias Bethge · Felix Wichmann · Wieland Brendel

Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. Some recent studies suggest a more important role of image textures. We here put these conflicting hypotheses to a quantitative test by evaluating CNNs and human observers on images with a texture-shape cue conflict. We show that ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies. We then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on 'Stylized-ImageNet', a stylized version of ImageNet. This provides a much better fit for human behavioural performance in our well-controlled psychophysical lab setting (nine experiments totalling 48,560 psychophysical trials across 97 observers) and comes with a number of unexpected emergent benefits such as improved object detection performance and previously unseen robustness towards a wide range of image distortions, highlighting advantages of a shape-based representation.


Invited Talk
{daterange} @ Great Hall AD
Developmental Autonomous Learning: AI, Cognitive Sciences and Educational Technology
Pierre-Yves Oudeyer

Current approaches to AI and machine learning are still fundamentally limited in comparison with autonomous learning capabilities of children. What is remarkable is not that some children become world champions in certains games or specialties: it is rather their autonomy, flexibility and efficiency at learning many everyday skills under strongly limited resources of time, computation and energy. And they do not need the intervention of an engineer for each new task (e.g. they do not need someone to provide a new task specific reward function). I will present a research program that has focused on computational modeling of child development and learning mechanisms in the last decade. I will discuss several developmental forces that guide exploration in large real world spaces, starting from the perspective of how algorithmic models can help us understand better how they work in humans, and in return how this opens new approaches to autonomous machine learning. In particular, I will discuss models of curiosity-driven autonomous learning, enabling machines to sample and explore their own goals and their own learning strategies, self-organizing a learning curriculum without any external reward or supervision. I will show how this has helped scientists understand better aspects of human development such as the emergence of developmental transitions between object manipulation, tool use and speech. I will also show how the use of real robotic platforms for evaluating these models has led to highly efficient unsupervised learning methods, enabling robots to discover and learn multiple skills in high-dimensions in a handful of hours. I will discuss how these techniques are now being integrated with modern deep learning methods. Finally, I will show how these models and techniques can be successfully applied in the domain of educational technologies, enabling to personalize sequences of exercises for human learners, while maximizing both learning efficiency and intrinsic motivation. I will illustrate this with a large-scale experiment recently performed in primary schools, enabling children of all levels to improve their skills and motivation in learning aspects of mathematics. Web: http://www.pyoudeyer.com


Oral
{daterange} @ Great Hall AD
Learning deep representations by mutual information estimation and maximization
R Devon Hjelm · Alex Fedorov · Samuel Lavoie · Karan Grewal · Philip Bachman · Adam Trischler · Yoshua Bengio

This work investigates unsupervised learning of representations by maximizing mutual information between an input and the output of a deep neural network encoder. Importantly, we show that structure matters: incorporating knowledge about locality in the input into the objective can significantly improve a representation's suitability for downstream tasks. We further control characteristics of the representation by matching to a prior distribution adversarially. Our method, which we call Deep InfoMax (DIM), outperforms a number of popular unsupervised learning methods and compares favorably with fully-supervised learning on several classification tasks in with some standard architectures. DIM opens new avenues for unsupervised learning of representations and is an important step towards flexible formulations of representation learning objectives for specific end-goals.


Oral
{daterange} @ Great Hall AD
KnockoffGAN: Generating Knockoffs for Feature Selection using Generative Adversarial Networks
James Jordon · Jinsung Yoon · Mihaela Schaar

Feature selection is a pervasive problem. The discovery of relevant features can be as important for performing a particular task (such as to avoid overfitting in prediction) as it can be for understanding the underlying processes governing the true label (such as discovering relevant genetic factors for a disease). Machine learning driven feature selection can enable discovery from large, high-dimensional, non-linear observational datasets by creating a subset of features for experts to focus on. In order to use expert time most efficiently, we need a principled methodology capable of controlling the False Discovery Rate. In this work, we build on the promising Knockoff framework by developing a flexible knockoff generation model. We adapt the Generative Adversarial Networks framework to allow us to generate knockoffs with no assumptions on the feature distribution. Our model consists of 4 networks, a generator, a discriminator, a stability network and a power network. We demonstrate the capability of our model to perform feature selection, showing that it performs as well as the originally proposed knockoff generation model in the Gaussian setting and that it outperforms the original model in non-Gaussian settings, including on a real-world dataset.


Oral
{daterange} @ Great Hall AD
Deterministic Variational Inference for Robust Bayesian Neural Networks
Anqi Wu · Sebastian Nowozin · Ted Meeds · Richard E Turner · José Miguel Hernández Lobato · Alexander Gaunt

Bayesian neural networks (BNNs) hold great promise as a flexible and principled solution to deal with uncertainty when learning from finite data. Among approaches to realize probabilistic inference in deep neural networks, variational Bayes (VB) is theoretically grounded, generally applicable, and computationally efficient. With wide recognition of potential advantages, why is it that variational Bayes has seen very limited practical use for BNNs in real applications? We argue that variational inference in neural networks is fragile: successful implementations require careful initialization and tuning of prior variances, as well as controlling the variance of Monte Carlo gradient estimates. We provide two innovations that aim to turn VB into a robust inference tool for Bayesian neural networks: first, we introduce a novel deterministic method to approximate moments in neural networks, eliminating gradient variance; second, we introduce a hierarchical prior for parameters and a novel Empirical Bayes procedure for automatically selecting prior variances. Combining these two innovations, the resulting method is highly efficient and robust. On the application of heteroscedastic regression we demonstrate good predictive performance over alternative approaches.


Oral
{daterange} @ Great Hall AD
FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models
Will Grathwohl · Tian Qi Chen · Jesse Bettencourt · Ilya Sutskever · David Duvenaud

A promising class of generative models maps points from a simple distribution to a complex distribution through an invertible neural network. Likelihood-based training of these models requires restricting their architectures to allow cheap computation of Jacobian determinants. Alternatively, the Jacobian trace can be used if the transformation is specified by an ordinary differential equation. In this paper, we use Hutchinson’s trace estimator to give a scalable unbiased estimate of the log-density. The result is a continuous-time invertible generative model with unbiased density estimation and one-pass sampling, while allowing unrestricted neural network architectures. We demonstrate our approach on high-dimensional density estimation, image generation, and variational inference, achieving the state-of-the-art among exact likelihood methods with efficient sampling.


Oral
{daterange}
ICLR Debate
Leslie Kaelbling


Invited Talk
{daterange} @ Great Hall AD
Learning Natural Language Interfaces with Neural Models
Mirella Lapata

In Spike Jonze's futuristic film "Her", Theodore, a lonely writer, forms a strong emotional bond with Samantha, an operating system designed to meet his every need. Samantha can carry on seamless conversations with Theodore, exhibits a perfect command of language, and is able to take on complex tasks. She filters his emails for importance, allowing him to deal with information overload, she proactively arranges the publication of Theodore's letters, and is able to give advice using common sense and reasoning skills. In this talk I will present an overview of recent progress on learning natural language interfaces which might not be as clever as Samantha but nevertheless allow uses to interact with various devices and services using every day language. I will address the structured prediction problem of mapping natural language utterances onto machine-interpretable representations and outline the various challenges it poses. For example, the fact that the translation of natural language to formal language is highly non-isomorphic, data for model training is scarce, and natural language can express the same information need in many different ways. I will describe a general modeling framework based on neural networks which tackles these challenges and improves the robustness of natural language interfaces.


Oral
{daterange} @ Great Hall AD
Pay Less Attention with Lightweight and Dynamic Convolutions
Felix Wu · Angela Fan · Alexei Baevski · Yann Dauphin · Michael Auli

Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT'14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU.


Oral
{daterange} @ Great Hall AD
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision
Jiayuan Mao · Chuang Gan · Pushmeet Kohli · Joshua B Tenenbaum · Jiajun Wu

We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns visual concepts, words, and semantic parsing of sentences without explicit supervision on any of them; instead, our model learns by simply looking at images and reading paired questions and answers. Our model builds an object-based scene representation and translates sentences into executable, symbolic programs. To bridge the learning of two modules, we use a neural-symbolic reasoning module that executes these programs on the latent scene representation. Analog to the human concept learning, given the parsed program, the perception module learns visual concepts based on the language description of the object being referred to. Meanwhile, the learned visual concepts facilitate learning new words and parsing new sentences. We use curriculum learning to guide searching over the large compositional space of images and language. Extensive experiments demonstrate the accuracy and efficiency of our model on learning visual concepts, word representations, and semantic parsing of sentences. Further, our method allows easy generalization to new object attributes, compositions, language concepts, scenes and questions, and even new program domains. It also empowers applications including visual question answering and bidirectional image-text retrieval.


Oral
{daterange} @ Great Hall AD
Smoothing the Geometry of Probabilistic Box Embeddings
Xiang Li · Luke Vilnis · Dongxu Zhang · Michael Boratko · Andrew McCallum

There is growing interest in geometrically-inspired embeddings for learning hierarchies, partial orders, and lattice structures, with natural applications to transitive relational data such as entailment graphs. Recent work has extended these ideas beyond deterministic hierarchies to probabilistically calibrated models, which enable learning from uncertain supervision and inferring soft-inclusions among concepts, while maintaining the geometric inductive bias of hierarchical embedding models. We build on the Box Lattice model of Vilnis et al. (2018), which showed promising results in modeling soft-inclusions through an overlapping hierarchy of sets, parameterized as high-dimensional hyperrectangles (boxes). However, the hard edges of the boxes present difficulties for standard gradient based optimization; that work employed a special surrogate function for the disjoint case, but we find this method to be fragile. In this work, we present a novel hierarchical embedding model, inspired by a relaxation of box embeddings into parameterized density functions using Gaussian convolutions over the boxes. Our approach provides an alternative surrogate to the original lattice measure that improves the robustness of optimization in the disjoint case, while also preserving the desirable properties with respect to the original lattice. We demonstrate increased or matching performance on WordNet hypernymy prediction, Flickr caption entailment, and a MovieLens-based market basket dataset. We show especially marked improvements in the case of sparse data, where many conditional probabilities should be low, and thus boxes should be nearly disjoint.


Oral
{daterange} @ Great Hall AD
Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks
Yikang Shen · Shawn Tan · Alessandro Sordoni · Aaron Courville

In general, natural language is governed by a tree structure: smaller units (e.g., phrases) are nested within larger units (e.g., clauses). This is a strict hierarchy: when a larger constituent ends, all of the smaller constituents that are nested within it must also be closed. While the standard LSTM allows different neurons to track information at different time scales, the architecture does not impose a strict hierarchy. This paper proposes to add such a constraint to the system by ordering the neurons; a vector of "master" input and forget gates ensure that when a given unit is updated, all of the units that follow it in the ordering are also updated. To this end, we propose a new RNN unit: ON-LSTM, which achieves good performance on four different tasks: language modeling, unsupervised parsing, targeted syntactic evaluation, and logical inference.


Invited Talk
{daterange} @ Great Hall AD
Learning (from) language in context
Noah Goodman

Human language use is exquisitely sensitive to and grounded in perceptual context. I will describe how the "reference game" paradigm allows us to explore these aspects of language use, and to elicit grounded, contextual language for training neural language models. The resulting models predict human language use with high quantitative accuracy. For humans, language is more than a means of communication, it is also the way we distribute the learning problem across generations. I will describe how "concept learning games" allow us to explore this cultural knowledge transmission process and build models that learn from language (somewhat) as people do.


Oral
{daterange} @ Great Hall AD
Meta-Learning Update Rules for Unsupervised Representation Learning
Luke Metz · Niru Maheswaranathan · Brian Cheung · Jascha Sohl-Dickstein

A major goal of unsupervised learning is to discover data representations that are useful for subsequent tasks, without access to supervised labels during training. Typically, this goal is approached by minimizing a surrogate objective, such as the negative log likelihood of a generative model, with the hope that representations useful for subsequent tasks will arise incidentally. In this work, we propose instead to directly target a later desired task by meta-learning an unsupervised learning rule, which leads to representations useful for that task. Here, our desired task (meta-objective) is the performance of the representation on semi-supervised classification, and we meta-learn an algorithm -- an unsupervised weight update rule -- that produces representations that perform well under this meta-objective. Additionally, we constrain our unsupervised update rule to a be a biologically-motivated, neuron-local function, which enables it to generalize to novel neural network architectures. We show that the meta-learned update rule produces useful features and sometimes outperforms existing unsupervised learning techniques. We further show that the meta-learned unsupervised update rule generalizes to train networks with different widths, depths, and nonlinearities. It also generalizes to train on data with randomly permuted input dimensions and even generalizes from image datasets to a text task.


Oral
{daterange} @ Great Hall AD
Temporal Difference Variational Auto-Encoder
Karol Gregor · George Papamakarios · Frederic Besse · Lars Buesing · Theophane Weber

To act and plan in complex environments, we posit that agents should have a mental simulator of the world with three characteristics: (a) it should build an abstract state representing the condition of the world; (b) it should form a belief which represents uncertainty on the world; (c) it should go beyond simple step-by-step simulation, and exhibit temporal abstraction. Motivated by the absence of a model satisfying all these requirements, we propose TD-VAE, a generative sequence model that learns representations containing explicit beliefs about states several steps into the future, and that can be rolled out directly without single-step transitions. TD-VAE is trained on pairs of temporally separated time points, using an analogue of temporal difference learning used in reinforcement learning.


Oral
{daterange} @ Great Hall AD
Transferring Knowledge across Learning Processes
Sebastian Flennerhag · Pablo Moreno · Neil D Lawrence · Andreas Damianou

In complex transfer learning scenarios new tasks might not be tightly linked to previous tasks. Approaches that transfer information contained only in the final parameters of a source model will therefore struggle. Instead, transfer learning at a higher level of abstraction is needed. We propose Leap, a framework that achieves this by transferring knowledge across learning processes. We associate each task with a manifold on which the training process travels from initialization to final parameters and construct a meta learning objective that minimizes the expected length of this path. Our framework leverages only information obtained during training and can be computed on the fly at negligible cost. We demonstrate that our framework outperforms competing methods, both in meta learning and transfer learning, on a set of computer vision tasks. Finally, we demonstrate that Leap can transfer knowledge across learning processes in demanding Reinforcement learning environments (Atari) that involve millions of gradient steps.