Skip to yearly menu bar Skip to main content


Session

Thu PM Talks

Edward Grefenstette

Abstract:
Chat is not available.

Thu 3 May 14:30 - 15:15 PDT

Invited Talk
A Neural Network Model That Can Reason

Christopher Manning

Deep learning has had enormous success on perceptual tasks but still struggles in providing a model for inference. To address this gap, we have been developing Memory-Attention-Composition networks (MACnets). The MACnet design provides a strong prior for explicitly iterative reasoning, enabling it to learn explainable, structured reasoning, as well as achieve good generalization from a modest amount of data. The model builds from the great success of existing recurrent cells such as LSTMs: A MacNet is a sequence of a single recurrent Memory, Attention, and Composition (MAC) cell. However, its design imposes structural constraints on the operation of each cell and the interactions between them, incorporating explicit control and soft attention mechanisms. We demonstrate the model’s strength and robustness on the challenging CLEVR dataset for visual reasoning (Johnson et al. 2016), achieving a new state-of-the-art 98.9% accuracy, halving the error rate of the previous best model. More importantly, we show that the new model is more data-efficient, achieving good results from even a modest amount of training data. Joint work with Drew Hudson.

Thu 3 May 15:15 - 15:30 PDT

Oral
Synthetic and Natural Noise Both Break Neural Machine Translation

Yonatan Belinkov · Yonatan Bisk

Character-based neural machine translation (NMT) models alleviate out-of-vocabulary issues, learn morphology, and move us closer to completely end-to-end translation systems. Unfortunately, they are also very brittle and easily falter when presented with noisy data. In this paper, we confront NMT models with synthetic and natural sources of noise. We find that state-of-the-art models fail to translate even moderately noisy texts that humans have no trouble comprehending. We explore two approaches to increase model robustness: structure-invariant word representations and robust training on noisy texts. We find that a model based on a character convolutional neural network is able to simultaneously learn representations robust to multiple kinds of noise.

Thu 3 May 15:30 - 15:45 PDT

Oral
Beyond Word Importance: Contextual Decomposition to Extract Interactions from LSTMs

William Murdoch · Peter J Liu · Bin Yu

The driving force behind the recent success of LSTMs has been their ability to learn complex and non-linear relationships. Consequently, our inability to describe these relationships has led to LSTMs being characterized as black boxes. To this end, we introduce contextual decomposition (CD), an interpretation algorithm for analysing individual predictions made by standard LSTMs, without any changes to the underlying model. By decomposing the output of a LSTM, CD captures the contributions of combinations of words or variables to the final prediction of an LSTM. On the task of sentiment analysis with the Yelp and SST data sets, we show that CD is able to reliably identify words and phrases of contrasting sentiment, and how they are combined to yield the LSTM's final prediction. Using the phrase-level labels in SST, we also demonstrate that CD is able to successfully extract positive and negative negations from an LSTM, something which has not previously been done.

Thu 3 May 15:45 - 16:00 PDT

Oral
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

Zhilin Yang · Zihang Dai · Ruslan Salakhutdinov · William W Cohen

We formulate language modeling as a matrix factorization problem, and show that the expressiveness of Softmax-based models (including the majority of neural language models) is limited by a Softmax bottleneck. Given that natural language is highly context-dependent, this further implies that in practice Softmax with distributed word embeddings does not have enough capacity to model natural language. We propose a simple and effective method to address this issue, and improve the state-of-the-art perplexities on Penn Treebank and WikiText-2 to 47.69 and 40.68 respectively. The proposed method also excels on the large-scale 1B Word dataset, outperforming the baseline by over 5.6 points in perplexity.

Thu 3 May 16:00 - 16:30 PDT

Break
Coffee Break