Foundation models (FMs) are models that are trained on a large and diverse pool of data and can be adapted to a wide range of tasks. Recent examples of FMs include large language models (GPT-3, BERT, PaLM), image representation encoders (SimCLR), and image-text models (CLIP, DALL-E), which have all revolutionized the way models are built in their domains. Foundation models are poorly understood: the core driving principle behind Foundation Models (FMs) is transfer learning, but scale and modern self supervision techniques have led to emergent capabilities we might not have anticipated. The goal of this workshop is to highlight research that aims to improve our understanding of FMs. We liberally interpret understanding as any research ranging from purely empirical papers that highlight interesting phenomena, to those which attempt to explain or provide theoretical foundations for such phenomena in potentially simplified settings.
Thu 12:15 a.m. - 12:45 a.m.
|
Invited Talk (Yann Dauphin): Leveraging Multiple Models and Multiple Tasks
(
Invited Talk
)
SlidesLive Video » Abstract: In recent years, there has been a surge in the numbers of trained models and datasets that are shared online. In this talk, we will investigate methods that allow us to leverage this trend. First, we will show that ensembles that diverge more in training methodology display categorically different generalization behavior, producing increasingly uncorrelated errors. We show these models specialize in subdomains of the data, leading to higher ensemble performance: with just 2 models (each with ImageNet accuracy 76.5%), we can create ensembles with 83.4% (+7% boost). Second, we will discuss a method to make use of auxiliary tasks using an algorithm called ATTITTUD. This approach allows fine-grained resolution of conflicts between the gradient of the auxiliary task and the primary task. We will show that this approach produces significant improvements on benchmark tasks such as Chexpert. Bio: Yann N. Dauphin is a machine learning researcher at Google Research working on understanding the fundamentals of deep learning algorithms and leveraging that in various applications. He has published seminal work on understanding the loss surface of neural nets. Prior to joining Google in 2019, he was a researcher at Facebook AI Research from 2015 to 2018 where his work led to award-winning scientific publications and helped improve automatic translation on Facebook.com. He completed his PhD at U. of Montreal under the supervision of Prof. Yoshua Bengio. During this time, he and his team won international machine learning competitions such as the Unsupervised Transfer Learning Challenge in 2013. |
Yann Dauphin 🔗 |
Thu 12:45 a.m. - 12:50 a.m.
|
Q&A
|
🔗 |
Thu 12:50 a.m. - 1:20 a.m.
|
Invited Talk (Jared Kaplan): AI Safety, RLHF, and Self-Supervision
(
Invited Talk
)
SlidesLive Video » |
Jared Kaplan 🔗 |
Thu 1:20 a.m. - 1:25 a.m.
|
Q&A
|
🔗 |
Thu 1:25 a.m. - 1:35 a.m.
|
Coffee Break
|
🔗 |
Thu 1:35 a.m. - 2:05 a.m.
|
Invited Talk (Lenka Zdeborová): Insights from exactly solvable high-dimensional models
(
Invited Talk
)
SlidesLive Video » Statistical physics has studied exactly solvable models of neural networks since more than four decades. In this talk, we will put this line of work in perspective of recent empirical observations stemming from deep learning. We will describe several types of phase transition that appear in the limit of large sizes as a function of the amount of data. Discontinuous phase transitions are linked to adjacent algorithmic hardness. This so-called hard phase influences the behaviour of gradient-descent-like algorithms. We show a case where the hardness is mitigated by overparametrization proposing that the benefits of overparametrization may be linked to the usage of a certain type of algorithms. We then discuss the overconfidence of overparametrized neural networks and evaluate methods to mitigate it, and calibrate the uncertainty. |
Lenka Zdeborova 🔗 |
Thu 2:05 a.m. - 2:10 a.m.
|
Q&A
|
🔗 |
Thu 2:10 a.m. - 2:40 a.m.
|
Invited Talk (Sanjeev Arora): Task-specific Skill Localization in Fine-tuned Language Models
(
Invited Talk
)
SlidesLive Video » |
Sanjeev Arora 🔗 |
Thu 2:40 a.m. - 2:45 a.m.
|
Q&A
|
🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Accelerating Neural Self-Improvement via Bootstrapping
(
Poster
)
link »
Few-shot learning with sequence-processing neural networks (NNs) has recently attracted a new wave of attention in the context of large language models. In the standard N-way K-shot learning setting, an NN is explicitly optimised to learn to classify unlabelled inputs by observing a sequence of NK labelled examples. This pressures the NN to learn a learning algorithm that achieves maximum performance, given the limited number of training examples. Here we study an auxiliary loss that encourages further acceleration of few-shot learning, by applying recently proposed bootstrapped meta-learning to NN few-shot learners: we optimise the K-shot learner to match its own performance achievable by observing more than NK examples, using only NK examples. Promising results are obtained on the standard Mini-ImageNet dataset. |
Kazuki Irie · Jürgen Schmidhuber 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Mini-Batch Optimization of Contrastive Loss
(
Poster
)
link »
In this paper, we study the effect of mini-batch selection on contrastive loss and propose new mini-batch selection methods to improve efficiency. Theoretically, we show that both the full-batch and mini-batch settings share the same solution, the simplex Equiangular Tight Frame (ETF), if all $\binom{N}{B}$ mini-batches are seen during training. However, when not all possible batches are seen, mini-batch training can lead to suboptimal solutions. To address this issue, we propose efficient mini-batch selection methods that compare favorably with existing methods. Our experimental results demonstrate the effectiveness of our proposed methods in finding a near-optimal solution with a reduced number of gradient steps and outperforming existing mini-batch selection methods.
|
Kartik Sreenivasan · Keon Lee · Jeong-Gwan Lee · Anna Lee · Jaewoong Cho · Jy-yong Sohn · Dimitris Papailiopoulos · Kangwook Lee 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
On the Role of Attention in Prompt-tuning
(
Poster
)
link »
Prompt-tuning is an emerging strategy to adapt large language models (LLM) to downstream tasks by learning a (soft-)prompt parameter from data. Despite its success in LLMs, there is limited theoretical understanding of the power of prompt-tuning and the role of the attention mechanism in prompting. In this work, we explore prompt-tuning for one-layer attention architectures and study contextual mixture-models where each input token belongs to a context-relevant or -irrelevant set. We isolate the role of prompt-tuning through a self-contained prompt-attention model. Our contributions are as follows: (1) We show that softmax-prompt-attention is provably more expressive than softmax-self-attention and linear-prompt-attention under our contextual data model. (2) We analyze the initial trajectory of gradient descent and show that it learns the prompt and prediction head with near-optimal sample complexity and demonstrate how prompt can provably attend to sparse context-relevant tokens. We also provide experiments that verify our theoretical insights on real datasets and demonstrate how prompt-tuning enables the model to attend to context-relevant information. |
Samet Oymak · Ankit Singh Rawat · Mahdi Soltanolkotabi · Christos Thrampoulidis 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
LOOPED TRANSFORMERS AS PROGRAMMABLE COMPUTERS
(
Poster
)
link »
We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop. Our input sequence acts as a punchcard, consisting of instructions and memory for data read/writes. We demonstrate that a constant number of encoder layers can emulate basic computing blocks, including lexicographic operations, non-linear functions, function calls, program counters, and conditional branches.Using this framework, we emulate a computer using a simple instruction-set architecture, which allows us to map iterative algorithms to programs that can be executed by a constant depth looped transformer network. We show how a single frozen transformer, instructed by its input, can emulate a basic calculator, a basic linear algebra library, and even a full backpropagation, in-context learning algorithm. Our findings reveal the potential of transformer networks as programmable compute units and offer insight into the mechanics of attention. |
Angeliki Giannou · Shashank Rajput · Jy-yong Sohn · Kangwook Lee · Jason Lee · Dimitris Papailiopoulos 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Diffusion Models are Minimax Optimal Distribution Estimators
(
Poster
)
link »
We provide the first rigorous analysis on estimation error bounds of diffusion modeling for well-known function spaces. The highlight of this paper is that when the true density function belongs to the Besov space and the empirical score matching loss is properly minimized, the generated data distribution achieves the nearly minimax optimal estimation rates in the total variation distance and in the Wasserstein distance of order one. We expect these results advance theoretical understandings of diffusion modeling and its ability to generate verisimilar outputs. |
Kazusato Oko · Akiyama Shunta · Taiji Suzuki 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
The Effects of Pretraining Task Diversity on In-Context Learning of Ridge Regression
(
Poster
)
link »
Pretrained transformers can do in-context learning (ICL), i.e. learn new tasks in the forward pass from a few examples provided in context. But can the model do ICL for completely new tasks or is this ability restricted to tasks similar to those seen during pretraining? How does the diversity of tasks seen during pretraining affect the model's ability to do ICL? In the setting of ICL for ridge regression, we show that, if pretrained on few tasks sampled from a latent distribution, the model behaves like the Bayesian estimator with a prior equal to the discrete distribution over the sampled tasks. But if pretrained on a sufficiently large number of tasks, the model behaves like the Bayesian estimator with prior equal to the underlying latent distribution over tasks. Our results suggest that, as the diversity of the pretraining dataset increases, the model transitions from doing ICL on tasks similar to ones seen during pretraining to learning the underlying task structure and doing ICL on new tasks. |
Allan Raventos · Mansheej Paul · Feng Chen · Surya Ganguli 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Conservative Prediction via Transductive Confidence Minimization
(
Poster
)
link »
Errors of machine learning models can be prohibitively costly, especially in safety-critical settings such as healthcare. However, machine learning may be applicable to such scenarios if the learned model can abstain and defer to a human on difficult examples instead of making errors. In safety-critical settings, we prefer conservative models that defer to humans at the cost of some overall accuracy. Unfortunately, selective classification and out-of-distribution detection are notably difficult as it is hard to anticipate all possible examples. To mitigate this challenge, we focus on the transductive setting, where unlabeled examples from the test distribution are available during training. We propose transductive confidence minimization (TCM), which minimizes prediction confidence on unlabeled test examples while simultaneously optimizing the training objective. We theoretically show that TCM learns a lower bound on the true confidence, and that this property can be leveraged to provably detect examples that are sufficiently different from training examples, regardless of what distribution they came from. In our experiments, TCM consistently shows high performance, achieving the highest OOD detection performance compared to 6 other methods on 9 out of 10 ID->OOD pairs and consistently outperforming methods for selective classification in settings where we test on data from a previously unseen distribution. |
Caroline Choi · Fahim Tajwar · Yoonho Lee · Huaxiu Yao · Ananya Kumar · Chelsea Finn 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Controlled assessment of CLIP-style language-aligned vision models in prediction of brain & behavioral data
(
Poster
)
link »
One of the core algorithmic forces driving the development of modern foundation models is the use of contrastive language alignment to facilitate more robust visual representation learning. The clear benefits conferred by CLIP-style multimodal objective functions in computer vision have generated a frenzy of interest in the application of these models to a long-debated question in cognitive neuroscience: to what extent does language shape perceptual representation in the human mind? In this work, we explore this question in two distinct domains: the prediction of brain activity in the human ventral visual system (as measured by high-resolution fMRI), and the prediction of visually evoked affect in human image assessment (as measured by self-report). In both of these cases, we leverage popular open-source foundation models (e.g. OpenAI's CLIP) in conjunction with empirically controlled alternatives (e.g. Meta AI's SLIP models) to better isolate the effects of language alignment while holding architecture and dataset constant. These controlled experiments offer mixed evidence regarding the influence of language on perceptual representation: specifically, when architecture and dataset are held constant, we find no evidence that language-alignment improves the brain predictivity of vision models, but we do find strong evidence that it increases predictivity of behavioral image assessments. We offer these examples as a case study in the urgency of injecting greater empirical control into the development and evaluation of foundation models, whose emergent properties may be attributable to a variety of sources that only systematic model comparison can fully disentangle. |
Colin Conwell · Jacob Prince · Christopher Hamblin · George Alvarez 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
The Independent Compositional Subspace Hypothesis for the Structure of CLIP's Last Layer
(
Poster
)
link »
In this paper, we propose a hypothesis which posits that CLIP disentangles compositional visual attributes into orthogonal, independent subspaces which CLIP uses to build compositional representations of images. Our hypothesis suggests that CLIP learns compositional techniques that are similar to humans'. We find five core compositional attributes predicted by the hypothesis: color, size, counting, camera view, and pattern. We empirically test their properties and find that they code for their respective compositional attribute type and are essentially orthogonal to one another, as well as the subject of the image. |
Max Wolff · Wieland Brendel · Stuart Wolff 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Exploring Demonstration Ensembling for In-context Learning
(
Poster
)
link »
In-context learning (ICL) operates by showing language models (LMs) examples of input-output pairs for desired tasks, i.e., demonstrations. The standard approach for ICL is to prompt the LM with concatenated demonstrations followed by the test input. This approach suffers from some issues. First, concatenation offers almost no control over the contribution of each demo to the model prediction. This can be sub-optimal when some demonstrations are not very relevant to the test example. Second, due to the input length limit of transformer models, it can be infeasible to fit many examples into the context, especially when dealing with long-input tasks. In this work, we explore Demonstration Ensembling (DENSE) as an alternative to simple concatenation. DENSE predicts outputs using subsets (i.e., buckets) of the demonstrations and then combines the output probabilities resulting from each subset to produce the final prediction. We study different ensembling methods using GPT-j and experiment on 7 different language tasks. Our experiments show max ensembling to outperform concatenation by an average of 3.8 points. |
Muhammad Khalifa · Lajanugen Logeswaran · Moontae Lee · Honglak Lee · Lu Wang 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks
(
Poster
)
link »
Grokking is a phenomenon where a model trained on an algorithmic task first overfits but, then, after a large amount of additional training, undergoes a phase transition to generalize perfectly. We empirically study the internal structure of networks undergoing grokking on the sparse parity task, and find that the grokking phase transition corresponds to the emergence of a sparse subnetwork that dominates model predictions. On an optimization level, we find that this subnetwork arises when a small subset of neurons undergoes rapid norm growth, whereas the other neurons in the network decay slowly in norm. Thus, we suggest that the grokking phase transition can be understood to emerge from competition of two largely distinct subnetworks: a dense one that dominates before the transition and generalizes poorly, and a sparse one that dominates afterwards. |
William Merrill · Nikolaos Tsilivis · Aman Shukla 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
A Comprehensive Benchmark of Human-Like Relational Reasoning for Text-to-Image Foundation Models
(
Poster
)
link »
Relations are basic building blocks of human cognition. Classic and recent work suggests that many relations are early developing, and quickly perceived. Machine models that aspire to human-level perception and reasoning should reflect the ability to recognize and reason generatively about relations. We report a systematic empirical examination of a recent text-guided image generation model (DALL-E 2), using a set of 15 basic physical and social relations studied or proposed in the literature, and judgements from human participants (N = 169). Overall, we find that only 22% of images matched basic relation prompts. Based on a quantitative examination of people's judgments, we suggest that current image generation models do not yet have a grasp of even basic relations involving simple objects and agents. We examine reasons for model successes and failures, and suggest possible improvements based on computations observed in biological intelligence. |
Colin Conwell · Tomer Ullman 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations
(
Poster
)
link »
Joint-embedding based learning (e.g., SimCLR, MoCo, DINO) and reconstruction-based learning (e.g., BEiT, SimMIM, MAE) are the two leading paradigms for self-supervised learning of vision transformers, but they differ substantially in their transfer performance. Here, we aim to explain these differences by analyzing the impact of these objectives on the structure and transferability of their representations. Our analysis reveals that reconstruction-based learning features are significantly dissimilar to joint-embedding based learning features and that models trained with similar objectives learn similar features even across architectures. These differences arise early in the network, primarily driven by attention and normalization layers. We find that joint-embedding features yield better linear probe transfer for classification because the different objectives drive different distributions of information and invariances in the representation. These differences explain opposite trends in transfer performance for downstream tasks that require spatial specificity in features. Finally, we address how fine-tuning changes reconstructive representations to enable better transfer, showing that it re-organizes the information to be more similar to pre-trained joint embedding models. |
Shashank Shekhar · Florian Bordes · Pascal Vincent · Ari Morcos 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Robustness of edited neural networks
(
Poster
)
link »
Successful deployment in uncertain, real-world environments requires that deep learning models can be efficiently and reliably modified in order to adapt to unexpected issues. However, the current trend toward ever-larger models makes standard retraining procedures an ever-more expensive burden. For this reason, there is growing interest in model editing, which enables computationally inexpensive, interpretable, post-hoc model modifications. While many model editing techniques are promising, research on the properties of edited models is largely limited to evaluation of validation accuracy. The robustness of edited models is an important and yet mostly unexplored topic. In this paper, we employ recently developed techniques from the field of deep learning robustness to investigate both how model editing affects the general robustness of a model, as well as the robustness of the specific behavior targeted by the edit. We find that edits tend to reduce general robustness, but that the degree of degradation depends on the editing algorithm chosen. In particular, robustness is best preserved by more constrained techniques that modify less of the model. Motivated by these observations, we introduce two new model editing algorithms, direct low-rank model editing and 1-layer interpolation (1-LI), which each exhibit strong generalization performance. |
Davis Brown · Charles Godfrey · Cody Nizinski · Jonathan Tu · Henry Kvinge 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Exploring the Representation Manifolds of Stable Diffusion Through the Lens of Intrinsic Dimension
(
Poster
)
link »
Prompting has become an important mechanism by which users can more effectively interact with many flavors of foundation model. Indeed, the last several years have shown that well-honed prompts can sometimes unlock emergent capabilities within such models. While there has been a substantial amount of empirical exploration of prompting within the community, relatively few works have studied prompting at a mathematical level. In this work we aim to take a first step towards understanding basic geometric properties induced by prompts in Stable Diffusion, focusing on the intrinsic dimension of internal representations within the model. We find that choice of prompt has a substantial impact on the intrinsic dimension of representations at both layers of the model which we explored, but that the nature of this impact depends on the layer being considered. For example, in certain bottleneck layers of the model, intrinsic dimension of representations is correlated with prompt perplexity (measured using a surrogate model), while this correlation is not apparent in the latent layers. Our evidence suggests that intrinsic dimension could be a useful tool for future studies of the impact of different prompts on text-to-image models. |
Henry Kvinge · Davis Brown · Charles Godfrey 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters
(
Poster
)
link »
Chain-of-Thought (CoT) prompting, which encourages language models (LMs) to generate intermediate rationales for the final answer through in-context demonstrations, dramatically improves large LMs' ability to solve reasoning tasks. Despite its success, there is little understanding on what makes CoT prompting effective and which aspects of the demonstrated reasoning steps contribute to its performance. In this paper, we show that prompting with invalid demonstrations affects little in CoT reasoning, achieving over 80-90% of the performance obtained using the original CoT under various metrics, while still generating coherent lines of reasoning during inference. Further experiments show that other aspects of the rationales, such as being relevant to the query and correctly ordering the reasoning steps, are the actual key to the effectiveness of CoT. Overall, these findings deepen our understanding of CoT prompting, while leading to new questions regarding large LMs’ capability to learn to reason in context and reflections on benchmarking few-shot reasoning. |
Boshi Wang · Sewon Min · Xiang Deng · Jiaming Shen · You Wu · Luke Zettlemoyer · Huan Sun 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
(
Poster
)
link »
Progress in machine learning has been driven in large part by massive increases in data. However, large web-scale datasets such as LAION are largely uncurated beyond searches for exact duplicates, potentially leaving much redundancy. Here, we introduce SemDeDup, a method which leverages embeddings from pre-trained models to identify and remove "semantic duplicates'': data pairs which are semantically similar, but not exactly identical. Removing semantic duplicates preserves performance and speeds up learning. Analyzing a subset of LAION, we show that SemDeDup can remove 50% of the data with minimal performance loss, effectively halving training time. Moreover performance increases out of distribution. Also, analyzing language models trained on C4, a partially curated dataset, we show that SemDeDup improves over prior approaches. SemDeDup provides an example of how simple ways of leveraging quality embeddings can be used to make models learn faster with less data. |
Amro Kamal · Kushal Tirumala · Daniel Simig · Surya Ganguli · Ari Morcos 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Effective Data Augmentation With Diffusion Models
(
Poster
)
link »
Data augmentation is one of the most prevalent tools in deep learning, underpinning many recent advances, including those from classification, generative models, and representation learning. The standard approach to data augmentation combines simple transformations like rotations and flips to generate new images from existing ones. However, these new images lack diversity along key semantic axes present in the data. Consider the task of recognizing different animals. Current augmentations fail to produce diversity in task-relevant high-level semantic attributes like the species of the animal. We address the lack of diversity in data augmentation with image-to-image transformations parameterized by pre-trained text-to-image diffusion models. Our method edits images to change their semantics using an off-the-shelf diffusion model, and generalizes to novel visual concepts from a few labelled examples. We evaluate our approach on image classification tasks in a few-shot setting, and on a real-world weed recognition task, and observe an improvement in accuracy in tested domains. |
Brandon Trabucco · Kyle Doherty · Max Gurinas · Ruslan Salakhutdinov 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Text-to-Image Diffusion Models are Zero-Shot Classifiers
(
Poster
)
link »
Text-to-image diffusion models have demonstrated remarkable generative capabilities, suggesting they learn informative representations of image-text data. However, their abilities are not fully understood and they have not been thoroughly explored on downstream tasks.We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers.The key idea is using a diffusion model's ability to denoise a noised image given a textual description of a label as a proxy for that label's likelihood.We apply our method to Imagen, using it to probe fine-grain aspects of Imagen's knowledge and comparing it with CLIP's zero-shot abilities. Imagen performs competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, it is more robust than CLIP and can successfully perform attribute binding while CLIP does not. Although generative pre-training is common in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for visual and vision-language problems. |
Kevin Clark · Priyank Jaini 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Simple Hardware-Efficient Long Convolutions for Sequence Modeling
(
Poster
)
link »
State space models (SSMs) have high performance on long sequence modeling but require sophisticated initialization techniques and specialized implementations for high quality and runtime performance. We study whether a simple alternative can match SSMs in performance and efficiency: directly learning long convolutions over the sequence. We find that simply squashing the long convolutional kernel weights is enough to match SSMs in performance on a range of tasks including the long range arena (LRA) and language modeling. To also improve runtime performance, we next develop FlashButterfly, an IO-aware algorithm to compute long convolutions efficiently. FlashButterfly appeals to classic Butterfly decompositions of the convolution to reduce GPU memory IO and increase FLOP utilization. FlashButterfly speeds up the LRA benchmark by 7.0× over Transformers, and allows us to train on Path256, a challenging task with sequence length 64K, where we set state-of-the-art by 29.1 points while training 7.2× faster than prior work. |
Dan Fu · Elliot Epstein · Eric Nguyen · Armin Thomas · Michael Zhang · Tri Dao · Atri Rudra · Christopher Re 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Understanding HTML with Large Language Models
(
Poster
)
link »
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks. Yet, their capabilities for HTML understanding -- i.e., parsing the raw HTML of a webpage, with applications to automation of web-based tasks, crawling, and browser-assisted retrieval -- have not been fully explored. We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks: (i) Semantic Classification of HTML elements, (ii) Description Generation for HTML inputs, and (iii) Autonomous Web Navigation of HTML pages. While previous work has developed dedicated architectures and training procedures for HTML understanding, we show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks. For instance, fine-tuned LLMs are 12\% more accurate at semantic classification compared to models trained exclusively on the task dataset. Moreover, when fine-tuned on data from the MiniWoB benchmark, LLMs successfully complete 50\% more tasks using 192x less data compared to the previous best supervised model. Out of the LLMs we evaluate, we show evidence that T5-based models are ideal due to their bidirectional encoder-decoder architecture. To promote further research on LLMs for HTML understanding, we create and open-source a large-scale HTML dataset distilled and auto-labeled from CommonCrawl. |
Izzeddin Gur · Ofir Nachum · Yingjie Miao · Mustafa Safdari · Austin Huang · Aakanksha Chowdhery · SHARAN NARANG · Noah Fiedel · Aleksandra Faust 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Instruction-Finetuned Foundation Models for Multimodal Web Navigation
(
Poster
)
link »
We propose an instruction-aligned multimodal agent for autonomous web navigation -- i.e., sequential decision making tasks employing a computer interface. Our approach is based on supervised finetuning of vision and language foundation models on a large corpus of web data consisting of webpage screenshots and HTML. Specifically, we use vision transformers on sequences of web page screenshots to extract patch-level image features. These features are concatenated with embedding of tokens in HTML documents. Using an instruction-finetuned large language model, we jointly encode both vision and HTML modalities and decode web actions such as click and type. We show that our method outperforms previous approaches by a significant margin, even in handling out-of-distribution HTML and compositional tasks. On the MiniWoB benchmark, we improve previous approaches using only HTML input by more than 17.7%, even surpassing the performance of RL-finetuned models. On the recent WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing state-of-the-art PaLM-540B. We also collect 347K gold demonstrations using our trained models, 29 times larger than prior work, and make them available to promote future research in this area. We believe that our work is a step towards building capable and generalist decision making agents for computer interface. |
Hiroki Furuta · Ofir Nachum · Kuang-Huei Lee · Yutaka Matsuo · Shixiang Gu · Izzeddin Gur 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Out-of-context Meta-learning in Large Language Models
(
Poster
)
link »
Brown (2020) famously introduced the phenomenon of in-context meta-learning in large language models (LLMs). Our work establishes the existence of a phenomenon we call out-of-context meta-learning via carefully designed synthetic experiments with large language models. We argue that out-of-context meta-learning is an important and surprising capability of LLMs, which may lead them to more readily “internalize” the semantic content of text that is, or appears to be, broadly useful (such as true statements, or text from authoritative sources) and apply it in appropriate contexts. We also raise the question of how this phenomenon emerges, and discuss two possible explanations: one relying on the way LLMs store knowledge in their parameters, and another suggesting that the implicit gradient alignment bias of gradient-descent-based methods may be responsible. |
Dmitrii Krasheninnikov · Egor Krasheninnikov · David Krueger 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
What Happens to the Source Domain in Transfer Learning?
(
Poster
)
link »
We investigate the impact of the source domain in supervised transfer learning, focusing on image classification. In particular, we aim to assess to which extent a fine-tuned model can still recognize the classes of the source domain. Furthermore, we want to understand how this ability impacts the target domain. We demonstrate how the retained knowledge about the old classes in a popular foundational model can interfere with the model’s ability to learn and recognize the new classes. This interference can incur significant implications and highlights an inherent shortcoming of supervised transfer learning. |
Amal Alnouri · Bilal Alsallakh 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Modality-Aware Adaptation of Contrastive Language-Image Models
(
Poster
)
link »
Despite their high levels of robustness, Contrastive Language-Image Models (CLIP) still require some form of downstream adaptation when applied to tasks sufficiently out-of-domain with respect to their training set. Recent methods propose light-weight adapters on the model features and show strong performance, primarily focused on the few-shot domain. All such approaches however, require per-task hyperparameter tuning which necessitates access to a validation set; limiting their applicability in practice. As an alternative, we propose Modality Aware Tangent-space Retrieval (MATeR), a training-free, interpretable adapter which outperforms all recent methods when per-task hyperparameter turning is prohibited. MATeR considers the manifold formed by CLIP embeddings when incorporating out of domain few-shot class information and its predictions are invariant to the modality gap; representing the first approach that considers the geometric structure of the CLIP latent space to inform downstream task adaptation. Additionally, we demonstrate a variant of MATeR has the ability to significantly increase zeroshot accuracy with only a handful of unlabelled images, much lower than the number of classes. |
Alexander Long · Thalaiyasingam Ajanthan · Anton Hengel 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
TabRet: Pre-training Transformer-based Tabular Models for Unseen Columns
(
Poster
)
link »
We present TabRet, a pre-trainable Transformer-based model for tabular data. TabRet is designed to work on a downstream task that contains columns not seen in pre-training. Unlike other methods, TabRet has an extra learning step before fine-tuning called retokenizing, which calibrates feature embeddings based on the masked autoencoding loss. In experiments, we pre-trained TabRet with a large collection of public health surveys and fine-tuned it on classification tasks in healthcare, and TabRet achieved the best AUC performance on four datasets. In addition, an ablation study shows retokenizing and random shuffle augmentation of columns during pre-training contributed to performance gains. |
Soma Onishi · Kenta Oono · Kohei Hayashi 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Do Video-Language Foundation Models have a Sense of Time?
(
Poster
)
link »
Modelling and understanding time remains a challenge in contemporary video understanding models. Time also appears in language through temporal relations. Video-language models can benefit from having a sense of time, especially since language provides an interface for generalization. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We construct a simple synthetic dataset to measure such temporal understanding in video-language models and find that six existing models struggle to understand even such simple relations. We then posit whether it is feasible to equip these foundation models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without needing data- and compute-intense training from scratch. |
Piyush Nitin Bagad · Makarand Tapaswi · Cees G Snoek 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
What Contrastive Learning Learns Beyond Class-wise Features?
(
Poster
)
link »
In recent years, contrastive learning has achieved the performance that is comparable to supervised learning in representation learning. However, the transferability of different contrastive learning methods to downstream tasks often varies greatly. In this paper, we study the downstream generalization ability of two contrastive learning methods: SimCLR and Spectral Contrastive Learning (Spectral CL). We find that beyond class-wise features, contrastive learning also learns two types of features, which we call shared features and subclass features, which play an important role in model transferability. SimCLR learns more shared and subclass features than Spectral CL, resulting in better transferability. We theoretically and experimentally reveal the mechanism by which SimCLR can learn more diverse features than Spectral CL. Therefore, we propose a method called High-pass Spectral CL to improve the transferability and generalization of Spectral CL, which achieves better performance than SimCLR and Spectral CL. |
Xingyuming Liu · Yifei Wang · Yisen Wang 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Look Globally and Locally: Inter-Intra Contrastive Learning from Unlabeled Videos
(
Poster
)
link »
State-of-the-art video contrastive learning methods spatiotemporally augment two clips from the same video as positives. By only sampling positive clips from the same video, these methods neglect other semantically related videos that can also be useful. To address this limitation, we leverage nearest-neighbor videos from the global space as additional positives, thus improving diversity and introducing a more relaxed notion of similarity that extends beyond video and even class boundaries. Our Inter-Intra Video Contrastive Learning (IIVCL) improves performance and generalization on video classification, detection, and retrieval tasks. |
David Fan · Deyu Yang · Xinyu Li · Vimal Bhat · Rohith MV 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Improving Foundation Models for Few-Shot Learning via Multitask Finetuning
(
Poster
)
link »
Foundation models have become essential tools for AI. In this paper, we study the problem of adapting foundation models, pre-trained using contrastive learning, to downstream tasks with limited labels. We explore the paradigm of finetuning a foundation model before adapting to a target task, using a set of related tasks with a few labeled samples. We show both theoretically and empirically that with a diverse set of related tasks this finetuning leads to reduced error in the target task, when compared with directly adapting the same pre-trained model, e.g., at least 6\% target accuracy improvements on the miniImageNet. |
Zhuoyan Xu · Zhenmei Shi · Junyi Wei · Yin Li · Yingyu Liang 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
A Kernel-Based View of Language Model Fine-Tuning
(
Poster
)
link »
It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK)---which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization---describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We extend the NTK formalism to Adam and use Tensor Programs (Yang, 2020) to characterize conditions under which the NTK lens may describe fine-tuning updates to pre-trained language models. Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning. Finally, we use this kernel view to propose an explanation for the success of parameter-efficient subspace-based fine-tuning methods.
|
Sadhika Malladi · Alexander Wettig · Dingli Yu · Danqi Chen · Sanjeev Arora 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Variable Discretization for Self-Supervised Learning
(
Poster
)
link »
In this study, we propose Variable Disretization (VD) for self-supervised image representation learning. VD is to discretize each and every variable in the embedding space making their probability distributions estimable, based on which the learning process can be directly principled by information measures. Specifically, a loss function is defined to maximize the joint entropy between discrete variables. Our theoretical analysis guarantees that the entropy-maximized VD can learn transform-invariant, non-trivial, redundancy-minimized, and discriminative features. Extensive experiments demonstrate the superiority of VD on various downstream tasks in terms of both accuracy and training efficiency. Moreover, the VD-based information-theoretic optimization could be adapted to other learning paradigms or multimodal data representation learning. |
Chuang Niu · Wenjun Xia · Ge Wang 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMs
(
Poster
)
link »
As foundation models continue to exponentially scale in size, efficient methods of adaptation become increasingly critical. Parameter-efficient fine-tuning (PEFT), a recent class of techniques which require only modifying a small percentage of the model parameters, is currently the most popular method for adapting large language models (LLMs). Several PEFT techniques have recently been proposed with varying tradeoffs. We provide a comprehensive and uniform benchmark of various PEFT techniques across a representative LLM, the FLAN-T5 model, and evaluate model performance across different data scales of classification and generation datasets. Based on this, we provide a framework for choosing the optimal PEFT techniques, based on task type and data availability. Contrary to popular belief, we also empirically prove that PEFT techniques converge slower and perform worse than full fine-tuning in low data scenarios, and posit the amount of data required for PEFT methods to both perform well and converge efficiently. Lastly, we further optimize these PEFT techniques by selectively choosing which parts of the model to train, and find that these techniques can be applied to significantly fewer parameters while maintaining model performance. |
George Pu · Anirudh Jain · Jihan Yin · Russell Kaplan 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
AWE: Adaptive weight-space ensembling for few-shot fine-tuning
(
Poster
)
link »
Transfer learning, which involves adapting a pre-trained model to perform a downstream task, is a widely used paradigm in machine learning. However, traditional transfer learning methods are typically designed for scenarios where fine-tuning data is abundant. Adapting such methods to the few-shot regime can be challenging because the quantity of data is limited compared to the model's capacity. In this work, we present a method called Adaptive Weight-space Ensembling (AWE) that demonstrates the effectiveness of weight-space ensembling, originally designed for large-scale data, in the few-shot setting. We achieve this by leveraging patterns in oracle weight-space ensembling to develop an adaptive ensembling method that can easily be deployed in practice. Our method achieves state-of-the-art results by more than 2% on average on standard few-shot setting benchmarks. |
Jean-Christophe Gagnon-Audet · David J Schwab · Ricardo Monti 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Z-ICL: Zero-Shot In-Context Learning with Pseudo-Demonstrations
(
Poster
)
link »
Language models (LMs) perform a new task at test time either through zero-shot inference or few-shot in-context learning, i.e., conditioning on the k-shot training data (so-called demonstrations). Prior work suggests that in-context learning mainly activates the intrinsic ability of the LM. We argue that this implies zero-shot performance of the LM is underestimated and can be as good as in-context learning if we inform the LM with the correct space of the inputs and the labels using pseudo-demonstrations. We also identify an additional factor which we call the copying effect: if pseudo-demonstrations includes an input that is very similar to the test input, the model prediction is heavily influenced by the paired label of that input. Putting altogether, we introduce Z-ICL, a new zero-shot prompting method that constructs pseudo-demonstrations without any training data that (a) informs the correct space of the inputs and the outputs and (b) reduces the copying effect so that the prediction is less affected by the pairings in the pseudo-demonstration. Z-ICL includes (a) leveraging nearest neighbors from a raw text corpus and pairing them with random but valid labels and (b) proposing a set of techniques such as physical neighbors and synonym labeling. Z-ICL outperforms previous zero-shot methods by a significant margin, and is on par with in-context learning with gold training data on a range of text classification datasets. Together, Z-ICL provides a significantly higher estimate of the model’s ability to perform a new task zero-shot, and poses a set of new questions about the capacities of LMs. |
Xinxi Lyu · Sewon Min · Iz Beltagy · Luke Zettlemoyer · Hannaneh Hajishirzi 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Variational prompt tuning improves generalization of vision-language foundation models
(
Poster
)
link »
Using prompt tuning, large vision-language foundation models can be adapted to downstream tasks by treating part of the input language prompts as learnable parameters and freezing the rest. However, existing work on prompt tuning may damage the generalization capabilities of foundation models. To avoid such limitations, we propose a probabilistic modeling of the underlying distribution of prompts, allowing prompts within the support of an associated concept to be de- rived through stochastic sampling. This results in a more complete and richer transfer of the information captured by the language model, providing better generalization capabilities for downstream tasks. The resulting algorithm relies on a simple yet powerful variational framework that can be directly integrated with other developments. We show our approach is seamlessly integrated into both standard and conditional prompt learning frameworks, improving the performance in both cases considerably, especially with regard to preserving the generalization capability of the original model. Our method provides the current state-of-the-art for prompt learning, surpassing CoCoOp by 1.6% average Top-1 accuracy on the standard benchmark. Remarkably, it even surpasses the original CLIP model in terms of generalization to new classes. The implementation code will be released. |
Mohammad Mahdi Derakhshani · Enrique Sanchez · Adrian Bulat · Victor Guilherme Turrisi da Costa · Cees G Snoek · Georgios Tzimiropoulos · Brais Martinez 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Aligning Foundation Models for Language with Preferences through $f$-divergence Minimization
(
Poster
)
link »
Aligning language models with preferences can be posed as approximating a target distribution representing some desired behavior. Existing approaches differ both in the functional form of the target distribution and the algorithm used to approximate it. For instance, Reinforcement Learning from Human Feedback (RLHF) corresponds to minimizing a reverse KL from an implicit target distribution arising from a KL penalty in the objective. On the other hand, Generative Distributional Control (GDC) has an explicit target distribution and minimizes a forward KL from it using the Distributional Policy Gradient (DPG) algorithm. In this paper, we propose a new approach, $f$-DPG, which allows the use of any $f$-divergence to approximate any target distribution. $f$-DPG unifies both frameworks (RLHF, GDC) and the approximation methods (DPG, RL with KL penalties). We show the practical benefits of various choices of divergence objectives and demonstrate that there is no universally optimal objective but that different divergences are good for approximating different targets.
|
Dongyoung Go · Tomek Korbak · Germàn Kruszewski · Jos Rozen · Nahyeon Ryu · Marc Dymetman 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
The SSL Interplay: Augmentations, Inductive Bias, and Generalization
(
Poster
)
link »
Self-supervised learning (SSL) has emerged as a powerful framework to learn representations from raw data without supervision. Yet in practice, engineers face issues such as instability in tuning optimizers and collapse of representations during training. Such challenges motivate the need for a theory to shed light on the complex interplay between the choice of data augmentation, network architecture, and training algorithm. We study such an interplay with a precise analysis of generalization performance on both pretraining and downstream tasks in a theory friendly setup, and highlight several insights for SSL practitioners that arise from our theory. |
Vivien Cabannes · Bobak Kiani · Randall Balestriero · Yann LeCun · Alberto Bietti 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Retrieval of Soft Prompt Enhances Zero-Shot Task Generalization
(
Poster
)
link »
During zero-shot inference with language models (LMs), using hard prompts alone may not be able to fully describe the target task. In this paper, we explore how the retrieval of soft prompts obtained through prompt tuning can assist hard prompts in zero-shot task generalization. Specifically, we train soft prompt embeddings for each prompt through prompt tuning, store the samples of the training instances (hard prompt + input instances) mapped with the prompt embeddings, and retrieve the corresponding prompt embedding of the training instance closest to the query instance during inference. Results show this simple approach enhances the performance of T0 on unseen tasks by outperforming it on 10 out of 11 datasets as well as improving the mean accuracy of T0 on BIG-bench benchmark by 2.39% points while adding only 0.007% additional parameters. Also, using interpolation of multiple embeddings and variance-based ranking further improve accuracy and robustness to different evaluation prompts, widening the performance gap. |
Seonghyeon Ye · Joel Jang · Doyoung Kim · Yongrae Jo · Minjoon Seo 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners
(
Poster
)
link »
Instruction-tuning, which fine-tunes the language model (LM) on various downstream tasks with task instruction, has improved the zero-shot task generalization performance. However, instruction-tuned LMs still struggle to generalize to challenging unseen tasks containing novel labels. In this paper, we propose Flipped Learning, an alternative method of instruction-tuning which trains the LM to generate the task instruction given the input instance and label. During inference, the LM trained with Flipped Learning, referred to as FLIPPED, selects the label option that is most likely to generate the task instruction. On 14 tasks of the BIG-bench benchmark, the 11B-sized FLIPPED outperforms zero-shot T0-11B and even a 16 times larger 3-shot GPT-3 (175B) on average by 8.4% and 9.7% points, respectively. Flipped Learning gives particularly large improvements on tasks with unseen labels, outperforming T0-11B by up to +20% average F1 score. This indicates that the strong task generalization of Flipped Learning comes from improved generalization to novel labels. |
Seonghyeon Ye · Doyoung Kim · Joel Jang · Joongbo Shin · Minjoon Seo 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Project with Source, Probe with Target: Extracting Useful Features for Adaptation to Distribution Shifts
(
Poster
)
link »
Conventional approaches to robustness try to learn a model based on causal features. However, identifying maximally robust or causal features may be difficult in some scenarios, and in others, non-causal ``shortcut'' features may actually be more predictive. We propose a lightweight, sample-efficient approach that learns a diverse set of features and adapts to a target distribution by interpolating these features with a small target dataset. Our approach, Project and Probe (Pro^2), first learns a linear projection that maps a pre-trained embedding onto orthogonal directions while being predictive of labels in the source dataset. The goal of this step is to learn a variety of predictive features, so that at least some of them remain useful after distribution shift. Pro^2 then learns a linear classifier on top of these projected features using a small target dataset. We theoretically show that Pro^2 learns a projection matrix that is optimal for classification in an information-theoretic sense, resulting in better generalization due to a favorable bias-variance tradeoff. Our experiments on eight distribution shift settings show that Pro^2 improves performance by 5-15% when given limited target data compared to prior methods such as standard linear probing. |
Annie Chen · Yoonho Lee · Amrith Setlur · Sergey Levine · Chelsea Finn 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers
(
Poster
)
link »
Large pretrained language models have shown surprising in-context learning (ICL) ability. With a few demonstration input-label pairs, they can predict labels for unseen inputs without parameter updates. Despite the great success in performance, its working mechanism still remains an open question. In this paper, we explain language models as meta-optimizers and understand ICL as implicit finetuning. Theoretically, we figure out that Transformer attention has a dual form of gradient descent. On top of it, we understand ICL as follows: GPT first produces meta-gradients according to the demonstration examples, and then these meta-gradients are applied to the original GPT to build an ICL model. We compare the behaviors of ICL and explicit finetuning on real tasks to provide empirical evidence that supports our understanding. Experimental results show that in-context learning behaves similarly to explicit finetuning from multiple perspectives. |
Damai Dai · Yutao Sun · Li Dong · Yaru Hao · Shuming Ma · Zhifang Sui · Furu Wei 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Towards Foundation Models with Mathematical Understanding
(
Poster
)
link »
We investigate the ability of transformer models to build representations of integer sequences that are of utility to tasks where deeper mathematical understanding is needed.To that end, we train BERT-like transformer encoders to assess the impact of individual pre-training tasks on the quality of the resulting model, and evaluate them for sequence classification, continuation, unmasking, complexity prediction, and next sequence-part prediction.We find that the models both outperform benchmark baselines and provide reasonable estimates of the complexity of the mathematical rules behind the sequences. |
Peter Belcak · Roger Wattenhofer 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons
(
Poster
)
link »
We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the $K$-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and max entropy Inverse Reinforcement Learning (IRL), and provide the first sample complexity bound for max entropy IRL.
|
Banghua Zhu · Jiantao Jiao · Michael Jordan 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Broken Neural Scaling Laws
(
Poster
)
link »
We present a smoothly broken power law functional form that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, training dataset size, or upstream performance varies) for various architectures and for various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision, language, audio, video, diffusion generative modeling, multimodal learning, contrastive learning, AI alignment, robotics, arithmetic, unsupervised/self-supervised learning, and reinforcement learning (single agent and multi-agent). When compared to other functional forms for neural scaling behavior, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set. Moreover, this functional form accurately models and extrapolates scaling behavior that other functional forms are incapable of expressing such as the non-monotonic transitions present in the scaling behavior of phenomena such as double descent and the delayed, sharp inflection points present in the scaling behavior of tasks such as arithmetic. Lastly, we use this functional form to glean insights about the limit of the predictability of scaling behavior. |
Ethan Caballero · Kshitij Gupta · Irina Rish · David Krueger 🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Coordinating Multiple Vision-Language Models for Visual Reasoning
(
Poster
)
link »
Visual reasoning demands multimodal perception and commonsense cognition of the world. Multiple vision-language models (VLMs) have recently been proposed with excellent commonsense reasoning ability in various domains. However, how to harness the collective power of these complementary VLMs is rarely explored. Existing methods like ensemble still struggle to combine these models with the desired higher-order communications. In this work, we propose COLA (Code is available at https://anonymous.4open.science/r/visualreasoning), a novel paradigm that coordinates multiple VLMs for visual reasoning. Our key insight is that a language model (LM) can serve as an efficient coordinator to leverage the distinct and complementary capabilities of multiple VLMs. Extensive experiments demonstrate that our finetuning variant, COLA-FT, achieves state-of-the-art performance on outside knowledge VQA, visual entailment, and visual-spatial reasoning tasks. Through systematic ablation studies and visualizations, we validate that a coordinator LM comprehends the instruction prompts and the separate functionalities of VLMs and then coordinates them to enable impressive visual reasoning capabilities. |
Liangyu Chen · Bo Li · Sheng Shen · Jingkang Yang · Chunyuan Li · Kurt Keutzer · trevor darrell · Ziwei Liu 🔗 |
Thu 5:00 a.m. - 5:05 a.m.
|
Diffusion Models are Minimax Optimal Distribution Estimators
(
Spotlight
)
link »
SlidesLive Video » We provide the first rigorous analysis on estimation error bounds of diffusion modeling for well-known function spaces. The highlight of this paper is that when the true density function belongs to the Besov space and the empirical score matching loss is properly minimized, the generated data distribution achieves the nearly minimax optimal estimation rates in the total variation distance and in the Wasserstein distance of order one. We expect these results advance theoretical understandings of diffusion modeling and its ability to generate verisimilar outputs. |
Kazusato Oko · Akiyama Shunta · Taiji Suzuki 🔗 |
Thu 5:08 a.m. - 5:13 a.m.
|
Text-to-Image Diffusion Models are Zero-Shot Classifiers
(
Spotlight
)
link »
SlidesLive Video » Text-to-image diffusion models have demonstrated remarkable generative capabilities, suggesting they learn informative representations of image-text data. However, their abilities are not fully understood and they have not been thoroughly explored on downstream tasks.We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers.The key idea is using a diffusion model's ability to denoise a noised image given a textual description of a label as a proxy for that label's likelihood.We apply our method to Imagen, using it to probe fine-grain aspects of Imagen's knowledge and comparing it with CLIP's zero-shot abilities. Imagen performs competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, it is more robust than CLIP and can successfully perform attribute binding while CLIP does not. Although generative pre-training is common in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for visual and vision-language problems. |
Kevin Clark · Priyank Jaini 🔗 |
Thu 5:16 a.m. - 5:21 a.m.
|
Exploring Demonstration Ensembling for In-context Learning
(
Spotlight
)
link »
SlidesLive Video » In-context learning (ICL) operates by showing language models (LMs) examples of input-output pairs for desired tasks, i.e., demonstrations. The standard approach for ICL is to prompt the LM with concatenated demonstrations followed by the test input. This approach suffers from some issues. First, concatenation offers almost no control over the contribution of each demo to the model prediction. This can be sub-optimal when some demonstrations are not very relevant to the test example. Second, due to the input length limit of transformer models, it can be infeasible to fit many examples into the context, especially when dealing with long-input tasks. In this work, we explore Demonstration Ensembling (DENSE) as an alternative to simple concatenation. DENSE predicts outputs using subsets (i.e., buckets) of the demonstrations and then combines the output probabilities resulting from each subset to produce the final prediction. We study different ensembling methods using GPT-j and experiment on 7 different language tasks. Our experiments show max ensembling to outperform concatenation by an average of 3.8 points. |
Muhammad Khalifa · Lajanugen Logeswaran · Moontae Lee · Honglak Lee · Lu Wang 🔗 |
Thu 5:24 a.m. - 5:29 a.m.
|
Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations
(
Spotlight
)
link »
SlidesLive Video » Joint-embedding based learning (e.g., SimCLR, MoCo, DINO) and reconstruction-based learning (e.g., BEiT, SimMIM, MAE) are the two leading paradigms for self-supervised learning of vision transformers, but they differ substantially in their transfer performance. Here, we aim to explain these differences by analyzing the impact of these objectives on the structure and transferability of their representations. Our analysis reveals that reconstruction-based learning features are significantly dissimilar to joint-embedding based learning features and that models trained with similar objectives learn similar features even across architectures. These differences arise early in the network, primarily driven by attention and normalization layers. We find that joint-embedding features yield better linear probe transfer for classification because the different objectives drive different distributions of information and invariances in the representation. These differences explain opposite trends in transfer performance for downstream tasks that require spatial specificity in features. Finally, we address how fine-tuning changes reconstructive representations to enable better transfer, showing that it re-organizes the information to be more similar to pre-trained joint embedding models. |
Shashank Shekhar · Florian Bordes · Pascal Vincent · Ari Morcos 🔗 |
Thu 5:32 a.m. - 5:37 a.m.
|
Effective Data Augmentation With Diffusion Models
(
Spotlight
)
link »
SlidesLive Video » Data augmentation is one of the most prevalent tools in deep learning, underpinning many recent advances, including those from classification, generative models, and representation learning. The standard approach to data augmentation combines simple transformations like rotations and flips to generate new images from existing ones. However, these new images lack diversity along key semantic axes present in the data. Consider the task of recognizing different animals. Current augmentations fail to produce diversity in task-relevant high-level semantic attributes like the species of the animal. We address the lack of diversity in data augmentation with image-to-image transformations parameterized by pre-trained text-to-image diffusion models. Our method edits images to change their semantics using an off-the-shelf diffusion model, and generalizes to novel visual concepts from a few labelled examples. We evaluate our approach on image classification tasks in a few-shot setting, and on a real-world weed recognition task, and observe an improvement in accuracy in tested domains. |
Brandon Trabucco · Kyle Doherty · Max Gurinas · Ruslan Salakhutdinov 🔗 |
Thu 5:40 a.m. - 5:45 a.m.
|
Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners
(
Spotlight
)
link »
SlidesLive Video » Instruction-tuning, which fine-tunes the language model (LM) on various downstream tasks with task instruction, has improved the zero-shot task generalization performance. However, instruction-tuned LMs still struggle to generalize to challenging unseen tasks containing novel labels. In this paper, we propose Flipped Learning, an alternative method of instruction-tuning which trains the LM to generate the task instruction given the input instance and label. During inference, the LM trained with Flipped Learning, referred to as FLIPPED, selects the label option that is most likely to generate the task instruction. On 14 tasks of the BIG-bench benchmark, the 11B-sized FLIPPED outperforms zero-shot T0-11B and even a 16 times larger 3-shot GPT-3 (175B) on average by 8.4% and 9.7% points, respectively. Flipped Learning gives particularly large improvements on tasks with unseen labels, outperforming T0-11B by up to +20% average F1 score. This indicates that the strong task generalization of Flipped Learning comes from improved generalization to novel labels. |
Seonghyeon Ye · Doyoung Kim · Joel Jang · Joongbo Shin · Minjoon Seo 🔗 |
Thu 5:48 a.m. - 5:53 a.m.
|
A Kernel-Based View of Language Model Fine-Tuning
(
Spotlight
)
link »
SlidesLive Video »
It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK)---which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization---describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We extend the NTK formalism to Adam and use Tensor Programs (Yang, 2020) to characterize conditions under which the NTK lens may describe fine-tuning updates to pre-trained language models. Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning. Finally, we use this kernel view to propose an explanation for the success of parameter-efficient subspace-based fine-tuning methods.
|
Sadhika Malladi · Alexander Wettig · Dingli Yu · Danqi Chen · Sanjeev Arora 🔗 |
Thu 6:00 a.m. - 6:30 a.m.
|
Invited Talk (Yasaman Bahri): Understanding Neural Scaling Laws
(
Invited Talk
)
SlidesLive Video » Bio: Yasaman Bahri is a Research Scientist at Google Brain with research interests in the foundations of deep learning and the intersection of machine learning with the physical sciences. Prior to joining Google Brain, she completed her Ph.D. in Physics at UC Berkeley. She is a past recipient of the Rising Stars Award in EECS. |
Yasaman Bahri 🔗 |
Thu 6:30 a.m. - 6:35 a.m.
|
Q&A
|
🔗 |
Thu 6:35 a.m. - 7:05 a.m.
|
Invited Talk (Danqi Chen): Analyzing Training Objectives and Trajectories in Language Pre-training
(
Invited Talk
)
SlidesLive Video » In this talk, I will present several empirical studies on understanding and analyzing pre-training of language models. I will start with BERT’s pre-training/fine-tuning paradigm, and discuss how pre-training objectives will influence downstream performance. Then, I will move on to the scaling of autoregressive large language models. Through analyzing intermediate training checkpoints, we present several interesting findings on token-level perplexity, sentence-level generation and their correlation with in-context learning on downstream tasks. I hope these findings can encourage more theoretical understanding and improved pre-training in the future. Bio: Danqi Chen is an Assistant Professor of Computer Science at Princeton University and co-leads the Princeton NLP Group. Her recent research focuses on training, adapting and understanding large language models, and developing scalable and efficient NLP systems for question answering, information extraction and conversational agents. Before joining Princeton, Danqi worked as a visiting scientist at Facebook AI Research. She received her Ph.D. from Stanford University (2018) and B.E. from Tsinghua University (2012), both in Computer Science. Her research was recognized by a Sloan Fellowship, an NSF CAREER award, a Samsung AI Researcher of the Year award, outstanding paper awards from ACL and EMNLP, and multiple industry faculty awards. |
Danqi Chen 🔗 |
Thu 7:05 a.m. - 7:10 a.m.
|
Q&A
|
🔗 |
Thu 7:10 a.m. - 7:40 a.m.
|
Invited Talk (Jonathan Frankle): Faster Neural Network Training, Algorithmically
(
Invited Talk
)
Training modern neural networks is time-consuming, expensive, and energy-intensive. As neural network training costs double every few months, it is difficult for researchers and businesses without immense budgets to keep up, especially as hardware improvements stagnate. In this talk, I will describe my favored approach for managing this challenge: changing the workload itself - the training algorithm. Unlike most workloads in computer science, machine learning is approximate, and we need not worry about changing the underlying algorithm so long as we properly account for the consequences. I will discuss how we have put this approach into practice at MosaicML, including the dozens of algorithmic changes we have studied (which are freely available open source), the science behind how these changes interact with each other (the composition problem), and how we evaluate whether these changes have been effective. I will also detail several surprises we have encountered and lessons we have learned along the way. In the time since we began this work, we have reduced the training times of standard computer vision models by 5-7x and standard language models by 2-3x, and we're just scratching the surface. I will close with a number of open research questions we have encountered that merit the attention of the research community. This is the collective work of a dozen empirical deep learning researchers at MosaicML, and I'm simply the messenger. Bio: Jonathan Frankle is Chief Scientist at MosaicML, where he leads the company's research team toward the goal of developing more efficient algorithms for training neural networks. In his PhD at MIT, he empirically studied deep learning with Prof. Michael Carbin, specifically the properties of sparse networks that allow them to train effectively (his "Lottery Ticket Hypothesis" - ICLR 2019 Best Paper). In addition to his technical work, he is actively involved in policymaking around challenges related to machine learning. He will be joining the computer science faculty at Harvard in the fall of 2023. He earned his BSE and MSE in computer science at Princeton and has previously spent time at Google Brain, Facebook AI Research, and Microsoft as an intern and Georgetown Law as an Adjunct Professor of Law. |
Jonathan Frankle 🔗 |
Thu 7:40 a.m. - 7:45 a.m.
|
Q&A
|
🔗 |