Learning “tabula rasa”, that is, from scratch without much previously learned knowledge, is the dominant paradigm in reinforcement learning (RL) research. However, learning tabula rasa is the exception rather than the norm for solving larger-scale problems. Additionally, the inefficiency of tabula rasa RL typically excludes the majority of researchers outside certain resource-rich labs from tackling computationally demanding problems. To address the inefficiencies of tabula rasa RL and help unlock the full potential of deep RL, our workshop aims to bring further attention to this emerging paradigm of reusing prior computation in RL, discuss potential benefits and real-world applications, discuss its current limitations and challenges, and come up with concrete problem statements and evaluation protocols for the research community to work on. Furthermore, we hope to foster discussions via panel discussions (with audience participation), several contributed talks and by welcoming short opinion papers in our call for papers.
Thu 12:00 a.m. - 12:10 a.m.
|
Introduction
SlidesLive Video » |
🔗 |
Thu 12:10 a.m. - 12:40 a.m.
|
Invited Talk by Avishkar Bhoopchand: Human-Timescale Adaptation in an Open-Ended Task Space
(
Invited Talk
)
SlidesLive Video » |
Avishkar Bhoopchand 🔗 |
Thu 12:40 a.m. - 12:50 a.m.
|
Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals
(
Oral
)
link »
SlidesLive Video » High sample complexity has long been a challenge for RL. On the other hand, human learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent.We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. Auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. When assisted by our design, A2C improves on 4 games in the Atari environment with sparse rewards, and requires 1000x less training frames compared to the previous SOTA Agent 57 on Skiing, the hardest game in Atari. |
Yue Wu · Yewen Fan · Paul Pu Liang · Amos Azaria · Yuanzhi Li · Tom Mitchell 🔗 |
Thu 12:50 a.m. - 1:00 a.m.
|
Reduce, Reuse, Recycle: Selective Reincarnation in Multi-Agent Reinforcement Learning
(
Oral
)
link »
SlidesLive Video » `Reincarnation' in reinforcement learning has been proposed as a formalisation of reusing prior computation from past experiments when training an agent in an environment. In this paper, we present a brief foray into the paradigm of reincarnation in the multi-agent (MA) context. We consider the case where only some agents are reincarnated, whereas the others are trained from scratch -- selective reincarnation. In the fully-cooperative MA setting with heterogeneous agents, we demonstrate that selective reincarnation can lead to higher returns than training fully from scratch, and faster convergence than training with full reincarnation. However, the choice of which agents to reincarnate in a heterogeneous system is vitally important to the outcome of the training -- in fact, a poor choice can lead to considerably worse results than the alternatives. We argue that a rich field of work exists here, and we hope that our effort catalyses further energy in bringing the topic of reincarnation to the multi-agent realm. |
Juan Formanek · Callum R. Tilbury · Jonathan P Shock · Kale-ab Tessera · Arnu Pretorius 🔗 |
Thu 1:00 a.m. - 1:10 a.m.
|
Learning to Modulate pre-trained Models in RL
(
Oral
)
link »
SlidesLive Video » Reinforcement Learning (RL) has experienced great success in complex games and simulations. However, RL agents are often highly specialized for a particular task, and it is difficult to adapt a trained agent to a new task.In supervised learning, an established paradigm is multi-task pre-training followed by fine-tuning.A similar trend is emerging in RL, where agents are pre-trained on data collections that comprise a multitude of tasks.Despite these developments, it remains an open challenge how to adapt such pre-trained agents to novel tasks while retaining performance on the pre-training tasks.In this regard, we pre-train an agent on a set of tasks from the Meta-World benchmark suite and adapt it to tasks from Continual-World. We conduct a comprehensive comparison of fine-tuning methods originating from supervised learning in our setup.Our findings show that fine-tuning is feasible, but for existing methods, performance on previously learned tasks often deteriorates.Therefore, we propose a novel approach that avoids forgetting by modulating the information flow of the pre-trained model. Our method outperforms existing fine-tuning approaches, and achieves state-of-the-art performance on the Continual-World benchmark.To facilitate future research in this direction, we collect datasets for all Meta-World tasks and make them publicly available. |
Thomas Schmied · Markus Hofmarcher · Fabian Paischer · Razvan Pascanu · Sepp Hochreiter 🔗 |
Thu 1:10 a.m. - 1:20 a.m.
|
Do Embodied Agents Dream of Pixelated Sheep?: Embodied Decision Making using Language Guided World Modelling
(
Oral
)
link »
SlidesLive Video » Reinforcement learning (RL) agents typically learn tabula rasa, without prior knowledge of the world, which makes learning complex tasks with sparse rewards difficult. If initialized with knowledge of high-level subgoals and transitions between subgoals, RL agents could utilize this Abstract World Model (AWM) for planning and exploration. We propose using few-shot large language models (LLMs) to hypothesize an AWM, that is tested and verified during exploration, to improve sample efficiency in embodied RL agents. Our DECKARD agent applies LLM-guided exploration to item crafting in Minecraft in two phases: (1) the Dream phase where the agent uses an LLM to decompose a task into a sequence of subgoals, the hypothesized AWM; and (2) the Wake phase where the agent learns a modular policy for each subgoal and verifies or corrects the hypothesized AWM on the basis of its experiences. Our method of hypothesizing an AWM with LLMs and then verifying the AWM based on agent experience not only increases sample efficiency over contemporary methods by an order of magnitude but is also robust to and corrects errors in the LLM, successfully blending noisy internet-scale information from LLMs with knowledge grounded in environment dynamics. |
Kolby Nottingham · Prithviraj Ammanabrolu · Alane Suhr · Yejin Choi · Hannaneh Hajishirzi · Sameer Singh · Roy Fox 🔗 |
Thu 1:20 a.m. - 1:30 a.m.
|
Towards A Unified Agent with Foundation Models
(
Oral
)
link »
Language Models and Vision Language Models have recently demonstrated unprecedented capabilities in terms of understanding human intentions, reasoning, scene understanding, and planning-like behaviour, in text form, among many others. In this work, we investigate how to embed and leverage such abilities in Reinforcement Learning (RL) agents. We design a framework that uses language as the core reasoning tool, exploring how this enables an agent to tackle a series of fundamental RL challenges, such as efficient exploration, reusing experience data, scheduling skills, and learning from observations, which traditionally require separate, vertically designed algorithms. We test our method on a sparse-reward simulated robotic manipulation environment, where a robot needs to stack a set of objects. We demonstrate substantial performance improvements over baselines in exploration efficiency and ability to reuse data from offline datasets, and illustrate how to reuse learned skills to solve novel tasks or imitate videos of human experts. |
Norman Di Palo · Arunkumar Byravan · Leonard Hasenclever · Markus Wulfmeier · Nicolas Heess · Martin Riedmiller 🔗 |
Thu 1:30 a.m. - 1:35 a.m.
|
Merging Decision Transformers: Weight Averaging for Forming Multi-Task Policies
(
Spotlight
)
link »
SlidesLive Video » Recent work has shown the promise of creating generalist, transformer-based, policies for language, vision, and sequential decision-making problems. To create such models, we generally require centralized training objectives, data, and compute. It is of interest if we can more flexibly create generalist policies, by merging together multiple, task-specific, individually trained policies. In this work, we take a preliminary step in this direction through merging, or averaging, subsets of Decision Transformers in weight space trained on different MuJoCo locomotion problems, forming multi-task models without centralized training. We also propose that when merging policies, we can obtain better results if all policies start from common, pre-trained initializations, while also co-training on shared auxiliary tasks during problem-specific finetuning. In general, we believe research in this direction can help democratize and distribute the process of which forms generally capable agents. |
Daniel Lawson · Ahmed Qureshi 🔗 |
Thu 1:35 a.m. - 1:40 a.m.
|
Deep Reinforcement Learning with Plasticity Injection
(
Spotlight
)
link »
SlidesLive Video » A growing body of evidence suggests that neural networks employed in deep reinforcement learning (RL) gradually lose their plasticity, the ability to learn from new data; however, the analysis and mitigation of this phenomenon is hampered by the complex relationship between plasticity, exploration, generalization, and performance in RL. This paper introduces plasticity injection, a minimalistic intervention that increases the network plasticity without changing the number of trainable parameters or biasing the predictions. The applications of this intervention are two-fold: first, as a diagnostic tool --- if injection increases the performance, we may conclude that an agent's network was losing its plasticity. This tool allows us to identify a subset of Atari environments where the lack of plasticity causes performance plateaus, motivating future studies on understanding and combating plasticity loss. Second, plasticity injection can be used to improve the computational efficiency of RL training if the agent exhausted its plasticity and has to re-learn from scratch or by growing the agent's network dynamically without compromising performance. The results on Atari show that plasticity injection attains stronger performance while being computationally efficient compared to alternative methods. |
Evgenii Nikishin · Junhyuk Oh · Georg Ostrovski · Clare Lyle · Razvan Pascanu · Will Dabney · Andre Barreto 🔗 |
Thu 1:40 a.m. - 1:45 a.m.
|
Synthetic Experience Replay
(
Spotlight
)
link »
SlidesLive Video » A key theme in the past decade has been that when large neural networks and large datasets combine they can produce remarkable results. In deep reinforcement learning (RL), this paradigm is commonly made possible through experience replay, whereby a dataset of past experiences is used to train a policy or value function. However, unlike in supervised or self-supervised learning, an RL agent has to collect its own data, which is often limited. Thus, it is challenging to reap the benefits of deep learning, and even small neural networks can overfit at the start of training. In this work, we leverage the tremendous recent progress in generative modeling and propose Synthetic Experience Replay (SynthER), a diffusion-based approach to arbitrarily upsample an agent's collected experience. We show that SynthER is an effective method for training RL agents across offline and online settings. In offline settings, we observe drastic improvements both when upsampling small offline datasets and when training larger networks with additional synthetic data. Furthermore, SynthER enables online agents to train with a much higher update-to-data ratio than before, leading to a large increase in sample efficiency, without any algorithmic changes. We believe that synthetic training data could open the door to realizing the full potential of deep learning for replay-based RL algorithms from limited data. |
Cong Lu · Philip Ball · Jack Parker-Holder 🔗 |
Thu 1:45 a.m. - 1:50 a.m.
|
Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?
(
Spotlight
)
link »
SlidesLive Video » We present the largest and most comprehensive empirical study of visual foundation models for Embodied AI (EAI). First, we curate CORTEXBENCH, consisting of 17 different EAI tasks spanning locomotion, navigation, dexterous and mobile manipulation. Next, we systematically evaluate existing visual foundation models and find that none is universally dominant.To study the effect of pre-training data scale and diversity, we combine ImageNet with over 4,000 hours of egocentric videos from 7 different sources (over 5.6M images) and train different sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. These models required over 10,000 GPU-hours to train and will be open-sourced to the community.We find that scaling dataset size and diversity does not improve performance across all tasks but does so on average. Finally, we show that adding a second pre-training step on a small in-domain dataset improves performance, matching or outperforming the best known results in this setting. |
Arjun Majumdar · Karmesh Yadav · Sergio Arnaud · Yecheng Jason Ma · Claire Chen · Sneha Silwal · Aryan Jain · Vincent-Pierre Berges · Pieter Abbeel · Dhruv Batra · Yixin Lin · Oleksandr Maksymets · Aravind Rajeswaran · Franziska Meier
|
Thu 1:50 a.m. - 1:55 a.m.
|
Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning
(
Spotlight
)
link »
SlidesLive Video » A compelling use case of offline reinforcement learning (RL) is to obtain an effective policy initialization from existing datasets, which allows efficient fine-tuning with limited amounts of active online interaction in the environment. Many existing offline RL methods tend to exhibit poor fine-tuning performance. On the contrary, while naive online RL methods achieve compelling empirical performance, online methods suffer from a large sample complexity without a good policy initialization from the offline data. Our goal in this paper is to devise an approach for learning an effective offline initialization that also unlocks fast online fine-tuning capabilities. Our approach, calibrated Q-learning (Cal-QL) accomplishes this by learning a conservative value function initialization that underestimates the value of the learned policy from offline data, while also being calibrated, meaning that the learned value estimation still upper-bounds the ground-truth value of some other reference policy (e.g., the behavior policy). Both theoretically and empirically, we show that imposing these conditions speeds up online fine-tuning, and brings in benefits of the offline data. In practice, Cal-QL can be implemented on top of existing offline RL methods without any extra hyperparameter tuning. Empirically, Cal-QL outperforms state-of-the-art methods on a wide range of fine-tuning tasks from both state and visual observations, across several benchmarks. |
Mitsuhiko Nakamoto · Yuexiang Zhai · Anikait Singh · Yi Ma · Chelsea Finn · Aviral Kumar · Sergey Levine 🔗 |
Thu 1:55 a.m. - 2:00 a.m.
|
TGRL: Teacher Guided Reinforcement Learning Algorithm for POMDPs
(
Spotlight
)
link »
SlidesLive Video »
In many real-world problems, an agent must operate in an uncertain and partially observable environment. Due to partial information, a policy directly trained to operate from these restricted observations tends to perform poorly. In some scenarios, during training more information about the environment is available, which can be utilized to find a superior policy. Because this privileged information is unavailable at deployment, such a policy cannot be deployed. The $\textit{teacher-student}$ paradigm overcomes this challenge by using actions of privileged (or $\textit{teacher}$) policy as the target for training the deployable (or $\textit{student}$) policy operating from the restricted observation space using supervised learning. However, due to information asymmetry, it is not always feasible for the student to perfectly mimic the teacher. We provide a principled solution to this problem, wherein the student policy dynamically balances between following the teacher's guidance and utilizing reinforcement learning to solve the partially observed task directly. The proposed algorithm is evaluated on diverse domains and fares favorably against strong baselines.
|
Idan Shenfeld · Zhang-Wei Hong · Aviv Tamar · Pulkit Agrawal 🔗 |
Thu 2:00 a.m. - 2:05 a.m.
|
Co-Imitation Learning without Expert Demonstration
(
Spotlight
)
link »
Imitation learning is a primary approach to improve the efficiency of reinforcement learning by exploiting the expert demonstrations. However, in many real scenarios, obtaining expert demonstrations could be extremely expensive or even impossible. To overcome this challenge, in this paper, we propose a novel learning framework called Co-Imitation Learning (CoIL) to exploit the past good experiences of the agents themselves without expert demonstration. Specifically, we train two different agents via letting each of them alternately explore the environment and exploit the peer agent’s experience. While the experiences could be valuable or misleading, we propose to estimate the potential utility of each piece of experience with the expected gain of the value function. Thus the agents can selectively imitate from each other by emphasizing the more useful experiences while filtering out noisy ones. Experimental results on various tasks show significant superiority of the proposed Co-Imitation Learning framework, validating that the agents can benefit from each other without external supervision. |
Kun-Peng Ning · Hu Xu · Kun Zhu · Sheng-Jun Huang 🔗 |
Thu 2:05 a.m. - 4:00 a.m.
|
Poster Session
(
Poster
)
|
🔗 |
Thu 4:00 a.m. - 5:00 a.m.
|
Lunch Break
|
🔗 |
Thu 5:00 a.m. - 5:30 a.m.
|
Invited Talk by Joseph Lim: Skill Reuse in Deep Reinforcement Learning
(
Invited Talk
)
SlidesLive Video » |
Joseph Lim 🔗 |
Thu 5:30 a.m. - 6:00 a.m.
|
Invited Talk by Furong Hunag: Adaptable Reinforcement Learning in An Ever-Changing World
(
Invited talk
)
|
Furong Huang 🔗 |
Thu 6:00 a.m. - 6:30 a.m.
|
Invited Talk by Anna Goldie: RL for Chip Design / LLMs
(
Invited talk
)
SlidesLive Video » |
Anna Goldie 🔗 |
Thu 6:30 a.m. - 7:00 a.m.
|
Invited Talk by Sergey Levine: Leveraging Offline Datasets / Foundation Models for Real-World RL
(
Invited Talk
)
SlidesLive Video » |
Sergey Levine 🔗 |
Thu 7:00 a.m. - 7:50 a.m.
|
Panel Discussion: Challenges & Open Problems in Reusing Prior Computation
(
Discussion Panel
)
SlidesLive Video » Panelists: Jim (Linxi) Fan, Furong Huang, Joseph Lim, Jeff Clune, Marc Bellemare, Avishkar Bhoopchand. |
Joseph Lim · Furong Huang · Marc G Bellemare · Linxi Fan · Jeff Clune · Anna Goldie 🔗 |
Thu 7:50 a.m. - 8:00 a.m.
|
Closing Remarks
|
🔗 |
-
|
Unsupervised Object Interaction Learning with Counterfactual Dynamics Models
(
Poster
)
link »
We present a novel way of learning skills of object interactions on entity-centric environments, whose goal is to learn primitive behaviors that can control objects and induce their interactions without external reward or supervision being used. Existing skill discovery methods are limited to locomotion, simple navigation tasks, or single-object manipulation tasks, mostly not inducing useful behaviors of inducing interaction between objects. Unlike a monolithic representation usually used in prior skill learning methods, we propose to use a structured goal representation that can query and scope which objects to interact with, which can serve a basis for solving more complex downstream tasks. We design a novel counter- factual intrinsic reward from either forward model or successor features that can learn an interaction skill between a pair of objects given as a goal. Through experiments on continuous control environments such as Magnetic Block and 2.5-D Stacking Box, we demonstrate that an agent can learn object interaction behaviors (e.g., attaching or stacking one block to another) without any external rewards or domain-specific knowledge. |
Jongwook Choi · Sungtae Lee · Xinyu Wang · Sungryull Sohn · Honglak Lee 🔗 |
-
|
Chain-of-Thought Predictive Control with Behavior Cloning
(
Poster
)
link »
We study how to learn generalizable policies from demonstrations for complex continuous space tasks (e.g., low-level object manipulations). We aim to leverage the applicability & scalability of Behavior Cloning (BC) combined with the planning capabilities & generalizability of Model Predictive Control (MPC), and at the same time, overcome the challenges of BC with sub-optimal demos and enable planning-based control over a much longer horizon. Specifically, we utilize hierarchical structures in object manipulation tasks via key states that mark the boundary between sub-stages of a trajectory. We couple key state (the chain-of-thought) and action predictions during both training and evaluation stages, providing the model with a structured peek into the long-term future to dynamically adjust its plan. Our method resembles a closed-loop control design and we call it Chain-of-Thought Predictive Control (CoTPC). We empirically find key states governed by learnable patterns shared across demos, and thus CoTPC eases the optimization of BC and produces policies much more generalizable than existing methods on four challenging object manipulation tasks in ManiSkill2. |
Zhiwei Jia · Fangchen Liu · Vineet Thumuluri · Linghao Chen · Zhiao Huang · Hao Su 🔗 |
-
|
Learning to Modulate pre-trained Models in RL
(
Poster
)
link »
Reinforcement Learning (RL) has experienced great success in complex games and simulations. However, RL agents are often highly specialized for a particular task, and it is difficult to adapt a trained agent to a new task.In supervised learning, an established paradigm is multi-task pre-training followed by fine-tuning.A similar trend is emerging in RL, where agents are pre-trained on data collections that comprise a multitude of tasks.Despite these developments, it remains an open challenge how to adapt such pre-trained agents to novel tasks while retaining performance on the pre-training tasks.In this regard, we pre-train an agent on a set of tasks from the Meta-World benchmark suite and adapt it to tasks from Continual-World. We conduct a comprehensive comparison of fine-tuning methods originating from supervised learning in our setup.Our findings show that fine-tuning is feasible, but for existing methods, performance on previously learned tasks often deteriorates.Therefore, we propose a novel approach that avoids forgetting by modulating the information flow of the pre-trained model. Our method outperforms existing fine-tuning approaches, and achieves state-of-the-art performance on the Continual-World benchmark.To facilitate future research in this direction, we collect datasets for all Meta-World tasks and make them publicly available. |
Thomas Schmied · Markus Hofmarcher · Fabian Paischer · Razvan Pascanu · Sepp Hochreiter 🔗 |
-
|
Think Before You Act: Unified Policy for Interleaving Language Reasoning with Actions
(
Poster
)
link »
The success of transformer models trained with a language modeling objective brings a promising opportunity to reinforcement learning. The Decision Transformer is a step towards this direction, showing how to train transformers with the same next-step prediction objective on offline data. Another important development in this area is the recent emergence of large-scale datasets collected from the internet. One interesting source of such data is tutorial videos with captions where people talk about what they are doing. To take advantage of this language component, we propose a novel method for unifying language reasoning with actions in a single policy. Specifically, we augment a transformer policy with word outputs, so it can generate textual captions interleaved with actions. When tested on the most challenging task in BabyAI, with captions describing next subgoals, our reasoning policy consistently outperforms the caption-free baseline. |
Lina Mezghani · Sainbayar Sukhbaatar · Piotr Bojanowski · Karteek Alahari 🔗 |
-
|
Offline Visual Representation Learning for Embodied Navigation
(
Poster
)
link »
How should we learn visual representations for embodied agents that must see and move? The status quo is tabula rasa in vivo, i.e. learning visual representations from scratch while also learning to move, potentially augmented with auxiliary tasks (e.g. predicting the action taken between two successive observations). In this paper, we show that an alternative 2-stage strategy is far more effective: (1) offline pretraining of visual representations with self-supervised learning (SSL) using large-scale pre-rendered images of indoor environments (Omnidata), and (2) online finetuning of visuomotor representations on specific tasks with image augmentations under long learning schedules. We call this method Offline Visual Representation Learning (OVRL). We conduct large-scale experiments -- on 3 different 3D datasets (Gibson, HM3D, MP3D), 2 tasks (ImageNav, ObjectNav), and 2 policy learning algorithms (RL, IL) -- and find that the OVRL representations lead to significant across-the-board improvements in state of art, on ImageNav from 29.2% to 54.2% (+25% absolute, 86% relative) and on ObjectNav from 18.1% to 23.2% (+5.1% absolute, 28% relative). Importantly, both results were achieved by the same visual encoder generalizing to datasets that were not seen during pretraining. While the benefits of pretraining sometimes diminish (or entirely disappear) with long finetuning schedules, we find that OVRL's performance gains continue to increase (not decrease) as the agent is trained for 2~billion frames of experience. |
Karmesh Yadav · Ram Ramrakhya · Arjun Majumdar · Vincent-Pierre Berges · Sachit Kuhar · Dhruv Batra · Alexei Baevski · Oleksandr Maksymets 🔗 |
-
|
Imitation from Arbitrary Experience: A Dual Unification of Reinforcement and Imitation Learning Methods
(
Poster
)
link »
It is well known that Reinforcement Learning (RL) can be formulated as a convex program with linear constraints. The dual form of this formulation is unconstrained, which we refer to as dual RL, and can leverage preexisting tools from convex optimization to improve the learning performance of RL agents. We show that several state-of-the-art deep RL algorithms (in online, offline, and imitation settings) can be viewed as dual RL approaches in a unified framework. This unification calls for the methods to be studied on common ground, so as to identify the components that actually contribute to the success of these methods. Our unification also reveals that prior off-policy imitation learning methods in the dual space are based on an unrealistic coverage assumption and are restricted to matching a particular f-divergence. We propose a new method using a simple modification to the dual framework that allows for imitation learning with arbitrary off-policy data to obtain near-expert performance. |
Harshit Sikchi · Amy Zhang · Scott Niekum 🔗 |
-
|
Self-Generating Data for Goal-Conditioned Compositional Problems
(
Poster
)
link »
Building reinforcement learning agents that are generalizable to compositional problems has long been a research challenge. Recent success relies on a pre-existing dataset of rich behaviors. We present a novel paradigm to learn policies generalizable to compositional tasks with self-generated data. After learning primitive skills, the agent runs task expansion that actively expands out more complex tasks by composing learned policies and also naturally generates a dataset of demonstrations for self-distillation. In a proof-of-concept block-stacking environment, our agent discovers a large number of complex tasks after multiple rounds of data generation and distillation, and achieves an appealing zero-shot generalization success rate when building human-designed shapes. |
Ying Yuan · Yunfei Li · Yi Wu 🔗 |
-
|
LIV: Language-Image Representations and Rewards for Robotic Control
(
Poster
)
link »
Motivated by the growing research in natural language-based task interfaces for robotic tasks, we seek good vision-language representations specialized for control. We posit that such representations should: (1) align the two modalities to permit grounding language-based task specifications in visual state-based task rewards, (2) capture sequentiality and task-directed progress in conjunction with cross-modality alignment, and (3) permit extensive pre-training from large generic datasets as well as fine-tuning on small in-domain datasets. We achieve these desiderata through Language-Image Value learning (LIV), a unified objective for vision-language representation and reward learning from action-free videos with text annotations. We use LIV to pre-train the first control-centric vision-language representation from large human video datasets such as EpicKitchen with no action information. Then, with access to target domain data, the very same objective consistently improves this pre-trained LIV model as well as other pre-existing vision-language representations for language-conditioned control. On two simulated robot domains that evaluate vision-language representations and rewards, LIV pre-trained and fine-tuned models consistently outperform the best prior approaches, establishing the advantages of joint vision-language representation and reward learning within its unified, compact framework. |
Yecheng Jason Ma · Vikash Kumar · Amy Zhang · Osbert Bastani · Dinesh Jayaraman 🔗 |
-
|
Model-Based Adversarial Imitation Learning As Online Fine-Tuning
(
Poster
)
link »
In many real world applications of sequential decision-making problems, such as robotics or autonomous driving, expert-level data is available (or easily obtainable) with methods such as tele-operation. However, directly learning to copy these expert behaviours can result in poor performance due to distribution shift at deployment time. Adversarial imitation learning algorithms alleviate this issue by learning to match the expert state-action distribution through additional environment interactions. Such methods are built around standard reinforcement-learning algorithms with both model-based and model-free approaches. In this work we focus on the model-based approach and argue that algorithms developed for online RL are sub-optimal for the distribution matching problem. We theoretically justify utilizing conservative algorithms developed for the offline learning paradigm in online adversarial imitation learning and empirically demonstrate improved performance and safety on a complex long-range robot manipulation task, directly from images. |
Rafael Rafailov · Victor Kolev · Kyle Hatch · John Martin · mariano Phielipp · Jiajun Wu · Chelsea Finn 🔗 |
-
|
MOTO: Offline to Online Fine-tuning for Model-Based Reinforcement Learning
(
Poster
)
link »
We study the problem of offline-to-online reinforcement learning from high-dimensional pixel observations. While recent model-free approaches successfully use offline pre-training with online fine-tuning to either improve the performance of the data-collection policy or adapt to novel tasks, model-based approaches still remain underutilized in this setting. In this work, we argue that existing methods for high-dimensional model-based offline RL are not suitable for offline-to-online fine-tuning due to issues with representation learning shifts, off-dynamics data, and non-stationary rewards. We propose a simple on-policy model-based method with adaptive behavior regularization. In our simulation experiments, we find that our approach successfully solves long-horizon robot manipulation tasks completely from images by using a combination of offline data and online interactions. |
Rafael Rafailov · Kyle Hatch · Victor Kolev · John Martin · mariano Phielipp · Chelsea Finn 🔗 |
-
|
Masked Trajectory Models for Prediction, Representation, and Control
(
Poster
)
link »
We introduce Masked Trajectory Models~(MTM) as a generic abstraction for sequential decision making. MTM takes a trajectory, such as a state-action sequence, and aims to reconstruct the trajectory conditioned on random subsets of the same trajectory. By training with a highly randomized masking pattern, MTM learns versatile networks that can take on different roles or capabilities, by simply choosing appropriate masks at inference time. For example, the same MTM network can be used as a forward dynamics model, inverse dynamics model, or even an offline RL agent. Through extensive experiments in several continuous control tasks, we show that the same MTM network -- i.e. same weights -- can match or outperform specialized networks trained for the aforementioned capabilities. Additionally, we find that state representations learned by MTM can significantly accelerate the learning speed of traditional RL algorithms. Finally, in offline RL benchmarks, we find that MTM is competitive with specialized offline RL algorithms, despite MTM being a generic self-supervised learning method without any explicit RL components. |
Philipp Wu · Arjun Majumdar · Kevin Stone · Yixin Lin · Igor Mordatch · Pieter Abbeel · Aravind Rajeswaran 🔗 |
-
|
Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?
(
Poster
)
link »
We present the largest and most comprehensive empirical study of visual foundation models for Embodied AI (EAI). First, we curate CORTEXBENCH, consisting of 17 different EAI tasks spanning locomotion, navigation, dexterous and mobile manipulation. Next, we systematically evaluate existing visual foundation models and find that none is universally dominant.To study the effect of pre-training data scale and diversity, we combine ImageNet with over 4,000 hours of egocentric videos from 7 different sources (over 5.6M images) and train different sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. These models required over 10,000 GPU-hours to train and will be open-sourced to the community.We find that scaling dataset size and diversity does not improve performance across all tasks but does so on average. Finally, we show that adding a second pre-training step on a small in-domain dataset improves performance, matching or outperforming the best known results in this setting. |
Arjun Majumdar · Karmesh Yadav · Sergio Arnaud · Yecheng Jason Ma · Claire Chen · Sneha Silwal · Aryan Jain · Vincent-Pierre Berges · Pieter Abbeel · Dhruv Batra · Yixin Lin · Oleksandr Maksymets · Aravind Rajeswaran · Franziska Meier
|
-
|
Knowledge Transfer from Teachers to Learners in Growing-Batch Reinforcement Learning
(
Poster
)
link »
Standard approaches to sequential decision-making exploit an agent's ability to continually interact with its environment and improve its control policy. However, due to safety, ethical, and practicality constraints, this type of trial-and-error experimentation is often infeasible in many real-world domains such as healthcare and robotics. Instead, control policies in these domains are typically trained offline from previously logged data or in a growing-batch manner. In this setting a fixed policy is deployed to the environment and used to gather an entire batch of new data before being aggregated with past batches and used to update the policy. This improvement cycle can then be repeated multiple times. While a limited number of such cycles is feasible in real-world domains, the quantity and diversity of the resulting data are much lower than in the standard continually-interacting approach. However, data collection in these domains is often performed in conjunction with human experts, who are able to label or annotate the collected data. In this paper, we first explore the trade-offs present in this growing-batch setting, and then investigate how information provided by a teacher (i.e., demonstrations, expert actions, and gradient information) can be leveraged at training time to mitigate the sample complexity and coverage requirements for actor-critic methods. We validate our contributions on tasks from the DeepMind Control Suite. |
Patrick Emedom-Nnamdi · Abram Friesen · Bobak Shahriari · Matthew Hoffman · Nando de Freitas 🔗 |
-
|
Synthetic Experience Replay
(
Poster
)
link »
A key theme in the past decade has been that when large neural networks and large datasets combine they can produce remarkable results. In deep reinforcement learning (RL), this paradigm is commonly made possible through experience replay, whereby a dataset of past experiences is used to train a policy or value function. However, unlike in supervised or self-supervised learning, an RL agent has to collect its own data, which is often limited. Thus, it is challenging to reap the benefits of deep learning, and even small neural networks can overfit at the start of training. In this work, we leverage the tremendous recent progress in generative modeling and propose Synthetic Experience Replay (SynthER), a diffusion-based approach to arbitrarily upsample an agent's collected experience. We show that SynthER is an effective method for training RL agents across offline and online settings. In offline settings, we observe drastic improvements both when upsampling small offline datasets and when training larger networks with additional synthetic data. Furthermore, SynthER enables online agents to train with a much higher update-to-data ratio than before, leading to a large increase in sample efficiency, without any algorithmic changes. We believe that synthetic training data could open the door to realizing the full potential of deep learning for replay-based RL algorithms from limited data. |
Cong Lu · Philip Ball · Jack Parker-Holder 🔗 |
-
|
Task-Agnostic Continual Reinforcement Learning: Gaining Insights and Overcoming Challenges
(
Poster
)
link »
We study methods for task-agnostic continual reinforcement learning (TACRL). TACRL is a setting that combines the difficulties of \emph{partially-observable} RL (a consequence of task agnosticism) and the difficulties of continual learning (CL), i.e., learning on a non-stationary sequence of tasks. We compare TACRL methods with their soft upper bounds prescribed by previous literature: multi-task learning (MTL) methods which do not have to deal with non-stationary data distributions, as well as task-aware methods, which are allowed to operate under \emph{full observability}. We consider a previously unexplored and straightforward baseline for TACRL, replay-based recurrent RL (3RL), in which we augment an RL algorithm with recurrent mechanisms to mitigate partial observability and experience replay mechanisms for catastrophic forgetting in CL.By studying empirical performance in a sequence of RL tasks, we find surprising occurrences of 3RL matching and overcoming the MTL and task-aware soft upper bounds. We lay out hypotheses that could explain this inflection point of continual and task-agnostic learning research. Our hypotheses are empirically tested in continuous control tasks via a large-scale study of the popular multi-task and continual learning benchmark Meta-World. By analyzing different training statistics including gradient conflict, we find evidence that 3RL's outperformance stems from its ability to quickly infer how new tasks relate with the previous ones, enabling forward transfer. |
Massimo Caccia · Jonas Mueller · Taesup Kim · Laurent Charlin · Rasool Fakoor 🔗 |
-
|
Towards A Unified Agent with Foundation Models
(
Poster
)
link »
Language Models and Vision Language Models have recently demonstrated unprecedented capabilities in terms of understanding human intentions, reasoning, scene understanding, and planning-like behaviour, in text form, among many others. In this work, we investigate how to embed and leverage such abilities in Reinforcement Learning (RL) agents. We design a framework that uses language as the core reasoning tool, exploring how this enables an agent to tackle a series of fundamental RL challenges, such as efficient exploration, reusing experience data, scheduling skills, and learning from observations, which traditionally require separate, vertically designed algorithms. We test our method on a sparse-reward simulated robotic manipulation environment, where a robot needs to stack a set of objects. We demonstrate substantial performance improvements over baselines in exploration efficiency and ability to reuse data from offline datasets, and illustrate how to reuse learned skills to solve novel tasks or imitate videos of human experts. |
Norman Di Palo · Arunkumar Byravan · Leonard Hasenclever · Markus Wulfmeier · Nicolas Heess · Martin Riedmiller 🔗 |
-
|
EDGI: Equivariant diffusion for planning with embodied agents
(
Poster
)
link »
Embodied agents operate in a structured world, often solving tasks with spatial, temporal, and permutation symmetries. Most algorithms for planning and model-based reinforcement learning (MBRL) do not take this rich geometric structure into account, leading to sample inefficiency and poor generalization. We introduce the Equivariant Diffuser for Generating Interactions (EDGI), an algorithm for MBRL and planning that is equivariant with respect to the product of the spatial symmetry group SE(3), the discrete-time translation group ℤ, and the object permutation group 𝕊ₙ. EDGI follows the Diffuser framework (Janner et. al 2022) in treating both learning a world model and planning in it as a conditional generative modeling problem, training a diffusion model on an offline trajectory dataset. We introduce a new SE(3) × ℤ × 𝕊ₙ-equivariant diffusion model that supports multiple representations. We integrate this model in a planning loop, where conditioning and classifier-based guidance allow us to softly break the symmetry for specific tasks as needed. On navigation and object manipulation tasks, \eqd improves sample efficiency and generalization. |
Johann Brehmer · Joey Bose · Pim De Haan · Taco Cohen 🔗 |
-
|
Reduce, Reuse, Recycle: Selective Reincarnation in Multi-Agent Reinforcement Learning
(
Poster
)
link »
`Reincarnation' in reinforcement learning has been proposed as a formalisation of reusing prior computation from past experiments when training an agent in an environment. In this paper, we present a brief foray into the paradigm of reincarnation in the multi-agent (MA) context. We consider the case where only some agents are reincarnated, whereas the others are trained from scratch -- selective reincarnation. In the fully-cooperative MA setting with heterogeneous agents, we demonstrate that selective reincarnation can lead to higher returns than training fully from scratch, and faster convergence than training with full reincarnation. However, the choice of which agents to reincarnate in a heterogeneous system is vitally important to the outcome of the training -- in fact, a poor choice can lead to considerably worse results than the alternatives. We argue that a rich field of work exists here, and we hope that our effort catalyses further energy in bringing the topic of reincarnation to the multi-agent realm. |
Juan Formanek · Callum R. Tilbury · Jonathan P Shock · Kale-ab Tessera · Arnu Pretorius 🔗 |
-
|
Beyond Temporal Credit Assignment in Reinforcement Learning
(
Poster
)
link »
In reinforcement learning, traditional value-based methods rely heavily on time as the main proxy for propagating information across the state space. This often results in slow learning and does not scale to large and complex environments. Here, we propose to leverage prior information about the structure of the the environment to assign credit non-temporally to improve learning efficiency. Specifically, we introduce the concept of structural neighbours, which are sets of states with similar semantic structures and which have equivalent values under the optimal policy. We augment traditional value-based RL methods (TD(0), Dyna and Dueling DQN) with a learning mechanism based on structural neighbours. Our empirical results show that by incorporating structural updates, learning efficiency can be greatly improved on a variety of environments ranging from simple tabular grid worlds to those which require function approximation, including the complex and high-dimensional game of Solitaire. |
Sephora Madjiheurem · Kimberly Stachenfeld · Peter Battaglia · Jessica Hamrick 🔗 |
-
|
Learning How to Infer Partial MDPs for In-Context Adaptation and Exploration
(
Poster
)
link »
To generalize across tasks, an agent should acquire knowledge from past tasks that facilitate adaptation and exploration in future tasks. We focus on the problem of in-context adaptation and exploration, where an agent only relies on context, i.e., history of states, actions and/or rewards, rather than gradient-based updates. Posterior sampling (extension of Thompson sampling) is a promising approach, but it requires Bayesian inference and dynamic programming, which often involve unknowns (e.g., a prior) and costly computations. To address these difficulties, we use a transformer to learn an inference process from training tasks and consider a hypothesis space of partial models, represented as small Markov decision processes that are cheap for dynamic programming. In our version of the Symbolic Alchemy benchmark, our method's adaptation speed and exploration-exploitation balance approach those of an exact posterior sampling oracle. We also show that even though partial models exclude relevant information from the environment, they can nevertheless lead to good policies. |
Chentian Jiang · Nan Rosemary Ke · Hado van Hasselt 🔗 |
-
|
Bootstrapped Representations in Reinforcement Learning
(
Poster
)
link »
In reinforcement learning (RL), state representations are key to dealing with large or continuous state spaces. While one of the promises of deep learning algorithms is to automatically construct features well-tuned for the task they try to solve, such a representation might not emerge from end-to-end training of deep RL agents. To mitigate this issue, pretrained representations are often learnt from auxiliary tasks on offline datasets as part of an unsupervised pre-training phase to improve the sample efficiency of deep RL agents in a future online phase. Bootstrapping methods are today's method of choice to make these additional predictions but it is unclear which features are being learned. In this paper, we address this gap and provide a theoretical characterization of the pre-trained representation learnt by temporal difference learning \citep{sutton1988learning}. Surprisingly, we find that this representation differs from the features learned by pre-training with Monte Carlo and residual gradient algorithms for most transition structures of the environment. We describe the goodness of these pre-trained representations to linearly predict the value function given any downstream reward function, and use our theoretical analysis to design new unsupervised pre-training rules. We complement our theoretical results with an empirical comparison of these pre-trained representations for different cumulant functions on the four-room \citep{sutton99between} and Mountain Car \citep{Moore90efficientmemory-based} domains and demonstrate that they speed up online learning. |
Charline Le Lan · Stephen Tu · Mark Rowland · Anna Harutyunyan · Rishabh Agarwal · Marc G Bellemare · Will Dabney 🔗 |
-
|
Action Inference by Maximising Evidence: Zero-Shot Imitation from Observation with World Models
(
Poster
)
link »
Unlike most reinforcement learning agents which require an unrealistic amount of environment interactions to learn a new behaviour, humans excel at learning quickly by merely observing and imitating others. This ability highly depends on the fact that humans have a model of their own embodiment that allows them to infer the most likely actions that led to the observed behaviour. In this paper, we propose Action Inference by Maximising Evidence (AIME) to replicate this behaviour using world models. AIME consists of two distinct phases. In the first phase, the agent learns a world model from its past experience to understand its own body by maximising the evidence lower bound (ELBO). While in the second phase, the agent is given some observation-only demonstrations of an expert performing a novel task and tries to imitate the expert's behaviour. AIME achieves this by defining a policy as an inference model and maximising the evidence of the demonstration under the policy and world model. Our method is "zero-shot" in the sense that it does not require further interactions with the environment after given the demonstration. We empirically validate the zero-shot imitation performance of our method on the Walker of the DeepMind Control Suite and find it outperforms the state-of-the-art baselines. We also find AIME with image observations still matches the performance of the baseline observing the true low-dimensional state of the environment. |
Xingyuan Zhang · Philip Becker-Ehmck · Patrick van der Smagt · Maximilian Karl 🔗 |
-
|
Deep Reinforcement Learning with Plasticity Injection
(
Poster
)
link »
A growing body of evidence suggests that neural networks employed in deep reinforcement learning (RL) gradually lose their plasticity, the ability to learn from new data; however, the analysis and mitigation of this phenomenon is hampered by the complex relationship between plasticity, exploration, generalization, and performance in RL. This paper introduces plasticity injection, a minimalistic intervention that increases the network plasticity without changing the number of trainable parameters or biasing the predictions. The applications of this intervention are two-fold: first, as a diagnostic tool --- if injection increases the performance, we may conclude that an agent's network was losing its plasticity. This tool allows us to identify a subset of Atari environments where the lack of plasticity causes performance plateaus, motivating future studies on understanding and combating plasticity loss. Second, plasticity injection can be used to improve the computational efficiency of RL training if the agent exhausted its plasticity and has to re-learn from scratch or by growing the agent's network dynamically without compromising performance. The results on Atari show that plasticity injection attains stronger performance while being computationally efficient compared to alternative methods. |
Evgenii Nikishin · Junhyuk Oh · Georg Ostrovski · Clare Lyle · Razvan Pascanu · Will Dabney · Andre Barreto 🔗 |
-
|
On The Role of Forgetting in Fine-Tuning Reinforcement Learning Models
(
Poster
)
link »
Recently, foundation models have achieved remarkable results in fields such as computer vision and language processing. Although there has been a significant push to introduce similar approaches in reinforcement learning, these have not yet succeeded on a comparable scale. In this paper, we take a step towards understanding and closing this gap by highlighting one of the problems specific to foundation RL models, namely the data shift occurring during fine-tuning. We show that fine-tuning on compositional tasks, where parts of the environment might only be available after a long training period, is inherently prone to catastrophic forgetting. In such a scenario, a pre-trained model might forget useful knowledge before even seeing parts of the state space it can solve. We provide examples of both a grid world and realistic robotic scenarios where catastrophic forgetting occurs. Finally, we show how this problem can be mitigated by using tools from continual learning. We discuss the potential impact of this finding and propose further research directions. |
Maciej Wołczyk · Bartłomiej Cupiał · Michał Zając · Razvan Pascanu · Łukasz Kuciński · Piotr Miłoś 🔗 |
-
|
Merging Decision Transformers: Weight Averaging for Forming Multi-Task Policies
(
Poster
)
link »
Recent work has shown the promise of creating generalist, transformer-based, policies for language, vision, and sequential decision-making problems. To create such models, we generally require centralized training objectives, data, and compute. It is of interest if we can more flexibly create generalist policies, by merging together multiple, task-specific, individually trained policies. In this work, we take a preliminary step in this direction through merging, or averaging, subsets of Decision Transformers in weight space trained on different MuJoCo locomotion problems, forming multi-task models without centralized training. We also propose that when merging policies, we can obtain better results if all policies start from common, pre-trained initializations, while also co-training on shared auxiliary tasks during problem-specific finetuning. In general, we believe research in this direction can help democratize and distribute the process of which forms generally capable agents. |
Daniel Lawson · Ahmed Qureshi 🔗 |
-
|
TGRL: Teacher Guided Reinforcement Learning Algorithm for POMDPs
(
Poster
)
link »
In many real-world problems, an agent must operate in an uncertain and partially observable environment. Due to partial information, a policy directly trained to operate from these restricted observations tends to perform poorly. In some scenarios, during training more information about the environment is available, which can be utilized to find a superior policy. Because this privileged information is unavailable at deployment, such a policy cannot be deployed. The $\textit{teacher-student}$ paradigm overcomes this challenge by using actions of privileged (or $\textit{teacher}$) policy as the target for training the deployable (or $\textit{student}$) policy operating from the restricted observation space using supervised learning. However, due to information asymmetry, it is not always feasible for the student to perfectly mimic the teacher. We provide a principled solution to this problem, wherein the student policy dynamically balances between following the teacher's guidance and utilizing reinforcement learning to solve the partially observed task directly. The proposed algorithm is evaluated on diverse domains and fares favorably against strong baselines.
|
Idan Shenfeld · Zhang-Wei Hong · Aviv Tamar · Pulkit Agrawal 🔗 |
-
|
PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav
(
Poster
)
link »
We study ObjectGoal Navigation - where a virtual robot situated in a new environment is asked to navigate to an object. Prior work has shown that imitation learning (IL) on a dataset of human demonstrations achieves promising results. However, this has limitations − 1) IL policies generalize poorly to new states, since the training mimics actions not their consequences, and 2) collecting demonstrations is expensive. On the other hand, reinforcement learning (RL) is trivially scalable, but requires careful reward engineering to achieve desirable behavior. We present a two-stage learning scheme for IL pretraining on human demonstrations followed by RL-finetuning. This leads to a PIRLNav policy that advances the state-of-the-art on ObjectNav from 60.0% success rate to 65.0% (+5.0% absolute). Using this IL→RL training recipe, we present a rigorous empirical analysis of design choices. First, we investigate whether human demonstrations can be replaced with `free' (automatically generated) sources of demonstrations, e.g. shortest paths (SP) or task-agnostic frontier exploration (FE) trajectories. We find that IL→RL on human demonstrations outperforms IL→RL on SP and FE trajectories, even when controlled for the same IL-pretraining success on TRAIN, and even on a subset of VAL episodes where IL-pretraining success favors the SP or FE policies. Next, we study how RL-finetuning performance scales with the size of the IL pretraining dataset. We find that as we increase the size of the IL-pretraining dataset and get to high IL accuracies, the improvements from RL-finetuning are smaller, and that 90% of the performance of our best IL→RL policy can be achieved with less than half the number of IL demonstrations. Finally, we analyze failure modes of our ObjectNav policies, and present guidelines for further improving them. |
Ram Ramrakhya · Dhruv Batra · Erik Wijmans · Abhishek Das 🔗 |
-
|
Bayesian regularization of empirical MDPs
(
Poster
)
link »
In most applications of model-based Markov decision processes, the parameters for the unknown underlying model are often estimated from the empirical data. Due to noise, the policy learned from the estimated model is often far from the optimal policy of the underlying model. When applied to the environment of the underlying model, the learned policy results in suboptimal performance, thus calling for solutions with better generalization performance. In this work we take a Bayesian perspective and regularize the objective function of the Markov decision process with prior information in order to obtain more robust policies. Two approaches are proposed, one based on $L^1$ regularization and the other on relative entropic regularization. We evaluate our proposed algorithms on synthetic simulations and on real-world search logs of a large scale online shopping store. Our results demonstrate the robustness of regularized MDP policies against the noise present in the models.
|
Samarth Gupta · Daniel Hill · Lexing Ying · Inderjit Dhillon 🔗 |
-
|
Prioritized offline Goal-swapping Experience Replay
(
Poster
)
link »
In goal-conditioned offline reinforcement learning, an agent learns from previously collected data to go to an arbitrary goal. Since the offline data only contains a finite number of trajectories, a main challenge is how to generate more data. Goal-swapping generates additional data by switching trajectory goals but while doing so produces a large number of invalid trajectories. To address this issue, we propose prioritized goal-swapping experience replay (PGSER). PGSER uses a pre-trained Q function to assign higher priority weights to goal swapped transitions that allow reaching the goal. In experiments, PGSER significantly improves over baselines in a wide range of benchmark tasks, including challenging previously unsuccessful dexterous in-hand manipulation tasks. |
Wenyan Yang · Joni Pajarinen · Dingding Cai · Joni-Kristian Kamarainen 🔗 |
-
|
Revisiting Behavior Regularized Actor-Critic
(
Poster
)
link »
In recent years, significant advancements have been made in offline reinforcement learning, with a growing number of novel algorithms of varying degrees of complexity. Despite this progress, the significance of specific design choices and the application of common deep learning techniques remains unexplored. In this work, we demonstrate that it is possible to achieve state-of-the-art performance on the D4RL benchmark through a simple set of modifications to the minimalist offline RL approach and careful hyperparameter search. Furthermore, our ablations emphasize the importance of minor design choices and hyperparameter tuning while highlighting the untapped potential of using deep learning techniques in offline reinforcement learning. |
Denis Tarasov · Vladislav Kurenkov · Alexander Nikulin · Sergey Kolesnikov 🔗 |
-
|
Successor Feature Representations
(
Poster
)
link »
Transfer in Reinforcement Learning aims to improve learning performance on target tasks using knowledge from experienced source tasks. Successor Representations (SR) and their extension Successor Features (SF) are prominent transfer mechanisms in domains where reward functions change between tasks. They reevaluate the expected return of previously learned policies in a new target task to transfer their knowledge. The SF framework extended SR by linearly decomposing rewards into successor features and a reward weight vector allowing their application in high-dimensional tasks. But this came with the cost of having a linear relationship between reward functions and successor features, limiting its application to such tasks. We propose a novel formulation of SR based on learning the cumulative discounted probability of successor features, called Successor Feature Representations (SFR). Crucially, SFR allows to reevaluate the expected return of policies for general reward functions. We introduce different SFR variations, prove its convergence, and provide a guarantee on its transfer performance. Experimental evaluations based on SFR with function approximation demonstrate its advantage over SF not only for general reward functions, but also in the case of linearly decomposable reward functions. |
Chris Reinke · Xavier Alameda-Pineda 🔗 |
-
|
Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories
(
Poster
)
link »
Natural agents can effectively learn from multiple data sources that differ in size, quality, and types of measurements. We study this heterogeneity in the context of offline reinforcement learning (RL) by introducing a new, practically motivated semi-supervised setting. Here, an agent has access to two sets of trajectories: labelled trajectories containing state, action, reward triplets at every timestep, along with unlabelled trajectories that contain only state and reward information. For this setting, we develop and study a simple meta-algorithmic pipeline that learns an inverse dynamics model on the labelled data to obtain proxy-labels for the unlabelled data, followed by the use of any offline RL algorithm on the true and proxy-labelled trajectories. Empirically, we find this simple pipeline to be highly successful --- on several D4RL benchmarks~\cite{fu2020d4rl}, certain offline RL algorithms can match the performance of variants trained on a fully labelled dataset even when we label only 10% trajectories from the low return regime. To strengthen our understanding, we perform a large-scale controlled empirical study investigating the interplay of data-centric properties of the labelled and unlabelled datasets, with algorithmic design choices (e.g., choice of inverse dynamics, offline RL algorithm) to identify general trends and best practices for training RL agents on semi-supervised offline datasets. |
Qinqing Zheng · Mikael Henaff · Brandon Amos · Aditya Grover 🔗 |
-
|
Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning
(
Poster
)
link »
A compelling use case of offline reinforcement learning (RL) is to obtain an effective policy initialization from existing datasets, which allows efficient fine-tuning with limited amounts of active online interaction in the environment. Many existing offline RL methods tend to exhibit poor fine-tuning performance. On the contrary, while naive online RL methods achieve compelling empirical performance, online methods suffer from a large sample complexity without a good policy initialization from the offline data. Our goal in this paper is to devise an approach for learning an effective offline initialization that also unlocks fast online fine-tuning capabilities. Our approach, calibrated Q-learning (Cal-QL) accomplishes this by learning a conservative value function initialization that underestimates the value of the learned policy from offline data, while also being calibrated, meaning that the learned value estimation still upper-bounds the ground-truth value of some other reference policy (e.g., the behavior policy). Both theoretically and empirically, we show that imposing these conditions speeds up online fine-tuning, and brings in benefits of the offline data. In practice, Cal-QL can be implemented on top of existing offline RL methods without any extra hyperparameter tuning. Empirically, Cal-QL outperforms state-of-the-art methods on a wide range of fine-tuning tasks from both state and visual observations, across several benchmarks. |
Mitsuhiko Nakamoto · Yuexiang Zhai · Anikait Singh · Yi Ma · Chelsea Finn · Aviral Kumar · Sergey Levine 🔗 |
-
|
Accelerating Policy Gradient by Estimating Value Function from Prior Computation in Deep Reinforcement Learning
(
Poster
)
link »
This paper investigates the use of prior computation to estimate the value function to improve sample efficiency in on-policy policy gradient methods in reinforcement learning. Our approach is to estimate the value function from prior computations, such as from the Q-network learned in DQN or the value function trained for different but related environments. In particular, we learn a new value function for the target task while combining it with a value estimate from the prior computation. Finally, the resulting value function is used as a baseline in the policy gradient method. This use of a baseline has the theoretical property of reducing variance in gradient computation and thus improving sample efficiency. The experiments show the successful use of prior value estimates in various settings and improved sample efficiency in several tasks. |
Md Masudur Rahman · Yexiang Xue 🔗 |
-
|
Co-Imitation Learning without Expert Demonstration
(
Poster
)
link »
Imitation learning is a primary approach to improve the efficiency of reinforcement learning by exploiting the expert demonstrations. However, in many real scenarios, obtaining expert demonstrations could be extremely expensive or even impossible. To overcome this challenge, in this paper, we propose a novel learning framework called Co-Imitation Learning (CoIL) to exploit the past good experiences of the agents themselves without expert demonstration. Specifically, we train two different agents via letting each of them alternately explore the environment and exploit the peer agent’s experience. While the experiences could be valuable or misleading, we propose to estimate the potential utility of each piece of experience with the expected gain of the value function. Thus the agents can selectively imitate from each other by emphasizing the more useful experiences while filtering out noisy ones. Experimental results on various tasks show significant superiority of the proposed Co-Imitation Learning framework, validating that the agents can benefit from each other without external supervision. |
Kun-Peng Ning · Hu Xu · Kun Zhu · Sheng-Jun Huang 🔗 |
-
|
Instruction-Finetuned Foundation Models for Multimodal Web Navigation
(
Poster
)
link »
We propose an instruction-aligned multimodal agent for autonomous web navigation -- i.e., sequential decision making tasks employing a computer interface. Our approach is based on supervised finetuning of vision and language foundation models on a large corpus of web data consisting of webpage screenshots and HTML. Specifically, we use vision transformers on sequences of web page screenshots to extract patch-level image features. These features are concatenated with embedding of tokens in HTML documents. Using an instruction-finetuned large language model, we jointly encode both vision and HTML modalities and decode web actions such as click and type. We show that our method outperforms previous approaches by a significant margin, even in handling out-of-distribution HTML and compositional tasks. On the MiniWoB benchmark, we improve previous approaches using only HTML input by more than 17.7%, even surpassing the performance of RL-finetuned models. On the recent WebShop benchmark, our 3-billion-parameter model achieves superior performance to the existing state-of-the-art PaLM-540B. We also collect 347K gold demonstrations using our trained models, 29 times larger than prior work, and make them available to promote future research in this area. We believe that our work is a step towards building capable and generalist decision making agents for computer interface. |
Hiroki Furuta · Ofir Nachum · Kuang-Huei Lee · Yutaka Matsuo · Shixiang Gu · Izzeddin Gur 🔗 |
-
|
Do Embodied Agents Dream of Pixelated Sheep?: Embodied Decision Making using Language Guided World Modelling
(
Poster
)
link »
Reinforcement learning (RL) agents typically learn tabula rasa, without prior knowledge of the world, which makes learning complex tasks with sparse rewards difficult. If initialized with knowledge of high-level subgoals and transitions between subgoals, RL agents could utilize this Abstract World Model (AWM) for planning and exploration. We propose using few-shot large language models (LLMs) to hypothesize an AWM, that is tested and verified during exploration, to improve sample efficiency in embodied RL agents. Our DECKARD agent applies LLM-guided exploration to item crafting in Minecraft in two phases: (1) the Dream phase where the agent uses an LLM to decompose a task into a sequence of subgoals, the hypothesized AWM; and (2) the Wake phase where the agent learns a modular policy for each subgoal and verifies or corrects the hypothesized AWM on the basis of its experiences. Our method of hypothesizing an AWM with LLMs and then verifying the AWM based on agent experience not only increases sample efficiency over contemporary methods by an order of magnitude but is also robust to and corrects errors in the LLM, successfully blending noisy internet-scale information from LLMs with knowledge grounded in environment dynamics. |
Kolby Nottingham · Prithviraj Ammanabrolu · Alane Suhr · Yejin Choi · Hannaneh Hajishirzi · Sameer Singh · Roy Fox 🔗 |
-
|
Multi-Environment Pretraining Enables Transfer to Action Limited Datasets
(
Poster
)
link »
Using massive datasets to train large-scale models has emerged as a dominant approach for broad generalization in natural language and vision applications. In reinforcement learning, however, a key challenge is that available data of sequential decision making is often not annotated with actions - for example, videos of game-play are much more available than sequences of frames paired with their logged game controls. We propose to circumvent this challenge by combining large but sparsely-annotated datasets from a \emph{target} environment of interest with fully-annotated datasets from various other \emph{source} environments. Our method, Action Limited PreTraining (ALPT), leverages the generalization capabilities of inverse dynamics modelling (IDM) to label missing action data in the target environment. We show that utilizing even one additional environment dataset of labelled data during IDM pretraining gives rise to substantial improvements in generating action labels for unannotated sequences. We evaluate our method on Atari game-playing environments and show that with target environment data equivalent to only $12$ minutes of gameplay, we can significantly improve game performance and generalization capability compared to other approaches. Furthermore, we show that ALPT remains beneficial even when target and source environments share no common actions, highlighting the importance of pretraining on broad datasets even though they might seem irrelevant to the target task at hand.
|
David Venuto · Sherry Yang · Pieter Abbeel · Doina Precup · Igor Mordatch · Ofir Nachum 🔗 |
-
|
Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals
(
Poster
)
link »
High sample complexity has long been a challenge for RL. On the other hand, human learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent.We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. Auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. When assisted by our design, A2C improves on 4 games in the Atari environment with sparse rewards, and requires 1000x less training frames compared to the previous SOTA Agent 57 on Skiing, the hardest game in Atari. |
Yue Wu · Yewen Fan · Paul Pu Liang · Amos Azaria · Yuanzhi Li · Tom Mitchell 🔗 |