A Roadmap to Never-Ending RL

Feryal Behbahani, Khimya Khetarpal, Louis Kirsch, Rose Wang, Annie Xie, Adam White, Doina Precup


Humans have a remarkable ability to continually learn and adapt to new scenarios over the duration of their lifetime (Smith & Gasser, 2005). This ability is referred to as never ending learning, also known as continual learning or lifelong learning. Never-ending learning is the constant development of increasingly complex behaviors and the process of building complicated skills on top of those already developed (Ring, 1997), while being able to reapply, adapt and generalize its abilities to new situations. A never-ending learner has the following desiderata

1) it learns behaviors and skills while solving its tasks
2) it invents new subtasks that may later serve as stepping stones
3) it learns hierarchically, i.e. skills learned now can be built upon later
4) it learns without ergodic or resetting assumptions on the underlying (PO)MDP
5) it learns without episode boundaries
6) it learns in a single life without leveraging multiple episodes of experience

There are several facets to building AI agents with never-ending learning abilities. Moreover, different fields have a variety of perspectives to achieving this goal. To this end, we identify key themes for our workshop including cognitive sciences, developmental robotics, agency and abstractions, open-ended learning, world modelling and active inference.

Chat is not available.

Timezone: »


Fri 6:00 a.m. - 7:00 a.m.
 link »

Join us on GatherTown for the poster session!

Fri 7:00 a.m. - 7:15 a.m.
Organizers Opening Remarks (Opening remarks)
Feryal Behbahani, Louis Kirsch, Khimya Khetarpal, Rose Wang, Annie Xie
Fri 7:15 a.m. - 7:16 a.m.
Speaker & Panelist Introduction #1: Danijar Hafner & Eric Eaton (Speaker introduction)
Feryal Behbahani
Fri 7:16 a.m. - 7:31 a.m.
Invited Talk #1: Danijar Hafner (Invited talk)   
Danijar Hafner
Fri 7:31 a.m. - 8:00 a.m.
Panel #1: Danijar Hafner & Eric Eaton (Panel discusion)
Fri 8:00 a.m. - 8:15 a.m.
Contributed Talk #1: Continuous Coordination As a Realistic Scenario For Lifelong Learning (Contributed talk)   
Akilesh Badrinaaraayanan, Hadi Nekoei, Aaron Courville, Sarath Chandar
Fri 8:15 a.m. - 8:16 a.m.
Speaker & Panelist Introduction #2: Anna Harutyunyan & Martha White (Speaker introduction)
Feryal Behbahani
Fri 8:16 a.m. - 8:31 a.m.
Invited Talk #2: Anna Harutyunyan (Invited talk)   
Anna Harutyunyan
Fri 8:31 a.m. - 9:00 a.m.
Panel #2: Anna Harutyunyan & Martha White (Panel discusion)
Fri 9:00 a.m. - 9:15 a.m.
Contributed Talk #2: Reward and Optimality Empowerments: Information-Theoretic Measures for Task Complexity in Deep Reinforcement Learning (Contributed talk)
Hiroki Furuta, Tatsuya Matsushima, Tadashi Kozuno, Yutaka Matsuo, Sergey Levine, Ofir Nachum, Shixiang Gu
Fri 9:15 a.m. - 9:20 a.m.
Fri 9:20 a.m. - 10:05 a.m.
Roundtable Panel (Panel discusion)
Adam White
Fri 10:05 a.m. - 10:06 a.m.
Speaker & Panelist Introduction #3: Joel Lehman & Pierre-Yves Oudeyer (Speaker introduction)
Louis Kirsch
Fri 10:06 a.m. - 10:21 a.m.
Invited Talk #3: Joel Lehman (Invited talk)   
Joel Lehman
Fri 10:21 a.m. - 10:50 a.m.
Panel #3: Joel Lehman & Pierre-Yves Oudeyer (Panel discusion)
Fri 10:50 a.m. - 11:05 a.m.
Contributed Talk #3: RECON: Rapid Exploration for Open-World Navigation with Latent Goal Models (Contributed talk)   
Dhruv Shah, Ben Eysenbach, Nicholas Rhinehart, Sergey Levine
Fri 11:05 a.m. - 11:10 a.m.
Fri 11:10 a.m. - 11:11 a.m.
Speaker & Panelist Introduction #4: Natalia Díaz-Rodríguez & Aleksandra Faust (Speaker introduction)
Annie Xie
Fri 11:11 a.m. - 11:26 a.m.
Invited Talk #4: Natalia Díaz-Rodríguez (Invited talk)   
Natalia Diaz Rodriguez
Fri 11:26 a.m. - 11:55 a.m.
Panel #4: Natalia Díaz-Rodríguez & Aleksandra Faust (Panel discussion)
Fri 11:55 a.m. - 11:56 a.m.
Speaker & Panelist Introduction #5: Hyo Gweon & Matt Botvinick (Speaker introduction)
Rose Wang
Fri 11:56 a.m. - 12:11 p.m.
Invited Talk #5: Hyo Gweon (Invited talk)   
Hyo Gweon
Fri 12:11 p.m. - 12:40 p.m.
Panel #5: Hyo Gweon & Matt Botvinick (Panel discussion)
Fri 12:40 p.m. - 12:55 p.m.
Closing remarks
Feryal Behbahani, Louis Kirsch, Khimya Khetarpal, Annie Xie, Rose Wang
Fri 12:55 p.m. - 1:55 p.m.
 link »

Please join us on GatherTown for the poster session!

We study reinforcement learning (RL) with no-reward demonstrations, a setting in which an RL agent has access to additional data from the interaction of other agents with the same environment. However, it has no access to the rewards or goals of these agents, and their objectives and levels of expertise may vary widely. These assumptions are common in multi-agent settings, such as autonomous driving. To effectively use this data, we turn to the framework of successor features. This allows us to disentangle shared features and dynamics of the environment from agent-specific rewards and policies. We propose a multi-task inverse reinforcement learning (IRL) algorithm, called \emph{inverse temporal difference learning} (ITD), that learns shared state features, alongside per-agent successor features and preference vectors, purely from demonstrations without reward labels. We further show how to seamlessly integrate ITD with learning from online environment interactions, arriving at a novel algorithm for reinforcement learning with demonstrations, called $\Psi \Phi$-learning~(pronounced `Sci-Fi'). We provide empirical evidence for the effectiveness of $\Psi \Phi$-learning as a method for improving RL, IRL, imitation, and few-shot transfer, and derive worst-case bounds for its performance in zero-shot transfer to new tasks.
Angelos Filos, Clare Lyle, Yarin Gal, Sergey Levine, Natasha Jaques, Gregory Farquhar

Reinforcement learning (RL) is becoming increasingly more successful for robotics beyond simulated environments. However, the success of such reinforcement learning systems is predicated on the often under-emphasised reset mechanism -- each trial needs to start from a fixed initial state distribution. Unfortunately, resetting the environment to its initial state after each trial often requires extensive instrumentation and engineering effort in the real world, or manual human supervision to orchestrate environmental resets, which defeats the purpose of autonomous reinforcement learning. In this work, we formalize \textit{persistent reinforcement learning}: a problem setting that explicitly factors in that the environment resets are not freely available. We then introduce Value-accelerated Persistent Reinforcement Learning (VaPRL), which learns efficiently on a constrained budget of resets by generating a curriculum of increasingly harder tasks converging to the evaluation setting. We observe that our proposed algorithm requires only a handful of environmental resets, reducing the requirement by several orders of magnitude while outperforming competitive baselines on several continuous control environments. Overall, we hope that the reduced reliance on environmental resets can enable agents to learn with greater autonomy in the real world.

Archit Sharma, Abhishek Gupta, Karol Hausman, Sergey Levine, Chelsea Finn

We propose a novel method that can learn a prior model of task structure from the training tasks and transfer it to the unseen tasks for fast adaptation. We formulate this as a few-shot reinforcement learning problem where a task is characterized by a subtask graph which describes a set of subtasks and their dependencies that are unknown to the agent. Instead of directly inferring an unstructured task embedding, our multi-task subtask graph inferencer (MTSGI) infers the common task structure in terms of the subtask graph from the training tasks, and use it as a prior to improve the task inference in testing. To this end, we propose to model the prior sampling and posterior update for the subtask graph inference. Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks than various existing algorithms such as meta reinforcement learning, hierarchical reinforcement learning, and other heuristic agents.

Sungryull Sohn, Hyunjae Woo, Jongwook Choi, Izzeddin Gur, Aleksandra Faust, Honglak Lee

The benefit of multi-task learning over single-task learning relies on the ability to use relations across tasks to improve performance on any single task. While sharing representations is an important mechanism to share information across tasks, its success depends on how well the structure underlying the tasks is captured. In some real-world situations, we have access to \textit{metadata}, or additional information about a task, that may not provide any new insight in the context of a single task setup alone but inform relations across multiple tasks. While this metadata can be useful for improving multi-task learning performance, effectively incorporating it can be an additional challenge. We posit that an efficient approach to knowledge transfer is through the use of multiple context-dependent, composable representations shared across a family of tasks. In this framework, metadata can help to learn interpretable representations and provide the context to inform which representations to compose and how to compose them. We use the proposed approach to obtain state-of-the-art results in Meta-World, a challenging multi-task benchmark consisting of 50 distinct robotic manipulation tasks.

Shagun Sodhani, Amy Zhang, Joelle Pineau

The lottery ticket hypothesis questions the role of overparameterization in supervised deep learning. But how does the distributional shift inherent to the reinforcement learning problem affect the performance of winning lottery tickets? In this work, we show that feed-forward networks trained via supervised policy distillation and reinforcement learning can be pruned to the same level of sparsity. Furthermore, we establish the existence of winning tickets for both on- and off-policy methods in a visual navigation and classic control task. Using a set of carefully designed baseline conditions, we find that the majority of the lottery ticket effect in reinforcement learning can be attributed to the identified mask. The resulting masked observation space eliminates redundant information and yields minimal task-relevant representations. The mask identified by iterative magnitude pruning provides an interpretable inductive bias. Its costly generation can be amortized by training dense agents with low-dimensional input and thereby at lower computational cost.

Marc Vischer, Henning Sprekeler, Robert Lange

We develop a new continual meta-learning method to address challenges in sequential multi-task learning. In this setting the goal of the agent is to quickly achieve high reward over an any sequence of tasks. Prior meta-reinforcement learning algorithms have demonstrated promising results in accelerating the acquisition of new tasks. Beyond simply transferring past experience to new tasks, our goal is to devise continual reinforcement learning algorithms that learn to learn, using their experience on previous tasks to learn new tasks more quickly. However, they require access to all tasks during training. We introduce a new method, continual meta-policy search (CoMPS), that removes this limitation by meta-training in incremental fashion, over each task in a sequence, without revisiting prior tasks. CoMPS continuously repeats two subroutines: learning a new task and meta-learning to prepare for subsequent task learning. To solve each new task, CoMPS runs reinforcement learning from its current meta-learned initial parameters. For meta-training, CoMPS performs an entirely offline meta-reinforcement learning procedure over data collected from previous tasks. On several sequences of challenging continuous control tasks, we find that CoMPS outperforms prior continual learning and off-policy meta-reinforcement methods.

Glen Berseth, Zhiwei Zhang, Chelsea Finn, Sergey Levine

In this work, we specifically study how a robot can autonomously learn to clean a room by collecting objects off the ground and putting them into a basket. This task exemplifies the coordination needed between manipulation and navigation: the robot needs to navigate to objects in order to attempt to grasp them. Our goal is to enable a robot to learn this task autonomously under realistic settings, without any environment instrumentation, human intervention, or access to privileged information, such as maps, objects positions, or a global view of the environment. While reinforcement learning (RL) from images provides a general solution to learning tasks in theory, in practice most successful uses of RL rely on instrumented setups, hand-engineered state tracking, and/or human provided resets. We propose a novel learning system, ReALMM, that avoids the need for these by separating grasping and navigation policies at the architecture level for efficient learning, but still trains them together from the same sparse grasp-success signal. ReALMM also avoids the needs for externally providing resets by using an autonomous pseudo-resetting behavior. We show that with ReALMM, a robot can learn to navigate and clean up a room completely autonomously, without any external supervision.

Charles Sun, Coline Devin, Abhishek Gupta, Glen Berseth, Sergey Levine

We consider the offline reinforcement learning (RL) setting where the agent aims to optimize the policy solely from the data without further environment interactions. In offline RL, the distributional shift becomes the primary source of difficulty, which arises from the deviation of the target policy being optimized from the behavior policy used for data collection. This typically causes overestimation of action values, which poses severe problems for model-free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms often used sophisticated techniques that encourage underestimation of action values, which introduces an additional set of hyperparameters that need to be tuned properly. In this paper, we present an offline RL algorithm that prevents overestimation in a more principled way. Our algorithm, OptiDICE, tightly integrates the optimization of the target policy and the stationary distribution ratio estimation of the target policy and the behavior policy. Using an extensive set of benchmark datasets for offline RL, we show that OptiDICE performs competitively with the state-of-the-art methods.

Jongmin Lee, Wonseok Jeon, Byung-Jun Lee, Joelle Pineau, Kee-Eung Kim

Model-based algorithms, which learn a dynamics model from logged experience and perform some sort of pessimistic planning under the learned model, have emerged as a promising paradigm for offline reinforcement learning (offline RL). However, practical variants of such model-based algorithms rely on explicit uncertainty quantification for incorporating pessimism. Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable. We overcome this limitation by developing a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-action tuples generated via rollouts under the learned model. This results in a conservative estimate of the value function for out-of-support state-action tuples, without requiring explicit uncertainty estimation. We theoretically show that our method optimizes a lower bound on the true policy value, that this bound is tighter than that of prior methods, and our approach satisfies a policy improvement guarantee in the offline setting. Through experiments, we find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods on widely studied offline RL benchmarks, including image-based tasks.

Tianhe (Kevin) Yu, Aviral Kumar, Aravind Rajeswaran, Rafael Rafailov, Sergey Levine, Chelsea Finn

Many sequential decision making problems can be naturally formulated as continuing tasks in which the agent-environment interaction goes on forever without limit. Unlike the episodic case, reinforcement learning (RL) solution methods for the continuing setting are not well understood, theoretically or empirically. RL research lacks a collection of easy-to-use continuing problems that can help foster our understanding of the problem setting and its solution methods. To stimulate research in the RL methods for the continuing setting, we sketch a preliminary set of continuing problems that we refer to as C-suite. We invite the workshop attendees to further refine the sketch and contribute new problems that isolate specific research issues that arise in the continuing setting.

Abhishek Naik, Zaheer Abbas, Adam White, Rich Sutton

The search for neural architecture is producing many of the most exciting results in artificial intelligence. It has increasingly become apparent that task-specific neural architecture plays a crucial role for effectively solving problems. This paper presents a simple method for learning neural architecture through random mutation. This method demonstrates 1) neural architecture may be learned during the agent’s lifetime, 2) neural architecture may be constructed over a single lifetime without any initial connections or neurons, and 3) architectural modifications enable rapid adaptation to dynamic and novel task scenarios. Starting without any neurons or connections, this method constructs a neural architecture capable of high-performance on several tasks. The lifelong learning capabilities of this method are demonstrated in an environment without episodic resets, even learning with constantly changing morphology, limb disablement, and changing task goals all without losing locomotion capabilities.

Samuel Schmidgall

Reinforcement Learning (RL) algorithms can in principle acquire complex robotic skills by learning from large amounts of data in the real world, collected via trial and error. However, most RL algorithms use a carefully instrumented setup in order to collect data, requiring human supervision and intervention to provide episodic resets. This is particularly evident in challenging robotics problems, such as dexterous manipulation. To make data collection scalable, such applications require reset-free algorithms that are able to learn autonomously, without explicit instrumentation or human intervention. Most prior work in this area handles single-task learning. However, we might also want robots that can perform large repertoires of skills. At first, this would appear to only make the problem harder. However, the key observation we make in this work is that an appropriately chosen multi-task RL setting actually alleviates the reset-free learning challenge, with minimal additional machinery required. In effect, solving a multi-task problem can directly solve the reset-free problem since different combinations of tasks can serve to perform resets for other tasks. By learning multiple tasks together and appropriately sequencing them, we can effectively learn all of the tasks together reset-free. This type of multi-task learning can effectively scale reset-free learning schemes to much more complex problems, as we demonstrate in our experiments. We propose a simple scheme for multi-task learning that tackles the reset-free learning problem, and show its effectiveness at learning to solve dexterous manipulation tasks in both hardware and simulation without any explicit resets.

Abhishek Gupta, Justin Yu, Vikash Kumar, Tony Zhao, Kelvin Xu, Aaron Rovinsky, Thomas Devlin, Sergey Levine

Recurrent meta reinforcement learning (meta-RL) agents are agents that employ a recurrent neural network (RNN) for the purpose of ``learning a learning algorithm''. After being trained on a pre-specified task distribution, the learned weights of the agent's RNN are able to implement an efficient learning algorithm through their activity dynamics, which allows the agent to quickly solve new tasks sampled from the same distribution. However, due to the black-box nature of these agents, the way in which they work is not yet fully understood. In this study, we shed light on the internal working mechanisms of these agents by reformulating the meta-RL problem using the Partially Observable Markov Decision Process (POMDP) framework. We hypothesize that the learned activity dynamics is acting as belief states for such agents. Several illustrative experiments suggest that this hypothesis is true, and that recurrent meta-RL agents can be viewed as agents that learn to act optimally in partially observable environments consisting of multiple related tasks. This view helps in understanding their failure cases and some interesting model-based results recently reported in the literature.

Safa Alver, Doina Precup