Open-ended learning processes that co-evolve agents and their environments resulted in human intelligence, but producing such a system, which generates endless, meaningful novelty, remains an open problem in AI research. We hope our workshop provides a forum both for bridging knowledge across a diverse set of relevant fields as well as sparking new insights that can enable agent learning in open-endedness.
Fri 1:45 a.m. - 2:00 a.m.
|
Introductory Remarks ( Introduction ) link » | Minqi Jiang 🔗 |
Fri 2:00 a.m. - 2:30 a.m.
|
Open-Ended Learning Leads to Generally Capable Agents
(
Invited Talk
)
SlidesLive Video » |
Wojciech M Czarnecki 🔗 |
Fri 2:30 a.m. - 3:00 a.m.
|
Environments and Representations for Open-Ended Discovery ( Invited Talk ) link » | Sebastian Risi 🔗 |
Fri 3:00 a.m. - 3:15 a.m.
|
Walk the Random Walk: Learning to Discover and Reach Goals Without Supervision
(
Spotlight, Poster
)
link »
Learning a diverse set of skills by interacting with an environment without any external supervision is an important challenge.In particular, obtaining a goal-conditioned agent that can reach any given state is useful in many applications. We propose a novel method for training such a goal-conditioned agent without any external rewards or domain knowledge about the environment.The first component of our method is a \emph{reachability network} that learns to measure the similarity between two states from random interactions only.Then this reachability network is used to build the second component, a memory of past observations that are diverse and well-balanced.Finally, we train a goal-conditioned policy network, the third component, with goals sampled from the memory and reward it by the scores computed by the reachability network.All three components are kept updated throughout training as the agent explores and learns new skills.We demonstrate that our method allows training an agent for continuous control navigation, as well as robotic manipulation. |
Lina Mezghani · Piotr Bojanowski · Karteek Alahari · Sainbayar Sukhbaatar 🔗 |
Fri 3:15 a.m. - 3:30 a.m.
|
Towards Evaluating Adaptivity of Model-Based Reinforcement Learning
(
Spotlight, Poster
)
link »
SlidesLive Video » In recent years, a growing number of deep model-based reinforcement learning (RL) methods have been introduced. The interest in deep model-based RL is not surprising, given its many potential benefits, such as higher sample efficiency and the potential for fast adaption to changes in the environment. However, we demonstrate, using an improved version of the recently introduced Local Change Adaptation (LoCA) setup, that the well-known model-based methods PlaNet and DreamerV2 adapt poorly to local environmental changes. Combined with prior work that made a similar observation about the other popular model-based method, MuZero, a trend emerges that suggests that current deep model-based methods have serious limitations. We dive deeper into the causes of this poor adaptivity, by identifying elements that hurt adaptive behavior and linking these to underlying techniques frequently used in deep model-based RL. We empirically validate these insights in the case of linear function approximation by demonstrating that a modified version of linear Dyna achieves effective adaptation to local changes. |
Yi Wan · Ali Rahimi-Kalahroudi · Janarthanan Rajendran · Ida Momennejad · Sarath Chandar · Harm van Seijen 🔗 |
Fri 3:30 a.m. - 3:45 a.m.
|
Don't Freeze Your Embedding: Lessons from Policy Finetuning in Environment Transfer
(
Spotlight, Poster
)
link »
SlidesLive Video » A common occurrence in reinforcement learning (RL) research is making use of a pretrained vision stack that converts image observations to latent vectors. Using a visual embedding in this way leaves open questions, though: should the vision stack be updated with the policy? In this work, we evaluate the effectiveness of such decisions in RL transfer settings. We introduce policy update formulations for use after pretraining in a different environment and analyze the performance of such formulations. Through this evaluation, we also detail emergent metrics of benchmark suites and present results on Atari and AndroidEnv. |
Victoria Dean · Daniel Toyama · Doina Precup 🔗 |
Fri 3:45 a.m. - 4:45 a.m.
|
Virtual Poster Session #1 ( Poster Session ) link » | 🔗 |
Fri 4:45 a.m. - 5:30 a.m.
|
Lunch Break and Discussions
link »
Grab a bite and come join us for discussions in Gather. |
🔗 |
Fri 5:30 a.m. - 6:00 a.m.
|
Using Quality-Diversity to Co-Learn Generators that Illuminate the Space of Agent Abilities ( Invited Talk ) link » | Sam Earle 🔗 |
Fri 6:00 a.m. - 6:30 a.m.
|
General Infomax Agents through World Models
(
Invited Talk
)
SlidesLive Video » |
Danijar Hafner 🔗 |
Fri 6:30 a.m. - 7:00 a.m.
|
Virtual Coffee Break
link »
A brief break before the first panel discussion of the day. Feel free to head to Gather to hang out with other attendees during this time. |
🔗 |
Fri 7:00 a.m. - 8:00 a.m.
|
Live Panel Discussion #1
(
Discussion Panel
)
link »
Live panel discussion with Lisa Soros, Sam Earle, Sebastian Risi, Tim Rocktäschel, and Wojciech Czarnecki. |
🔗 |
Fri 8:00 a.m. - 8:15 a.m.
|
Virtual Coffee Break
link »
Join us in Gather over a quick cup of joe. |
🔗 |
Fri 8:15 a.m. - 8:30 a.m.
|
Accelerated Quality-Diversity for Robotics through Massive Parallelism
(
Spotlight, Poster
)
link »
SlidesLive Video » Quality-Diversity (QD) algorithms are a well-known approach to generate large collections of diverse and high-quality policies. However, QD algorithms are also known to be data-inefficient, requiring large amounts of computational resources and are slow when used in practice for robotics tasks. Policy evaluations are already commonly performed in parallel to speed up QD algorithms but have limited capabilities on a single machine as most physics simulators run on CPUs. With recent advances in simulators that run on accelerators, thousands of evaluations can performed in parallel on single GPU/TPU. In this paper, we present QDax, an implementation of MAP-Elites which leverages massive parallelism on accelerators to make QD algorithms more accessible. We first demonstrate the improvements on the number of evaluations per second that parallelism using accelerated simulators can offer. More importantly, we show that QD algorithms are ideal candidates and can scale with massive parallelism to be run at interactive timescales. The increase in parallelism does not significantly affect the performance of QD algorithms, while reducing experiment runtimes by two factors of magnitudes, turning days of computation into minutes. These results show that QD can now benefit from hardware acceleration, which contributed significantly to the bloom of deep learning. |
Bryan Lim · Maxime Allard · Luca Grillotti · Antoine Cully 🔗 |
Fri 8:30 a.m. - 8:45 a.m.
|
DSA-ME: Deep Surrogate Assisted MAP-Elites
(
Spotlight, Poster
)
link »
SlidesLive Video » We study the problem of efficiently generating high-quality and diverse content in games. Previous work on automated deckbuilding in Hearthstone shows that the quality diversity algorithm \mbox{MAP-Elites} can generate a collection of high-performing decks with diverse strategic gameplay. However, MAP-Elites requires a large number of expensive evaluations to discover a diverse collection of decks. We propose assisting MAP-Elites with a deep surrogate model trained online to predict game outcomes with respect to candidate decks. MAP-Elites discovers a diverse dataset to improve the surrogate model accuracy, while the surrogate model helps guide MAP-Elites towards promising new content. In a Hearthstone deckbuilding case study, we show that our approach improves the sample efficiency of MAP-Elites and outperforms a model trained offline with random decks, as well as a linear surrogate model baseline, setting a new state-of-the-art for quality diversity approaches in automated Hearthstone deckbuilding. We include the source code for all the experiments as supplemental material. |
Yulun Zhang · Matthew Fontaine · Amy Hoover · Stefanos Nikolaidis 🔗 |
Fri 8:45 a.m. - 9:00 a.m.
|
Differentiable Quality Diversity for Reinforcement Learning by Approximating Gradients
(
Spotlight, Poster
)
link »
SlidesLive Video » Consider a walking agent that must adapt to damage. To approach this task, we can train a collection of policies and have the agent select a suitable policy when damaged. Training this collection may be viewed as a quality diversity (QD) optimization problem, where we search for solutions (policies) which maximize an objective (walking forward) while spanning a set of measures (measurable characteristics). Recent work shows that differentiable quality diversity (DQD) algorithms greatly accelerate QD optimization when exact gradients are available for the objective and measures. However, such gradients are typically unavailable in RL settings due to non-differentiable environments. To apply DQD in RL settings, we propose to approximate objective and measure gradients with evolution strategies and actor-critic methods. We develop two variants of the DQD algorithm CMA-MEGA, each with different gradient approximations, and evaluate them on four simulated walking tasks. One variant achieves comparable performance (QD score) with the state-of-the-art PGA-MAP-Elites in two tasks. The other variant performs comparably in all tasks but is less efficient than PGA-MAP-Elites in two tasks. These results provide insight into the limitations of CMA-MEGA in domains that require rigorous optimization of the objective and where exact gradients are unavailable. |
Bryon Tjanaka · Matthew Fontaine · Julian Togelius · Stefanos Nikolaidis 🔗 |
Fri 9:00 a.m. - 9:30 a.m.
|
Beyond Agents and Environments? ( Invited Talk ) link » | Julian Togelius 🔗 |
Fri 9:30 a.m. - 10:00 a.m.
|
Emergent Complexity and Zero-Shot Transfer with PAIRED ( Invited Talk ) link » | Natasha Jaques 🔗 |
Fri 10:00 a.m. - 10:15 a.m.
|
Virtual Coffee Break
link »
Join us in Gather over a quick cup of joe. |
🔗 |
Fri 10:15 a.m. - 10:45 a.m.
|
Open Questions in Creating Safe Open-Ended AI: Tensions Between Control and Creativity ( Invited Talk ) link » | Joel Lehman 🔗 |
Fri 10:45 a.m. - 11:15 a.m.
|
Why Open-Endedness Matters to Machine Learning
(
Invited Talk
)
SlidesLive Video » |
Kenneth Stanley 🔗 |
Fri 11:15 a.m. - 12:15 p.m.
|
Live Panel Discussion #2
(
Discussion Panel
)
link »
Live panel discussion with Danijar Hafner, Joel Lehman, Julian Togelius, Kenneth Stanley, and Natasha Jaques. |
🔗 |
Fri 12:15 p.m. - 12:25 p.m.
|
Closing Remarks ( Conclusion ) link » | 🔗 |
Fri 12:25 p.m. - 1:15 p.m.
|
Virtual Poster Session #2 ( Poster Session ) link » | 🔗 |
-
|
Meta-World Conditional Neural Processes
(
Poster
)
link »
SlidesLive Video » We propose Meta-World Conditional Neural Process (MW-CNP), a conditional world model generator that leverages sample efficiency and scalability of Conditional Neural Processes architecture to allow an agent to sample from the generated world model. We intend to reduce agent's interaction with the target environment as much as possible. Thus, we designed a model-based meta-RL framework where the RL agent can be conditioned on significantly fewer samples collected from the target environment to imagine the unseen environment. We emphasize that the agent does not have access to the task parameters throughout training and testing. |
Suzan Ece Ada · Emre Ugur 🔗 |
-
|
A little taxonomy of open-endedness
(
Poster
)
link »
This paper aims to provide a partial taxonomy of the ways that the term open-endedness is used in Artificial Life, the ways that open-endedness is used outside of Artificial Life, open-endedness referred to via other terms. The definitions for open-endedness fall in the conceptual or measuring categories. The conceptual categories are rich, while the measuring categories---and the actual measurements---tend to be simplistic. Related concepts that we describe include meta-learning, compositionality (in language), creativity. |
Asiiah Song 🔗 |
-
|
Dojo: A Large Scale Benchmark for Multi-Task Reinforcement Learning
(
Poster
)
link »
We introduce Dojo, a reinforcement learning environment intended as a benchmark for evaluating RL agents' capabilities in the areas of multi-task learning, generalization, transfer learning, and curriculum learning. In this work, we motivate our benchmark, compare it to existing methods, and empirically demonstrate its suitability for the purpose of studying cross-task generalization. We establish a multi-task baseline across the whole benchmark as a reference for future research and discuss the achieved results and encountered issues. Finally, we provide experimental protocols and evaluation procedures to ensure that results are comparable across experiments. We also supply tools allowing researchers to easily understand their agents' performance across a wide variety of metrics. |
Dominik Schmidt 🔗 |
-
|
Streaming Inference for Infinite Non-Stationary Clustering
(
Poster
)
link »
Learning from a continuous stream of non-stationary data in an unsupervised manner is arguably one of the most common and most challenging settings facing intelligent agents. Here, we attack learning under all three conditions (unsupervised, streaming, non-stationary) in the context of clustering, also known as mixture modeling. We introduce a novel clustering algorithm that endows mixture models with the ability to create new clusters online, as demanded by the data, in a probabilistic and principled manner. To achieve this, we first define a novel stochastic process called the Dynamical Chinese Restaurant Process (Dynamical CRP), which is a non-exchangeable distribution over partitions of a set; then, we show that the Dynamical CRP provides a non-stationary prior over cluster assignments and yields an efficient streaming variational inference algorithm. We conclude with preliminary experiments showing that the Dynamical CRP can be applied on diverse data. |
Rylan Schaeffer · Gabrielle Liu · Yilun Du · Scott Linderman · Ila Fiete 🔗 |
-
|
Ensemble Learning as a Peer Process
(
Poster
)
link »
SlidesLive Video » Ensemble learning, in its simplest form, entails the training of multiple models with the same training set. In a standard supervised setting, the training set can be viewed as a 'teacher' with an unbounded capacity of interactions with a single group of 'trainee' models. One can then ask the following broad question: How can we train an ensemble if the teacher has a bounded capacity of interactions with the trainees? Towards answering this question we consider how humans learn in peer groups. The problem of how to group individuals in order to maximize outcomes via cooperative learning has been debated for a long time by social scientists and policymakers. More recently, it has attracted research attention from an algorithmic standpoint which led to the design of grouping policies that appear to result in better aggregate learning in experiments with human subjects. Inspired by human peer learning, we hypothesize that using partially trained models as teachers to other less accurate models, i.e.~viewing ensemble learning as a peer process, can provide a solution to our central question. We further hypothesize that grouping policies, that match trainer models with learner models play a significant role in the overall learning outcome of the ensemble. We present a formalization and through extensive experiments with different types of classifiers, we demonstrate that: (i) an ensemble can reach surprising levels of performance with little interaction with the training set (ii) grouping policies definitely have an impact on the ensemble performance, in agreement with previous intuition and observations in human peer learning. |
Ehsan Beikihassan · Ali Parviz · Amy Hoover · Ioannis Koutis 🔗 |
-
|
Watts: Infrastructure for Open-Ended Learning
(
Poster
)
link »
This paper proposes a framework called Watts for implementing, comparing, and recombining open-ended learning (OEL) algorithms. Motivated by modularity and algorithmic flexibility, Watts atomizes the components of OEL systems to promote the study and direct comparisons between approaches. Examining implementations of three OEL algorithms, the paper introduces the modules of the framework. The hope is for Watts to enable benchmarking and to explore new types of OEL algorithms. An anonymized repo is available at \url{https://anonymous.4open.science/r/watts-011B/README.md} |
Aaron Dharna · Charlie Summers · Rohin Dasari · Julian Togelius · Amy Hoover 🔗 |
-
|
Generalization Games for Reinforcement Learning
(
Poster
)
link »
Many subfields have emerged in reinforcement learning (RL) to understand how distributions of training tasks affect an RL agent's ability to transfer learned experiences to one or more evaluation tasks. While the field is extensive and ever-growing, recent research has underlined that variability among the different methods is not as significant. We leverage this intuition to demonstrate how current methods for generalization in RL are specializations of a general framework. We obtain the fundamental aspects of this formulation by rebuilding a Markov Decision Process (MDP) from the ground up by resurfacing the game-theoretic framework of games against nature. The two-player game that arises from considering nature as a complete player on this formulation explains how existing approaches rely on learned and randomized dynamics and initial state distributions. We develop this result further by drawing inspiration from mechanism design theory to introduce the role of a principal as a third player that can modify the payoff functions of the decision-making agent and nature. The main contribution of our work is the complete description of the Generalization Games for Reinforcement Learning, a multiagent, multiplayer, game-theoretic formal approach to study generalization methods in RL. The games induced by playing against the principal extend our framework to explain how learned and randomized reward functions induce generalization in RL agents. We offer a preliminary ablation experiment of the different components of the framework and demonstrate that a more simplified composition of the objectives that we introduce for each player leads to comparable, and in some cases superior, zero-shot generalization performance than state-of-the-art methods while requiring almost two orders of magnitude fewer samples. |
Manfred Diaz · Charlie Gauthier · Glen Berseth · Liam Paull 🔗 |
-
|
A Study of Off-Policy Learning in Environments with Procedural Content Generation
(
Poster
)
link »
Environments with procedural content generation (PCG environments) are useful for assessing the generalization capacity of Reinforcement Learning (RL) agents. A growing body of work focuses on generalization in RL in PCG environments, with many methods being built on top of on-policy algorithms. On the other hand, off-policy methods have received less attention. Motivated by this discrepancy, we examine how Deep Q Networks (Mnih et al., 2013) perform on the Procgen benchmark (Cobbe et al., 2020), and look at the impact of various additions to DQN on performance. We find that some popular techniques that have improved DQN on benchmarks like the Arcade Learning Environment (Bellemare et al., 2015, ALE) do not carry over to Procgen, implying that some research has overfit to tasks that lack diversity, and fails to consider the importance of generalization. |
Andrew Ehrenberg · Robert Kirk · Minqi Jiang · Edward Grefenstette · Tim Rocktaeschel 🔗 |
-
|
SkillHack: A Benchmark for Skill Transfer in Open-Ended Reinforcement Learning
(
Poster
)
link »
Practising and honing skills forms a fundamental component of how humans learn, yet artificial agents are rarely specifically trained to perform them. Instead, they are usually trained end-to-end, with the hope being that useful skills will be implicitly learned in order to maximise discounted return of some extrinsic reward function. In this paper, we investigate how skills can be incorporated into the training of reinforcement learning (RL) agents in complex environments with large state-action spaces and sparse rewards. To this end, we created SkillHack, a benchmark of tasks and associated skills based on the game of NetHack. We evaluate a number of baselines on this benchmark, as well as our own novel skill-based method Hierarchical Kickstarting (HKS), which is shown to outperform all other evaluated methods. Our experiments show that learning with a prior knowledge of useful skills can significantly improve the performance of agents on complex problems. We ultimately argue that utilising predefined skills provides a useful inductive bias for RL problems, especially those with large state-action spaces and sparse rewards. |
Michael Matthews · Mikayel Samvelyan · Jack Parker-Holder · Edward Grefenstette · Tim Rocktaeschel 🔗 |
-
|
Meta-Gradients in Non-Stationary Environments
(
Poster
)
link »
SlidesLive Video » Meta-gradient methods (Xu et al., 2018; Zahavy et al., 2020) offer a promising solution to the problem of hyperparameter selection and adaptation in non-stationary reinforcement learning problems. However, the properties of meta-gradients in such environments have not been systematically studied. In this work, we bring new clarity to meta-gradients in non-stationary environments. Concretely, we ask: (i) how much information should be given to the learned optimizers, so as to enable faster adaptation and generalization over a lifetime, (ii) what meta-optimizer functions are learned in this process, and (iii) whether meta-gradient methods provide a bigger advantage in highly non-stationary environments. To study the effect of information provided to the meta-optimizer, as in recent works (Flennerhaget al., 2021; Almeida et al., 2021), we replace the tuned meta-parameters of fixed update rules with learned meta-parameter functions of selected context features. The context features carry information about agent performance and changes in the environment and hence can inform learned meta-parameter schedules. We find that adding more contextual information is generally beneficial, leading to faster adaptation of meta-parameter values and increased performance over a lifetime. We support these results with a qualitative analysis of resulting meta-parameter schedules and learned functions of context features. Lastly, we find that without context, meta-gradients do not provide a consistent advantage over the baseline in highly non-stationary environments. By adding contextual information, we are able to obtain significant improvements in even highly non-stationary environments. |
Jelena Luketina · Sebastian Flennerhag · Yannick Schroecker · David Abel · Tom Zahavy 🔗 |
-
|
Bayesian Generational Population-Based Training
(
Poster
)
link »
Reinforcement learning (RL) offers the potential for training generally capable agents that can interact autonomously in the real world.However, one key limitation is the brittleness of RL algorithms to core hyperparameters and network architecture choice.Furthermore, non-stationarities such as evolving training data and increased agent complexity mean that different hyperparameters and architectures may be optimal at different points of training.This motivates AutoRL, a class of methods seeking to automate these design choices.One prominent class of AutoRL methods is Population-Based Training (PBT), which have led to impressive performance in several large scale settings.In this paper, we introduce two new innovations in PBT-style methods.First, we employ trust-region based Bayesian Optimization, enabling full coverage of the high-dimensional mixed hyperparameter search space.Second, we show that using a generational approach, we can also learn both architectures and hyperparameters jointly on-the-fly in a single training run.Leveraging the new highly parallelizable Brax physics engine, we show that these innovations lead to dramatic performance gains, significantly outperforming the tuned baseline while learning entire configurations on the fly. |
Xingchen Wan · Cong Lu · Jack Parker-Holder · Philip Ball · Vu Nguyen · Binxin Ru · Michael Osborne 🔗 |
-
|
Open-Ended Reinforcement Learning with Neural Reward Functions
(
Poster
)
link »
SlidesLive Video » Inspired by the great success of unsupervised learning in Computer Vision and Natural Language Processing, the Reinforcement Learning community has recently started to focus more on unsupervised discovery of skills. Most current approaches, like DIAYN or DADS, optimize some form of mutual information objective. We propose a different approach that uses reward functions encoded by neural networks. These are trained iteratively to reward more complex behavior. In high-dimensional robotic environments our approach learns a wide range of interesting skills including front-flips for Half-Cheetah and one-legged running for Humanoid. In the pixel-based Montezuma’s Revenge environment our method also works with minimal changes and it learns complex skills that involve interacting with items and visiting diverse locations. |
Robert Meier · Asier Mujika 🔗 |
-
|
Diversity Policy Gradient for Sample Efficient Quality-Diversity Optimization
(
Poster
)
link »
A fascinating aspect of nature lies in its ability to produce a large and diverse collection of high-performing organisms in an open-ended way. By contrast, most AI algorithms seek convergence and focus on finding a single efficient solution to a given problem. Aiming for diversity through divergent search in addition to performance is a convenient way to deal with the exploration-exploitation trade-off that plays a central role in learning. It also allows for increased robustness when the returned collection contains several working solutions to the considered problem, making it well-suited for real applications such as robotics. Quality-Diversity (QD) methods are evolutionary algorithms designed for this purpose.This paper proposes a novel algorithm, QD-PG, which combines the strength of Policy Gradient algorithms and Quality Diversity approaches to produce a collection of diverse and high-performing neural policies in continuous control environments. The main contribution of this work is the introduction of a Diversity Policy Gradient (DPG) that drives policies towards more diversity in a sample-efficient and open-ended manner. Specifically, QD-PG selects neural controllers from a MAP-ELITES grid and uses two gradient-based mutation operators to improve both quality and diversity. Our results demonstrate that QD-PG is significantly more sample-efficient than its evolutionary competitors. |
Thomas PIERROT · Valentin Macé · Felix Chalumeau · Arthur Flajolet · Geoffrey Cideron · Karim Beguir · Antoine Cully · Olivier Sigaud · Nicolas Perrin-Gilbert 🔗 |
-
|
Discovering Unsupervised Behaviours from Full State Trajectories
(
Poster
)
link »
SlidesLive Video » Improving open-ended learning capabilities is a promising approach to enable robots to face the unbounded complexity of the real-world. Among existing methods, the ability of Quality-Diversity algorithms to generate large collections of diverse and high-performing skills is instrumental in this context. However, most of those algorithms rely on a hand-coded behavioural descriptor to characterise the diversity, hence requiring prior knowledge about the considered tasks. In this work, we propose an additional analysis of Autonomous Robots Realising their Abilities; a Quality-Diversity algorithm that autonomously finds behavioural characterisations. We evaluate our approach on a simulated robotic environment, where the robot has to autonomously discover its abilities from its full state trajectories. All algorithms were applied to three tasks: navigation, moving forward with a high velocity, and performing half-rolls. The experimental results show that the algorithm under study discovers autonomously collections of solutions that are diverse with respect to all tasks. More specifically, our approach autonomously finds policies that make the robot move to diverse positions, but also utilise its legs in diverse ways, and even perform half-rolls. |
Luca Grillotti · Antoine Cully 🔗 |
-
|
Specialization and Exchange in Neural MMO
(
Poster
)
link »
We present a simulated profession and exchange system for use in multi-agent intelligence research. Each of the eight implemented jobs produces items required by other professions. As a result, each profession must purchase items that they cannot produce themselves from other professions. These items are then used to produce increasingly high-quality goods for resale on a global market. Better and better goods enter the market as trade among professions creates a feedback loop. We integrate our profession and exchange system with Neural MMO, an existing multi-agent reinforcement learning platform capable of efficiently simulating populations of tens to 1000+ agents. We hope that our work will help support new research on emergent specialization --- the ability to select and commit to a specific long-term strategy that fills a niche left by other learning agents. All of our code, including scripted baseline agents for each profession, will be free, open-source, and actively maintained. |
Joseph Suarez · Phillip Isola 🔗 |
-
|
Subjective Learning for Conflicting Data
(
Poster
)
link »
SlidesLive Video » Conventional supervised learning typically assumes that the learning task can be solved by approximating a single target function. However, this assumption is often invalid in open-ended environments where no manual task-level data partitioning is available. In this paper, we investigate a more general setting where training data is sampled from multiple domains while the data in each domain conforms to a domain-specific target function. When different domains possess distinct target functions, training data exhibits inherent "conflict'', thus rendering single-model training problematic. To address this issue, we propose a framework termed subjective learning where the key component is a subjective function that automatically allocates the data among multiple candidate models to resolve the conflict in multi-domain data, and draw an intriguing connection between subjective learning and a variant of Expectation-Maximization. We present theoretical analysis on the learnability and the generalization error of our approach, and empirically show its efficacy and potential applications in a range of regression and classification tasks with synthetic data. |
Tianren Zhang · Yizhou Jiang · Xin Su · Shangqi Guo · Chongkai Gao · Feng Chen 🔗 |
-
|
Zero-Shot Reward Specification via Grounded Natural Language
(
Poster
)
link »
Reward signals in reinforcement learning are expensive to design and often require access to the true state which is not available in the real world. Common alternatives are usually demonstrations or goal images which can be labor-intensive to collect. On the other hand, text descriptions provide a general, natural, and low-effort way of communicating the desired task. However, prior works in learning text-conditioned policies still rely on rewards that are defined using either true state or labeled expert demonstrations. We use recent developments in building large-scale visuolanguage models like CLIP to devise a framework that generates the task reward signal just from goal text description and raw pixel observations which is then used to learn the task policy. We evaluate the proposed framework on control and robotic manipulation tasks. Finally, we distill the individual task policies into a single goal text conditioned policy that can generalize in a zero-shot manner to new tasks with unseen objects and unseen goal text descriptions. |
Parsa Mahmoudieh · Deepak Pathak · trevor darrell 🔗 |
-
|
Learning Object-Centered Autotelic Behaviors with Graph Neural Networks
(
Poster
)
link »
SlidesLive Video » Although humans live in an open-ended world with endless challenges, they do not have to learn from scratch whenever they encounter a new task. Rather, they have access to a handful of previously learned skills, which they rapidly adapt to new situations. In artificial intelligence, autotelic agents—which are intrinsically motivated to represent and set their own goals—exhibit promising skill transfer capabilities. However, their learning capabilities are highly constrained by their policy and goal space representations. In this paper, we propose to investigate the impact of these representations. We study different implementations of autotelic agents using four types of Graph Neural Networks policy representations and two types of goal spaces, either geometric or predicate-based. We show that combining object-centered architectures that are expressive enough with semantic relational goals enables an efficient transfer between skills and promotes behavioral diversity. We also release our graph-based implementations to encourage further research in this direction. |
Ahmed Akakzia · Olivier Sigaud 🔗 |
-
|
Learning to Walk Autonomously via Reset-Free Quality-Diversity
(
Poster
)
link »
SlidesLive Video » Quality-Diversity (QD) algorithms can discover large and complex behavioural repertoires consisting of both diverse and high-performing skills. However, the generation of behavioural repertoires has mainly been limited to simulation environments instead of real-world learning. This is because existing QD algorithms need large numbers of evaluations as well as episodic resets, which require manual human supervision and interventions. This paper proposes Reset-Free Quality-Diversity optimization (RF-QD) as a step towards autonomous learning for robotics in open-ended environments. We build on Dynamics-Aware Quality-Diversity (DA-QD) and introduce a behaviour selection policy that leverages the diversity of the imagined repertoire and environmental information to intelligently select of behaviours that can act as automatic resets. We demonstrate this through a task of learning to walk within defined training zones with obstacles. Our experiments show that we can learn full repertoires of legged locomotion controllers autonomously without manual resets with high sample efficiency in spite of harsh safety constraints. Finally, using an ablation of different target objectives, we show that it is important for RF-QD to have diverse types solutions available for the behaviour selection policy over solutions optimised with a specific objective. |
Bryan Lim · Alexander Reichenbach · Antoine Cully 🔗 |
-
|
Agent, do you see it now? systematic generalisation in deep reinforcement learning
(
Poster
)
link »
SlidesLive Video » Systematic generalisation, i.e., the algebraic capacity to understand and execute unseen tasks by combining already known primitives, is one of the most desirable features for a computational model. Good adaptation to novel tasks in open-ended settings rely heavily on the ability of agents to reuse their past experience and recombine meaningful learning pieces to tackle new goals. In this work, we analyse how the architecture of convolutional layers impacts on the performance of autonomous agents when generalising to zero-shot, unseen tasks while executing human instructions. Our findings suggest that the convolutional architecture that is correctly suited to the environment the agent will interact with, may be of greater importance than having a generic convolutional network trained in the given environment. |
Borja G. Leon · Murray Shanahan · Francesco Belardinelli 🔗 |
-
|
Mixture-of-Variational-Experts for Continual Learning
(
Poster
)
link »
One weakness of machine learning algorithms is the poor ability of models to solve new problems without forgetting previously acquired knowledge. The Continual Learning (CL) paradigm has emerged as a protocol to systematically investigate settings where the model sequentially observes samples generated by a series of tasks. In this work, we take a task-agnostic view of continual learning and develop a hierarchical information-theoretic optimality principle that facilitates a trade-off between learning and forgetting. We discuss this principle from a Bayesian perspective and show its connections to previous approaches to CL. Based on this principle, we propose a neural network layer, called the Mixture-of-Variational-Experts layer, that alleviates forgetting by creating a set of information processing paths through the network which is governed by a gating policy. Due to the general formulation based on generic utility functions, we can apply this optimality principle to a large variety of learning problems, including supervised learning, reinforcement learning, and generative modeling. We demonstrate the competitive performance of our method in continual supervised learning and in continual reinforcement learning. |
Heinke Hihn · Daniel Braun 🔗 |
-
|
Backdoors Stuck At The Frontdoor: Multi-Agent Backdoor Attacks That Backfire
(
Poster
)
link »
Malicious agents in collaborative learning and outsourced data collection threaten the training of clean models. Backdoor attacks, where an attacker poisons a model during training to successfully achieve targeted misclassification, are a major concern to train-time robustness. In this paper, we investigate a multi-agent backdoor attack scenario, where multiple attackers attempt to backdoor a victim model simultaneously. A consistent backfiring phenomenon is observed across a wide range of games, where agents suffer from a low collective attack success rate. We examine different modes of backdoor attack configurations, non-cooperation / cooperation, joint distribution shifts, and game setups to return an equilibrium attack success rate at the lower bound. The results motivate the re-evaluation of backdoor defense research for practical environments. |
Siddhartha Datta · Nigel Shadbolt 🔗 |
-
|
When to Go, and When to Explore: The Benefit of Post-Exploration in Intrinsic Motivation
(
Poster
)
link »
SlidesLive Video » Go-Explore achieved breakthrough performance on challenging reinforcement learning (RL) tasks with sparse rewards. The key insight of Go-Explore was that successful exploration requires an agent to first return to an interesting state ('Go'), and only then explore into unknown terrain ('Explore'). We refer to such exploration after a goal is reached as 'post-exploration'. In this paper we present a systematic study of post-exploration, answering open questions that the Go-Explore paper did not answer yet. First, we study the isolated potential of post-exploration, by turning it on and off within the same algorithm. Subsequently, we introduce new methodology to adaptively decide when to post-explore and for how long to post-explore. Experiments on a range of MiniGrid environments show that post-exploration indeed boosts performance (with a bigger impact than tuning regular exploration parameters), and this effect is further enhanced by adaptively deciding when and for how long to post-explore. In short, our work identifies adaptive post-exploration as a promising direction for RL exploration research. |
Zhao Yang · Thomas Moerland · Mike Preuss · Aske Plaat 🔗 |
-
|
On Credit Assignment in Hierarchical Reinforcement Learning
(
Poster
)
link »
Hierarchical Reinforcement Learning (HRL) has held longstanding promise to advance reinforcement learning. Yet, it has remained a considerable challenge to develop practical algorithms that exhibit some of these promises. To improve our fundamental understanding of HRL, we investigate hierarchical credit assignment from the perspective of conventional multistep reinforcement learning. We show how e.g., a 1-step `hierarchical backup' can be seen as a conventional multistep backup with $n$ skip connections over time connecting each subsequent state to the first independent of actions inbetween. Furthermore, we find that generalizing hierarchy to multistep return estimation methods requires us to consider how to partition the environment trace, in order to construct backup paths. We leverage these insight to develop a new hierarchical algorithm Hier$Q_k(\lambda)$, for which we demonstrate that hierarchical credit assignment alone can already boost agent performance (i.e., when eliminating generalization or exploration). Altogether, our work yields fundamental insight into the nature of hierarchical backups and distinguishes this as an additional basis for reinforcement learning research.
|
joery de Vries · Thomas Moerland · Aske Plaat 🔗 |
-
|
Neuroevolution of Recurrent Architectures on Control Tasks
(
Poster
)
link »
SlidesLive Video » Modern artificial intelligence works typically train the parameters of fixed-sized deep neural networks using gradient-based optimization techniques. Simple evolutionary algorithms have recently been shown to also be capable of optimizing deep neural network parameters, at times matching the performance of gradient-based techniques, e.g. in reinforcement learning settings. In addition to optimizing network parameters, many evolutionary computation techniques are also capable of progressively constructing network architectures. However, constructing network architectures from elementary evolution rules has not yet been shown to scale to modern reinforcement learning benchmarks. In this paper we therefore propose a new approach in which the architectures of recurrent neural networks dynamically evolve according to a small set of mutation rules. We implement a massively parallel evolutionary algorithm and run experiments on all 19 OpenAI Gym state-based reinforcement learning control tasks. We find that in most cases, dynamic agents match or exceed the performance of gradient-based agents while utilizing orders of magnitude fewer parameters. We believe our work to open avenues for real-life applications where network compactness and autonomous design are of critical importance. We provide our source code, final model checkpoints and full results at github.com/neuroevolution-recurrent-architectures/. |
Maximilien Le Clei · Pierre Bellec 🔗 |
-
|
Model-Value Inconsistency as a Signal for Epistemic Uncertainty
(
Poster
)
link »
Using a model of the environment and a value function, an agent can construct many estimates of a state’s value, by unrolling the model for different lengths and bootstrapping with its value function. Our key insight is that one can treat this set of value estimates as a type of ensemble, which we call an \emph{implicit value ensemble} (IVE). Consequently, the discrepancy between these estimates can be used as a proxy for the agent’s epistemic uncertainty; we term this signal \emph{model-value inconsistency} or \emph{self-inconsistency} for short. Unlike prior work which estimates uncertainty by training an ensemble of many models and/or value functions, this approach requires only the single model and value function which are already being learned in most model-based reinforcement learning algorithms. We provide empirical evidence in both tabular and function approximation settings from pixels that self-inconsistency is useful (i) as a signal for exploration, (ii) for acting safely under distribution shifts, and (iii) for robustifying value-based planning with a learned model. |
Angelos Filos · Eszter Vertes · Zita Marinho · Gregory Farquhar · Diana Borsa · Abram Friesen · Feryal Behbahani · Tom Schaul · Andre Barreto · Simon Osindero 🔗 |
-
|
An Empirical Investigation of Mutual Information Skill Learning
(
Poster
)
link »
SlidesLive Video » Unsupervised skill learning methods are a form of unsupervised pre-training for reinforcement learning (RL) that has the potential to improve the sample efficiency of solving downstream tasks. Prior work has proposed several methods for unsupervised skill discovery based on mutual information (MI) objectives, with different methods varying in how this mutual information is estimated and optimized. This paper studies how different skill learning algorithms and their key design decisions affect the sample efficiency of solving downstream tasks. Our key findings are that the sample efficiency of downstream adaptation under off-policy backbones is better than their on-policy counterparts. In contrast, on-policy backbones resulted in better state coverage, moreover, regularizing the discriminator gave better results, and careful choice of the mutual information lower bound and discriminator architecture yielded significant improvements on downstream tasks, also we showed empirically that the learned representations during the pre-training corresponded to the controllable aspects of the environment. |
Faisal Mohamed · Benjamin Eysenbach · Ruslan Salakhutdinov 🔗 |