## Generalizable Policy Learning in the Physical World

### Young Min Kim · Sergey Levine · Ming Lin · Tongzhou Mu · Ashvin Nair · Hao Su

Abstract Workshop Website
Fri 29 Apr, 8 a.m. PDT

Abstract:

While the study of generalization has played an essential role in many application domains of machine learning (e.g., image recognition and natural language processing), it did not receive the same amount of attention in common frameworks of policy learning (e.g., reinforcement learning and imitation learning) at the early stage for reasons such as policy optimization is difficult and benchmark datasets are not quite ready yet. Generalization is particularly important when learning policies to interact with the physical world. The spectrum of such policies is broad: the policies can be high-level, such as action plans that concern temporal dependencies and causalities of environment states; or low-level, such as object manipulation skills to transform objects that are rigid, articulated, soft, or even fluid.In the physical world, an embodied agent can face a number of changing factors such as \textbf{physical parameters, action spaces, tasks, visual appearances of the scenes, geometry and topology of the objects}, etc. And many important real-world tasks involving generalizable policy learning, e.g., visual navigation, object manipulation, and autonomous driving. Therefore, learning generalizable policies is crucial to developing intelligent embodied agents in the real world. Though important, the field is very much under-explored in a systematic way.Learning generalizable policies in the physical world requires deep synergistic efforts across fields of vision, learning, and robotics, and poses many interesting research problems. This workshop is designed to foster progress in generalizable policy learning, in particular, with a focus on the tasks in the physical world, such as visual navigation, object manipulation, and autonomous driving. We envision that the workshop will bring together interdisciplinary researchers from machine learning, computer vision, and robotics to discuss the current and future research on this topic.

Chat is not available.
Timezone: America/Los_Angeles »

### Schedule

 Fri 8:00 a.m. - 8:10 a.m. Introduction and Opening Remarks (Introduction) Hao Su 🔗 Fri 8:10 a.m. - 8:35 a.m. Invited Talk (Danica Kragic): Learning for contact rich tasks (Invited Talk) Danica Kragic 🔗 Fri 8:35 a.m. - 8:40 a.m. Q&A for Invited Talk (Danica Kragic) (Q&A) Danica Kragic 🔗 Fri 8:40 a.m. - 9:05 a.m. Invited Talk (Peter Stone): Grounded Simulation Learning for Sim2Real (Invited Talk) Peter Stone 🔗 Fri 9:05 a.m. - 9:10 a.m. Q&A for Invited Talk (Peter Stone) (Q&A) Peter Stone 🔗 Fri 9:10 a.m. - 9:20 a.m. Break 🔗 Fri 9:20 a.m. - 10:15 a.m. Poster Session 1 (Poster Session)  link » 🔗 Fri 10:15 a.m. - 11:15 a.m. Panel Discussion Young Min Kim · Peter Stone · Nadia Figueroa · Hao Su · Mrinal Kalakrishnan · Xiaolong Wang · Deepak Pathak · Ming Lin · Danfei Xu 🔗 Fri 11:15 a.m. - 11:23 a.m. ManiSkill Challenge Winner Presentation (Zhutian Yang & Aidan Curtis) (Contributed Talk) Zhutian Yang 🔗 Fri 11:23 a.m. - 11:31 a.m. ManiSkill Challenge Winner Presentation (Fattonny) (Contributed Talk) Kun Wu 🔗 Fri 11:31 a.m. - 1:00 p.m. Lunch Break (Break) 🔗 Fri 1:00 p.m. - 1:10 p.m. Contributed Talk (Sim-to-Lab-to-Real: Safe RL with Shielding and Generalization Guarantees) (Contributed Talk) Kai-Chieh Hsu 🔗 Fri 1:10 p.m. - 1:35 p.m. Invited Talk (Shuran Song): Iterative Residual Policy for Generalizable Dynamic Manipulation of Deformable Objects (Invited Talk) Shuran Song 🔗 Fri 1:35 p.m. - 1:40 p.m. Q&A for Invited Talk (Shuran Song) (Q&A) Shuran Song 🔗 Fri 1:40 p.m. - 2:05 p.m. Invited Talk (Nadia Figueroa): Towards Safe and Efficient Learning and Control for Physical Human Robot Interaction (Invited Talk) Nadia Figueroa 🔗 Fri 2:05 p.m. - 2:10 p.m. Q&A for Invited Talk (Nadia Figueroa) (Q&A) Nadia Figueroa 🔗 Fri 2:10 p.m. - 2:18 p.m. ManiSkill Challenge Winner Presentation (EPIC Lab) (Contributed Talk) Weikang Wan 🔗 Fri 2:18 p.m. - 2:30 p.m. Break 🔗 Fri 2:30 p.m. - 2:40 p.m. Contributed Talk (Know Thyself: Transferable Visual Control Policies Through Robot-Awareness) (Contributed Talk) Edward Hu 🔗 Fri 2:40 p.m. - 3:05 p.m. Invited Talk (Mrinal Kalakrishnan): Robot Learning & Generalization in the Real World (Invited Talk) Mrinal Kalakrishnan 🔗 Fri 3:05 p.m. - 3:10 p.m. Q&A for Invited Talk (Mrinal Kalakrishnan) (Q&A) Mrinal Kalakrishnan 🔗 Fri 3:10 p.m. - 3:35 p.m. Invited Talk (Xiaolong Wang): Generalizing Dexterous Manipulation by Learning from Humans (Invited Talk) Xiaolong Wang 🔗 Fri 3:35 p.m. - 3:40 p.m. Q&A for Invited Talk (Xiaolong Wang) (Q&A) Xiaolong Wang 🔗 Fri 3:40 p.m. - 3:48 p.m. ManiSkill Challenge Winner Presentation (Silver-Bullet-3D) (Contributed Talk) Yingwei Pan 🔗 Fri 3:48 p.m. - 3:50 p.m. Break 🔗 Fri 3:50 p.m. - 4:45 p.m. Poster Session 2 (Poster Session)  link » 🔗 Fri 4:45 p.m. - 5:30 p.m. ManiSkill Challenge Award Ceremony (Challenge Award Ceremony) Hao Su · Weikang Wan · Hao Shen · He Wang · Yingwei Pan · Zhutian Yang · Fabian Dubois · Tom Sonoda · Kun Wu · Kangqi Ma · Liu Kun · Jilei Hou · Tongzhou Mu 🔗 Fri 5:30 p.m. - 6:30 p.m. Closing Remarks 🔗 - PAnDR: Fast Adaptation to New Environments from Offline Experiences via Decoupling Policy and Environment Representations (Poster)  link » Deep Reinforcement Learning (DRL) has been a promising solution to many complex decision-making problems. Nevertheless, the notorious weakness in generalization among environments prevent widespread application of DRL agents in real-world scenarios. Although advances have been made recently, most prior works assume sufficient online interaction on training environments, which can be costly in practical cases. To this end, we focus on an offline-training-online-adaptation setting, in which the agent first learns from offline experiences collected in environments with different dynamics and then performs online policy adaptation in environments with new dynamics. In this paper, we propose Policy Adaptation with Decoupled Representations (PAnDR) for fast policy adaptation. In offline training phase, the environment representation and policy representation are learned through contrastive learning and policy recovery, respectively. The representations are further refined by mutual information optimization to make them more decoupled and complete. With learned representations, a Policy-Dynamics Value Function (PDVF) network is trained to approximate the values for different combinations of policies and environments. In online adaptation phase, with the environment context inferred from few experiences collected in new environments, the policy is optimized by gradient ascent with respect to the PDVF. Our experiments show that PAnDR outperforms existing algorithms in several representative policy adaptation problems. Link » Sang Tong · Hongyao Tang · Yi Ma · Jianye HAO · YAN ZHENG · Zhaopeng Meng · Boyan Li · Zhen Wang 🔗 - Imitation Learning for Generalizable Self-driving Policy with Sim-to-real Transfer (Poster)  link » Imitation Learning uses the demonstrations of an expert to uncover the optimal policy and it is suitable for real-world robotics tasks as well. In this case, however, the training of the agent is carried out in a simulation environment due to safety, economic and time constraints. Later, the agent is applied in the real-life domain using sim-to-real methods. In this paper, we apply Imitation Learning methods that solve a robotics task in a simulated environment and use transfer learning to apply these solutions in the real-world environment. Our task is set in the Duckietown environment, where the robotic agent has to follow the right lane based on the input images of a single forward-facing camera. We present three Imitation Learning and two sim-to-real methods capable of achieving this task. A detailed comparison is provided on these techniques to highlight their advantages and disadvantages. Link » Zoltán Lőrincz · Márton Szemenyei · Robert Moni 🔗 - FlexiBiT: Flexible Inference in Sequential Decision Problems via Bidirectional Transformers (Poster)  link » Randomly masking sub-portions of sentences has been a very successful approach in training natural language processing models for a variety of tasks. In this work, we observe that the same idea also applies naturally to sequential decision making, where many traditional tasks like behavior cloning, offline RL, inverse dynamics, or planning correspond to different sequence maskings. We introduce the FlexiBiT framework, which enables to flexibly specify models which can be trained on many different sequential decision making tasks. Experimentally, we show that we can train a single FlexiBiT model to perform all tasks with performance similar to or better than specialized models, and that such performance can be further improved by fine-tuning this general model on the task of interest. Link » Micah Carroll · Jessy Lin · Orr Paradise · Raluca Georgescu · Mingfei Sun · David Bignell · Stephanie Milani · Katja Hofmann · Matthew Hausknecht · Anca Dragan · Sam Devlin 🔗 - Learning Category-Level Generalizable Object Manipulation Policy via Generative Adversarial Self-Imitation Learning from Demonstrations (Poster)  link » Generalizable object manipulation skills are critical for intelligent and multi-functional robots to work in real-world complex scenes. Despite the recent progress in reinforcement learning, it is still very challenging to learn a generalizable manipulation policy that can handle a category of geometrically diverse articulated objects. In this work, we tackle this category-level object manipulation policy learning problem via imitation learning in a task-agnostic manner, where we assume no handcrafted dense rewards but only a terminal reward. Given this novel and challenging generalizable policy learning problem, we identify several key issues that can fail the previous imitation learning algorithms and hinder the generalization to unseen instances. We then propose several general but critical techniques, including generative adversarial self-imitation learning from demonstrations, progressive growing of discriminator, and instance-balancing for expert buffer, that accurately pinpoints and tackles these issues and can benefit category-level manipulation policy learning regardless of the tasks. Our experiments on ManiSkill benchmarks demonstrate a remarkable improvement on all tasks and our ablation studies further validate the contribution of each proposed technique. Link » Hao Shen · Weikang Wan · He Wang 🔗 - A Study of Off-Policy Learning in Environments with Procedural Content Generation (Poster)  link » Environments with procedural content generation (PCG environments) are useful for assessing the generalization capacity of Reinforcement Learning (RL) agents. A growing body of work focuses on generalization in RL in PCG environments, with many methods being built on top of on-policy algorithms. On the other hand, off-policy methods have received less attention. Motivated by this discrepancy, we examine how Deep Q Networks (Mnih et al., 2013) perform on the Procgen benchmark (Cobbe et al., 2020), and look at the impact of various additions to DQN on performance. We find that some popular techniques that have improved DQN on benchmarks like the Arcade Learning Environment (Bellemare et al., 2015, ALE) do not carry over to Procgen, implying that some research has overfit to tasks that lack diversity, and fails to consider the importance of generalization. Link » Andrew Ehrenberg · Robert Kirk · Minqi Jiang · Edward Grefenstette · Tim Rocktaeschel 🔗 - Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in Latent Space (Poster)  link » General-purpose robots in real-world settings require diverse repertoires of behaviors to complete challenging tasks in unstructured environments. To address this problem, goal-conditioned reinforcement learning aims to train policies that can reach configurable goals for a wide range of tasks on command. However, such goal-conditioned policies are notoriously difficult and time-consuming to train from scratch. In this paper, we propose Planning to Practice (PTP), a method that makes it practical to train goal-conditioned policies for long-horizon tasks that require multiple distinct types of interactions to solve. Our approach is based on two key ideas. First, we decompose the goal-reaching problem hierarchically, with a high-level planner that sets intermediate subgoals using conditional subgoal generators in the latent space for a low-level model-free policy. Second, we propose a hybrid offline reinforcement learning approach with online fine-tuning, which uses previously collected data to pre-train both the conditional subgoal generator and the policy, and then fine-tune the policy via online exploration. This fine-tuning process is itself facilitated by the planned subgoals, which break down the original target task into short-horizon goal-reaching tasks that are significantly easier to learn. We conduct experiments in both the simulation and real world, in which the policy is pre-trained on demonstrations of short primitive behaviors and fine-tuned for temporally extended tasks that are unseen in the offline data. Our experimental results show that PTP can generate feasible sequences of subgoals that enable the policy to efficiently solve the target tasks. Link » Kuan Fang · Patrick Yin · Ashvin Nair · Sergey Levine 🔗 - Learning Transferable Policies By Inferring Agent Morphology (Poster)  link » The prototypical approach to reinforcement learning involves training policies tailored to a particular agent from scratch for every new morphology.Recent work aims to eliminate the re-training of policies by investigating whether a morphology-agnostic policy, trained on a diverse set of agents with similar task objectives, can be transferred to new agents with unseen morphologies without re-training. This is a challenging problem that required previous approaches to use hand-designed descriptions of the new agent's morphology. Instead of hand-designing this description, we propose a data-driven method that learns a representation of morphology directly from the reinforcement learning objective.Ours is the first reinforcement learning algorithm that can train a policy to generalize tonew agent morphologies without requiring a description of the agent's morphology in advance. We evaluate our approach on a standard benchmark for agent-agnostic control, and improve over the state of the art in zero-shot generalization. Importantly, our method attains good performance \textit{without} an explicit description of morphology. Link » Brandon Trabucco · mariano Phielipp · Glen Berseth 🔗 - Using Deep Learning to Bootstrap Abstractions for Robot Planning (Poster)  link » This paper addresses the problem of learning abstractions that boost robot planning performance while providing strong guarantees of reliability. Although state-of-the-art hierarchical robot planning algorithms allow robots to efficiently compute long-horizon motion plans for achieving user desired tasks, these methods typically rely upon environment-dependent state and action abstractions that need to be hand-designed by experts.We present a new approach for bootstrapping the entire hierarchical planning process. This allows us to compute abstract states and actions for new environments automatically using the critical regions predicted by a deep neural network with an auto-generated robot-specific architecture. We show that the learned abstractions can be used with a novel multi-source bi-directional hierarchical robot planning algorithm that is sound and probabilistically complete. An extensive empirical evaluation on twenty different settings using holonomic and non-holonomic robots shows that (a) our learned abstractions provide the information necessary for efficient multi-source hierarchical planning; and that (b) this approach of learning, abstractions, and planning outperforms state-of-the-art baselines by nearly a factor of ten in terms of planning time on test environments not seen during training. Link » Naman Shah · Siddharth Srivastava 🔗 - Don't Freeze Your Embedding: Lessons from Policy Finetuning in Environment Transfer (Poster)  link » A common occurrence in reinforcement learning (RL) research is making use of a pretrained vision stack that converts image observations to latent vectors. Using a visual embedding in this way leaves open questions, though: should the vision stack be updated with the policy? In this work, we evaluate the effectiveness of such decisions in RL transfer settings. We introduce policy update formulations for use after pretraining in a different environment and analyze the performance of such formulations. Through this evaluation, we also detail emergent metrics of benchmark suites and present results on Atari and AndroidEnv. Link » Victoria Dean · Daniel Toyama · Doina Precup · Victoria Dean 🔗 - Safer Autonomous Driving in a Stochastic, Partially-Observable Environment by Hierarchical Contingency Planning (Poster)  link » When learning to act in a stochastic, partially observable environment, an intelligent agent should be prepared to anticipate a change in its belief of the environment state, and be capable of adapting its actions on-the-fly to changing conditions. As humans, we are able to form contingency plans when learning a task with the explicit aim of being able to correct errors in the initial control, and hence prove useful if ever there is a sudden change in our perception of the environment which requires immediate corrective action.This is especially the case for autonomous vehicles (AVs) navigating real-world situations where safety is paramount, and a strong ability to react to a changing belief about the environment is truly needed. In this paper we explore an end-to-end approach, from training to execution, for learning robust contingency plans and combining them with a hierarchical planner to obtain a robust agent policy in an autonomous navigation task where other vehicles’ behaviours are unknown, and the agent’s belief about these behaviours is subject to sudden, last-second change. We show that our approach results in robust, safe behaviour in a partially observable, stochastic environment, generalizing well over environment dynamics not seen during training. Link » Ugo Lecerf · Christelle Yemdji-Tchassi · Pietro Michiardi 🔗 - Separating the World and Ego Models for Self-Driving (Poster)  link »    Training self-driving systems to be robust to the long-tail of driving scenarios is a critical problem.Model-based approaches leverage simulation to emulate a wide range of scenarios without putting users at risk in the real world.One promising path to faithful simulation is to train a forward model of the world to predict the future states of both the environment and the ego-vehicle given past states and a sequence of actions.In this paper, we argue that it is beneficial to model the state of the ego-vehicle, which often has simple, predictable and deterministic behavior, separately from the rest of the environment, which is much more complex and highly multimodal.We propose to model the ego-vehicle using a simple and differentiable kinematic model, while training a stochastic convolutional forward model on raster representations of the state to predict the behavior of the rest of the environment.We explore several configurations of such decoupled models, and evaluate their performance both with Model Predictive Control (MPC) and direct policy learning.We test our methods on the task of highway driving and demonstrate lower crash rates and better stability. Link » Vlad Sobal · Alfredo Canziani · Nicolas Carion · Kyunghyun Cho · Yann LeCun 🔗 - Multi-objective evolution for Generalizable Policy Gradient Algorithms (Poster)  link » Performance, generalizability, and stability are three Reinforcement Learning (RL) challenges relevant to many practical applications in which they present themselves in combination. Still, state-of-the-art RL algorithms fall short when addressing multiple RL objectives simultaneously and current human-driven design practices might not be well-suited for multi-objective RL. In this paper we present MetaPG, an evolutionary method that discovers new RL algorithms represented as graphs, following a multi-objective search criteria in which different RL objectives are encoded in separate fitness scores. Our findings show that, when using a graph-based implementation of Soft Actor-Critic (SAC) to initialize the population, our method is able to find new algorithms that improve upon SAC's performance and generalizability by 3% and 17%, respectively, and reduce instability up to 65%. In addition, we analyze the graph structure of the best algorithms in the population and offer an interpretation of specific elements that help trading performance for generalizability and vice versa. We validate our findings in three different continuous control tasks: RWRL Cartpole, RWRL Walker, and Gym Pendulum. Link » Juan Jose Garau-Luis · Yingjie Miao · John Co-Reyes · Aaron Parisi · Jie Tan · Esteban Real · Aleksandra Faust 🔗 - ShiftNorm: On Data Efficiency in Reinforcement Learning with Shift Normalization (Poster)  link »    We propose ShiftNorm, a simple yet promising data augmentation that can be applied to standard model-free algorithms to improve sample-efficiency in high-dimensional image-based reinforcement learning (RL).Concretely, the differentiable ShiftNorm leverages original samples with reparameterized virtual samples, and hasten the image encoder to generate invariant representations. Our approach demonstrates certify substantial advances, enabling it to outperform the new state-of-the-art on 8 of 9 tasks on the DeepMind Control Suite at 500k steps. Link » Sicong Liu · Xi Zhang · Yushuo Li · Yifan Zhang · Jian Cheng 🔗 - Improving performance on the ManiSkill Challenge via Super-convergence and Multi-Task Learning (Poster)  link » We present key aspects of our approach to the ManiSkill Challenge, where we used Imitation Learning on the provided demonstration dataset to let a robot learn how to manipulate interactive objects.We present what is to our knowledge the first application of super-convergence via learning rate scheduling to Imitation Learning and robotics, enabling better policy performance with a training time reduced by almost an order of magnitude. We also present how we used Multi-task Learning to reach a top score on unseen object of 1 task of the challenge. It shows that the strategy can unlock generalization performance on some tasks, corroborating other work in the field. We also show that simple data augmentation strategies can help push the model performance further. Link » Fabian Dubois · Eric Platon · Tom Sonoda 🔗 - Multi-task Reinforcement Learning with Task Representation Method (Poster)  link » Multi-task reinforcement learning (RL) algorithms can train agents to acquire generalized skills across various tasks. However, jointly learning with multiple tasks can induce negative transfer between different tasks, resulting in unstable training. In this paper, we newly propose a task representation method that prevents negative transfer in policy learning. The proposed method for multi-task RL adopts a task embedding network in addition to a policy network, where the policy network takes the output of the task embedding network and states as inputs. Furthermore, we propose a measure of negative transfer and design an overall update method that can minimize the suggested measure. In addition, we raise an issue of the negative effect on soft Q-function learning resulting in unstable Q learning and introduce the clipping method to reduce this issue. The proposed multi-task algorithm is evaluated on various robotics manipulation tasks. Numerical results show that the proposed multi-task RL algorithm effectively minimizes negative transfer and achieves better performance than previous state-of-the-art multi-task RL algorithms. Link » Myungsik Cho · Whiyoung Jung · Youngchul Sung 🔗 - Deep Sequenced Linear Dynamical Systems for Manipulation Policy Learning (Poster)  link » In policy learning for robotic manipulation tasks, action parameterization can have a major impact on the final performance and sample efficiency of a policy. Unlike highly-dynamic continuous-control tasks, many manipulation tasks can be efficiently performed by a sequence of simple, smooth end-effector motions. Building on this intuition, we present a new class of policies built on top of differentiable Linear Dynamical System (dLDS) units, our differentiable formulation of the classical LDS. Constructing policies using dLDS units yields several advantageous properties, including trajectory coherence across timesteps, stability, and invariance under translation and scaling. Inspired by the sequenced LDS approach proposed by \citet{lds_dixon}, we propose a deep neural-network policy parameterization based on sequenced dLDS units, and we integrate this policy class into standard on-policy reinforcement learning settings. We conduct extensive experiments on Metaworld environments and show a notable improvement in performance and sample efficiency compared to other state-of-the-art algorithms. Additional visualizations and code can be found at \url{https://sites.google.com/view/deep-sequenced-lds}. Link » Mohammad Nomaan Qureshi · Ben Eisner · David Held 🔗 - Learning Robust Task Context with Hypothetical Analogy-Making (Poster)  link » Learning compact state representations from high dimensional and noisy observations is the cornerstone of reinforcement learning (RL). However, these representations are often biased toward the current task context and overfitted to context-irrelevant features, making it hard to generalize to other tasks. Inspired by the human analogy-making process, we propose a novel representation learning framework called Hypothetical Analogy-Making (HAM) for learning robust task contexts and generalizable policy for RL. It consists of task context and background encoding, hypothetical observation generation, and analogy-making between the original and hypothetical observations. Our model introduces an auxiliary objective that maximizes the mutual information between the generated observation and existing labels of codes used to generate the observation. Experiments on various challenging RL environments showed that our model helps the RL agent’s learned policy generalize by revealing a robust task context space. Link » Shinyoung Joo · Sang Wan Lee 🔗 - Silver-Bullet-3D at ManiSkill 2021: Learning-from-Demonstrations and Heuristic Rule-based Methods for Object Manipulation (Poster)  link » This paper presents an overview and comparative analysis of our systems designed for the following two tracks in SAPIEN ManiSkill Challenge 2021 {\url{https://sapien.ucsd.edu/challenges/maniskill2021/}}:No Interaction Track: The No Interaction track targets for learning policies from pre-collected demonstration trajectories. We investigate both imitation learning-based approach, \emph{i.e.}, imitating the observed behavior using classical supervised learning techniques, and offline reinforcement learning-based approaches, for this track. Moreover, the geometry and texture structures of objects and robotic arms are exploited via Transformer-based networks to facilitate imitation learning.No Restriction Track: In this track, we design a Heuristic Rule-based Method (HRM) to trigger high-quality object manipulation by decomposing the task into a series of sub-tasks. For each sub-task, the simple rule-based controlling strategies are adopted to predict actions that can be applied to robotic arms.To ease the implementations of our systems, all the source codes and pre-trained models are available at \url{https://github.com/caiqi/Silver-Bullet-3D/}. Link » Yingwei Pan · Yehao Li · Yiheng Zhang · Qi Cai · Fuchen Long · Zhaofan Qiu · Ting Yao · Tao Mei 🔗 - Zero-Shot Reward Specification via Grounded Natural Language (Poster)  link » Reward signals in reinforcement learning are expensive to design and often require access to the true state which is not available in the real world. Common alternatives are usually demonstrations or goal images which can be labor-intensive to collect. On the other hand, text descriptions provide a general, natural, and low-effort way of communicating the desired task. However, prior works in learning text-conditioned policies still rely on rewards that are defined using either true state or labeled expert demonstrations. We use recent developments in building large-scale visuolanguage models like CLIP to devise a framework that generates the task reward signal just from goal text description and raw pixel observations which is then used to learn the task policy. We evaluate the proposed framework on control and robotic manipulation tasks. Finally, we distill the individual task policies into a single goal text conditioned policy that can generalize in a zero-shot manner to new tasks with unseen objects and unseen goal text descriptions. Link » Parsa Mahmoudieh · Deepak Pathak · trevor darrell 🔗 - Reinforcement Learning for Location-Aware Warehouse Scheduling (Poster)  link » Recent techniques in dynamical scheduling and resource management have found applications in warehouse environments due to their ability to organize and prioritize tasks in a higher temporal resolution. The rise of deep reinforcement learning, as a learning paradigm, has enabled decentralized agent populations to discover complex coordination strategies. However, training multiple agents simultaneously introduce many obstacles in training as observation and action spaces become exponentially large. In our work, we experimentally quantify how various aspects of the warehouse environment (e.g., floor plan complexity, information about agents’ live location, level of task parallelizability) affect performance and execution priority. To achieve efficiency, we propose a compact representation of the state and action space for location-aware multi-agent systems, wherein each agent has knowledge of only self and task coordinates, hence only partial observability of the underlying Markov Decision Process. Finally, we show how agents trained in certain environments maintain performance in completely unseen settings and also correlate performance degradation with floor plan geometry. Link » Stelios Stavroulakis · Biswa Sengupta 🔗 - A Probabilistic Perspective on Reinforcement Learning via Supervised Learning (Poster)  link » Reinforcement Learning via Supervised Learning (RvS) only uses supervised techniques to learn desirable behaviors from large datasets. RvS has attracted much attention lately due to its simplicity and ability to leverage diverse trajectories. We introduce Density to Decision (D2D), a new framework, to unify a myriad of RvS algorithms. The Density to Decision framework formulates RvS as a two-step process: i) density estimation via supervised learning and ii) decision making via exponential tilting of the density. Using our framework, we categorise popular RvS algorithms and show how they are different by the design choices in their implementation. We then introduce a novel algorithm, Implicit RvS, leveraging powerful density estimation techniques that can easily be tilted to produce desirable behaviors. We compare the performance of a suite of RvS algorithms on the D4RL benchmark. Finally, we highlight the limitations of current RvS algorithms in comparison with traditional RL ones. Link » Alexandre Piche · Rafael Pardinas · David Vazquez · Chris J Pal 🔗 - Prompts and Pre-Trained Language Models for Offline Reinforcement Learning (Poster)  link » In this preliminary study, we introduce a simple way to leverage pre-trained language models in deep offline RL settings that are not naturally suited for textual representation. We propose using a state transformation into a human-readable text and a minimal fine-tuning of the pre-trained language model when training with deep offline RL algorithms. This approach shows consistent performance gains on the NeoRL MuJoCo datasets. Our experiments suggest that LM fine-tuning is crucial for good performance on robotics tasks. However, we also show that it is not necessary when working with finance environments in order to retain significant improvement in the final performance. Link » Denis Tarasov · Vladislav Kurenkov · Sergey Kolesnikov 🔗 - Compositional Multi-Object Reinforcement Learning with Linear Relation Networks (Poster)  link » Although reinforcement learning has seen remarkable progress over the last years, solving robust dexterous object-manipulation tasks in multi-object settings remains a challenge. In this paper, we focus on models that can learn manipulation tasks in fixed multi-object settings \emph{and} extrapolate this skill zero-shot without any drop in performance when the number of objects changes. We consider the generic task of bringing a specific cube out of a set to a goal position. We find that previous approaches, which primarily leverage attention and graph neural network-based architectures, do not generalize their skills when the number of input objects changes while scaling as $K^2$. We propose an alternative plug-and-play module based on relational inductive biases to overcome these limitations. Besides exceeding performances in their training environment, we show that our approach, which scales linearly in $K$, allows agents to extrapolate and generalize zero-shot to any new object number. Link » Davide Mambelli · Frederik Träuble · Stefan Bauer · Bernhard Schoelkopf · Francesco Locatello 🔗 - Density Estimation For Conservative Q-Learning (Poster)  link » Batch Reinforcement Learning algorithms aim at learning the best policy from a batch of data without interacting with the environment. Within this setting, one difficulty is to correctly assess the value of state-action pairs far from the data set. Indeed, the lack of information may provoke an overestimation of the value function, leading to non-desirable behaviours. A compromise between enhancing the performance of the behaviour policy and staying close to it must be found.To alleviate this issue, most existing approaches introduce a regularization term to favor state-action pairs from the data set.In this paper, we refine this idea by estimating the density of these state-action pairs to distinguish neighbourhoods. The resulting regularization guides the policy toward meaningful unseen regions, improving the learning process. We hence introduce Density Conservative Q-Learning (D-CQL), a sound batch RL algorithm that carefully penalizes the value function based on the information collected in the state-action space. The performance of our approach is outlined on many classical benchmark in batch RL. Link » Paul Daoudi · Ludovic Dos Santos · Merwan Barlier · Aladin Virmaux 🔗 - Control of Two-way Coupled Fluid Systems with Differentiable Solvers (Poster)  link »    We investigate the use of deep neural networks to control complex nonlinear dynamical systems, specifically the movement of a rigid body immersed in a fluid. We solve the Navier Stokes equations with two way coupling, which gives rise to nonlinear perturbations that make the control task very challenging. Neural networks are trained to act as controllers with desired characteristics through a process of learning from a differentiable simulator. Here we introduce a set of physically interpretable loss terms to let the networks learn robust and stable interactions. We demonstrate that controllers trained in a canonical setting with quiescent initial conditions reliably generalize to varied and challenging environments such as previously unseen inflow conditions and forcing. Further, we show that controllers trained with our approach outperform a variety of classical and learned alternatives in terms of evaluation metrics and generalizing capabilities. Link » Brener Ramos · Felix Trost · Nils Thuerey 🔗 - One-Shot Imitation with Skill Chaining using a Goal-Conditioned Policy in Long-Horizon Control (Poster)  link » Recent advances in skill learning from a task-agnostic offline dataset enable the agent to acquire various skills that can be used as primitives to perform long-horizon imitation. However, most work implicitly assumes that the offline dataset covers the entire distribution of target demonstrations. If the dataset only contains subtask-local trajectories, existing methods fail to imitate the transitions between subtasks without a sufficient amount of target demonstrations, significantly limiting the scalability of these methods. In this work, we show that a simple goal-conditioned policy can imitate the missing transitions using only the target demonstrations. We combine it with a policy-switching strategy that uses the skills when they are applicable. Furthermore, we present multiple choices that can effectively evaluate the applicability of skills. Our new method successfully performs one-shot imitation with skills learned from a subtask-local offline dataset. We experimentally show that it outperforms other one-shot imitation methods in a challenging kitchen environment, and we also qualitatively analyze how each policy-switching strategy works during imitation. Link » Hayato Watahiki · Yoshimasa Tsuruoka 🔗 - Versatile Offline Imitation Learning via State-Occupancy Matching (Poster)  link » We propose State Matching Offline DIstribution Correction Estimation (SMODICE), a novel and versatile algorithm for offline imitation learning (IL) via state-occupancy matching. Without requiring access to expert actions, SMODICE can be effectively applied to three offline IL settings: (i) imitation from observations (IfO), (ii) IfO with dynamics or morphologically mismatched expert, and (iii) example-based reinforcement learning, which we show can be formulated as a state-occupancy matching problem. We show that the SMODICE objective admits a simple optimization procedure through an application of Fenchel duality, reducing a nested optimization problem to a sequence of stable supervised learning problems. We extensively evaluate SMODICE on both gridworld environments as well as on high-dimensional offline benchmarks. Our results demonstrate that SMODICE is effective for all three problem settings and significantly outperforms prior state-of-art. Link » Yecheng Jason Ma · Andrew Shen · Dinesh Jayaraman · Osbert Bastani 🔗 - Let’s Handle It: Generalizable Manipulation of Articulated Objects (Poster)  link » In this project we present a framework for building generalizable manipulation controller policies that map from raw input point clouds and segmentation masks to joint velocities. We took a traditional robotics approach, using point cloud processing, end-effector trajectory calculation, inverse kinematics, closed-loop position controllers, and behavior trees. We demonstrate our framework on four manipulation skills on common household objects that comprise the SAPIEN ManiSkill Manipulation challenge. Link » Zhutian Yang · Aidan Curtis 🔗 - Revisiting Model-based Value Expansion (Poster)  link » Model-based value expansion methods promise to improve the quality of value function targets and, thereby, the effectiveness of value function learning. However, to date, these methods are being outperformed by Dyna-style algorithms with conceptually simpler 1-step value function targets. This shows that in practice, the theoretical justification of value expansion does not seem to hold. We provide a thorough empirical study to shed light on the causes of failure of value expansion methods in practice which is believed to be the compounding model error. By leveraging GPU based physics simulators, we are able to efficiently use the true dynamics for analysis inside the model-based reinforcement learning loop. Performing extensive comparisons between true and learned dynamics sheds light into this black box. This paper provides a better understanding of the actual problems in value expansion. We provide future directions of research by empirically testing the maximum theoretical performance of current approaches. Link » Daniel Palenicek · Michael Lutter · Jan Peters 🔗 - An Empirical Study and Analysis of Learning Generalizable Manipulation Skill in the SAPIEN Simulator (Poster)  link » This paper provides a brief overview of our submission to the no interaction track of SAPIEN ManiSkill Challenge 2021. Our approach follows an end-to-end pipeline which mainly consists of two steps: first, we extract the point cloud features of multiple objects; then we adopt these features to predict the action score of the robot simulators through a deep and wide transformer-based network. More specially, %to give guidance for future work, open up avenues for exploitation of learning manipulation skill tasks, we present an empirical study that includes a bag of tricks and abortive attempts. Finally, our method achieves a promising ranking on the leaderboard. All code of our solution is available at https://github.com/**. Link » Liu Kun · Huiyuan Fu · Zheng Zhang · huanpu yin 🔗 - Don't Change the Algorithm, Change the Data: Exploratory Data for Offline Reinforcement Learning (Poster)  link » Recent progress in deep learning has relied on access to large and diverse datasets. Such data-driven progress has been less evident in offline reinforcement learning (RL), because offline RL data is usually collected to optimize specific target tasks limiting the data’s diversity. In this work, we propose Exploratory data for Offline RL (ExORL), a data-centric approach to offline RL. ExORL first generates data with unsupervised reward-free exploration, then relabels this data with a downstream reward before training a policy with offline RL. We find that exploratory data allows vanilla off-policy RL algorithms, without any offline-specific modifications, to outperform or match state-of-the-art offline RL algorithms on downstream tasks. Our findings suggest that data generation is as important as algorithmic advances for offline RL and hence requires careful consideration from the community. Link » Denis Yarats · David Brandfonbrener · Hao Liu · Michael Laskin · Pieter Abbeel · Alessandro Lazaric · Lerrel Pinto 🔗 - Learning Generalizable Dexterous Manipulation from Human Grasp Affordance (Poster)  link » Dexterous manipulation with a multi-finger hand is one of the most challenging problems in robotics. While recent progress in imitation learning has largely improved the sample efficiency compared to Reinforcement Learning, the learned policy can hardly generalize to manipulate novel objects, given limited expert demonstrations. In this paper, we propose to learn dexterous manipulation using large-scale demonstrations with diverse 3D objects in a category, which are generated from a human grasp affordance model. This generalizes the policy to novel object instances within the same category. To train the policy, we propose a novel imitation learning objective jointly with a geometric representation learning objective using our demonstrations. By experimenting with relocating diverse objects in simulation, we show that our approach outperforms baselines with a large margin when manipulating novel objects. We also ablate the importance on 3D object representation learning for manipulation. Link » Yueh-Hua Wu · Jiashun Wang · Xiaolong Wang 🔗 - Continuous Control on Time (Poster)  link » The physical world evolves continuously in time. Most prior works on reinforcement learning cast continuous-time environments into a discrete-time Markov Decision Process (MDP), by discretizing time into constant-width decision intervals. In this work, we propose Continuous-Time-Controlled MDPs (CTC-MDP), a continuous-time decision process that permits the agent to decide how long each action will last in the physical time of the environment. However, reinforcement learning in vanilla CTC-MDP may result in agents learning to take infinitesimally small time scales for each action. To prevent such degeneration and allow users to control the computation budget, we further propose CTC-MDPs with a constraint on the average time scale over a given threshold. We hypothesize that constrained CTC-MDPs will allow agents to "budget" fine-grained time scales to states where it may need to adjust actions quickly, and coarse-grained time scales to states where it can get away with a single decision. We evaluate our new CTC-MDP framework (with and without constraint) on the standard MuJoCo benchmark. Link » Tianwei Ni · Eric Jang · Tianwei Ni 🔗 - A Minimalist Ensemble Method for Generalizable Offline Deep Reinforcement Learning (Poster)  link » Deep Reinforcement Learning (DRL) has achieved awesome performance in a variety of applications. However, most existing DRL methods require massive active interactions with the environments, which is not practical in real-world scenarios. Moreover, most current evaluation environments are exactly the same as the training environments, leading to the negligence of the generalization ability of the agent. To fulfill the potential of DRL, an ideal policy should have 1) the ability to learn from a previously collected dataset (i.e., offline DRL) and 2) the generalization ability for the unseen scenarios and objects in the testing environments. Given the expert demonstrations collected from the training environments, the goal is to enhance the performance of the model in both the training and testing environments without any more interaction. In this paper, we proposed a minimalist ensemble imitation learning-based method that trains a bundle of agents with simple modifications on network architecture and hyperparameter tuning and combines them as an ensemble model. To verify our method, we took part in the No Interaction Track of the SAPIEN Manipulation Skill (ManiSkill) Challenge and conducted extensive experiments on the ManiSkill Benchmark. The challenge rank and experimental results well demonstrated the effectiveness of our method. Link » Kun Wu · Yinuo Zhao · Zhiyuan Xu · Zhen Zhao · Pei Ren · Zhengping Che · Chi Liu · Feifei Feng · Jian Tang 🔗 - Know Thyself: Transferable Visual Control Policies Through Robot-Awareness (Poster)  link » Note: This submission is published in ICLR'22.Training visual control policies from scratch on a new robot typically requires generating large amounts of robot-specific data. Could we leverage data previously collected on another robot to reduce or even completely remove this need for robot-specific data? We propose a “robot-aware control” paradigm that achieves this by exploiting readily available knowledge about the robot. We then instantiate this in a robot-aware model-based RL policy by training modular dynamics models that couple a transferable, robot-agnostic world dynamics module with a robot-specific, potentially analytical, robot dynamics module. This also enables us to set up visual planning costs that separately consider the robot agent and the world. Our experiments on tabletop manipulation tasks with simulated and real robots demonstrate that these plug-in improvements dramatically boost the transferability of visual model-based RL policies, even permitting zero-shot transfer of visual manipulation skills onto new robots. Project website: https://sites.google.com/view/rac-iclr22 Link » Edward Hu · Kun Huang · Oleh Rybkin · Dinesh Jayaraman 🔗 - Sim-to-Lab-to-Real: Safe RL with Shielding and Generalization Guarantees (Poster)  link » Safety is a critical component of autonomous systems and remains a challenge for learning-based policies to be utilized in the real world. In this paper, we propose Sim-to-Lab-to-Real to safely close the reality gap. To improve safety, we apply a dual policy setup where a performance policy is trained using the cumulative task reward and a backup (safety) policy is trained by solving the safety Bellman Equation based on Hamilton-Jacobi reachability analysis. In Sim-to-Lab transfer, we apply a supervisory control scheme to shield unsafe actions during exploration; in Lab-to-Real transfer, we leverage the Probably Approximately Correct (PAC)-Bayes framework to provide lower bounds on the expected performance and safety of policies in unseen environments. We empirically study the proposed framework for ego-vision navigation in two types of indoor environments including a photo-realistic one. We also demonstrate strong generalization performance through hardware experiments in real indoor spaces with a quadrupedal robot (See https://tinyurl.com/2p9hbyf7 for video of representative trials of Real deployment). Link » Kai-Chieh Hsu · Allen Z. Ren · Duy Nguyen · Anirudha Majumdar · Jaime Fernández Fisac 🔗