Skip to yearly menu bar Skip to main content

( events)   Timezone:  
Mon May 06 07:45 AM -- 04:30 PM (PDT) @ Room R06
Safe Machine Learning: Specification, Robustness, and Assurance
Silvia Chiappa · Victoria Krakovna · Adrià Garriga-Alonso · Andrew Trask · Jonathan Uesato · Christina Heinze-Deml · Ray Jiang · Adrian Weller

Workshop Home Page

The ultimate goal of ML research should be to have a positive impact on society and the world. As the number of applications of ML increases, it becomes more important to address a variety of safety issues; both those that already arise with today's ML systems and those that may be exacerbated in the future with more advanced systems.

Current ML algorithms tend to be brittle and opaque, reflect undesired bias in the data and often optimize for objectives that are misaligned with human preferences. We can expect many of these issues to get worse as our systems become more advanced (e.g. finding more clever ways to optimize for a misspecified objective). This workshop aims to bring together researchers in diverse areas such as reinforcement learning, formal verification, value alignment, fairness, and security to further the field of safety in machine learning.

We will focus on three broad categories of ML safety problems: specification, robustness and assurance (Ortega et al, 2018). Specification is defining the purpose of the system, robustness is designing the system to withstand perturbations, and assurance is monitoring, understanding and controlling system activity before and during its operation. Research areas within each category include:

- Reward Hacking: Systems may behave in ways unintended by the designers, because of discrepancies between the specified reward and the true intended reward. How can we design systems that don’t exploit these misspecifications, or figure out where they are? (Over 40 examples of specification gaming by AI systems can be found here: .)
- Side effects: How can we give artificial agents an incentive to avoid unnecessary disruptions to their environment while pursuing the given objective? Can we do this in a way that generalizes across environments and tasks and does not introduce bad incentives for the agent in the process?
- Fairness: ML is increasingly used in core societal domains such as health care, hiring, lending, and criminal risk assessment. How can we make sure that historical prejudices, cultural stereotypes, and existing demographic inequalities contained in the data, as well as sampling bias and collection issues, are not reflected in the systems?

- Adaptation: How can machine learning systems detect and adapt to changes in their environment (e.g. low overlap between train and test distributions, poor initial model assumptions, or shifts in the underlying prediction function)? How should an autonomous agent act when confronting radically new contexts, or identify that the context is new in the first place?
- Verification: How can we scalably verify meaningful properties of ML systems? What role can and should verification play in ensuring robustness of ML systems?
- Worst-case robustness: How can we train systems which never perform extremely poorly, even in the worst case? Given a trained system, can we ensure it never fails catastrophically, or bound this probability?
- Safe exploration: Can we design reinforcement learning algorithms which never fail catastrophically, even at training time?

- Interpretability: How can we robustly determine whether a system is working as intended (i.e. is well specified and robust) before large-scale deployment, even when we do not have a formal specification of what it should do?
- Monitoring: How can we monitor large-scale systems to identify whether they are performing well? What tools can help diagnose and fix the found issues?
- Privacy: How can we ensure that the trained systems do not reveal sensitive information about individuals contained in the training set?
- Interruptibility: An artificial agent may learn to avoid interruptions by the human supervisor if such interruptions lead to receiving less reward. How can we ensure the system behaves safely even under the possibility of shutdown?

- Make the ICLR community more aware that the impact of their work is important, and that positive impact does not come for free, since safety issues can be difficult to formalize and address.
- Provide a forum for concerned researchers to discuss their work and its implications for the societal impact of ML.
- Bring together researchers working on near-term and long-term safety and explore overlaps between the considerations and approaches in those fields.

Opening remarks (Talk)
Interpretability for important problems (Invited Talk)
Posters and Coffee Break 1 (Poster Session)
Formalizing the Value Alignment Problem in A.I. (Invited Talk)
Misleading meta-objectives and hidden incentives for distributional shift (Contributed Talk)
Panel: Exploring overlaps and interactions between AI safety research areas (Panel Discussion)
Lunch break (Break)
Bridging Adversarial Robustness and Gradient Interpretability (Contributed Talk)
Uncovering Surprising Behaviors in Reinforcement Learning via Worst-Case Analysis (Contributed Talk)
Posters and Coffee Break 2 (Poster Session)
The case for dynamic defenses against adversarial examples (Invited Talk)
Panel: Research priorities in AI safety (Panel Discussion)
Closing (Talk)