Marin: Open Development of Frontier AI
As AI capabilities skyrocket, openness plummets: the scientific community and broader public knows little of how frontier models (including open-weight models) are trained. I will describe Marin, a radically new way of doing model development, inspired by true open-source software. Every experiment is done in the open, and anyone can suggest ideas, review, and even run experiments through GitHub, providing a better way of doing science that improves on preregistration, reproducibility, and peer review. I will discuss a selection of scientific results that have emerged from Marin, including new optimizers and scaling laws. We hope that Marin will be a platform for the community to participate in the development of frontier AI.
Nubank: From LLMs to Financial Inclusion: Efficient LLM Training and Scaling AI Agents for 131 Million Lives
Nubank serves 131 million customers across Brazil, Mexico, and Colombia, many of whom accessed financial services for the first time through the platform. This scale demands AI that is both technically excellent and deeply human, capable of understanding individual behavior and delivering personalized, trustworthy experiences across languages and financial backgrounds. We present three interconnected lines of work across Nubank’s AI stack: 1. Efficient model training: research on more efficient optimization to enable training larger, more capable models under real-world resource constraints. 2. Deep learning from user behavior: nuFormer, a transformer-based self-supervised model that learns rich user representations directly from raw transaction sequences to improve large-scale recommendation systems. 3. Production AI agents for customer support: an evaluation-driven framework with a closed loop of context engineering, prompt optimization, LLM judges, and simulation-to-production validation to build high-quality AI agents that serve millions of customer interactions.
LLM-based agents struggle to execute complex, multi-step Standard Operating Procedures (SOPs) that are fundamental to industrial automation. Existing benchmarks fail to capture the procedural complexity and tool orchestration demands of real-world workflows. We introduce SOP-Bench, a benchmark of 2,000+ tasks from human expert-authored SOPs across 12 business domains (healthcare, logistics, finance, content moderation, etc.). Using a human-AI collaborative framework, experts crafted authentic SOPs while AI generated artifacts (tools, APIs, datasets), all human-validated, yielding realistic tasks with executable interfaces and ground-truth outputs. SOP-Bench serves as a research enabler for systematically investigating agent architectures, model capabilities, and deployment considerations across diverse procedural tasks. In this talk, we discuss the challenges of real world SOPs and a framework on evaluating them.
Humanists in AI
This affinity group is for NLP and AI researchers with a prior background or interest in the humanities. As AI research becomes increasingly interdisciplinary - drawing from theories and methods from outside of computer science - it is important to engage with what the humanities have to offer in terms of theories and methods to study language, culture, narrative, subjectivity, mind, emotion, etc. Disciplines like literary and cultural studies, media studies, philosophy, and history have a great deal to offer in terms of studying, building and improving language systems. Topics of interest include (but are not limited to): co intelligence and co-creative systems, narrative understanding, cultural analytics, literary NLP, AI literacy, AI ethics, culture and cognition, etc.
Reliability in NL-to-SQL Systems
Large language models are increasingly used to translate natural language into SQL. But how reliable are they in real-world settings?
In this session, we present a focused evaluation framework for measuring NL-to-SQL performance, including execution correctness, robustness, and query efficiency under varying levels of database context. We’ll discuss how structured QA and validation approaches can help move LLM systems from benchmark success to production reliability.
We invite researchers and practitioners working on NL-to-SQL systems, LLM evaluation, and database applications to participate in the discussion and share perspectives from real-world deployments.
Women in Machine Learning (WiML) Social @ ICLR
The Women in Machine Learning (WiML) initiative, founded in 2006, was created to connect and support the relatively small but growing community of researchers in ML who identify as women or nonbinary. Over the years, WiML events at conferences such as NeurIPS, ICML, ICLR, and other conferences have highlighted cutting-edge research, fostered mentorship, and created space for meaningful technical exchange. For ICLR, we propose a WiML Social that keeps WiML’s core mission while emphasising interaction and networking. The event will feature a panel discussion, facilitated roundtables, and structured networking activities designed to spark in-depth conversations and future collaborations. Building on the success of the highly interactive WiML formats at ICLR in the past, we will include small-group discussions, allowing participants to engage directly on open research questions and career paths. The goals remain the same: to celebrate the work of researchers who identify as women or nonbinary, to create opportunities for junior and senior participants to connect, and to strengthen community ties within the broader ICLR ecosystem.
What is the Role of World Models in Decision-Making?
World models have recently gained popularity thanks to impressive results and the availability of data. However, no consensus has been reached on how they should help improve decision-making. This social aims to foster discussions around the role of world models in decision-making.
World models are used in many ways. Some approaches use world models as synthetic data generators. Others leverage them at test time to reason or evaluate policies. Video models, in addition to an inverse dynamics model, can be used to infer actions. Discussing those choices appears to be essential to assess their role in decision-making.
When data is scarce, some methods only learn a world model to improve representation learning and rely on model-free methods for decision-making to avoid hallucinations. We believe it is fruitful to discuss the scenarios for which world models can be trusted.
The exchanges can focus on discussing in which situations world models have an edge over algorithms that do not directly learn the transition dynamics. It is also not yet clear whether world models are more relevant at certain levels of hierarchy.
Finally, discussing why world models can enable better generalization can provide an answer to the question asked in this social.
Trust but Verify: Discussion on AI Verification Practices
As AI agents increasingly operate in shared environments — negotiating, transacting, and making joint decisions — questions of coordination and cooperation become inseparable from questions of system design. How do we incentivize cooperation among autonomous agents when no single party controls the system? What coordination protocols, commitment devices, and oversight mechanisms work in distributed settings? How do we prevent collusion and ensure robustness as agent networks scale? This social is hosted by Cooperative AI Foundation and The Institute for Decentralized AI as a gathering for researchers working on multi-agent systems, mechanism design, AI safety, distributed systems, and related areas who are interested in how cooperative and decentralized approaches to AI intersect and inform each other. Drinks will be provided.
Turing: A Framework for Evaluating Agents on Stateful, Multi-Step Real-World Workflows
Agent benchmarks are evolving from static prediction tasks to multi-step tasks requiring tool use, yet most evaluations remain short-horizon, loosely coupled, and state-agnostic. These settings fail to capture the properties that determine reliability in real workflows: persistent state, dense relational structure, role-based access control, and policy-constrained execution. This talk introduces an evaluation framework built on three components: workflow-grounded task definitions, scenario construction derived from operational traces, and a stateful execution environment with persistent databases and verifiable outcomes. Agents are evaluated across tasks that require multi-step planning over interconnected subsystems, exposing failure modes that simply don't appear in short-horizon benchmarks. Scoring combines automated outcome verification with rubrics that assess domain reasoning quality, including constraint prioritization and decision-making under ambiguity. We show that performance degrades predictably with increasing horizon length and constraint density. The primary bottleneck for frontier models is strategic planning under constraint, not tool invocation accuracy. Structural failures surface reliably through outcome checks, while subtler breakdowns in judgment require reasoning-quality assessment. These results suggest that deployable autonomy requires evaluation frameworks that integrate realistic workflows, operational constraints, stateful simulation, and scoring that captures both outcome correctness and the quality of domain-grounded reasoning. This session outlines the design principles, empirical findings, and open research challenges that define this next generation of agent evaluation.
As large language models (LLMs) evolve from short-burst chatbots into long-horizon autonomous agents, progress is increasingly bottlenecked by verification asymmetry: rapid gains in domains with cheap correctness signals (e.g., math and code) contrast sharply with limited progress in tasks with weak or delayed verification, such as research planning and strategic decision-making. This talk argues that evaluation and reinforcement learning (RL) beyond easily verifiable domains are the next critical frontier for AI capability. We present results from three new evaluation frameworks. Humanity’s Last Exam (HLE) shows that frontier models are frequently wrong and overconfident at the human-expert level. The Remote Labor Index (RLI) demonstrates that current agents automate only ~2.5% of real, paid freelance work. Visual ToolBench reveals that 70–80% of multimodal agent failures stem from visual perception rather than reasoning. To close these gaps, we introduce Rubrics as Rewards (RaR) within a Group Relative Policy Optimization (GRPO) framework. We show that Dynamic Rubrics, which adaptively elicit evaluation criteria by contrasting model outputs during training, outperform static human-written rubrics and reduce reward hacking in the high-reward regime. These findings motivate a shift from static benchmarks to high-fidelity RL environments, such as Scale Gymnasium, that train agents through interaction rather than imitation.
Images of the Hidden Universe
Some of the most iconic images in modern science were never captured by a camera in the traditional sense. Instead, they were inferred from indirect and incomplete measurements, using a combination of physics, prior knowledge, and computation. In this talk, I will explore how physics and machine learning are working together to illuminate parts of the universe that are difficult - or even fundamentally impossible - to observe directly. I’ll begin with the story of black hole imaging, where theory long predicted what we should see, and where confidence came not from a single image, but from the consistency of features across many reconstructions of the same data. Along the way, I’ll show that this kind of inference is not unique to extreme astrophysics, but also underlies how we form images in familiar technologies we rely on every day, where images are similarly reconstructed from indirect measurements using models and assumptions. I’ll show that simple assumptions can take us far, but also where they begin to limit what we can learn. Incorporating richer assumptions through the help of machine learning allows us to extract more from the same data and explore a full range of possibilities that respects varying strengths of the expected physics. Finally, I will discuss how these ideas extend beyond black holes to other scientific imaging problems, including mapping the distribution of dark matter from subtle distortions in the shapes of galaxies due to gravitational lensing. Together, these examples illustrate how modern imaging increasingly relies on integrating physics and machine learning to extract meaningful information from fundamentally limited data to uncover our hidden universe.
X-informed AI
Scientific ML is more than applying ML to science — it's about letting domain structure shape the model itself. In this social, we bring together researchers working across domains (neuroscience, PDEs, physics simulations, and beyond) to discuss: What is the urgent 'X' in your X-informed AI? How do you identify the right inductive bias? How do we build a community that supports this? Come share how your domain shapes your models.
Cooperation in Decentralized Multi-Agent Systems
As AI agents increasingly operate in shared environments — negotiating, transacting, and making joint decisions — questions of coordination and cooperation become inseparable from questions of system design. How do we incentivize cooperation among autonomous agents when no single party controls the system? What coordination protocols, commitment devices, and oversight mechanisms work in distributed settings? How do we prevent collusion and ensure robustness as agent networks scale?
This social is hosted by Cooperative AI Foundation and The Institute for Decentralized AI as a gathering for researchers working on multi-agent systems, mechanism design, AI safety, distributed systems, and related areas who are interested in how cooperative and decentralized approaches to AI intersect and inform each other. Drinks will be provided.
Evaluating LLMs Holistically in a World Where Benchmarks Leak: The Case for Private-Only Evaluation
Benchmark contamination is no longer a theoretical concern. As frontier models are trained on open-web data, public test sets are routinely absorbed into pre-training corpora — and beyond passive contamination, labs are known to actively optimize against known benchmarks and selectively report favorable results. When a model claims state-of-the-art on a public leaderboard, it is increasingly unclear whether that reflects genuine generalization or familiarity with the test. Private-only benchmarks — never released publicly, evaluated under controlled conditions, and continuously refreshed — offer a structural solution. If a model cannot train against a benchmark it has never seen, contamination becomes impossible by design. Built across capability families rather than isolated skills, such benchmarks can also surface cross-domain failure modes that narrow public evaluations miss entirely. This social will examine what private, holistic evaluation infrastructure could look like in practice, with short talks from practitioners followed by open discussion on what it would take for the community to coalesce around shared private evaluation standards.
Researchers Using AI to Research AI Research
More simply put: LLMs for Metascience. This social is focused on facilitating conversations from industry, academia, and government about how we are all using AI to understand the current landscape of AI research. What’s coming at us the fastest? What are the problems we’ve effectively “solved” ? Where can existing methods be applied to support progress in overlooked areas?
This is meant to be a relaxed, discussion-driven space. We’re excited for a cross-sector exchange where people can share tools, workflows, rough ideas, and open questions, as well as the challenges of using AI systems to reason about science itself. Input from those who conduct, fund, and review research will be especially valuable in shaping a more complete picture of the different components in the AI research landscape.
Queer in AI Social
This social is an informal community gathering organised by Queer in AI to foster connection, visibility, and mutual support among LGBTQ+ researchers and allies in AI/ML. The event provides a welcoming space for queer scientists, students, and practitioners to meet, share experiences, and build professional and personal networks within the broader research community.
Breaking Silos – Open Community for AI x Science
AI for Science is rapidly emerging as a key area where machine learning can accelerate discovery in domains such as materials science, biology, physics, and mathematics. Progress in this space increasingly depends on collaboration between machine learning researchers and domain scientists, as well as open ecosystems of data, models, and tools. This socials aims to create an informal space at ICLR for researchers interested in AI for Science and open collaboration to connect, exchange ideas, and build new collaborations.
The socials will bring together participants from several ICLR workshops related to AI for Science, including AI4Mat, FM4Science, AI&PDE, and Sci4DL, and foster interaction across these communities. The session will focus on practical challenges and opportunities in building open scientific ecosystems, including open datasets and benchmarks, open-source tools and foundation models for science, cross-disciplinary collaboration, and community-driven initiatives.
The format will emphasize interaction and networking with brief opening remarks, structured speed networking, themed small-group discussions, and a short open panel conversation where participants can share insights and identify opportunities for collaboration. The social will also provide a welcoming environment for students and early-career researchers to engage with both academic and industry researchers working on AI-driven scientific discovery
World Models and Beyond: Bridging Video, Simulation, and Robotic Intelligence
World models have emerged as a unifying thread across video generation, model-based reinforcement learning, and robotic planning — yet the communities working on these problems often don't overlap at conferences. This Social brings together researchers working on learned simulators, video prediction, world models for decision-making, and sim-to-real transfer for an informal, discussion-first session.
| ICLR uses cookies for essential functions only. We do not sell your personal information. Our Privacy Policy » |