Skip to yearly menu bar Skip to main content


(6 events)   Timezone:  
Show all
The 2026 schedule is still incomplete
Toggle Poster Visibility
Oral
Sat Apr 25 11:15 AM -- 11:25 AM (PDT) None
Reliable Weak-to-Strong Monitoring of LLM Agents
Neil Kale · Chen Bo Calvin Zhang · Kevin Zhu · Ankit Aich · Paula Rodriguez · Christina Knight · Zifan Wang
[ OpenReview
Oral
Sat Apr 25 11:27 AM -- 11:37 AM (PDT) None
CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
Zhun Wang · Tianneng Shi · Jingxuan He · Matthew Cai · Jialin Zhang · Dawn Song
[ OpenReview
Oral
Sat Apr 25 11:39 AM -- 11:49 AM (PDT) None
OpenApps: Simulating Environment Variations to Measure UI Agent Reliability
Karen Ullrich · Jingtong Su · Claudia Shi · Arjun Subramonian · Amir Bar · Ivan Evtimov · Nikolaos Tsilivis · Randall Balestriero · Julia Kempe · Mark Ibrahim
[ OpenReview
Oral
Sat Apr 25 11:51 AM -- 12:01 PM (PDT) None
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao · Jaylen Jones · Linxi Jiang · Yuting Ning · Eric Fosler-Lussier · Yu Su · ZHIQIANG LIN · Huan Sun
[ OpenReview
Oral
Sat Apr 25 12:03 PM -- 12:13 PM (PDT) None
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering
Yahan Li · Jifan Yao · John Bunyi · Adam Frank · Angel Hwang · Ruishan Liu
[ OpenReview
Oral
Sat Apr 25 12:15 PM -- 12:25 PM (PDT) None
WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality
Chunyang Li · Yilun Zheng · Xinting Huang · Tianqing Fang · Jiahao Xu · Lihui Chen · Yangqiu Song · Han Hu
[ OpenReview