Skip to yearly menu bar Skip to main content


(6 events)   Timezone:  
Show all
Toggle Poster Visibility
Oral
Sat Apr 25 11:15 AM -- 11:25 AM (PDT) @ 204 A/B None
Reliable Weak-to-Strong Monitoring of LLM Agents
Neil Kale ⋅ Chen Bo Calvin Zhang ⋅ Kevin Zhu ⋅ Ankit Aich ⋅ Paula Rodriguez ⋅ Christina Knight ⋅ Zifan Wang
[ OpenReview
Oral
Sat Apr 25 11:27 AM -- 11:37 AM (PDT) @ 204 A/B None
CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
ZHUN WANG ⋅ Tianneng Shi ⋅ Jingxuan He ⋅ Matthew Cai ⋅ Jialin Zhang ⋅ Dawn Song
[ OpenReview
Oral
Sat Apr 25 11:39 AM -- 11:49 AM (PDT) @ 204 A/B None
OpenApps: Simulating Environment Variations to Measure UI Agent Reliability
Karen Ullrich ⋅ Jingtong Su ⋅ Claudia Shi ⋅ Arjun Subramonian ⋅ Amir Bar ⋅ Ivan Evtimov ⋅ Nikolaos Tsilivis ⋅ Randall Balestriero ⋅ Julia Kempe ⋅ Mark Ibrahim
[ OpenReview
Oral
Sat Apr 25 11:51 AM -- 12:01 PM (PDT) @ 204 A/B None
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao ⋅ Jaylen Jones ⋅ Linxi Jiang ⋅ Yuting Ning ⋅ Eric Fosler-Lussier ⋅ Yu Su ⋅ ZHIQIANG LIN ⋅ Huan Sun
[ OpenReview
Oral
Sat Apr 25 12:03 PM -- 12:13 PM (PDT) @ 204 A/B None
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering
Yahan Li ⋅ Jifan Yao ⋅ John Bunyi ⋅ Adam Frank ⋅ Angel Hwang ⋅ Ruishan Liu
[ OpenReview
Oral
Sat Apr 25 12:15 PM -- 12:25 PM (PDT) @ 204 A/B None
WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality
Chunyang Li ⋅ Yilun Zheng ⋅ Xinting Huang ⋅ Tianqing Fang ⋅ Jiahao Xu ⋅ Lihui Chen ⋅ Yangqiu Song ⋅ Winston Hu
[ OpenReview