Toggle Poster Visibility
Oral
Sat Apr 25 11:15 AM -- 11:25 AM (PDT) None
Reliable Weak-to-Strong Monitoring of LLM Agents
[
OpenReview]
Oral
Sat Apr 25 11:27 AM -- 11:37 AM (PDT) None
CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
[
OpenReview]
Oral
Sat Apr 25 11:39 AM -- 11:49 AM (PDT) None
OpenApps: Simulating Environment Variations to Measure UI Agent Reliability
[
OpenReview]
Oral
Sat Apr 25 11:51 AM -- 12:01 PM (PDT) None
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
[
OpenReview]
Oral
Sat Apr 25 12:03 PM -- 12:13 PM (PDT) None
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering
[
OpenReview]
Oral
Sat Apr 25 12:15 PM -- 12:25 PM (PDT) None
WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality
[
OpenReview]
Successful Page Load