This episode of Latent Space features Zico Kolter and Matt Fredrikson from Gray Swan, a company focused on AI safety and security. They discuss the unique vulnerabilities of AI systems compared to traditional software, particularly around adversarial attacks like prompt injection in agents. The conversation highlights the need for dedicated AI security solutions, including their red-teaming platform (Shade) and guardrail model (Signal), and explores the evolving landscape of AI red teaming, the 'lethal trifecta' of risks, and the potential for AI to automate its own security and interpretability research.
Summarized by Podsumo
Gray Swan's dual approach combines automated red-teaming (Shade) to find vulnerabilities and a guardrail model (Signal) to filter malicious inputs or outputs, serving both frontier labs and enterprises.
AI models do not inherently become safer with increased scale; safety requires explicit training, and red-teaming models like Shade are becoming more effective than human red teamers at finding jailbreaks.
The 'lethal trifecta' of prompt injection risk consists of: ingesting untrusted data, having access to sensitive information, and the ability to exfiltrate it—all necessary for a real security incident.
Open-source agents like OpenClaw present significant security challenges, with Gray Swan finding numerous breaks across various use cases. Computer use agents are particularly vulnerable.
The future of AI security may involve agents writing secure code and automating interpretability research, effectively scaling security efforts that are currently too labor-intensive for humans.
"They're not just different; it's a different form of intelligence. And that difference is actually often brought out to a large degree by things like adversarial attacks and red teaming."
"Zico Kolter"
"You have a trade-off between usability and how much power agent has versus security. Our goal with Signal, with Shade to assess these vulnerabilities with Signal to protect it, is to shift that point up and to the right."
"Matt Fredrikson"
"When you have computer use, and when you have OpenClaw, man, you can break those things. This is what makes these things useful."
"Zico Kolter"