Latent Space: The AI Engineer Podcast

NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)

1h 24min

This episode of Latent Space features Nader Khalil (Brev) and Kyle Kranen (Dynamo) from NVIDIA, discussing the evolution of AI engineering. They delve into NVIDIA's developer-centric culture, the acquisition of Brev for simplified GPU access, and the "Speed of Light" philosophy. A core focus is Dynamo, NVIDIA's data center-scale inference engine, which optimizes large language model inference through techniques like disaggregation and specialized scaling, alongside a deep dive into the security and future of AI agents.

Summarized by Podsumo

🎧 Listen 🎙️ Ask about this episode

✨ Key Takeaways

1

NVIDIA's Developer Experience & Culture: The company fosters a "pick up basketball" environment, allowing engineers to pursue passions and invest in "zero billion dollar businesses" for long-term market creation, exemplified by Brev's acquisition for simplified GPU access.
2

"Speed of Light" (S.O.L.) Philosophy: NVIDIA uses S.O.L. to define theoretical limits and create urgency, pushing teams to understand root causes and prioritize for rapid delivery, even for complex hardware like DJX Spark.
3

Dynamo for Data Center Scale Inference: Dynamo is an inference engine that optimizes LLM serving by separating pre-fill (compute-bound) and decode (memory-bound) phases, enabling specialized scaling and improved cost-quality-latency trade-offs.
4

AI Agent Security & Capabilities: Agents can access files, the internet, and execute custom code. A critical security insight is to limit agents to two out of three capabilities (e.g., no internet access if they can access files and write code) to prevent vulnerabilities.
5

Hardware-Model-Context Co-Design & "Unhovelers": The discussion highlights how model architectures (like Kimi or DeepSeek) are increasingly co-designed with hardware capabilities, with "unhovelers" (scientific discoveries) drastically improving scaling, such as reducing KV cache burden for longer contexts.

💬 Notable Quotes

"If you can access your files and you can write custom code, you don't want internet access because that's one of the safety vulnerabilities, right?"

— Nader Khalil
"Jensen says we're completely happy investing in zero billion dollar markets. We don't care if this creates revenue. It's important for us to know about this market. We think it will be important in the future."

— Kyle Kranen
"This is the year system as model, right? We're like instead of having like a single model be a thing. You have a system of models and components that are working together to like emulate the black box model."

— Kyle Kranen