Ethan He discusses the evolution from video generation to video agents, emphasizing that the most significant advances in generative media are now coming from language models rather than diffusion techniques. He shares insights from building Grok Imagine at xAI and his vision for real-time, interactive, long-form video generation as the next frontier.
Summarized by Podsumo
Ethan argues that the majority of recent improvements in video and image generation stem from language model capabilities (prompt rewriting, agentic reasoning) rather than diffusion model innovations.
Building Grok Imagine from zero to launch in three months was driven by a small, tightly-knit team with strong infrastructure, enabling rapid iteration and bug-fixing.
Real-time, interactive, long-horizon video generation (world models) faces challenges in temporal compression, context management, and inference latency, with heuristic solutions like FramePak emerging.
Video agents—where LLM-based reasoning harnesses diffusion models as tools—are predicted to reach production-grade quality by end of year, unlocking enterprise adoption.
Ethan transitioned from computer vision to video generation to language models, reflecting a conviction that language intelligence now drives progress across modalities.
"[The objective] was that you have to describe the video as detailed as possible such that a blind person, here's a blob of text, can reconstruct what the video is like."