Dwarkesh Podcast

The next big breakthrough will be AIs learning on the job

20 min

The podcast argues that the next major breakthrough in AI will come from enabling models to learn continuously on the job, moving beyond pre-training and reinforcement learning from verifiable rewards (RLVR). It critiques current approaches for their sample inefficiency and inability to learn from real-world, unstructured data, and explores techniques like on-policy self-distillation (OPSD) and 'dreaming' to enable continual learning from deployment. The episode highlights the tension between scaling current methods and the need for new architectures to achieve true general intelligence.

Summarized by Podsumo

🎧 Listen 🎙️ Ask about this episode

✨ Key Takeaways

1

The key research bet is that training AIs on millions of verifiable tasks in diverse RL environments will lead to AGI, but this may not generalize to real-world, non-replayable domains like business or politics.
2

Current AIs are extremely sample inefficient during training (one millionth as sample efficient as humans), but proponents argue this is mitigated by amortizing costs across billions of sessions.
3

Continual learning (updating model weights from deployment) is argued to be necessary because in-context learning alone is not scalable and cannot capture the tacit knowledge learned from real-world interactions.
4

On-policy self-distillation (OPSD) is proposed as a method to distill insights from long sessions back into weights without requiring verifiable rewards, using dense per-token supervision.
5

A speculative future technique called 'dreaming' could allow AIs to simulate environments and practice skills offline, achieving orders of magnitude more samples in the same real time.
6

By 2027-2028, AIs might improve primarily from post-deployment experience, with each interaction making the model smarter for all users.

💬 Notable Quotes

"Training is this one-time cost that is amortized across billions of sessions that a model will experience."
"OPSD preserves this property of supervised learning, where instead of slingshotting towards a teacher distribution, you only extract the knowledge that is necessary for you to achieve the same results."
"The way you get better at your job is not by recalling the transcript of every single thing that happened every day with perfect fidelity. Rather, it's by consolidating the handful of insights."