Eric Jang explains how to build AlphaGo from scratch, breaking down the core concepts of Monte Carlo Tree Search (MCTS), policy and value networks, and self-play reinforcement learning. He also discusses the profound implications of how a small neural network can amortize an intractable search problem, and shares insights into using LLMs for automated research.
Summarized by Podsumo
AlphaGo's core breakthrough was using neural networks to make the search problem tractable by providing a value network (estimating win probability) and a policy network (suggesting good moves).
MCTS uses a four-step iterative process (selection, expansion, evaluation, backup) to build a sparse tree, guided by the neural network's predictions to focus on promising moves.
The training algorithm distills the search process: the policy network is trained to imitate the improved MCTS distribution, creating a virtuous cycle where the model gets stronger.
Jang achieved a 40x reduction in compute to train a strong Go bot compared to earlier efforts, demonstrating the 'bitter lesson' of scaling and algorithmic progress.
He compares the sample efficiency of AlphaGo's approach with modern LLM RL, highlighting the challenges of long-horizon credit assignment in the latter.
"A 10 layer neural network, able to amortize and approximate a nearly intractable search problem to very high fidelity... makes me wonder if our understanding of problems like P equals NP are incomplete."
— Eric Jang
"The problem with Go and Chess is that the other player is always trying to do some shit. Things can drift off and you always want to be able to correct back to your winning condition."
— Eric Jang
"In AlphaGo, you never have to initialize at a 0% success rate and solve the exploration problem... it's just supervised learning on improved labels."
— Eric Jang