Artificial General Intelligence (AGI): Planning
In the last post, we covered how memory, planning, and reasoning are lacking in current AGI. Of course, "lacking" is in degrees, not an all-or-nothing switch. In this post, we will cover planning.
Planning
If you have recently used ChatGPT for trip-planning, you might think that it was actually pretty good at planning. Give it a set of preference, location, or duration parameters, and it can come up with a solid itinerary. I've heard anecdotally that GPT-4 was great at constraint optimization, making it perfect for planning a Disneyland trip where you get to meet all the characters at their varied showtimes. Why do we say that LLMs today cannot plan?
Planning means the ability to formulate a sequence of actions or strategies to accomplish goals. In order to formulate a plan, one needs the ability to look ahead and predict what's going to happen next.
LLMs do this in a rudimentary way. LLMs predict the next token for a certain context. Predicting the next token can approximate prediction in the real world.
However, AI scientists like Yann LeCun claims it's not quite the same. Here's his illustrative example (emphasis added):
Holding up a sheet of paper, LeCun demonstrated ChatGPT’s limited understanding of the world... The bot wouldn’t know what would happen if he let go of the paper with one hand, LeCun promised. Upon consultation, ChatGPT said the paper would “tilt or rotate in the direction of the hand that is no longer holding it.” For a moment—given its presentation and confidence—the answer seemed plausible. But the bot was dead wrong... because people rarely describe the physics of letting go of a paper in text. (Perhaps until now). - Observer
Why does an LLM, great at an astounding number of tasks, fail here? If you ask ChatGPT the same question today, it will answer correctly—because LeCun's example and others have likely been added to the dataset. That is the key. An LLM's predictions today are based on pattern matching upon the training data and examples we have provided. When scenarios unlike the training data come up, the model can fail spectacularly.1 The model interpolates well based on data provided, but lacks deeper understanding to help it extrapolate beyond provided examples.
The latter is needed if we want AGI to be able to plan beyond plans it has seen before—or invent new actions and strategies for a given goal, beyond actions and strategies it has seen before. In LeCun's words (emphasis added):
It would feel like you have a PhD sitting next to you, but it's not a PhD you have next to you. It's a system with gigantic memory and retrieval ability, not a system that can invent solutions to new problems, which is really what a PhD is — Yann LeCun
Interpolation v.s. Extrapolation
Hassabis expresses this interpolation versus extrapolation elegantly, using the examples of LLMs and AlphaGo (emphasis added):
I have three categories of originality or creativity. The most basic, mundane form is just interpolation, which is averaging what you see. So, if I said to a system, "Come up with a new picture of a cat," and it's seen a million cats and it produces just some kind of average of all the ones it's seen, in theory that's an original cat because you won't find the average in the specific examples. But it's pretty boring. It's not really very creative, I would't call that creativity...
The next level is what AlphaGo exhibited, which is extrapolation. So, here's all the games humans have ever played..., and now it comes up with a new strategy that no human has ever seen before—that's move 37, revolutionizing Go even though we've played it for thousands of years...
But there's one level above that that humans can do, which is invent Go... If I specify to an abstract level: Takes five minutes to learn the rules but many lifetimes to master, it's beautiful aesthetically, encompasses some mystical part of the universe in it... you can play a game in a human afternoon in 2 hours. That would be a high level specification of Go, and then somehow the system's got to come up with a game that's as elegant and as beautiful and as perfect as Go. Now, we can't do that now. — Demis Hassabis
Extrapolation is the next level of creativity where AI invents something new. The key for unlocking it is search, where AI explores actions or strategies not previously considered and tests their performance. That's where invention takes place. We've reached superintelligence in Go because of search. In Hassabis' words (emphasis added):
[The LLM] certainly won't come up with original moves. For that, you need the search component to get you beyond what the model knows about—which is mostly summarizing existing knowledge—to some new part of the tree of knowledge. You can use search to get beyond what the model currently understands, and that's where you can get new ideas like move 37... It was searching Go moves beyond what the model knew — Demis Hassabis
Reinforcement Learning
The method for search was Reinforcement learning (RL). RL improves AI by guiding its search with a reward model. A reward model is a "What is good?" predictor. It aligns AI with human preferences by telling the AI whether an action or strategy is good. For example, winning a game of Go is good, and the reward model will give a positive reward. As the AI searches and produces different actions and strategies, they are evaluated against the reward model, and the AI gets better by producing actions and strategies that yield better rewards. Reward models and RL have been crucial in unlocking the extrapolation phase for LLMs.
RL has been used in LLM training for a while. LLMs today already use reward models for evaluating interpolations, like Reinforcement Learning with Human Feedback (RLHF), which trains a reward model based on human feedback on LLM prompts and responses. However, RLHF had been mostly introduced as an alignment finishing touch, to improve a pre-trained model.
The release of DeepSeek-R1 reinvigorated using RL to train LLMs writ large. Their innovation was using RL on LLMs without a preliminary stage where humans provided curated prompt-and-answer pairs to teach the model.2 Instead, the model was simply given a large dataset of math problems and rewarded for correct solution. This minimal set-up was sufficient for the model to autonomously discover chain-of-thought reasoning, self-verification, and reflection, with no explicit prompt to do so. The results rivaled other frontier models.
DeepSeek sparked excitement amidst researchers to start using "pure RL" methods to improve LLMs. For tasks, like planning or math, where nice reward models can be defined, RL can help AI improve—and, pivotally, improve beyond existing data. This direction is promising towards ASI, where we aim for superintelligence. Reinforcement learning helps solve the "search" problem, helping LLMs extrapolate.
More ways to search
Reinforcement learning is only one method among many to help models search for novel solutions. Google DeepMind recently released AlphaEvolve which could propose new methods to complex computational problems, like matrix multiplication. Their method—evolutionary algorithms that iteratively improves outputs based on feedback from automated evaluations. The team preferred evolutionary methods to RL for better interpretability:
AlphaEvolve was chosen over a deep reinforcement learning approach because its code solution not only leads to better performance, but also offers clear advantages in interpretability, debuggability, predictability, and ease of deployment—essential qualities for a mission-critical system. — AlphaEvolve white paper
The ways to search are numerous, and there are only more breakthroughs to come.
Conclusion
Planning is key to building agents and agentic workflows. I suspect that's why so much emphasis is placed on it. With agents, you want to be able to give it a goal, e.g. "Book a flight to Hawaii," and have the agent accomplish it. Planning designs the actions and strategies to get there. The path to good planning is the path to good agents. OpenAI talks about using end-to-end reinforcement learning to build agents.
Taking a broader view, planning is also a lens that helps us understand creativity better—creativity as a process of exploring and evaluating actions and strategies against goals.
In the next post, I will discuss reasoning.
One of my favorite examples of a language models being only as good as its data is this generated image of salmon swimming down a river.↩
That preliminary stage is called supervised fine-tuning (SFT).↩