shy's journal

Artificial General Intelligence (AGI): Reasoning

Reasoning

This year is when we got reasoning models. The earliest release was DeepSeek's R1 model ("R" for reasoning), and then we got OpenAI's o1 model. They are nominally "reasoning" models.

Learning to reason with LLMs - o1 press release's title

System 1 vs System 2 Thinking

What does it mean when we say R1 and o1 can reason? Practically, reasoning for today's large language models means thinking for longer.

An analogy comes from cognitive science about System 1 vs System 2 thinking from Daniel Kahneman's Thinking Fast and Slow.

Models before reasoning models tended to flounder at mathematics and similar tasks. Humans excel at these tasks by engaging in System 2 thinking. Applying the same logic to models, reasoning models that "spend longer time thinking" apparently also improved significantly on reasoning benchmarks like competitive programming questions (Codeforces), USA Math Olympiad (AIME), and other benchmarks of physics, biology, and chemistry problems (GPQA).

Test-time Compute

Technically, thinking for longer means using more test-time compute.

TTC refers to the amount of computational power used by an AI model when it is generating a response or performing a task after it has been trained. In simple terms, it's the processing power and time required when the model is actually being used, rather than when it is being trained. - Hugging Face

OpenAI shares the observation in their o1 press release:

We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them. - OpenAI

In other words, the method we unlocked "reasoning" was to increase the amount of compute used by the AI to generate a response. Among other scaling laws for reaching AGI like scaling train-time compute (increase compute to train the AI), we have one added vector to scale: test-time compute. The path to AGI clearly looks like a lot and a lot of compute.

Is it sufficient to just throw more compute?

The reigning opinion is yes, with mild and gentle caveats.

I want to end off this trilogy with a fairly unpopular opinion, "no," espoused by LeCun, whom I've quoted before in the series. LeCun believes we are still key architectural improvements and other innovations away from AGI.

For example, he thinks Joint Embedding Predictive Architecture (JEPA), or something similar, is essential for AGI to get a common sense of the physical world. In a world where multiple outputs are possible, generating an identical output as the test solution is less important than generating a "likely" or "valid" output. For example, if I wave at somebody, there is a world of possibilities of how they might react. For AI to get "common sense," they need not exactly predict the future, but know the difference between an output like a person waving back, smiling, looking angry, or an output where the person grows a third nose. The last option would be counter to common sense. The crux for JEPA is that instead of calculating the loss on the actual outputs, the loss is calculated at a higher-order representation level.

For many of the voices claiming that AGI is imminent, their definition of AGI often revolves around economic tasks and training AI for them. It's possible that in those cases, the need for a higher-order representation like JEPA is not crucial. Or, perhaps, increasing compute and data is sufficient to overcome any deficiencies.

The verdict is out as we look ahead to the next few exciting years.

Published on July 21, 2025