top of page

LLMs and AGI: Is Attention All You Need?

  • Writer: Zeyneb K
    Zeyneb K
  • Apr 14
  • 5 min read

What does it take to reach AGI and what are the key capabilities and limitations of LLMs? The following write-up was originally a report for the course Stanford CS372: Artificial Intelligence for Reasoning, Planning, and Decision Making examining the path to AGI and elevating LLMs from pattern matching to systematic reasoning, validation, and planning.

LLMs and AGI: Is Attention All You Need?


The impending approach of artificial general intelligence (AGI) has become an unavoidable topic as the popularity and capabilities of large language models (LLMs) have been rapidly growing. What used to be a distant idea has now shifted into an anticipation of the inevitable. Yet perhaps we need to reconsider our approach to the path to AGI, and whether LLMs in particular are in fact the key to reaching it. 

Let us first establish some important ideas relevant to understanding how we create AGI. AGI represents the point where AI reaches the intellectual abilities of a human, achieving human-comparable performance in a broad range of tasks. Human learning and intelligence can be characterized by efficient adaptability/continual learning, the ability to robustly extrapolate beyond existing experiences, and creativity/diversity of thought. These are, of course, more general abstractions of human cognition, but nevertheless are important metrics to implement into AGI so that they can truly be integrated in our lives in a broad and useful way. 


In the path to AGI, LLMs in particular show great potential, enabled by (a) the parallelizability of the transformer architecture which can leverage (b) the vast amount of language data of the internet capturing world knowledge and human interactions that can be generally applied through (c) the flexibility of optimizing for ‘next word prediction’ for many general uses. LLMs are empowered by their ability to form representations of human knowledge, and do so in a way that can be used in a general manner to many many applications, leveraging their information to do anything from simply predicting the next word to question answering, code generation, and story creation. These key elements of LLMs drive the success in their progress towards human intelligence and align with human learning in the way they learn patterns from data with a broad coverage of topics and tasks. 

Currently, the dominant narrative of enabling AGI has come down to scaling. With increasing compute, LLMs have demonstrated the capacity to achieve “emergent abilities” (Wei et al., 2022); these abilities, such as reasoning with chain-of-thought and in-context, appear in drastic shifts during different scaling thresholds in unprecedented ways. Scaling compute in training and inference continues to improve LLM performance, and with even further scaling, there comes the potential to improve beyond what is possible. 

With this scale, LLMs have shown capabilities in learning new information. In-context learning has shown the potential to perform well in new tasks presented at inference time. Studies in meso-optimization have demonstrated the capacity of transformers to implicitly implement learning algorithms like gradient descent (Dai et al., 2023; Mahankali et al., 2024). Furthermore, scaling compute by performing active updates or by explicitly learning relevant unsupervised algorithms that can be used during inference time (two different takes on ‘test time training’, (Akyürek et al., 2025; Sun et al., 2024) also allow LLMs to learn to generalize to new distributions. These all offer forms of learning that extend beyond training. 

Furthermore, expanding beyond a single isolated LLM by utilizing tool use or agentic collaboration, accurate and unique generations can be achieved. Self-consistency checking, multi-agent debate, and other iterative generation approaches have demonstrated the ability to improve responses and expand reasoning with more diverse thought and extended thinking (Xiang et al., 2025; Wang et al., 2023; Du et al., 2023). Post-training approaches in reinforcement learning have allowed for further improvement with more exploratory goal-directed optimization, boosting generalization (Chu et al., 2025). As context windows grow, they can pull more relevant information, generate more complex lines of reasoning, and expand their capacity to learn and draw precise and supportive chains of thought. Already, LLMs have achieved and even surpassed human-level performance on many benchmarks measuring different reasoning and problem-solving tasks. 


Yet much of our conceptualization of the capabilities of LLMs is fundamentally flawed—and in a way that can throw off the entire argument of LLMs for AGI. The most crucial limitation of LLMs is in how they learn from data—in particular, their performance’s reliance on the training and testing distribution and learning flawed heuristics that are more indicative of memorization than true extrapolative reasoning. 

This comes down to the problems of benchmarking: they depend on the distributions of the data, fail to distinguish the capabilities enabling performance, provide averaged metrics, and use downstream task performance as a proxy for broader skills. Good performance on a “math reasoning” benchmark does not necessarily indicate the ability to robustly reason to do math. For instance, Li et al., (2022) demonstrate that LLMs’ ability to generalize to novel settings through counterfactuals can be attributed to memorizing flawed heuristics based on lexical cues over robustly understanding the task. The performance of models is deeply dependent on proximity to the training data (Yadlowsky et al., 2023; Wu et al., 2024). Thus systematic benchmarks and standard tasks cannot be reliably used as a measure of robust reasoning ability. In fact, with recent works identifying unfaithfulness in reasoning paths (Arcuschin et al., 2025; Turpin et al., 2023), there are further questions about the true ability of LLMs to ‘reason’ beyond familiar knowledge. A model may be able to achieve human-level performance on a range of standardized benchmarks, but if a slight distribution shift in the real world has it fail on even simpler tasks, such a model cannot be considered robust or compared to having human-level intellect. This is only worsened by evidence of data contamination and over-optimization to specific data.

Furthermore, the capabilities of LLMs are limited in their adaptability. The knowledge of LLMs cannot be reliably updated. Post-training approaches show instability, and how context can be leveraged is limited. Approaches such as memory modules, neuro-symbolic methods, and other auxiliary components can help address this, enabling better continual learning and systematic generalization.

Beyond the overestimation of LLMs current capabilities and their implications in how those capabilities change with scale are the increasing bottlenecks presented by scaling itself. With more scale, we have reached a point of worryingly diminishing returns even with tremendously more costs and compute (Fernandez et al., 2025). Furthermore, the availability of data is becoming a key bottleneck, especially due to the data inefficiency of these models. Thus, fundamental architectural changes are needed to address these challenges for substantial progress. 

Finally, there are key shortcomings of language modeling for general human intelligence. Language and cognition are distinct (Fedorenko & Varley, 2016), and capabilities in language cannot always robustly apply to other intellectual tasks. The modality of LLMs are narrow; while they can form some representation of concepts, their model of the world is limited to contextual relationships in text, with significant challenges in physical and temporal reasoning (Karvonen, 2025). Multimodal data from the physical world through sensorimotor interactions have the potential to provide rich information to enable the more holistic world representations needed for many tasks. 


Recognizing the fundamental challenges of LLMs is the prerequisite to addressing them towards AGI. We need to think not only about how our world will change once we get there, but also consider the best path to getting there with robust, helpful, and positive models. Efficient mechanisms for learning from data, effective and diagnostic benchmarking, and multimodality are key areas to explore for more robust and capable reasoning models.

 
 
 

Comments


bottom of page