Why AI can’t spell ‘strawberry’

In partnership with

How many times does the letter “r” appear in the word “strawberry”? According to advanced AI systems like GPT-4o and Claude, the answer is twice. This simple question highlights a surprising shortcoming in AI capabilities, revealing a fundamental truth: these models, despite their sophistication, don’t think like humans.

Large language models (LLMs) such as GPT-4o are known for their ability to generate essays, solve complex equations, and analyze vast amounts of data in mere seconds. Yet, they sometimes fail in spectacular ways, like miscounting letters in a word. This is a reminder that these systems, however impressive, are not truly intelligent in the way humans understand intelligence.

It’s time to STOP using Honey and Credit Karma

  • Ditch Honey: Kudos doubles your credit card rewards.

  • Nerdwallet, but better: One app for all your cards.

  • Simplify shopping: Autofill payments with a single click.

The Limits of Large Language Models

At their core, most LLMs are built on a type of deep learning architecture known as transformers. Transformers process text by breaking it down into "tokens," which can represent full words, syllables, or even single letters, depending on the model's design.

As Matthew Guzdial, an AI researcher and assistant professor at the University of Alberta, explains, “LLMs are based on this transformer architecture, which notably is not actually reading text. When you input a prompt, it’s translated into an encoding.” This means that the model doesn’t understand text as a sequence of individual letters. For instance, while it may recognize the tokens “straw” and “berry” that combine to form the word “strawberry,” it doesn’t inherently understand that the word is composed of the specific sequence of letters "s," "t," "r," "a," "w," "b," "e," "r," "r," and "y."

This structural limitation is not just a minor flaw; it’s embedded in the very architecture that allows LLMs to perform their tasks. So, when an AI stumbles over spelling "strawberry," it’s not a sign of incompetence but rather a reflection of how these models are built.

The Challenge of Tokenization

The process of breaking down text into tokens isn’t straightforward, and there’s no perfect method. As Sheridan Feucht, a PhD student studying LLM interpretability, notes, “It’s kind of hard to get around the question of what exactly a ‘word’ should be for a language model, and even if we got human experts to agree on a perfect token vocabulary, models would probably still find it useful to ‘chunk’ things even further.”

This issue becomes even more complex when considering multiple languages. Some languages, like Chinese or Japanese, do not use spaces to separate words, complicating tokenization further. As Yennie Jun from Google DeepMind discovered, some languages require up to ten times more tokens than English to convey the same meaning, demonstrating how challenging it is for AI models to handle diverse linguistic structures.

The Difference in Image Generation

Interestingly, the shortcomings of LLMs contrast with those of AI image generators like Midjourney and DALL-E, which use a different underlying architecture known as diffusion models. These models create images by reconstructing them from noise, based on large datasets of images. They tend to perform better at recognizing and generating larger, more distinct objects like cars and faces but struggle with smaller details like fingers and handwriting.

As Asmelash Teka Hadgu, co-founder of Lesan and a fellow at the DAIR Institute, explains, “Image generators tend to perform much better on artifacts like cars and people’s faces, and less so on smaller things like fingers and handwriting.” This difference highlights that while AI models excel in some areas, they have distinct limitations based on their design and the data they are trained on.

Moving Forward with AI

Despite these limitations, AI development is rapidly progressing. OpenAI is working on a new model, code-named "Strawberry," which is expected to enhance reasoning capabilities by generating accurate synthetic data. This could address some of the current limitations by providing more robust training material, potentially allowing LLMs to handle more complex tasks and linguistic challenges.

Meanwhile, Google DeepMind has developed AI systems like AlphaProof and AlphaGeometry 2, which have demonstrated impressive capabilities in solving formal math problems, achieving results comparable to those of silver medalists at the International Math Olympiad.

These advancements suggest that while current AI models might struggle with simple tasks like spelling "strawberry," their potential to solve more complex problems continues to grow. The humorous contrast between AI’s inability to spell and its achievements in formal reasoning underscores the unpredictable and evolving nature of this technology.

As AI continues to develop, it’s crucial to recognize both its capabilities and its limitations. By understanding these factors, we can better harness the power of AI while remaining aware of the challenges that lie ahead.

What did you think of this week's issue?

We take your feedback seriously.

Login or Subscribe to participate in polls.