Duo Discover
Posts
OpenAI’s new “deep-thinking” o1 model crushes coding benchmarks

OpenAI’s new “deep-thinking” o1 model crushes coding benchmarks

Alex John
September 15, 2024

I thought AI development had plateaued. I thought the bubble was about to burst, and the hype train was derailing. I even thought my software engineering job might be safe from Devon. But I couldn't have been more wrong.

It’s time to STOP using Honey and Credit Karma

Kudos: The all-in-one app that outshines Honey, Credit Karma, and Apple Pay. Maximize rewards at 3M+ stores, discover top card deals with The Points Guy, and enjoy one-click checkouts. Pick your best card, earn more cashback, and simplify online shopping. Kudos does it all.

Download Now—It’s Free!

Yesterday, OpenAI released a new, terrifying state-of-the-art model named 01, and it's not just another basic GPT. It represents a new paradigm of deep thinking or reasoning models, obliterating all past benchmarks in math, coding, and PhD-level science. Sam Altman had a message for all the AI skeptics: "I am always two steps ahead."

Before we get too hopeful that 01 will unburden us from our programming jobs, there are many reasons to doubt this new model. It’s definitely not Artificial Superintelligence (ASI), it’s not Artificial General Intelligence (AGI), and it’s not even good enough to be called GPT-5. Despite OpenAI’s mission of openness, they’re keeping all the interesting details locked away. In today's discussion, we’ll try to figure out how 01 actually works and what it means for the future of humanity.

Names like GPT-5 or Q-Star have been leaked out of OpenAI in recent months, but the world was shocked when they released 01 ahead of schedule. GPT stands for Generative Pre-trained Transformer, and according to some, the “O” stands for "Oh, we’re all going to die."

Let’s first look at some of 01’s dubious benchmarks. Compared to GPT-4, 01 achieves massive gains in accuracy, particularly in PhD-level physics and on multitask language understanding benchmarks for math and formal logic. But the craziest improvements come in its coding ability. At the International Olympiad in Informatics, 01 jumped from the 49th percentile when allowed 50 submissions per problem, to breaking the gold medal threshold with 10,000 submissions. Compared to GPT-4, its Codeforces ELO score went from the 11th percentile all the way to the 93rd percentile.

OpenAI has also been working secretly with Cognition Labs, the company that aims to replace programmers with the AI Devon. Using GPT-4, Devon solved only 25% of problems, but with 01, that success rate skyrocketed to 75%. That’s crazy! Our only hope is that these internal, closed-source benchmarks from a venture capital-funded company desperate for more money might just be exaggerated. Only time will tell, but 01 is undoubtedly a huge leap forward in the AI race.

The timing is perfect because many people have been switching from ChatGPT to models like Claude, and OpenAI is currently in talks to raise more money at a $150 billion valuation. But how does a deep-thinking model like 01 actually work?

Technically, OpenAI released three new models: 01 Mini, 01 Preview, and 01 Regular. Most users only have access to Mini and Preview, while 01 Regular is still locked away. OpenAI hinted at a $2,000 Premium Plus plan to access it. What makes these models special is their reliance on reinforcement learning to perform complex reasoning. When presented with a problem, they produce a chain of thought before presenting the answer to the user. In other words, they "think." As Descartes said, “I think, therefore I am,” but 01 isn’t a sentient life form. Like a human, it goes through a series of thoughts before reaching a conclusion, producing reasoning tokens along the way. These tokens allow the model to refine its steps and backtrack when necessary, resulting in more accurate and complex solutions with fewer hallucinations.

However, this comes with trade-offs. More computation time, power, and money are required. OpenAI provides examples, like one where the model creates a playable snake game in a single shot or a nonogram puzzle. But while it might get some things right, it's not infallible. For instance, 01 still fails at counting the number of "R"s in the word strawberry, a task that has baffled large language models in the past.

OpenAI hides much of the Chain of Thought from users, even though we still have to pay for those reasoning tokens, priced at $60 per million. Some examples of the model's reasoning process are visible in coding tasks, such as transposing a matrix in Bash, where it first considers the shape of inputs and outputs, then assesses constraints before producing a solution. This might sound impressive, but the concept isn’t novel. Google has been dominating math and coding competitions with models like AlphaProof and AlphaCoder, which also rely on reinforcement learning to produce synthetic data. What’s new is that a model like 01 is now generally available to the public.

Is 01 groundbreaking? Let’s find out. A few years ago, when I first learned to code, I recreated the classic MS-DOS game Drug Wars. It took me around 100 hours to build as a human programmer. But let’s see how GPT-4 handles it. When I asked GPT-4 to build the game with a GUI, it produced code that almost worked, though it didn’t compile properly. After some follow-up prompts, I got something functional, but the game logic was still limited.

Now, let’s see how 01 fares with the same prompt. It follows a chain of thought, producing reasoning tokens along the way. Unlike GPT-4, 01 compiled the game right away and followed the requirements perfectly. At first, it seemed flawless, but the app was actually buggy. I encountered an infinite loop with Officer Hardass, and the UI was poor. Additional prompts only led to more hallucinations and bugs. So, it’s clear that while 01 is more advanced, it’s still not truly intelligent.

That said, there’s massive potential in the Chain of Thought approach, though it might also lead to overstating the model’s capabilities. In 2019, they claimed GPT-2 was too dangerous to release, and now, five years later, Sam Altman is asking for government regulation.

Until proven otherwise, 01 is just another benign AI tool. It’s basically GPT-4 with the ability to recursively prompt itself. It’s not fundamentally game-changing. But don't take my word for it—after all, I’m just like a horse influencer in 1910 telling other horses that a car won’t take their jobs, but another horse driving a car might.

This has been the Code Report. Thanks for reading, and I’ll see you in the next one.

What did you think of this week's issue?

We take your feedback seriously.