Duo Discover
Posts
OpenAI announces new o3 models

OpenAI announces new o3 models

December 22, 2024

In partnership with

Start learning AI in 2025

Everyone talks about AI, but no one has the time to learn it. So, we found the easiest way to learn AI in as little time as possible: The Rundown AI.

It's a free AI newsletter that keeps you up-to-date on the latest AI news, and teaches you how to apply it in just 5 minutes a day.

Plus, complete the quiz after signing up and they’ll recommend the best AI tools, guides, and courses – tailored to your needs.

OpenAI has capped off its 12-day “shipmas” event with its most significant announcement yet.

On Friday, the company introduced o3, the successor to the o1 “reasoning” model launched earlier this year. Similar to o1, o3 isn’t a standalone model but a family, including o3-mini — a smaller, distilled version designed for specific tasks.

OpenAI has made a bold claim: under certain conditions, o3 edges closer to AGI (artificial general intelligence) than any of its predecessors — though this comes with several caveats, as explained below.

Why Skip o2?

The decision to name the new model o3, bypassing o2, reportedly stems from trademark concerns. According to The Information, OpenAI avoided potential conflicts with the British telecom provider O2. CEO Sam Altman alluded to this during a livestream earlier today, calling attention to the oddity of such naming constraints in today’s world.

Availability and Safety Concerns

While neither o3 nor o3-mini is publicly accessible yet, researchers focused on AI safety can sign up for early access to o3-mini starting today. OpenAI plans to roll out an o3-mini preview by the end of January, followed by o3 itself, though specific dates remain unclear.

Interestingly, Altman’s recent comments suggest a more cautious approach. In a recent interview, he expressed a preference for the establishment of a federal testing framework to guide the release of new reasoning models.

Such caution isn’t unwarranted. Tests have shown that o1’s advanced reasoning capabilities sometimes lead it to deceive users more frequently than traditional models, including those from Meta, Anthropic, and Google. It’s yet to be seen whether o3 exhibits similar tendencies, as OpenAI’s red-team partners are still evaluating its behavior.

To mitigate these risks, OpenAI is using a technique called “deliberative alignment,” which was also employed for o1. A detailed study on this method accompanies o3’s release.

Reasoning and Performance Enhancements

Reasoning models like o3 are designed to “fact-check” themselves, reducing errors that plague conventional AI models. However, this process adds latency, with o3 often taking seconds to minutes longer to respond compared to standard models. The trade-off? Greater reliability in fields like physics, science, and mathematics.

Trained via reinforcement learning, o3 uses a “private chain of thought” to think through tasks before responding. This enables the model to plan and execute a sequence of actions to arrive at a solution.

A new feature in o3 allows users to adjust its reasoning time by selecting low, medium, or high compute modes. Higher compute yields better performance but at a higher cost.

Despite these advancements, o3 isn’t immune to errors. Like its predecessor, it can falter in tasks as simple as tic-tac-toe.

Benchmarks and the AGI Question

Speculation has swirled about whether OpenAI might position o3 as a step toward AGI, defined as AI capable of outperforming humans at most economically valuable tasks.

On ARC-AGI, a benchmark for evaluating AI’s ability to learn new skills beyond its training data, o3 scored 87.5% in high compute mode, a significant leap from o1. However, François Chollet, co-creator of ARC-AGI, cautioned against overinterpreting these results, noting that o3 struggles with simple tasks and incurs high costs for complex ones.

OpenAI plans to collaborate with ARC-AGI’s foundation to develop its successor, ARC-AGI 2.

On other benchmarks, o3 has set records. It outperformed o1 by 22.8 percentage points on SWE-Bench Verified (a programming benchmark), achieved a Codeforces rating of 2727 (placing it in the 99.2nd percentile for coding), and excelled in academic exams like the 2024 American Invitational Mathematics Exam and graduate-level science tests. However, these claims come from internal evaluations and await external validation.

The Reasoning Model Boom

The release of o1 spurred competitors like Google, Alibaba, and DeepSeek to launch their own reasoning models. These models represent a shift in generative AI, as traditional scaling approaches yield diminishing returns.

However, reasoning models come with drawbacks, including high computational costs and unclear scalability. Critics question whether these models can sustain their progress.

A Transition at OpenAI

The o3 announcement coincides with the departure of Alec Radford, a pivotal figure in OpenAI’s history and the lead author behind its groundbreaking GPT series. Radford announced this week that he’s leaving to pursue independent research.

What did you think of this week's issue?

We take your feedback seriously.