OpenAI o3 Mini: The Illusion of Intelligence

In the Age of Information, news media faces both unprecedented opportunities and significant challenges.
A lone businessman stands on the edge of a broken stone bridge high above a sprawling cityscape, symbolizing a gap or challenge. A lone businessman stands on the edge of a broken stone bridge high above a sprawling cityscape, symbolizing a gap or challenge.
A tense moment on a fractured archway, underscoring the challenges of closing gaps—whether in business, technology, or global AI development.

OpenAI o3 Mini: The Illusion of Intelligence

As Sam Altman mentioned, reasoning models are designed for specialists, but perhaps not for the reasons you might expect.

Jing Hu – AI Advances

If you’ve followed my previous article, you’re already ahead of many news outlets by a week. The situation remains chaotic in the AI world.

If this is your first time reading my work, I recommend starting with my piece titled 30 Questions You Should Ask About DeepSeek.

Advertisement

What’s Really at Stake?

You might think that with all the expert videos from OpenAI, they’ve nailed it, or that their work brings us closer to AGI (though the definition of AGI keeps shifting). It’s troubling how people overlook the dangers of something that’s “almost right,” especially when it concerns your health, finances, or future.

What to Expect:

This isn’t a technical comparison or benchmark of AI models. Rather, it’s an exercise in second-order thinking. I’ll dive into how these models reason and whether they can truly be trusted.

Key Observations:

I ran the same prompt across both reasoning (Mixture of Experts, MoE) models and non-reasoning models.

I observed how AI confidently makes errors in complex tasks like options trading, with documented data available for paying subscribers.

Larger models tend to give the illusion of expertise, but still require human oversight to validate their results.

Sometimes, being almost correct is more dangerous than not using the AI at all, especially for intricate reasoning tasks.

Experiment Overview:

I tasked various AI models with calculating the future value of an options position—a task combining math with assumptions. The models needed to calculate:

The goal was to see if an AI model could reason through a chain of steps to arrive at the correct answer in one go—just as automated systems should work in real-world workflows.

The Models Tested:

DeepSeek 14b, DeepSeek R1, Llama 3.1 8b, Claude 3.5 Sonnet 175b, Perplexity + DeepSeek R1, and OpenAI o3-mini-high.

The Prompt:

I refined the prompt through multiple iterations to give smaller models a better shot at success. The task was to calculate the overall PnL for a stock trading scenario involving options.

In this scenario:

MSFT stock is trading at $447.2

A set of call options with varying expiry dates and prices was given.

The models had to calculate the value of these options on a future date, considering time decay and intrinsic value.

Correct Answer Breakdown:

Any reasonable solution should provide:

The value of February options: $25

The value of March options, factoring in both intrinsic value and time value.

The final PnL: A total loss of $27.46.

Results and Observations:

None of the models gave the correct answer. Some were close but made small errors, such as incorrectly calculating the remaining days or misapplying the square root time decay formula.

While OpenAI’s o3-mini-high model reasoned well, it still made subtle math errors that would be easy to overlook at first glance. Other models like Llama and Claude missed key points, such as confusing intrinsic value with total value.

The Bigger Problem:

The confidence with which these models made errors is a growing issue. As these systems reason more like humans, the need for expert validation increases. This brings us to a troubling realization: while these models may seem like they’re on the path to AGI, they still lack true understanding and can’t reliably perform tasks that require deep reasoning.

Commercial Applications and Pitfalls:

The idea of using reasoning models in commercial applications, such as CRM or insurance claims, sounds appealing but is fraught with challenges. Variability in outputs, difficulty in ensuring consistent formats, and the high need for human intervention to fix mistakes make these models impractical without heavy engineering support.

The reality behind AI reasoning models is far from the hype. Despite the promise, they fall short in practical use cases. Rather than being a plug-and-play solution, integrating these models into systems requires ongoing maintenance, monitoring, and a significant amount of manual oversight. It’s crucial to approach AI integration with caution, as the risks—legal and otherwise—are often underestimated.

Add a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Advertisement