Orion's Big Debut: What's New with GPT-4. 5?

OpenAI has just launched its biggest AI model yet, called GPT-4. 5, with the code name Orion. This new model is a game-changer, using way more computing power and data than any of OpenAI's previous models. OpenAI says it's not a frontier model. But it is a big deal in the AI world. Subscribers to ChatGPT Pro will get to try it out first, starting today. Developers using OpenAI’s API on paid tiers will also have access to it. The rest of the ChatGPT users, like those on ChatGPT Plus and ChatGPT Team, will have to wait a bit longer. The big question is, will this new model live up to the hype? GPT-4. 5 was created using the same technique as before, which is to dramatically increase the amount of computing power and data during a “pre-training” phase called unsupervised learning. This technique has worked well in the past, but there are signs that the gains from scaling up data and computing are beginning to level off. GPT-4. 5 is expensive to run. OpenAI is charging developers $75 for every million input tokens and $150 for every million output tokens. This is a lot more than GPT-4o, which costs just $2. 50 per million input tokens and $10 per million output tokens. OpenAI is using this release as a research preview to better understand its strengths and limitations. GPT-4. 5 is not meant to be a drop-in replacement for GPT-4o, the company’s workhorse model that powers most of its API and ChatGPT. GPT-4. 5 supports features like file and image uploads and ChatGPT’s canvas tool, but it currently lacks capabilities like support for ChatGPT’s realistic two-way voice mode. GPT-4. 5 is more performant than GPT-4o and many other models besides. On OpenAI’s SimpleQA benchmark, which tests AI models on straightforward, factual questions, GPT-4. 5 outperforms GPT-4o and OpenAI’s reasoning models, o1 and o3-mini, in terms of accuracy. OpenAI did not list one of its top-performing AI reasoning models, deep research, on SimpleQA. An OpenAI spokesperson tells TechCrunch it has not publicly reported deep research’s performance on this benchmark, and claimed its not a relevant comparison. Notably, AI startup Perplexity’s Deep Research model, which performs similarly on other benchmarks to OpenAI’s deep research, outperforms GPT-4. 5 on this test of factual accuracy. On a subset of coding problems, the SWE-Bench Verified benchmark, GPT-4. 5 roughly matches the performance of GPT-4o and o3-mini, but falls short of OpenAI’s deep research and Anthropic’s Claude 3. 7 Sonnet.

On another coding test, OpenAI’s SWE-Lancer benchmark, which measures an AI model’s ability to develop full software features, GPT-4. 5 outperforms GPT-4o and o3-mini, but falls short of deep research. GPT-4. 5 doesn’t quite reach the performance of leading AI reasoning models such as o3-mini, DeepSeek’s R1, and Claude 3. 7 Sonnet on difficult academic benchmarks such as AIME and GPQA. GPT-4. 5 matches or bests leading non-reasoning models on those same tests, suggesting that the model performs well on math- and science-related problems. OpenAI also claims that GPT-4. 5 is qualitatively superior to other models in areas that benchmarks don’t capture well, like the ability to understand human intent. GPT-4. 5 responds in a warmer and more natural tone, OpenAI says, and performs well on creative tasks such as writing and design. In one informal test, OpenAI prompted GPT-4. 5 and two other models, GPT-4o and o3-mini, to create a unicorn in SVG, a format for displaying graphics based on mathematical formulas and code. GPT-4. 5 was the only AI model to create anything resembling a unicorn. In another test, OpenAI asked GPT-4. 5 and the other two models to respond to the prompt, “I’m going through a tough time after failing a test. ” GPT-4o and o3-mini gave helpful information, but GPT-4. 5’s response was the most socially appropriate. OpenAI is excited to see how people use GPT-4. 5 in ways they might not have expected. The industry is starting to question if pre-training "scaling laws" will continue to hold. OpenAI co-founder and former chief scientist Ilya Sutskever said in December that “we’ve achieved peak data, ” and that “pre-training as we know it will unquestionably end. ” His comments echoed concerns AI investors, founders, and researchers shared with TechCrunch for a feature in November. The industry — including OpenAI — has embraced reasoning models, which take longer than non-reasoning models to perform tasks but tend to be more consistent. By increasing the amount of time and computing power that AI reasoning models use to “think” through problems, AI labs are confident they can significantly improve models’ capabilities. OpenAI plans to eventually combine its GPT series of models with its o reasoning series, beginning with GPT-5 later this year. GPT-4. 5, which reportedly was incredibly expensive to train, delayed several times, and failed to meet internal expectations, may not take the AI benchmark crown on its own. But OpenAI likely sees it as a stepping stone toward something far more powerful.

Actions