The marquee feature is obviously the 1M context window, compared to the ~200k other models support with maybe an extra cost for generations beyond >200k tokens. Per the pricing page, there is no additional cost for tokens beyond 200k: https://openai.com/api/pricing/
Also per pricing, GPT-5.4 ($2.50/M input, $15/M output) is much cheaper than Opus 4.6 ($5/M input, $25/M output) and Opus has a penalty for its beta >200k context window.
I am skeptical whether the 1M context window will provide material gains as current Codex/Opus show weaknesses as its context window is mostly full, but we'll see.
> For models with a 1.05M context window (GPT-5.4 and GPT-5.4 pro), prompts with >272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.
Which, Claude has the same deal. You can get a 1M context window, but it's gonna cost ya. If you run /model in claude code, you get:
Switch between Claude models. Applies to this session and future Claude Code sessions. For other/previous model names, specify with --model.
1. Default (recommended) Opus 4.6 · Most capable for complex work
2. Opus (1M context) Opus 4.6 with 1M context · Billed as extra usage · $10/$37.50 per Mtok
3. Sonnet Sonnet 4.6 · Best for everyday tasks
4. Sonnet (1M context) Sonnet 4.6 with 1M context · Billed as extra usage · $6/$22.50 per Mtok
5. Haiku Haiku 4.5 · Fastest for quick answers
Is that why it says rate limit all the time if you switch to a 1M model on Claude now? It kept giving me that so I switched to API account over the weekend for some vibe coding ran up a huuuuge API bill by mistake, whooops.
> GPT‑5.4 in Codex includes experimental support for the 1M context window. Developers can try this by configuring model_context_window and model_auto_compact_token_limit. Requests that exceed the standard 272K context window count against usage limits at 2x the normal rate.
I can see that's what they mean now that I've read the replies, but when I first read that top comment I too parsed it as meaning 201k would cost the same as 999k (which admittedly did seem strange, hence I read the replies to confirm and sure enough that's not actually the case!)
Yeah, long context vs compaction is always an interesting tradeoff. More information isn't always better for LLMs, as each token adds distraction, cost, and latency. There's no single optimum for all use cases.
For Codex, we're making 1M context experimentally available, but we're not making it the default experience for everyone, as from our testing we think that shorter context plus compaction works best for most people. If anyone here wants to try out 1M, you can do so by overriding `model_context_window` and `model_auto_compact_token_limit`.
Curious to hear if people have use cases where they find 1M works much better!
> Curious to hear if people have use cases where they find 1M works much better!
Reverse engineering [1]. When decompiling a bunch of code and tracing functionality, it's really easy to fill up the context window with irrelevant noise and compaction generally causes it to lose the plot entirely and have to start almost from scratch.
(Side note, are there any OpenAI programs to get free tokens/Max to test this kind of stuff?)
Do you maybe want to give us users some hints on what to compact and throw away? In codex CLI maybe you can create a visual tool that I can see and quickly check mark things I want to discard.
Sometimes I’m exploring some topic and that exploration is not useful but only the summary.
Also, you could use the best guess and cli could tell me that this is what it wants to compact and I can tweak its suggestion in natural language.
Context is going to be super important because it is the primary constraint. It would be nice to have serious granular support.
I too tried Codex and found it similarly hard to control over long contexts. It ended up coding an app that spit out millions of tiny files which were technically smaller than the original files it was supposed to optimize, except due to there being millions of them, actual hard drive usage was 18x larger. It seemed to work well until a certain point, and I suspect that point was context window overflow / compaction. Happy to provide you with the full session if it helps.
I’ll give Codex another shot with 1M. It just seemed like cperciva’s case and my own might be similar in that once the context window overflows (or refuses to fill) Codex seems to lose something essential, whereas Claude keeps it. What that thing is, I have no idea, but I’m hoping longer context will preserve it.
Yeah, I would definitely characterize it as an instruction following problem. After a few more round trips I got it to admit that "my earlier passes leaned heavily on build/tests + targeted reads, which can miss many “deep” bugs that only show up under specific conditions or with careful semantic review" and then asking it to "Please do a careful semantic review of files, one by one." started it on actually reviewing code.
Mind you, the bugs it reported were mostly bogus. But at least I was eventually able to convince it to try.
It occurred to me that searching 196 .c files was a context window issue, but maybe there’s something else going on. Either way, Codex could behave better.
Haha. This was the second time in like a year that I’ve posted a Twitter link, and the second time someone complained. Okay, I’ll try to remove those before posting, and I’ll edit this one out.
Feels like a losing battle, but hey, the audience is usually right.
I'm sorry, but it's my pet peeve. If you're on iOS/macOS I built a 100% free and privacy-friendly app to get rid of tracking parameters from hundreds of different websites, not just X/Twitter.
This is great! I have been meaning to implement this sort of thing in my existing Shortcuts flow but I see you already support it in Shortcuts! Thank you for this!
So what is your motivation for doing this, incidentally? Can you be explicit about it? I am genuinely curious.
Especially when it’s to the point of, you know, nagging/policing people to do it the way you’d prefer, when you could just redirect your router requests from x.com to xcancel.com
It's not particularly about x.com, hundreds of site like x, youtube, facebook, linkedin, tiktok etc surreptitious add tracking parameters to their links. The iOS Messages app even hides these tracking parameters. I don't like being surreptitiously tracked online and judging by the success of my free app, there are millions of people like me.
so, since these companies have to comply with removing PII, is the worst thing that could happen to me, that I get ads that are more likely to be interesting to me?
i’m not being facetious, honest question, especially considering ads are the only thing paying these people these days
Who has to comply with removing PII? Your profile, yours, mapped to a special snowflake ID, is packaged and sold across a network of 2500 - 4000 buyers, including in particular those that clean, tie (a surprisingly small footprint turns into its own "natural primary key"), qualify, and sell on to agencies. No step in this is illegal.
my first and last name is already a "natural primary key" (every single google result of Peter Marreck is me), so I've already had to give that up a long time ago. So nothing new is lost I guess?
The more data they have on you, the more valuable that data is to a third party. So they sell your data to someone else, who then phones you based on your known deep interest in <whatever it was that tracked you>. Or spams you. Or messages you. Or whatever method they think will most get your attention.
If you don't give them that information, they can't sell it, and the buyers won't annoy you.
It's not that the ads you get are more interesting, it's that you get more ads because they think they know more about you.
The worst thing that could happen is that you get caught in some government dragnet based on your historical viewing data and get disappeared because (as is the nature of dragnet searches) no matter how innocent you are you still look guilty.
Helpful type of nagging for me. Most here would agree they are not a positive aspect of the modern digital experience, calling it out gently without hostility is not bad. It might not be quite self policing but some of that with good reason is not bad for healthy communities IMO.
It's funny that the context window size is such a thing still. Like the whole LLM 'thing' is compression. Why can't we figure out some equally brilliant way of handling context besides just storing text somewhere and feeding it to the llm? RAG is the best attempt so far. We need something like a dynamic in flight llm/data structure being generated from the context that the agent can query as it goes.
That’s actually a pretty cool idea. When I think about my internal mental model of a codebase I’m working on it’s definitely a compacted lossy thing that evolves as I learn more.
Personally what I am more interested about is effective context window. I find that when using codex 5.2 high, I preferred to start compaction at around 50% of the context window because I noticed degradation at around that point. Though as of a bout a month ago that point is now below that which is great. Anyways, I feel that I will not be using that 1 million context at all in 5.4 but if the effective window is something like 400k context, that by itself is already a huge win. That means longer sessions before compaction and the agent can keep working on complex stuff for longer. But then there is the issue of intelligence of 5.4. If its as good as 5.2 high I am a happy camper, I found 5.3 anything... lacking personally.
Not sure how accurate this is, but found contextarena benchmarks today when I had the same question.
It appears only gemini has actual context == effective context from these. Although, I wasn't able to test this neither in gemini cli, nor antigravity with my pro subscription because, well, it appears nobody actually uses these tools at Google.
That's an interesting point regarding context Vs. compaction. If that's viewed as the best strategy, I'd hope we would see more tools around compaction than just "I'll compact what I want, brace yourselves" without warning.
Like, I'd love an optional pre-compaction step, "I need to compact, here is a high level list of my context + size, what should I junk?" Or similar.
This is exactly how it should work. I imagine it as a tree view showing both full and summarized token counts at each level, so you can immediately see what’s taking up space and what you’d gain by compacting it.
The agent could pre-select what it thinks is worth keeping, but you’d still have full control to override it. Each chunk could have three states: drop it, keep a summarized version, or keep the full history.
That way you stay in control of both the context budget and the level of detail the agent operates with.
I do find it really interesting that more coding agents don't have this as an toggleable feature, sometimes you really need this level of control to get useful capability
Yep; I've actually had entire jobs essentially fail due to a bad compaction. It lost key context, and it completely altered the trajectory.
I'm now more careful, using tracking files to try to keep it aligned, but more control over compaction regardless would be highly welcomed. You don't ALWAYS need that level of control, but when you do, you do.
Have you tried writing that as a skill? Compaction is just a prompt with a convenient UI to keep you in the same tab. There's no reason you can't ask the model to do that yourself and start a new conversation. You can look up Claude's /compact definition, for reference.
However, in some harnesses the model is given access to the old chat log/"memories", so you'd need a way to provide that. You could compromise by running /compact and pasting the output from your own summarizer (that you ran first, obviously).
Frontend work with large component libraries. When I'm refactoring shared design system components, things like a token system that touches 80+ files, compaction tends to lose the thread on which downstream components have already been updated vs which still need changes. It ends up re-doing work or missing things silently.
The model holds "what has been updated" well at the start of a session. After compaction, it reconstructs from summaries, and that reconstruction is lossy exactly where precision matters most: tracking partially-complete cross-file operations.
1M context isn't about reading more, it's about not forgetting what you already did halfway through.
This observation makes sense, because all models currently probably use some kind of a sparse attention architecture.
So the closer the two related pieces of information are to each other in the input context, the larger the chance their relationship will be preserved.
What needs to be an option is to allow complete and then compact and if needed go into the 1m version. That way you can get the most out of the shorter window but in the case where it just couldn't finish and compact in time it will (at cost) go over. I wonder how many tokens are actually left at the end of compaction on average. I know there have been many times where I likely needed just another 10-20k and a better stopping point would have been there.
I really don't have any numbers to back this up. But it feels like the sweet spot is around ~500k context size. Anything larger then that, you usually have scoping issues, trying to do too much at the same time, or having having issues with the quality of what's in the context at all.
For me, I would say speed (not just time to first token, but a complete generation) is more important then going for a larger context size.
I have found a bigger context window qute useful when trying to make sense of larger codebases. Generating documentation on how different components interact is better than nothing, especially if the code has poor test coverage.
I've also had it succeed in attempts to identify some non-trivial bugs that spanned multiple modules.
context distillation mostly. Agents tend to report success too early if they find something close to what they need for the task. If you are able to shove it in a 1M context, it's impossible for them to give up looking, it's in the context. But for actual implementation, it's not useful at all. They get derailed with too long of a context.
It's a little hard to compare, because Claude needs significantly fewer tokens for the same task. A better metric is the cost per task, which ends up being pretty similar.
For example on Artificial Analysis, the GPT-5.x models' cost to run the evals range from half of that of Claude Opus (at medium and high), to significantly more than the cost of Opus (at extra high reasoning). So on their cost graphs, GPT has a considerable distribution, and Opus sits right in the middle of that distribution.
The most striking graph to look at there is "Intelligence vs Output Tokens". When you account for that, I think the actual costs end up being quite similar.
According to the evals, at least, the GPT extra high matches Opus in intelligence, while costing more.
Of course, as always, benchmarks are mostly meaningless and you need to check Actual Real World Results For Your Specific Task!
For most of my tasks, the main thing a benchmark tells me is how overqualified the model is, i.e. how much I will be over-paying and over-waiting! (My classic example is, I gave the same task to Gemini 2.5 Flash and Gemini 2.5 Pro. Both did it to the same level of quality, but Gemini took 3x longer and cost 3x more!)
Looks like the same thing might apply to GPT-5.4 vs the previous GPTs:
>In the API, GPT‑5.4 is priced higher per token than GPT‑5.2 to reflect its improved capabilities, while its greater token efficiency helps reduce the total number of tokens required for many tasks.
Looks like it costs ~25% more than 5.2, with both on xhigh reasoning.
They only seem to have tested xhigh, which is a shame, since I think that reasoning level is in the point of diminishing returns for most tasks.
Also I was completely wrong earlier. Opus is significantly more expensive. I was looking at the wrong entry in the chart, the non-reasoning version of Opus. The fair comparison is Opus on max reasoning, which costs about twice the price of GPT-5.4 xhigh, to run the AA evals.
Grok has a 2M context window for most of their models.
For example their latest model `grok-4-1-fast-reasoning`:
- Context window: 2M
- Rate limits: 4M tokens per minute, 480 requests per minute
- Pricing: $0.20/M input $0.50/M output
Grok is not as good in coding as Claude for example. But for researching stuff it is incredible. While they have a model for coding now, did not try that one out yet.
Based on my experience with LLMs the larger your input context the bigger the chance of something going sideways in the response. Not sure how to address this properly.
Context rot is definitely still a problem but apparently it can be mitigated by doing RL on longer tasks that utilize more context. Recent Dario interview mentions this is part of Anthropic’s roadmap.
imo , the main feature is /fast ... who use 1M context and for what? the model become dumber already at 200K.. it's better to manage the context , and since 5.3, codex is very good at managing it
I've been using Codex for software development personally (I have a ChatGPT account), and I use Claude at work (since it is provided by my employer).
I find both Codex and Claude Opus perform at a similar level, and in some ways I actually prefer Codex (I keep hitting quota limits in Opus and have to revert back to Sonnet).
If your question is related to morality (the thing about US politics, DoD contract and so on)... I am not from the US, and I don't care about its internal politics. I also think both OpenAI and Anthropic are evil, and the world would be better if neither existed.
> I've been using Codex for software development personally (I have a ChatGPT account), and I use Claude at work (since it is provided by my employer).
Exact same situation here. I've been using both extensively for the last month or so, but still don't really feel either of them is much better or worse. But I have not done large complex features with it yet, mostly just iterative work or small features.
I also feel I am probably being very (overly?) specific in my prompts compared to how other people around me use these agents, so maybe that 'masks' things
I have a hypothesis that people who have patience and reasonably well-developed written language skills will scratch their heads at why everyone else is having so much difficulty.
Ahh, good question. I misunderstood you, apologies.
There's no mention of pricing, quotas and so on. Perhaps Codex will still be preferable for coding tasks as it is tailored for it? Maybe it is faster to respond?
Just speculation on my part. If it becomes redundant to 5.4, I presume it will be sunset. Or maybe they eventually release a Codex 5.4?
They perform at a somewhat equal level on writing single files. But Codex is absolute garbage at theory of self/others. That quickly becomes frustrating.
I can tell claude to spawn a new coding agent, and it will understand what that is, what it should be told, and what it can approximately do.
Codex on the other hand will spawn an agent and then tell it to continue with the work. It knows a coding agent can do work, but doesn't know how you'd use it - or that it won't magically know a plan.
You could add more scaffolding to fix this, but Claude proves you shouldn't have to.
I suspect this is a deeper model "intelligence" difference between the two, but I hope 5.4 will surprise me.
> They perform at a somewhat equal level on writing single files.
That's not the experience I have. I had it do more complex changes spawning multiple files and it performed well.
I don't like using multiple agents though. I don't vibe code, I actually review every change it makes. The bottleneck is my review bandwidth, more agents producing more code will not speed me up (in fact it will slow me down, as I'll need to context switch more often).
In our evals for answering cybersecurity incident investigation questions and even autonomously doing the full investigation, gpt-5.2-codex with low reasoning was the clear winner over non-codex or higher reasoning. 2X+ faster, higher completion rates, etc.
It was generally smarter than pre-5.2 so strategically better, and codex likewise wrote better database queries than non-codex, and as it needs to iteratively hunt down the answer, didn't run out the clock by drowning in reasoning.
in my testing codex actually planned worse than claude but coded better once the plan is set, and faster.
it is also excellent to cross check claude's work, always finding great weakness each time.
Weird. It used to be the opposite. My own experience is that Claude’s behind-the-scenes support is a differentiator for supporting office work. It handles documents, spreadsheets and such much better than anyone else (presumably with server side scripts). Codex feels a bit smarter, but it inserts a lot of checkpoints to keep from running too long. Claude will run a plan to the end, but the token limits have become so small in the last couple months that the $20 pla basically only buys one significant task per day. The iOS app is what makes me keep the subscription.
Correct, this is the way. A year or two ago lots of people were saying to do the opposite, but at least now and probably also even then, this is better. Claude is a more sensible and holistic designer, planner, debater, and idea generator. Codex is better at actually correctly implementing any large codebase change in a single pass.
Why would someone use Claude Code instead? Or any other harness? Or why only use one?
My own tooling throws off requests to multiple agents at the same time, then I compare which one is best, and continue from there. Most of the time Codex ends up with the best end results though, but my hunch is that at one point that'll change, hence I continue using multiple at the same time.
Also per pricing, GPT-5.4 ($2.50/M input, $15/M output) is much cheaper than Opus 4.6 ($5/M input, $25/M output) and Opus has a penalty for its beta >200k context window.
I am skeptical whether the 1M context window will provide material gains as current Codex/Opus show weaknesses as its context window is mostly full, but we'll see.
Per updated docs (https://developers.openai.com/api/docs/guides/latest-model), it supercedes GPT-5.3-Codex, which is an interesting move.