The marquee feature is obviously the 1M context window, compared to the ~200k ot...

damsta · 2026-03-05T20:12:14 1772741534

There is extra cost for >272K:

> For models with a 1.05M context window (GPT-5.4 and GPT-5.4 pro), prompts with >272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.

Taken from https://developers.openai.com/api/docs/models/gpt-5.4

fragmede · 2026-03-05T20:16:19 1772741779

Which, Claude has the same deal. You can get a 1M context window, but it's gonna cost ya. If you run /model in claude code, you get:

    Switch between Claude models. Applies to this session and future Claude Code sessions. For other/previous model names, specify with --model.
    
       1. Default (recommended)   Opus 4.6 · Most capable for complex work
       2. Opus (1M context)        Opus 4.6 with 1M context · Billed as extra usage · $10/$37.50 per Mtok
       3. Sonnet                   Sonnet 4.6 · Best for everyday tasks
       4. Sonnet (1M context)      Sonnet 4.6 with 1M context · Billed as extra usage · $6/$22.50 per Mtok
       5. Haiku                    Haiku 4.5 · Fastest for quick answers

WXLCKNO · 2026-03-06T00:24:11 1772756651

Anthropic literally don't allow you to use the 1M context anymore on Sonnet and Opus 4.6 without it being billed as extra usage immediately.

I had 4.5 1M before that so they definitely made it worse.

OpenAI at least gives you the option of using your plan for it. Even if it uses it up more quickly.

neom · 2026-03-06T00:56:15 1772758575

Is that why it says rate limit all the time if you switch to a 1M model on Claude now? It kept giving me that so I switched to API account over the weekend for some vibe coding ran up a huuuuge API bill by mistake, whooops.

minimaxir · 2026-03-05T20:29:15 1772742555

Good find, and that's too small a print for comfort.

ValentineC · 2026-03-05T21:31:41 1772746301

It's also in the linked article:

> GPT‑5.4 in Codex includes experimental support for the 1M context window. Developers can try this by configuring model_context_window and model_auto_compact_token_limit. Requests that exceed the standard 272K context window count against usage limits at 2x the normal rate.

glenstein · 2026-03-05T20:37:11 1772743031

Wow, that's diametrically the opposite point: the cost is *extra*, not free.

apetresc · 2026-03-05T21:24:08 1772745848

Diametrically opposite to tokens beyond 200K being literally free? As in, you only pay for the first 200K tokens and the remaining 800K cost $0.00?

I don't think that's a fair reading of the original post at all, obviously what they meant by "no cost" was "no increase in the cost".

swores · 2026-03-06T07:45:37 1772783137

I can see that's what they mean now that I've read the replies, but when I first read that top comment I too parsed it as meaning 201k would cost the same as 999k (which admittedly did seem strange, hence I read the replies to confirm and sure enough that's not actually the case!)

tedsanders · 2026-03-05T18:41:14 1772736074

Yeah, long context vs compaction is always an interesting tradeoff. More information isn't always better for LLMs, as each token adds distraction, cost, and latency. There's no single optimum for all use cases.

For Codex, we're making 1M context experimentally available, but we're not making it the default experience for everyone, as from our testing we think that shorter context plus compaction works best for most people. If anyone here wants to try out 1M, you can do so by overriding `model_context_window` and `model_auto_compact_token_limit`.

Curious to hear if people have use cases where they find 1M works much better!

(I work at OpenAI.)

akiselev · 2026-03-05T19:06:08 1772737568

> Curious to hear if people have use cases where they find 1M works much better!

Reverse engineering [1]. When decompiling a bunch of code and tracing functionality, it's really easy to fill up the context window with irrelevant noise and compaction generally causes it to lose the plot entirely and have to start almost from scratch.

(Side note, are there any OpenAI programs to get free tokens/Max to test this kind of stuff?)

[1] https://github.com/akiselev/ghidra-cli

fragmede · 2026-03-06T09:02:27 1772787747

OpenAi has program for trusted cybersecurity researchers https://openai.com/index/trusted-access-for-cyber/

simianwords · 2026-03-05T18:46:11 1772736371

Do you maybe want to give us users some hints on what to compact and throw away? In codex CLI maybe you can create a visual tool that I can see and quickly check mark things I want to discard.

Sometimes I’m exploring some topic and that exploration is not useful but only the summary.

Also, you could use the best guess and cli could tell me that this is what it wants to compact and I can tweak its suggestion in natural language.

Context is going to be super important because it is the primary constraint. It would be nice to have serious granular support.

sillysaurusx · 2026-03-05T20:08:16 1772741296

You may want to look over this thread from cperciva: https://x.com/cperciva/status/2029645027358495156

I too tried Codex and found it similarly hard to control over long contexts. It ended up coding an app that spit out millions of tiny files which were technically smaller than the original files it was supposed to optimize, except due to there being millions of them, actual hard drive usage was 18x larger. It seemed to work well until a certain point, and I suspect that point was context window overflow / compaction. Happy to provide you with the full session if it helps.

I’ll give Codex another shot with 1M. It just seemed like cperciva’s case and my own might be similar in that once the context window overflows (or refuses to fill) Codex seems to lose something essential, whereas Claude keeps it. What that thing is, I have no idea, but I’m hoping longer context will preserve it.

FrankBooth · 2026-03-05T20:40:48 1772743248

What’s the connection with context size in that thread? It seems more like an instruction following problem.

cperciva · 2026-03-06T01:40:53 1772761253

Yeah, I would definitely characterize it as an instruction following problem. After a few more round trips I got it to admit that "my earlier passes leaned heavily on build/tests + targeted reads, which can miss many “deep” bugs that only show up under specific conditions or with careful semantic review" and then asking it to "Please do a careful semantic review of files, one by one." started it on actually reviewing code.

Mind you, the bugs it reported were mostly bogus. But at least I was eventually able to convince it to try.

sillysaurusx · 2026-03-06T01:29:45 1772760585

It occurred to me that searching 196 .c files was a context window issue, but maybe there’s something else going on. Either way, Codex could behave better.

woadwarrior01 · 2026-03-05T20:26:42 1772742402

Please don't post links with tracking parameters (t=jQb...).

https://xcancel.com/cperciva/status/2029645027358495156

sillysaurusx · 2026-03-05T20:30:35 1772742635

Haha. This was the second time in like a year that I’ve posted a Twitter link, and the second time someone complained. Okay, I’ll try to remove those before posting, and I’ll edit this one out.

Feels like a losing battle, but hey, the audience is usually right.

woadwarrior01 · 2026-03-05T20:35:28 1772742928

I'm sorry, but it's my pet peeve. If you're on iOS/macOS I built a 100% free and privacy-friendly app to get rid of tracking parameters from hundreds of different websites, not just X/Twitter.

https://apps.apple.com/us/app/clean-links-qr-code-reader/id6...

monocularvision · 2026-03-05T21:12:10 1772745130

This is great! I have been meaning to implement this sort of thing in my existing Shortcuts flow but I see you already support it in Shortcuts! Thank you for this!

Anywhere I can toss a Tip for this free app?

woadwarrior01 · 2026-03-05T23:00:20 1772751620

I'm glad you like it. :)

sillysaurusx · 2026-03-05T20:38:28 1772743108

It works on iOS? That’s cool. I’ll give it a go.

pmarreck · 2026-03-05T21:05:21 1772744721

So what is your motivation for doing this, incidentally? Can you be explicit about it? I am genuinely curious.

Especially when it’s to the point of, you know, nagging/policing people to do it the way you’d prefer, when you could just redirect your router requests from x.com to xcancel.com

woadwarrior01 · 2026-03-05T23:03:15 1772751795

It's not particularly about x.com, hundreds of site like x, youtube, facebook, linkedin, tiktok etc surreptitious add tracking parameters to their links. The iOS Messages app even hides these tracking parameters. I don't like being surreptitiously tracked online and judging by the success of my free app, there are millions of people like me.

pmarreck · 2026-03-06T00:40:29 1772757629

so, since these companies have to comply with removing PII, is the worst thing that could happen to me, that I get ads that are more likely to be interesting to me?

i’m not being facetious, honest question, especially considering ads are the only thing paying these people these days

Terretta · 2026-03-06T11:10:50 1772795450

Who has to comply with removing PII? Your profile, yours, mapped to a special snowflake ID, is packaged and sold across a network of 2500 - 4000 buyers, including in particular those that clean, tie (a surprisingly small footprint turns into its own "natural primary key"), qualify, and sell on to agencies. No step in this is illegal.

https://www.theverge.com/2024/10/23/24277679/atlas-privacy-b...

pmarreck · 2026-03-06T13:35:15 1772804115

my first and last name is already a "natural primary key" (every single google result of Peter Marreck is me), so I've already had to give that up a long time ago. So nothing new is lost I guess?

marcus_holmes · 2026-03-06T06:09:57 1772777397

The more data they have on you, the more valuable that data is to a third party. So they sell your data to someone else, who then phones you based on your known deep interest in <whatever it was that tracked you>. Or spams you. Or messages you. Or whatever method they think will most get your attention.

If you don't give them that information, they can't sell it, and the buyers won't annoy you.

It's not that the ads you get are more interesting, it's that you get more ads because they think they know more about you.

rapind · 2026-03-06T11:45:59 1772797559

IMO the tracking, advertising, and attention market might just be societies biggest problem.

Certainly it employs a lot of people, as do cartels.

taneq · 2026-03-06T04:58:29 1772773109

The worst thing that could happen is that you get caught in some government dragnet based on your historical viewing data and get disappeared because (as is the nature of dragnet searches) no matter how innocent you are you still look guilty.

pnexk · 2026-03-06T00:28:53 1772756933

Helpful type of nagging for me. Most here would agree they are not a positive aspect of the modern digital experience, calling it out gently without hostility is not bad. It might not be quite self policing but some of that with good reason is not bad for healthy communities IMO.

lubesGordi · 2026-03-05T21:55:54 1772747754

It's funny that the context window size is such a thing still. Like the whole LLM 'thing' is compression. Why can't we figure out some equally brilliant way of handling context besides just storing text somewhere and feeding it to the llm? RAG is the best attempt so far. We need something like a dynamic in flight llm/data structure being generated from the context that the agent can query as it goes.

le-mark · 2026-03-06T02:12:05 1772763125

That’s actually a pretty cool idea. When I think about my internal mental model of a codebase I’m working on it’s definitely a compacted lossy thing that evolves as I learn more.

nowittyusername · 2026-03-05T20:53:13 1772743993

Personally what I am more interested about is effective context window. I find that when using codex 5.2 high, I preferred to start compaction at around 50% of the context window because I noticed degradation at around that point. Though as of a bout a month ago that point is now below that which is great. Anyways, I feel that I will not be using that 1 million context at all in 5.4 but if the effective window is something like 400k context, that by itself is already a huge win. That means longer sessions before compaction and the agent can keep working on complex stuff for longer. But then there is the issue of intelligence of 5.4. If its as good as 5.2 high I am a happy camper, I found 5.3 anything... lacking personally.

gck1 · 2026-03-06T01:31:42 1772760702

Not sure how accurate this is, but found contextarena benchmarks today when I had the same question.

It appears only gemini has actual context == effective context from these. Although, I wasn't able to test this neither in gemini cli, nor antigravity with my pro subscription because, well, it appears nobody actually uses these tools at Google.

https://contextarena.ai/?showLabels=false

Someone1234 · 2026-03-05T19:52:42 1772740362

That's an interesting point regarding context Vs. compaction. If that's viewed as the best strategy, I'd hope we would see more tools around compaction than just "I'll compact what I want, brace yourselves" without warning.

Like, I'd love an optional pre-compaction step, "I need to compact, here is a high level list of my context + size, what should I junk?" Or similar.

thyb23 · 2026-03-05T20:18:34 1772741914

This is exactly how it should work. I imagine it as a tree view showing both full and summarized token counts at each level, so you can immediately see what’s taking up space and what you’d gain by compacting it.

The agent could pre-select what it thinks is worth keeping, but you’d still have full control to override it. Each chunk could have three states: drop it, keep a summarized version, or keep the full history.

That way you stay in control of both the context budget and the level of detail the agent operates with.

joquarky · 2026-03-05T22:51:38 1772751098

I compact myself by having it write out to a file, I prune what's no longer relevant, and then start a new session with that file.

But I'm mostly working on personal projects so my time is cheap.

I might experiment with having the file sections post-processed through a token counter though, that's a great idea.

Folcon · 2026-03-05T20:28:00 1772742480

I do find it really interesting that more coding agents don't have this as an toggleable feature, sometimes you really need this level of control to get useful capability

Someone1234 · 2026-03-05T20:50:28 1772743828

Yep; I've actually had entire jobs essentially fail due to a bad compaction. It lost key context, and it completely altered the trajectory.

I'm now more careful, using tracking files to try to keep it aligned, but more control over compaction regardless would be highly welcomed. You don't ALWAYS need that level of control, but when you do, you do.

joshvm · 2026-03-06T00:31:38 1772757098

Have you tried writing that as a skill? Compaction is just a prompt with a convenient UI to keep you in the same tab. There's no reason you can't ask the model to do that yourself and start a new conversation. You can look up Claude's /compact definition, for reference.

However, in some harnesses the model is given access to the old chat log/"memories", so you'd need a way to provide that. You could compromise by running /compact and pasting the output from your own summarizer (that you ran first, obviously).

mindplunge · 2026-03-06T07:35:44 1772782544

Frontend work with large component libraries. When I'm refactoring shared design system components, things like a token system that touches 80+ files, compaction tends to lose the thread on which downstream components have already been updated vs which still need changes. It ends up re-doing work or missing things silently.

The model holds "what has been updated" well at the start of a session. After compaction, it reconstructs from summaries, and that reconstruction is lossy exactly where precision matters most: tracking partially-complete cross-file operations.

1M context isn't about reading more, it's about not forgetting what you already did halfway through.

dahcryn · 2026-03-06T08:05:48 1772784348

I would like to counteract your statement that each token adds a distraction.

In our experiments, we see a surprising benefit to rewriting blocks to use more tokens, especially long lists etc..

E.g. compare these two options

"The following conditions are excluded from your contract - condition A - condition B ... - condition Z"

The next one works better for us:

"The following conditions are excluded from your contract - condition A is excluded - condition B is excluded ... - condition Z is excluded"

And we now have scripts to rewrite long documents like this, explicitly adding more tokens. Would you have any opinion on this?

mnicky · 2026-03-06T08:40:54 1772786454

This observation makes sense, because all models currently probably use some kind of a sparse attention architecture.

So the closer the two related pieces of information are to each other in the input context, the larger the chance their relationship will be preserved.

jmward01 · 2026-03-06T05:19:11 1772774351

What needs to be an option is to allow complete and then compact and if needed go into the 1m version. That way you can get the most out of the shorter window but in the case where it just couldn't finish and compact in time it will (at cost) go over. I wonder how many tokens are actually left at the end of compaction on average. I know there have been many times where I likely needed just another 10-20k and a better stopping point would have been there.

asabla · 2026-03-05T21:43:40 1772747020

I really don't have any numbers to back this up. But it feels like the sweet spot is around ~500k context size. Anything larger then that, you usually have scoping issues, trying to do too much at the same time, or having having issues with the quality of what's in the context at all.

For me, I would say speed (not just time to first token, but a complete generation) is more important then going for a larger context size.

gspetr · 2026-03-05T20:02:23 1772740943

I have found a bigger context window qute useful when trying to make sense of larger codebases. Generating documentation on how different components interact is better than nothing, especially if the code has poor test coverage.

I've also had it succeed in attempts to identify some non-trivial bugs that spanned multiple modules.

oidar · 2026-03-06T14:41:09 1772808069

context distillation mostly. Agents tend to report success too early if they find something close to what they need for the task. If you are able to shove it in a 1M context, it's impossible for them to give up looking, it's in the context. But for actual implementation, it's not useful at all. They get derailed with too long of a context.

neom · 2026-03-06T00:58:43 1772758723

On Claude Code (sorry) the big context window is good for teams. On CC if you hit compact while a bunch of teams working it's a total shit show after.

andai · 2026-03-05T20:28:49 1772742529

It's a little hard to compare, because Claude needs significantly fewer tokens for the same task. A better metric is the cost per task, which ends up being pretty similar.

For example on Artificial Analysis, the GPT-5.x models' cost to run the evals range from half of that of Claude Opus (at medium and high), to significantly more than the cost of Opus (at extra high reasoning). So on their cost graphs, GPT has a considerable distribution, and Opus sits right in the middle of that distribution.

The most striking graph to look at there is "Intelligence vs Output Tokens". When you account for that, I think the actual costs end up being quite similar.

According to the evals, at least, the GPT extra high matches Opus in intelligence, while costing more.

Of course, as always, benchmarks are mostly meaningless and you need to check Actual Real World Results For Your Specific Task!

For most of my tasks, the main thing a benchmark tells me is how overqualified the model is, i.e. how much I will be over-paying and over-waiting! (My classic example is, I gave the same task to Gemini 2.5 Flash and Gemini 2.5 Pro. Both did it to the same level of quality, but Gemini took 3x longer and cost 3x more!)

andai · 2026-03-05T23:23:34 1772753014

Looks like the same thing might apply to GPT-5.4 vs the previous GPTs:

>In the API, GPT‑5.4 is priced higher per token than GPT‑5.2 to reflect its improved capabilities, while its greater token efficiency helps reduce the total number of tokens required for many tasks.

I eagerly await the benchies on AA :)

andai · 2026-03-06T19:08:51 1772824131

Benchies update:

https://artificialanalysis.ai/

Looks like it costs ~25% more than 5.2, with both on xhigh reasoning.

They only seem to have tested xhigh, which is a shame, since I think that reasoning level is in the point of diminishing returns for most tasks.

Also I was completely wrong earlier. Opus is significantly more expensive. I was looking at the wrong entry in the chart, the non-reasoning version of Opus. The fair comparison is Opus on max reasoning, which costs about twice the price of GPT-5.4 xhigh, to run the AA evals.

hagen8 · 2026-03-06T06:42:34 1772779354

But does it use the same agent harness? Because the harness determines the behavior a lot.

netinstructions · 2026-03-05T19:36:57 1772739417

People (and also frustratingly LLMs) usually refer to https://openai.com/api/pricing/ which doesn't give the complete picture.

https://developers.openai.com/api/docs/pricing is what I always reference, and it explicitly shows that pricing ($2.50/M input, $15/M output) for tokens under 272k

It is nice that we get 70-72k more tokens before the price goes up (also what does it cost beyond 272k tokens??)

Flashtoo · 2026-03-05T20:12:48 1772741568

> Prompts with more than 272K input tokens are priced at 2x input and 1.5x output for the full session for standard, batch, and flex.

netinstructions · 2026-03-05T20:49:01 1772743741

Thanks, it looks like the pricing page keeps getting updated.

Even right now one page refers to prices for "context lengths under 270K" whereas another has pricing for "<272K context length"

smusamashah · 2026-03-05T21:16:24 1772745384

Gemini already has 1M or 2M context window right?

rezonant · 2026-03-06T07:29:20 1772782160

Yes, 1M context window since Gemini 1.5 Pro first previewed in February 2024.

danenania · 2026-03-06T14:27:21 1772807241

Gemini 1.5 Pro actually has 2M!

No other model from a major lab has matched it since afaik.

Edit: err, I see in the comment below mine that Grok has 2M as well. Had no idea!

peterspath · 2026-03-06T04:48:18 1772772498

Grok has a 2M context window for most of their models.

For example their latest model `grok-4-1-fast-reasoning`:

- Context window: 2M

- Rate limits: 4M tokens per minute, 480 requests per minute

- Pricing: $0.20/M input $0.50/M output

Grok is not as good in coding as Claude for example. But for researching stuff it is incredible. While they have a model for coding now, did not try that one out yet.

https://docs.x.ai/developers/models

aurareturn · 2026-03-06T06:45:29 1772779529

What kind of research do you use it for?

dev_l1x_be · 2026-03-06T12:06:41 1772798801

Based on my experience with LLMs the larger your input context the bigger the chance of something going sideways in the response. Not sure how to address this properly.

luca-ctx · 2026-03-05T20:41:34 1772743294

Context rot is definitely still a problem but apparently it can be mitigated by doing RL on longer tasks that utilize more context. Recent Dario interview mentions this is part of Anthropic’s roadmap.

paulddraper · 2026-03-05T20:30:59 1772742659

I don’t know about 5.4 specifically, but in the past anything over 200k wasn’t that great anyway.

Like, if you really don’t want to spend any effort trimming it down, sure use 1m.

Otherwise, 1m is an anti pattern.

fvv · 2026-03-06T08:26:53 1772785613

imo , the main feature is /fast ... who use 1M context and for what? the model become dumber already at 200K.. it's better to manage the context , and since 5.3, codex is very good at managing it

thehamkercat · 2026-03-05T18:27:37 1772735257

GPT 5.3 codex had 400K context window btw

AtreidesTyrant · 2026-03-05T20:40:04 1772743204

token rot exists for any context window at above 75% capacity, thats why so many have pushed for 1 mil windows

simianwords · 2026-03-05T18:34:00 1772735640

Why would some one use codex instead?

surgical_fire · 2026-03-05T18:48:08 1772736488

I've been using Codex for software development personally (I have a ChatGPT account), and I use Claude at work (since it is provided by my employer).

I find both Codex and Claude Opus perform at a similar level, and in some ways I actually prefer Codex (I keep hitting quota limits in Opus and have to revert back to Sonnet).

If your question is related to morality (the thing about US politics, DoD contract and so on)... I am not from the US, and I don't care about its internal politics. I also think both OpenAI and Anthropic are evil, and the world would be better if neither existed.

hnsr · 2026-03-05T21:08:32 1772744912

> I've been using Codex for software development personally (I have a ChatGPT account), and I use Claude at work (since it is provided by my employer).

Exact same situation here. I've been using both extensively for the last month or so, but still don't really feel either of them is much better or worse. But I have not done large complex features with it yet, mostly just iterative work or small features.

I also feel I am probably being very (overly?) specific in my prompts compared to how other people around me use these agents, so maybe that 'masks' things

joquarky · 2026-03-05T23:05:04 1772751904

> overly specific

I have a hypothesis that people who have patience and reasonably well-developed written language skills will scratch their heads at why everyone else is having so much difficulty.

simianwords · 2026-03-05T18:49:21 1772736561

No my question was why would I use codex over gpt 5.4

surgical_fire · 2026-03-05T18:57:34 1772737054

Ahh, good question. I misunderstood you, apologies.

There's no mention of pricing, quotas and so on. Perhaps Codex will still be preferable for coding tasks as it is tailored for it? Maybe it is faster to respond?

Just speculation on my part. If it becomes redundant to 5.4, I presume it will be sunset. Or maybe they eventually release a Codex 5.4?

landtuna · 2026-03-05T19:34:31 1772739271

5.3 Codex is $1.75/$14, and 5.4 is $2.50/$15.

surgical_fire · 2026-03-05T21:24:25 1772745865

There you go. It makes perfect sense to keep it around then.

athrowaway3z · 2026-03-05T19:56:29 1772740589

They perform at a somewhat equal level on writing single files. But Codex is absolute garbage at theory of self/others. That quickly becomes frustrating.

I can tell claude to spawn a new coding agent, and it will understand what that is, what it should be told, and what it can approximately do.

Codex on the other hand will spawn an agent and then tell it to continue with the work. It knows a coding agent can do work, but doesn't know how you'd use it - or that it won't magically know a plan.

You could add more scaffolding to fix this, but Claude proves you shouldn't have to.

I suspect this is a deeper model "intelligence" difference between the two, but I hope 5.4 will surprise me.

surgical_fire · 2026-03-05T20:32:04 1772742724

> They perform at a somewhat equal level on writing single files.

That's not the experience I have. I had it do more complex changes spawning multiple files and it performed well.

I don't like using multiple agents though. I don't vibe code, I actually review every change it makes. The bottleneck is my review bandwidth, more agents producing more code will not speed me up (in fact it will slow me down, as I'll need to context switch more often).

lmeyerov · 2026-03-05T20:39:00 1772743140

In our evals for answering cybersecurity incident investigation questions and even autonomously doing the full investigation, gpt-5.2-codex with low reasoning was the clear winner over non-codex or higher reasoning. 2X+ faster, higher completion rates, etc.

It was generally smarter than pre-5.2 so strategically better, and codex likewise wrote better database queries than non-codex, and as it needs to iteratively hunt down the answer, didn't run out the clock by drowning in reasoning.

Video: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t...

We'll be updating numbers on 5.3 and claude, but basically same thing there. Early, but we were surprised to see codex outperform opus here.

jeswin · 2026-03-05T19:10:11 1772737811

When it comes to lengthy non-trivial work, codex is much better but also slower.

synergy20 · 2026-03-05T21:02:50 1772744570

in my testing codex actually planned worse than claude but coded better once the plan is set, and faster. it is also excellent to cross check claude's work, always finding great weakness each time.

pmarreck · 2026-03-05T21:08:42 1772744922

That’s why I think the sweet spot is to write up plans with Claude and then execute them with Codex

GorbachevyChase · 2026-03-05T22:17:57 1772749077

Weird. It used to be the opposite. My own experience is that Claude’s behind-the-scenes support is a differentiator for supporting office work. It handles documents, spreadsheets and such much better than anyone else (presumably with server side scripts). Codex feels a bit smarter, but it inserts a lot of checkpoints to keep from running too long. Claude will run a plan to the end, but the token limits have become so small in the last couple months that the $20 pla basically only buys one significant task per day. The iOS app is what makes me keep the subscription.

meowface · 2026-03-06T17:26:38 1772817998

Correct, this is the way. A year or two ago lots of people were saying to do the opposite, but at least now and probably also even then, this is better. Claude is a more sensible and holistic designer, planner, debater, and idea generator. Codex is better at actually correctly implementing any large codebase change in a single pass.

joquarky · 2026-03-05T22:59:47 1772751587

And it fits well with the $20 plans for each since Codex seems to provide about 7-8x more usage than Claude.

embedding-shape · 2026-03-05T18:40:01 1772736001

Why would someone use Claude Code instead? Or any other harness? Or why only use one?

My own tooling throws off requests to multiple agents at the same time, then I compare which one is best, and continue from there. Most of the time Codex ends up with the best end results though, but my hunch is that at one point that'll change, hence I continue using multiple at the same time.