These releases are lacking something. Yes, they optimised for benchmarks, but it...

ipsum2 · 2026-03-05T18:50:13 1772736613

The model was released less than an hour ago, and somehow you've been able to form such a strong opinion about it. Impressive!

satvikpendem · 2026-03-05T19:33:54 1772739234

It's more hedonic adaptation, people just aren't as impressed by incremental changes anymore over big leaps. It's the same as another thread yesterday where someone said the new MacBook with the latest processor doesn't excite them anymore, and it's because for most people, most models are good enough and now it's all about applications.

https://news.ycombinator.com/item?id=47232453#47232735

dmix · 2026-03-05T20:05:13 1772741113

Plus people just really like to whine on the internet

AlexeyBelov · 2026-03-06T10:35:26 1772793326

I think whine is a very strong word in this case. Kind of offputting and negative.

mirekrusin · 2026-03-05T19:42:54 1772739774

Oh, come on, if it can't run local models that compete with proprietary ones it's not good enough yet!

satvikpendem · 2026-03-05T20:09:15 1772741355

Qwen 3.5 small models are actually very impressive and do beat out larger proprietary models.

smartbit · 2026-03-06T10:12:18 1772791938

Qwen version 3.5 might be the last serious version (for some time at least), see Something is afoot in the land of Qwen (2 days ago) https://news.ycombinator.com/item?id=47249343

Also interesting experiences shared in that thread, even someone using it on a rented H200.

satvikpendem · 2026-03-06T16:51:25 1772815885

Not necessarily, Alibaba is still working on it and the CEO is directly co-leading the team. Translated with Qwen 3.5:

> To all colleagues in the Tongyi Lab:

> The company has approved Lin Junyang’s resignation and thanks him for his contributions during his tenure. Jingren will continue to lead the Tongyi Lab in advancing future work. At the same time, the company will establish a Foundation Model Support Group, jointly coordinated by myself, Jingren, and Fan Yu, to mobilize group resources in support of foundation model development.

> Technological progress demands constant advancement — stagnation means regression. Developing foundational large models is our key strategic direction toward the future. While continuing to uphold our open-source model strategy, we will further increase R&D investment in artificial intelligence, intensify efforts to attract top talent, and move forward together with renewed commitment.

> Wu Yongming

https://x.com/poezhao0605/status/2029396117239276013

earth2mars · 2026-03-05T19:15:00 1772738100

I am actually super impressed with Codex-5.3 extra high reasoning. Its a drop in replacement (infact better than Claude Opus 4.6. lately claude being super verbose going in circles in getting things resolved). I stopped using claude mostly and having a blast with Codex 5.3. looking forward to 5.4 in codex.

whynotminot · 2026-03-05T20:42:59 1772743379

I still love Opus but it's just too expensive / eats usage limits.

I've found that 5.3-Codex is mostly Opus quality but cheaper for daily use.

Curious to see if 5.4 will be worth somewhat higher costs, or if I'll stick to 5.3-Codex for the same reasons.

satvikpendem · 2026-03-05T19:31:55 1772739115

Same, it also helps that it's way cheaper than Opus in VSCode Copilot, where OpenAI models are counted as 1x requests while Opus is 3x, for similar performance (no doubt Microsoft is subsidizing OpenAI models due to their partnership).

CryZe · 2026-03-05T21:42:12 1772746932

I've been using both Opus 4.6 and Codex 5.3 in VSCode's Copilot and while Opus is indeed 3x and Codex is 1x, that doesn't seem to matter as Opus is willing to go work in the background for like an hour for 3 credits, whereas Codex asks you whether to continue every few lines of code it changes, quickly eating way more credits than Opus. In fact Opus in Copilot is probably underpriced, as it can definitely work for an hour with just those 12 cents of cost. Which I'm not sure you get anywhere else at such a low price.

Update: I don't know why I can't reply to your reply, so I'll just update this. I have tried many times to give it a big todo list and told it to do it all. But I've never gotten it to actually work on it all and instead after the first task is complete it always asks if it should move onto the next task. In fact, I always tell it not to ask me and yet it still does. So unless I need to do very specific prompt engineering, that does not seem to work for me.

satvikpendem · 2026-03-05T21:45:30 1772747130

That shouldn't really make a difference because you can just prompt Codex to behave the same way, having it load a big list of todo items perhaps from a markdown file and asking it to iterate until it's finished without asking for confirmation, and that'll still cost 1x over Opus' 3x.

braebo · 2026-03-05T21:44:25 1772747065

I struggle to believe this. Codex can’t hold a candle to Claude on any task I’ve given it.

cj · 2026-03-05T18:53:53 1772736833

One opinion you can form in under an hour is... why are they using GPT-4o to rate the bias of new models?

> assess harmful stereotypes by grading differences in how a model responds

> Responses are rated for harmful differences in stereotypes using GPT-4o, whose ratings were shown to be consistent with human ratings

Are we seriously using old models to rate new models?

hex4def6 · 2026-03-05T19:11:59 1772737919

If you're benchmarking something, old & well-characterized / understood often beats new & un-characterized.

Sure, there may be shortcomings, but they're well understood. The closer you get to the cutting edge, the less characterization data you get to rely on. You need to be able to trust & understand your measurement tool for the results to be meaningful.

titanomachy · 2026-03-05T18:58:22 1772737102

Why not? If they’ve shown that 4o is calibrated to human responses, and they haven’t shown that yet for 5.4…

utopiah · 2026-03-05T18:57:11 1772737031

Benchmarks?

I don't use OpenAI nor even LLMs (despite having tried https://fabien.benetou.fr/Content/SelfHostingArtificialIntel... a lot of models) but I imagine if I did I would keep failed prompts (can just be a basic "last prompt failed" then export) then whenever a new model comes around I'd throw at 5 it random of MY fails (not benchmarks from others, those will come too anyway) and see if it's better, same, worst, for My use cases in minutes.

If it's "better" (whatever my criteria might be) I'd also throw back some of my useful prompts to avoid regression.

Really doesn't seem complicated nor taking much time to forge a realistic opinion.

Sohcahtoa82 · 2026-03-06T00:35:28 1772757328

GP said "It is time for a product, not for a marginally improved model."

ChatGPT is still just that: Chat.

Meanwhile, Anthropic offers a desktop app with plugins that easily extend the data Claude has access to. Connect it to Confluence, Jira, and Outlook, and it'll tell you what your top priorities are for the day, or write a Powerpoint. Add Github and it can reason about your code and create a design document on Confluence.

OpenAI doesn't have a product the way Anthropic does. ChatGPT might have a great model, but it's not nearly as useful.

fisf · 2026-03-06T09:19:00 1772788740

Are you ignoring the codex desktop app on purpose? Or the integrations?

kranke155 · 2026-03-05T20:05:05 1772741105

The models are so good that incremental improvements are not super impressive. We literally would benefit more from maybe sending 50% of model spending into spending on implementation into the services and industrial economy. We literally are lagging in implementation, specialised tools, and hooks so we can connect everything to agents. I think.

tgarrett · 2026-03-05T19:46:20 1772739980

Plasma physicist here, I haven't tried 5.4 yet, but in general I am very impressed with the recent upgrades that started arriving in the fall of 2025: for tasks like manipulating analytic systems of equations, quickly developing new features for simulation codes, and interpreting and designing experiments (with pictures) they have become much stronger. I've been asking questions and probing them for several years now out of curiosity, and they suddenly have developed deep understanding (Gemini 2.5 <<< Gemini 3.1) and become very useful. I totally get the current SV vibes, and am becoming a lot more ambitious in my future plans.

brcmthrowaway · 2026-03-05T19:52:38 1772740358

Youre just chatting yourself out of a job.

slibhb · 2026-03-05T22:20:33 1772749233

If we don't need plasma physicists anymore then we probably have fusion reactors or something, which seems like a fine trade. (In reality we're going to want humans in the loop for for the forseeable future)

axus · 2026-03-05T20:42:17 1772743337

Giving the right answer: $1

Asking the right question: $9,999

softwaredoug · 2026-03-05T18:55:51 1772736951

The products are the harnesses, and IMO that’s where the innovation happens. We’ve gotten better at helping get good, verifiable work from dumb LLMs

crorella · 2026-03-06T07:04:44 1772780684

underrated comment, this is going to be the main differentiator going forward, the more powerful and versatile harness the more the models will be able to achieve and better/more advanced products will come out of it.

mindwok · 2026-03-05T20:59:57 1772744397

They don't need to be impressive to be worthwhile. I like incremental improvements, they make a difference in the day to day work I do writing software with these.

iterateoften · 2026-03-05T19:02:41 1772737361

The product is putting the skills / harness behind the api instead of the agent locally on your computer and iterating on that between model updates. Close off the garden.

Not that I want it, just where I imagine it going.

Gigachad · 2026-03-05T23:01:56 1772751716

They have a product now. Mass surveillance and fully automated killing machines.

wahnfrieden · 2026-03-05T18:50:43 1772736643

5.3 codex was a huge leap over 5.2 for agentic work in practice. have you been using both of those or paying attention more to benchmark news and chatgpt experience?

esafak · 2026-03-05T18:45:11 1772736311

That's for you to build; they provide the brains. Do you really want one company to build everything? There wouldn't be a software industry to speak of if that happened.

simlevesque · 2026-03-05T18:47:40 1772736460

Nah, the second you finish your build they release their version and then it's game over.

acedTrex · 2026-03-05T18:47:58 1772736478

Well they are currently the ones valued at a number with a whole lotta 0s on it. I think they should probably do both

varispeed · 2026-03-05T19:32:11 1772739131

The scores increase and as new versions are released they feel more and more dumbed down.

jascha_eng · 2026-03-05T19:19:17 1772738357

When did they stop putting competitor models on the comparison table btw? And yeh I mean the benchmark improvements are meh. Context Window and lack of real memory is still an issue.

metalliqaz · 2026-03-05T19:12:24 1772737944

They need something that POPS:

    The new GPT -- SkyNet for _real_