Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's a little hard to compare, because Claude needs significantly fewer tokens for the same task. A better metric is the cost per task, which ends up being pretty similar.

For example on Artificial Analysis, the GPT-5.x models' cost to run the evals range from half of that of Claude Opus (at medium and high), to significantly more than the cost of Opus (at extra high reasoning). So on their cost graphs, GPT has a considerable distribution, and Opus sits right in the middle of that distribution.

The most striking graph to look at there is "Intelligence vs Output Tokens". When you account for that, I think the actual costs end up being quite similar.

According to the evals, at least, the GPT extra high matches Opus in intelligence, while costing more.

Of course, as always, benchmarks are mostly meaningless and you need to check Actual Real World Results For Your Specific Task!

For most of my tasks, the main thing a benchmark tells me is how overqualified the model is, i.e. how much I will be over-paying and over-waiting! (My classic example is, I gave the same task to Gemini 2.5 Flash and Gemini 2.5 Pro. Both did it to the same level of quality, but Gemini took 3x longer and cost 3x more!)

 help



Looks like the same thing might apply to GPT-5.4 vs the previous GPTs:

>In the API, GPT‑5.4 is priced higher per token than GPT‑5.2 to reflect its improved capabilities, while its greater token efficiency helps reduce the total number of tokens required for many tasks.

I eagerly await the benchies on AA :)


Benchies update:

https://artificialanalysis.ai/

Looks like it costs ~25% more than 5.2, with both on xhigh reasoning.

They only seem to have tested xhigh, which is a shame, since I think that reasoning level is in the point of diminishing returns for most tasks.

Also I was completely wrong earlier. Opus is significantly more expensive. I was looking at the wrong entry in the chart, the non-reasoning version of Opus. The fair comparison is Opus on max reasoning, which costs about twice the price of GPT-5.4 xhigh, to run the AA evals.


But does it use the same agent harness? Because the harness determines the behavior a lot.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: