We have an OCR job running with a lot of domain specific knowledge. After testing different models we have clear results that some prompts are more effective with some models, and also some general observations (eg, some prompts performed badly across all models).
Sample size was 1000 jobs per prompt/model. We run them once per month to detect regression as well.
While I believe that performance varies with respect to prompt, I have a seriously hard time believing that using the same prompt that was effective with the previous model would perform worse with the next generation of the same model from that lab and the same prompt.
You shouldn't have a hard time believing it. There are thousands of different domains out there. You find it hard to believe that any of them would perform worse in your scenario?
Labs are still really optimizing for maybe 10 of those domains. At most 25 if we're being incredibly generous.
And for many domains, "worse" can hardly be benched. Think about creative writing. Think about a Burmese cooking recipe generator.
Bruh, how do you evaluate a batch of 1000 jobs against a x model for creative writing or cooking recipes? It’s vibes all the way down. This reeks like some kind of blog spam seo nonsense.
The entire point is that you _don't_ for creative writing, vibes are the whole point, and those vibes often get worse across model updates for the same prompts.
OK, so a while back I set up a workflow to do language tagging. There were 6-8 stages in the pipeline where it would go out to an LLM and come back. Each one has its own prompt that has to be tweaked to get it to give decent results. I was only doing it for a smallish batch (150 short conversations) and only for private use; but I definitely wouldn't switch models without doing another informal round of quality assessment and prompt tweaking. If this were something I was using in production there would be a whole different level of testing and quality required before switching to a different model.
The big providers are gonna deprecate old models after a new one comes out. They can't make money off giant models sitting on GPUs that aren't taking constant batch jobs. If you wanna avoid re-tweaking, open weights are the way. Lots of companies host open weights, and they're dirt cheap. Tune your prompts on those, and if one provider stops supporting it, another will, or worst case you could run it yourself. Open weights are now consistently at SOTA-level at only a month or two behind the big providers. But if they're short, simple prompts, even older, smaller models work fine.
Enterprises moving slow, or preferring to remain on old technology that they already know how to work...is received wisdom in hn-adjacent computing, a truism known and reported for more than 3 decades (5 decades since the Mythical Man-Month).
Sounds like someone who's responsible, on the hook, for a bunch of processes, repeatable processes (as much as LLM driven processes will be), operating at scale.
Just in the open, tools like open-webui bolts on evals so you can compare: how different models, including new ones, perform on the tasks that you in particular care about.
Indeed LLM model providers mainly don't release models that do worse on benchmarks—running evals is the same kind of testing, but outside the corporate boundary, pre-release feedback loop, and public evaluation.
Like, bro, do you think 5.x is a drop in replacement for 4.1? No it obviously wasn’t, since it had reasoning effort and verbosity and no more temperature setting, etc.
There’s no way you can switch model versions without testing and tweaking prompts, even the outputs usually look different. You pin it on a very specific version like gpt-5.2-20250308 in prod.