If 40k is the barrier to entry for impressive, that doesn't really sell the usec...

jazzyjackson · 2026-03-03T16:14:50 1772554490

It’s what a small business might have paid for an onprem web server a couple of decades ago before clouds caught on. I figure if a legal or medical practice saw value in LLMs it wouldn’t be a big deal to shove 50k into a closet

Greed · 2026-03-03T18:59:26 1772564366

You would still have to do some pretty outstanding volume before that makes sense over choosing the "Enterprise" plan from OpenAI or Anthropic if data retention is the motivation.

Assuming, of course, that your legal team signs off on their assurance not to train on or store your data with said Enterprise plans.

LunaSea · 2026-03-03T19:44:13 1772567053

At least with the server you know what you are buying.

With Anthropic you're paying for "more tokens than the free plan" which has no meaning

ttoinou · 2026-03-03T15:48:05 1772552885

With M3 Max with 64GB of unified ram you can code with a local LLM, so the bar is much lower

Greed · 2026-03-03T18:47:12 1772563632

But why? Spending several thousand dollars to run sub-par models when the break-even point could still be years away seems bizarre for any real usecase where your goal is productivity over novelty. Anyone who has used Codex or Opus can attest that the difference between those and a locally available model like Qwen or Codestral is night and day.

To be clear, I totally get the idea of running local LLMs for toy reasons. But in a business context the sell on a stack of Mac Pros seems misguided at best.

nurettin · 2026-03-03T20:16:05 1772568965

I ran the qwen 3.5 35b a3b q4 model locally on a ryzen server with 64k context window and 5-8 tokens a second.

It is the first local model I've tried which could reason properly. Similar to Gemini 2.5 or sonnet 3.5. I gave it some tools to call , asked claude to order it around, (download quotes, print charts, set up a gnome extension) even claude was sort of impressed that it could get the job done.

Point is, it is really close. It isn't opus 4.5 yet, but very promising given the size. Local is definitely getting there and even without GPUs.

But you're right, I see no reason to spend right now.

Greed · 2026-03-03T22:56:32 1772578592

Getting Opus to call something local sounds interesting, since that's more or less what it's doing with Sonnet anyway if you're using Claude Code. How are you getting it to call out to local models? Skills? Or paying the API costs and using Pi?

nurettin · 2026-03-04T02:06:19 1772589979

I just start llama.cpp serve with the gguf which creates an openai compatible endpoint.

The session so far is stored in a file like /tmp/s.json messages array. Claude reads that file, appends its response/query, sends it to the API and reads the response.

I simply wrapped this process in a python script and added tool calling as well. Tools run on the client side. If you have Claude, just paste this in :-)

robotresearcher · 2026-03-03T20:09:16 1772568556

Sometimes you can't push your working data to third party service, by law, by contract, or by preference.

0x457 · 2026-03-03T19:37:30 1772566650

I started doing it to hedge myself for inevitable disappearance of cheap inference.

blackqueeriroh · 2026-03-04T02:58:00 1772593080

Sure, but now double the team size. Double it again.

Suddenly that $40k is quite reasonable because you’ll never pay another dollar for st least 2-3 years.

prmoustache · 2026-03-04T11:01:05 1772622065

Would you?

2-3 years ago people were fantasizing on running local models on a consumer nvidia RTX GPU.

spacedcowboy · 2026-03-03T18:42:28 1772563348

It's not. I've got a single one of those 512GB machines and it's pretty damn impressive for a local model.

Greed · 2026-03-03T19:10:40 1772565040

Assuming you ran the gamut up from what you could fit on 32 or 64GB previously, how noticeable is the difference between models you can run on that vs. the 512GB you have now?

I've been working my way up from a 3090 system and I've been surprised by how underwhelming even the finetunes are for complex coding tasks, once you've worked with Opus. Does it get better? As in, noticeably and not just "hallucinates a few minutes later than usual"?