The bottleneck for running LLMs on consumer grade equipment is the amount of VRA...

The bottleneck for running LLMs on consumer grade equipment is the amount of VRAM your GPU has. VRAM is RAM that's physically built into the unit and it has much higher memory bandwidth than regular system RAM.

Obviously, newer GPUs will run faster than older GPUs, but you need more VRAM to be able to run larger models. A small LLM that fits into an RTX 4060's 8GB of VRAM will run faster there than it would on an older RTX 3090. But the 3090 has 24GB of VRAM, so it can run larger LLMs that the 4060 simply can't handle.

Llama.cpp can split your LLM onto multiple GPUs, or split part of it onto the CPU using system RAM, though that last option is much much slower. The more of the model you can fit into VRAM, the better.

The Apple M-series macbooks have unified memory, so the GPU has higher bandwidth access to system RAM than would be available over a PCIe card. They're not as powerful as Nvidia GPUs, but they're a reasonable option for running larger LLMs. It's also worth considering AMD and Intel GPUs, but most of the development in the ML space is happening on Nvidia's CUDA architecture, so bleeding edge stuff tends to be Nvidia first and other architectures later, if at all.