the only thing I can say to this is that Apple have seemed laser focused on tuni...

tehsauce · on April 19, 2023

Apple's new M2 Max has a neural engine which can do 15 trillion flops. Nvidias's A100 chip (released almost 3 years ago) can do 315 trillion flops. Apple is not going to close this 20x gap in a few years.

moffkalast · on April 19, 2023

> this 168x gap

FTFY, remember it takes 8 of those to even load the thing. And when the average laptop has that much compute, GPT 4 will seem like Cleverbot in comparison to the state of the art.

davnicwil · on April 19, 2023

right, it's a huge challenge.

I think the tuning the models to the hardware piece is important, and of course there is much more incentive to do this for Apple than nvidia because of the distribution and ecosystem advantages Apple have.

But also, I don't know... let's see what the curve looks like! It's only been a couple of years of these neural engines. Let's see how many flops M3 can hit this year. And then m4 the next. Again, 5 years is a long time actually when real improvement is happening. I am optimistic.

sroussey · on April 19, 2023

At some point, they will put the models in silicon. I’m curious as to when… 5yr?

viraptor · on April 19, 2023

That doesn't sound likely with the current architectures. There may be some kind of specialisation, but NN is like the chip design nightmare. We can't do chips that that many crossed lines. It's going to have to keep the storage+execution engine pattern unless we have done breakthroughs.

"More specialised than GPU" is the game for now.

zamnos · on April 20, 2023

Not even with 3d chip-stacking?

viraptor · on April 20, 2023

Well, we'll see what the future manufacturing brings, but right now we're not even at thousands of layers (as far as I know... please link if there's been more), and we'd need to be in hundreds of thousands range. Given the rate of defects also adding up and the need for some way to dissipate the heat... (almost all of that chip will be engaged while running - no chance for balancing power between systems) Yeah, still lots of challenges there.

(I'm assuming the original comment meant literally putting the network as is in the purpose designed chip)

anhner · on April 20, 2023

Try comparing M2 with an actual consumer level GPU, not a supercomputer...

brucethemoose2 · on April 19, 2023

The M2 and the 4090 are both very general purpose. In fact, the 4090 allocates proportionally more silicon area to the tensor cores than Apple allocates to the neural engine.

The M series is basically the only "big" SoC with a functional, flexible NPU and big GPU right now, which is why it seems so good at ML. But you can bet actual ML focused designs are in the pipe.

zamnos · on April 19, 2023

Well, and gobs of RAM, which only top end cards can compete with.

brucethemoose2 · on April 19, 2023

I am really hoping Intel disrupts this status quo. The only thing stopping AMD/Nvidia is anticompetitiveness.

kristianp · on April 20, 2023

Does llama.cpp use the npu or metal? I thought it uses NEON instructions? So currently those units aren't even used.

brucethemoose2 · on April 20, 2023

I don't think so. M chips just happen to have a really good memory subsystem and good simd performance through accelerate, so the CPU performance is pretty good.

Some stable diffusion implementations can use the NPU or GPU, or (experimentally and unsucessfully) both.

refulgentis · on April 19, 2023

Curious, why do you think that? My knowledge is limited to marketing material and my M2 vs my 3090, and my conclusion so far would be that’s in every hardware makers marketing claims the past couple years.