Homebrew: AVX instructions in Big Sur bottles crash when run in Rosetta 2

adenylyl · on Dec 27, 2020

Why is this interesting? Rosetta 2 is documented to not support AVX: https://developer.apple.com/documentation/apple_silicon/abou...

thibautg · on Dec 27, 2020

I own a MacBook Air M1. Today, I noticed that my Homebrew Python (3.9.1 on Intel) installation had reverted to the default one (3.8 on ARM). I forced a reinstall with homebrew and I got an ‘illegal hardware instruction’ error when running python3. It was corrected by building from source (with the -s switch: brew reinstall -s python3).

I found out about this issue on Github and I learned that the M1 chip does not support AVX (which is indeed documented) and that Homebrew developers assumed that all CPUs supported by Big Sur were compatible with AVX instruction set (which is true except for the M1 under Rosetta 2).

I found it interesting to see how they rapidly corrected the situation by merging a partial revert. I also found their CI infrastructure interesting to watch while it was live-testing the commit.

klodolph · on Dec 27, 2020

This is fascinating. I can understand why the Homebrew devs might make the decision, but it seems the right way to do this is just to, e.g.,

    cc -dM -E -xc /dev/null | less

Anything with a preprocessor macro is supported on all OS configurations that you are targeting. You can see, for example, that __SSE4_1__ is defined, so SSE 4.1 is ok to use without checking it at runtime with e.g. sysctlbyname("hw.optional.sse4_1", &enabled, ...)

I have always thought that it is worth a bit of paranoia to try and do the check for optional features the “correct” way, whatever method the OS advertises (like sysctl for macOS) rather than doing something like cpuid. After all, every once in a while it’s possible to run into a configuration where the CPU does support some particular feature but the OS does not, and if you rely on cpuid you could be up shit creek without a paddle.

bodyfour · on Dec 27, 2020

It was a mistaken assumption from the Homebrew developers who didn't realize AVX would cause problems with Rosetta2.

Within a few days problems became evident and Homebrew turned the compiler flag back off. Now it's just a question of the best way of remediating the already-compiled packages in the simplest way possible.

Really the fact that this is now a front-page-of-HN story is going to cause more confusion to people. It was just a short-lived bug.

rincebrain · on Dec 27, 2020

It appears the decision was made to hardwire AVX support to "on" recently as all the hardware for Big Sur supposedly supported AVX, not noticing that Rosetta 2 violated this, and then the bug report(s) started coming in about precompiled binaries being broken under Rosetta 2 since that change.

So I personally find it interesting because I would not have expected Rosetta 2 to be missing AVX instructions if all the supported Big Sur hardware possessed them, and this is an interesting failure mode to watch and see how the Brew developers handle.

als0 · on Dec 27, 2020

> I would not have expected Rosetta 2 to be missing AVX instructions

AVX can’t be emulated by Rosetta due to Intel patents.

phire · on Dec 27, 2020

Regardless of the patent issue, emulating AVX on the M1 (which only has 4-wide SIMD) would actually be significantly slower than forcing the x86 application to use it's SSE fallback path and emulating that.

Diggsey · on Dec 27, 2020

Emulating AVX via Rosetta should be just as fast as re-compiling the original without AVX support and then emulating it. Emulating larger SIMD instructions is very easy, you just use multiple smaller SIMD instructions.

On the other hand, disabling AVX for all Intel machines would make those programs significantly slower, so it's clear why there is reluctance to do that...

phire · on Dec 27, 2020

No. For many algorithms, AVX isn't a 2x speedup over SSE. Especially when lanes are conditionally masked.

Often you are happy to get a 1.25x speed up with AVX. Sometimes it actually goes slower.

If you were to emulate that code with a 1.25x speedup with AVX on the M1, you would end up with all the disadvantages of going to 8-wide, but with none of the speedup.

That 1.25x speedup is halved and the emulated AVX code actually runs at about 0.625x the speed of the emulated SSE code path.

a-dub · on Dec 27, 2020

plus doesn't the M1 have specialized hardware. what's the neuro-engine or whatever it is that they call it for speeding up ML? i imagine at it's core it's a bunch of instructions for doing vector operations.

side bar: is there documentation for the instruction set or abi for that hardware?

mhh__ · on Dec 27, 2020

> side bar: is there documentation for the instruction set or abi for that hardware?

No. Apple stans, food for thought.

I would hope there will be something soon although it's Apple so not much.

Edit: Still basically no, but https://github.com/geohot/tinygrad/tree/master/ane has got the instruction format apparently.

You can attack an instruction set blind (https://recon.cx/2012/schedule/events/236.en.html).

More edit: Bingo, patent: https://patents.google.com/patent/US20190340491A1/en?oq=2019...

DerekL · on Dec 27, 2020

The Apple Neural Engine is separate from the CPU; it's not additional registers and instructions for the CPU, like a vector unit. You go through the Core ML framework to use it, just like you go through Metal or OpenGL to use the GPU.

mhh__ · on Dec 27, 2020

The value of SIMD on a CPU these days is really a middle-ground where you value latency above throughput, so you probably would have the same trade off as getting the data to and from a GPU

a-dub · on Dec 27, 2020

i thought the M1 was a SoC with a unified memory architecture?

mhh__ · on Dec 27, 2020

That definitely changes the calculus, but as I've mentioned in a different comment there doesn't seem to be literally any microarchitectural documentation to read, so I (don't own an M1) have nothing to go off unfortunately.

I'll make a wild guess that getting data to the neural engine is still probably not quick because I assume it's some kind of statically scheduled type affair (exposed pipeline?). We literally seem to know almost nothing about it sadly.

https://jobs.apple.com/en-gb/details/200205070/neural-engine...

Even the job listing gives away next to nothing, other than poor English "Knowledge in compiler is a plus" ;)

mhh__ · on Dec 27, 2020

https://news.ycombinator.com/item?id=14523587

Discussion of Intel patents from a few years ago

trollian · on Dec 27, 2020

They're implemented by other emulation systems like bochs: https://sourceforge.net/p/bochs/news/

colejohnson66 · on Dec 27, 2020

bochs is probably just too small a fish for Intel to care. But Apple is a big money maker for Intel, so they’d have an incentive to push back legally. The patents aren’t on the implementation, but the function, so I’d wager that bochs is actually infringing. But IANAL, so take it with a grain of salt.

HexagonalKitten · on Dec 27, 2020

Can't be emulated efficiently, or at all? Has it been tested or is it based on this comment from Intel?

"Emulation is not a new technology, and Transmeta was notably the last company to claim to have produced a compatible x86 processor using emulation (“code morphing”) techniques. Intel enforced patents relating to SIMD instruction set enhancements against Transmeta’s x86 implementation even though it used emulation"

varispeed · on Dec 27, 2020

Can't they rename it to ABX and switch things here and there? Or get someone to do clean room implementation of ISA specification? You can't patent an API.

mhh__ · on Dec 27, 2020

The patents aren't on the API but what the API does.

Like it or not they are patented https://patents.google.com/patent/US7499962 (FMA, for example)

xoa · on Dec 27, 2020

>Like it or not they are patented https://patents.google.com/patent/US7499962 (FMA, for example)

Yeah, and AVX is also relatively new, intro'd in 2008 and first shipped in a chip in 2011. AVX2 wasn't until 2013. So even with R&D and patents happening years beforehand it'll still be a good long while before they expire (that FMA example being a case in point, not until end of 2026).

Granted in Apple's specific case that's actually not a bad thing. Precisely because AVX is so new, many Macs supported up until the last version or two of macOS didn't have it. So AVX isn't at all a widely expected dependency for the kind of older software that may never get an ARM port and in turn most needs Rosetta 2.

Const-me · on Dec 28, 2020

If FMA is patented, how comes ARM has equivalent instructions, called MLA? And GPUs have FMA as well.

loeg · on Dec 27, 2020

Can you elaborate on what patents, and how they prevent emulating AVX but allow emulating the rest of x86?

colejohnson66 · on Dec 27, 2020

Every new feature set (the many SSEs, AVXs, and others) is patented. But newer processor features (like AVX) don’t “renew” the patents on the older features. So when AVX-512 was patented, it didn’t change the expiration dates for AVX2 and prior.

The base requirements for x86-64 mandate SSE2 IIRC. Those patents expired this year, so Apple was now able to release an x86-64 “emulator” without negotiating patents.

tempay · on Dec 27, 2020

This leaves me wondering if Apple had waited until now for ARM macs solely because of the patent situation.

loeg · on Dec 27, 2020

Huh. Is AMD paying licensing fees for the privilege of implementing the AVX and AVX2 instruction sets? It seems weird and anticompetitive that patents are granted for what amounts to an API.

(Do you have links or identifiers for the specific patents, by any chance?)

colejohnson66 · on Dec 27, 2020

The situation with AMD is a bit complicated. Basically, there was an antitrust lawsuit led by AMD years ago, and as part of the settlement, Intel and AMD would share patents to allow collaboration. I can’t recall the actual specifics though.

As for patents, I don’t, but someone else here linked in the one for FMA[0]. It’s a bit more complicated than just an API, but it seems broad enough that anything implementing that API would be covered. But IANAL.

[0]: https://patents.google.com/patent/US7499962

jabberwcky · on Dec 27, 2020

After what feels like a solid month of front pages cooing over the new hardware, a little counterbalance really could not hurt. I'm no longer an Apple hardware user, so this post (along with your comment) was the first time I'd heard AVX was unsupported. Pretty sure I'm not alone.

[edit: this seemed like quite an uncontroversial thing to say, please understand it was not intended to hurt anyone's feelings. I was a mac user for 12 years and simply don't click those links any more. Please don't downvote comments just because you're a Mac user]

gumby · on Dec 27, 2020

They did mention it a lot (though they said a lot of things, so easy to overlook).

The actual issue is that Rosetta identifies itself as a CPU without AVX so it’s clearly a homebrew bug. And in fact if you tell homebrew not to download a prebuilt binary it works fine.

EricE · on Dec 27, 2020

Counterbalance? The Homebrew team made a lazy assumption and it bit them. That's hardly Apple's fault.

mhh__ · on Dec 27, 2020

Is there any long-form documentation (specifically an optimization manual) for M1 yet?.

Currently the only details I can find about the microarchitecture is either press waffle or things reverse engineered via timing (you can look for bumps in graphs to find the widths of various speculative features of the CPU).

It would be very Apple not to publish any (Fucking sad if true)

Edit: Intel, like them or not, have genuinely very good documentation (Much better than AMD's) and seem to have really thought about actually using their performance - AMD can really deliver the goods on their terms, but they still lag behind in both software (particular compared to Nvidia) and documentation (Intel's optimization manual is 850 pages; AMD's is 45 pages). AMD basically give you a raw list of RDPMC event assignments to play with, Intel actually tell you how to use them.

moonchild · on Dec 27, 2020

> Currently the only details I can find about the microarchitecture is either press waffle or things reverse engineered via timing (you can look for bumps in graphs to find the widths of various speculative features of the CPU).

Tbf, the agner fog's unofficial CPU optimization manuals[1] (for x86 CPUs) are quite good.

> Intel, like them or not, have genuinely very good documentation

Eh, depends. (Not talking specifically about optimization docs.) On the whole, I find intel documentation functions better as a reference, where amd documentation functions better as explanation.

1. https://www.agner.org/optimize/

skavi · on Dec 27, 2020

If Apple wants to move fast, maybe it would be disadvantageous to have developers hyper optimize for a specific μarch?

mhh__ · on Dec 27, 2020

No, document them both. And you can't really move fast, progress comes in a big dollop with sprinkles on top (i.e. Zen brought like 50% improvement over Excavator but the current AMD is Zen 3 rather than Ben 1)

giomasce · on Dec 27, 2020

Isn't the optimization manual for Rosetta basically "Recompile for ARM"?

mhh__ · on Dec 27, 2020

And when you've done that what next?

saagarjha · on Dec 27, 2020

There is not.

marcan_42 · on Dec 27, 2020

TL;DR: A change was made to set the minimum CPU level (instruction set target) for Big Sur builds to Ivy Bridge. This assumes the presence of AVX instructions on all Big Sur machines, which is true for Intel macs. Rosetta is a Big Sur "machine" that does not support AVX. Therefore, the assumption was false.

phire · on Dec 27, 2020

> which is true for Intel macs

But not true for Intel in general, because Intel still ships low-end Core processes branded as "Pentium" and "Celeron" such as the Intel® Pentium® Gold G6600 which have AVX disabled.

It only holds true for Intel macs because Apple have never used these processors in their products.

daenney · on Dec 27, 2020

It doesn’t matter that it’s not true for Intel in general, brew doesn’t target “Intel in general”.

Brian_K_White · on Dec 28, 2020

If it doesn't, that's just yet another broken assumption by homebrew. Macos is run on other hardware, both physical and virtual, than a few Apple models.

daenney · on Jan 2, 2021

It’s not a broken assumption. You can still build and compile all formulae yourself.

But for the bottles, the precompiled builds, they’ve always had a lower boundary on which generation of Mac you need to at least be running, and as a consequence which instructions are available.

surfer7837 · on Dec 27, 2020

Only bottles are affected so you can just compile from source with homebrew

chrisseaton · on Dec 27, 2020

Why aren't they checking if the CPU supports AVX before using it?!

lights0123 · on Dec 27, 2020

Because all Intel Macs that support Big Sur have AVX, so the Homebrew team enabled it globally. It allows for "things like small memset() operations [to get] inlined as AVX instructions" as the compiler knows that an AVX operation is faster than a function call.

bodyfour · on Dec 27, 2020

It was just a (short-lived) bug, not a conscious decision to break anything. Not sure why this got so much traction on HN.

recursive · on Dec 27, 2020

What's a Big Sur bottle? I don't follow Apple stuff super closely, but I'm completely lost here. Searches aren't too helpful.

grzm · on Dec 27, 2020

It's not an Apple thing; it's homebrew's binary packaging: https://docs.brew.sh/Bottles

ficklepickle · on Dec 27, 2020

They use their own special snowflake jargon because they are Mac people. Best to not waste brain cycles on it.

a-dub · on Dec 27, 2020

why doesn't rosetta support avx? arm has neon and sve... i would assume that there's equivalence for most basic vector math instructions and for those that there isn't, like the weird sha256 and aes instructions, they could just be emulated in software?

daenney · on Dec 27, 2020

Over here: https://news.ycombinator.com/item?id=25548001.

Intel patents.