One of the fun things about dav1d is that since it’s written in assembly, they c...

janwas · on Feb 23, 2025

I'm curious why there are even function calls in time-critical code, shouldn't just about everything be inlined there? And if it's not time-critical, why are we interested in the savings from a custom calling convention?

rbultje · on Feb 23, 2025

Binary size was a concern, so excessive inlining was undesirable.

And don't forget that any asm-optimized variant always has a C fallback for generic platforms lacking a hand-optimized variant which is also used to verify the asm-optimized variant using checkasm. This might not be linked into your binary/library (the linker eliminated it because it's never used), but the code exists nonetheless.

janwas · on Feb 23, 2025

hm, fair enough. IIRC JPEG XL was a few hundred KB of SIMD code for the four or so different targets/ISAs, including the generic fallback, but I can believe video codecs are larger.

hrydgard · on Feb 23, 2025

Function calls are very fast (unless there's really a lot of parameter copying/saving-to-stack) and if you can re-use a chunk of code from multiple places, you'll reduce pressure on the instruction cache. Inlining is not always ideal.

janwas · on Feb 23, 2025

Perhaps the use cases are different (heavily data-parallel), but FWIW I do not remember many cases where we were frontend bound, so icache hasn't been a concern.

ajb · on Feb 23, 2025

Codecs often have many redundant ways of doing the same thing, which are chosen on the basis of which one uses the fewest bits, for a specific piece of data. So you can't inline them as you don't know ahead of time which will be used.

weebull · on Feb 24, 2025

Cache misses hurt.

MortyWaves · on Feb 23, 2025

Doesn’t this just make it harder to maintain ports to other architectures though?

epr · on Feb 23, 2025

For what's written in assembly, lack of portability is a given. The only exceptions would presumably be high level entry points called to from C, etc. If you wanted to support multiple targets, you have completely separate assembly modules for each architecture at least. You'd even need to bifurcate further for each simd generation (within x64 for example).

antoinealb · on Feb 23, 2025

Yes, but on projects like that, ease of maintenance is a secondary priority when compared to performance or throughput.

wolf550e · on Feb 23, 2025

There indeed have been bugs caused by amd64 assembly code assuming unix calling convention being used for Windows builds and causing data corruption. You have to be careful.

secondcoming · on Feb 23, 2025

SIMD instructions are already architecture dependent