Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One of the fun things about dav1d is that since it’s written in assembly, they can use their own calling convention. And it can differ from method to method, so they have very few stack stores and loads compared to what a compiler will generate following normal platform calling conventions.


I'm curious why there are even function calls in time-critical code, shouldn't just about everything be inlined there? And if it's not time-critical, why are we interested in the savings from a custom calling convention?


Binary size was a concern, so excessive inlining was undesirable.

And don't forget that any asm-optimized variant always has a C fallback for generic platforms lacking a hand-optimized variant which is also used to verify the asm-optimized variant using checkasm. This might not be linked into your binary/library (the linker eliminated it because it's never used), but the code exists nonetheless.


hm, fair enough. IIRC JPEG XL was a few hundred KB of SIMD code for the four or so different targets/ISAs, including the generic fallback, but I can believe video codecs are larger.


Function calls are very fast (unless there's really a lot of parameter copying/saving-to-stack) and if you can re-use a chunk of code from multiple places, you'll reduce pressure on the instruction cache. Inlining is not always ideal.


Perhaps the use cases are different (heavily data-parallel), but FWIW I do not remember many cases where we were frontend bound, so icache hasn't been a concern.


Codecs often have many redundant ways of doing the same thing, which are chosen on the basis of which one uses the fewest bits, for a specific piece of data. So you can't inline them as you don't know ahead of time which will be used.


Cache misses hurt.


Doesn’t this just make it harder to maintain ports to other architectures though?


For what's written in assembly, lack of portability is a given. The only exceptions would presumably be high level entry points called to from C, etc. If you wanted to support multiple targets, you have completely separate assembly modules for each architecture at least. You'd even need to bifurcate further for each simd generation (within x64 for example).


Yes, but on projects like that, ease of maintenance is a secondary priority when compared to performance or throughput.


There indeed have been bugs caused by amd64 assembly code assuming unix calling convention being used for Windows builds and causing data corruption. You have to be careful.


SIMD instructions are already architecture dependent




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: