Hi everyone, the author is here. Yes, I believe the title should be changed to `Optimizing 128-bit Division`
Yet, I was not expecting it to be here. Overall, I put some knowledge hidden in Hacker's Delight book, Knuth, GMP and GNU in the article with my knowledge of low level optimizations. In the end it turned out to be a good thing to write and to submit into LLVM
You might want to also check that 128-bit multiplies are implemented as only two 64-bit multiplies and a couple of additions (assuming no overflow; three multiplies if you need to get the right answer mod 2^64).
This is the code that my compiler generates for a 64/64 divide when it doesn't know that the divisor is max 32 bits. It works by using the x86 64/32 div instruction. The same trick would work on x86-64 by using the 128/64 divisions.
I am not sure how good the machine code you get out of your C++ is in the end, but it's hard to beat this handwritten assembly. Of course, you can do better on x86-64 because there are more registers and you don't have to spill on to the stack.
That's a difficult question. As I am working at Google I need to consult each open source I want to publish on my behalf. This does not involve contributing to the list of the approved projects and telling about these contributions.
If you want to workaround this for now, I suggest looking into libdivide (https://github.com/ridiculousfish/libdivide), it is published with the boost license and the library contains all the needed artifacts I described in the article (unfortunately, not combined).
Thanks for the pointers. I suspect Google would be ok with Boost, since they are a C++ house and Boost is the major library. A big reason D picked it was because Boost is corporate lawyer approved.
I remember seeing a letter to the editor of an early Byte magazine, wherein the author of the letter recommended that you "Be sure to take your 32-bit divide routines with you when you change jobs... there's no guarantee that your new job's routines will be correct, let alone fast!".
For the record, I have never followed this advice.
Very nice post! I remember helping to port the 128 bit integer functions in compiler-rt to Rust years ago [0], because clang only supported 128 bit integers on 64 bit platforms, but the goal of Rust was to support them on all platforms that Rust supports.
Since then, all algorithms in compiler-rt have been ported to Rust and live in the compiler-builtins crate. This [1] is the source code file for unsigned division. The actual logic is inside a macro and used to implement both 128 bit division in terms of 64 bit numbers and 64 bit division in terms of 32 bit numbers.
I wonder if similar optimizations can be done to that code.
Many years ago I implemented a set of multiple precision integer routines in fortran. For division, I slavishly copied Knuth's routine, using all the sample divisions in his book to test, and it worked fine.
And for hash tables with a 128bit hashfunc he doesn't need division nor modulo at all. Only with primed sizes, but with these large sizes primed does not make sense, you always take power of 2 sizes. Then you get the bucket index with a trivial rshift or AND bitmask.
So a lot of work for nothing. At least LLVM did profit.
Go uses this technique internally for generating random numbers in a range. I don't remember if it's used in the map implementation or not, just that I found this code and the link to Daniel Lemire's blog post.
The article’s title is actually “128 bit division” while HN currently shows it as just “Bit Division”. I suggest both be changed to “Optimizing 128-bit Division”. Nice article, btw.
Hi, I thought that this is not the most interesting part of all the benchmarks, for example, all we need to test that with the quotient and the remainder: dividend = quotient * divisor + remainder, remainder < divisor and multiplication does not overflow which is free of division operations.
Yet, I added several tests like dividend < divisor, close to zero remainders, a lot of random stuff just to make sure each time I add a new approach, it is correct.
> FWIW, I dimly recall a numeric library, maybe a float to string, which tested everything for verification. Took a few days to run.
You can definitely test against all ~2^31 IEEE 754 binary32 values to make sure that float-to-decimal conversion is correct; that's what I've done with the Rust standard library (took 2 hours per test). I believe testing all ~2^63 binary64 values are also feasible by now, but only with dedicated hardwares. For that reason I believe the library had only tested with binary32.
Yet, I was not expecting it to be here. Overall, I put some knowledge hidden in Hacker's Delight book, Knuth, GMP and GNU in the article with my knowledge of low level optimizations. In the end it turned out to be a good thing to write and to submit into LLVM