Optimizing 128-bit Division

danlark1 · on July 19, 2020

Hi everyone, the author is here. Yes, I believe the title should be changed to `Optimizing 128-bit Division`

Yet, I was not expecting it to be here. Overall, I put some knowledge hidden in Hacker's Delight book, Knuth, GMP and GNU in the article with my knowledge of low level optimizations. In the end it turned out to be a good thing to write and to submit into LLVM

drfuchs · on July 19, 2020

You might want to also check that 128-bit multiplies are implemented as only two 64-bit multiplies and a couple of additions (assuming no overflow; three multiplies if you need to get the right answer mod 2^64).

ncmncm · on July 19, 2020

There is an `if` statement in a `for` loop that looks like it could be eliminated. Abbreviating:

  if (dd.all >= dr.all) {
    dd.all -= dr.all;
    quo.s.low |= 1;
  }

could be

  bool c = dd.all >= dr.all;
  dd.all -= (-c & dr.all);
  quo.s.low |= c;

Clang might implement this with `cmov` instructions, coded either way.

titzer · on July 19, 2020

Great article, I read to the end.

I faced a similar problem implemented 64/64 division on 32-bit x86. I cribbed some code off the internet and came up with this:

https://github.com/titzer/virgil/blob/2c674dbe3512da23c4a644...

This is the code that my compiler generates for a 64/64 divide when it doesn't know that the divisor is max 32 bits. It works by using the x86 64/32 div instruction. The same trick would work on x86-64 by using the 128/64 divisions.

I am not sure how good the machine code you get out of your C++ is in the end, but it's hard to beat this handwritten assembly. Of course, you can do better on x86-64 because there are more registers and you don't have to spill on to the stack.

WalterBright · on July 19, 2020

What's the license? Hopefully, Boost or Boost compatible?

danlark1 · on July 19, 2020

For compiler-rt it is Apache 2.0 license

https://github.com/llvm/llvm-project/blob/master/compiler-rt...

WalterBright · on July 19, 2020

Thank you. We've standardized on the Boost license for D, as it is the least restrictive, and well-accepted in the C++ community.

Is it possible you can make a Boost licensed version of it so we can add it to D?

https://www.boost.org/LICENSE_1_0.txt

danlark1 · on July 19, 2020

That's a difficult question. As I am working at Google I need to consult each open source I want to publish on my behalf. This does not involve contributing to the list of the approved projects and telling about these contributions.

If you want to workaround this for now, I suggest looking into libdivide (https://github.com/ridiculousfish/libdivide), it is published with the boost license and the library contains all the needed artifacts I described in the article (unfortunately, not combined).

WalterBright · on July 20, 2020

Thanks for the pointers. I suspect Google would be ok with Boost, since they are a C++ house and Boost is the major library. A big reason D picked it was because Boost is corporate lawyer approved.

chris_st · on July 19, 2020

I remember seeing a letter to the editor of an early Byte magazine, wherein the author of the letter recommended that you "Be sure to take your 32-bit divide routines with you when you change jobs... there's no guarantee that your new job's routines will be correct, let alone fast!".

For the record, I have never followed this advice.

est31 · on July 19, 2020

Very nice post! I remember helping to port the 128 bit integer functions in compiler-rt to Rust years ago [0], because clang only supported 128 bit integers on 64 bit platforms, but the goal of Rust was to support them on all platforms that Rust supports.

Since then, all algorithms in compiler-rt have been ported to Rust and live in the compiler-builtins crate. This [1] is the source code file for unsigned division. The actual logic is inside a macro and used to implement both 128 bit division in terms of 64 bit numbers and 64 bit division in terms of 32 bit numbers.

I wonder if similar optimizations can be done to that code.

[1]: https://github.com/rust-lang/rust/pull/38482

[0]: https://github.com/rust-lang/compiler-builtins/blob/master/s...

danlark1 · on July 19, 2020

Hi, I believe they can be done for 32 bit platforms also with the multiword division in Knuth https://skanthak.homepage.t-online.de/division.html what I've chosen for fallback if not x86_64 platform.

I will try to implement the same optimizations in Rust in the upcoming weeks

UPD And we opened an issue :) https://github.com/rust-lang/compiler-builtins/issues/368

mjcohen · on July 20, 2020

Many years ago I implemented a set of multiple precision integer routines in fortran. For division, I slavishly copied Knuth's routine, using all the sample divisions in his book to test, and it worked fine.

eutectic · on July 19, 2020

If all you need is to uniformly reduce a number into a given range, there is a division-free approach:

lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/%3famp

repiret · on July 19, 2020

It took my fat fingers a while to be able to successfully copy-n-paste the link on mobile. Here’s a proper link to save others the hassle:

https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-...

rurban · on July 19, 2020

And for hash tables with a 128bit hashfunc he doesn't need division nor modulo at all. Only with primed sizes, but with these large sizes primed does not make sense, you always take power of 2 sizes. Then you get the bucket index with a trivial rshift or AND bitmask.

So a lot of work for nothing. At least LLVM did profit.

slashdev · on July 19, 2020

Go uses this technique internally for generating random numbers in a range. I don't remember if it's used in the map implementation or not, just that I found this code and the link to Daniel Lemire's blog post.

Stratoscope · on July 19, 2020

Looks like this got bit by the HN title number ripper.

For the curious, the actual title is "128 bit division".

drfuchs · on July 19, 2020

The article’s title is actually “128 bit division” while HN currently shows it as just “Bit Division”. I suggest both be changed to “Optimizing 128-bit Division”. Nice article, btw.

amelius · on July 19, 2020

Great post. The only thing missing is a good strategy to test the resulting algorithm. Errors in arithmetic can be a nightmare.

danlark1 · on July 19, 2020

Hi, I thought that this is not the most interesting part of all the benchmarks, for example, all we need to test that with the quotient and the remainder: dividend = quotient * divisor + remainder, remainder < divisor and multiplication does not overflow which is free of division operations.

Yet, I added several tests like dividend < divisor, close to zero remainders, a lot of random stuff just to make sure each time I add a new approach, it is correct.

specialist · on July 19, 2020

Nicely done.

FWIW, I dimly recall a numeric library, maybe a float to string, which tested everything for verification. Took a few days to run.

Then maybe use the spot checks to test for regressions. Weird compiler, toolchain, processor combos. That sort of thing.

lifthrasiir · on July 19, 2020

> FWIW, I dimly recall a numeric library, maybe a float to string, which tested everything for verification. Took a few days to run.

You can definitely test against all ~2^31 IEEE 754 binary32 values to make sure that float-to-decimal conversion is correct; that's what I've done with the Rust standard library (took 2 hours per test). I believe testing all ~2^63 binary64 values are also feasible by now, but only with dedicated hardwares. For that reason I believe the library had only tested with binary32.

saagarjha · on July 19, 2020

What kind of hardware can crank through the 2^64 space in reasonable time?

grandmczeb · on July 19, 2020

It’s actually surprisingly feasible. 1,000 cores running at 3GHz for a week do ~2^64 cycles.

bloak · on July 19, 2020

Don't you mean 10,000 cores?

grandmczeb · on July 19, 2020

Whoops, yes! Too late to edit.

Const-me · on July 19, 2020

If you’re on Windows, you can test against WinAPI implementation: https://docs.microsoft.com/en-us/windows/win32/api/mfapi/nf-...