Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Optimizing 128-bit Division (danlark.org)
136 points by EvgeniyZh on July 19, 2020 | hide | past | favorite | 28 comments


Hi everyone, the author is here. Yes, I believe the title should be changed to `Optimizing 128-bit Division`

Yet, I was not expecting it to be here. Overall, I put some knowledge hidden in Hacker's Delight book, Knuth, GMP and GNU in the article with my knowledge of low level optimizations. In the end it turned out to be a good thing to write and to submit into LLVM


You might want to also check that 128-bit multiplies are implemented as only two 64-bit multiplies and a couple of additions (assuming no overflow; three multiplies if you need to get the right answer mod 2^64).


There is an `if` statement in a `for` loop that looks like it could be eliminated. Abbreviating:

  if (dd.all >= dr.all) {
    dd.all -= dr.all;
    quo.s.low |= 1;
  }
could be

  bool c = dd.all >= dr.all;
  dd.all -= (-c & dr.all);
  quo.s.low |= c;

Clang might implement this with `cmov` instructions, coded either way.


Great article, I read to the end.

I faced a similar problem implemented 64/64 division on 32-bit x86. I cribbed some code off the internet and came up with this:

https://github.com/titzer/virgil/blob/2c674dbe3512da23c4a644...

This is the code that my compiler generates for a 64/64 divide when it doesn't know that the divisor is max 32 bits. It works by using the x86 64/32 div instruction. The same trick would work on x86-64 by using the 128/64 divisions.

I am not sure how good the machine code you get out of your C++ is in the end, but it's hard to beat this handwritten assembly. Of course, you can do better on x86-64 because there are more registers and you don't have to spill on to the stack.


What's the license? Hopefully, Boost or Boost compatible?



Thank you. We've standardized on the Boost license for D, as it is the least restrictive, and well-accepted in the C++ community.

Is it possible you can make a Boost licensed version of it so we can add it to D?

https://www.boost.org/LICENSE_1_0.txt


That's a difficult question. As I am working at Google I need to consult each open source I want to publish on my behalf. This does not involve contributing to the list of the approved projects and telling about these contributions.

If you want to workaround this for now, I suggest looking into libdivide (https://github.com/ridiculousfish/libdivide), it is published with the boost license and the library contains all the needed artifacts I described in the article (unfortunately, not combined).


Thanks for the pointers. I suspect Google would be ok with Boost, since they are a C++ house and Boost is the major library. A big reason D picked it was because Boost is corporate lawyer approved.


I remember seeing a letter to the editor of an early Byte magazine, wherein the author of the letter recommended that you "Be sure to take your 32-bit divide routines with you when you change jobs... there's no guarantee that your new job's routines will be correct, let alone fast!".

For the record, I have never followed this advice.


Very nice post! I remember helping to port the 128 bit integer functions in compiler-rt to Rust years ago [0], because clang only supported 128 bit integers on 64 bit platforms, but the goal of Rust was to support them on all platforms that Rust supports.

Since then, all algorithms in compiler-rt have been ported to Rust and live in the compiler-builtins crate. This [1] is the source code file for unsigned division. The actual logic is inside a macro and used to implement both 128 bit division in terms of 64 bit numbers and 64 bit division in terms of 32 bit numbers.

I wonder if similar optimizations can be done to that code.

[1]: https://github.com/rust-lang/rust/pull/38482

[0]: https://github.com/rust-lang/compiler-builtins/blob/master/s...


Hi, I believe they can be done for 32 bit platforms also with the multiword division in Knuth https://skanthak.homepage.t-online.de/division.html what I've chosen for fallback if not x86_64 platform.

I will try to implement the same optimizations in Rust in the upcoming weeks

UPD And we opened an issue :) https://github.com/rust-lang/compiler-builtins/issues/368


Many years ago I implemented a set of multiple precision integer routines in fortran. For division, I slavishly copied Knuth's routine, using all the sample divisions in his book to test, and it worked fine.


If all you need is to uniformly reduce a number into a given range, there is a division-free approach:

lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/%3famp


It took my fat fingers a while to be able to successfully copy-n-paste the link on mobile. Here’s a proper link to save others the hassle:

https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-...


And for hash tables with a 128bit hashfunc he doesn't need division nor modulo at all. Only with primed sizes, but with these large sizes primed does not make sense, you always take power of 2 sizes. Then you get the bucket index with a trivial rshift or AND bitmask.

So a lot of work for nothing. At least LLVM did profit.


Go uses this technique internally for generating random numbers in a range. I don't remember if it's used in the map implementation or not, just that I found this code and the link to Daniel Lemire's blog post.


Looks like this got bit by the HN title number ripper.

For the curious, the actual title is "128 bit division".


The article’s title is actually “128 bit division” while HN currently shows it as just “Bit Division”. I suggest both be changed to “Optimizing 128-bit Division”. Nice article, btw.


Great post. The only thing missing is a good strategy to test the resulting algorithm. Errors in arithmetic can be a nightmare.


Hi, I thought that this is not the most interesting part of all the benchmarks, for example, all we need to test that with the quotient and the remainder: dividend = quotient * divisor + remainder, remainder < divisor and multiplication does not overflow which is free of division operations.

Yet, I added several tests like dividend < divisor, close to zero remainders, a lot of random stuff just to make sure each time I add a new approach, it is correct.


Nicely done.

FWIW, I dimly recall a numeric library, maybe a float to string, which tested everything for verification. Took a few days to run.

Then maybe use the spot checks to test for regressions. Weird compiler, toolchain, processor combos. That sort of thing.


> FWIW, I dimly recall a numeric library, maybe a float to string, which tested everything for verification. Took a few days to run.

You can definitely test against all ~2^31 IEEE 754 binary32 values to make sure that float-to-decimal conversion is correct; that's what I've done with the Rust standard library (took 2 hours per test). I believe testing all ~2^63 binary64 values are also feasible by now, but only with dedicated hardwares. For that reason I believe the library had only tested with binary32.


What kind of hardware can crank through the 2^64 space in reasonable time?


It’s actually surprisingly feasible. 1,000 cores running at 3GHz for a week do ~2^64 cycles.


Don't you mean 10,000 cores?


Whoops, yes! Too late to edit.


If you’re on Windows, you can test against WinAPI implementation: https://docs.microsoft.com/en-us/windows/win32/api/mfapi/nf-...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: