This concerns unsigned division by a 32 bit constant divisor. Compilers routinely optimize this to a high-multiply by a "magic number" but this number may be up to 33 bits (worst-case, divisor dependent, 7 is a problem child). So you may need a 32x33-bit high multiply.
What compilers do today is some bit tricks to perform a 32x33 bit high multiply through a 32x32 multiply and then handling the last bit specially, through additions and bitshifts. That's the "GM Method" in the paper; the juice to be squeezed out is the extra stuff to handle the 33rd bit.
What the paper observes is that 32x33 high multiply can be performed via a 64x64 high multiply and then the extra stuff goes away. Well yes, of course.
But amazingly in ALL of these worst case situations you can compute a different magic number, and perform a (saturating) increment of the dividend at runtime, and ONLY need a 32 bit high multiply.
That is, these are the two algorithms for unsigned division by constants where the magic number overflows:
- This paper: Promote to twice the bit width, followed by high multiply (32x64 => 64) and then bitshift
- SOTA: Saturating increment of the dividend, then high multiply (32x32 => 32) and then bitshift
Probably the paper's approach is a little better on wide multipliers, but they missed this well-known technique published 15 years ago (and before that, I merely rediscovered it):
The paper doesn't require a bitshift after multiplication -- it directly uses the high half of the product as the quotient, so it saves at least one tick over the solution you mentioned. And on x86, saturating addition can't be done in a tick and 32->64 zero-extension is implicit, so the distinction is even wider.
Unfortunately, that's only vector, and ≤16-bit ints at that, no 32-bit ints; and as the other reply says, nearly non-existent multiply-high which generally makes vectorized div-by-const its own mini-hell (but doing a 2x-width multiply with fixups is still better than the OP 4x-width method).
(...though, x86 does have (v)pmulhw for 16-bit input, so for 16-bit div-by-const the saturating option works out quite well.)
(And, for what it's worth, the lack of 8-bit multiplies on x86 means that the OP method of high-half-of-4x-width-multiply works out nicely for vectorizing dividing 8-bit ints too)
On x86, there is no vector instruction to get the upper half of integer product (64-bits x 64-bits). ARM SVE2 and RISC-V RVV have one, x86 unfortunately does not (and probably wont for a long time as AVX10 does not add it, either).
There is one for the f64 FMA recycling IFMA from AVX512 they have for bignum libraries;it's a 52 bit unsigned multiply and accumulates either the low or the high output halves into a 64bit accumulator.
It's surely no 64 bit but it's much more than 32 bit.
And it's giving you access to the high halves so you can use it to compute 32x32->64 on vector even if only half as packed as that could be.
From my research you can always fit it in a 32x32 multiply, you only need the extra bit at compile time. The extra bit tells you how to adjust the result in the end, but the adjustment is also a constant.
Compilers already optimize division of a 32 bit integer x by a constant d like:
c = 2^a / d
so: d = 2^a / c
so: x / d = (x * c) / 2^a
And /2^a is equivalent to >>a, so we have:
x / d = (x * c) >> a
Which is just a multiply and shift, cheaper than integer division.
Problem is, with some divisors like 7, 14, 21, ... the smallest c that works to produce an exact result is 33 bits. Awkward on 32 bit CPUs, it just barely doesn’t fit but you can account for that extra 1 bit manually with an extra sequence of sub, shift, add.
What the paper points out is twofold:
1. On 64 bit CPUs the extra sequence isn’t required, you can just do a full 64x64 bit multiply and 128 bit shift. Back to just a multiply and shift like the original optimization, and already faster than the workaround.
2. Even better, the shift itself can be optimized away. A 64x64 bit multiply stores the 128 bit product in a pair of 64 bit registers, meaning you basically get >> 64 for free by just looking at the upper half register in the pair. That means if you pre-shift the constant to:
c’ = c << (64 - a)
Now (x * c’) >> 64 is equivalent to (x * c) >> a. And the result is just waiting for you in a 64 bit register, no shift needed.
Just one multiply to divide a 32 bit integer by 7. Nice.
I must admit I'm surprised to see this -- Lemire offhandedly mentioned in the famous remainder blog post (https://lemire.me/blog/2019/02/08/faster-remainders-when-the...) that 64-bit constants can be used for 32-bit division, and even provided a short example to compute the remainder that way (though not the quotient). Looking a bit more, it seems like libdivide didn't integrate this optimization either.
I guess everyone just assumed that this is so well-known now, that compilers have certainly integrated it, but no one actually bothered to submit a patch until now, when it was reinvented?
As the PR clearly points out, you can do this in a register but not inside vectors.
I don't think fastdiv has had an update in years, which what I've used because compilers can't do "this is a constant for the next loop of 1024" like columnar sql needs.
At this rate it might be worthwhile turning integer divisions by an unknown which is constant in a loop into a call to a routine to generate the inverse and then use that within the loop. Though I guess that would be pretty infrequent compared to the floating point equivalent.
This concerns unsigned division by a 32 bit constant divisor. Compilers routinely optimize this to a high-multiply by a "magic number" but this number may be up to 33 bits (worst-case, divisor dependent, 7 is a problem child). So you may need a 32x33-bit high multiply.
What compilers do today is some bit tricks to perform a 32x33 bit high multiply through a 32x32 multiply and then handling the last bit specially, through additions and bitshifts. That's the "GM Method" in the paper; the juice to be squeezed out is the extra stuff to handle the 33rd bit.
What the paper observes is that 32x33 high multiply can be performed via a 64x64 high multiply and then the extra stuff goes away. Well yes, of course.
But amazingly in ALL of these worst case situations you can compute a different magic number, and perform a (saturating) increment of the dividend at runtime, and ONLY need a 32 bit high multiply.
That is, these are the two algorithms for unsigned division by constants where the magic number overflows:
- This paper: Promote to twice the bit width, followed by high multiply (32x64 => 64) and then bitshift
- SOTA: Saturating increment of the dividend, then high multiply (32x32 => 32) and then bitshift
Probably the paper's approach is a little better on wide multipliers, but they missed this well-known technique published 15 years ago (and before that, I merely rediscovered it):
https://ridiculousfish.com/blog/posts/labor-of-division-epis...
Perhaps I misunderstand your point, but I am rather sure that in SSE.../AVX... there do exist instructions for saturating addition:
* (V)PADDSB, (V)PADDSW, (V)PADDUSB, (V)PADDUSW
* (V)PHADDSW, (V)PHSUBSW
(...though, x86 does have (v)pmulhw for 16-bit input, so for 16-bit div-by-const the saturating option works out quite well.)
(And, for what it's worth, the lack of 8-bit multiplies on x86 means that the OP method of high-half-of-4x-width-multiply works out nicely for vectorizing dividing 8-bit ints too)
It's surely no 64 bit but it's much more than 32 bit. And it's giving you access to the high halves so you can use it to compute 32x32->64 on vector even if only half as packed as that could be.
Problem is, with some divisors like 7, 14, 21, ... the smallest c that works to produce an exact result is 33 bits. Awkward on 32 bit CPUs, it just barely doesn’t fit but you can account for that extra 1 bit manually with an extra sequence of sub, shift, add.
What the paper points out is twofold:
1. On 64 bit CPUs the extra sequence isn’t required, you can just do a full 64x64 bit multiply and 128 bit shift. Back to just a multiply and shift like the original optimization, and already faster than the workaround.
2. Even better, the shift itself can be optimized away. A 64x64 bit multiply stores the 128 bit product in a pair of 64 bit registers, meaning you basically get >> 64 for free by just looking at the upper half register in the pair. That means if you pre-shift the constant to:
Now (x * c’) >> 64 is equivalent to (x * c) >> a. And the result is just waiting for you in a 64 bit register, no shift needed.Just one multiply to divide a 32 bit integer by 7. Nice.
I guess everyone just assumed that this is so well-known now, that compilers have certainly integrated it, but no one actually bothered to submit a patch until now, when it was reinvented?
https://github.com/llvm/llvm-project/pull/181288/files
As the PR clearly points out, you can do this in a register but not inside vectors.
I don't think fastdiv has had an update in years, which what I've used because compilers can't do "this is a constant for the next loop of 1024" like columnar sql needs.
Best thing in the paper, bravo :)