Dividing unsigned 8-bit numbers

(0x80.pl)

118 points | by mfiguiere 11 hours ago

15 comments

orlp 10 hours ago
What's not mentioned is that in most cases you have a constant divisor which lets you replace division by multiplication with the reciprocal. The reciprocal can be rounded to a nearby dyadic rational, letting you do the division with a right-shift.
For example, 8-bit division by 3 is equivalent to widening multiplication by 171 followed by a right-shift of 9, as 171/2^9 = 0.333984375 which is close enough to 1/3 that the results match exactly.
[-]
- edflsafoiewq 10 hours ago
  A shift of 16 is enough for every 8-bit numerator, ie. x/a is (u32(x)*b)>>16 for some b depending only on a. You could precompute b for each a and store it a lookup table. The largest b is b=65536 for a=1 and the smallest is b=258 for a=255, so b fits in a u16 if stored with a 1-offset. Not sure it's worth it unless you reuse the denominator many times though.
- 15155 7 hours ago
  These methods are especially useful in hardware/FPGA implementations where it's infeasible to have a ton of fully pipelined dividers.
  [-]
  - andrepd 6 hours ago
    They are actually useful for optimising compilers too! Mul or mul+shifts is often faster than div
- kevinventullo 9 hours ago
  Also, if you know ahead of time that it’s exact division, there is a similar approach that doesn’t even need widening multiplication!
  [-]
  - orlp 8 hours ago
    Yes, if you know something is an exact multiple of n = r*2^k where r is odd, you can divide out the multiple by right-shifting k followed by modular multiplication by the modular multiplicative inverse of r.
    [-]
    - thaumasiotes 3 hours ago
      Theoretically, you could also take advantage of two's complement arithmetic:
      00011011 (27) x 01010101 (-0.333...) ------------------------ 11110111 (-9) invert and add one: 00001001 (9, like we wanted)
      I'd be interested in extending this to non-exact multiples, but if that's possible I don't know how.
  - Footkerchief 8 hours ago
    Can you provide an example with details? Thanks!
    [-]
    - kevinventullo 7 hours ago
      In 8-bit arithmetic (i.e. mod 256), the multiplicative inverse of 11 is 163. So, if you take some multiple of 11, say 154, then you can compute 154/11 instead as 154*163.
      Indeed,
      154*163 = 25102, and
      25102 = 14 (mod 256).
- thaumasiotes 9 hours ago
  > For example, 8-bit division by 3 is equivalent to widening multiplication by 171 followed by a right-shift of 9, as 171/2^9 = 0.333984375 which is close enough to 1/3 that the results match exactly.
  Is this related to the fact that 171 is the multiplicative inverse of 3 (mod 256), or is that a coincidence?
  [-]
  - zahlman 7 hours ago
    Sort of. (After all, a "reciprocal" for our purposes is just a multiplicative inverse in the reals, so it makes sense that it would be related to the multiplicative inverse in other domains.)
    3 times 171 is 513. So to divide by 3, we could multiply by 171 and then divide by 513. Dividing by 513... isn't any easier, but 513 is close to 512, so we hope that dividing by 512 (which is trivial - just a right-shift) gets us close enough.
    Suppose for a moment we try dividing 3 by 3 using this trick. First we'll multiply by 171 to get 513. Consider that value in binary:
```
    1000000001
     ^^^^^^^^^
      ~~~~~~~~
```
    Dividing by 512 will shift away the ^ bits. For floor division, we therefore want the ^ underlined value to be close to zero. (That way, when we divide 255 (say) by 3, the error won't be big enough to overflow into the result bits.)
    The multiplicative inverse fact is equivalent to telling us that the ~ underlined bits are exactly 1. Conveniently, that's close to 0 - but we didn't account for all the ^ underlined bits. For example, the multiplicative inverse of 7 (mod 256) is 183, but 7 times 183 is 1281. That's close to 1280, but that doesn't really help us - we could right-shift by 8 but then we still have to divide by 5. If we just ignore the problem and divide by 1024 (right-shift by 10), of course we get a lot of wrong results. (Even 6 / 7 would give us 1 instead of 0.)
    It also turns out that we'll need more bits for accurate results in the general case. I think it's possible without overflowing 16-bit numbers, but it definitely requires a bit more trickery for problematic divisors like (from my testing) 195. I thought I remembered the details here but proved myself wrong at the Python REPL :(
    [-]
    - orlp 7 hours ago
      Division by 195 is trivial. The answer is simply
      uint8_t div195(uint8_t x) { return x >= 195; }
      [-]
      - zahlman 6 hours ago
        Yes, but you can't incorporate that into an algorithm that allows the divisor to vary and avoids branching. 195 is problematic for the multiply-shift strategy.
  - orlp 8 hours ago
    It's not entirely a coincidence but also not a general result that one should use the modular inverse as multiplier.
    171 * 3 = 2^9 + 1, which is not surprising as we know that 171 * 3 = 1 (mod 2^8). So rearranged we have 171 / 2^9 = 1/3 + 1/(2^9*3) which shows it's a close approximation of 1/3.
- godsmokescrack 6 hours ago
  [flagged]
mlochbaum 10 hours ago
We implemented something like this in CBQN last year (mainly for modulus, as floor division isn't a primitive). Commit is https://github.com/dzaima/CBQN/commit/d333902, some proofs of when and why it works at https://mlochbaum.github.io/BQN/implementation/primitive/ari....
AlotOfReading 11 hours ago
I'm not certain it'll actually be faster, but you should be able to merge the reciprocal, multiplication and rounding adjustment into a single fma in log space via evil floating point bit hacks. Then you'd just be paying the cost of converting to and from the float representation of the integer.
[-]
- ashvardanian 11 hours ago
  I'd also love to see such comparisons. In SimSIMD, if AVX-512FP16 is available, I use FP16 for most I8/U8 operations, but can't speak about division.
  Overall, using F64 directly for I32 division is a well known trick, and it should hold for smaller representations as well.
  [-]
  - AlotOfReading 10 hours ago
    Don't have time to figure out the actual numbers myself, but here's an example doing the same for various transcendental functions: https://github.com/J-Montgomery/hacky_float_math/blob/623ee9...
    a*(1.xxx/b) = a*(-1*1.xxx*log2(b)), which means you should be able to do a*(fma(b, magic, constant)) with appropriate conversions on either side. Should work in 32 bits for u8s.
dzaima 8 hours ago
Here's a version of the vrcpps idea, doing whole vectors of 32 u8's, bypassing lane crossing and somewhat simplifying the packing/unpacking, that's ~1.5x faster on Haswell: https://godbolt.org/z/TExGahv3z
[-]
- Cold_Miserable 6 hours ago
  This is ~9.6x faster than "scalar".
  ASM_TestDiv proc ;rcx out, rdx A, r8 B mov rax,05555555555555555H kmovq k1,rax vmovdqu8 zmm0,zmmword ptr [rdx] vmovdqu8 zmm4,zmmword ptr [r8] vpbroadcastw zmm3,word ptr [FLOAT16_F8] vmovdqu8 zmm2{k1},zmm0 ;lower 8-bit vmovdqu8 zmm16{k1},zmm4 ;lower 8-bit vpsrlw zmm1,zmm0,8 ;higher 8-bit vpsrlw zmm5,zmm4,8 ;higher 8-bit vpord zmm1,zmm1,zmm3 vpord zmm2,zmm2,zmm3 vpord zmm5,zmm5,zmm3 vpord zmm16,zmm16,zmm3 vsubph zmm1,zmm1,zmm3{rd-sae} ;fast conv 16FP vsubph zmm2,zmm2,zmm3{rd-sae} vsubph zmm5,zmm5,zmm3{ru-sae} vsubph zmm16,zmm16,zmm3{ru-sae} vrcpph zmm5,zmm5 vrcpph zmm16,zmm16 vfmadd213ph zmm1,zmm5,zmm3{rd-sae} vfmadd213ph zmm2,zmm16,zmm3{rd-sae} vxorpd zmm1,zmm1,zmm3 vxorpd zmm2,zmm2,zmm3 vpsllw zmm1,zmm1,8 vpord zmm1,zmm1,zmm2 vmovdqu8 zmmword ptr [rcx],zmm1 ;16 8-bit unsigned int ret
jonhohle 4 hours ago
I do a fair amount of decompiling and the list of constants is awesome! I usually can make a reasonable guess or have intuition, but frequently end up stepping through several integers to find the right value. Having a handy lookup table will be great.
foundry27 10 hours ago
I think the approximate reciprocal approach is interesting here. The doc mentions multiplying the dividend by ~1.00025 in the math to avoid FP error so you don’t end up off-by-one after truncation, but I think this hack is still incomplete! On some inputs (like 255, or other unlucky divisors near powers of two), you might get borderline rounding behaviour that flips a bit of the final integer. It’s easy to forget that single-precision floats don’t line up neatly with every 8bit integer ratio in real code, and a single off-by-one can break pixel ops or feed subtle bugs into a bigger pipeline.
I suspect a hybrid scheme like using approximate reciprocals for most values but punting to scalar for unlucky ones could handle these corner cases without killing performance. That’d be interesting to benchmark
[-]
- LegionMammal978 10 hours ago
  There are only 65280 possible inputs, that's easily small enough to test every value for correctness.
xpe 3 hours ago
The bit-division burns my eyes and makes a mockery of my puny intellect, my liege. Mine fellow vassals possess skills most frightful in their potency.
purplesyringa 6 hours ago
It's odd that the only reason AVX-512 long division wins in the end is that it's compared to AVX2 reciprocal. Would it be possible to compare it to AVX-512 reciprocal computation?
synthos 9 hours ago
Newton Raphson could be used to calculate a reciprocal. (in a bit width larger than 8). If starting from a good reciprocal approximation, convergence to bit accurate reciprocal should not take many iterations. Then multiply the reciprocal with the numerator to perform the divide
ryao 10 hours ago
Why not use libdivide?
https://libdivide.com/
[-]
- LegionMammal978 10 hours ago
  libdivide is optimized for the case of a common divisor used many times, not for a long array of distinct divisors.
ack_complete 10 hours ago
Think the SSE2 implementation could be tightened up by using the same register for the dividend and the quotient, shifting the quotient bits into the dividend as the dividend bits are shifted out. This was common practice in software CPU division routines.
Cold_Miserable 9 hours ago
Heh? Surely fast convert 8-bit int to 16-bit FP,rcp+mul/div is a no-brainer? edit make that fast convert,rcp,fma (float 16 constant 1.0) and xor (same constant)
[-]
- bremac 5 hours ago
  Unfortunately none of the hardware used for testing supports FP16 arithmetic. Between Intel and AMD, the only platform that supports AVX512-FP16 is currently Sapphire Rapids.
- purplesyringa 5 hours ago
  I tried a similar approach with 32-bit FP before, and the problem here is that fast conversion is only fast in the sense of latency. Throughput-wise, it takes 2 uops instead of one, so in the end, a plain float<->int conversion wins.
abcd_f 11 hours ago
Since it's 8bit by 8bit, a precomputed lookup table (64K in size) will be another option.
[-]
- ack_complete 10 hours ago
  A lookup table in memory can only be accessed an element at a time, so it gets bottlenecked in the execution units on the load and address generation traffic. This algorithm uses 8-bit integers, so the vectorized version is processing between 16-64 elements per operation depending on the vector width. It's even worse if the 8-bit divide is integrated with other vector operations as then the lookup table method also has insert/extract overhead to exchange the values between the vector and integer units.
  A hybrid approach using small in-register lookup tables in the vector unit (pshufb/tbl) can be lucrative, but are very limited in table size.
  [-]
  - mithametacs 9 hours ago
    I've never thought of a micro-lookup-table!
    That's cool. What have you used those for?
    [-]
    - ack_complete 9 hours ago
      Typically just like conventional lookup tables, where you can get the table size down small enough. Indexed palette / quantization coding is a case where this can often work. It's pretty niche given the limitations, but if you can make it work it's often a major speedup since you're able to do 16/32/64 lookups in parallel.
- mmastrac 11 hours ago
  Lookup tables aren't necessarily faster these days for a lot of things when using low-level languages, but it would have been interesting to see the comparison here.
  [-]
  - adhoc32 11 hours ago
    pretty sure a memory access is faster than the methods presented in the article.
    [-]
    - PhilipRoman 10 hours ago
      Depends also heavily on the context. You pay for each cache miss twice - once for the miss itself, and next time when you access whatever was evicted during the first miss. This is why LUTs often shine in microbenchmarks, but drag down performance in real world scenarios when mixed with other cache bound code.
    - ryao 10 hours ago
      An uncached random memory access is around 100 cycles.
      [-]
      - Sesse__ 8 hours ago
        100 cycles would be very low. Many systems have more than 100 ns!
    - retrac 10 hours ago
      Access to main memory can be many many cycles; a short routine already in cache may be able to recompute a value more quickly than pulling it from main memory.
    - Retr0id 10 hours ago
      64K is enough to fill L1 on many systems
    - dist-epoch 11 hours ago
      Hitting L2 is more than 3-4 cycles
- ashvardanian 11 hours ago
  64 KB is a pretty significant budget for such a small operation. I've had a variant that uses 768 bytes with some extra logic, but will deprecate that kernel soon.
  https://github.com/ashvardanian/StringZilla/blob/0d47be212c5...
  [-]
  - sroussey 11 hours ago
    This seems like something that could be in hardware to allow native execution of the instruction. Is that not the case anywhere?
afstr 4 hours ago
M I’m looking I’m I’m O ; r
dataflow 10 hours ago
> SIMD ISAs usually do not provide the integer division; the only known exception is RISC-V Vector Extension
It's kind of funny to read "the only known exception is..." in this context. What would an unknown exception be - an ISA that accidentally implements this but that the author believes nobody is aware of yet?
More seriously, I actually don't understand the intended meaning here. I assume the author means "out of all the ISAs I know"? What is that set of ISAs?
[-]
- dzaima 10 hours ago
  Some SIMD ISAs with integer division:
  - Arm SVE, though only for 32-bit and 64-bit element types: https://developer.arm.com/architectures/instruction-sets/int...
  - loongarch64 LSX/LASX: https://jia.je/unofficial-loongarch-intrinsics-guide/viewer/...
  - MRISC32 (though that's somewhat obvious as basically everything in it is shared between scalar and vector).
- bee_rider 10 hours ago
  Practically, could the expression “only know exception” mean anything other than “known by me?” I mean, it is clearly possible for an exception to exist, on account of the existing known exception, so they can’t know that more exceptions don’t exist out there somewhere.
  I dunno. I think it is a more meaningful statement if we know more about the author; if we assume that they are very well informed, I guess we would assume that the fact that they don’t know about something is really meaningful. In the case of a blog post where most of us don’t know the author, it is hard to infer much. But at least it tells us why they decided to do the thing.
  [-]
  - zahlman 6 hours ago
    If a SIMD ISA exists, someone must know that it exists, because definitionally we only apply the term "SIMD ISA" to things that were consciously created to be such. So we could simply check every such example. Saying "only known example" is indeed silly.
    But e.g. in mathematics, if we say that "almost every member of set X has property Y; the only known exception is Z" then there absolutely could be more exceptions, even if we pool the knowledge of every mathematician. It isn't necessary that X is finite, or even enumerable. It could be possible for exceptions other than Z to exist even though every other member of the set that we know about has the property. It could be possible to prove that there are at most finitely many exceptions in an infinite set, and only know of Z but not be able to rule out the possibility of more exceptions than that.
    We don't even need to appeal to infinities. For example, there are problems in discrete math where nobody has found the exact answer (which necessarily is integer, by the construction of the problem) but we can prove upper and lower bounds. Suppose we find a problem where the known bounds are very tight (but not exact) and the bounded value is positive. Now, construct a set of integers ranging from 0 up to (proven upper bound + 1) inclusive... you can probably see where this is going.
    The latter doesn't apply to SIMD ISAs, because we know all the interesting (big hand-wave!) properties of all of them rather precisely - since they're designed to have those properties.
- camel-cdr 10 hours ago
  The Convex C1 [0] and for a newer example NEC SX Aurora [1] both also support vector integer division.
  [0] https://bitsavers.org/pdf/convex/080-000120-000_CONVEX_Archi...
  [1] https://ftp.libre-soc.org/NEC_SX_Aurora_TSUBASA_VectorEngine...
- michaelhoffman 10 hours ago
  > I assume the author means "out of all the ISAs I know"?
  Out of all the ISAs where they know whether it provides integer division or not.
  [-]
  - dataflow 10 hours ago
    Yeah but my point is that as a reader I'm trying to figure out which ISAs actually don't provide this (vs. which ISAs the author lacks knowledge of), and I still don't know what those are. The sentence looks like it's supposed to tell me, but it doesn't.
- perching_aix 10 hours ago
  You could come up with an ISA that implements it and it wouldn't be "known". Maybe that helps?