> In ancient times, floating point numbers were stored in 32 bits.
This was true only for cheap computers, typically after the mid sixties.
Most of the earliest computers with vacuum tubes used longer floating-point number formats, e.g. 48-bit, 60-bit or even weird sizes like 57-bit.
The 32-bit size has never been acceptable in scientific computing with complex computations where rounding errors accumulate. The early computers with floating-point hardware were oriented to scientific/technical computing, so bigger number sizes were preferred. The computers oriented to business applications usually preferred fixed-point numbers.
The IBM System/360 family has definitively imposed the 32-bit single-precision and 64-bit double-precision sizes, where 32-bit is adequate for input data and output data and it can be sufficient for intermediate values when the input data passes through few computations, while otherwise double-precision must be used.
A few years after 1980, especially after 1985, the computers with coprocessors like Intel 8087 or Motorola 68881 became the most numerous computers with floating-point hardware, and for them the default FP size was 80-bit.
So the 1990s were long after the time when 32-bit FP numbers were normal. FP32 was revived only by GPUs, for graphic applications where precision matters much less.
Already after 1974, the C programming language made double-precision the default FP size, not the 32-bit single-precision size, for the same reason why Intel 8087 introduced extended precision. Single-precision computations for traditional applications are suitable only for experts, not for ordinary computer users.
While before C the programming languages used single-precision 32-bit numbers as the default size, the recommendations were already to use only double-precision wherever complicated expressions were computed.
I have started using computers by punching cards for a mainframe, but that was already at a time when 32-bit FP numbers were not normally used, but only 64-bit FP numbers.
The best chances of seeing 32-bit single-precision numbers in use was in the decade from 1965 to 1975, at the users of cheap mainframes or of minicomputers without hardware floating-point units, where floating-point emulation was done in software and emulating double-precision was significantly slower.
Before the mid sixties, there were more chances to see 36-bit floating-point numbers as the smallest FP size.
Yeah. I know. I'm not disagreeing with your diagnosis, I'm just trying to gently rib you that your correction is misaimed. It's a joke, ya know?
>Single-precision computations for traditional applications are suitable only for experts, not for ordinary computer users.
Lots of ordinary computer users did compute in single precision! The reason I picked the 1990s as 'ancient' and not 1980 (when the 8087 was taped out) or 1985 (when IEEE754 was finally approved) was because those microprocessors were now in the hands of users who weren't under the supervision of 'experts'. That, along with the lack of fast 64 bit registers + the desire for high throughput at low fidelity led to a lot of 32 bit code!
And, frankly, if you want to get real technical, the ability of non-experts to program in FP in 64 bit is enforced NOT ONLY by the doubled bits but by the implicit ability (absent now in many implementations) to use the 80 bit extended precision format for intermediate calcs. It's the added bits in that format for scratch that let lots of 64 bit programs just work.
When you have so few bits, does it really make sense to invent a meaning for the bit positions? Just use an index into a "palette" of pre-determined numbers.
As a bonus, any operation can be replaced with a lookup into a nxn table.
In standard FP32, the infs are represented as a sign bit, all exponent bits=1, and all mantissa bits=0. The NaNs are represented as a sign bit, all exponent bits=1, and the mantissa is non-zero. If you used that interpretation with FP4, you'd get the table below, which restricts the representable range to +/- 3, and it feels less useful to me. If you're using FP4 you probably are space optimized and don't want to waste a quarter of your possible combinations on things that aren't actually numbers, and you'd likely focus your efforts on writing code that didn't need to represent inf and NaN.
That sounds pretty niche. What's a use case where you have less than 8 bits and that distinction is more important than having an extra finite value? I don't think AI is one.
For neural net gradient descent, automatic differentiation etc, the widely used ReLU function has infornation carrying derivatives at +0 and –0 if those are infinitesimals.
Barely any information. After surviving RELU that signed zero is probably getting added to another value and then oops the information is gone. It sounds a lot worse than properly spaced values.
If you were looking at the entire number line, sign would roughly be the most important part.
But you still have all the other numbers carrying sign info. This is only the sign of denormals and that's way less valuable. Outside of particular equations it ends up added to something else and disappearing entirely. It would be way better to cut it and have either half the smallest existing positive value or double the largest existing value as a replacement. Or many other options.
You need it if you want the idea of total ordering over the extended Reals. There's +/- infinity--an affine closure, not projective (point at infinity)--so to make that math work you need to give 0 a sign.
For FP4, yes... sometimes... it depends. But newer Nvidia architecture eg Blackwell w/ NVFP4 does not, they perform micro block scaling in the core. For older architectures, low quants like FP4 are also often not done native, and instead inflated back to BF16, eg with BnB.
As explained in an article linked at the bottom of TFA, the weights of a LLM have a normal (Gaussian) distribution.
Because of that, the best compromise when the weights are quantized to few levels is to place the points encoded by the numeric format used for the weights using a Gaussian function, instead of placing them uniformly on a logarithmic scale, like the usual floating-point formats attempt.
There is a relevant Wikipedia page about minifloats [0]
> The smallest possible float size that follows all IEEE principles, including normalized numbers, subnormal numbers, signed zero, signed infinity, and multiple NaN values, is a 4-bit float with 1-bit sign, 2-bit exponent, and 1-bit mantissa.
Even the latest CPUs have a 2:1 fp64:fp32 performance ratio - plus the effects of 2x the data size in cache and bandwidth use mean you can often get greater than a 2x difference.
If you're in a numeric heavy use case that's a massive difference. It's not some outdated "Ancient Lore" that causes languages that care about performance to default to fp32 :P
> Even the latest CPUs have a 2:1 fp64:fp32 performance ratio
Not completely - for basic operations (and ignoring byte size for things like cache hit ratios and memory bandwidth) if you look at (say Agner Fog's optimisation PDFs of instruction latency) the basic SSE/AVX latency for basic add/sub/mult/div (yes, even divides these days), the latency between float and double is almost always the same on the most recent AMD/Intel CPUs (and normally execution ports can do both now).
Where it differs is gather/scatter and some shuffle instructions (larger size to work on), and maths routines like transcendentals - sqrt(), sin(), etc, where the backing algorithms (whether on the processor in some cases or in libm or equivalent) obviously have to do more work (often more iterations of refinement) to calculate the value to greater precision for f64.
> the latency between float and double is almost always the same on the most recent AMD/Intel CPUs
If you are developing for ARM, some systems have hardware support for FP32 but use software emulation for FP64, with noticeable performance difference.
> ... if you look at (say Agner Fog's optimisation PDFs of instruction latency) ...
That.... doesn't seem true? At least for most architectures I looked at?
While true the latency for ADDPS and ADDPD are the same latency, using the zen4 example at least, the double variant only calculates 4 fp64 values compared to the single-precision's 8 fp32. Which was my point? If each double precision instruction processes a smaller number of inputs, it needs to be lower latency to keep the same operation rate.
And DIV also has a significntly lower throughput for fp32 vs fp64 on zen4, 5clk/op vs 3, while also processing half the values?
Sure, if you're doing scalar fp32/fp64 instructions it's not much of a difference (though DIV still has a lower throughput) - but then you're already leaving significant peak flops on the table I'm not sure it's a particularly useful comparison. It's just the truism of "if you're not performance limited you don't need to think about performance" - which has always been the case.
So yes, they do at least have a 2:1 difference in throughput on zen4 - even higher for DIV.
Well, maybe not all admittedly, and I didn't look at AVX2/512, but it looks like `_mm_div_ps` and `_mm_div_pd` are identical for divide, at the 4-wide level for the basics.
Obviously, the wider you go, the more constrained you are on infrastructure and how many ports there are.
My point was more it's very often the expensive transcendentals where the performance difference is felt between f32 and f64.
This depends largely on your operations. There is lots of performance critical code that doesn't vectorize smoothly, and for those operations, 64 bit is just as fast.
Yes, if you're not FP ALU limited (which is likely the case if not vectorized), or data cache/bandwidth/thermally limited from the increased cost of fp64, then it doesn't matter - but as I said that's true for every performance aspect that "doesn't matter".
That doesn't mean that there are no situations where it does matter today - which is what I feel is implied by calling it "Ancient".
But the "float" typename is generally fp32 - if we assume the "most generically named type" is the "default". Though this is a bit of an inconsistency with C - the type name "double" surely implies it's double the expected baseline while, as you mentioned, constants and much of libm default to 'double'.
The C keywords "float" and "double" are based on the tradition established a decade earlier by IBM System/360 of calling FP32 as "single-precision" and FP64 as "double-precision".
This IBM convention has been inherited by the IBM programming languages FORTRAN IV and PL/I and from these 2 languages it has spread everywhere.
The C language has taken several keywords and operators from IBM PL/I, which was one of the three main inspiration sources for C (which were CPL/BCPL, PL/I and ALGOL 68).
So "float" and "double" are really inherited by C from PL/I.
A feature that is specific to C is that it has changed the default format for constants and for intermediate values to double-precision, instead of the single-precision that was the default in earlier programming languages.
This was done with the intention of protecting naive users from making mistakes, because if you compute with FP32 it is very easy to obtain erroneous results, unless you analyze very carefully the propagation of errors. Except in applications where errors matter very little, e.g. graphics and ML/AI, the use of FP32 is more suitable for experts, while bigger formats are recommended for normal users.
The earliest Cray models (starting with Cray-1 in 1976) had only 64-bit floating-point numbers. 128-bit numbers were a later addition and I do not think that they were implemented in hardware, but only in software. Very few computers, except some from IBM, have implemented FP128 in hardware, while software libraries for quadruple-precision or double-double-precision FP128 are widespread.
The Cray 64-bit format was a slight increase in size over the 60-bit floating-point numbers that had been used in the previous computers designed by Seymour Cray, at CDC.
Before IBM increased the size of a byte to 8 bits, which caused all numeric formats to use sizes that are multiple of 8-bits, in the computers with 6-bit bytes the typical floating-point number sizes were either 60-bit in the high-end models or 48-bit in cheaper models or 36-bit in the cheapest models.
I too want fewer bits of mantissa in my floating point!
But what I wish is that there had been fp64 encoding with a field for number of significant digits.
strtod() would encode this, fresh out of an instrument reading (serial). It would be passed along. It would be useful EVEN if it weren't updated by arithmetic with other such numbers.
Every day I get a query like "why does the datum have so many decimal digits? You can't possibly be saying that the instrument is that precise!"
Well, it's because of sprintf(buf, "%.16g", x) as the default to CYA.
Also sad is the complaint about "0.56000 ... 01" because someone did sprintf("%.16f").
I can't fix this in one class -- data travels between too many languages and communication buffers.
In short, I wish I had an fp64 double where the last 4 bits were ALWAYS left alone by the CPU.
> It would be passed along. It would be useful EVEN if it weren't updated by arithmetic with other such numbers.
It would be useful if you could then pass it to an "about equal" operator, too.
I don't need to know that the alternator is putting out 13.928528V, and sure as hell I know you're not measuring that accurately. It's precise but wrong.
I want an "about equals" thing so I can say "if Valt == 14 alt_ok=true" kind of thing but tag it to be "about 14" not "exactly 14".
There's an "Update:" note for a next post on NF4 format. As far as I can tell this is neither NVFP4 nor MXFP4 which are commonly used with LLM model files. The thing with these formats is that common information is separated in batches so not a singular format but a format for groups of values. I'd like to know more about these (but not enough to go research them myself).
> 9 years ago, I shared this as an April Fools joke here on HN.
That's fun.
> It seems that life is imitating art.
You didn't even beat wikipedia to the punch. They've had a nice page about minifloats using 6-8 bit sizes as examples for about 20 years.
The 4 bit section is newer, but it actually follows IEEE rules. Your joke formats forgot there's an implied 1 bit in the fraction. And how exponents work.
Interesting! I have been using integers or f32 for that. What was the use case specifically? Did you write a software float for it? I remember writing a `f16` type for an IC that used that was a pain!
> In ancient times, floating point numbers were stored in 32 bits.
I thought in ancient times, floating point numbers used to be 80 bit. They lived in a funky mini stack on the coprocessor (x87). Then one day, somebody came along and standardized those 32 and 64 bit floats we still have today.
I was going to reply that just because intel did something funny doesn't mean that it was the beginning of the story. but it turns out that the release of the 8087 predates the ratification of IEEE floats by 2 years. in addition, the primary numeric designer for the 8087 was apparently Kahan, which means that they were both part of the same design process. of course there were other formats predating both of these
The Intel 8087 design team, with Kahan as their consultant, who was the author of most novel features, based on his experience with the design of the HP scientific calculators, have realized that instead of keeping their new much improved floating-point format as proprietary it would be much better to agree with the entire industry on a common floating-point standard.
So Intel has initiated the discussions for the future IEEE standard with many relevant companies, even before the launch of 8087. AMD was a company convinced immediately by Intel, so AMD was able to introduce a FP accelerator (Am9512) based on the 8087 FP formats, which were later adopted in IEEE 754, also in 1980 and a few months before the launch of Intel 8087. So in 1980 there already were 2 implementations of the future IEEE 754 standard. Am9512 was licensed to Intel and Intel made it using the 8232 part number (it was used in 8080/8085/Z80 systems).
Unlike AMD, the traditional computer companies agreed that a FP standard is needed to solve the mess of many incompatible FP formats, but they thought that the Kahan-Intel proposal would be too expensive for them, so they came with a couple of counter-proposals, based on the tradition of giving priority to implementation costs over usefulness for computer users.
Fortunately the Intel negotiators eventually succeeded to convince the others to adopt the Intel proposal, by explaining how the new features can be implemented at an acceptable cost.
The story of IEEE 754 is one of the rare stories in standardization where it was chosen to do what is best for customers, not what is best for vendors.
Like the use of encryption in communications, the use of the IEEE standard has been under continuous attacks during its history, coming from each new generation of logic designers, who think that they are smarter than their predecessors, and who are lazy to implement properly some features of the standard, despite the fact that older designs have demonstrated that they can in fact be implemented efficiently, but the newbies think that they should take the easy path and implement inefficiently some features of the standard, because supposedly the users will not care about that.
The floating point "standard" was basically codifying multiple different vendor implementations of the same idea. Hence the mess that floating point is not consistent across implementations.
IEEE 754 basically had three major proposals that were considered for standardization. There was the "KCS draft" (Kahan, Coonen, Stone), which was the draft implemented for the x87 coprocessor. There was DEC's counter proposal (aka the PS draft, for Payne and Strecker), and HP's counter proposal (aka, the FW draft for Fraley and Walther). Ultimately, it was the KCS draft that won out and become what we now know as IEEE 754.
One of the striking things, though, is just how radically different KCS was. By the time IEEE 754 forms, there is a basic commonality of how floating-point numbers work. Most systems have a single-precision and double-precision form, and many have an additional extended-precision form. These formats are usually radix-2, with a sign bit, a biased exponent, and an integer mantissa, and several implementations had hit on the implicit integer bit representation. (See http://www.quadibloc.com/comp/cp0201.htm for a tour of several pre-IEEE 754 floating-point formats). What KCS did that was really new was add denormals, and this was very controversial. I also think that support for infinities was introduced with KCS, although there were more precedents for the existence of NaN-like values. I'm also pretty sure that sticky bits as opposed to trapping for exceptions was considered innovative. (See, e.g., https://ethw-images.s3.us-east-va.perf.cloud.ovh.us/ieee/f/f... for a discussion of the differences between the early drafts.)
Now, once IEEE 754 came out, pretty much every subsequent implementation of floating-point has started from the IEEE 754 standard. But it was definitely not a codification of existing behavior when it came out, given the number of innovations that it had!
By definition, a document that is written is historic, not prehistoric.
Prehistoric information could be preserved by an oral tradition, until it is recorded in some documents (like the Oral Histories at the Computer History Museum site).
80 bits is just in the processor. Thats why you might a little bit different result, depending how you calculated first and maybe stored something in the RAM
Intel 8087, which has introduced in 1980 the 80-bit extended floating point format, could store and load 80-bit numbers, avoiding any alterations caused by conversions to less precise formats.
To be able to use the corresponding 8087 instructions, "long double" has been added to the C language, so to avoid extra roundings one had to use "long double" variables and one had to also be careful so that intermediate values used for the computing of an expression will not be spilled into the memory as "double".
However this became broken in some newer C compilers, where due to the deprecation of the x87 ISA "long double" was made synonymous to "double". Some better C compilers have chosen to implement "long double" as quadruple-precision instead of extended precision, which ensures that no precision is lost, but which may be slow on most computers, where no hardware support for FP128 exists.
You can set x87 to round each operation result to 32-bit or 64-bit.
With this setting in operates internally exactly on those sizes.
Operating internally on 80-bits is just the default setting, because it is the best for naive users, who are otherwise prone to computing erroneous results.
This is the same reason why the C language has made "double" the default precision in constants and intermediate values.
Unless you do graphics or ML/AI, single-precision computations are really only for experts who can analyze the algorithm and guarantee that it is correct.
0.0 + x = x
NaN + x = NaN
+1.0 + -1.0 = 0.0
+1.0 + +1.0 = NaN
-1.0 + -1.0 = NaN
-0.0 = 0.0
-(+1.0) = -1.0
-(-1.0) = +1.0
-NaN = NaN
x - y = x + (-y)
NaN * x = NaN
+1.0 * x = x
-1.0 * x = -x
0.0 * 0.0 = 0.0
/0.0 = NaN
/+1.0 = +1.0
/-1.0 = -1.0
/NaN = NaN
x / y = x * (/y)
More interestingly, how to implement in logic gates. Addition with a 2's complement full adder and NaN detector. Negation with a 2's complement negation circuit. Reciprocal with a 0.0 detector.
Multiplication with a unique logic circuit (use a Karnaugh map):
FP4 1:2:0:1 (other examples: binary32 1:8:0:23, 8087 ep 1:15:1:63)
S:E:l:M
S = sign bit present (or magnitude-only absolute value)
E = exponent bits (typically biased by 2^(E-1) - 1)
l = explicit leading integer present (almost always 0 because the leading digit is always 1 for normals, 0 for denormals, and not very useful for special values)
M = mantissa (fraction) bits
The limitations of FP4 are that it lacks infinities, [sq]NaNs, and denormals that make it very limited to special purposes only. There's no denying that it might be extremely efficient for very particular problems.
If a more even distribution were needed, a simpler fixed point format like 1:2:1 (sign:integer:fraction bits) is possible.
This was true only for cheap computers, typically after the mid sixties.
Most of the earliest computers with vacuum tubes used longer floating-point number formats, e.g. 48-bit, 60-bit or even weird sizes like 57-bit.
The 32-bit size has never been acceptable in scientific computing with complex computations where rounding errors accumulate. The early computers with floating-point hardware were oriented to scientific/technical computing, so bigger number sizes were preferred. The computers oriented to business applications usually preferred fixed-point numbers.
The IBM System/360 family has definitively imposed the 32-bit single-precision and 64-bit double-precision sizes, where 32-bit is adequate for input data and output data and it can be sufficient for intermediate values when the input data passes through few computations, while otherwise double-precision must be used.
I am...very sorry to be the one delivering this news. It was not a pleasant realization for me, either.
So the 1990s were long after the time when 32-bit FP numbers were normal. FP32 was revived only by GPUs, for graphic applications where precision matters much less.
Already after 1974, the C programming language made double-precision the default FP size, not the 32-bit single-precision size, for the same reason why Intel 8087 introduced extended precision. Single-precision computations for traditional applications are suitable only for experts, not for ordinary computer users.
While before C the programming languages used single-precision 32-bit numbers as the default size, the recommendations were already to use only double-precision wherever complicated expressions were computed.
I have started using computers by punching cards for a mainframe, but that was already at a time when 32-bit FP numbers were not normally used, but only 64-bit FP numbers.
The best chances of seeing 32-bit single-precision numbers in use was in the decade from 1965 to 1975, at the users of cheap mainframes or of minicomputers without hardware floating-point units, where floating-point emulation was done in software and emulating double-precision was significantly slower.
Before the mid sixties, there were more chances to see 36-bit floating-point numbers as the smallest FP size.
>Single-precision computations for traditional applications are suitable only for experts, not for ordinary computer users.
Lots of ordinary computer users did compute in single precision! The reason I picked the 1990s as 'ancient' and not 1980 (when the 8087 was taped out) or 1985 (when IEEE754 was finally approved) was because those microprocessors were now in the hands of users who weren't under the supervision of 'experts'. That, along with the lack of fast 64 bit registers + the desire for high throughput at low fidelity led to a lot of 32 bit code!
And, frankly, if you want to get real technical, the ability of non-experts to program in FP in 64 bit is enforced NOT ONLY by the doubled bits but by the implicit ability (absent now in many implementations) to use the 80 bit extended precision format for intermediate calcs. It's the added bits in that format for scratch that let lots of 64 bit programs just work.
As a bonus, any operation can be replaced with a lookup into a nxn table.
It seems quite wastful to have two zeros when you only have 4 bits it total
But you still have all the other numbers carrying sign info. This is only the sign of denormals and that's way less valuable. Outside of particular equations it ends up added to something else and disappearing entirely. It would be way better to cut it and have either half the smallest existing positive value or double the largest existing value as a replacement. Or many other options.
Because of that, the best compromise when the weights are quantized to few levels is to place the points encoded by the numeric format used for the weights using a Gaussian function, instead of placing them uniformly on a logarithmic scale, like the usual floating-point formats attempt.
> The smallest possible float size that follows all IEEE principles, including normalized numbers, subnormal numbers, signed zero, signed infinity, and multiple NaN values, is a 4-bit float with 1-bit sign, 2-bit exponent, and 1-bit mantissa.
[0] https://en.wikipedia.org/wiki/Minifloat
Someome didn't try it on GPU...
If you're in a numeric heavy use case that's a massive difference. It's not some outdated "Ancient Lore" that causes languages that care about performance to default to fp32 :P
Not completely - for basic operations (and ignoring byte size for things like cache hit ratios and memory bandwidth) if you look at (say Agner Fog's optimisation PDFs of instruction latency) the basic SSE/AVX latency for basic add/sub/mult/div (yes, even divides these days), the latency between float and double is almost always the same on the most recent AMD/Intel CPUs (and normally execution ports can do both now).
Where it differs is gather/scatter and some shuffle instructions (larger size to work on), and maths routines like transcendentals - sqrt(), sin(), etc, where the backing algorithms (whether on the processor in some cases or in libm or equivalent) obviously have to do more work (often more iterations of refinement) to calculate the value to greater precision for f64.
If you are developing for ARM, some systems have hardware support for FP32 but use software emulation for FP64, with noticeable performance difference.
https://gcc.godbolt.org/z/7155YKTrK
That.... doesn't seem true? At least for most architectures I looked at?
While true the latency for ADDPS and ADDPD are the same latency, using the zen4 example at least, the double variant only calculates 4 fp64 values compared to the single-precision's 8 fp32. Which was my point? If each double precision instruction processes a smaller number of inputs, it needs to be lower latency to keep the same operation rate.
And DIV also has a significntly lower throughput for fp32 vs fp64 on zen4, 5clk/op vs 3, while also processing half the values?
Sure, if you're doing scalar fp32/fp64 instructions it's not much of a difference (though DIV still has a lower throughput) - but then you're already leaving significant peak flops on the table I'm not sure it's a particularly useful comparison. It's just the truism of "if you're not performance limited you don't need to think about performance" - which has always been the case.
So yes, they do at least have a 2:1 difference in throughput on zen4 - even higher for DIV.
Obviously, the wider you go, the more constrained you are on infrastructure and how many ports there are.
My point was more it's very often the expensive transcendentals where the performance difference is felt between f32 and f64.
That doesn't mean that there are no situations where it does matter today - which is what I feel is implied by calling it "Ancient".
What do you mean by this? In C 1.0 is a double.
This IBM convention has been inherited by the IBM programming languages FORTRAN IV and PL/I and from these 2 languages it has spread everywhere.
The C language has taken several keywords and operators from IBM PL/I, which was one of the three main inspiration sources for C (which were CPL/BCPL, PL/I and ALGOL 68).
So "float" and "double" are really inherited by C from PL/I.
A feature that is specific to C is that it has changed the default format for constants and for intermediate values to double-precision, instead of the single-precision that was the default in earlier programming languages.
This was done with the intention of protecting naive users from making mistakes, because if you compute with FP32 it is very easy to obtain erroneous results, unless you analyze very carefully the propagation of errors. Except in applications where errors matter very little, e.g. graphics and ML/AI, the use of FP32 is more suitable for experts, while bigger formats are recommended for normal users.
I think Cray doubles were 128 bits, and their singles were 64… which makes it seem like smaller floats are just a continuation of the eternal trend.
The Cray 64-bit format was a slight increase in size over the 60-bit floating-point numbers that had been used in the previous computers designed by Seymour Cray, at CDC.
Before IBM increased the size of a byte to 8 bits, which caused all numeric formats to use sizes that are multiple of 8-bits, in the computers with 6-bit bytes the typical floating-point number sizes were either 60-bit in the high-end models or 48-bit in cheaper models or 36-bit in the cheapest models.
Shouldn't that be m mantissa bits (not y) -- i.e. typo here -- or am I misunderstanding something?
But what I wish is that there had been fp64 encoding with a field for number of significant digits.
strtod() would encode this, fresh out of an instrument reading (serial). It would be passed along. It would be useful EVEN if it weren't updated by arithmetic with other such numbers.
Every day I get a query like "why does the datum have so many decimal digits? You can't possibly be saying that the instrument is that precise!"
Well, it's because of sprintf(buf, "%.16g", x) as the default to CYA.
Also sad is the complaint about "0.56000 ... 01" because someone did sprintf("%.16f").
I can't fix this in one class -- data travels between too many languages and communication buffers.
In short, I wish I had an fp64 double where the last 4 bits were ALWAYS left alone by the CPU.
It would be useful if you could then pass it to an "about equal" operator, too.
I don't need to know that the alternator is putting out 13.928528V, and sure as hell I know you're not measuring that accurately. It's precise but wrong.
I want an "about equals" thing so I can say "if Valt == 14 alt_ok=true" kind of thing but tag it to be "about 14" not "exactly 14".
It seems that life is imitating art.
https://github.com/sdd/ieee754-rrp
That's fun.
> It seems that life is imitating art.
You didn't even beat wikipedia to the punch. They've had a nice page about minifloats using 6-8 bit sizes as examples for about 20 years.
The 4 bit section is newer, but it actually follows IEEE rules. Your joke formats forgot there's an implied 1 bit in the fraction. And how exponents work.
Yes, purely software.
[1] https://tom7.org/nand/
I thought in ancient times, floating point numbers used to be 80 bit. They lived in a funky mini stack on the coprocessor (x87). Then one day, somebody came along and standardized those 32 and 64 bit floats we still have today.
So Intel has initiated the discussions for the future IEEE standard with many relevant companies, even before the launch of 8087. AMD was a company convinced immediately by Intel, so AMD was able to introduce a FP accelerator (Am9512) based on the 8087 FP formats, which were later adopted in IEEE 754, also in 1980 and a few months before the launch of Intel 8087. So in 1980 there already were 2 implementations of the future IEEE 754 standard. Am9512 was licensed to Intel and Intel made it using the 8232 part number (it was used in 8080/8085/Z80 systems).
Unlike AMD, the traditional computer companies agreed that a FP standard is needed to solve the mess of many incompatible FP formats, but they thought that the Kahan-Intel proposal would be too expensive for them, so they came with a couple of counter-proposals, based on the tradition of giving priority to implementation costs over usefulness for computer users.
Fortunately the Intel negotiators eventually succeeded to convince the others to adopt the Intel proposal, by explaining how the new features can be implemented at an acceptable cost.
The story of IEEE 754 is one of the rare stories in standardization where it was chosen to do what is best for customers, not what is best for vendors.
Like the use of encryption in communications, the use of the IEEE standard has been under continuous attacks during its history, coming from each new generation of logic designers, who think that they are smarter than their predecessors, and who are lazy to implement properly some features of the standard, despite the fact that older designs have demonstrated that they can in fact be implemented efficiently, but the newbies think that they should take the easy path and implement inefficiently some features of the standard, because supposedly the users will not care about that.
One of the striking things, though, is just how radically different KCS was. By the time IEEE 754 forms, there is a basic commonality of how floating-point numbers work. Most systems have a single-precision and double-precision form, and many have an additional extended-precision form. These formats are usually radix-2, with a sign bit, a biased exponent, and an integer mantissa, and several implementations had hit on the implicit integer bit representation. (See http://www.quadibloc.com/comp/cp0201.htm for a tour of several pre-IEEE 754 floating-point formats). What KCS did that was really new was add denormals, and this was very controversial. I also think that support for infinities was introduced with KCS, although there were more precedents for the existence of NaN-like values. I'm also pretty sure that sticky bits as opposed to trapping for exceptions was considered innovative. (See, e.g., https://ethw-images.s3.us-east-va.perf.cloud.ovh.us/ieee/f/f... for a discussion of the differences between the early drafts.)
Now, once IEEE 754 came out, pretty much every subsequent implementation of floating-point has started from the IEEE 754 standard. But it was definitely not a codification of existing behavior when it came out, given the number of innovations that it had!
In ancient times, floats were all 60 bits and there was no single precision.
See page 3-15 of this https://caltss.computerhistory.org/archive/6400-cdc.pdf
Prehistoric information could be preserved by an oral tradition, until it is recorded in some documents (like the Oral Histories at the Computer History Museum site).
To be able to use the corresponding 8087 instructions, "long double" has been added to the C language, so to avoid extra roundings one had to use "long double" variables and one had to also be careful so that intermediate values used for the computing of an expression will not be spilled into the memory as "double".
However this became broken in some newer C compilers, where due to the deprecation of the x87 ISA "long double" was made synonymous to "double". Some better C compilers have chosen to implement "long double" as quadruple-precision instead of extended precision, which ensures that no precision is lost, but which may be slow on most computers, where no hardware support for FP128 exists.
With this setting in operates internally exactly on those sizes.
Operating internally on 80-bits is just the default setting, because it is the best for naive users, who are otherwise prone to computing erroneous results.
This is the same reason why the C language has made "double" the default precision in constants and intermediate values.
Unless you do graphics or ML/AI, single-precision computations are really only for experts who can analyze the algorithm and guarantee that it is correct.
Multiplication with a unique logic circuit (use a Karnaugh map):
S:E:l:M
S = sign bit present (or magnitude-only absolute value)
E = exponent bits (typically biased by 2^(E-1) - 1)
l = explicit leading integer present (almost always 0 because the leading digit is always 1 for normals, 0 for denormals, and not very useful for special values)
M = mantissa (fraction) bits
The limitations of FP4 are that it lacks infinities, [sq]NaNs, and denormals that make it very limited to special purposes only. There's no denying that it might be extremely efficient for very particular problems.
If a more even distribution were needed, a simpler fixed point format like 1:2:1 (sign:integer:fraction bits) is possible.
Or does that matter - its the kernel that handles the FP format?