C++ proposal: There are exactly 8 bits in a byte

(open-std.org)

166 points | by Twirrim 6 hours ago

39 comments

  • favorited 4 hours ago
    Previously, in JF's "Can we acknowledge that every real computer works this way?" series: "Signed Integers are Two’s Complement" <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p09...>
    • jsheard 3 hours ago
      Maybe specifying that floats are always IEEE floats should be next? Though that would obsolete this Linux kernel classic so maybe not.

      https://github.com/torvalds/linux/blob/master/include/math-e...

      • jfbastien 1 hour ago
        Hi! I'm JF. I half-jokingly threatened to do IEEE float in 2018 https://youtu.be/JhUxIVf1qok?si=QxZN_fIU2Th8vhxv&t=3250

        I wouldn't want to lose the Linux humor tho!

      • conradev 37 minutes ago
        I was curious about float16, and TIL that the 2008 revision of the standard includes it as an interchange format:

        https://en.wikipedia.org/wiki/IEEE_754-2008_revision

      • AnimalMuppet 3 hours ago
        That line is actually from a famous Dilbert cartoon.

        I found this snapshot of it, though it's not on the real Dilbert site: https://www.reddit.com/r/linux/comments/73in9/computer_holy_...

        • Jerrrrrrry 2 hours ago
          This is the epitome, the climax, the crux, the ultimate, the holy grail, the crème de la crème of nerd sniping.

          fuckin bravo

      • NL807 3 hours ago
        Love it
      • FooBarBizBazz 2 hours ago
        Whether double floats can silently have 80 bit accumulators is a controversial thing. Numerical analysis people like it. Computer science types seem not to because it's unpredictable. I lean towards, "we should have it, but it should be explicit", but this is not the most considered opinion. I think there's a legitimate reason why Intel included it in x87, and why DSPs include it.
        • stephencanon 1 hour ago
          Numerical analysis people do not like it. Having _explicitly controlled_ wider accumulation available is great. Having compilers deciding to do it for you or not in unpredictable ways is anathema.
          • bee_rider 53 minutes ago
            It isn’t harmful, right? Just like getting a little accuracy from a fused multiply add. It just isn’t useful if you can’t depend on it.
            • Negitivefrags 13 minutes ago
              It can be harmful. In GCC while compiling a 32 bit executable, making an std::map< float, T > can cause infinite loops or crashes in your program.

              This is because when you insert a value into the map, it has 80 bit precision, and that number of bits is used when comparing the value you are inserting during the traversal of the tree.

              After the float is stored in the tree, it's clamped to 32 bits.

              This can cause the element to be inserted into into the wrong order in the tree, and this breaks the assumptions of the algorithm leaidng to the crash or infinite loop.

              Compiling for 64 bits or explicitly disabling x87 float math makes this problem go away.

              I have actually had this bug in production and it was very hard to track down.

              • blt 2 minutes ago
                dang that's a good war story.
            • lf37300 12 minutes ago
              If not done properly, double rounding (round to extended precision then rounding to working precision) can actually introduce larger approximation error than round to nearest working precision directly. So it can actually make some numerical algorithms perform worse.
            • eternityforest 25 minutes ago
              I suppose it could be harmful if you write code that depends on it without realizing it, and then something changes so it stops doing that.
  • pjdesno 4 hours ago
    During an internship in 1986 I wrote C code for a machine with 10-bit bytes, the BBN C/70. It was a horrible experience, and the existence of the machine in the first place was due to a cosmic accident of the negative kind.
    • Isamu 1 hour ago
      I wrote code on a DECSYSTEM-20, the C compiler was not officially supported. It had a 36-bit word and a 7-bit byte. Yep, when you packed bytes into a word there were bits left over.

      And I was tasked with reading a tape with binary data in 8-bit format. Hilarity ensued.

      • bee_rider 52 minutes ago
        Hah. Why did they do that?
    • csours 4 hours ago
      Somehow this machine found its way onto The Heart of Gold in a highly improbable chain of events.
    • Taniwha 3 hours ago
      I've worked on a machine with 9-bit bytes (and 81-bit instructions) and others with 6-bit ones - nether has a C compiler
      • asveikau 2 hours ago
        I think the pdp-10 could have 9 bit bytes, depending on decisions you made in the compiler. I notice it's hard to Google information about this though. People say lots of confusing, conflicting things. When I google pdp-10 byte size it says a c++ compiler chose to represent char as 36 bits.
      • corysama 1 hour ago
        The Nintendo64 had 9-bit RAM. But, C viewed it as 8 bit. The 9th bit was only there for the RSP (GPU).
    • aldanor 2 hours ago
      10-bit arithmetics are actually not uncommon on fpgas these days and are used in production in relatively modern applications.

      10-bit C, however, ..........

      • eulgro 2 hours ago
        How so? Arithmetic on FPGA usually use the minimum size that works, because any size over that will use more resources than needed.

        9-bit bytes are pretty common in block RAM though, with the extra bit being used for either for ECC or user storage.

    • WalterBright 3 hours ago
      I programmed the Intel Intellivision cpu which had a 10 bit "decl". A wacky machine. It wasn't powerful enough for C.
    • kazinator 2 hours ago
      C itself was developed on machines that had 18 bit ints.
  • WalterBright 3 hours ago
    D made a great leap forward with the following:

    1. bytes are 8 bits

    2. shorts are 16 bits

    3. ints are 32 bits

    4. longs are 64 bits

    5. arithmetic is 2's complement

    6. IEEE floating point

    and a big chunk of wasted time trying to abstract these away and getting it wrong anyway was saved. Millions of people cried out in relief!

    Oh, and Unicode was the character set. Not EBCDIC, RADIX-50, etc.

    • Laremere 1 hour ago
      Zig is even better:

      1. u8 and i8 are 8 bits.

      2. u16 and i16 are 16 bits.

      3. u32 and i32 are 32 bits.

      4. u64 and i64 are 64 bits.

      5. Arithmetic is an explicit choice. '+' overflowing is illegal behavior (will crash in debug and releasesafe), '+%' is 2's compliment wrapping, and '+|' is saturating arithmetic. Edit: forgot to mention @addWithOverflow(), which provides a tuple of the original type and a u1; there's also std.math.add(), which returns an error on overflow.

      6. f16, f32, f64, f80, and f128 are the respective but length IEEE floating point types.

      The question of the length of a byte doesn't even matter. If someone wants to compile to machine whose bytes are 12 bits, just use u12 and i12.

      • __turbobrew__ 1 hour ago
        This is the way.
      • Spivak 1 hour ago
        How does 5 work in practice? Surely no one is actually checking if their arithmetic overflows, especially from user-supplied or otherwise external values. Is there any use for the normal +?
        • dullcrisp 23 minutes ago
          You think no one checks if their arithmetic overflows?
    • gerdesj 3 hours ago
      "1. bytes are 8 bits"

      How big is a bit?

      • thamer 2 hours ago
        This doesn't feel like a serious question, but in case this is still a mystery to you… the name bit is a portmanteau of binary digit, and as indicated by the word "binary", there are only two possible digits that can be used as values for a bit: 0 and 1.
      • basementcat 41 minutes ago
        A bit is a measure of information theoretical entropy. Specifically, one bit has been defined as the uncertainty of the outcome of a single fair coin flip. A single less than fair coin would have less than one bit of entropy; a coin that always lands heads up has zero bits, n fair coins have n bits of entropy and so on.

        https://en.m.wikipedia.org/wiki/Information_theory

        https://en.m.wikipedia.org/wiki/Entropy_(information_theory)

        • fourier54 19 minutes ago
          That is a bit in information theory. It has nothing to do with the computer/digital engineering term being discussed here.
      • dullcrisp 21 minutes ago
        At least 2 or 3
      • CoastalCoder 2 hours ago
        > How big is a bit?

        A quarter nybble.

      • nonameiguess 1 hour ago
        How philosophical do you want to get? Technically, voltage is a continuous signal, but we sample only at clock cycle intervals, and if the sample at some cycle is below a threshold, we call that 0. Above, we call it 1. Our ability to measure whether a signal is above or below a threshold is uncertain, though, so for values where the actual difference is less than our ability to measure, we have to conclude that a bit can actually take three values: 0, 1, and we can't tell but we have no choice but to pick one.

        The latter value is clearly less common than 0 and 1, but how much less? I don't know, but we have to conclude that the true size of a bit is probably something more like 1.00000000000000001 bits rather than 1 bit.

      • poincaredisk 2 hours ago
        A bit is either a 0 or 1. A byte is the smallest addressable piece of memory in your architecture.
        • elromulous 2 hours ago
          Technically the smallest addressable piece of memory is a word.
          • Maxatar 1 hour ago
            I don't think the term word has any consistent meaning. Certainly x86 doesn't use the term word to mean smallest addressable unit of memory. The x86 documentation defines a word as 16 bits, but x86 is byte addressable.

            ARM is similar, ARM processors define a word as 32-bits, even on 64-bit ARM processors, but they are also byte addressable.

            As best as I can tell, it seems like a word is whatever the size of the arithmetic or general purpose register is at the time that the processor was introduced, and even if later a new processor is introduced with larger registers, for backwards compatibility the size of a word remains the same.

          • asveikau 1 hour ago
            Depends on your definition of addressable.

            Lots of CISC architectures allow memory accesses in various units even if they call general-purpose-register-sized quantities "word".

            Iirc the C standard specifies that all memory can be accessed via char*.

        • Nevermark 2 hours ago
          Which … if your heap always returns N bit aligned values, for some N … is there a name for that? The smallest heap addressable segment?
    • cogman10 3 hours ago
      Yeah, this is something Java got right as well. It got "unsigned" wrong, but it got standardizing primitive bits correct

      byte = 8 bits

      short = 16

      int = 32

      long = 64

      float = 32 bit IEEE

      double = 64 bit IEEE

      • jltsiren 2 hours ago
        I like the Rust approach more: usize/isize are the native integer types, and with every other numeric type, you have to mention the size explicitly.

        On the C++ side, I sometimes use an alias that contains the word "short" for 32-bit integers. When I use them, I'm explicitly assuming that the numbers are small enough to fit in a smaller than usual integer type, and that it's critical enough to performance that the assumption is worth making.

        • jonstewart 2 hours ago
          <cstdint> has int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, and uint64_t. I still go back and forth between uint64_t, size_t, and unsigned int, but am defaulting to uint64_t more and more, even if it doesn't matter.
        • Jerrrrrrry 2 hours ago
          hindsight has its advantages
        • kazinator 1 hour ago
          > you have to mention the size explicitly

          It's unbelievably ugly. Every piece of code working with any kind of integer screams "I am hardware dependent in some way".

          E.g. in a structure representing an automobile, the number of wheels has to be some i8 or i16, which looks ridiculous.

          Why would you take a language in which you can write functional pipelines over collections of objects, and make it look like assembler.

          • pezezin 1 hour ago
            If you don't care about the size of your number, just use isize or usize.

            If you do care, then isn't it better to specify it explicitly than trying to guess it and having different compilers disagreeing on the size?

            • kazinator 1 hour ago
              A type called isize is some kind of size. It looks wrong for something that isn't a size.
          • Spivak 1 hour ago
            Is it any better calling it an int where it's assumed to be an i32 and 30 of the bits are wasted.
      • josephg 2 hours ago
        Yep. Pity about getting chars / string encoding wrong though. (Java chars are 16 bits).

        But it’s not alone in that mistake. All the languages invented in that era made the same mistake. (C#, JavaScript, etc).

        • jeberle 30 minutes ago
          Java strings are byte[]'s if their contents contain only Latin-1 values (the first 256 codepoints of Unicode). This shipped in Java 9.

          JEP 254: Compact Strings

          https://openjdk.org/jeps/254

        • paragraft 2 hours ago
          What's the right way?
          • WalterBright 2 hours ago
            UTF-8

            When D was first implemented, circa 2000, it wasn't clear whether UTF-8, UTF-16, or UTF-32 was going to be the winner. So D supported all three.

          • Remnant44 2 hours ago
            utf8, for essentially the reasons mentioned in this manifesto: https://utf8everywhere.org/
            • josephg 1 hour ago
              Yep. Notably supported by go, python3, rust and swift. And probably all new programming languages created from here on.
  • MaulingMonkey 4 hours ago
    Some people are still dealing with DSPs.

    https://thephd.dev/conformance-should-mean-something-fputc-a...

    Me? I just dabble with documenting an unimplemented "50% more bits per byte than the competition!" 12-bit fantasy console of my own invention - replete with inventions such as "UTF-12" - for shits and giggles.

    • jfbastien 1 hour ago
      Yes, I'm trying to figure out which are still relevant and whether they target a modern C++, or intend to. I've been asking for a few years and haven't gotten positive answers. The only one that been brought up is TI, I added info in the updated draft: https://isocpp.org/files/papers/D3477R1.html
    • jeffbee 4 hours ago
      They can just target C++23 or earlier, right? I have a small collection of SHARCs but I am not going to go crying to the committee if they make C++30 (or whatever) not support CHAR_BIT=32
    • PaulDavisThe1st 4 hours ago
      no doubt you've got your brainfuck compiler hard at work on this ...
      • defrost 2 hours ago
        TI DSP Assembler is pretty high level, it's "almost C" already.

        Writing geophysical | military signal and image processing applications on custom DSP clusters is suprisingly straightforward and doesn't need C++.

        It's a RISC architecture optimised for DSP | FFT | Array processing with the basic simplification that char text is for hosts, integers and floats are at least 32 bit and 32 bits (or 64) is the smallest addressable unit.

        Fantastic architecture to work with for numerics, deep computational pipelines, once "primed" you push in raw aquisition samples in chunks every clock cycle and extract processed moving window data chunks every clock cycle.

        A single ASM instruction in a cycle can accumulate totals from vector multiplication and modulo update indexes on three vectors (two inputs and and out).

        Not your mama's brainfuck.

  • harry8 4 hours ago
    Is C++ capable of deprecating or simplifying anything?

    Honest question, haven't followed closely. rand() is broken,I;m told unfixable and last I heard still wasn't deprecated.

    Is this proposal a test? "Can we even drop support for a solution to a problem literally nobody has?"

    • epcoa 4 hours ago
      Signed integers did not have to be 2’s complement, there were 3 valid representations: signed mag, 1s and 2s complement. Modern C and C++ dropped this and mandate 2s complement (“as if” but that distinction is moot here, you can do the same for CHAR_BIT). So there is certainly precedence for this sort of thing.
    • jfbastien 1 hour ago
      As mentioned by others, we've dropped trigraph and deprecated rand (and offer an alternative). I also have:

      * p2809 Trivial infinite loops are not Undefined Behavior * p1152 Deprecating volatile * p0907 Signed Integers are Two's Complement * p2723 Zero-initialize objects of automatic storage duration * p2186 Removing Garbage Collection Support

      So it is possible to change things!

    • Nevermark 2 hours ago
      I think you are right. Absolutely.

      Don’t break perfection!! Just accumulate more perfection.

      What we need is a new C++ symbol that reliably references eight bit bytes, without breaking compatibility, or wasting annnnnny opportunity to expand the kitchen sink once again.

      I propose “unsigned byte8” and (2’s complement) “signed byte8”. And “byte8” with undefined sign behavior because we can always use some more spice.

      “unsigned decimal byte8” and “signed decimal byte8”, would limit legal values to 0 to 10 and -10 to +10.

      For the damn accountants.

      “unsigned centimal byte8” and “signed centimal byte8”, would limit legal values to 0 to 100 and -100 to +100.

      For the damn accountants who care about the cost of bytes.

      Also for a statistically almost valid, good enough for your customer’s alpha, data type for “age” fields in databases.

      And “float byte8” obviously.

      • bastawhiz 59 minutes ago
        > For the damn accountants who care about the cost of bytes.

        Finally! A language that can calculate my S3 bill

    • hyperhello 3 hours ago
      C++ long ago crossed the line where making any change is more work than any benefit it could ever create.
    • mrpippy 3 hours ago
      C++17 removed trigraphs
      • poincaredisk 2 hours ago
        Which was quite controversial. Imagine that.
    • nialv7 4 hours ago
      well they managed to get two's complement requirement into C++20. there is always hope.
      • oefrha 3 hours ago
        Well then someone somewhere with some mainframe got so angry they decided to write a manifesto to condemn kids these days and announced a fork of Qt because Qt committed the cardinal sin of adopting C++20. So don’t say “a problem literally nobody has”, someone always has a use case; although at some point it’s okay to make a decision to ignore them.

        https://lscs-software.com/LsCs-Manifesto.html

        https://news.ycombinator.com/item?id=41614949

        Edit: Fixed typo pointed out by child.

        • ripe 2 hours ago
          > because Qt committed the carnal sin of adopting C++20

          I do believe you meant to write "cardinal sin," good sir. Unless Qt has not only become sentient but also corporeal when I wasn't looking and gotten close and personal with the C++ standard...

        • __turbobrew__ 1 hour ago
          This person is unhinged.

          > It's a desktop on a Linux distro meant to create devices to better/save lives.

          If you are creating life critical medical devices you should not be using linux.

        • epcoa 1 hour ago
          Wow.

          https://theminimumyouneedtoknow.com/

          https://lscs-software.com/LsCs-Roadmap.html

          "Many of us got our first exposure to Qt on OS/2 in or around 1987."

          Uh huh.

          > someone always has a use case;

          No he doesn't. He's just unhinged. The machines this dude bitches about don't even have a modern C++ compiler nor do they support any kind of display system relevant to Qt. They're never going to be a target for Qt. Further irony is this dude proudly proclaims this fork will support nothing but Wayland and Vulkan on Linux.

          "the smaller processors like those in sensors, are 1's complement for a reason."

          The "reason" is never explained.

          "Why? Because nothing is faster when it comes to straight addition and subtraction of financial values in scaled integers. (Possibly packed decimal too, but uncertain on that.)"

          Is this a justification for using Unisys mainframes, or is the implication that they are fastest because of 1's complement? (not that this is even close to being true - as any dinosaurs are decomissioned they're fucking replaced with capable but not TOL commodity Xeon CPU based hardware running emulation, I don't think Unisys makes any non x86 hardware anymore) Anyway, may need to refresh that CS education.

          There's some rambling about the justification being data conversion, but what serialization protocols mandate 1's complement anyway, and if those exist someone has already implemented 2's complement supporting libraries for the past 50 years since that has been the overwhelming status quo. We somehow manage to deal with endianness and decimal conversions as well.

          "Passing 2's complement data to backend systems or front end sensors expecting 1's complement causes catastrophes."

          99.999% of every system MIPS, ARM, x86, Power, etc for the last 40 years uses 2's complement, so this has been the normal state of the world since forever.

          Also the enterpriseist of languages, Java somehow has survived mandating 2's complement.

          This is all very unhinged.

          I'm not holding my breath to see this ancient Qt fork fully converted to "modified" Barr spec but that will be a hoot.

  • bcoates 45 minutes ago
    I have mixed feelings about this. On the one hand, it's obviously correct--there is no meaningful use for CHAR_BIT to be anything other than 8.

    On the other hand, it seems like some sort of concession to the idea that you are entitled to some sort of just world where things make sense and can be reasoned out given your own personal, deeply oversimplified model of what's going on inside the computer. This approach can take you pretty far, but it's a garden path that goes nowhere--eventually you must admit that you know nothing and the best you can do is a formal argument that conditional on the documentation being correct you have constructed a correct program.

    This is a huge intellectual leap, and in my personal experience the further you go without being forced to acknowledge it the harder it will be to make the jump.

    That said, there seems to be an increasing popularity of physical electronics projects among the novice set these days... hopefully read the damn spec sheet will become the new read the documentation

    • joelignaatius 18 minutes ago
      As with any highly used language you end up running into what I call the COBOL problem. It will work for the vast majority of cases except where there's a system that forces an update and all of a sudden a traffic control system doesn't work or a plane falls out of the sky.

      You'd have to have some way of testing all previous code in the compilation (pardon my ignorance if this is somehow obvious) to make sure this macro isn't already used. You also risk forking the language with any kind of breaking changes like this. How difficult it would be to test if a previous code base uses a charbit macro and whether it can be updated to the new compiler sounds non obvious. What libraries would then be considered breaking? Would interacting with other compiled code (possibly stupid question) that used charbit also cause problems? Just off the top of my head.

      I agree that it sounds nonintuitive. I'd suggest creating a conversion tool first and demonstrating it was safe to use even in extreme cases and then make the conversion. But that's just my unenlightened opinion.

  • jfbastien 1 hour ago
    Hi! Thanks for the interest on my proposal. I have an updated draft based on feedback I've received so far: https://isocpp.org/files/papers/D3477R1.html
  • TrueDuality 5 hours ago
    This is both uncontroversial and incredibly spicy. I love it.
  • boulos 1 hour ago
    The current proposal says:

    > A byte is 8 bits, which is at least large enough to contain the ordinary literal encoding of any element of the basic character set literal character set and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is bits in a byte.

    But instead of the "and is composed" ending, it feels like you'd change the intro to say that "A byte is 8 contiguous bits, which is".

    We can also remove the "at least", since that was there to imply a requirement on the number of bits being large enough for UTF-8.

    Personally, I'd make a "A byte is 8 contiguous bits." a standalone sentence. Then explain as follow up that "A byte is large enough to contain...".

  • kreco 4 hours ago
    I'm totally fine with enforcing that int8_t == char == 8-bits, however I'm not sure about spreading the misconception that a byte is 8-bits. A byte with 8-bits is called an octet.

    At the same time, a `byte` is already an "alias" for `char` since C++17 anyway[1].

    [1] https://en.cppreference.com/w/cpp/types/byte

    • spc476 51 minutes ago
      My first experience with computers was 45 years ago, and a "byte" back then was defined as an 8-bit quantity. And in the intervening 45 years, I've never come across a different meaning for "byte". I'll ask for a citation for a definition of "byte" that isn't 8-bits.
    • bobmcnamara 4 hours ago
      I, for one, hate that int8 == signed char.

      std::cout << (int8_t)32 << std::endl; //should print 32 dang it

      • kreco 4 hours ago
        Now you can also enjoy the fact that you can't even compile:

          std::cout << (std::byte)32 << std::endl;
        
        because there is no default operator<< defined.
        • PaulDavisThe1st 4 hours ago
          Very enjoyable. It will a constant reminder that I need to decide how I want std::byte to print - character or integer ...
  • kazinator 1 hour ago
    There are DSP chips that have C compilers, and do not have 8 bit bytes; smallest addressable unit is 16 (or larger).

    Less than a decade ago I worked with something like that: the TeakLite III DSP from CEVA.

  • bobmcnamara 4 hours ago
    I just put static_assert(CHAR_BITS==8); in one place and move on. Haven't had it fire since it was #if equivalent
  • JamesStuff 5 hours ago
    Not sure about that, seems pretty controversial to me. Are we forgetting about the UNIVACs?
    • trebligdivad 5 hours ago
      Hopefully we are; it's been a long time, but as I remember indexing in strings on them is a disaster.
      • Animats 4 hours ago
        They still exist. You can still run OS 2200 on a Clearpath Dorado.[1] Although it's actually Intel Xeon processors doing an emulation.

        Yes, indexing strings of 6-bit FIELDATA characters was a huge headache. UNIVAC had the unfortunate problem of having to settle on a character code in the early 1960s, before ASCII was standardized. At the time, a military 6-bit character set looked like the next big thing. It was better than IBM's code, which mapped to punch card holes and the letters weren't all in one block.

        [1] https://www.unisys.com/siteassets/collateral/info-sheets/inf...

    • dathinab 4 hours ago
      idk. by today most software already assumes 8 bit == byte in subtle ways all over the place to a point you kinda have to use a fully custom or at least fully self reviewed and patched stack of C libraries

      so delegating such by now very very edge cases to non standard C seems fine, i.e. seems to IMHO not change much at all in practice

      and C/C++ compilers are anyway full of non standard extensions and it's not that CHAR_BIT go away or you as a non-standard extension assume it might not be 8

      • II2II 3 hours ago
        > most software already assumes 8 bit == byte in subtle ways all over the place

        Which is the real reason why 8-bits should be adopted as the standard byte size.

        I didn't even realize that the byte was defined as anything other than 8-bits until recently. I have known, for decades, that there were non-8-bit character encodings (including ASCII) and word sizes were all over the map (including some where word size % 8 != 0). Enough thought about that last point should have helped me realize that there were machines where the byte was not 8-bits, yet the rarity of encountering such systems left me with the incorrect notion that a byte was defined as 8-bits.

        Now if someone with enough background to figure it out doesn't figure it out, how can someone without that background figure it out? Someone who has only experienced systems with 8-bit bytes. Someone who has only read books that make the explicit assumption of 8-bit bytes (which virtually every book does). Anything they write has the potential of breaking on systems with a different byte size. The idea of writing portable code because the compiler itself is "standards compliant" breaks down. You probably should modify the standard to ensure the code remains portable by either forcing the compiler for non-8-bit systems to handle the exceptions, or simply admitting that compiler does not portable code for non-8-bit systems.

    • omoikane 3 hours ago
      This would be a great setup for a time travelling science fiction where there is some legacy UNIVAC software that needs to be debugged, and John Titor, instead of looking for an IBM 5100, came back to the year 2024 to find a pre-P3477R0 compiler.
    • forrestthewoods 4 hours ago
      Do UNIVACs care about modern C++ compilers? Do modern C++ compilers care about UNIVACs?

      Given that Wikipedia says UNIVAC was discontinued in 1986 I’m pretty sure the answer is no and no!

      • skissane 3 hours ago
        The UNIVAC 1108 (and descendants) mainframe architecture was not discontinued in 1986. The company that owned it (Sperry) merged with Burroughs in that year to form Unisys. The platform still exists, but now runs as a software emulator under x86-64. The OS is still maintained and had a new release just last year. Around the time of the merger the old school name “UNIVAC” was retired in a rebranding, but the platform survived.

        Its OS, OS 2200, does have a C compiler. Not sure if there ever was a C++ compiler, if there once was it is no longer around. But that C compiler is not being kept up to date with the latest standards, it only officially supports C89/C90 - this is a deeply legacy system, most application software is written in COBOL and the OS itself itself is mainly written in assembler and a proprietary Pascal-like language called “PLUS”. They might add some features from newer standards if particularly valuable, but formal compliance with C99/C11/C17/C23/etc is not a goal.

        The OS does contain components written in C++, most notably the HotSpot JVM. However, from what I understand, the JVM actually runs in x86-64 Linux processes on the host system, outside of the emulated mainframe environment, but the mainframe emulator is integrated with those Linux processes so they can access mainframe files/data/apps.

  • vitiral 52 minutes ago
    I wish the types were all in bytes instead of bits too. u1 is unsigned 1 byte and u8 is 8 bytes.

    That's probably not going to fly anymore though

  • kazinator 2 hours ago
    What will be the benefit?

    - CHAR_BIT cannot go away; reams of code references it.

    - You still need the constant 8. It's better if it has a name.

    - Neither the C nor C++ standard will be simplified if CHAR_BIT is declared to be 8. Only a few passages will change. Just, certain possible implementations will be rendered nonconforming.

    - There are specialized platforms with C compilers, such as DSP chips, that are not byte addressable machines. They are in current use; they are not museum pieces.

  • donatj 4 hours ago
    So please do excuse my ignorance, but is there a "logic" related reason other than hardware cost limitations ala "8 was cheaper than 10 for the same number of memory addresses" that bytes are 8 bits instead of 10? Genuinely curious, as a high-level dev of twenty years, I don't know why 8 was selected.

    To my naive eye, It seems like moving to 10 bits per byte would be both logical and make learning the trade just a little bit easier?

    • morio 3 hours ago
      One example from the software side: A common thing to do in data processing is to obtain bit offsets (compression, video decoding etc.). If a byte would be 10 bits you would need mod%10 operations everywhere which is slow and/or complex. In contrast mod%(2^N) is one logic processor instruction.
    • dplavery92 4 hours ago
      Eight is a nice power of two.
      • donatj 4 hours ago
        Can you explain how that's helpful? I'm not being obtuse, I just don't follow
        • spongebobstoes 4 hours ago
          One thought is that it's always a whole number of bits (3) to bit-address within a byte. It's 3.5 bits to bit address a 10 bit byte. Sorta just works out nicer in general to have powers of 2 when working on base 2.
          • cogman10 2 hours ago
            This is basically the reason.

            Another part of it is the fact that it's a lot easier to represent stuff with hex if the bytes line up.

            I can represent "255" with "0xFF" which fits nice and neat in 1 byte. However, now if a byte is 10bits that hex no longer really works. You have 1024 values to represent. The max value would be 0x3FF which just looks funky.

            Coming up with an alphanumeric system to represent 2^10 cleanly just ends up weird and unintuitive.

            • Spivak 58 minutes ago
              We probably wouldn't have chosen hex in a theoretical world where bytes were 10 bits, right? It would probably be two groups of 5 like 02:21 == 85 (like an ip address) or five groups of two 0x01111 == 85. It just has to be one of its divisors.
        • davemp 3 hours ago
          Many circuits have ceil(log_2(N_bits)) scaling wrt to propagation delay/other dimensions so you’re just leaving efficiency on the table if you aren’t using a power of 2 for your bit size.
        • bonzini 4 hours ago
          It's easier to go from a bit number to (byte, bit) if you don't have to divide by 10.
        • inkyoto 3 hours ago
          Because modern computing has settled on the Boolean (binary) logic (0/1 or true/false) in the chip design, which has given us 8 bit bytes (a power of two). It is the easiest and most reliable to design and implement in the hardware.

          On the other hand, if computing settled on a three-valued logic (e.g. 0/1/«something» where «something» has been proposed as -1, «undefined»/«unknown»/«undecided» or a «shade of grey»), we would have had 9 bit bytes (a power of three).

          10 was tried numerous times at the dawn of computing and… it was found too unwieldy in the circuit design.

          • davemp 3 hours ago
            > On the other hand, if computing settled on a three-valued logic (e.g. 0/1/«something» where «something» has been proposed as -1, «undefined»/«unknown/undecided» or a «shade of grey»), we would have had 9 bit bytes (a power of three).

            Is this true? 4 ternary bits give you really convenient base 12 which has a lot of desirable properties for things like multiplication and fixed point. Though I have no idea what ternary building blocks would look like so it’s hard to visualize potential hardware.

            • inkyoto 2 hours ago
              It is hard to say whether it would have been 9 or 12, now that people have stopped experimenting with alternative hardware designs. 9-bit byte designs certainly did exist (and maybe even the 12-bit designs), too, although they were still based on the Boolean logic.

              I have certainly heard an argument that ternary logic would have been a better choice, if it won over, but it is history now, and we are left with the vestiges of the ternary logic in SQL (NULL values which are semantically «no value» / «undefined» values).

    • bryanlarsen 4 hours ago
      I'm fairly sure it's because the English character set fits nicely into a byte. 7 bits would have have worked as well, but 7 is a very odd width for something in a binary computer.
    • zamadatix 4 hours ago
      If you're ignoring what's efficient to use then just use a decimal data type and let the hardware figure out how to calculate that for you best. If what's efficient matters then address management, hardware operation implementations, and data packing are all simplest when the group size is a power of the base.
    • knome 3 hours ago
      likely mostly as a concession to ASCII in the end. you used a typewriter to write into and receive terminal output from machines back in the day. terminals would use ASCII. there were machines with all sorts of smallest-addressable-sizes, but eight bit bytes align nicely with ASCII. makes strings easier. making strings easier makes programming easier. easier programming makes a machine more popular. once machines started standardizing on eight bit bytes, others followed. when they went to add more data, they kept the byte since code was written for bytes, and made their new registeres two bytes. then two of those. then two of those. so we're sitting at 64 bit registers on the backs of all that that came before.
    • wvenable 4 hours ago
      I'm not sure why you think being able to store values from -512 to +511 is more logical than -128 to +127?
      • donatj 4 hours ago
        Buckets of 10 seem more regular to beings with 10 fingers that can be up or down?
        • wvenable 4 hours ago
          I think 8bits (really 7 bits) was chosen because it holds a value closest to +/- 100. What is regular just depends on how you look at it.
  • bawolff 2 hours ago
    > We can find vestigial support, for example GCC dropped dsp16xx in 2004, and 1750a in 2002.

    Honestly kind of surprised it was relavent as late as 2004. I thought the era of non 8-bit bytes was like 1970s or earlier.

  • pabs3 3 hours ago
    Hmm, I wonder if any modern languages can work on computers that use trits instead of bits.

    https://en.wikipedia.org/wiki/Ternary_computer

    • cobbal 2 hours ago
      Possible, but likely slow. There's nothing in the "C abstract machine" that mandates specific hardware. But, the bitshift is only a fast operation when you have bits. Similarly with bitwise boolean operations.
    • cogman10 2 hours ago
      It'd just be a translation/compiler problem. Most languages don't really have a "bit", instead it's usually a byte with the upper bits ignored.
  • throwaway889900 5 hours ago
    But how many bytes are there in a word?
    • o11c 5 hours ago
      If you're on x86, the answer can be simultaneously 16, 32, and 64.
      • EasyMark 1 hour ago
        Don’t you mean 2,4, and 8?
    • wvenable 4 hours ago
      "Word" is an outdated concept we should try to get rid of.
      • anigbrowl 4 hours ago
        You're right. To be consistent with bytes we should call it a snack.
        • SCUSKU 4 hours ago
          Henceforth, it follows that a doublesnack is called a lunch. And a quadruplesnack a fourthmeal.
          • tetron 4 hours ago
            There's only one right answer:

            Nybble - 4 bits

            Byte - 8 bits

            Snyack - 16 bits

            Lyunch - 32 bits

            Dynner - 64 bits

            • kstrauser 1 hour ago
              In the spirit of redefining the kilobyte, we should define byte as having a nice, metric 10 bits. An 8 bit thing is obviously a bibyte. Then power of 2 multiples of them can include kibibibytes, mebibibytes, gibibibytes, and so on for clarity.
            • cozzyd 2 hours ago
              And what about elevensies?

              (Ok,. I guess there's a difference between bits and hob-bits)

          • iwaztomack 4 hours ago
            or an f-word
      • kevin_thibedeau 1 hour ago
        Appeasing that attitude is what prevented Microsoft from migrating to LP64. Would have been an easier task if their 32-bit LONG type never existed, they stuck with DWORD, and told the RISC platforms to live with it.
      • pclmulqdq 4 hours ago
        It's very useful on hardware that is not an x86 CPU.
        • wvenable 4 hours ago
          As an abstraction on the size of a CPU register, it really turned out to be more confusing than useful.
          • pclmulqdq 4 hours ago
            On RISC machines, it can be very useful to have the concept of "words," because that indicates things about how the computer loads and stores data, as well as the native instruction size. In DSPs and custom hardware, it can indicate the only available datatype.

            The land of x86 goes to great pains to eliminate the concept of a word at a silicon cost.

          • o11c 4 hours ago
            Fortunately we have `register_t` these days.
          • bobmcnamara 4 hours ago
            Is it 32 or 64 bits on ARM64? Why not both?
        • iwaztomack 4 hours ago
          such as...?
      • BlueTemplar 4 hours ago
        How exactly ? How else do you suggest CPUs do addressing ?

        Or are you suggesting to increase the size of a byte until it's the same size as a word, and merge both concepts ?

        • wvenable 4 hours ago
          I'm saying the term "Word" abstracting the number of bytes a CPU can process in a single operation is an outdated concept. We don't really talk about word-sized values anymore. Instead we mostly explicit on the size of value in bits. Even the idea of a CPU having just one relevant word size is a bit outdated.
    • elteto 3 hours ago
      There are 4 bytes in word:

        const char word[] = {‘w’, ‘o’, ‘r’, ‘d’};
        assert(sizeof word == 4);
    • Taniwha 3 hours ago
      I've seen 6 8-bit characters/word (Burroughs large systems, they also support 8 6-bit characters/word)
  • aj7 4 hours ago
    And then we lose communication with Europa Clipper.
  • IAmLiterallyAB 2 hours ago
    I like the diversity of hardware and strange machines. So this saddens me. But I'm in the minority I think.
  • masfuerte 4 hours ago
    This is entertaining and probably a good idea but the justification is very abstract.

    Specifically, has there even been a C++ compiler on a system where bytes weren't 8 bits? If so, when was it last updated?

    • bryanlarsen 4 hours ago
      There were/are C++ compilers for PDP-10 (9 bit byte). Those haven't been maintained AFAICT, but there are C++ compilers for various DSP's where the smallest unit of access is 16 or 32 bits that are still being sold.
    • userbinator 4 hours ago
      I know some DSPs have 24-bit "bytes", and there are C compilers available for them.
  • DowsingSpoon 4 hours ago
    As a person who designed and built a hobby CPU with a sixteen-bit byte, I’m not sure how I feel about this proposal.
  • whatsakandr 3 hours ago
    Hoesntly at thought this might be an onion headline. But then I stopped to think about it.
  • lowbloodsugar 1 hour ago

      #define SCHAR_MIN -127
      #define SCHAR_MAX 128
    
    Is this two typos or am I missing the joke?
  • gafferongames 4 hours ago
    Amazing stuff guys. Bravo.
  • starik36 4 hours ago
    There are FOUR bits.

    Jean-Luc Picard

  • Quekid5 5 hours ago
    JF Bastien is a legend for this, haha.

    I would be amazed if there's any even remotely relevant code that deals meaningfully with CHAR_BIT != 8 these days.

    (... and yes, it's about time.)

    • Animats 4 hours ago
      Here's a bit of 40 year old code I wrote which originally ran on 36-bit PDP-10 machines, but will work on non-36 bit machines.[1] It's a self-contained piece of code to check passwords for being obvious. This will detect any word in the UNIX dictionary, and most English words, using something that's vaguely like a Bloom filter.

      This is so old it predates ANSI C; it's in K&R C. It used to show up on various academic sites. Now it's obsolete enough to have scrolled off Google. I've seen copies of this on various academic sites over the years, but it seems to have finally scrolled off.

      I think we can dispense with non 8-bit bytes at this point.

      [1] https://animats.com/source/obvious/obvious.c

    • shawn_w 4 hours ago
      DSP chips are a common exception that people bring up. I think some TI made ones have 64 bit chars.

      Edit: I see TFA mentions them but questions how relevant C++ is in that sort of embedded environment.

      • Quekid5 4 hours ago
        Yes, but you're already in specialized territory if you're using that
    • nullc 4 hours ago
      The tms320c28x DSPs have 16 bit char, so e.g. the Opus audio codec codebase works with 16-bit char (or at least it did at one point -- I wouldn't be shocked if it broke from time to time, since I don't think anyone runs regression tests on such a platform).

      For some DSP-ish sort of processors I think it doesn't make sense to have addressability at char level, and the gates to support it would be better spent on better 16 and 32 bit multipliers. ::shrugs::

      I feel kind of ambivalent about the standards proposal. We already have fixed size types. If you want/need an exact type, that already exists. The non-fixed size types set minimums and allow platforms to set larger sizes for performance reasons.

      Having no fast 8-bit level access is a perfectly reasonable decision for a small DSP.

      Might it be better instead to migrate many users of char to (u)int8_t?

      The proposed alternative of CHAR_BIT congruent to 0 mod 8 also sounds pretty reasonable, in that it captures the existing non-8-bit char platforms and also the justification for non-8-bit char platforms (that if you're not doing much string processing but instead doing all math processing, the additional hardware for efficient 8 bit access is a total waste).

      • jfbastien 1 hour ago
        I added a mention of TI's hardware in my latest draft: https://isocpp.org/files/papers/D3477R1.html
      • dathinab 4 hours ago
        I thinks it's fine to relegate non 8 bit chars to non-standard C given that a lot of software anyway assumes 8bit bytes already implicitly. Non standard extensions for certain use-cases isn't anything new for C compilers. Also it's a C++ proposal I'm not sure if you program DSPs with C++ :think:
  • MrLeap 2 hours ago
    How many bytes is a devour?
  • hexo 4 hours ago
    Why? Pls no. We've been told (in school!) that byte is byte. Its only sometimes 8bits long (ok, most of the time these days). Do not destroy the last bits of fun. Is network order little endian too?
    • thfuran 2 hours ago
      Heretic, do not defile the last remnants of true order!
    • bbkane 3 hours ago
      I think there's plenty of fun left in the standard if they remove this :)
  • scosman 4 hours ago
    Bold leadership
  • adamnemecek 4 hours ago
    Incredible things are happening in the C++ community.
  • cyberax 4 hours ago
    But think of ternary computers!
    • dathinab 4 hours ago
      Doesn't matter ternary computers just have ternary bits, 8 of them ;)
      • mathgenius 4 hours ago
        Ternary computers have 8 tits to a byte.
        • tbrownaw 3 hours ago
          Should be either 9 or 27 I'd think.
          • epcoa 3 hours ago
            Why can’t it be 8?, the fact that it’s a trit doesn’t put any constraint on the byte (tryte ? size). You could actually make it 5 or 6 trits (~9.5 bits) for similar information density. The Setun used 6 trit addressable units.
      • AStonesThrow 4 hours ago
        Supposedly, "bit" is short for "binary digit", so we'd need a separate term for "ternary digit", but I don't wanna go there.
        • epcoa 4 hours ago
          The prefix is tri-, not ti- so I don’t think there was any concern of going anywhere.

          It’s tricycle and tripod, not ticycle.

        • bryanlarsen 4 hours ago
          The standard term is "trit" because they didn't want to go there.
  • bmitc 4 hours ago
    Ignoring this C++ proposal, especially because C and C++ seem like a complete nightmare when it comes to this stuff, I've almost gotten into the habit of treating a "byte" as a conceptual concept. Many serial protocols will often define a "byte", and it might be 7, 8, 9, 11, 12, or whatever bits long.
  • AlienRobot 4 hours ago
    I wish I knew what a 9 bit byte means.

    One fun fact I found the other day: ASCII is 7 bits, but when it was used with punch cards there was an 8th bit to make sure you didn't punch the wrong number of holes. https://rabbit.eng.miami.edu/info/ascii.html

    • Animats 4 hours ago
      A 9-bit byte is found on 36-bit machines in quarter-word mode.

      Parity is for paper tape, not punched cards. Paper tape parity was never standardized. Nor was parity for 8-bit ASCII communications. Which is why there were devices with settings for EVEN, ODD, ZERO, and ONE for the 8th bit.

      Punched cards have their very own encodings, only of historical interest.

      • AlienRobot 3 hours ago
        >A 9-bit byte is found on 36-bit machines in quarter-word mode.

        I've only programmed in high level programming languages in 8-bit-byte machines. I can't understand what you mean by this sentence.

        So in a 36-bit CPU a word is 36 bits. And a byte isn't a word. But what is a word and how does it differ from a byte?

        If you asked me what 32-bit/64-bit means in a CPU, I'd say it's how large memory addresses can be. Is that true for 36-bit CPUs or does it mean something else? If it's something else, then that means 64-bit isn't the "word" of a 64-bit CPU, so what would the word be?

        This is all very confusing.

  • CephalopodMD 4 hours ago
    Obviously
  • 38 4 hours ago
    the fact that this isn't already done after all these years is one of the reasons why I no longer use C/C++. it takes years and years to get anything done, even the tiniest, most obvious drama free changes. contrast with Go, which has had this since version 1, in 2012:

    https://pkg.go.dev/builtin@go1#byte

    • AlexandrB 1 hour ago
      Don't worry, 20 years from now Go will also be struggling to change assumptions baked into the language in 2012.
  • Iwan-Zotow 4 hours ago
    In a char, not in a byte. Byte != char
    • znkr 4 hours ago
    • AStonesThrow 4 hours ago
      A common programming error in C is reading input as char rather than int.

      https://man7.org/linux/man-pages/man3/fgetc.3.html

      fgetc(3) and its companions always return character-by-character input as an int, and the reason is that EOF is represented as -1. An unsigned char is unable to represent EOF. If you're using the wrong return value, you'll never detect this condition.

      However, if you don't receive an EOF, then it should be perfectly fine to cast the value to unsigned char without loss of precision.

  • electricdreams 5 hours ago
    [dead]