Is OOXML Artifically Complex?

(hsu.cy)

87 points | by firexcy 3 days ago

29 comments

  • s20n 3 hours ago
    > Why Microsoft’s Motive Wasn’t Deliberate Sabotage

    I absolutely do not agree.

    Not only is the standard overly complex, Microsoft also indulged in all sorts of unscrupulous activities to corrupt various National Standards Organisations to get it approved through the ISO <https://en.wikipedia.org/wiki/Standardization_of_Office_Open...>, which is clear evidence of malicious intent.

    This is a quote from Richard Stallman:

    > The specifications document was so long that it would be difficult for anyone else to implement it properly. When the proposed standard was submitted through the usual track, experienced evaluators rejected it for many good reasons. Microsoft responded using a special override procedure in which its money buy the support of many of the voting countries, thus bypassing proper evaluation and demonstrating that ISO can be bought.

    • gregopet 3 hours ago
      My wife worked in one of the national standardization organizations. She was urgently called into her boss' office: "Please be on this meeting with me, I think they will try to bribe me if I'm alone". It only happened once while my wife worked there and it was right before the vote where Microsoft tried to fast track their office format.
    • monocasa 3 hours ago
      Specifically what I heard on the grapevine was that Microsoft sponsored a collection of small island nations into the ISO process, in exchange for their vote on OOXML.
    • CorrectHorseBat 2 hours ago
      Both can be true at once.

      They didn't want a standard other people could adapt easily nor do the work to make Word adhere to one and it had to happen fast. By doing it the way they did they got everything they wanted and only needed to buy ISO.

      • fsflover 37 minutes ago
        Sounds exactly like deliberate sabotage to me.
    • quotemstr 3 hours ago
      Some myths just won't die.

      OOXML is complex because it has to be. It has to losslessly round trip through an open format every single feature of Office. That's a lot of features.

      Yes, it's complex. Should Microsoft have cut features of Office just to make OOXML simpler? That's ridiculous. What about users who relied on those cut features?

      It was fair to ask Microsoft to open the file format. It wasn't fair to expect them to cut features and compatibility. The complaints about complexity from RMS and others represent outsiders seeing the sausage factory and realizing that the sausage making is complicated and needs a lot of moving parts. Maybe life wasn't as simple as the Slashdot "Micro$oft" narrative would suggest. Maybe the complexity of the product was downstream of the shit ton of complexity and sweat and thought that had gone into it.

      But admitting that would have been hard. Easier to come up with conspiracy theories.

      • MattPalmer1086 7 minutes ago
        But they did define two variants to get their standard approved in the fast track process.

        The Transitional variant which is entirely backwards compatible is not fully defined in a way that others can implement without reverse engineering how Microsoft Office does things.

        The Strict variant isn't totally compatible with all older binary formats but is fully defined.

        Guess which one is the standard file format?

      • clort 1 hour ago
        You are wrong. Microsoft was not asked to open the file format. There was an open file format already accepted as an ISO standard, so now they needed to make their product compliant with an ISO standard because companies around the world were going to prioritise that in their purchases. They did everything they could to ensure that their format was both an ISO standard, and impossible for somebody else to implement.
        • hdjrudni 52 minutes ago
          From the article,

          > First, OOXML was, in material part, a defensive posture under intensifying antitrust and “open standards” pressure. Microsoft announced OOXML in late 2005 while appealing an adverse European Commission judgment centered on interoperability disclosures. Thus, it was only a matter of time before Office file compatibility came under the regulatory microscope. (The Commission indeed opened a probe in 2008.)

          > Meanwhile, the rival ODF matured and became an ISO standard in May 2006. Governments, especially in Europe, began to mandate open standards in public procurement. If Microsoft did nothing, Office risked exclusion from government deals.

          So... maybe they weren't directly asked to open their file format, but what then? Adopt ODF which is surely incompatible with their feature set, and... just corrupt every .doc file when converting into the new format? And also have to reimplement all their apps?

      • user3939382 2 hours ago
        So you put extensions in the spec you don’t make it impossible for anyone else to implement. They knew open source suites were competing with them they did it on purpose.
        • quotemstr 2 hours ago
          > So you put extensions in the spec

          ... which are either public, in which case people complain that the spec+extensions is too long instead of that the spec is too long, or

          ... which aren't public, in which case people complain that there's no interoperability.

          You can't win.

          > impossible for anyone else to implement

          Except for all the people who did implement it?

          • fsflover 35 minutes ago
            > Except for all the people who did implement it?

            It was never fully implemented. LibreOffice has been trying since then and there are always problems.

      • dullcrisp 3 hours ago
        The…sausage has a lot of moving parts?
      • troupo 1 hour ago
        > OOXML is complex because it has to be.

        What it didn't have to be is sections upon sections of "this behaviour is as seen in Word 95", "this behaviour is as seen in Word 97" without any further specification or context.

        The main struggle for independent implementors was reverse engineering all the implicit and explicit assumptions and inner workings of MS Office software.

        > But admitting that would have been hard. Easier to come up with conspiracy theories.

        I actually read through a lot of that spec at the time. A lot of it was just lip service to open standards at a time when MS was under a lot of regulatory pressure.

        • qcnguy 1 hour ago
          That stuff happens because Microsoft don't know what the behavior is. It's just a bit which forks Word down some ancient code path that nobody understands and isn't properly documented. Given the huge effort that would have gone into producing this thousand plus page specification, is understandable why the spec writers would have given up at times.

          I expect most people posting on Hacker News would not be able to write a satisfactory specification for their own software if they are working a large legacy code base.

  • Lammy 8 hours ago
    I love this screen that shows you exactly why they named it “Office Open” XML: https://i.imgur.com/hnj3sdv.png

    It was a pretty big deal when OpenOffice.org's 2.0 release came with OpenDocument as the default file format. Very easy for someone to misread this MSOffice screen and click on OOXML expecting it to mean OO.o.

    • zamadatix 3 hours ago
      Oh wow. I must have clicked through that page dozens of times, selecting "Keep Current" after a quick scan and thinking the 2nd option was talking about Open Office.
  • nneonneo 3 hours ago
    Microsoft seems to have known that they could ram basically anything through a standards body, so they presumably didn't bother to actually try and simplify the standard. Instead, it's basically an XML serialization of their older binary formats, complete with all of the quirks and bugs that have to be emulated for 100% compatibility.

    To be fair, we're talking about a product line with over 35 years of history here. Cruft in the format builds up but can never be removed, so long as you commit to strong backwards compatibility - which Microsoft has always done.

    Fun trivia: many of the old binary formats use a meta-format called OLE2 (Object Linking and Embedding). The file format is a FAT12 filesystem packed into a single file, with a FAT filesystem chain, file blocks aligned to a specific power-of-two size, etc. This made saving files very fast, but raised the possibility of internal fragmentation (where individual sub-files are scattered over many non-contiguous blocks); hence, users were recommended to "Save As..." periodically for large/complex files to optimize the internal storage.

    • rtpg 47 minutes ago
      "You have to standardize the format"

      "OK we will standardize our serialization format"

      It's... I guess malicious compliance, though also if you don't care about interop you're not going to try to abstract away your internal application structures, are you!

      I appreciate the standard existing rather than it not existing. Trying to have the standard exist in this way has always felt like an uphill battle, and at least now there's _something_.

      Just you will have a better time if you emulate how Office does things. But you have a bit more documentation to go along with it.

    • flomo 2 hours ago
      Officially now MS-CFB (i think). OLE2 generally refers to a predecessor to COM, and not just the file format.

      https://learn.microsoft.com/en-us/openspecs/windows_protocol...

  • CobrastanJorji 1 hour ago
    The OOXML fight is near and dear to my heart because, when it happened, I was a baby developer, and I cared about the issue for some reason I can barely recall, and I found an expert on the issue on Twitter. That guy would regularly tweet about everything that was going on and the problems with the spec and the shenanigans, and I was one of the, like, 20 people who was hanging on his every word. And sometimes he'd talk about bee keeping instead. It was my first introduction to Twitter at its best. You got these unfiltered whole views of the lives and concerns of real people who were, in part, experts at what you cared about. So sometimes you had to listen to them talk about other random stuff they thought was neat. And that's great!
  • lorenzohess 8 hours ago
    > In my view, OOXML is indeed complex, convoluted, and obscure. But that’s likely less about a plot to block third-party compatibility and more about a self-interested negligence: Microsoft prioritized the convenience of its own implementation and neglected the qualities of clarity, simplicity, and universality that a general-purpose standard should have.

    The author only provides arguments for "self-interested negligence". He provides no counterarguments to the claim that OOXML complexity was "a plot to block third-party compatibility". Therefore, he cannot compare "negligence" and "a plot". Therefore, his claim that "negligence" is a better explanation for OOXML complexity than "a plot" cannot follow.

    To restate:

    > If we dig into the context of OOXML’s creation, it can be argued that harming competitors was not Microsoft’s primary aim.

    The author provides no evidence to support this claim. At most, the evidence provided in this section at most supports the claim that "negligence" played a role in OOXML complexity. From this evidence alone, no conclusions can be drawn about the "primariness" of "negligence" vs "harming competitors".

    • unscaled 4 hours ago
      Unless we ever get the full archive of Microsoft emails, meeting minutes and recordings from all the secret microphones they didn't have in their meeting rooms, I don't think you can ever disprove this claim. It's generally impossible to conclusively disprove conspiracy theories, because you could always claim you're only showing there are no documents proving the conspiracy, but there are no documents disproving it.

      The author is just implicitly appealing to Occam's razor here, as people often in face of accusations of a plot. They can show that Microsoft has backed the ANSI accreditation of ODF[1] and eventually implemented support for ODF import and export in Office, but that's not enough to prove there was no conspiracy.

      Instead, the article just provides a very plausible explanation for the complexity in OOXML. Does this explanation thoroughly disprove the accusations of a plot? Clear not. Is it more plausible than a great plot to crush a bunch of competitors that had no market share and kill a better standard document format that Microsoft did end up implementing in Office? Yes. This is probably as far as we can get.

      [1] https://news.microsoft.com/source/2007/05/16/microsoft-votes...

      • airstrike 3 hours ago
        Both things can be true. It had a genuine purpose, but the fact that Microsoft will go out of its way to not implement anything better and less temperamental is an indication it's not really open. There's plenty of evidence of Microsoft dragging their feet at playing nice with the rest of the office ecosystem.

        I'm not saying they shouldn't do that as a company maximizing shareholder value. But we should all collectively groan every time the topic comes up, not applaud them.

    • to11mtm 8 hours ago
      I mean sometimes you gotta ship a product (and remember back then, that meant masters for CDs,) and it's perfectly possible that whatever team was in charge of handling 'conversion' stuff for old format (remember that old excel formats have OLE type cruft going on, the sorts of things that led to VBA viruses, imagine what other functionality needs to be implemented) just plain had to take shortcuts in uglifying the spec to support all the jank.
  • tannhaeuser 8 hours ago
    Worth keeping in mind that the native MSO formats were using "structured storage", a horrible binary chunked serialization and metadata format from an era where binary embedding of document streams in other application documents via "Object linking and embedding" (OLE, see also Apple's OpenDoc format) was deemed desirable, with zero consideration given to third-party apps and segment formats tied to C++ data structures. Compared to that, OOXML is still a huge progress, and while it's complex I wouldn't say it's maliciously so.

    The Shakespeare example is a good one where the sentence is split into multiple spans to apply style rules yet the bare text content could be extracted by just removing all XML tags. Whereas the ODF variant is actually less recommendable as it relies on an unneccesarily complex formatting and text addressing language on top of XML.

    The article says

    > Even at a glance [ODF's markup] is more intelligible. Strip the text: namespaces and it’s nearly valid HTML. The only thing that needs explaining is that ODF doesn’t wrap To be with a dedicated “bold” tag. Instead, it applies an auto-style named T1 to a <text:span>, an act of separating content and presentation that mirrors established web practices.

    but this definitely makes things more complex for data exchange compared to OOXML.

    • quotemstr 3 hours ago
      Can you explain what's wrong with the concept of a container format that allows embedding subdocuments of different types?

      > zero consideration given to third-party apps and segment formats

      The reality is the opposite. COM serialization was specifically built to allow for composing components (and serializations thereof) that didn't know about each other into a single document. That's why it leans so heavily on GUIDs for names: they avoid collisions without needing coordination. That's a laudable goal, not pointless bloat. And the COM people implemented it pretty efficiently too!

      > C++ data structures

      What gives you that idea? Yes, the OLE stream thing was a binary format, but so is DER for ASN.1. Every webpage you load goes over a binary tagged object format not too different from OLE/COM's.

      But due to a persistence of myths from the 90s, people still think of the Office binary format as "horrible" when it's actually quite elegant, especially considering the problems the authors had to solve and their constraints in doing so.

      In many ways, we've regressed.

      > Markup

      The author of the article nails it when he says ODF is meant to be a markup language and OOXML is the serialization of an object graph. So what? Do people write ODF by hand? There are countless JSON formats just as inscrutable as MSO's legacy streams.

      Anyway, the idea that the MSO binary format was crap because it was binary, lazy, and represented a "memory dump" is an old myth that just won't die. It wasn't a memory dump, it wasn't lazy, and it wasn't crap. Yes, there are real problems with some of the things people put inside the OLE container, but it's facile and wrong to blame the container or the OLE stream composition model for the problem.

  • Mikhail_Edoshin 2 hours ago
    I remember Spreadsheet ML, an older format compatible with Excel. It had a subset of features, I think, but it was a rather powerful subset: formatting, formulae, multiple sheets. And it was rather simple. (Had a silly design mistake: for some reason MS gave namespace to attributes, which is not necessary, only for rather specific purposes).

    Another XML standard from MS that also seems relatively simple is XPS, a PDF alternative. But it uses Open Packaging and that is somewhat hard to read.

  • themerone 6 hours ago
    It's as complex as it needs to be to losslessly convert old binary office files.

    A better format would have made us geeks a lot happier, but the average user just wants things to work the way they always have.

    • Gigachad 6 hours ago
      My possibly incomplete understanding was that the original office file format was basically just raw dumps of the internal C data structures. Not designed or specified in any way.

      The XML version likely carries a lot of baggage having to be compatible with that.

      • lmkg 5 hours ago
        They weren't "just" raw dumps of internal C structures. It takes careful design work to dump raw memory in a usable fashion. Consider: You can't just write a pointer to disk and then read it back next week.

        Binary MS Office format is a phenomenal piece of engineering to achieve a goal that's no longer relevant: fast save/load on late-80's hard drives. Other programs took minutes to save a spreadsheet, Excel took seconds. It did this by making sure it's in-memory data structures for a document could be dumped straight to disk without transformation.

        But yes, this approach carries a shitton of baggage. And that achievement is no longer relevant in a world where consumer hardware can parse XML documents on the fly.

        I have heard it argued, though, that the "baggage" isn't the file format. It's actually the full historical featureset of Excel. Being backwards-compatible means being able to faithfully represent the features of old Excel, and the essential complexity of that far outweighs the incidental complexity of how those features were encoded.

  • charlieyu1 9 hours ago
    I once digged through the 5000 page specification. There was a lot of useless stuff that only old Microsoft Word supported like WordArt items.
    • bawolff 4 hours ago
      Does office no longer support word art?

      When i was a kid,making cool wordart headers for school projects was like 50% of what we used office for.

  • PaulHoule 8 hours ago
    People who were developing "office" programs in the early 1990s were thinking about the problem of serializing arbitrary object graphs into documents to support technologies like

    https://en.wikipedia.org/wiki/Object_Linking_and_Embedding

    where you could embed an Excel spreadsheet inside a Word document or actually embedded any of a large range of COM objects into a Word document which on one hand is a really appealing vision but on the other hand means you have to have and be able to run all the binaries for all the objects that live in a document which ties the whole thing to Windows.

    PDF is a different sort of document format which privileges viewing over editing but it is also really about serializing an object graph when it comes down to it and then having various sorts of filters and transformations and a range of objects defined in the spec as opposed to open ended access to an object library.

    This kind of system has a lot of overlap with the serdes problem you get with RPC frameworks that used to be under the files "Sun RPC sucks", "DCOM Sucks", "CORBA Sucks" and "WS-* Sucks" Those things are mostly forgotten these days because well... they sucked, and now the usual complaint is "protobuf sucks" but you rarely hear "JSON sucks" because it gave up on graphs for trees, if you don't have a type system people can't say the type system sucks, and the only thing that really sucks about it is that people won't just use ISO 8601 dates but you can always rise above that by just using ISO 8601 dates without asking for permission. But we all agree YAML sucks.

    That points to any flexible document format sucking but also sucks because it has lots of poorly specified and obscure features that amount to "format this the same way Word 95 formatted it if you used a certain obscure option".

    From a glass is half empty perspective it sucks because it's close to impossible to make a Microsoft Office replacement that renders 100% of documents 100% correctly.

    From a glass is half empty perspective it rules because if you want to make a Python script that writes an Excel script with formulas it is easy. If you want to extract the images out of a Word document it is easy because a Word document is just a ZIP file. If you want to do anything with an OOXML document short of writing an Office replacement it's actually a pretty good situation.

    • com2kid 5 hours ago
      > but you rarely hear "JSON sucks" because it gave up on graphs for trees

      Except it also spawned a thousand custom formats that include $ref support of some type, so we are right back to having graphs. :-D

  • mxmilkiib 7 hours ago
  • eirikbakke 7 hours ago
    Microsoft Office has many features. Each feature must be reflected in the file format somehow.

    (I wonder what the specification-pages-to-man-years ratio is...)

  • tracker1 8 hours ago
    I think the last part is probably the biggest thing holding them back IMO... I tend not to install MS Office products on my personal devices, I haven't run Windows on a personal device in a few years. I've mostly maintained just my resume in word or libre-office format for well over a decade. I can't tell you how many times the LO format lost formatting, or just messed up between version upgrades. Same goes for opening a word version in LO.

    That doesn't count the various times where it behaved weird, inconsistently had fields/tables that were impossible to edit, etc. I've had to completely recreate everything a couple times over the years. That's just one document, for one guy that I don't really touch that often.

    Say what you will about Firefox vs Chrome in terms of usability, compared to MS Word using LibreOffice is worse than early betas of Netscape Navigator 4.0. It's both impressive and upsetting. OnlyOffice at least looks nicer, even if it doesn't really function any better. MS's online version of Word in the browser operates more consistently than either.

    • abhinavk 5 hours ago
      Have you tried setting MSWord's default save format to Strict OOXML for inter-operation with LibreOffice?
  • piker 9 hours ago
    Dead on.

    Microsoft is just dominant and exporting its 40 year old legacy codebase as a spec. LibreOffice team is frustrated that the for-profit model is beating the OSS model and crying foul over mostly necessary complexity. If LibreOffice started from scratch they’d probably appreciate how much Microsoft serializes because a sufficiently complicated document saved to .docx basically provides a reference implementation.

    We do need for-profit alternatives to Word, and I’m working on one in legal.

    [edit: I hope to put some real thoughts on this down soon, but most of the wonkiness emanates from evolving functionality and varying trends in best practices over the decades. I’ve implemented a fair bit of the spec here: https://tritium.legal, but most of the hard part is providing for bidi language support, fonts, real-time editing and re-rendering, UI and annotations like spellchecking and grammar, not conforming to the markup spec. Spec conformance is just polish and testing. A performant modern word processor of any spec, however, is a technological achievement on the order of a web browser.]

    • Gigachad 6 hours ago
      I feel like Libreoffice became largely irrelevant the day Google Docs came out. People put up with LO wonkyness because it was free and office was expensive.

      Google completely flipped the game and then cloud collaboration became everything.

      • toast0 5 hours ago
        I mean, multiplayer features are useful, but Google Docs is wonkier than LO. At least when LO loads a document, it's fully loaded.
    • trelane 8 hours ago
      LibreOffice has versions that you pay for, with support. The most prominent is Collabora, which is a (if not the) biggest contributor to LibreOffice.
    • taftster 5 hours ago
      > We do need for-profit alternatives to Word, and I’m working on one in legal.

      Wow, big undertaking!

      What we really need, though, is a for-profit alternative to Excel, that's not Google. I think Excel is more of the Killer App than Word has ever been.

      • qcnguy 1 hour ago
        That's Apple Numbers.
    • haskellshill 3 hours ago
      > If LibreOffice started from scratch

      What do you mean though? Libreoffice wrote their application from scratch, did they not? And they managed to implement a superior serialization format, did they not? And they managed to get that format standardized without bribing and cheating, did they not?

      What you're saying is akin to "those residents of banana republics are just frustrated capitalism (and a little help from the CIA) is beating democracy"

      > We do need for-profit alternatives to Word

      Why does it have to be for profit?

      • smaudet 3 hours ago
        I think this is definitely some weird attempt to justify a terrible piece of technological junk....

        For all the hate people gave CSS, it was/is fantastic at its job. Word documents are an example of how you don't design a document, and how when a for profit org designs a thing (instead of standards and market pressures), you get a technological monstrosity...

        To be clear, I don't think LibreOffice is great. Part of their issue, they were built as a way to "not pay" for office, and it turns out that no, volunteers don't really do a better job at implementing 1000 pages of nonsense that the people who came up with that spaghetti code in the first place...

        We don't need that software anymore, though. If you use it, know we are looking at you like you are pulling out a physical paper phonebook to store your numbers in, or a less hurtfully but just as topically, a record or CD player...it is dinosaur technology that pretty much has no place in todays world...

        So, they have a point, I don't disagree with them, however it probably would be better just to "admit defeat", get MS to open source their code for compat reasons, and work on something new that's not trying to write viruses on your computer better than paragraphs...

    • unscaled 4 hours ago
      This may be nitpicking, but the complexity in OOXML is not "necessary", at least not in the sense of what Fred Brooks would call essential complexity. As OP clearly demonstrate, the complexity in OOXML is not artificial: there was never some grand conspiracy by Microsoft to create a format that competitors will find it hard to implement.

      But very little of this complexity is necessary for a standard interoperable document file format. The background was that the EU started pushing for a standardized document exchange format, and several governments started implementing regulations requiring the use of this format — Microsoft now had some very big customers which urgently needed a feature: a standard document file format. Microsoft _could_ have implemented and submitted a new format that doesn't include slavishly reflect their in-memory object graph and legacy issues. Or they even could have just adopted ODF (shudder). But they've chosen the easy way, because, frankly, they probably just didn't have the time. They took the accidental complexity which was the hot mess Microsoft Office internals (like a buggy date format) and serialized it to disk. It was never an ideal solution, but this was quick to implement.

      That's just a classic case of technical debt: Microsoft needed to deliver a feature fast, and they were willing to make compromises. The crazy political shenanigans Microsoft had executed to standardize their technical debt are ironically just another form of accidental complexity.

    • croes 7 hours ago
      Where does the article say it’s a necessary complexity?

      > Thus, the primary goal for this new format wasn’t to be elegant, universal, or easy to implement; it was to placate regulators while preserving Microsoft’s technological and commercial advantages.

      That sounds quite anti-competitive to me

    • like_any_other 4 hours ago
      > LibreOffice team is frustrated that the for-profit model is beating the OSS model

      Let's take a look at this "for-profit model" - is it just higher price outweighed by better product? lol:

      Microsoft, after getting beat up in the press for making propietary extensions to the Kerberos protocol, has released the specifications on the web -- but in order to get it, you have to run a Windows .exe file which forces you agree to a click-through license agreement where you agree to treat it as a trade secret, before it will give you the .pdf file. Who would have thought that you could publish a trade secret on the web? - https://slashdot.org/story/00/05/02/158204/kerberos-pacs-and...

      Back in 2001, Be, Inc. managed to get BeOS pre-installed on one computer model from Hitachi. Just one. On the entire PC market. Microsoft forced Hitachi to drop the bootloader entry to hide BeOS from customers buying it. They enforced their monopoly over the only possible niche BeOS could find on the PC market, crushing Be, Inc. in the process. - https://www.haiku-os.org/blog/mmu_man/2021-10-04_ok_lenovo_w...

      So why aren't there any dual-boot computers for sale? The answer lies in the nature of the relationship Microsoft maintains with hardware vendors. More specifically, in the "Windows License" agreed to by hardware vendors who want to include Windows on the computers they sell. This is not the license you pretend to read and click "I Accept" to when installing Windows. This license is not available online. This is a confidential license, seen only by Microsoft and computer vendors. You and I can't read the license because Microsoft classifies it as a "trade secret." The license specifies that any machine which includes a Microsoft operating system must not also offer a non-Microsoft operating system as a boot option. In other words, a computer that offers to boot into Windows upon startup cannot also offer to boot into BeOS or Linux. The hardware vendor does not get to choose which OSes to install on the machines they sell -- Microsoft does. - https://birdhouse.org/beos/byte/30-bootloader/

  • nashashmi 4 hours ago
    OOXML carries bloat from a full legacy doc file into a docx file. Readability was not the mission of the developers of the open format. Openness was the mission of the developers of the format. And they made it open enough.
  • freeopinion 7 hours ago
    This is talking about OOXML the proprietary MS format, right? Not ISO/IEC 29500?

    ISO/IEC 29500 should be open to evolution, no? Just like all the open collaboration on it before it was confirmed as a standard.

  • theanonymousone 9 hours ago
  • etothepii 9 hours ago
    I spent a lot of time last year replicating every valid Excel number format. I've really struggled to find good documentation on the excel format when you really get into the weeds.

    The use of namespaces is also incredibly annoying in so far as I can tell in every xml library I can find they really aren't well supported for that "human" readable component.

    When you crack open the file it feels like you are going to be able to find everything you need with an xpath like //w:t but none of the xml parsers I've found cope well with the namespaces.

    • rhdunn 9 hours ago
      What language?

      In Python, the `find`, `findall`, etc. methods take a namespace dictionary. E.g.

         result = doc.findall("//w:t", namespaces={"w": "..."})
      
      In C# you can do:

          var navigator = doc.Root!.CreateNavigator();
          nsManager = new XmlNamespaceManager(navigator.NameTable);
          nsManager.AddNamespace("w", "...");
          var results = doc.Root?.XPathSelectElements("//w:t", nsManager);
      
      In Java you need to enable a namespace-aware flag in the settings to get namespaces to work. I can't recall off-hand how to do that.
  • taspeotis 5 hours ago
    Off topic sorry but with all the comments discussing Office's size and age and technical baggage ... does anyone know how they pivoted from X million lines of code for a desktop application to running it on the web with all those collaboration features?
  • Joker_vD 9 hours ago
    sigh Just because it was not deliberately engineered to be prohitibively expensive to support does not mean that it can not be used to deliberately obstruct interoperability. It's really not that difficult a concept: if you want others to suffer, you can take a sad artifact of well-meant historical accidents, and say "welp, now it's a standard, you gotta support it!" There is nothing contradictory or conspirational.
    • taeric 9 hours ago
      Agreed. I'm... not entirely clear I get the distinction the article is trying to make?

      If you take the idea that it is "artificially complex, because they actively added complexity", then I can see how that isn't quite right. But "artificially complex" can also allow for "because they actively avoided the effort to remove complexity." In which case, we are back to the same spot? But in agreement this time?

    • piker 9 hours ago
      I think we take issue with requiring the leap to Microsoft “deliberately” obstructing interoperability. Microsoft just isn’t incentivized to make it simple to implement, but it’s probably less complicated than the various web standards.
      • Joker_vD 9 hours ago
        An engineering team in Microsoft decides to switch from binary format to XML to save effort in the long run; even though it'll take some effort now, they have the competency, and can afford it. They are absolutely correct!

        But then their manager needs to sell this project to the higher-ups, who have read BillG's memo about how "One thing we have got to change in our strategy – allowing Office documents to be rendered very well by other people's browsers is one of the most destructive things we could do to the company. We have to stop putting any effort into this and make sure that Office documents very well depend on proprietary IE capabilities. Anything else is suicide for our platform. This is a case where Office has to avoid doing something to destroy Windows." and took it to heart. So what does he do? Why, he spins a tale that since it's XML, they'll be able to standardize it, and everyone else will still be forced to interoperate with MS Office anyhow, because it will be the de-facto reference implementation (by the virtue of being there first, and widely deployed), and the spec is going to be an absolute PITA to implement decently — and that manager too will be absolutely correct!

        • piker 9 hours ago
          It’s not actually that bad.
          • to11mtm 8 hours ago
            IME there at least used to be a difference between 'fresh OO doc' and 'oo doc upsaved from legacy' as far as parsing.

            I know when I had to deal with a LOT of excel in 2008-2013, somewhere in that range I gave up on trying to parse the XML (admittedly with the then-rudimentary tools, to say nothing of nascent state of nuget at the time) and just learned how to do VSTO (Visual Studio Tools for Office) as we all had excel installed anyway, and it led to less overall code for the tasks we had to do that involved Excel...

  • 3cats-in-a-coat 9 hours ago
    Microsoft just took what they had and directly translated it to XML. It's not intentionally messy, it's just a big corporation with old product acting like it.
    • gitonup 8 hours ago
      This is the God's honest.

      I worked on the MS Word core team for a little over three years from 2010-2014, and de-facto owned a significant part of implementing ODF / OOXML Strict support.

      The binary format was a liability for Microsoft to begin with, because of decades of cruft lining up with actual memory alignment. During my tenure there I ran into code my GM had written as an intern and was still intact -- he had 20+ years of tenure (mostly on Word) when I joined the team.

      The translation of the file format to XML involved a significant amount of performance degradation if you weren't careful. Hundreds of millions of people use the app monthly, and MS still tries to maintain backwards compatibility. Given that open APIs were a relatively late development for the app, I really don't think in the current reality of what's expected by boards of directors for the companies they oversee that _anyone_ would take years to:

      a) define a spec that maintained that backwards compatibility

      b) reach whatever nebulous simplicity metric today's HN article wants

      c) not get whoever greenlit the project fired for taking that many engineering hours for a and b

  • pessimizer 8 hours ago
    I have no idea what this article is intending to express. It is artificially complex to dump the exact implementations of your legacy products into a giant data structure and call it a standard. Nobody can implement that. Which is why they had to bribe, stuff committees and bully people to get it done.

    I don't think anyone cares about debating the word "artificial," I don't think that was anyone's point. It's just not a standard. It was, as is made clear here, a way to head off a standard that would be possible to competitors to implement with a fake standard that Microsoft couldn't even implement.

    I also don't think that it is "a counterproductive reflex that’s common in open-source circles: scolding users for accepting proprietary tech." I don't even know wtf that's supposed to mean. People are stuck with it because of corruption, they're not being scolded for using it.

    edit: "LibreOffice itself, as ODF’s flagship, still suffers from rough edges in design, interaction, and performance. As a result, even as Office hobble itself with bloat, most people still find it easier."

    Yeah, it'd be a lot easier if they didn't every have to deal with OOXML and could just work on their own product.

  • croes 7 hours ago
    > Thus, the primary goal for this new format wasn’t to be elegant, universal, or easy to implement; it was to placate regulators while preserving Microsoft’s technological and commercial advantages.

    That sound exactly like it is an anti-competitive format.

    Keeping the own advantage sums pretty all anti-competitive behavior.

  • RcouF1uZ4gsC 9 hours ago
    > Faced with demands for openness, Microsoft could have produced a clean, modern spec and keep the mass pile of legacy inside the application.

    Very, very few people care about openness. Maybe a few hundred. Tens of millions care about docx capturing exactly what their doc files had.

    Microsoft made the correct choice.

  • fsflover 9 hours ago
  • cyberax 9 hours ago
    The answer: no.

    OOXML is an extremely detailed spec that lists minute details of the Office documents, with uncountable features. While it could have used some "standard" features, there weren't that many usable standards when OOXML was being developed.

    In comparison, OASIS OpenDocument spec is horribly ambiguous and has all the same issues (like units not being used consistently). It got better over the years, but it's still not at all great. And its size is now comparable to OOXML, when all the referenced specs are incorporated.

    • rhdunn 8 hours ago
      There are places where it says the equivalent of "Works the same as Word 95" [3], but does not specify in the specification what that means.

      It's essentially a serialization of the binary format to XML.

      ODF 1.4 is around 1,100 pages across all 4 parts whereas OOXML is over 6,000.

      [1] https://stephesblog.blogs.com/my_weblog/2007/08/microsofts-f...

      [2] https://ooxmlisdefectivebydesign.blogspot.com/2007/08/micros...

      [3] https://www.robweir.com/blog/2007/01/how-to-hire-guillaume-p...

      • xeeeeeeeeeeenu 3 hours ago
        They improved this in later revisions of the standard. The behaviour of autoSpaceLikeWord95 is now actually described and there's an example.

        You can see it for yourself here (in Part 4): https://ecma-international.org/publications-and-standards/st...

      • cyberax 8 hours ago
        > There are places where it says the equivalent of "Works the same as Word 95" [3], but does not specify in the specification what that means.

        Yeah, sure, whatever. You'll never see these kinds of documents in real life. And the specified quirks were minor. If you don't implement them, you'll get subtle formatting issues in documents imported directly from Word97.

        MS could have just put them into a "vendor-specific" extension and not documented them at all.

        > ODF 1.4 is around 1,100 pages across all 4 parts whereas OOXML is over 6,000.

        LOL, no. SVG spec alone is 800 pages. ODF formula spec is 200 pages alone, and is still underspecified.

  • stuzenz 8 hours ago
    My theory (from anecdotal use) is that the OOXML complexity also explains why M365 office implementation is lacking in so many features and is just not very good at all when compared to the Google office suite.

    I do have strong memories of OOXML and the scandals that were with it when it became a standard through MS allegedly buying/stacking/influencing votes:

    https://chatgpt.com/share/68bf5e11-4e10-8003-ac9d-d4d10f7951...