Not only is the standard overly complex, Microsoft also indulged in all sorts of unscrupulous activities to corrupt various National Standards Organisations to get it approved through the ISO <https://en.wikipedia.org/wiki/Standardization_of_Office_Open...>, which is clear evidence of malicious intent.
This is a quote from Richard Stallman:
> The specifications document was so long that it would be difficult for anyone else to implement it properly. When the proposed standard was submitted through the usual track, experienced evaluators rejected it for many good reasons. Microsoft responded using a special override procedure in which its money buy the support of many of the voting countries, thus bypassing proper evaluation and demonstrating that ISO can be bought.
My wife worked in one of the national standardization organizations. She was urgently called into her boss' office: "Please be on this meeting with me, I think they will try to bribe me if I'm alone". It only happened once while my wife worked there and it was right before the vote where Microsoft tried to fast track their office format.
Specifically what I heard on the grapevine was that Microsoft sponsored a collection of small island nations into the ISO process, in exchange for their vote on OOXML.
They didn't want a standard other people could adapt easily nor do the work to make Word adhere to one and it had to happen fast. By doing it the way they did they got everything they wanted and only needed to buy ISO.
OOXML is complex because it has to be. It has to losslessly round trip through an open format every single feature of Office. That's a lot of features.
Yes, it's complex. Should Microsoft have cut features of Office just to make OOXML simpler? That's ridiculous. What about users who relied on those cut features?
It was fair to ask Microsoft to open the file format. It wasn't fair to expect them to cut features and compatibility. The complaints about complexity from RMS and others represent outsiders seeing the sausage factory and realizing that the sausage making is complicated and needs a lot of moving parts. Maybe life wasn't as simple as the Slashdot "Micro$oft" narrative would suggest. Maybe the complexity of the product was downstream of the shit ton of complexity and sweat and thought that had gone into it.
But admitting that would have been hard. Easier to come up with conspiracy theories.
But they did define two variants to get their standard approved in the fast track process.
The Transitional variant which is entirely backwards compatible is not fully defined in a way that others can implement without reverse engineering how Microsoft Office does things.
The Strict variant isn't totally compatible with all older binary formats but is fully defined.
You are wrong. Microsoft was not asked to open the file format. There was an open file format already accepted as an ISO standard, so now they needed to make their product compliant with an ISO standard because companies around the world were going to prioritise that in their purchases. They did everything they could to ensure that their format was both an ISO standard, and impossible for somebody else to implement.
> First, OOXML was, in material part, a defensive posture under intensifying antitrust and “open standards” pressure. Microsoft announced OOXML in late 2005 while appealing an adverse European Commission judgment centered on interoperability disclosures. Thus, it was only a matter of time before Office file compatibility came under the regulatory microscope. (The Commission indeed opened a probe in 2008.)
> Meanwhile, the rival ODF matured and became an ISO standard in May 2006. Governments, especially in Europe, began to mandate open standards in public procurement. If Microsoft did nothing, Office risked exclusion from government deals.
So... maybe they weren't directly asked to open their file format, but what then? Adopt ODF which is surely incompatible with their feature set, and... just corrupt every .doc file when converting into the new format? And also have to reimplement all their apps?
So you put extensions in the spec you don’t make it impossible for anyone else to implement. They knew open source suites were competing with them they did it on purpose.
What it didn't have to be is sections upon sections of "this behaviour is as seen in Word 95", "this behaviour is as seen in Word 97" without any further specification or context.
The main struggle for independent implementors was reverse engineering all the implicit and explicit assumptions and inner workings of MS Office software.
> But admitting that would have been hard. Easier to come up with conspiracy theories.
I actually read through a lot of that spec at the time. A lot of it was just lip service to open standards at a time when MS was under a lot of regulatory pressure.
That stuff happens because Microsoft don't know what the behavior is. It's just a bit which forks Word down some ancient code path that nobody understands and isn't properly documented. Given the huge effort that would have gone into producing this thousand plus page specification, is understandable why the spec writers would have given up at times.
I expect most people posting on Hacker News would not be able to write a satisfactory specification for their own software if they are working a large legacy code base.
It was a pretty big deal when OpenOffice.org's 2.0 release came with OpenDocument as the default file format. Very easy for someone to misread this MSOffice screen and click on OOXML expecting it to mean OO.o.
Oh wow. I must have clicked through that page dozens of times, selecting "Keep Current" after a quick scan and thinking the 2nd option was talking about Open Office.
Microsoft seems to have known that they could ram basically anything through a standards body, so they presumably didn't bother to actually try and simplify the standard. Instead, it's basically an XML serialization of their older binary formats, complete with all of the quirks and bugs that have to be emulated for 100% compatibility.
To be fair, we're talking about a product line with over 35 years of history here. Cruft in the format builds up but can never be removed, so long as you commit to strong backwards compatibility - which Microsoft has always done.
Fun trivia: many of the old binary formats use a meta-format called OLE2 (Object Linking and Embedding). The file format is a FAT12 filesystem packed into a single file, with a FAT filesystem chain, file blocks aligned to a specific power-of-two size, etc. This made saving files very fast, but raised the possibility of internal fragmentation (where individual sub-files are scattered over many non-contiguous blocks); hence, users were recommended to "Save As..." periodically for large/complex files to optimize the internal storage.
It's... I guess malicious compliance, though also if you don't care about interop you're not going to try to abstract away your internal application structures, are you!
I appreciate the standard existing rather than it not existing. Trying to have the standard exist in this way has always felt like an uphill battle, and at least now there's _something_.
Just you will have a better time if you emulate how Office does things. But you have a bit more documentation to go along with it.
The OOXML fight is near and dear to my heart because, when it happened, I was a baby developer, and I cared about the issue for some reason I can barely recall, and I found an expert on the issue on Twitter. That guy would regularly tweet about everything that was going on and the problems with the spec and the shenanigans, and I was one of the, like, 20 people who was hanging on his every word. And sometimes he'd talk about bee keeping instead. It was my first introduction to Twitter at its best. You got these unfiltered whole views of the lives and concerns of real people who were, in part, experts at what you cared about. So sometimes you had to listen to them talk about other random stuff they thought was neat. And that's great!
> In my view, OOXML is indeed complex, convoluted, and obscure. But that’s likely less about a plot to block third-party compatibility and more about a self-interested negligence: Microsoft prioritized the convenience of its own implementation and neglected the qualities of clarity, simplicity, and universality that a general-purpose standard should have.
The author only provides arguments for "self-interested negligence". He provides no counterarguments to the claim that OOXML complexity was "a plot to block third-party compatibility". Therefore, he cannot compare "negligence" and "a plot". Therefore, his claim that "negligence" is a better explanation for OOXML complexity than "a plot" cannot follow.
To restate:
> If we dig into the context of OOXML’s creation, it can be argued that harming competitors was not Microsoft’s primary aim.
The author provides no evidence to support this claim. At most, the evidence provided in this section at most supports the claim that "negligence" played a role in OOXML complexity. From this evidence alone, no conclusions can be drawn about the "primariness" of "negligence" vs "harming competitors".
Unless we ever get the full archive of Microsoft emails, meeting minutes and recordings from all the secret microphones they didn't have in their meeting rooms, I don't think you can ever disprove this claim. It's generally impossible to conclusively disprove conspiracy theories, because you could always claim you're only showing there are no documents proving the conspiracy, but there are no documents disproving it.
The author is just implicitly appealing to Occam's razor here, as people often in face of accusations of a plot. They can show that Microsoft has backed the ANSI accreditation of ODF[1] and eventually implemented support for ODF import and export in Office, but that's not enough to prove there was no conspiracy.
Instead, the article just provides a very plausible explanation for the complexity in OOXML. Does this explanation thoroughly disprove the accusations of a plot? Clear not. Is it more plausible than a great plot to crush a bunch of competitors that had no market share and kill a better standard document format that Microsoft did end up implementing in Office? Yes. This is probably as far as we can get.
Both things can be true. It had a genuine purpose, but the fact that Microsoft will go out of its way to not implement anything better and less temperamental is an indication it's not really open. There's plenty of evidence of Microsoft dragging their feet at playing nice with the rest of the office ecosystem.
I'm not saying they shouldn't do that as a company maximizing shareholder value. But we should all collectively groan every time the topic comes up, not applaud them.
I mean sometimes you gotta ship a product (and remember back then, that meant masters for CDs,) and it's perfectly possible that whatever team was in charge of handling 'conversion' stuff for old format (remember that old excel formats have OLE type cruft going on, the sorts of things that led to VBA viruses, imagine what other functionality needs to be implemented) just plain had to take shortcuts in uglifying the spec to support all the jank.
Worth keeping in mind that the native MSO formats were using "structured storage", a horrible binary chunked serialization and metadata format from an era where binary embedding of document streams in other application documents via "Object linking and embedding" (OLE, see also Apple's OpenDoc format) was deemed desirable, with zero consideration given to third-party apps and segment formats tied to C++ data structures. Compared to that, OOXML is still a huge progress, and while it's complex I wouldn't say it's maliciously so.
The Shakespeare example is a good one where the sentence is split into multiple spans to apply style rules yet the bare text content could be extracted by just removing all XML tags. Whereas the ODF variant is actually less recommendable as it relies on an unneccesarily complex formatting and text addressing language on top of XML.
The article says
> Even at a glance [ODF's markup] is more intelligible. Strip the text: namespaces and it’s nearly valid HTML.
The only thing that needs explaining is that ODF doesn’t wrap To be with a dedicated “bold” tag. Instead, it applies an auto-style named T1 to a <text:span>, an act of separating content and presentation that mirrors established web practices.
but this definitely makes things more complex for data exchange compared to OOXML.
Can you explain what's wrong with the concept of a container format that allows embedding subdocuments of different types?
> zero consideration given to third-party apps and segment formats
The reality is the opposite. COM serialization was specifically built to allow for composing components (and serializations thereof) that didn't know about each other into a single document. That's why it leans so heavily on GUIDs for names: they avoid collisions without needing coordination. That's a laudable goal, not pointless bloat. And the COM people implemented it pretty efficiently too!
> C++ data structures
What gives you that idea? Yes, the OLE stream thing was a binary format, but so is DER for ASN.1. Every webpage you load goes over a binary tagged object format not too different from OLE/COM's.
But due to a persistence of myths from the 90s, people still think of the Office binary format as "horrible" when it's actually quite elegant, especially considering the problems the authors had to solve and their constraints in doing so.
In many ways, we've regressed.
> Markup
The author of the article nails it when he says ODF is meant to be a markup language and OOXML is the serialization of an object graph. So what? Do people write ODF by hand? There are countless JSON formats just as inscrutable as MSO's legacy streams.
Anyway, the idea that the MSO binary format was crap because it was binary, lazy, and represented a "memory dump" is an old myth that just won't die. It wasn't a memory dump, it wasn't lazy, and it wasn't crap. Yes, there are real problems with some of the things people put inside the OLE container, but it's facile and wrong to blame the container or the OLE stream composition model for the problem.
I remember Spreadsheet ML, an older format compatible with Excel. It had a subset of features, I think, but it was a rather powerful subset: formatting, formulae, multiple sheets. And it was rather simple. (Had a silly design mistake: for some reason MS gave namespace to attributes, which is not necessary, only for rather specific purposes).
Another XML standard from MS that also seems relatively simple is XPS, a PDF alternative. But it uses Open Packaging and that is somewhat hard to read.
My possibly incomplete understanding was that the original office file format was basically just raw dumps of the internal C data structures. Not designed or specified in any way.
The XML version likely carries a lot of baggage having to be compatible with that.
They weren't "just" raw dumps of internal C structures. It takes careful design work to dump raw memory in a usable fashion. Consider: You can't just write a pointer to disk and then read it back next week.
Binary MS Office format is a phenomenal piece of engineering to achieve a goal that's no longer relevant: fast save/load on late-80's hard drives. Other programs took minutes to save a spreadsheet, Excel took seconds. It did this by making sure it's in-memory data structures for a document could be dumped straight to disk without transformation.
But yes, this approach carries a shitton of baggage. And that achievement is no longer relevant in a world where consumer hardware can parse XML documents on the fly.
I have heard it argued, though, that the "baggage" isn't the file format. It's actually the full historical featureset of Excel. Being backwards-compatible means being able to faithfully represent the features of old Excel, and the essential complexity of that far outweighs the incidental complexity of how those features were encoded.
People who were developing "office" programs in the early 1990s were thinking about the problem of serializing arbitrary object graphs into documents to support technologies like
where you could embed an Excel spreadsheet inside a Word document or actually embedded any of a large range of COM objects into a Word document which on one hand is a really appealing vision but on the other hand means you have to have and be able to run all the binaries for all the objects that live in a document which ties the whole thing to Windows.
PDF is a different sort of document format which privileges viewing over editing but it is also really about serializing an object graph when it comes down to it and then having various sorts of filters and transformations and a range of objects defined in the spec as opposed to open ended access to an object library.
This kind of system has a lot of overlap with the serdes problem you get with RPC frameworks that used to be under the files "Sun RPC sucks", "DCOM Sucks", "CORBA Sucks" and "WS-* Sucks" Those things are mostly forgotten these days because well... they sucked, and now the usual complaint is "protobuf sucks" but you rarely hear "JSON sucks" because it gave up on graphs for trees, if you don't have a type system people can't say the type system sucks, and the only thing that really sucks about it is that people won't just use ISO 8601 dates but you can always rise above that by just using ISO 8601 dates without asking for permission. But we all agree YAML sucks.
That points to any flexible document format sucking but also sucks because it has lots of poorly specified and obscure features that amount to "format this the same way Word 95 formatted it if you used a certain obscure option".
From a glass is half empty perspective it sucks because it's close to impossible to make a Microsoft Office replacement that renders 100% of documents 100% correctly.
From a glass is half empty perspective it rules because if you want to make a Python script that writes an Excel script with formulas it is easy. If you want to extract the images out of a Word document it is easy because a Word document is just a ZIP file. If you want to do anything with an OOXML document short of writing an Office replacement it's actually a pretty good situation.
I think the last part is probably the biggest thing holding them back IMO... I tend not to install MS Office products on my personal devices, I haven't run Windows on a personal device in a few years. I've mostly maintained just my resume in word or libre-office format for well over a decade. I can't tell you how many times the LO format lost formatting, or just messed up between version upgrades. Same goes for opening a word version in LO.
That doesn't count the various times where it behaved weird, inconsistently had fields/tables that were impossible to edit, etc. I've had to completely recreate everything a couple times over the years. That's just one document, for one guy that I don't really touch that often.
Say what you will about Firefox vs Chrome in terms of usability, compared to MS Word using LibreOffice is worse than early betas of Netscape Navigator 4.0. It's both impressive and upsetting. OnlyOffice at least looks nicer, even if it doesn't really function any better. MS's online version of Word in the browser operates more consistently than either.
Microsoft is just dominant and exporting its 40 year old legacy codebase as a spec. LibreOffice team is frustrated that the for-profit model is beating the OSS model and crying foul over mostly necessary complexity. If LibreOffice started from scratch they’d probably appreciate how much Microsoft serializes because a sufficiently complicated document saved to .docx basically provides a reference implementation.
We do need for-profit alternatives to Word, and I’m working on one in legal.
[edit: I hope to put some real thoughts on this down soon, but most of the wonkiness emanates from evolving functionality and varying trends in best practices over the decades. I’ve implemented a fair bit of the spec here: https://tritium.legal, but most of the hard part is providing for bidi language support, fonts, real-time editing and re-rendering, UI and annotations like spellchecking and grammar, not conforming to the markup spec. Spec conformance is just polish and testing. A performant modern word processor of any spec, however, is a technological achievement on the order of a web browser.]
I feel like Libreoffice became largely irrelevant the day Google Docs came out. People put up with LO wonkyness because it was free and office was expensive.
Google completely flipped the game and then cloud collaboration became everything.
What do you mean though? Libreoffice wrote their application from scratch, did they not? And they managed to implement a superior serialization format, did they not? And they managed to get that format standardized without bribing and cheating, did they not?
What you're saying is akin to "those residents of banana republics are just frustrated capitalism (and a little help from the CIA) is beating democracy"
I think this is definitely some weird attempt to justify a terrible piece of technological junk....
For all the hate people gave CSS, it was/is fantastic at its job. Word documents are an example of how you don't design a document, and how when a for profit org designs a thing (instead of standards and market pressures), you get a technological monstrosity...
To be clear, I don't think LibreOffice is great. Part of their issue, they were built as a way to "not pay" for office, and it turns out that no, volunteers don't really do a better job at implementing 1000 pages of nonsense that the people who came up with that spaghetti code in the first place...
We don't need that software anymore, though. If you use it, know we are looking at you like you are pulling out a physical paper phonebook to store your numbers in, or a less hurtfully but just as topically, a record or CD player...it is dinosaur technology that pretty much has no place in todays world...
So, they have a point, I don't disagree with them, however it probably would be better just to "admit defeat", get MS to open source their code for compat reasons, and work on something new that's not trying to write viruses on your computer better than paragraphs...
This may be nitpicking, but the complexity in OOXML is not "necessary", at least not in the sense of what Fred Brooks would call essential complexity. As OP clearly demonstrate, the complexity in OOXML is not artificial: there was never some grand conspiracy by Microsoft to create a format that competitors will find it hard to implement.
But very little of this complexity is necessary for a standard interoperable document file format. The background was that the EU started pushing for a standardized document exchange format, and several governments started implementing regulations requiring the use of this format — Microsoft now had some very big customers which urgently needed a feature: a standard document file format. Microsoft _could_ have implemented and submitted a new format that doesn't include slavishly reflect their in-memory object graph and legacy issues. Or they even could have just adopted ODF (shudder). But they've chosen the easy way, because, frankly, they probably just didn't have the time. They took the accidental complexity which was the hot mess Microsoft Office internals (like a buggy date format) and serialized it to disk. It was never an ideal solution, but this was quick to implement.
That's just a classic case of technical debt: Microsoft needed to deliver a feature fast, and they were willing to make compromises. The crazy political shenanigans Microsoft had executed to standardize their technical debt are ironically just another form of accidental complexity.
Where does the article say it’s a necessary complexity?
> Thus, the primary goal for this new format wasn’t to be elegant, universal, or easy to implement; it was to placate regulators while preserving Microsoft’s technological and commercial advantages.
> LibreOffice team is frustrated that the for-profit model is beating the OSS model
Let's take a look at this "for-profit model" - is it just higher price outweighed by better product? lol:
Microsoft, after getting beat up in the press for making propietary extensions to the Kerberos protocol, has released the specifications on the web -- but in order to get it, you have to run a Windows .exe file which forces you agree to a click-through license agreement where you agree to treat it as a trade secret, before it will give you the .pdf file. Who would have thought that you could publish a trade secret on the web? - https://slashdot.org/story/00/05/02/158204/kerberos-pacs-and...
Back in 2001, Be, Inc. managed to get BeOS pre-installed on one computer model from Hitachi. Just one. On the entire PC market. Microsoft forced Hitachi to drop the bootloader entry to hide BeOS from customers buying it. They enforced their monopoly over the only possible niche BeOS could find on the PC market, crushing Be, Inc. in the process. - https://www.haiku-os.org/blog/mmu_man/2021-10-04_ok_lenovo_w...
So why aren't there any dual-boot computers for sale? The answer lies in the nature of the relationship Microsoft maintains with hardware vendors. More specifically, in the "Windows License" agreed to by hardware vendors who want to include Windows on the computers they sell. This is not the license you pretend to read and click "I Accept" to when installing Windows. This license is not available online. This is a confidential license, seen only by Microsoft and computer vendors. You and I can't read the license because Microsoft classifies it as a "trade secret." The license specifies that any machine which includes a Microsoft operating system must not also offer a non-Microsoft operating system as a boot option. In other words, a computer that offers to boot into Windows upon startup cannot also offer to boot into BeOS or Linux. The hardware vendor does not get to choose which OSes to install on the machines they sell -- Microsoft does. - https://birdhouse.org/beos/byte/30-bootloader/
OOXML carries bloat from a full legacy doc file into a docx file. Readability was not the mission of the developers of the open format. Openness was the mission of the developers of the format. And they made it open enough.
I spent a lot of time last year replicating every valid Excel number format. I've really struggled to find good documentation on the excel format when you really get into the weeds.
The use of namespaces is also incredibly annoying in so far as I can tell in every xml library I can find they really aren't well supported for that "human" readable component.
When you crack open the file it feels like you are going to be able to find everything you need with an xpath like //w:t but none of the xml parsers I've found cope well with the namespaces.
In Python, the `find`, `findall`, etc. methods take a namespace dictionary. E.g.
result = doc.findall("//w:t", namespaces={"w": "..."})
In C# you can do:
var navigator = doc.Root!.CreateNavigator();
nsManager = new XmlNamespaceManager(navigator.NameTable);
nsManager.AddNamespace("w", "...");
var results = doc.Root?.XPathSelectElements("//w:t", nsManager);
In Java you need to enable a namespace-aware flag in the settings to get namespaces to work. I can't recall off-hand how to do that.
Off topic sorry but with all the comments discussing Office's size and age and technical baggage ... does anyone know how they pivoted from X million lines of code for a desktop application to running it on the web with all those collaboration features?
sigh Just because it was not deliberately engineered to be prohitibively expensive to support does not mean that it can not be used to deliberately obstruct interoperability. It's really not that difficult a concept: if you want others to suffer, you can take a sad artifact of well-meant historical accidents, and say "welp, now it's a standard, you gotta support it!" There is nothing contradictory or conspirational.
Agreed. I'm... not entirely clear I get the distinction the article is trying to make?
If you take the idea that it is "artificially complex, because they actively added complexity", then I can see how that isn't quite right. But "artificially complex" can also allow for "because they actively avoided the effort to remove complexity." In which case, we are back to the same spot? But in agreement this time?
I think we take issue with requiring the leap to Microsoft “deliberately” obstructing interoperability. Microsoft just isn’t incentivized to make it simple to implement, but it’s probably less complicated than the various web standards.
An engineering team in Microsoft decides to switch from binary format to XML to save effort in the long run; even though it'll take some effort now, they have the competency, and can afford it. They are absolutely correct!
But then their manager needs to sell this project to the higher-ups, who have read BillG's memo about how "One thing we have got to change in our strategy – allowing Office documents to be rendered very well by other people's browsers is one of the most destructive things we could do to the company. We have to stop putting any effort into this and make sure that Office documents very well depend on proprietary IE capabilities. Anything else is suicide for our platform. This is a case where Office has to avoid doing something to destroy Windows." and took it to heart. So what does he do? Why, he spins a tale that since it's XML, they'll be able to standardize it, and everyone else will still be forced to interoperate with MS Office anyhow, because it will be the de-facto reference implementation (by the virtue of being there first, and widely deployed), and the spec is going to be an absolute PITA to implement decently — and that manager too will be absolutely correct!
IME there at least used to be a difference between 'fresh OO doc' and 'oo doc upsaved from legacy' as far as parsing.
I know when I had to deal with a LOT of excel in 2008-2013, somewhere in that range I gave up on trying to parse the XML (admittedly with the then-rudimentary tools, to say nothing of nascent state of nuget at the time) and just learned how to do VSTO (Visual Studio Tools for Office) as we all had excel installed anyway, and it led to less overall code for the tasks we had to do that involved Excel...
Microsoft just took what they had and directly translated it to XML. It's not intentionally messy, it's just a big corporation with old product acting like it.
I worked on the MS Word core team for a little over three years from 2010-2014, and de-facto owned a significant part of implementing ODF / OOXML Strict support.
The binary format was a liability for Microsoft to begin with, because of decades of cruft lining up with actual memory alignment. During my tenure there I ran into code my GM had written as an intern and was still intact -- he had 20+ years of tenure (mostly on Word) when I joined the team.
The translation of the file format to XML involved a significant amount of performance degradation if you weren't careful. Hundreds of millions of people use the app monthly, and MS still tries to maintain backwards compatibility. Given that open APIs were a relatively late development for the app, I really don't think in the current reality of what's expected by boards of directors for the companies they oversee that _anyone_ would take years to:
a) define a spec that maintained that backwards compatibility
b) reach whatever nebulous simplicity metric today's HN article wants
c) not get whoever greenlit the project fired for taking that many engineering hours for a and b
I have no idea what this article is intending to express. It is artificially complex to dump the exact implementations of your legacy products into a giant data structure and call it a standard. Nobody can implement that. Which is why they had to bribe, stuff committees and bully people to get it done.
I don't think anyone cares about debating the word "artificial," I don't think that was anyone's point. It's just not a standard. It was, as is made clear here, a way to head off a standard that would be possible to competitors to implement with a fake standard that Microsoft couldn't even implement.
I also don't think that it is "a counterproductive reflex that’s common in open-source circles: scolding users for accepting proprietary tech." I don't even know wtf that's supposed to mean. People are stuck with it because of corruption, they're not being scolded for using it.
edit: "LibreOffice itself, as ODF’s flagship, still suffers from rough edges in design, interaction, and performance. As a result, even as Office hobble itself with bloat, most people still find it easier."
Yeah, it'd be a lot easier if they didn't every have to deal with OOXML and could just work on their own product.
> Thus, the primary goal for this new format wasn’t to be elegant, universal, or easy to implement; it was to placate regulators while preserving Microsoft’s technological and commercial advantages.
That sound exactly like it is an anti-competitive format.
Keeping the own advantage sums pretty all anti-competitive behavior.
OOXML is an extremely detailed spec that lists minute details of the Office documents, with uncountable features. While it could have used some "standard" features, there weren't that many usable standards when OOXML was being developed.
In comparison, OASIS OpenDocument spec is horribly ambiguous and has all the same issues (like units not being used consistently). It got better over the years, but it's still not at all great. And its size is now comparable to OOXML, when all the referenced specs are incorporated.
> There are places where it says the equivalent of "Works the same as Word 95" [3], but does not specify in the specification what that means.
Yeah, sure, whatever. You'll never see these kinds of documents in real life. And the specified quirks were minor. If you don't implement them, you'll get subtle formatting issues in documents imported directly from Word97.
MS could have just put them into a "vendor-specific" extension and not documented them at all.
> ODF 1.4 is around 1,100 pages across all 4 parts whereas OOXML is over 6,000.
LOL, no. SVG spec alone is 800 pages. ODF formula spec is 200 pages alone, and is still underspecified.
My theory (from anecdotal use) is that the OOXML complexity also explains why M365 office implementation is lacking in so many features and is just not very good at all when compared to the Google office suite.
I do have strong memories of OOXML and the scandals that were with it when it became a standard through MS allegedly buying/stacking/influencing votes:
I absolutely do not agree.
Not only is the standard overly complex, Microsoft also indulged in all sorts of unscrupulous activities to corrupt various National Standards Organisations to get it approved through the ISO <https://en.wikipedia.org/wiki/Standardization_of_Office_Open...>, which is clear evidence of malicious intent.
This is a quote from Richard Stallman:
> The specifications document was so long that it would be difficult for anyone else to implement it properly. When the proposed standard was submitted through the usual track, experienced evaluators rejected it for many good reasons. Microsoft responded using a special override procedure in which its money buy the support of many of the voting countries, thus bypassing proper evaluation and demonstrating that ISO can be bought.
They didn't want a standard other people could adapt easily nor do the work to make Word adhere to one and it had to happen fast. By doing it the way they did they got everything they wanted and only needed to buy ISO.
OOXML is complex because it has to be. It has to losslessly round trip through an open format every single feature of Office. That's a lot of features.
Yes, it's complex. Should Microsoft have cut features of Office just to make OOXML simpler? That's ridiculous. What about users who relied on those cut features?
It was fair to ask Microsoft to open the file format. It wasn't fair to expect them to cut features and compatibility. The complaints about complexity from RMS and others represent outsiders seeing the sausage factory and realizing that the sausage making is complicated and needs a lot of moving parts. Maybe life wasn't as simple as the Slashdot "Micro$oft" narrative would suggest. Maybe the complexity of the product was downstream of the shit ton of complexity and sweat and thought that had gone into it.
But admitting that would have been hard. Easier to come up with conspiracy theories.
The Transitional variant which is entirely backwards compatible is not fully defined in a way that others can implement without reverse engineering how Microsoft Office does things.
The Strict variant isn't totally compatible with all older binary formats but is fully defined.
Guess which one is the standard file format?
> First, OOXML was, in material part, a defensive posture under intensifying antitrust and “open standards” pressure. Microsoft announced OOXML in late 2005 while appealing an adverse European Commission judgment centered on interoperability disclosures. Thus, it was only a matter of time before Office file compatibility came under the regulatory microscope. (The Commission indeed opened a probe in 2008.)
> Meanwhile, the rival ODF matured and became an ISO standard in May 2006. Governments, especially in Europe, began to mandate open standards in public procurement. If Microsoft did nothing, Office risked exclusion from government deals.
So... maybe they weren't directly asked to open their file format, but what then? Adopt ODF which is surely incompatible with their feature set, and... just corrupt every .doc file when converting into the new format? And also have to reimplement all their apps?
... which are either public, in which case people complain that the spec+extensions is too long instead of that the spec is too long, or
... which aren't public, in which case people complain that there's no interoperability.
You can't win.
> impossible for anyone else to implement
Except for all the people who did implement it?
It was never fully implemented. LibreOffice has been trying since then and there are always problems.
What it didn't have to be is sections upon sections of "this behaviour is as seen in Word 95", "this behaviour is as seen in Word 97" without any further specification or context.
The main struggle for independent implementors was reverse engineering all the implicit and explicit assumptions and inner workings of MS Office software.
> But admitting that would have been hard. Easier to come up with conspiracy theories.
I actually read through a lot of that spec at the time. A lot of it was just lip service to open standards at a time when MS was under a lot of regulatory pressure.
I expect most people posting on Hacker News would not be able to write a satisfactory specification for their own software if they are working a large legacy code base.
It was a pretty big deal when OpenOffice.org's 2.0 release came with OpenDocument as the default file format. Very easy for someone to misread this MSOffice screen and click on OOXML expecting it to mean OO.o.
To be fair, we're talking about a product line with over 35 years of history here. Cruft in the format builds up but can never be removed, so long as you commit to strong backwards compatibility - which Microsoft has always done.
Fun trivia: many of the old binary formats use a meta-format called OLE2 (Object Linking and Embedding). The file format is a FAT12 filesystem packed into a single file, with a FAT filesystem chain, file blocks aligned to a specific power-of-two size, etc. This made saving files very fast, but raised the possibility of internal fragmentation (where individual sub-files are scattered over many non-contiguous blocks); hence, users were recommended to "Save As..." periodically for large/complex files to optimize the internal storage.
"OK we will standardize our serialization format"
It's... I guess malicious compliance, though also if you don't care about interop you're not going to try to abstract away your internal application structures, are you!
I appreciate the standard existing rather than it not existing. Trying to have the standard exist in this way has always felt like an uphill battle, and at least now there's _something_.
Just you will have a better time if you emulate how Office does things. But you have a bit more documentation to go along with it.
https://learn.microsoft.com/en-us/openspecs/windows_protocol...
The author only provides arguments for "self-interested negligence". He provides no counterarguments to the claim that OOXML complexity was "a plot to block third-party compatibility". Therefore, he cannot compare "negligence" and "a plot". Therefore, his claim that "negligence" is a better explanation for OOXML complexity than "a plot" cannot follow.
To restate:
> If we dig into the context of OOXML’s creation, it can be argued that harming competitors was not Microsoft’s primary aim.
The author provides no evidence to support this claim. At most, the evidence provided in this section at most supports the claim that "negligence" played a role in OOXML complexity. From this evidence alone, no conclusions can be drawn about the "primariness" of "negligence" vs "harming competitors".
The author is just implicitly appealing to Occam's razor here, as people often in face of accusations of a plot. They can show that Microsoft has backed the ANSI accreditation of ODF[1] and eventually implemented support for ODF import and export in Office, but that's not enough to prove there was no conspiracy.
Instead, the article just provides a very plausible explanation for the complexity in OOXML. Does this explanation thoroughly disprove the accusations of a plot? Clear not. Is it more plausible than a great plot to crush a bunch of competitors that had no market share and kill a better standard document format that Microsoft did end up implementing in Office? Yes. This is probably as far as we can get.
[1] https://news.microsoft.com/source/2007/05/16/microsoft-votes...
I'm not saying they shouldn't do that as a company maximizing shareholder value. But we should all collectively groan every time the topic comes up, not applaud them.
The Shakespeare example is a good one where the sentence is split into multiple spans to apply style rules yet the bare text content could be extracted by just removing all XML tags. Whereas the ODF variant is actually less recommendable as it relies on an unneccesarily complex formatting and text addressing language on top of XML.
The article says
> Even at a glance [ODF's markup] is more intelligible. Strip the text: namespaces and it’s nearly valid HTML. The only thing that needs explaining is that ODF doesn’t wrap To be with a dedicated “bold” tag. Instead, it applies an auto-style named T1 to a <text:span>, an act of separating content and presentation that mirrors established web practices.
but this definitely makes things more complex for data exchange compared to OOXML.
> zero consideration given to third-party apps and segment formats
The reality is the opposite. COM serialization was specifically built to allow for composing components (and serializations thereof) that didn't know about each other into a single document. That's why it leans so heavily on GUIDs for names: they avoid collisions without needing coordination. That's a laudable goal, not pointless bloat. And the COM people implemented it pretty efficiently too!
> C++ data structures
What gives you that idea? Yes, the OLE stream thing was a binary format, but so is DER for ASN.1. Every webpage you load goes over a binary tagged object format not too different from OLE/COM's.
But due to a persistence of myths from the 90s, people still think of the Office binary format as "horrible" when it's actually quite elegant, especially considering the problems the authors had to solve and their constraints in doing so.
In many ways, we've regressed.
> Markup
The author of the article nails it when he says ODF is meant to be a markup language and OOXML is the serialization of an object graph. So what? Do people write ODF by hand? There are countless JSON formats just as inscrutable as MSO's legacy streams.
Anyway, the idea that the MSO binary format was crap because it was binary, lazy, and represented a "memory dump" is an old myth that just won't die. It wasn't a memory dump, it wasn't lazy, and it wasn't crap. Yes, there are real problems with some of the things people put inside the OLE container, but it's facile and wrong to blame the container or the OLE stream composition model for the problem.
Another XML standard from MS that also seems relatively simple is XPS, a PDF alternative. But it uses Open Packaging and that is somewhat hard to read.
A better format would have made us geeks a lot happier, but the average user just wants things to work the way they always have.
The XML version likely carries a lot of baggage having to be compatible with that.
Binary MS Office format is a phenomenal piece of engineering to achieve a goal that's no longer relevant: fast save/load on late-80's hard drives. Other programs took minutes to save a spreadsheet, Excel took seconds. It did this by making sure it's in-memory data structures for a document could be dumped straight to disk without transformation.
But yes, this approach carries a shitton of baggage. And that achievement is no longer relevant in a world where consumer hardware can parse XML documents on the fly.
I have heard it argued, though, that the "baggage" isn't the file format. It's actually the full historical featureset of Excel. Being backwards-compatible means being able to faithfully represent the features of old Excel, and the essential complexity of that far outweighs the incidental complexity of how those features were encoded.
When i was a kid,making cool wordart headers for school projects was like 50% of what we used office for.
[0]: https://support.microsoft.com/en-us/office/insert-wordart-c5...
It might be a Swedish thing, but I always laugh when I see them. Not nearly as common today as ten years ago, but I see them a couple of times a year.
https://en.wikipedia.org/wiki/Object_Linking_and_Embedding
where you could embed an Excel spreadsheet inside a Word document or actually embedded any of a large range of COM objects into a Word document which on one hand is a really appealing vision but on the other hand means you have to have and be able to run all the binaries for all the objects that live in a document which ties the whole thing to Windows.
PDF is a different sort of document format which privileges viewing over editing but it is also really about serializing an object graph when it comes down to it and then having various sorts of filters and transformations and a range of objects defined in the spec as opposed to open ended access to an object library.
This kind of system has a lot of overlap with the serdes problem you get with RPC frameworks that used to be under the files "Sun RPC sucks", "DCOM Sucks", "CORBA Sucks" and "WS-* Sucks" Those things are mostly forgotten these days because well... they sucked, and now the usual complaint is "protobuf sucks" but you rarely hear "JSON sucks" because it gave up on graphs for trees, if you don't have a type system people can't say the type system sucks, and the only thing that really sucks about it is that people won't just use ISO 8601 dates but you can always rise above that by just using ISO 8601 dates without asking for permission. But we all agree YAML sucks.
That points to any flexible document format sucking but also sucks because it has lots of poorly specified and obscure features that amount to "format this the same way Word 95 formatted it if you used a certain obscure option".
From a glass is half empty perspective it sucks because it's close to impossible to make a Microsoft Office replacement that renders 100% of documents 100% correctly.
From a glass is half empty perspective it rules because if you want to make a Python script that writes an Excel script with formulas it is easy. If you want to extract the images out of a Word document it is easy because a Word document is just a ZIP file. If you want to do anything with an OOXML document short of writing an Office replacement it's actually a pretty good situation.
Except it also spawned a thousand custom formats that include $ref support of some type, so we are right back to having graphs. :-D
(I wonder what the specification-pages-to-man-years ratio is...)
That doesn't count the various times where it behaved weird, inconsistently had fields/tables that were impossible to edit, etc. I've had to completely recreate everything a couple times over the years. That's just one document, for one guy that I don't really touch that often.
Say what you will about Firefox vs Chrome in terms of usability, compared to MS Word using LibreOffice is worse than early betas of Netscape Navigator 4.0. It's both impressive and upsetting. OnlyOffice at least looks nicer, even if it doesn't really function any better. MS's online version of Word in the browser operates more consistently than either.
Microsoft is just dominant and exporting its 40 year old legacy codebase as a spec. LibreOffice team is frustrated that the for-profit model is beating the OSS model and crying foul over mostly necessary complexity. If LibreOffice started from scratch they’d probably appreciate how much Microsoft serializes because a sufficiently complicated document saved to .docx basically provides a reference implementation.
We do need for-profit alternatives to Word, and I’m working on one in legal.
[edit: I hope to put some real thoughts on this down soon, but most of the wonkiness emanates from evolving functionality and varying trends in best practices over the decades. I’ve implemented a fair bit of the spec here: https://tritium.legal, but most of the hard part is providing for bidi language support, fonts, real-time editing and re-rendering, UI and annotations like spellchecking and grammar, not conforming to the markup spec. Spec conformance is just polish and testing. A performant modern word processor of any spec, however, is a technological achievement on the order of a web browser.]
Google completely flipped the game and then cloud collaboration became everything.
Wow, big undertaking!
What we really need, though, is a for-profit alternative to Excel, that's not Google. I think Excel is more of the Killer App than Word has ever been.
What do you mean though? Libreoffice wrote their application from scratch, did they not? And they managed to implement a superior serialization format, did they not? And they managed to get that format standardized without bribing and cheating, did they not?
What you're saying is akin to "those residents of banana republics are just frustrated capitalism (and a little help from the CIA) is beating democracy"
> We do need for-profit alternatives to Word
Why does it have to be for profit?
For all the hate people gave CSS, it was/is fantastic at its job. Word documents are an example of how you don't design a document, and how when a for profit org designs a thing (instead of standards and market pressures), you get a technological monstrosity...
To be clear, I don't think LibreOffice is great. Part of their issue, they were built as a way to "not pay" for office, and it turns out that no, volunteers don't really do a better job at implementing 1000 pages of nonsense that the people who came up with that spaghetti code in the first place...
We don't need that software anymore, though. If you use it, know we are looking at you like you are pulling out a physical paper phonebook to store your numbers in, or a less hurtfully but just as topically, a record or CD player...it is dinosaur technology that pretty much has no place in todays world...
So, they have a point, I don't disagree with them, however it probably would be better just to "admit defeat", get MS to open source their code for compat reasons, and work on something new that's not trying to write viruses on your computer better than paragraphs...
But very little of this complexity is necessary for a standard interoperable document file format. The background was that the EU started pushing for a standardized document exchange format, and several governments started implementing regulations requiring the use of this format — Microsoft now had some very big customers which urgently needed a feature: a standard document file format. Microsoft _could_ have implemented and submitted a new format that doesn't include slavishly reflect their in-memory object graph and legacy issues. Or they even could have just adopted ODF (shudder). But they've chosen the easy way, because, frankly, they probably just didn't have the time. They took the accidental complexity which was the hot mess Microsoft Office internals (like a buggy date format) and serialized it to disk. It was never an ideal solution, but this was quick to implement.
That's just a classic case of technical debt: Microsoft needed to deliver a feature fast, and they were willing to make compromises. The crazy political shenanigans Microsoft had executed to standardize their technical debt are ironically just another form of accidental complexity.
> Thus, the primary goal for this new format wasn’t to be elegant, universal, or easy to implement; it was to placate regulators while preserving Microsoft’s technological and commercial advantages.
That sounds quite anti-competitive to me
Let's take a look at this "for-profit model" - is it just higher price outweighed by better product? lol:
Microsoft, after getting beat up in the press for making propietary extensions to the Kerberos protocol, has released the specifications on the web -- but in order to get it, you have to run a Windows .exe file which forces you agree to a click-through license agreement where you agree to treat it as a trade secret, before it will give you the .pdf file. Who would have thought that you could publish a trade secret on the web? - https://slashdot.org/story/00/05/02/158204/kerberos-pacs-and...
Back in 2001, Be, Inc. managed to get BeOS pre-installed on one computer model from Hitachi. Just one. On the entire PC market. Microsoft forced Hitachi to drop the bootloader entry to hide BeOS from customers buying it. They enforced their monopoly over the only possible niche BeOS could find on the PC market, crushing Be, Inc. in the process. - https://www.haiku-os.org/blog/mmu_man/2021-10-04_ok_lenovo_w...
So why aren't there any dual-boot computers for sale? The answer lies in the nature of the relationship Microsoft maintains with hardware vendors. More specifically, in the "Windows License" agreed to by hardware vendors who want to include Windows on the computers they sell. This is not the license you pretend to read and click "I Accept" to when installing Windows. This license is not available online. This is a confidential license, seen only by Microsoft and computer vendors. You and I can't read the license because Microsoft classifies it as a "trade secret." The license specifies that any machine which includes a Microsoft operating system must not also offer a non-Microsoft operating system as a boot option. In other words, a computer that offers to boot into Windows upon startup cannot also offer to boot into BeOS or Linux. The hardware vendor does not get to choose which OSes to install on the machines they sell -- Microsoft does. - https://birdhouse.org/beos/byte/30-bootloader/
ISO/IEC 29500 should be open to evolution, no? Just like all the open collaboration on it before it was confirmed as a standard.
The use of namespaces is also incredibly annoying in so far as I can tell in every xml library I can find they really aren't well supported for that "human" readable component.
When you crack open the file it feels like you are going to be able to find everything you need with an xpath like //w:t but none of the xml parsers I've found cope well with the namespaces.
In Python, the `find`, `findall`, etc. methods take a namespace dictionary. E.g.
In C# you can do: In Java you need to enable a namespace-aware flag in the settings to get namespaces to work. I can't recall off-hand how to do that.If you take the idea that it is "artificially complex, because they actively added complexity", then I can see how that isn't quite right. But "artificially complex" can also allow for "because they actively avoided the effort to remove complexity." In which case, we are back to the same spot? But in agreement this time?
But then their manager needs to sell this project to the higher-ups, who have read BillG's memo about how "One thing we have got to change in our strategy – allowing Office documents to be rendered very well by other people's browsers is one of the most destructive things we could do to the company. We have to stop putting any effort into this and make sure that Office documents very well depend on proprietary IE capabilities. Anything else is suicide for our platform. This is a case where Office has to avoid doing something to destroy Windows." and took it to heart. So what does he do? Why, he spins a tale that since it's XML, they'll be able to standardize it, and everyone else will still be forced to interoperate with MS Office anyhow, because it will be the de-facto reference implementation (by the virtue of being there first, and widely deployed), and the spec is going to be an absolute PITA to implement decently — and that manager too will be absolutely correct!
I know when I had to deal with a LOT of excel in 2008-2013, somewhere in that range I gave up on trying to parse the XML (admittedly with the then-rudimentary tools, to say nothing of nascent state of nuget at the time) and just learned how to do VSTO (Visual Studio Tools for Office) as we all had excel installed anyway, and it led to less overall code for the tasks we had to do that involved Excel...
I worked on the MS Word core team for a little over three years from 2010-2014, and de-facto owned a significant part of implementing ODF / OOXML Strict support.
The binary format was a liability for Microsoft to begin with, because of decades of cruft lining up with actual memory alignment. During my tenure there I ran into code my GM had written as an intern and was still intact -- he had 20+ years of tenure (mostly on Word) when I joined the team.
The translation of the file format to XML involved a significant amount of performance degradation if you weren't careful. Hundreds of millions of people use the app monthly, and MS still tries to maintain backwards compatibility. Given that open APIs were a relatively late development for the app, I really don't think in the current reality of what's expected by boards of directors for the companies they oversee that _anyone_ would take years to:
a) define a spec that maintained that backwards compatibility
b) reach whatever nebulous simplicity metric today's HN article wants
c) not get whoever greenlit the project fired for taking that many engineering hours for a and b
I don't think anyone cares about debating the word "artificial," I don't think that was anyone's point. It's just not a standard. It was, as is made clear here, a way to head off a standard that would be possible to competitors to implement with a fake standard that Microsoft couldn't even implement.
I also don't think that it is "a counterproductive reflex that’s common in open-source circles: scolding users for accepting proprietary tech." I don't even know wtf that's supposed to mean. People are stuck with it because of corruption, they're not being scolded for using it.
edit: "LibreOffice itself, as ODF’s flagship, still suffers from rough edges in design, interaction, and performance. As a result, even as Office hobble itself with bloat, most people still find it easier."
Yeah, it'd be a lot easier if they didn't every have to deal with OOXML and could just work on their own product.
That sound exactly like it is an anti-competitive format.
Keeping the own advantage sums pretty all anti-competitive behavior.
Very, very few people care about openness. Maybe a few hundred. Tens of millions care about docx capturing exactly what their doc files had.
Microsoft made the correct choice.
OOXML is an extremely detailed spec that lists minute details of the Office documents, with uncountable features. While it could have used some "standard" features, there weren't that many usable standards when OOXML was being developed.
In comparison, OASIS OpenDocument spec is horribly ambiguous and has all the same issues (like units not being used consistently). It got better over the years, but it's still not at all great. And its size is now comparable to OOXML, when all the referenced specs are incorporated.
It's essentially a serialization of the binary format to XML.
ODF 1.4 is around 1,100 pages across all 4 parts whereas OOXML is over 6,000.
[1] https://stephesblog.blogs.com/my_weblog/2007/08/microsofts-f...
[2] https://ooxmlisdefectivebydesign.blogspot.com/2007/08/micros...
[3] https://www.robweir.com/blog/2007/01/how-to-hire-guillaume-p...
You can see it for yourself here (in Part 4): https://ecma-international.org/publications-and-standards/st...
Yeah, sure, whatever. You'll never see these kinds of documents in real life. And the specified quirks were minor. If you don't implement them, you'll get subtle formatting issues in documents imported directly from Word97.
MS could have just put them into a "vendor-specific" extension and not documented them at all.
> ODF 1.4 is around 1,100 pages across all 4 parts whereas OOXML is over 6,000.
LOL, no. SVG spec alone is 800 pages. ODF formula spec is 200 pages alone, and is still underspecified.
I do have strong memories of OOXML and the scandals that were with it when it became a standard through MS allegedly buying/stacking/influencing votes:
https://chatgpt.com/share/68bf5e11-4e10-8003-ac9d-d4d10f7951...