YAML is not a superset of JSON

(patrickstevens.co.uk)

59 points | by Smaug123 126 days ago

13 comments

  • DougBTX 126 days ago
    This article claims that 1e2 is interpreted as a string, while this other article on the front page[0] claims that 556474e378 is interpreted as a number. What's correct?

    The YAML "Scalars" section[1] says:

    > A few examples also use the int, float and null types from the JSON schema.

    And includes these examples:

        canonical: 1.23015e+3
        exponential: 12.3015e+02
    
    So, is the "+" required here or not? Is a YAML parser buggy if it doesn't parse all JSON numbers as numbers?

    Edit: Ah, further on, it says:

        Canonical Form
        Either 0, .inf, -.inf, .nan or scientific notation matching the regular expression
        -? [1-9] ( \. [0-9]* [1-9] )? ( e [-+] [1-9] [0-9]* )?
    
    The example 1e2 clearly matches this regex, so his YAML parser is broken.

    Edit edit:

    In YAML 1.1, there were separate definitions of float[2] and int[3] types, where only floats support "scientific" notation, and must have a ".", unlike JSON.

    So this article is talking about YAML 1.1, while the other article is talking about YAML 1.2.

    [0] https://news.ycombinator.com/item?id=41498264

    [1] https://yaml.org/spec/1.2.2/#23-scalars

    [2] https://yaml.org/type/float.html

    [3] https://yaml.org/type/int.html

    • jmillikin 126 days ago
      What's "correct" depends on whether your YAML parser defaults to YAML 1.1 or 1.2.

      Most YAML parsers default to 1.1 for compatibility reasons, because if they default to 1.2 then existing YAML documents expecting 1.1 behavior will be parsed incorrectly.

      YAML is a difficult language to parse if you care about getting the correct data.

      • soco 126 days ago
        Which only goes to prove that every replacement of XML eventually ran into the exactly same complexity issues they thought to solve in the first place. Another example of why "just use..." is a losing approach.
    • n2d4 126 days ago
      > The example 1e2 clearly matches this regex, so his YAML parser is broken.

      1e2 does not match this regex. 1e+2 or 1e-2 would, though.

      • lifthrasiir 126 days ago
        The "canonical" form indeed requires an exponent sign, but the tag resolution process in YAML 1.2 (section 10.2.2) does allow its omission. That said however, the equivalent specification in YAML 1.1 [1] does require an exponent sign at any case! It should be no surprise that YAML 1.1 isn't a superset of any version of JSON, but I don't know whether this is intentional or simply an oversight.

        [1] https://yaml.org/type/float.html

        • n2d4 126 days ago
          YAML 1.2.2 still isn't a superset of JSON, because it requires all keys to be unique:

          > The content of a mapping node is an unordered set of key/value node pairs, with the restriction that each of the keys is unique

          • lifthrasiir 126 days ago
            While JSON doesn't prevent duplicate keys per se, it doesn't fully specify its semantics anyway and only states that duplicate keys are less "interoperable". And there is an explicit profile of JSON with this requirement (I-JSON [1]), so YAML 1.2.2 can be said to be a superset of some version of JSON.

            [1] https://datatracker.ietf.org/doc/html/rfc7493

            • n2d4 126 days ago
              "X is a superset of some subset of Y" is a weak statement.

              This is not about semantics, it's about grammar. While it's fair to say that JSON "usually" is valid YAML, it's still good to be strict about it, because the existence of a single counterexample can be used maliciously.

              • lifthrasiir 126 days ago
                Agreed, though it is more like "X is a superset of some common but not de jure interpretation of Y". The real culprit is the ambiguity of Y...
          • seba_dos1 125 days ago
            A JSON implementation that rejects non-unique keys is a valid JSON implementation, so a valid YAML 1.2 implementation is still a valid JSON implementation.
            • kazinator 125 days ago
              Some things in the world of JSON depend on order, too. There is a Microsoft way of indicating the type of an object using a key that must be first in the JSON syntax.
              • seba_dos1 125 days ago
                Same situation as above - a JSON parser that ignores order is a valid JSON implementation.
                • kazinator 125 days ago
                  And such a parser could work with that convention. However, if its complementary formatter printed the objects such that that type field doesn't appear first, it wouldn't be correctly implementing the specification.
    • xelxebar 126 days ago
      > So this article is talking about YAML 1.1, while the other article is talking about YAML 1.2.

      Precisely. The article is really noticing quirks and limitations libyaml, the library doing the heavy lifting behind PyYAML, not YAML-the-spec proper.

      Granted, in practice, library limitations are probably what you want to know about. AFAIK, libfyaml[0] (not libyaml) is the most spec-compliant library around. It's a shame more downstream languages aren't using it.

      [0]:https://github.com/pantoniou/libfyaml

    • endycm 126 days ago
      No, the YAML parser is a valid YAML 1.1 parser, where this behaviour is totally correct and in spec.
  • wolframhempel 126 days ago
    I feel there's a place for YAML and JSON and that they are quite different:

    - YAML is for files written or edited by humans - e.g. configuration files for servers with lots of comments and explanations, but generally quite simple key value pairs or basic data structures

    - JSON is for files written and consumed by machines. It allows for complex, nested data structures and types, but requires technical knowledge to use.

    The problem arises once you start confusing these usecases. I'd argue that once you start writing `'{"a": 1e2}'` in YAML you're quite far outside of its ideal use. I appreciate that feature creep might lead to overly complex configuration files (I remember editing XML config that allowed you to specify for-loops in my earlier days), but really, at a certain point it might be worth taking a step back and reflecting if you're still using the right tool for the right job.

    • kajika91 126 days ago
      I agree about the distinction you made from the human perspective and you put it well.

      I have always hated YAML, and still to this day, because I cannot write a yaml file, the indentation makes no sense to me and the list syntax is black magic (you actually have several ways to write those, once again the indentation implication is obscure). So while agreeing on the goal to be written/edited by humans to my perspective it fails at it.

      Also 1e2 might not be the best example as this is just 100, but as someone who had to pass a lot of neural network training hyper-parameters : passing 1e-3 and so on is definitely a use-case. I am on the negative values Xe-XX (and YAML 1.1 would parsed it OK) but I guess other domains could also use the positive side to pass values (maybe as upper limits like 1e5 or so).

      I think the YAML format should have parsed those number formats from the beginning. If this is fixed now, good job, hopefully the default yaml parsers are going to be "fixed". I would still use TOML over YAML any day, waiting for a human-json (some already exist) to be popularized one day.

    • mjevans 126 days ago
      Offhand, YAML is non-printing-space sensitive, while JSON is not (it's syntax sensitive though). Looking it up, tabs are not supported, which is a HUGE determent for my use cases. (I'm nearly religious on loving tabs and user defined width of single character indent levels.)

      A minor change of nesting is easy to accomplish in JSON, while YAML practically requires support from a text editor. JSON _can_ be rendered in a pretty print fashion for easier human editing, and it should still parse correctly irrespectively of how additional non-printing-space is added.

    • eviks 126 days ago
      Why would you need text format when only machines are involved?
      • ozim 126 days ago
        Because at multiple points humans are involved creating context for the values and later debugging or changing meaning of the values. Fun part it is when you integrate with systems where you even don't talk with the other party - you still need the documentation but as we see how much JSON over HTTP is used it is rather clear why we need text format.
        • eviks 126 days ago
          Not sure what context creation means here that requires machines to speak text to each other, and you could debug the converted human-readable format. And JSON is also used much for app user-editable config despite lacking such trivial things as comments, so prevalence may signal some need, but not necessarily the need for the thing used (the awful XML is also widely used)
    • conradfr 126 days ago
      Yes my only exposure to YAML is for Symfony configuration and it's way better than JSON for this.
  • zquestz 126 days ago
    I am confused, when was YAML ever considered a superset of JSON?

    These are completely different serialization formats.

    • pletnes 126 days ago
      By many people, e.g https://learning.sap.com/learning-journeys/build-side-by-sid...

      But the yaml docs said it’s accidental

      https://yaml.org/spec/1.2.2/

      «The YAML 1.18 specification was published in 2005. Around this time, the developers became aware of JSON9. By sheer coincidence, JSON was almost a complete subset of YAML (both syntactically and semantically).»

      • zquestz 126 days ago
        Thanks. Didn't realize anyone thought this. I checked the YAML docs when I saw this post, and it didn't seem like an official feature.

        The SAP article definitely states it, that's the first time I have seen it described that way.

        • pletnes 126 days ago
          I’ve seen the claim around, and I actually believed it, to be honest. I found that article by googling. I guess it’s still true enough for many cases.
    • andreareina 126 days ago
      I don't remember if it was an official talking point, but "you can load json on a yaml parser and it'll read the same" is a claim I've seen around.
    • sorcix 126 days ago
      The YAML 1.2 spec states:

      > YAML can therefore be viewed as a natural superset of JSON, offering improved human readability and a more complete information model.

      • jorams 126 days ago
        The YAML 1.0 spec says no such thing, it doesn't even mention JSON. Neither does the YAML 1.1 spec. The YAML 1.2.0 and 1.2.1 specs do say exactly that. 1.2.2 no longer does, but it reiterates that the primary focus of YAML 1.2 was to make it a strict superset of JSON.
        • sorcix 126 days ago
          You're right. I must have clicked the wrong link, the YAML 1.0 spec doesn't mention JSON. The quote was from 1.2, thanks for pointing that out!
      • zquestz 126 days ago
        However this wasn't really true till 1.2 if I am reading the spec correctly. Plus many parsers default to 1.1. =\
    • jmillikin 126 days ago
      I wrote a similar article two years ago, with some screenshots of people claiming YAML is a superset of JSON: https://john-millikin.com/json-is-not-a-yaml-subset

      It's one of those beliefs that seems like it should be true, but isn't for obscure technical reasons.

      • jorams 126 days ago
        Your post's remark about YAML 1.2 being opt-in with a "%YAML 1.2" directive is true for the parser you are using (LibYAML), but is not compliant with the 1.2 spec. The spec specifies it should assume 1.2 and 1.1 should be opt-in.
        • jmillikin 126 days ago
          It's true for all of the parsers that I tested at the time, which covered a couple different languages (Ruby, Python, C, Haskell).

          It appears to still be true in at least Ruby and Python, which are probably the two most popular languages to write YAML-consuming programs in:

            $ irb -v
            irb 1.3.5 (2021-04-03)
            $ irb
            irb(main):001:0> require 'yaml'
            => true
            irb(main):002:0> YAML.load '{"a": 1e2}'
            => {"a"=>"1e2"}
            irb(main):003:0> 
          
          and

            $ python3
            Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
            Type "help", "copyright", "credits" or "license" for more information.
            >>> import yaml
            >>> yaml.safe_load('{"a": 1e2}')
            {'a': '1e2'}
            >>> 
          
          --

            > The spec specifies it should assume 1.2 and 1.1 should be opt-in.
          
          The spec for 1.2 says that, but that's the reason parsers can't upgrade to 1.2, because changing the default version will cause backwards-incompatible parsing changes. Without the version directive there's no way for the parser to know which version was intended, so it defaults to 1.1, so people writing YAML will write YAML 1.1 documents, because that's what the parsers expect.

          The only way YAML 1.2 is going to displace older versions is in a greenfield ecosystem that has all its tools using YAML 1.2 from the beginning, but that requires an author who both (1) cares a lot about parser correctness, and (2) wants to use YAML as a config syntax, which isn't a large population.

          • jorams 126 days ago
            It is indeed true for LibYAML, which is used by Python, Ruby, and the Haskell yaml package. The majority of YAML implementations I've seen that aren't based on LibYAML implement 1.2. I'm pointing it out because it's highly implementation dependent.
    • xelxebar 126 days ago
      Not only does the spec say that, but it was an explicit goal of Ingy döt Net, the designer of YAML.
    • trueismywork 126 days ago
      Official YAML spec says so.
  • valenterry 126 days ago
    Please just drop YAML. Stop using it. Especially if you are devops and building a new CICD solution or configuration.

    If you must use some kind of configuration language, at least use Dhall or something more sane. Thank you in advance!

  • endycm 126 days ago
    Author has no clue, but I don't blame him.

    There are - similar to the JSON insanity - multiple YAML standards.

    YAML 1.1 and 1.2+ are the important ones, as the "superset" argument is only valid since 1.2.

    HOWEVER PyYAML is a YAML 1.1 parser: https://pypi.org/project/PyYAML/#description

    This also can be responsible for many security problems, as ppl will assume things about JSON and YAML, but don't worry about which of the 8 different JSON standards / YAML implementations they use.

    • n2d4 126 days ago
      YAML 1.2.2 is also not a JSON superset, as keys are required to be unique:

      > The content of a mapping node is an unordered set of key/value node pairs, with the restriction that each of the keys is unique

      https://yaml.org/spec/1.2.2/#representation-graph

      • jorams 126 days ago
        YAML says keys MUST be unique, while JSON says they SHOULD be unique. It is therefore possible to produce technically valid JSON documents that aren't valid YAML, so it is in that sense not a strict superset. At the same time those documents aren't portable across JSON implementations either. Rejecting duplicate keys is a valid implementation of the JSON spec.
  • xelxebar 126 days ago
    Ouch. That tab error is a bug in libyaml, the YAML loader used by PyYAML. It's still stuck on YAML 1.1, but the spec explicitly allows tabs as whitespace, though they don't count as indentation semantics.[0]

    The same is true in YAML 1.2, so it's not just a legacy thing, either.

    [0]:https://yaml.org/spec/1.1/#id893482

    • Brian_K_White 125 days ago
      Good god, indentation matters, yet some whitespace counts and some whitespace doesn't?

      What the ever loving hell.

  • jayceedenton 126 days ago
    This seems pedantic, but I suppose anyone using the term 'superset' is inviting the pedantry.

    For almost all intents and purposes, if you are asked to create a YAML file then you can choose JSON as your syntax instead, because your file will be understood by the YAML parser. The benefit being that JSON has far fewer quirks and edge cases.

    It's comical that when people get confused with YAML (which is often) they convert their YAML snippet to JSON to see what's really going on. YAML is horrible for humans to write. Let's just use JSON, the sane syntax, instead. A few extra parents and quotes is really no big deal, and it's far easier to read unambiguously.

    • roenxi 126 days ago
      If we're not being pedantic, YAML has almost nothing to do with JSON. A typical YAML file and a typical JSON file have no syntactical overlap at all. Practically speaking, YAML parsers are expected to also parse JSON despite it being an otherwise unrelated format. The entire idea that it is a "superset" is misleading. This isn't a C -> C++ transition.
      • seba_dos1 125 days ago
        YAML 1.2 is actually a superset of JSON. C++ is not and never was a superset of C.
    • lifthrasiir 126 days ago
      It is not only for the pedantry, unfortunately. A subtle input that is a valid JSON but an invalid YAML may cause all sort of problems at any level, with some security implications.
  • kkfx 126 days ago
    They are both ideas to makes XML easier, more manageable, as XML was for SGML, and well... They are damn failures. Modern YAML and JSON monsters clearly prove this.

    If we want human readability we should accept not being language agnostic or simply direct use a programming language for data as well (lisp teach), if we want a textual language agnostic data exchange format than we should ignore human readability and maybe stick with XML.

  • masfoobar 126 days ago
    I have never been fond of yaml. Guess it really depends on what you have been accustomed to. Then again, if that were true I would be defending xml over json.

    I guess there is a place for all... but I do wish s-expressions got more love. :-)

  • nvader 126 days ago
    YAML is the Papyrus of serialization formats.

    Great for cutesy human-readable applications but nothing workhorse. Just as I'd happily use Papyrus for invitations to a neighborhood barbecue but not a resume.

  • widforss 126 days ago
    This should 100 % be named YAML is almost a superset of JSON. TIL.
  • msoad 126 days ago
    `'{"a": 1e2}'` is YAML not JSON. What a weird argument.
    • jmillikin 126 days ago
      That string is a valid JSON document, and it's also a valid YAML document.

      The problem is that it parses to different values in those two syntaxes.

      • Timwi 126 days ago
        > That string is a valid JSON document

        No it is not?... A valid JSON document cannot start with ` or with '. You could argue that the poster added ` in the hopes of getting it formatted, but not the '.

        • jmillikin 126 days ago
          The string being discussed is:

            {"a": 1e2}
          
          There are no single quotes or backticks in it. Those are an artifact of posting on Hacker News.
  • trueismywork 126 days ago
    Technically right but useless. When people talk about superset, they rarely are considering code formatting or parsing.
    • AdhemarVandamme 126 days ago
      They might be rarely considering code formatting specifically, but the claim that one language is a superset of another really does imply that all valid instances of the latter are also valid instances of the former (and are similarly parsed).

      The old spec of YAML 1.2 section 1.3 explicitly said:

      YAML can therefore be viewed as a natural superset of JSON, offering improved human readability and a more complete information model. This is also the case in practice; every JSON file is also a valid YAML file.

      The revised spec 1.2 revision 1.2.2 (2021-10-01) no longer contains that sentence; but still says, in section 1.2:

      The YAML 1.2 specification was published in 2009. Its primary focus was making YAML a strict superset of JSON.

      and in section 6.8.1:

      Note that version 1.2 is mostly a superset of version 1.1, defined for the purpose of ensuring JSON compatibility.

      Given all these claims, Patrick Stevens’ observations that YAML really isn’t a superset of JSON, because YAML can’t handle all JSON number literals, and tabs as whitespace, really is surprising. At least to me.

      When previously JavaScript/ECMAScript 2018 was found not to be a JSON superset, at least it was about unescaped occurrences of little-used characters U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR in string literals. And even that got fixed (by allowing the unescaped characters) in ECMAScript 2019.

      [YAML 1.2]: https://yaml.org/spec/1.2-old/spec.html#id2759572 [YAML 1.2 revision 1.2.2]: https://yaml.org/spec/1.2.2/ [ECMAScript 2019 feature Subume JSON]: https://v8.dev/features/subsume-json

      • seba_dos1 125 days ago
        It stops being surprising when you realize the article refers to a YAML 1.1 parser.