YAML is not a superset of JSON

(patrickstevens.co.uk)

59 points | by Smaug123 406 days ago

13 comments

DougBTX 406 days ago
This article claims that 1e2 is interpreted as a string, while this other article on the front page[0] claims that 556474e378 is interpreted as a number. What's correct?
The YAML "Scalars" section[1] says:
> A few examples also use the int, float and null types from the JSON schema.
And includes these examples:
```
    canonical: 1.23015e+3
    exponential: 12.3015e+02
```
So, is the "+" required here or not? Is a YAML parser buggy if it doesn't parse all JSON numbers as numbers?
Edit: Ah, further on, it says:
```
    Canonical Form
    Either 0, .inf, -.inf, .nan or scientific notation matching the regular expression
    -? [1-9] ( \. [0-9]* [1-9] )? ( e [-+] [1-9] [0-9]* )?
```
The example 1e2 clearly matches this regex, so his YAML parser is broken.
Edit edit:
In YAML 1.1, there were separate definitions of float[2] and int[3] types, where only floats support "scientific" notation, and must have a ".", unlike JSON.
So this article is talking about YAML 1.1, while the other article is talking about YAML 1.2.
[0] https://news.ycombinator.com/item?id=41498264
[1] https://yaml.org/spec/1.2.2/#23-scalars
[2] https://yaml.org/type/float.html
[3] https://yaml.org/type/int.html
[-]
- jmillikin 406 days ago
  What's "correct" depends on whether your YAML parser defaults to YAML 1.1 or 1.2.
  Most YAML parsers default to 1.1 for compatibility reasons, because if they default to 1.2 then existing YAML documents expecting 1.1 behavior will be parsed incorrectly.
  YAML is a difficult language to parse if you care about getting the correct data.
  [-]
  - soco 406 days ago
    Which only goes to prove that every replacement of XML eventually ran into the exactly same complexity issues they thought to solve in the first place. Another example of why "just use..." is a losing approach.
- n2d4 406 days ago
  > The example 1e2 clearly matches this regex, so his YAML parser is broken.
  1e2 does not match this regex. 1e+2 or 1e-2 would, though.
  [-]
  - lifthrasiir 406 days ago
    The "canonical" form indeed requires an exponent sign, but the tag resolution process in YAML 1.2 (section 10.2.2) does allow its omission. That said however, the equivalent specification in YAML 1.1 [1] does require an exponent sign at any case! It should be no surprise that YAML 1.1 isn't a superset of any version of JSON, but I don't know whether this is intentional or simply an oversight.
    [1] https://yaml.org/type/float.html
    [-]
    - n2d4 406 days ago
      YAML 1.2.2 still isn't a superset of JSON, because it requires all keys to be unique:
      > The content of a mapping node is an unordered set of key/value node pairs, with the restriction that each of the keys is unique
      [-]
      - lifthrasiir 406 days ago
        While JSON doesn't prevent duplicate keys per se, it doesn't fully specify its semantics anyway and only states that duplicate keys are less "interoperable". And there is an explicit profile of JSON with this requirement (I-JSON [1]), so YAML 1.2.2 can be said to be a superset of some version of JSON.
        [1] https://datatracker.ietf.org/doc/html/rfc7493
        [-]
        n2d4 406 days ago
        "X is a superset of some subset of Y" is a weak statement.
        This is not about semantics, it's about grammar. While it's fair to say that JSON "usually" is valid YAML, it's still good to be strict about it, because the existence of a single counterexample can be used maliciously.
        [-]
        lifthrasiir 406 days ago
        Agreed, though it is more like "X is a superset of some common but not de jure interpretation of Y". The real culprit is the ambiguity of Y...
      - seba_dos1 405 days ago
        A JSON implementation that rejects non-unique keys is a valid JSON implementation, so a valid YAML 1.2 implementation is still a valid JSON implementation.
        [-]
        kazinator 405 days ago
        Some things in the world of JSON depend on order, too. There is a Microsoft way of indicating the type of an object using a key that must be first in the JSON syntax.
        [-]
        seba_dos1 405 days ago
        Same situation as above - a JSON parser that ignores order is a valid JSON implementation.
        [-]
        kazinator 405 days ago
        And such a parser could work with that convention. However, if its complementary formatter printed the objects such that that type field doesn't appear first, it wouldn't be correctly implementing the specification.
- xelxebar 406 days ago
  > So this article is talking about YAML 1.1, while the other article is talking about YAML 1.2.
  Precisely. The article is really noticing quirks and limitations libyaml, the library doing the heavy lifting behind PyYAML, not YAML-the-spec proper.
  Granted, in practice, library limitations are probably what you want to know about. AFAIK, libfyaml[0] (not libyaml) is the most spec-compliant library around. It's a shame more downstream languages aren't using it.
  [0]:https://github.com/pantoniou/libfyaml
- endycm 406 days ago
  No, the YAML parser is a valid YAML 1.1 parser, where this behaviour is totally correct and in spec.
wolframhempel 406 days ago
I feel there's a place for YAML and JSON and that they are quite different:
- YAML is for files written or edited by humans - e.g. configuration files for servers with lots of comments and explanations, but generally quite simple key value pairs or basic data structures
- JSON is for files written and consumed by machines. It allows for complex, nested data structures and types, but requires technical knowledge to use.
The problem arises once you start confusing these usecases. I'd argue that once you start writing `'{"a": 1e2}'` in YAML you're quite far outside of its ideal use. I appreciate that feature creep might lead to overly complex configuration files (I remember editing XML config that allowed you to specify for-loops in my earlier days), but really, at a certain point it might be worth taking a step back and reflecting if you're still using the right tool for the right job.
[-]
- kajika91 406 days ago
  I agree about the distinction you made from the human perspective and you put it well.
  I have always hated YAML, and still to this day, because I cannot write a yaml file, the indentation makes no sense to me and the list syntax is black magic (you actually have several ways to write those, once again the indentation implication is obscure). So while agreeing on the goal to be written/edited by humans to my perspective it fails at it.
  Also 1e2 might not be the best example as this is just 100, but as someone who had to pass a lot of neural network training hyper-parameters : passing 1e-3 and so on is definitely a use-case. I am on the negative values Xe-XX (and YAML 1.1 would parsed it OK) but I guess other domains could also use the positive side to pass values (maybe as upper limits like 1e5 or so).
  I think the YAML format should have parsed those number formats from the beginning. If this is fixed now, good job, hopefully the default yaml parsers are going to be "fixed". I would still use TOML over YAML any day, waiting for a human-json (some already exist) to be popularized one day.
- mjevans 406 days ago
  Offhand, YAML is non-printing-space sensitive, while JSON is not (it's syntax sensitive though). Looking it up, tabs are not supported, which is a HUGE determent for my use cases. (I'm nearly religious on loving tabs and user defined width of single character indent levels.)
  A minor change of nesting is easy to accomplish in JSON, while YAML practically requires support from a text editor. JSON _can_ be rendered in a pretty print fashion for easier human editing, and it should still parse correctly irrespectively of how additional non-printing-space is added.
- eviks 406 days ago
  Why would you need text format when only machines are involved?
  [-]
  - ozim 406 days ago
    Because at multiple points humans are involved creating context for the values and later debugging or changing meaning of the values. Fun part it is when you integrate with systems where you even don't talk with the other party - you still need the documentation but as we see how much JSON over HTTP is used it is rather clear why we need text format.
    [-]
    - eviks 406 days ago
      Not sure what context creation means here that requires machines to speak text to each other, and you could debug the converted human-readable format. And JSON is also used much for app user-editable config despite lacking such trivial things as comments, so prevalence may signal some need, but not necessarily the need for the thing used (the awful XML is also widely used)
- conradfr 406 days ago
  Yes my only exposure to YAML is for Symfony configuration and it's way better than JSON for this.
zquestz 406 days ago
I am confused, when was YAML ever considered a superset of JSON?
These are completely different serialization formats.
[-]
- pletnes 406 days ago
  By many people, e.g https://learning.sap.com/learning-journeys/build-side-by-sid...
  But the yaml docs said it’s accidental
  https://yaml.org/spec/1.2.2/
  «The YAML 1.18 specification was published in 2005. Around this time, the developers became aware of JSON9. By sheer coincidence, JSON was almost a complete subset of YAML (both syntactically and semantically).»
  [-]
  - zquestz 406 days ago
    Thanks. Didn't realize anyone thought this. I checked the YAML docs when I saw this post, and it didn't seem like an official feature.
    The SAP article definitely states it, that's the first time I have seen it described that way.
    [-]
    - pletnes 406 days ago
      I’ve seen the claim around, and I actually believed it, to be honest. I found that article by googling. I guess it’s still true enough for many cases.
- andreareina 406 days ago
  I don't remember if it was an official talking point, but "you can load json on a yaml parser and it'll read the same" is a claim I've seen around.
- sorcix 406 days ago
  The YAML 1.2 spec states:
  > YAML can therefore be viewed as a natural superset of JSON, offering improved human readability and a more complete information model.
  [-]
  - jorams 406 days ago
    The YAML 1.0 spec says no such thing, it doesn't even mention JSON. Neither does the YAML 1.1 spec. The YAML 1.2.0 and 1.2.1 specs do say exactly that. 1.2.2 no longer does, but it reiterates that the primary focus of YAML 1.2 was to make it a strict superset of JSON.
    [-]
    - sorcix 406 days ago
      You're right. I must have clicked the wrong link, the YAML 1.0 spec doesn't mention JSON. The quote was from 1.2, thanks for pointing that out!
  - zquestz 406 days ago
    However this wasn't really true till 1.2 if I am reading the spec correctly. Plus many parsers default to 1.1. =\
- jmillikin 406 days ago
  I wrote a similar article two years ago, with some screenshots of people claiming YAML is a superset of JSON: https://john-millikin.com/json-is-not-a-yaml-subset
  It's one of those beliefs that seems like it should be true, but isn't for obscure technical reasons.
  [-]
  - jorams 406 days ago
    Your post's remark about YAML 1.2 being opt-in with a "%YAML 1.2" directive is true for the parser you are using (LibYAML), but is not compliant with the 1.2 spec. The spec specifies it should assume 1.2 and 1.1 should be opt-in.
    [-]
    - jmillikin 406 days ago
      It's true for all of the parsers that I tested at the time, which covered a couple different languages (Ruby, Python, C, Haskell).
      It appears to still be true in at least Ruby and Python, which are probably the two most popular languages to write YAML-consuming programs in:
      $ irb -v irb 1.3.5 (2021-04-03) $ irb irb(main):001:0> require 'yaml' => true irb(main):002:0> YAML.load '{"a": 1e2}' => {"a"=>"1e2"} irb(main):003:0>
      and
      $ python3 Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import yaml >>> yaml.safe_load('{"a": 1e2}') {'a': '1e2'} >>>
      --
      > The spec specifies it should assume 1.2 and 1.1 should be opt-in.
      The spec for 1.2 says that, but that's the reason parsers can't upgrade to 1.2, because changing the default version will cause backwards-incompatible parsing changes. Without the version directive there's no way for the parser to know which version was intended, so it defaults to 1.1, so people writing YAML will write YAML 1.1 documents, because that's what the parsers expect.
      The only way YAML 1.2 is going to displace older versions is in a greenfield ecosystem that has all its tools using YAML 1.2 from the beginning, but that requires an author who both (1) cares a lot about parser correctness, and (2) wants to use YAML as a config syntax, which isn't a large population.
      [-]
      - jorams 406 days ago
        It is indeed true for LibYAML, which is used by Python, Ruby, and the Haskell yaml package. The majority of YAML implementations I've seen that aren't based on LibYAML implement 1.2. I'm pointing it out because it's highly implementation dependent.
- xelxebar 406 days ago
  Not only does the spec say that, but it was an explicit goal of Ingy döt Net, the designer of YAML.
- trueismywork 406 days ago
  Official YAML spec says so.
valenterry 406 days ago
Please just drop YAML. Stop using it. Especially if you are devops and building a new CICD solution or configuration.
If you must use some kind of configuration language, at least use Dhall or something more sane. Thank you in advance!
endycm 406 days ago
Author has no clue, but I don't blame him.
There are - similar to the JSON insanity - multiple YAML standards.
YAML 1.1 and 1.2+ are the important ones, as the "superset" argument is only valid since 1.2.
HOWEVER PyYAML is a YAML 1.1 parser: https://pypi.org/project/PyYAML/#description
This also can be responsible for many security problems, as ppl will assume things about JSON and YAML, but don't worry about which of the 8 different JSON standards / YAML implementations they use.
[-]
- n2d4 406 days ago
  YAML 1.2.2 is also not a JSON superset, as keys are required to be unique:
  > The content of a mapping node is an unordered set of key/value node pairs, with the restriction that each of the keys is unique
  https://yaml.org/spec/1.2.2/#representation-graph
  [-]
  - jorams 406 days ago
    YAML says keys MUST be unique, while JSON says they SHOULD be unique. It is therefore possible to produce technically valid JSON documents that aren't valid YAML, so it is in that sense not a strict superset. At the same time those documents aren't portable across JSON implementations either. Rejecting duplicate keys is a valid implementation of the JSON spec.
xelxebar 406 days ago
Ouch. That tab error is a bug in libyaml, the YAML loader used by PyYAML. It's still stuck on YAML 1.1, but the spec explicitly allows tabs as whitespace, though they don't count as indentation semantics.[0]
The same is true in YAML 1.2, so it's not just a legacy thing, either.
[0]:https://yaml.org/spec/1.1/#id893482
[-]
- Brian_K_White 406 days ago
  Good god, indentation matters, yet some whitespace counts and some whitespace doesn't?
  What the ever loving hell.
jayceedenton 406 days ago
This seems pedantic, but I suppose anyone using the term 'superset' is inviting the pedantry.
For almost all intents and purposes, if you are asked to create a YAML file then you can choose JSON as your syntax instead, because your file will be understood by the YAML parser. The benefit being that JSON has far fewer quirks and edge cases.
It's comical that when people get confused with YAML (which is often) they convert their YAML snippet to JSON to see what's really going on. YAML is horrible for humans to write. Let's just use JSON, the sane syntax, instead. A few extra parents and quotes is really no big deal, and it's far easier to read unambiguously.
[-]
- roenxi 406 days ago
  If we're not being pedantic, YAML has almost nothing to do with JSON. A typical YAML file and a typical JSON file have no syntactical overlap at all. Practically speaking, YAML parsers are expected to also parse JSON despite it being an otherwise unrelated format. The entire idea that it is a "superset" is misleading. This isn't a C -> C++ transition.
  [-]
  - seba_dos1 405 days ago
    YAML 1.2 is actually a superset of JSON. C++ is not and never was a superset of C.
- lifthrasiir 406 days ago
  It is not only for the pedantry, unfortunately. A subtle input that is a valid JSON but an invalid YAML may cause all sort of problems at any level, with some security implications.
  [-]
kkfx 406 days ago
They are both ideas to makes XML easier, more manageable, as XML was for SGML, and well... They are damn failures. Modern YAML and JSON monsters clearly prove this.
If we want human readability we should accept not being language agnostic or simply direct use a programming language for data as well (lisp teach), if we want a textual language agnostic data exchange format than we should ignore human readability and maybe stick with XML.
masfoobar 406 days ago
I have never been fond of yaml. Guess it really depends on what you have been accustomed to. Then again, if that were true I would be defending xml over json.
I guess there is a place for all... but I do wish s-expressions got more love. :-)
nvader 406 days ago
YAML is the Papyrus of serialization formats.
Great for cutesy human-readable applications but nothing workhorse. Just as I'd happily use Papyrus for invitations to a neighborhood barbecue but not a resume.
widforss 406 days ago
This should 100 % be named YAML is almost a superset of JSON. TIL.
msoad 406 days ago
`'{"a": 1e2}'` is YAML not JSON. What a weird argument.
[-]
- jmillikin 406 days ago
  That string is a valid JSON document, and it's also a valid YAML document.
  The problem is that it parses to different values in those two syntaxes.
  [-]
  - Timwi 406 days ago
    > That string is a valid JSON document
    No it is not?... A valid JSON document cannot start with ` or with '. You could argue that the poster added ` in the hopes of getting it formatted, but not the '.
    [-]
    - jmillikin 406 days ago
      The string being discussed is:
      {"a": 1e2}
      There are no single quotes or backticks in it. Those are an artifact of posting on Hacker News.
trueismywork 406 days ago
Technically right but useless. When people talk about superset, they rarely are considering code formatting or parsing.
[-]
- AdhemarVandamme 406 days ago
  They might be rarely considering code formatting specifically, but the claim that one language is a superset of another really does imply that all valid instances of the latter are also valid instances of the former (and are similarly parsed).
  The old spec of YAML 1.2 section 1.3 explicitly said:
  YAML can therefore be viewed as a natural superset of JSON, offering improved human readability and a more complete information model. This is also the case in practice; every JSON file is also a valid YAML file.
  The revised spec 1.2 revision 1.2.2 (2021-10-01) no longer contains that sentence; but still says, in section 1.2:
  The YAML 1.2 specification was published in 2009. Its primary focus was making YAML a strict superset of JSON.
  and in section 6.8.1:
  Note that version 1.2 is mostly a superset of version 1.1, defined for the purpose of ensuring JSON compatibility.
  Given all these claims, Patrick Stevens’ observations that YAML really isn’t a superset of JSON, because YAML can’t handle all JSON number literals, and tabs as whitespace, really is surprising. At least to me.
  When previously JavaScript/ECMAScript 2018 was found not to be a JSON superset, at least it was about unescaped occurrences of little-used characters U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR in string literals. And even that got fixed (by allowing the unescaped characters) in ECMAScript 2019.
  [YAML 1.2]: https://yaml.org/spec/1.2-old/spec.html#id2759572 [YAML 1.2 revision 1.2.2]: https://yaml.org/spec/1.2.2/ [ECMAScript 2019 feature Subume JSON]: https://v8.dev/features/subsume-json
  [-]
  - seba_dos1 405 days ago
    It stops being surprising when you realize the article refers to a YAML 1.1 parser.