This article claims that 1e2 is interpreted as a string, while this other article on the front page[0] claims that 556474e378 is interpreted as a number. What's correct?
The YAML "Scalars" section[1] says:
> A few examples also use the int, float and null types from the JSON schema.
And includes these examples:
canonical: 1.23015e+3
exponential: 12.3015e+02
So, is the "+" required here or not? Is a YAML parser buggy if it doesn't parse all JSON numbers as numbers?
Edit: Ah, further on, it says:
Canonical Form
Either 0, .inf, -.inf, .nan or scientific notation matching the regular expression
-? [1-9] ( \. [0-9]* [1-9] )? ( e [-+] [1-9] [0-9]* )?
The example 1e2 clearly matches this regex, so his YAML parser is broken.
Edit edit:
In YAML 1.1, there were separate definitions of float[2] and int[3] types, where only floats support "scientific" notation, and must have a ".", unlike JSON.
So this article is talking about YAML 1.1, while the other article is talking about YAML 1.2.
What's "correct" depends on whether your YAML parser defaults to YAML 1.1 or 1.2.
Most YAML parsers default to 1.1 for compatibility reasons, because if they default to 1.2 then existing YAML documents expecting 1.1 behavior will be parsed incorrectly.
YAML is a difficult language to parse if you care about getting the correct data.
Which only goes to prove that every replacement of XML eventually ran into the exactly same complexity issues they thought to solve in the first place. Another example of why "just use..." is a losing approach.
The "canonical" form indeed requires an exponent sign, but the tag resolution process in YAML 1.2 (section 10.2.2) does allow its omission. That said however, the equivalent specification in YAML 1.1 [1] does require an exponent sign at any case! It should be no surprise that YAML 1.1 isn't a superset of any version of JSON, but I don't know whether this is intentional or simply an oversight.
While JSON doesn't prevent duplicate keys per se, it doesn't fully specify its semantics anyway and only states that duplicate keys are less "interoperable". And there is an explicit profile of JSON with this requirement (I-JSON [1]), so YAML 1.2.2 can be said to be a superset of some version of JSON.
"X is a superset of some subset of Y" is a weak statement.
This is not about semantics, it's about grammar. While it's fair to say that JSON "usually" is valid YAML, it's still good to be strict about it, because the existence of a single counterexample can be used maliciously.
A JSON implementation that rejects non-unique keys is a valid JSON implementation, so a valid YAML 1.2 implementation is still a valid JSON implementation.
Some things in the world of JSON depend on order, too. There is a Microsoft way of indicating the type of an object using a key that must be first in the JSON syntax.
And such a parser could work with that convention. However, if its complementary formatter printed the objects such that that type field doesn't appear first, it wouldn't be correctly implementing the specification.
> So this article is talking about YAML 1.1, while the other article is talking about YAML 1.2.
Precisely. The article is really noticing quirks and limitations libyaml, the library doing the heavy lifting behind PyYAML, not YAML-the-spec proper.
Granted, in practice, library limitations are probably what you want to know about. AFAIK, libfyaml[0] (not libyaml) is the most spec-compliant library around. It's a shame more downstream languages aren't using it.
I feel there's a place for YAML and JSON and that they are quite different:
- YAML is for files written or edited by humans - e.g. configuration files for servers with lots of comments and explanations, but generally quite simple key value pairs or basic data structures
- JSON is for files written and consumed by machines. It allows for complex, nested data structures and types, but requires technical knowledge to use.
The problem arises once you start confusing these usecases. I'd argue that once you start writing `'{"a": 1e2}'` in YAML you're quite far outside of its ideal use. I appreciate that feature creep might lead to overly complex configuration files (I remember editing XML config that allowed you to specify for-loops in my earlier days), but really, at a certain point it might be worth taking a step back and reflecting if you're still using the right tool for the right job.
I agree about the distinction you made from the human perspective and you put it well.
I have always hated YAML, and still to this day, because I cannot write a yaml file, the indentation makes no sense to me and the list syntax is black magic (you actually have several ways to write those, once again the indentation implication is obscure). So while agreeing on the goal to be written/edited by humans to my perspective it fails at it.
Also 1e2 might not be the best example as this is just 100, but as someone who had to pass a lot of neural network training hyper-parameters : passing 1e-3 and so on is definitely a use-case. I am on the negative values Xe-XX (and YAML 1.1 would parsed it OK) but I guess other domains could also use the positive side to pass values (maybe as upper limits like 1e5 or so).
I think the YAML format should have parsed those number formats from the beginning. If this is fixed now, good job, hopefully the default yaml parsers are going to be "fixed". I would still use TOML over YAML any day, waiting for a human-json (some already exist) to be popularized one day.
Offhand, YAML is non-printing-space sensitive, while JSON is not (it's syntax sensitive though). Looking it up, tabs are not supported, which is a HUGE determent for my use cases. (I'm nearly religious on loving tabs and user defined width of single character indent levels.)
A minor change of nesting is easy to accomplish in JSON, while YAML practically requires support from a text editor. JSON _can_ be rendered in a pretty print fashion for easier human editing, and it should still parse correctly irrespectively of how additional non-printing-space is added.
Because at multiple points humans are involved creating context for the values and later debugging or changing meaning of the values. Fun part it is when you integrate with systems where you even don't talk with the other party - you still need the documentation but as we see how much JSON over HTTP is used it is rather clear why we need text format.
Not sure what context creation means here that requires machines to speak text to each other, and you could debug the converted human-readable format.
And JSON is also used much for app user-editable config despite lacking such trivial things as comments, so prevalence may signal some need, but not necessarily the need for the thing used (the awful XML is also widely used)
«The YAML 1.18 specification was published in 2005. Around this time, the developers became aware of JSON9. By sheer coincidence, JSON was almost a complete subset of YAML (both syntactically and semantically).»
The YAML 1.0 spec says no such thing, it doesn't even mention JSON. Neither does the YAML 1.1 spec. The YAML 1.2.0 and 1.2.1 specs do say exactly that. 1.2.2 no longer does, but it reiterates that the primary focus of YAML 1.2 was to make it a strict superset of JSON.
Your post's remark about YAML 1.2 being opt-in with a "%YAML 1.2" directive is true for the parser you are using (LibYAML), but is not compliant with the 1.2 spec. The spec specifies it should assume 1.2 and 1.1 should be opt-in.
$ python3
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import yaml
>>> yaml.safe_load('{"a": 1e2}')
{'a': '1e2'}
>>>
--
> The spec specifies it should assume 1.2 and 1.1 should be opt-in.
The spec for 1.2 says that, but that's the reason parsers can't upgrade to 1.2, because changing the default version will cause backwards-incompatible parsing changes. Without the version directive there's no way for the parser to know which version was intended, so it defaults to 1.1, so people writing YAML will write YAML 1.1 documents, because that's what the parsers expect.
The only way YAML 1.2 is going to displace older versions is in a greenfield ecosystem that has all its tools using YAML 1.2 from the beginning, but that requires an author who both (1) cares a lot about parser correctness, and (2) wants to use YAML as a config syntax, which isn't a large population.
It is indeed true for LibYAML, which is used by Python, Ruby, and the Haskell yaml package. The majority of YAML implementations I've seen that aren't based on LibYAML implement 1.2. I'm pointing it out because it's highly implementation dependent.
This also can be responsible for many security problems, as ppl will assume things about JSON and YAML, but don't worry about which of the 8 different JSON standards / YAML implementations they use.
YAML says keys MUST be unique, while JSON says they SHOULD be unique. It is therefore possible to produce technically valid JSON documents that aren't valid YAML, so it is in that sense not a strict superset. At the same time those documents aren't portable across JSON implementations either. Rejecting duplicate keys is a valid implementation of the JSON spec.
Ouch. That tab error is a bug in libyaml, the YAML loader used by PyYAML. It's still stuck on YAML 1.1, but the spec explicitly allows tabs as whitespace, though they don't count as indentation semantics.[0]
The same is true in YAML 1.2, so it's not just a legacy thing, either.
This seems pedantic, but I suppose anyone using the term 'superset' is inviting the pedantry.
For almost all intents and purposes, if you are asked to create a YAML file then you can choose JSON as your syntax instead, because your file will be understood by the YAML parser. The benefit being that JSON has far fewer quirks and edge cases.
It's comical that when people get confused with YAML (which is often) they convert their YAML snippet to JSON to see what's really going on. YAML is horrible for humans to write. Let's just use JSON, the sane syntax, instead. A few extra parents and quotes is really no big deal, and it's far easier to read unambiguously.
If we're not being pedantic, YAML has almost nothing to do with JSON. A typical YAML file and a typical JSON file have no syntactical overlap at all. Practically speaking, YAML parsers are expected to also parse JSON despite it being an otherwise unrelated format. The entire idea that it is a "superset" is misleading. This isn't a C -> C++ transition.
It is not only for the pedantry, unfortunately. A subtle input that is a valid JSON but an invalid YAML may cause all sort of problems at any level, with some security implications.
They are both ideas to makes XML easier, more manageable, as XML was for SGML, and well... They are damn failures. Modern YAML and JSON monsters clearly prove this.
If we want human readability we should accept not being language agnostic or simply direct use a programming language for data as well (lisp teach), if we want a textual language agnostic data exchange format than we should ignore human readability and maybe stick with XML.
I have never been fond of yaml. Guess it really depends on what you have been accustomed to. Then again, if that were true I would be defending xml over json.
I guess there is a place for all... but I do wish s-expressions got more love. :-)
Great for cutesy human-readable applications but nothing workhorse. Just as I'd happily use Papyrus for invitations to a neighborhood barbecue but not a resume.
No it is not?... A valid JSON document cannot start with ` or with '. You could argue that the poster added ` in the hopes of getting it formatted, but not the '.
They might be rarely considering code formatting specifically, but the claim that one language is a superset of another really does imply that all valid instances of the latter are also valid instances of the former (and are similarly parsed).
The old spec of YAML 1.2 section 1.3 explicitly said:
YAML can therefore be viewed as a natural superset of JSON, offering improved human readability and a more complete information model. This is also the case in practice; every JSON file is also a valid YAML file.
The revised spec 1.2 revision 1.2.2 (2021-10-01) no longer contains that sentence; but still says, in section 1.2:
The YAML 1.2 specification was published in 2009. Its primary focus was making YAML a strict superset of JSON.
and in section 6.8.1:
Note that version 1.2 is mostly a superset of version 1.1, defined for the purpose of ensuring JSON compatibility.
Given all these claims, Patrick Stevens’ observations that YAML really isn’t a superset of JSON, because YAML can’t handle all JSON number literals, and tabs as whitespace, really is surprising. At least to me.
When previously JavaScript/ECMAScript 2018 was found not to be a JSON superset, at least it was about unescaped occurrences of little-used characters U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR in string literals. And even that got fixed (by allowing the unescaped characters) in ECMAScript 2019.
The YAML "Scalars" section[1] says:
> A few examples also use the int, float and null types from the JSON schema.
And includes these examples:
So, is the "+" required here or not? Is a YAML parser buggy if it doesn't parse all JSON numbers as numbers?Edit: Ah, further on, it says:
The example 1e2 clearly matches this regex, so his YAML parser is broken.Edit edit:
In YAML 1.1, there were separate definitions of float[2] and int[3] types, where only floats support "scientific" notation, and must have a ".", unlike JSON.
So this article is talking about YAML 1.1, while the other article is talking about YAML 1.2.
[0] https://news.ycombinator.com/item?id=41498264
[1] https://yaml.org/spec/1.2.2/#23-scalars
[2] https://yaml.org/type/float.html
[3] https://yaml.org/type/int.html
Most YAML parsers default to 1.1 for compatibility reasons, because if they default to 1.2 then existing YAML documents expecting 1.1 behavior will be parsed incorrectly.
YAML is a difficult language to parse if you care about getting the correct data.
1e2 does not match this regex. 1e+2 or 1e-2 would, though.
[1] https://yaml.org/type/float.html
> The content of a mapping node is an unordered set of key/value node pairs, with the restriction that each of the keys is unique
[1] https://datatracker.ietf.org/doc/html/rfc7493
This is not about semantics, it's about grammar. While it's fair to say that JSON "usually" is valid YAML, it's still good to be strict about it, because the existence of a single counterexample can be used maliciously.
Precisely. The article is really noticing quirks and limitations libyaml, the library doing the heavy lifting behind PyYAML, not YAML-the-spec proper.
Granted, in practice, library limitations are probably what you want to know about. AFAIK, libfyaml[0] (not libyaml) is the most spec-compliant library around. It's a shame more downstream languages aren't using it.
[0]:https://github.com/pantoniou/libfyaml
- YAML is for files written or edited by humans - e.g. configuration files for servers with lots of comments and explanations, but generally quite simple key value pairs or basic data structures
- JSON is for files written and consumed by machines. It allows for complex, nested data structures and types, but requires technical knowledge to use.
The problem arises once you start confusing these usecases. I'd argue that once you start writing `'{"a": 1e2}'` in YAML you're quite far outside of its ideal use. I appreciate that feature creep might lead to overly complex configuration files (I remember editing XML config that allowed you to specify for-loops in my earlier days), but really, at a certain point it might be worth taking a step back and reflecting if you're still using the right tool for the right job.
I have always hated YAML, and still to this day, because I cannot write a yaml file, the indentation makes no sense to me and the list syntax is black magic (you actually have several ways to write those, once again the indentation implication is obscure). So while agreeing on the goal to be written/edited by humans to my perspective it fails at it.
Also 1e2 might not be the best example as this is just 100, but as someone who had to pass a lot of neural network training hyper-parameters : passing 1e-3 and so on is definitely a use-case. I am on the negative values Xe-XX (and YAML 1.1 would parsed it OK) but I guess other domains could also use the positive side to pass values (maybe as upper limits like 1e5 or so).
I think the YAML format should have parsed those number formats from the beginning. If this is fixed now, good job, hopefully the default yaml parsers are going to be "fixed". I would still use TOML over YAML any day, waiting for a human-json (some already exist) to be popularized one day.
A minor change of nesting is easy to accomplish in JSON, while YAML practically requires support from a text editor. JSON _can_ be rendered in a pretty print fashion for easier human editing, and it should still parse correctly irrespectively of how additional non-printing-space is added.
These are completely different serialization formats.
But the yaml docs said it’s accidental
https://yaml.org/spec/1.2.2/
«The YAML 1.18 specification was published in 2005. Around this time, the developers became aware of JSON9. By sheer coincidence, JSON was almost a complete subset of YAML (both syntactically and semantically).»
The SAP article definitely states it, that's the first time I have seen it described that way.
> YAML can therefore be viewed as a natural superset of JSON, offering improved human readability and a more complete information model.
It's one of those beliefs that seems like it should be true, but isn't for obscure technical reasons.
It appears to still be true in at least Ruby and Python, which are probably the two most popular languages to write YAML-consuming programs in:
and -- The spec for 1.2 says that, but that's the reason parsers can't upgrade to 1.2, because changing the default version will cause backwards-incompatible parsing changes. Without the version directive there's no way for the parser to know which version was intended, so it defaults to 1.1, so people writing YAML will write YAML 1.1 documents, because that's what the parsers expect.The only way YAML 1.2 is going to displace older versions is in a greenfield ecosystem that has all its tools using YAML 1.2 from the beginning, but that requires an author who both (1) cares a lot about parser correctness, and (2) wants to use YAML as a config syntax, which isn't a large population.
If you must use some kind of configuration language, at least use Dhall or something more sane. Thank you in advance!
There are - similar to the JSON insanity - multiple YAML standards.
YAML 1.1 and 1.2+ are the important ones, as the "superset" argument is only valid since 1.2.
HOWEVER PyYAML is a YAML 1.1 parser: https://pypi.org/project/PyYAML/#description
This also can be responsible for many security problems, as ppl will assume things about JSON and YAML, but don't worry about which of the 8 different JSON standards / YAML implementations they use.
> The content of a mapping node is an unordered set of key/value node pairs, with the restriction that each of the keys is unique
https://yaml.org/spec/1.2.2/#representation-graph
The same is true in YAML 1.2, so it's not just a legacy thing, either.
[0]:https://yaml.org/spec/1.1/#id893482
What the ever loving hell.
For almost all intents and purposes, if you are asked to create a YAML file then you can choose JSON as your syntax instead, because your file will be understood by the YAML parser. The benefit being that JSON has far fewer quirks and edge cases.
It's comical that when people get confused with YAML (which is often) they convert their YAML snippet to JSON to see what's really going on. YAML is horrible for humans to write. Let's just use JSON, the sane syntax, instead. A few extra parents and quotes is really no big deal, and it's far easier to read unambiguously.
If we want human readability we should accept not being language agnostic or simply direct use a programming language for data as well (lisp teach), if we want a textual language agnostic data exchange format than we should ignore human readability and maybe stick with XML.
I guess there is a place for all... but I do wish s-expressions got more love. :-)
Great for cutesy human-readable applications but nothing workhorse. Just as I'd happily use Papyrus for invitations to a neighborhood barbecue but not a resume.
The problem is that it parses to different values in those two syntaxes.
No it is not?... A valid JSON document cannot start with ` or with '. You could argue that the poster added ` in the hopes of getting it formatted, but not the '.
The old spec of YAML 1.2 section 1.3 explicitly said:
YAML can therefore be viewed as a natural superset of JSON, offering improved human readability and a more complete information model. This is also the case in practice; every JSON file is also a valid YAML file.
The revised spec 1.2 revision 1.2.2 (2021-10-01) no longer contains that sentence; but still says, in section 1.2:
The YAML 1.2 specification was published in 2009. Its primary focus was making YAML a strict superset of JSON.
and in section 6.8.1:
Note that version 1.2 is mostly a superset of version 1.1, defined for the purpose of ensuring JSON compatibility.
Given all these claims, Patrick Stevens’ observations that YAML really isn’t a superset of JSON, because YAML can’t handle all JSON number literals, and tabs as whitespace, really is surprising. At least to me.
When previously JavaScript/ECMAScript 2018 was found not to be a JSON superset, at least it was about unescaped occurrences of little-used characters U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR in string literals. And even that got fixed (by allowing the unescaped characters) in ECMAScript 2019.
[YAML 1.2]: https://yaml.org/spec/1.2-old/spec.html#id2759572 [YAML 1.2 revision 1.2.2]: https://yaml.org/spec/1.2.2/ [ECMAScript 2019 feature Subume JSON]: https://v8.dev/features/subsume-json