protobuf solved serialization with schema evolution back/forward compatibility.
Skir seems to have great devex for the codegen part, but that's the least interesting aspect of protobufs. I don't see how the serialization this proposes fixes it without the numerical tagging equivalent.
Did you look at other formats like Avro, Ion etc? Some feedback:
1. Dense json
Interesting idea. You can also just keep the compact binary if you just tag each payload with a schema id (see Avro). This also allows a generic reader to decode any binary format by reading the schema and then interpreting the binary payload, which is really useful. A secondary benefit is you never ever misinterpret a payload. I have seen bugs with protobufs misinterpreted since there is no connection handshake and interpretation is akin to 'cast'.
2. Compatibility checks
+100 there's not reason to allow breaking changes by default
3. Adding fields to a type: should you have to update all call sites?
I'm not so sure this is the right default. If I add a field to a core type used by 10 services, this requires rebuilding and deploying all of them.
4. enum looks great. what about backcompat when adding new enum fields? or sometimes when you need to 'upgrade' an atomic to an enum?
0. Yes, I looked at Avro, Ion. I like Protobuf much better because I think using field numbers for field identity, meaning being able to rename fields freely, is a must.
1. Yes. Skir also supports that with binary format (you can serialize and deserialize a Skir schema to JSON, which then allows you to convert from binary format to readable JSON). It just requires to build many layers of extra tooling which can be painful. For example, if you store your data in some SQL engine X, you won't be able to quickly visualize your data with a simple SELECT statement, you need to build the tooling which will allow you to visualize the data.
Now dense JSON is obviously not idea for this use case, because you don't see the field names, but for quick debugging I find it's "good enough".
3. I agree there are definitely cases where it can be painful, but I think the cases where it actually is helpful are more numerous.
One thing worth noting is that you can "opt-out" of this feature by using `ClassName.partial(...)` instead of `ClassName()` at construction time. See for example `User.partial(...)` here: https://skir.build/docs/python#frozen-structs
I mostly added this feature for unit tests, where you want to easily create some objects with only some fields set and not be bothered if new fields are added to the schema.
> meaning being able to rename fields freely, is a must.
avro supports field renames though.
3. on second thought i believe you'd only have to deploy when you choose. the next build will force you to provide values (or opt into the default). so forcing inspection of construction sites seems good.
> Skir is a universal language for representing data types, constants, and RPC interfaces. Define your schema once in a .skir file and generate idiomatic, type-safe code in TypeScript, Python, Java, C++, and more.
Maybe I'm missing some additional features but that's exactly what https://buf.build/plugins/typescript does for Protobuf already, with the advantage that you can just keep Protobuf and all the battle hardened tooling that comes with it.
The entire original post, it seems, is dedicated to explaining why Skir is better than plain Protobuf, with examples of all the well-known pain points. If these are not persuasive for you, staying with Protobuf (or just JSON) should be a fine choice.
Protobuf is battle-tested and excellent. If your team already runs on Protobuf and has large amounts of persisted protobuf data in databases or on disk, a full migration is often a major effort: you have to migrate both application code and stored data safely. In many cases, that cost is not worth it.
For new projects, though, the choice is open. That is where Skir can offer a meaningful long-term advantage on developer experience, schema evolution guardrails, and day-to-day ergonomics.
"""
Skir has exactly the same goals as Protobuf, so yes, that sentence can apply to Protobuf as well (and buf.build).
I listed some of the reasons to prefer Skir over Protobuf in my humble opinion here: https://medium.com/@gepheum/i-spent-15-years-with-protobuf-t...
Built-in compatibility checks, the fact that you can import dependencies from other projects (buf.build offers this, but makes you pay for it), some language designs (around enum/oneof, the fact that adding fields forces you to update all constructor code sites), the dense JSON format, are examples.
I've been dabbling with the newer Cap'n Web, whose nicely descriptive README's first line says:
> Cap'n Web is a spiritual sibling to Cap'n Proto (and is created by the same author), but designed to play nice in the web stack.
It's just JSON, which has up and down sides. But things like promise pipelining are such a huge upside versus everything else: you can refer to results (and maybe send them around?) and kick off new work based on those results, before you even get the result back.
This is far far far superior to everything else, totally different ball-game.
I've been a little rebuffed by wasm when I try, keep getting too close to some gravitational event horizon & get sucked in & give up, but for more data-throughput oriented systems, I'm still hoping wrpc ends up being a fantastic pick. https://github.com/bytecodealliance/wrpc . Also Apache Arrow Flight, which I know less about, has mad traction in serious data-throughput systems, which being adjacent to amazingly popular Apache Arrow makes sense. https://arrow.apache.org/docs/format/Flight.html
That is correct and that is a good catch, the idea though is that when you remove a field you typically do that after having made sure that all code no longer read from the removed field and that all binaries have been deployed.
That “compact JSON” format reminds me if the special protobufs JSON format that Google uses in their APIs that has very little public documentation. Does anyone happen to know why Google uses that, and to OP, were you inspired by that format?
I don't know the reason TextFormat was invented, but in practice it's way easier to work with TextFormat than JSON in the context of Protos.
Consider numeric types -
JSON: number aka 64-bit IEEE 754 floating point
Proto: signed and unsigned int 8, 16, 32, 64-bit, float, double
I can only imagine the carnage saved by not accidentally chopping of the top 10 bits (or something similar) of every int64 identifier when it happens to get processed by a perfectly normal, standards compliant JSON processor.
It's true that most int64 fields could be just fine with int54. It's also true that some fields actually use those bits in practice.
Also, the JSPB format references tag numbers rather than field names. It's not really readable. For TextProto it might be a log output, or a config file, or a test, which are all have ways of catching field name discrepancies (or it doesn't matter). For the primary transport layer to the browser, the field name isn't a forward compatible/safe way to reference the schema.
So oddly the engineers complaining about the multiple text formats are also saved from a fair number of bugs by being forced to use tools more suited to their specific situation.
I think you may be referring to JSPB. It's used internally at Google but has little support in the open-source. I know about it, but I wouldn't say I was inspired by it. It's particularly unreadable, because it needs to account for field numbers being possible sparse.
Google built it for frontend-backend communication, when both the frontend and the backend use Protobuf dataclasses, as it's more efficient than sending a large JSON object and also it's faster to deserialize than deserializing a binary string on the browser side. I think it's mostly deprecated nowadays.
1. Google uses protobufs everywhere, so having something that behaves equivalently is very valuable. For example in protobuf renaming fields is safe, so if they used field names in the JSON it would be protobuf incompatible.
2. It is usually more efficient because you don't send field names. (Unless the struct is very sparse it is probably smaller on the wire, serialized JS usage may be harder to evaluate since JS engines are probably more optimized for structs than heterogeneous arrays).
3. Presumably the ability to use the native JSON parsing is beneficial over a binary parser in many cases (smaller code size and probably faster until the code gets very hot and JITed).
Thanks!
Main use case (similarly to Protobuf) is when you need to exchange data types between systems written in different languages. Like Protobuf, it can also be used in a mono-linguistic system, when you want to serialize systems and have strong guarantees that you will be able to deserialize your data in the future (when you use classic serialization libraries like Pydantic, Java Serialization etc., it's easy to accidentally modify a schema and break the ability to deserialize old data.)
I spent some time in the actual compiler source. There's real work here, genuinely good ideas.
The best thing Skir does is strict generated constructors. You add a field, every construction site lights up. Protobuf's "silently default everything" model has caused mass production incidents at real companies. This is a legitimately better default.
Dense JSON is interesting but the docs gloss over the tradeoff: your serialized data is [3, 4, "P"]. If you ever lose your schema, or a human needs to read a payload in a log, you're staring at unlabeled arrays. Protobuf binary has the same problem but nobody markets binary as "easy to inspect with standard tools."
The "serialize now, deserialize in 100 years" claim has a real asterisk. Compatibility checking requires you to opt into stable record IDs and maintain snapshots. If you skip that (and the docs' own examples often do), the CLI literally warns you: "breaking changes cannot be detected." So it's less "built-in safety" and more "safety available if you follow the discipline." Which is... also what Protobuf offers.
The Rust-style enum unification is genuinely cleaner than Protobuf's enum/oneof split. No notes there, that's just better language design.
Minor thing that bothered me disproportionately: the constant syntax in the docs (x = 600) doesn't match what the parser actually accepts (x: 600).
The weirdest thing that bugged the heck out of me was the tagline, "like protos but better", that's doing the project no favors.
I think this would land better if it were positioned as "Protobuf, but fresh" rather than "Protobuf, but better." The interesting conversation is which opinions are right, not whether one tool is universally superior.
Quite frankly, I don't use protobuf because it seems like an unapproachable monolith, and I'm not at FAANG anymore, just a solo dev. No one's gonna complain if I don't. But I do love the idea of something simpler thats easy to wrap my mind around.
That's why "but fresh" hits nice to me, and I have a feeling it might be more appealing than you'd think - ex. it's hard to believe a 2 month old project is strictly better than whatever mess and history protobufs gone through with tons of engineers paid to use and work on it. It is easy to believe it covers 99% of what Protobuf does already, and any crazy edge cases that pop up (they always do, eventually :), will be easy to understand and fix.
Thank you so much for taking the time to dig into the compile source code and the thorough comment you left.
For dense JSON: the idea is that it is often a good "default" choice because it offers a good tradeoff across 3 properties: efficiency (where it's between binary and readable JSON), persistability (safe to evolve shema without losing backward compatibility), and readability (it's low for the reasons you mentioned, but it's not as bad as a binary string). I tried to explain this tradeoff in this table: https://skir.build/docs/serialization#serialization-formats
I hear your point about the tagline "like protos but better" which I hesitated to put because it sounds presumptuous. But I am not quite sure what idea you mean to convey by "fresh"?
Not the parent but I infer “fresh” as meaning a new approach to an old problem (with the benefits of experience baked in). A synonym of “modern” without the baggage.
> Minor thing that bothered me disproportionately: the constant syntax in the docs (x = 600) doesn't match what the parser actually accepts (x: 600).
You’re a better man than me. If the docs can’t even get the syntax right, that’s a hard no from me.
Also, fwiw, you’ve got a few points wrong about protos. Inspecting the binary data is hard, but the tag numbers are present. You need the schema, but at least you can identify each element.
Also, I disagree on the constructor front. Proto forces you to grapple with the reality that a field may be missing. In a production system, when adding a new field, there will be a point where that field isn’t present on only one side of the network call. The compiler isn’t saving you.
Fresh is more honest than better, and personally, I wouldn’t change it.
> Also, I disagree on the constructor front. Proto forces you to grapple with the reality that a field may be missing. In a production system, when adding a new field, there will be a point where that field isn’t present on only one side of the network call. The compiler isn’t saving you.
I agree it's important for users to understand that newer fields won't be set when they deserialize old data -- whether that's with Protobuf or Skir. I disagree with the idea that not forcing you to update all constructor call sites when you add a field will help (significantly) with that. Are you saying that because Protobuf forces you to manually search for all call sites when you add a field, it forces you to think about what happens if the field is not set at deserialization, hence, it's a good thing? I'm not sure that outweighs the cost of bugs introduced by cases where you forget to update a constructor call site when you add a field to your schema.
Respectfully, I’ve never forgotten a call site, but also yes. In a hypothetical HelloWorld service, the HelloRequest and HelloResponse generally aren’t used anywhere except a rpc caller and rpc handler, so it’s not hard to “remember” and find the usage.
Some callers may not need to update right away, or don’t need the new feature at all, and breaking the existing callers compilation is bad. If your caller is a different team, for example, and their CICD breaks because you added a field, that’s bad. Each place it’s used, you should think about how it’ll be handled, BUT ALSO, your system explicitly should gracefully handle the case where it’s not uniformly present. It’s an explicit goal of protos to support the use case where heterogeneous schema versions are used over the wire.
If a bug is introduced because the caller and handler use different versions, the compiler wasn’t going to save you anyways. That bug would have shown up when you deploy or update the client and server anyways - unless you atomically update both at once. You generally cannot guarantee that a client won’t use an outdated version of the schema, and if things break because of that, you didn’t guard it correctly. That’s a business logic failure not a compilation failure.
I would recommend exploring OpenRPC for those who have not yet seen it. It brings protocol-buffer-like definitions (components), RPC definitions and centralised error definitions.
Obligatory dense field numbers seems like a massive downside, the problems of which would become evident after a busy repo has been open for a few days.
It's not obligatory.
Basically Protobuf gives you a choice between (1) binary format, (2) readable JSON.
Skir gives you a choice between (1) binary format, (2) readable JSON, (3) dense JSON. It recommends dense JSON as the "default choice", but it does not force it. The reason why it's recommended as the default choice is because it offers a good tradeoff between efficiency (only a bit less compact than binary), backward compatibility (you can rename fields safely unlike readable JSON) and debuggability (although it's definiely not as good as readable JSON, because you lose the field numbers, it's decent and much better than binary format)
protobuf solved serialization with schema evolution back/forward compatibility.
Skir seems to have great devex for the codegen part, but that's the least interesting aspect of protobufs. I don't see how the serialization this proposes fixes it without the numerical tagging equivalent.
1. Dense json
Interesting idea. You can also just keep the compact binary if you just tag each payload with a schema id (see Avro). This also allows a generic reader to decode any binary format by reading the schema and then interpreting the binary payload, which is really useful. A secondary benefit is you never ever misinterpret a payload. I have seen bugs with protobufs misinterpreted since there is no connection handshake and interpretation is akin to 'cast'.
2. Compatibility checks
+100 there's not reason to allow breaking changes by default
3. Adding fields to a type: should you have to update all call sites?
I'm not so sure this is the right default. If I add a field to a core type used by 10 services, this requires rebuilding and deploying all of them.
4. enum looks great. what about backcompat when adding new enum fields? or sometimes when you need to 'upgrade' an atomic to an enum?
0. Yes, I looked at Avro, Ion. I like Protobuf much better because I think using field numbers for field identity, meaning being able to rename fields freely, is a must.
1. Yes. Skir also supports that with binary format (you can serialize and deserialize a Skir schema to JSON, which then allows you to convert from binary format to readable JSON). It just requires to build many layers of extra tooling which can be painful. For example, if you store your data in some SQL engine X, you won't be able to quickly visualize your data with a simple SELECT statement, you need to build the tooling which will allow you to visualize the data. Now dense JSON is obviously not idea for this use case, because you don't see the field names, but for quick debugging I find it's "good enough".
3. I agree there are definitely cases where it can be painful, but I think the cases where it actually is helpful are more numerous. One thing worth noting is that you can "opt-out" of this feature by using `ClassName.partial(...)` instead of `ClassName()` at construction time. See for example `User.partial(...)` here: https://skir.build/docs/python#frozen-structs I mostly added this feature for unit tests, where you want to easily create some objects with only some fields set and not be bothered if new fields are added to the schema.
4. Good question. I guess you mean "forward compatibility": you add a new field to the enum, not all binaries are deployed at the same time, and some old binary encounters the new enum it doesn't know about? I do like Protobuf does: I default to the UNKNOWN enum. More on this: - https://skir.build/docs/schema-evolution#adding-variants-to-... - https://skir.build/docs/schema-evolution#default-behavior-dr... - https://skir.build/docs/protobuf#implicit-unknown-variant
avro supports field renames though.
3. on second thought i believe you'd only have to deploy when you choose. the next build will force you to provide values (or opt into the default). so forcing inspection of construction sites seems good.
Maybe I'm missing some additional features but that's exactly what https://buf.build/plugins/typescript does for Protobuf already, with the advantage that you can just keep Protobuf and all the battle hardened tooling that comes with it.
Copying from blog post [https://medium.com/@gepheum/i-spent-15-years-with-protobuf-t...]:
""" Should you switch from Protobuf?
Protobuf is battle-tested and excellent. If your team already runs on Protobuf and has large amounts of persisted protobuf data in databases or on disk, a full migration is often a major effort: you have to migrate both application code and stored data safely. In many cases, that cost is not worth it.
For new projects, though, the choice is open. That is where Skir can offer a meaningful long-term advantage on developer experience, schema evolution guardrails, and day-to-day ergonomics. """
https://news.ycombinator.com/user?id=kentonv
> Cap'n Web is a spiritual sibling to Cap'n Proto (and is created by the same author), but designed to play nice in the web stack.
It's just JSON, which has up and down sides. But things like promise pipelining are such a huge upside versus everything else: you can refer to results (and maybe send them around?) and kick off new work based on those results, before you even get the result back.
This is far far far superior to everything else, totally different ball-game.
I've been a little rebuffed by wasm when I try, keep getting too close to some gravitational event horizon & get sucked in & give up, but for more data-throughput oriented systems, I'm still hoping wrpc ends up being a fantastic pick. https://github.com/bytecodealliance/wrpc . Also Apache Arrow Flight, which I know less about, has mad traction in serious data-throughput systems, which being adjacent to amazingly popular Apache Arrow makes sense. https://arrow.apache.org/docs/format/Flight.html
Unfortunately, I really like postfix types, but IDL itself doesn't support them.
In the "dense JSON" format, isn't representing removed/absent struct fields with `0` and not `null` backwards incompatible?
If you remove or are unaware of a `int32?` field, old consumers will suddenly think the value is present as a "default" value rather than absent
Consider numeric types -
JSON: number aka 64-bit IEEE 754 floating point
Proto: signed and unsigned int 8, 16, 32, 64-bit, float, double
I can only imagine the carnage saved by not accidentally chopping of the top 10 bits (or something similar) of every int64 identifier when it happens to get processed by a perfectly normal, standards compliant JSON processor.
It's true that most int64 fields could be just fine with int54. It's also true that some fields actually use those bits in practice.
Also, the JSPB format references tag numbers rather than field names. It's not really readable. For TextProto it might be a log output, or a config file, or a test, which are all have ways of catching field name discrepancies (or it doesn't matter). For the primary transport layer to the browser, the field name isn't a forward compatible/safe way to reference the schema.
So oddly the engineers complaining about the multiple text formats are also saved from a fair number of bugs by being forced to use tools more suited to their specific situation.
1. Google uses protobufs everywhere, so having something that behaves equivalently is very valuable. For example in protobuf renaming fields is safe, so if they used field names in the JSON it would be protobuf incompatible.
2. It is usually more efficient because you don't send field names. (Unless the struct is very sparse it is probably smaller on the wire, serialized JS usage may be harder to evaluate since JS engines are probably more optimized for structs than heterogeneous arrays).
3. Presumably the ability to use the native JSON parsing is beneficial over a binary parser in many cases (smaller code size and probably faster until the code gets very hot and JITed).
The best thing Skir does is strict generated constructors. You add a field, every construction site lights up. Protobuf's "silently default everything" model has caused mass production incidents at real companies. This is a legitimately better default.
Dense JSON is interesting but the docs gloss over the tradeoff: your serialized data is [3, 4, "P"]. If you ever lose your schema, or a human needs to read a payload in a log, you're staring at unlabeled arrays. Protobuf binary has the same problem but nobody markets binary as "easy to inspect with standard tools." The "serialize now, deserialize in 100 years" claim has a real asterisk. Compatibility checking requires you to opt into stable record IDs and maintain snapshots. If you skip that (and the docs' own examples often do), the CLI literally warns you: "breaking changes cannot be detected." So it's less "built-in safety" and more "safety available if you follow the discipline." Which is... also what Protobuf offers.
The Rust-style enum unification is genuinely cleaner than Protobuf's enum/oneof split. No notes there, that's just better language design.
Minor thing that bothered me disproportionately: the constant syntax in the docs (x = 600) doesn't match what the parser actually accepts (x: 600).
The weirdest thing that bugged the heck out of me was the tagline, "like protos but better", that's doing the project no favors.
I think this would land better if it were positioned as "Protobuf, but fresh" rather than "Protobuf, but better." The interesting conversation is which opinions are right, not whether one tool is universally superior.
Quite frankly, I don't use protobuf because it seems like an unapproachable monolith, and I'm not at FAANG anymore, just a solo dev. No one's gonna complain if I don't. But I do love the idea of something simpler thats easy to wrap my mind around.
That's why "but fresh" hits nice to me, and I have a feeling it might be more appealing than you'd think - ex. it's hard to believe a 2 month old project is strictly better than whatever mess and history protobufs gone through with tons of engineers paid to use and work on it. It is easy to believe it covers 99% of what Protobuf does already, and any crazy edge cases that pop up (they always do, eventually :), will be easy to understand and fix.
For dense JSON: the idea is that it is often a good "default" choice because it offers a good tradeoff across 3 properties: efficiency (where it's between binary and readable JSON), persistability (safe to evolve shema without losing backward compatibility), and readability (it's low for the reasons you mentioned, but it's not as bad as a binary string). I tried to explain this tradeoff in this table: https://skir.build/docs/serialization#serialization-formats
I hear your point about the tagline "like protos but better" which I hesitated to put because it sounds presumptuous. But I am not quite sure what idea you mean to convey by "fresh"?
You’re a better man than me. If the docs can’t even get the syntax right, that’s a hard no from me.
Also, fwiw, you’ve got a few points wrong about protos. Inspecting the binary data is hard, but the tag numbers are present. You need the schema, but at least you can identify each element.
Also, I disagree on the constructor front. Proto forces you to grapple with the reality that a field may be missing. In a production system, when adding a new field, there will be a point where that field isn’t present on only one side of the network call. The compiler isn’t saving you.
Fresh is more honest than better, and personally, I wouldn’t change it.
I agree it's important for users to understand that newer fields won't be set when they deserialize old data -- whether that's with Protobuf or Skir. I disagree with the idea that not forcing you to update all constructor call sites when you add a field will help (significantly) with that. Are you saying that because Protobuf forces you to manually search for all call sites when you add a field, it forces you to think about what happens if the field is not set at deserialization, hence, it's a good thing? I'm not sure that outweighs the cost of bugs introduced by cases where you forget to update a constructor call site when you add a field to your schema.
Some callers may not need to update right away, or don’t need the new feature at all, and breaking the existing callers compilation is bad. If your caller is a different team, for example, and their CICD breaks because you added a field, that’s bad. Each place it’s used, you should think about how it’ll be handled, BUT ALSO, your system explicitly should gracefully handle the case where it’s not uniformly present. It’s an explicit goal of protos to support the use case where heterogeneous schema versions are used over the wire.
If a bug is introduced because the caller and handler use different versions, the compiler wasn’t going to save you anyways. That bug would have shown up when you deploy or update the client and server anyways - unless you atomically update both at once. You generally cannot guarantee that a client won’t use an outdated version of the schema, and if things break because of that, you didn’t guard it correctly. That’s a business logic failure not a compilation failure.
Why build another language instead of extending an existing one?