While I never used Cap'n Proto, I want to thank kentonv for the extremely inform...

kccqzy · on July 28, 2023

This is some very valuable perspective. Personally, I previously also struggled to understand why. For me, the thing that clicked was to understand protobuf and Cap'n proto as serialization formats that need to work across API boundaries and need to work with different versions of their schema in a backwards- and forwards-compatible way; do not treat them as in-memory data structures that represent the world from the perspective of a single process running a single version without no compatibility concerns. Thus, the widely repeated mantra of "making illegal states unrepresentable" does not apply.

chubot · on July 28, 2023

Rich Hickey (creator of the Clojure language) has a good talk "Maybe Not" that touches on these issues, with a nice way of explaining it

https://www.youtube.com/watch?v=YR5WdGrpoug

The capnproto link explains it concretely in terms of a message bus example, which is useful.

But more abstractly you can think of the shape of data (aka schema, names and types) and field presence (optional/required) as separate things

https://lobste.rs/s/zdvg9y/maybe_not_rich_hickey

First, being valid or invalid with respect to a static type system is a GLOBAL property of program -- writing a type checker will convince you of that. And big distributed systems don't have such global properties: https://news.ycombinator.com/item?id=36590799

If they did, they'd be small :) Namely you could just reboot the whole thing at once. You can't reboot say the entire Internet at once, and this also holds for smaller systems, like the ones at say Google (and I'm sure Cloudflare, etc.).

So the idea is that the shape/schema is a GLOBAL property -- you never want two messages called foo.RequestX or two fields called "num_bar" with different types -- ever, anywhere.

But optional/required is LOCAL property. It depends on what version of a schema is deployed in a particular binary. Inherently, you need to be able to handle a mix of inconsistent versions running simultaneously.

---

To be pedantic, I woulds say "making illegal states unrepresentable" DOES apply, but you can't do it in a STATIC type system. [1] Your Maybe<T> type is not useful for data that crosses process boundaries.

A distributed system isn't a state machine.

1. Lamport showed us one important reason why: the relative order of messages means that there is no globally coherent state. You need something like Paxos to turn a distributed system back into a state machine (and this is very expensive in general)

2. The second reason is probably a consequence of the first. You can think of deploying a binary to a node as a message to that node. So you don't have a completely consistent state -- you always have an in-between state, a mix of versions. And presumably you want your system to keep working during this time period :)

And that coarse-grained problem (code versioning and deployment) implies the fine-grained problem (whether a specific message in a field is present). This is because protobufs generate parsers with validation for you -- or they used to!

---

tl;dr Think of the shape of data (aka schema) and field presence (optional/required) as different dimensions of data modeling. Maybe<T> mixes those up, which is fine in a single process, but doesn't work across processes.

---

[1] A very specific example of making illegal states unrepresentable without static types - my Oils project uses a DSL for algebraic data types, borrowed from CPython. The funny thing is that in CPython, it generates C code, which doesn't have any static notion of Maybe<T>. It has tagged unions.

And in Oils we first generated dynamically typed Python at first. Somewhat surprisingly, algebraic data types are STILL useful there.

Now the generated code is statically typed with MyPy (and with C++), and we do pleasant type-driven refactorings. But using algebraic data types were still extremely useful before static typing. They made illegal states unrepresentable -- but you would get the error at runtime.

skybrian · on July 28, 2023

I wonder about how to make this play nicely with systems that have different perspectives. Yes, a message bus is written to deal with any possible message and it can do that because it doesn't care what's in the message. Incomplete messages are useful to have, too.

This is sort of like the difference between a text editor and a compiler. An editor has to deal with code that doesn't even parse, which is easiest if just treats it as a plain text file, but then you're missing a lot of language-specific features that we take for granted these days. Meanwhile, a compiler can require all errors to be fixed before it emits a binary, but it has to be good at reporting what the errors are, because they will certainly happen.

It's unclear to me how the type of the field can be a global property in a large system. From a text editor's point of view, you can just edit the type. How can anyone guarantee that a type is always the same?

Also, SQL tables actually do make closed-world assumptions; every record meets the schema, or it can't be stored there. If you change the schema, there is a migration step where all the old rows in the production database gets upgraded. This doesn't seem unrealistic?

I guess it's unrealistic that you only have one production database, and not also a staging database, and every developer having their own database? And they will be on different versions. As soon as you have lots of databases, things get complicated.

chubot · on July 29, 2023

Yes databases and persisted data are an even bigger problem -- it's not enough to "reboot the Internet", you would also have to migrate all the data it stores to a different format!

I started using the terminology "interior vs. exterior" for these issues, and I find it pretty helpful - https://www.oilshell.org/blog/2023/06/ysh-design.html

e.g. type systems are interior, network protocols and persisted data are exterior.

SWEs tend to reason about the interior; SREs tend to reason about the exterior. Every problem a SRE deals with has passed type checks.

I see many fallacies where programmers want to think about the interior ONLY. They want the type system to ensure correctness. But that can be taken too far -- there are some things the interior view can't (or doesn't currently) handle, like mixed versions of binaries, schema migrations, etc.

The key point with databases is that your schema is LITERALLY dynamic -- it lives in a process outside your program, outside your type system (unless your program consists entirely of stored procedures, etc.)

Of course most people have some kind of synchronization or ORM (with varying degrees of success). But the point is that the exterior reality is the thing that matters; the interior is just a guide. "When models and reality collide, reality wins"

---

On the other hand, I think there can be more static tools in the future -- if they are built to understand more of the system; if they're not so parochially limited to a single process.

But I think these issues are never going away -- quite the contrary they will increase, because software tends to get more heterogeneous as it gets bigger. It's tempting to think that someday there's going to be a "unified platform" that solves it all, but I think the trend is actually the opposite.

The other issue is that while type systems can get better, they're mostly applicable when you control both sides of the wire.

Most orgs writing software do not control both sides of the wire -- they are reusing services like Google Maps or Stripe. In the last 10 years it seems like every org is now relying on 50 different third party integrations (which btw makes the web super slow ...)

As I mentioned in the previous comment, even if you can somehow go into the repo of Google Maps or Stripe, download their schema, and compile it into your binary, that STILL doesn't give you any guarantees.

They can upgrade or roll back their binaries whenever. They might switch from free-form JSON to JSON schema to Swagger, etc. You don't control that decision.

The people on the other side of the wire may have started using protobufs 10 years ago, and don't feel like switching to whatever you think is great right now. There's a lot of heterogeneity across time too, not just across different tech right now.

So fundamentally the problems are with "the world" the type system is modeling, not about or within the type system itself!

skybrian · on July 29, 2023

It seems like protobufs are sort of a mixed interior / exterior model? In an organization that uses centralized protobuf schemas and generates serialization code from that, you can assume every protobuf record of that type came from a serializer that obeys some version of the schema.

Why can this guarantee a field's type but not whether it's required? Because fields have a lifecycle. They get added to the schema, used for a while, and later become obsolete and are deleted.

The maintainers of a protobuf schema can guarantee that a field number is always associated with a certain type, and that when the field is deleted, that field number is never reused. This makes (field number, type) tuples timeless. But they can't say whether a field will be needed forever, and they can't control which versions of the schema are still being used.

Effectively, "required" means "possibly required forever." As long as there is some parser out there that treats the field as required, deserialization will break if you leave out the field.

Changing a field's type is easier, because you don't really do it. You add a new field number with a new type and maybe eventually stop emitting the old field.

This suggests a strategy: fields can be required as long as they're not top-level. You can define coordinates(42) to always be (x,y) where both x and y are required. If you change it to be a 3D coordinate, then that will be coordinates(43) = (x,y,z), and you never reuse field number 42 in that schema.

kccqzy · on July 30, 2023

The strategy illustrated at the end still ignores the learning of the FAQ. What if, let's say, one microservice is responsible for calculating the x coordinate but another microservice is responsible for calculating the y coordinate? The former microservice could rightly fill in only the x coordinate and ask the latter microservice to fill in the y coordinate.

Then you start to make compromises such as defining an IncompleteCoordinate message that requires x. But what if later you calculate y first and use that to calculate x?

Etc. Etc. I find it more convenient to make everything optional at the wire format level and explicitly validate what is required at the point of use, not at the point of parse.

skybrian · on July 30, 2023

It seems like an edge case and you could define another message altogether for the microservice traffic.

But yeah, for larger child messages, copying fields would be annoying. Often you want to be able to create child messages without creating the parent.

oftenwrong · on July 28, 2023

Typical provides "asymmetric" fields to assist with evolution of types:

https://github.com/stepchowfun/typical#asymmetric-fields-can...

>To help you safely add and remove required fields, Typical offers an intermediate state between optional and required: asymmetric. An asymmetric field in a struct is considered required for the writer, but optional for the reader. Unlike optional fields, an asymmetric field can safely be promoted to required and vice versa.

skybrian · on July 28, 2023

Yeah, it only works for migrations in fairly closed systems where you can upgrade or delete all the old data, though.

3cats-in-a-coat · on July 28, 2023

Can't we extend this argument to eliminating basically all static typing? And frankly that'd not even be wrong, and is why Alan Kay defined OOP as one that's dynamically typed and late bound, and we went against it anyway to keep relearning the same lessons over and over.

kentonv · on July 28, 2023

The argument is really more like: Always defer validation until the point where the data is actually consumed, because only the consumer actually knows what is valid.

Which is definitely a counterpoint to the oft-stated argument that you should validate all data upfront.

Either way though, you can still have types, the question is just when and where (in a distributed system, especially) they should be checked.

mike_hearn · on July 28, 2023

The argument is actually more like: don't use badly written middleman software that tries to parse messages it doesn't need to parse.

I was at Google when the "let's get rid of optional" crusade started. It didn't make sense to me then and over a decade later it still doesn't. If a program expects a field to be there then it has to be there, removing the protobuf level checking just meant that programs could now read garbage (some default value) instead of immediately crashing. But the whole reason we have types, assertions, bounds checking and so on is because, almost always, we'd like our software to NOT just blindly plough on into undefined territory when it doesn't understand something properly, so in reality it just means everyone ends up coding those very same required-ness assertions by hand.

Now, Google's business model is remarkably robust to generating and processing corrupt data, so you can argue that in the specific case of this specific company, it is actually better to silently serve garbage than to crash. This argument was made explicitly in other forms, like when they deleted all the assertions from the HTTP load balancers. But in every case where I examined an anti-required argument carefully the actual problem would turn out to be elsewhere, and removing assertions was just covering things up. The fact that so much of Google's code is written in C++ that not only starts up slowly but also just immediately aborts the entire process when something goes wrong also contributes to the brittleness that encourages this kind of thing. If Google had been built on a language with usable exceptions right from the start it'd have been easier to limit the blast radius of data structure versioning errors to only the requests where that data structure turned up, instead of causing them to nuke the entire server (and then the RPC stack will helpfully retry because it doesn't know why the server died, promptly killing all of them).

But this tolerance to undefined behavior is not true for almost any other business (except maybe video games?). In those businesses it's better to be stopped than wrong. If you don't then you can lose money, lose data, lose customers or in the worst cases even lose your life. I don't think people appreciate the extent to which the unique oddities of Google's business model and infrastructure choices have leaked out into the libraries their staffers/ex-staffers release.

kentonv · on July 28, 2023

> The argument is actually more like: don't use badly written middleman software that tries to parse messages it doesn't need to parse.

The middleman software in question often needed to process some part of the message but not others. It wasn't realistic to define a boundary between what each middleman might need and what they wouldn't need, and somehow push the "not needed" part into nested encoded blobs.

I'm not sure the rest of your comment is really addressing the issue here. The argument doesn't have anything to do with proceeding forward in the face of corrupt data or undefined behavior. The argument is that validation needs to happen at the consumer. There should still be validation.

naasking · on July 28, 2023

> It wasn't realistic to define a boundary between what each middleman might need and what they wouldn't need, and somehow push the "not needed" part into nested encoded blobs.

This is an interesting argument that I would like to see more elaboration on, because that's the obvious solution. Effectively you're building a pipeline of data processors and each stage in the pipeline reads its own information and then passes along a payload with the rest of the information to the next stage. This would preserve full static typing with required fields, but I can see how it might inhibit some forms of dynamic instrumentation, eg. turning verbose logging on/off might dynamically reconfigure the pipeline, which would affect all upstream producers if they're wrapping messages for downstream consumers.

If this were a programming language I would immediately think of row typing to specify the parts that each stage depends on while being agnostic about the rest of the content, but I'm not sure how that might work for a serialization format. Effectively, you're pulling out a typed "view" over the underlying data that contains offsets to the underlying fields (this is the dictionary-passing transform as found in Haskell).

kentonv · on July 28, 2023

The particular piece of infrastructure I worked on sat in the middle of the search pipeline, between the front-end that served HTML web pages, and the back-end index. This middle piece would request search results from the back-end, tweak them in a bunch of ways, and forward them on.

These "tweaks" could be just about anything. Like: "You searched for Jaguar, but I don't know if you meant the cat or the car. The index decided that pages about the car rank higher so the first three pages of results are about the car. I'm going to pull some results about the cat from page 4 and put them near the top so just in case that's what you really wanted, you'll find it."

Google Search, at least when I worked on it, was composed of a huge number of such tweaks. People were constantly proposing them, testing if they led to an improvement, and shipping them if they do. For a variety of reasons, our middleman server was a great place to implement certain kinds of tweaks.

But what kinds of information are needed for these "tweaks"? Could be anything! It's a general-purpose platform. Search results were annotated with all kinds of crazy information, and any piece of information might be useful in implementing some sort of middleman tweak at some point.

So you couldn't very well say upfront "OK, we're going to put all the info that is only for the frontend into the special 'frontend blob' that doesn't get parsed by the middleman", because you have no idea what fields are only needed by the frontend. In fact, that set would change over time.

> If this were a programming language I would immediately think of row typing to specify the parts that each stage depends on while being agnostic about the rest of the content

Indeed, perhaps one could develop an elaborate system where in the schemas, we could annotate certain fields as being relevant to certain servers. Anywhere else, those fields would be unavailable (but passed through without modification or validation). If you needed the fields in a new place, you change the annotations.

But that sounds... complicated to design and cumbersome to maintain the annotations. Simply banning required fields solved the problem for us, and everything else just worked.

naasking · on July 29, 2023

> Indeed, perhaps one could develop an elaborate system where in the schemas, we could annotate certain fields as being relevant to certain servers. Anywhere else, those fields would be unavailable (but passed through without modification or validation). If you needed the fields in a new place, you change the annotations.

I don't think it has to be elaborate. What I was thinking was something more like, in pseudo-C#:

    // the framework's general channel type from which messages are read
    public interface IChannel
    {
        T Read<T>() where T : interface;
    }

    // clients declare the interface they operate on:
    public interface IClientFields
    {
        public int Foo { get; set; }
        public string? Name { get; set; }
    }
    ...
    // client middleware function
    Task MiddlewareFn(IChannel chan)
    {
        var client = chan.Read<IClientFields>();
        ... // do something with client before resuming at next stage
    }

The client's interface type T must simply be a structural subtype of the underlying message type. As long as the underlying format is somewhat self-descriptive with a name and type map, you can perform the necessary checking that only applies locally to the client. Nothing fancy, and the required fields that client cares about are still there and the rest are ignored because they're never referenced. This could return an interface that contains a series of offsets into the data stream, which I believe is how capnproto already works.

kentonv · on July 29, 2023

Are you saying that each service would need to declare, separately, the subset of fields they operate on, and make sure that those fields are always a strict subset of the overall set of fields the protocol contains?

This essentially means declaring the same protocol multiple times, which seems like a big pain.

naasking · on July 29, 2023

Assuming the service is not operating on the whole format, then it's already implicitly depending on the presence of those fields and also implicitly depending on the fact that they are a strict subset of the overall set of fields in the protocol. I'm not sure why making this fact slightly more explicit by having the client add a single interface would be a big pain.

In principle, this also enables type checking the whole pipeline before deployments since the interfaces can be known upfront rather than latent in the code.

kentonv · on July 29, 2023

Single interface?

In the search infrastructure example I mentioned up-thread, we had hundreds, maybe thousands of schemas involved.

naasking · on July 29, 2023

I said a single interface per client, where I'm using "client" as a stage in the pipeline. Each piece of middleware that plugged into this pipeline already implicitly depends on a schema, so why not describe the schema explicitly as some subset of the underlying message?

pramodbiligiri · on July 29, 2023

Typescript has a types like Partial, Pick and Required which lets you work with subsets of fields of an existing type (https://www.typescriptlang.org/docs/handbook/utility-types.h...). Can something like that be built for Protobuf message processing?

mike_hearn · on July 28, 2023

It's easier to understand in context - some services (iirc web search but it might have been ads or something else very core) had kept adding fields to some core protobufs for years and years. It made sense, was the path of least resistance etc, but inevitably some of those fields became obsolete and they wanted to remove them but found it was hard, because every program that did anything with web search was deserializing these structures.

Truly generic middleware like RPC balancers did what you are saying, but there were also a lot of service specific "middlemen" which did need to look at parts of these mega-structures.

Now due to how protobufs work, you can do what you suggest and "cast" a byte stream to multiple different types, so they could have defined subsets of the overall structures and maybe they did, I don't remember, but the issue then is code duplication. You end up defining the same structures multiple times, just as subsets. With a more advanced type system you can eliminate the duplication, but there was a strong reluctance to add features to protobufs.

3cats-in-a-coat · on July 28, 2023

Honestly I wonder what is the big win in terms of performance by using static types here, because this sounds so terribly well fit for dynamic types (of which optionality by default is in fact a limited example). Such an odd choice to calcify a spec in a places where it changes all the time. "Static" optimizations should be local, not distributed.

mike_hearn · on July 28, 2023

I think you're defining consumer as the literal line of code where the field is read, whereas a more natural definition would be something like "the moment the data structure is deserialized". After all it's usually better to abort early than half way through an operation.

It was quite realistic to improve protobufs to help dig web search out of their "everything+dog consumes an enormous monolithic datastructure" problem, assuming that's what you're thinking of (my memory of the details of this time is getting fuzzy).

A simple brute-force fix for their situation would have been to make validation of required fields toggle-able on a per-parse level, so they could disable validation for their own parts of the stack without taking it away for everyone else (none of the projects I worked on had problems with required fields that I can recall).

A better fix would have been for protobufs to support composition. They could then have started breaking down the mega-struct into overlapping protos, with the original being defined as a recursive merge of them. That'd have let them start narrowing down semantically meaningful views over what the programs really needed.

The worst fix was to remove validation features from the language, thus forcing everyone to manually re-add them without the help of the compiler.

Really, the protobuf type system was too simple for Google even in 2006. I recall during training wondering why it didn't have a URL type given that this was a web-centric company. Shortly after I discovered a very simple and obvious bug in web search in which some local business results were 404s even though the URL existed. It had been there for months, maybe years, and I found it by reading the user support forums (nobody else did this, my manager considered me way out of my lane for doing so). The bug was that nothing anywhere in the pipeline checked that the website address entered by the business owner started with https://, so when the result was stuffed into an <a> tag it turned into <a href="www.business.com"> and so the user ended up at https://www.google.com/www.business.com. Oops. These bad strings made it all the way from the business owner, through the LBC frontend, the data pipeline, the intermediate microservices and the web search frontends to the user's browser. The URL did pass crawl validation because when loaded into a URL type, the missing protocol was being added. SREs were trained to do post-mortems, so after it got fixed and the database was patched up, I naively asked whether there was a systematic fix for this, like maybe adding a URL type to protobufs so data would be validated right at the start. The answer was "it sounds like you're asking how to not write bugs" and nothing was done, sigh. It's entirely possible that similar bugs reoccurred dozens of times without being detected.

Those are just a couple of cases where the simplicity (or primitivity) of the protobuf type system led to avoidable problems. Sure, there are complexity limits too, but the actual languages Googlers were using all had more sophisticated type systems than protobuf and bugs at the edges weren't uncommon.

kentonv · on July 28, 2023

> I think you're defining consumer as the literal line of code where the field is read

I am.

> After all it's usually better to abort early than half way through an operation.

I realize this goes against common wisdom, but I actually disagree.

It's simply unrealistic to imagine that we can fully determine whether an operation will succeed by examining the inputs upfront. Even if the inputs are fully valid, all sorts of things can go wrong at runtime. Maybe a database connection is randomly dropped. Maybe you run out of memory. Maybe the power goes out.

So we already have to design our code to be tolerant to random failures in the middle. This is why we try to group our state changes into a single transaction, or design things to be idempotent.

Given we already have to do all that, I think trying to validate input upfront creates more trouble than it solves. When your validation code is far away from the code that actually processes the data, it is easier to miss things and harder to keep in sync.

To be clear, though, this does not mean I like dynamic typing. Static types are great. But the reason I like them is more because they make programming easier, letting you understand the structure of the data you're dealing with, letting the IDE implement auto-complete, jump-to-definition, and error checking, etc.

Consider TypeScript, which implements static typing on JavaScript, but explicitly does not perform any runtime checks whatsoever validating types. It's absolutely possible that a value at runtime does not match the type that TypeScript assigned to it. The result is a runtime exception when you try to access the value in a way that it doesn't support (even though its type says it should have). And yet, people love TypeScript, it clearly provides value despite this.

This stuff makes a programming language theorist's head explode but it practice it works. Look, anything can be invalid in ways you never thought of, and no type system can fully defend you from that. You gotta get comfortable with the idea that exceptions might be thrown from anywhere, and design systems to accommodate failure.

mike_hearn · on July 28, 2023

I agree with a lot of this, but:

1. The advantage of having it in the type system is the compiler can't forget.

2. It's quite hard to unwind operations in C++. I think delaying validation to the last moment is easier when you have robust exceptions. At the top level the frameworks can reject RPCs or return a 400 or whatever it is you want to do, if it's found out 20 frames deep into some massive chunk of code then you're very likely to lose useful context as the error gets unwound (and worse error messages).

On forgetting, the risky situation is something like this:

    message FooRequest {
        required string query = 1;
        optional list<string> options = 2;   // added later
    }

The intention is: in v1 of the message there's some default information returned, but in v2 the client is given more control including the ability to return less information as well as more. In proto2 you can query if options is set, and if not, select the right default value. In proto3 you can't tell the difference between an old client and a client that wants no extra information returned. That's a bug waiting to happen: the difference between "not set" and "default value" is important. Other variants are things like adding "int32 timeout" where it defaults to zero, or even just having a client that forgets to set a required field by mistake.

TypeScript does indeed not do validation of type casts up front, but that's more because it's specifically designed to be compatible with JavaScript and the runtime doesn't do strong typing. People like it compared to raw JS.

Dylan16807 · on July 29, 2023

> Consider TypeScript, which implements static typing on JavaScript, but explicitly does not perform any runtime checks whatsoever validating types. It's absolutely possible that a value at runtime does not match the type that TypeScript assigned to it. The result is a runtime exception when you try to access the value in a way that it doesn't support (even though its type says it should have). And yet, people love TypeScript, it clearly provides value despite this.

> This stuff makes a programming language theorist's head explode but it practice it works. Look, anything can be invalid in ways you never thought of, and no type system can fully defend you from that. You gotta get comfortable with the idea that exceptions might be thrown from anywhere, and design systems to accommodate failure.

It's only possible if you're doing something wrong type-wise. In particular, when ingesting an object you're supposed to validate it before/as you assign the type to it. Delaying the error until the particular field is accessed is bad TypeScript! Those kinds of exceptions aren't supposed to be thrown from anywhere.

skybrian · on July 28, 2023

I think this comes from everyone wanting to use the same schema and parser. For example, a text editor and a compiler have obvious differences in how to deal with invalid programs.

Maybe there need to be levels of validation, like "it's a text file" versus "it parses" versus "it type checks."

mike_hearn · on July 28, 2023

Sure, that would also have been a fine solution. There are lots of ways to tackle it really and some of it is just very subjective. There's a lot of similarities here between the NoSQL vs SQL debates. Do you want a schemaless collection of JSON documents or do you want enforced schemas, people can debate this stuff for a long time.

You can also see it as a version control and awareness problem rather than a schema or serialization problem. The issues don't occur if you always have full awareness of what code is running and what's consuming what data, but that's hard especially when you take into account batch jobs.

saghm · on July 28, 2023

> The argument is actually more like: don't use badly written middleman software that tries to parse messages it doesn't need to parse.

> I was at Google when the "let's get rid of optional" crusade started. It didn't make sense to me then and over a decade later it still doesn't. If a program expects a field to be there then it has to be there, removing the protobuf level checking just meant that programs could now read garbage (some default value) instead of immediately crashing. But the whole reason we have types, assertions, bounds checking and so on is because, almost always, we'd like our software to NOT just blindly plough on into undefined territory when it doesn't understand something properly, so in reality it just means everyone ends up coding those very same required-ness assertions by hand.

Yeah, that's what stuck out to me from the linked explanation as well; the issue wasn't that the field was required, it was that the message bus was not doing what was originally claimed. It sounds like either having the message bus _just_ process the header and not the entire message or having the header have a version number that indicated which fields are required (with versions numbers that are newer than the latest the bus was aware of being considered to have no required fields). I don't claim that it's never correct to design a protocol optimizing for robustness when consumed by poorly written clients, but I similarly struggle to see how making that the only possible way to implement a protocol is the only valid option. Maybe the goal of cap'n proto is to be prescriptive about this sort of thing, so it wouldn't be a good choice for uses where there's more rigor in the implementation of services using the protocol, but if its intended for more general usage, I don't understand this design decision at all.

3cats-in-a-coat · on July 28, 2023

That's valuable what you say, and it's kinda odd some people here discard practical experience in favor of their subjective flavor of theoretical correctness.

lanstin · on July 28, 2023

The distributed part shifts the problem from "find types that represent your solution" to "find a system of types that enable evolution of your solution over time." I think this is why bad things like json or xml do so well: they work fine with a client dev saying, "I need this extra data" and the server dev adding it, and then the client dev consuming it.

The more modern approaches, like protobuf or capn proto are designed with the experience of mutating protocols over time.

It works pretty well too unless the new field changes the semantics of old field values, e.g. adding a field "payment_is_reversal_if_set" to a payment info type, which would change the meaning of the signs of the amounts. In that case, you have to reason more explicitly about when to roll out the protocol readers and when to roll out the protocol writers. Or version it, etc.

insanitybit · on July 28, 2023

> Can't we extend this argument to eliminating basically all static typing?

No, because static typing exists in all sorts of places. This argument is primarily about cases where you're exchanging data, which is a very specific use case.

klabb3 · on July 28, 2023

To elaborate on your point:

Static type systems in programming languages are designed to break at compilation-time. The reason this works is because all users are within the same “program unit”, on the same version.

In other words, static typing allows more validation to be automated, and removes the need for multiple simultaneous versions, but assumes that the developer has access and ability to change all other users at the same “time” of their own change.

I find this whole topic fascinating. It seems like programmers are limited to an implicit understanding of these differences but it’s never formalized (or even properly conceptualized). Thus, our intuition often fails with complex systems (eg multiple simultaneous versions, etc). Case in point: even mighty Google distinguished engineers made this “billion-dollar mistake” with required fields, even though they had near-perfect up-front knowledge of their planned use-cases.

mike_hearn · on July 28, 2023

It's actually the opposite. The billion dollar mistake is to have pervasive implicit nullability, not to have the concept of optionality in your type system. Encoding optionality in the type system and making things required by default is usually given as the fix for the billion dollar mistake.

klabb3 · on July 28, 2023

Huh? Did you read the link, from the guy who was there during the major failure at Google that led to proto3 being redesigned without that flaw?

The whole lesson is that you can’t apply the lessons from static type systems in PLs when you have multiple versions and fragmented validation across different subsystems. Counter-intuitively! Everyone thought it was a good idea, and it turned out to be a disaster.

mike_hearn · on July 28, 2023

I did read the link and I was at Google at the time people started arguing for that. With respect, I think the argument was and still is incorrect, that the wrong lessons were drawn and that proto3 is worse than proto2.

klabb3 · on July 28, 2023

Alright, fair enough. Apologies for the dismissive tone. Could you elaborate (or point to) these wrong lessons or an alternative?

hgsgm · on July 28, 2023

OK, what do you do when a message comes in missing a field? Crash the server?

dastbe · on July 29, 2023

you reject the message in the framework? and if the client is aware it’s required they fail to send?

the bigger challenge with proto3 is that people use it both for rpc and storage, in some cases directly serializing rpc payloads. Disregarding how awful a choice that is, you likely want to trade off flexible deserialization of old data at the expense of rigidity, and conformance.

3cats-in-a-coat · on July 28, 2023

It remains a big asterisk to me, why was some random middleware validating an end-to-end message between two systems, instead of treating it as just an opaque message.

Why are we not having this debate about "everything must be optional" for Internet Packets (IP) for example? Because it's just binary load. If you want to ensure integrity you checksum the binary load.

klabb3 · on July 28, 2023

Things like distributed tracing, auth data, metrics, error logging messages and other “meta-subsystems” is certainly typical use cases. Reverse proxies and other http middleware do exactly this with http headers all the time.

lanstin · on July 28, 2023

No one has near-perfect up-front knowledge of a software system designed to change and expand. The solution space is too large and the efficient delivery methods are a search thru this space.

klabb3 · on July 28, 2023

I may have phrased it poorly. What I should have said is that Google absolutely could have “anticipated” that many of their subsystems would deal with partial messages and multiple versions, because they most certainly already did. The designers would have maintained, developed and debugged exactly such systems for years.

lanstin · on July 28, 2023

Makes sense: they knew arbitrary mutability was a requirement but did not think it thru for the required keyword.

3cats-in-a-coat · on July 28, 2023

Static types are a partial application/reduction when certain mutable or unknown variables become constants (i.e. "I for sure only need integers between 0-255 here").

I'm not rejecting static types entirely, and yes I was discussing exchanging data here, as Alan Kay's OOP is inherently distributed. It's much closer to Erlang than it is to Java.

insanitybit · on July 28, 2023

> I'm not rejecting static types entirely, and yes I was discussing exchanging data here

OK I guess I'm having a hard time reconciling that with:

> basically all static typing

cakoose · on July 29, 2023

I'm not the person you're responding to, but I interpreted their comment as, "doesn't the argument against having protobuf check for required fields also apply to all of protobuf's other checks?"

From the linked article the post: "The right answer is for applications to do validation as-needed in application-level code. If you want to detect when a client fails to set a particular field, give the field an invalid default value and then check for that value on the server. Low-level infrastructure that doesn’t care about message content should not validate it at all."

(I agree that "static typing" isn't exactly the right term here. But protobuf dynamic validation allows the programmer to then rely on static types, vs having to dynamically check those properties with hand-written code, so I can see why someone might use that term.)

3cats-in-a-coat · on July 28, 2023

Sorry, I see how I'm vague. The idea is you have no "pre-burned" static types, but dynamic types. And static types then become a disposable optimization compiled out of more dynamic code, in the same way JIT works in V8 and JVM for example (where type specialization is in fact part of the optimization strategy).

insanitybit · on July 28, 2023

You're describing dynamic types

3cats-in-a-coat · on July 28, 2023

But with the benefit of static types, and without the drawbacks of static types.

insanitybit · on July 28, 2023

No. "Types only known at runtime" are dynamic types. "And also you can optimize by examining the types at runtime" is just dynamic types. And it does not have the benefit of static types because it is dynamic types.

3cats-in-a-coat · on July 28, 2023

This is devolving into a "word definition war" so I'll leave aside what you call static types and dynamic types and get down to specifics. Type info is available in these flavors, relative to runtime:

1. Type info which is available before runtime, but not at runtime (compiled away).

2. Type info which is available at runtime, but not at compile time (input, statistics, etc.).

3. Type info which is available both at compile time and runtime (say like a Java class).

When you have a JIT optimizer that can turn [3] and [2] into [1], there's no longer a reason to have [1], except if you're micro-optimizing embedded code for some device with 64kb RAM or whatever. We've carried through legacy practices, and we don't even question them, and try to push them way out of their league into large-scale distributed software.

When I say we don't need [1], this doesn't mean I deny [3], which is still statically analyzable type information. It's static types, but without throwing away flexibility and data at runtime, that doesn't need to be thrown away.

insanitybit · on July 28, 2023

Short of time travel one can not turn (3) or (2) into (1). I'm not sure where the confusion here is or what you're advocating for because this isn't making sense to me.

> there's no longer a reason to have [1]

I guess if you're assuming the value of static types is just performance? But it's not, not by a long shot - hence 'mypy', a static typechecker that in no way impacts runtime.

I think this conversation is a bit too confusing for me so I'm gonna respectfully walk away :)

3cats-in-a-coat · on July 28, 2023

The confusion is to assume "runtime" is statically defined. JIT generates code which omits type information that's determined not to be needed in the context of the compiled method/trace/class/module. That code still "runs" it's "runtime".

insanitybit · on July 28, 2023

Yes, the types that JIT omits are dynamic types.

mrkeen · on July 28, 2023

It's up to you.

It's easy to imagine any statically typed language having a general-purpose JSON type. You could imagine all functions accepting and returning such objects.

Now it's your turn to implement the sum(a,b) function. Would you like to allow the caller to pass anything in as a and b?

lanstin · on July 28, 2023

This is like when people use protobuf to send a list of key-value mappings, and call that a protocol. (I've seen that same design in many protocol description arenas, even SQL database schemas that are just (entityId INT, key CLOB, value BLOB).

kccqzy · on July 28, 2023

Do you need to make different versions of a program exchange information even though they do not agree on the types? No? Then this argument cannot be extended this way.

chubot · on July 28, 2023

See my sibling comment, e.g. with respect to Rich Hickey's framing - https://news.ycombinator.com/item?id=36911033

throw14082020 · on July 29, 2023

From the FAQ [1]

> The right answer is for applications to do validation as-needed in application-level code.

It would've been nice to include a parameter to switch "required message validation" on and off, instead of relying on application code. Internally in an application, we can turn this off, the message bus can turn it off, but in general, developers would really benefit from this being on.

[1] https://capnproto.org/faq.html#how-do-i-make-a-field-require...

AtNightWeCode · on July 29, 2023

Very good point.

A gotcha along the same path. Deserialization of things not needed as what you get with generated clients. An aspect of interfaces in Go I really like. Remotely type what I use. Skip the rest. Not fun to have incidents caused by changes to a contract that is not even used by a service. Also hard to find.

kaba0 · on July 29, 2023

That FAQ answer has a very nice parallel with Hickey’s video of a similar topic: https://m.youtube.com/watch?v=YR5WdGrpoug&feature=youtu.be

nly · on July 28, 2023

Avro solves this problem completely, and more elegantly with its schema resolution mechanism. Exchanging schemas at the beginning of a connection handshake is hardly burdensome

dtech · on July 28, 2023

If by "solving" you mean "refuse to do anything at all unless you have the exact schema version of the message you're trying to read" then yes. In a RPC context that might even be fine, but in a message queue...

I will never use Avro again on a MQ. I also found the schema resolution mechanism anemic.

Avro was (is?) popular on Kafka, but it is such a bad fit that Confluent created a whole additional piece of infra called Schema Registry [1] to make it work. For Protobuf and JSON schema, it's 90% useless and sometimes actively harmful.

I think you can also embed the schema in an Avro message to solve this, but then you add a massive amount of overhead if you send individual messages.

[1] https://docs.confluent.io/platform/current/schema-registry/i...

insanitybit · on July 28, 2023

> but it is such a bad fit that Confluent created a whole additional piece of infra called Schema Registry [1] to make it work.

That seems like a weird way to describe it. It is assumed that a schema registry would be present for something like Avro. It's just how it's designed - the assumption with Avro is that you can share your schemas. If you can't abide by that don't use it.

dtech · on July 28, 2023

I do not think its unfair at all. Schema registry needs to add a wrapper and UUID to an Avro payload for it to work, so at the very least Avro as-is is unsuitable for a MQ like Kafka since you cannot use it efficiently without some out-of-band communication channel.

insanitybit · on July 28, 2023

Everyone knows you need an out of band channel for it, I don't know why you're putting this out there like it's a fault instead of how it's designed. Whether it's RPC where you can deploy your services or a schema registry, that is literally just how it works.

Wrapping a message with its schema version so that you can look up that version is a really sensible way to go. A uuid is way more than what's needed since they could have just used a serial integer but whatever, that's on Kafka for building it that way, not Avro.

morelisp · on July 28, 2023

> a serial integer

And now you can't trivially port your data between environments.

insanitybit · on July 29, 2023

Can you elaborate? I don't see any issue at all.

morelisp · on July 29, 2023

Understanding the data now depends not just on having a schema to be found in a registry, but your schema registry, with the schemata registered in the same specific order you registered them in. If you want to port some data from prod back to staging, you need to rewrite the IDs. If you merge with some other company using serial IDs and want to share data, you need to rewrite the IDs. Etc.

insanitybit · on July 29, 2023

I have no idea what you're talking about. If the standard were a 64bit integer none of what you're saying would be the case at all.

There is no difference between a random 128bit integer vs a sequentual 64bit integer except that the 64bit integer is smaller.

kentonv · on July 29, 2023

They're saying that for a sequential number, you must have a central authority managing that number. If two environments have different authorities, they have a different mapping of these sequential numbers, so now they can't share data.

insanitybit · on July 29, 2023

Oh. Like, if you have two different schema registries that you simultaneously deploy new schemas to while also wanting to synchronize the schemas across them.

Sounds weird as hell.

nly · on July 28, 2023

Having the schema for a data format I'm decoding has never been a problem in my line of work, and i've dealt with dozens of data formats. Evolution, versioning and deprecating fields on the other hand is always a pain in the butt.

dtech · on July 28, 2023

If a n+1 version producer sends a message to the message queue with a new optional field, how do the n version consumers have the right schema without relying on some external store?

In Protobuf or JSON this is not a problem at all, the new field is ignored. With Avro you cannot read the message.

nly · on July 28, 2023

I mean a schema registry solves this problem, and you just put the schema in to the registry before the software is released.

A simpler option is to just publish the schema in to the queue periodically. Say every 30 seconds, and then receivers can cache schemas for message types they are interested in.

throwboatyface · on July 28, 2023

Disagree. Avro makes messages slightly smaller by removing tags, but it makes individual messages completely incomprehensible without the writer schema. For serializing data on disk it's fine and a reasonable tradeoff to save space, but for communication on the wire tagged formats allow for more flexibility on the receiver end.

The spec for evolving schemas is also full of ambiguity and relies on the canonical Java implementation. I've built an Avro decoder from scratch and some of the evolution behaviour is counter-intuitive.

nly · on July 29, 2023

Protobufs are also undecodable without the schema. You can't even properly log unknown tags, because the wire format is ambiguous and doesn't encode data type (just size)

kentonv · on July 28, 2023

> Exchanging schemas at the beginning of a connection handshake is hardly burdensome.

I dunno, that sounds extremely burdensome to me, especially if the actual payload is small.

And how exactly does exchanging schemas solve the problem? If my version of the schema says this field is required but yours says it is optional, and so you don't send it, what am I supposed to do?

dtech · on July 28, 2023

Avro makes that case slightly better because you can default value for a missing field in one of the two schemas and then it works.

It's not worth the boatload of problems it bring in all other and normal use cases though. Having the default value in the app or specified by the protocol is good enough.

alphanullmeric · on July 28, 2023

Rustaceans in shambles