At this point, I don't have it clear that this is going to be accepted and adopted. If it is, then it isn't clear what the economic impact will be. Companies still need the job done to be competitive, and clearly no one will be happy with pay cuts. In a country in which the industry overall is struggling, it sounds like a risky move. At the personal level, I understand the benefits of having more personal and family time, though.
My friends gave me the tongue-in-cheek nickname of "Technothrasher" many years ago in college, which I now use for my handle on many websites. I often get asked the very same question that you just asked.
True, and the blog post also provides a comprehensive comparison of the write/read path performance against Apache Kafka and Apache Pulsar, which is a plus.
Thanks for your comments, and I'm sorry that you feel that the post does not match your expectation. If you write me directly, I'll be more than happy to clarify any question you may have. I'm not sure, for example, what is confusing you about the "zookeeper and consensus" section. I'm also not sure what kind of evidence you're after on published messages being lost.
It does both, it first proposes by writing a sequential znode and then reads the children (all proposals are written under some parent znode). This is certainly assuming some experience with ZK, and I wonder if that's the problem. It was not the goal to go into a discussion of the ZooKeeper API, but I'm happy to clarify if this is what is preventing you from getting the point.
>> Not suppose to ? How do you ensure that ?
It is not supposed to in the sense that if this is implemented right, then the proposed values for each client won't change. The recipe guarantees it because it assumes that each client writes just once.
>> Sorry, but I fail to see what recipe you speak about.
I'm referring to these three steps: creating a sequential znode under a known parent, reading the children, picking the value in the znode with smallest sequence number.
I'm happy you enjoyed the post and thanks for the comment.
If I understand your comment correctly, it isn't the case that the ISR is fixed. The ISR has a minimum size, but it can change over time, so brokers can be removed from the ISR and they can rejoin later once they catch up.
Yes, I got that, maybe I wasn't clear enough. My main concern is that slow nodes (rather than failing nodes) may easily provoke latency spikes, and that seems to me a quite frequent situation. The good point about quorum writes is that outliers are ignored, but with the ISR, as you need to wait for all of them, outliers are not ignored (until, maybe, removed from the ISR). I understand the advantages and the compromise here. But I would like to see if this is a good compromise, as outliers may have a big impact.
Got it, yeah, quorum systems have higher tolerance to tail latency, there is no question about it. we do mention it briefly in the post, but we don't have numbers to show. I'm not aware of it being a major concern for kafka deployments, but I can say that for Apache BookKeeper, we ended up adding the notion of ack quorums to get around such latency spikes. I'll see if I can gather some more information about kafka that we can share at a later time. Thanks for raising this point.
That would be awesome to have some numbers about this topic. Thanks for your interest. I guess with other systems like Paxos it could be solved by separating the notion of learners and acceptors, which are usually collapsed in the same nodes. In this case, you may have more learners than acceptors (solving the N^2 communication growth with the number of nodes) while still solving the tail latency by running quorum among the acceptors.
I think a further interesting observation is that one can actually trade time uncertainty (i.e. latency spikes) for location uncertainty.
One can drive latency almost arbitrarily low if one is willing to give up exact knowledge of a where the data is.
A simple example is N out of M writes (N and M can be arbitrary with N <= M). M is a set of machines, N is number of replicas we want to have. Now at any given time we write to the N machines that respond fastest. Data is now arbitrarily sprayed over the M machines, but as long as the data itself has ordering information the right state can be reassembled by talking to the M machines.
The "spray" can now be controlled by favouring some nodes from M up to a timeout (i.e. we put more uncertainty in time). We can reduce the reassembly work by using learners and hence increase the likelihood that one machine has all state.
I'm glad you enjoyed the post. The chain replication work indeed relies on a paxos-based master service for reconfiguration, but I believe that's the only similarity, the replication scheme is different otherwise.