> it throws out all of the interesting parts of Borg
This is not true. It throws out the Google-specific parts of Borg (like integration with Google's service discovery, load balancing, and monitoring systems) and improves a number of things compared to Borg. For a good reference on the evolution of Borg into Kubernetes, I recommend the recent Kubernetes Podcast interview with Brian Grant: https://kubernetespodcast.com/episode/043-borg-omega-kuberne...
> Google themselves don't use it
This is not true, and the reasons why it hasn't replaced Borg are related to the integrations I mentioned above (which will take time to integrate or replace) and the zillions of lines of borg config that have built up over the years, rather than concerns that people outside of Google would have (production-worthiness, reliability, etc.)
(Disclaimer: I worked on Borg at Google, and now work on Kubernetes at Google.)
Unfortunately we can't discuss the parts of Google's platform that aren't in Kubernetes on this forum. If we could, I think I could defend my statement reasonably well. But perhaps you just don't think that the pieces I would mention qualify as interesting.
I don't think multi-tenancy has been "retrofitted" onto Kubernetes. Kubernetes was designed with multi-tenancy in mind from the very early releases -- namespaces, authn/authz (initially ABAC, later RBAC), ResourceQuota, PodSecurityPolicy, etc. New features are added over time, such as NetworkPolicy (which has been in Kubernetes for a year and a half, so perhaps not "new" anymore!), EventRateLimit, and others, but always in a principled way. And the integration of container isolation technologies like gVisor and Kata are using a standard Kubernetes extension point (the Container Runtime Interface) so I do not view this work as retrofitting.
Moreover, even today there are real public PaaSes that expose the Kubernetes API served by a multi-tenant Kubernetes cluster to mutually untrusting end-users, e.g. OpenShift Online and one of the Huawei cloud products (I forget which one). Obviously Kubernetes multi-tenancy isn't going to be secure enough today for everyone, especially folks who want an additional layer of isolation on top of cgroups/namespaces/seccomp/AppArmor/etc., but there are a lot of advantages to minimizing the number of clusters. (See my other comment in this thread about the pattern we frequently see of separate clusters for dev/test vs. staging vs. prod, possibly per region, but sharing each of those among multiple users and/or applications.)
Disclosure: I work at Google on Kubernetes and GKE.
From a security (as opposed to workload isolation) perspective, I don't think k8s was designed with multi-tenancy in mind at all, in early versions.
Definitely I've had conversations with some of the project originators where it was clear the security boundry was intended to be cluster level in early versions.
Some of the security weaknesses in earlier versions (e.g. no AuthN on the kubelet, cluster-admin grade service tokens etc) make that clear.
Now it's obv. that secure hard multi-tenancy is a goal going forward (and I'll be very interested to see what the 3rd party audit throws up in that regard), but it is a retro-fit.
> I don't think multi-tenancy has been "retrofitted" onto Kubernetes. Kubernetes was designed with multi-tenancy in mind from the very early releases -- namespaces, authn/authz (initially ABAC, later RBAC), ResourceQuota, PodSecurityPolicy, etc.
My complaint is that these require assembly and are in many cases opt-in (making RBAC opt-out was a massive leap forward).
Namespaces are the lynchpin, but are globally visible. In fact an enormous amount of stuff tends to wind up visible in some fashion. And I have to go through all the different mechanisms and set them up correctly, align them correctly, to create a firmer multi-tenancy than the baseline.
Put another way, I am having to construct multi-tenancy inside multiple resources at the root level, rather than having tenancy as the root level under which those multiple resources fall.
> there are a lot of advantages to minimizing the number of clusters.
The biggest is going to be utilisation. Combining workloads pools variance, meaning you can safely run at a higher baseline load. But I think that can be achieved more effectively with virtual kubelet .
> The biggest is going to be utilisation. Combining workloads pools variance, meaning you can safely run at a higher baseline load.
Utilization is arguably the biggest benefit (fewer nodes if you can share nodes among users/workloads, fewer masters if you can share the control plane among users/workloads), but I wouldn't under-estimate the manageability benefit of having fewer clusters to run. Also, for applications (or application instances, e.g. in the case of a SaaS) that are short-lived, the amount of time it takes to spin up a new cluster to serve that application (instance) can cause a poor user experience; spinning up a new namespace and pod(s) in an existing multi-tenant cluster is much faster.
> But I think that can be achieved more effectively with virtual kubelet .
I think it's hard to compare virtual kubelet to something like Kata Containers, gVisor, or Firecracker. You can put almost anything at the other end of a virtual kubelet, and as others have pointed out in this thread virtual kubelet doesn't provide the full Kubelet API (and thus you can't use the full Kubernetes API against it). At a minimum I think it's important to specify what is backing the virtual kubelet, and what Kubernetes features you need, in order to compare it with isolation technologies like the others I mentioned.
Disclosure: I work at Google on Kubernetes and GKE.
One trick I used before was to create resources and leave them unused until they are allocated, at which point I create another one to top off the pool of pre-created resources. A stopped cluster takes up disk space and nothing else and this is an easy solution to the user experience issue.
Of course, hardening multi-tenant clusters is also needed. Even if the use case requires resource partitioning, there are use cases that don't and keeping one friend from stepping on another's toes is always a good idea.
I'd like to understand more about your second paragraph, since it shapes some of the work I want to do in 2019. What should I be reading or looking up?
A pattern we're seeing a lot of recently is one cluster per "stage" per region, where a "stage" is something like dev/test, canary, and prod. (In some cases only prod is replicated across multiple regions.) I think this may end up being the "sweet spot" for Kubernetes multi-tenancy architecture. The number of clusters isn't quite at the "Kubesprawl" level (I love that phrase and am absolutely going to steal it) -- you can still treat them as pets. But you get good isolation; you can limit access to the prod clusters to only the small set of folks (and perhaps the CD system) authorized to push code there, you can canary Kubernetes upgrades on the canary cluster(s), etc.
As an aside, something that's useful when thinking about Kubernetes multi-tenancy is to understand the distinction between "control plane" multi-tenancy and "data plane" multi-tenancy. Data plane multi-tenancy is about making it safe to share a node (or network) among multiple untrusting users and/or workloads. Examples of existing features for data plane multi-tenancy are gVisor/Kata, PodSecurityPolicy, and NetworkPolicy. Control plane multi-tenancy is about making it safe to share the cluster control plane among multiple untrusting users and/or workloads. Examples of existing features for control plane multi-tenancy are RBAC, ResourceQuota (particularly quota on number of objects; quota on things like cpu and memory are arguably data plane), and the EventRateLimit admission controller.
Also, I gave a talk at KubeCon EU earlier this year that gives a rough overview of Kubernetes multi-tenancy, that might be of interest to some folks: https://kccnceu18.sched.com/event/Dqvb?iframe=no
(links to the slides and YouTube video are near the bottom of the page)
Disclosure: I work at Google on Kubernetes and GKE.
Many teams use clusters for stages because they work on underlying cluster components and need to ensure they work together and upgrade processes work (e.g. terraform configs comes to mind). Theres no reason to separate accounts because the cluster constructs aren't there for security.
Considering it deeper (I haven't had to think about this for a while), I think multi tenancy would cover almost all of the use cases I've seen except for the platform dev where people use clusters for separation when testing cluster config-as-code changes.
I basically split the clusters into livedata, nolivedata, random untrusted code (ci), shared tooling.
The idea being that you have process around getting your code to run on the livedata cluster and this we add more stringent requirements for accessing each API.
This is for soft tenancy, and you want to write admission controllers to reject apps that haven't went through the defined process.
The distinction is very helpful and gets at something I was struggling to articulate.
Edit: looking more in the thread, you clearly know this much better than I do. I'd like to get the chance to talk and improve my understanding, if you ever find some spare time.
> Borg will remain orders of magnitude beyond Kubernetes until Kubernetes is completely rearchitected. It’s not scalability bugs. It’s decisions regarding how the cluster maintains state that hamstring it, and that’s so fundamental to everything it’s not a find/squish loop.
Can you say more about this? Borgmaster uses Paxos for replicating checkpoint data, and etcd uses Raft for replicating the equivalent data, but these are really just two flavors of the same algorithm. I don't doubt that there are probably more efficient ways that Kubernetes could handle state (I don't claim to be an expert in that area), but I don't think they're approaches that would look any more like Borg than Kubernetes does.
If you're at liberty to do so, could you say what orchestrators the customers you mentioned chose in lieu of Kubernetes? What scale are they running at for a single cluster?
> It can only span multiple clouds now because other clouds had to ship Kubernetes.
This isn't true. People were running open-source Kubernetes on AWS and Azure before either provider had a hosted Kubernetes service. In fact back when GKE was the only hosted Kubernetes service, more companies were running Kubernetes on non-Google platforms than on GCP (https://www.cncf.io/blog/2017/12/06/cloud-native-technologie...).
Stubby and Chubby are not related to Borg's scalability.
The reason Kubernetes scalability was originally not so great was because it simply wasn't prioritized. We were more concerned with building a feature set that would drive adoption (and making sure the system was stable). Only once Kubernetes began to have serious users, did we start worrying about scalability. There have been a number of blog posts on the Kubernetes blog over the years about what we did to improve scalability, and how we measure it.
I'd encourage you to join the Kubernetes scalability SIG (https://github.com/kubernetes/community/tree/master/sig-scal...) to learn more about this topic. The SIG is always interested in understanding people's scalability requirements, and improving Kubernetes scalability beyond the current 5000 node "limit." (I put that in quotes because there's no performance cliff, it's just the maximum number of nodes you can run today if you want Kubernetes to meet the Kubernetes performance SLOs given the workload in the scalability tests.)
[Disclaimers: I worked on Borg and Omega, and currently work on Kubernetes/GKE. Everything here is my personal opinion.]
There's a lot to unpack here, but I'll do my best.
I don't see Kubernetes locking people into GKE. There's an extensive conformance program (https://github.com/cncf/k8s-conformance) administered by the CNCF. AWS and Azure both have certified hosted Kubernetes offerings. Portability is in Google's best interest.
Go, Docker, and etcd were the best open-source technologies for the job at the time Kubernetes was created (and arguably still are). Open-sourcing Borg would have been impossible, due to its use of many Google-specific libraries (though a number of those have been open-sourced since then), and its close coupling to the Google production environment. Commenting more specifically on each of the pieces you mentioned:
* Go was chosen over C++ because, like C++, it is a systems language, but is much more accessible for building an open-source community.
* Docker was (and still is) by far the most popular container runtime, and the slimmer containerd makes it even more appropriate to serve as the container runtime for a system like Kubernetes. While it's true that in Borg the container runtime and "package" (container image) management systems are separate, the tradeoffs between packaging more in the image vs. pre-installing dependencies on the host are exactly the same as with Docker images. In any event, it's very feasible to build very slim Docker images (you definitely don't need getty in your image :-).
* You can read the reasons etcd was chosen in this recent comment (https://news.ycombinator.com/item?id=17476142) from a Red Hat employee who is one of the earliest contributors to Kubernetes and one of the most prolific. Regarding consensus, I didn't understand your comment; Borg uses Paxos and etcd uses Raft, but those are basically equivalent algorithms.
Regarding scalability, we do continuous scalability testing as part of the Kubernetes CI pipeline, at a cluster size of 5000 nodes. If you're interested in learning more, I'd encourage you to joint the scalability SIG (https://github.com/kubernetes/community/tree/master/sig-scal...). I'm not aware that "messaging around Kubernetes has gravitated toward smaller, targeted clusters." It's true that a lot of people do use small-ish clusters, but AFAICT that's not because of scalability limitations, but rather because (1) the hosted Kubernetes offerings make it so easy to spin up clusters on demand, and (2) until recently, Kubernetes was lacking critical multi-tenancy features that would allow, say, multiple teams within a company to safely share a cluster.
Regarding mixing batch and interactive/serving applications in a single cluster managed by a single control plane, this has been the intention of Kubernetes from the beginning. It's true that open-source batch systems like Hadoop and Spark have traditionally shipped with their own orchestrators/schedulers, but that's starting to change as Kubernetes becomes more popular, for example Spark now supports Kubernetes natively (https://kubernetes.io/blog/2018/03/apache-spark-23-with-nati...). In terms of features that enable batch and serving workloads to share a node and a cluster, Kubernetes has had the concept of QoS classes (https://kubernetes.io/docs/tasks/configure-pod-container/qua...) from the beginning, and as of the most recent Kubernetes release we now have priority/preemption (https://cloudplatform.googleblog.com/2018/02/get-the-most-ou...). QoS classes and priority/preemption are the two main concepts that allow batch and interactive/serving application to share nodes and clusters in Borg, and we now have them in Kubernetes.
On your fifth point, I agree that this is one of the strengths of the Google production environment, but Kubernetes is limited in how prescriptive it can be in dictating how people write applications, since we want Kubernetes to work with essentially any application. This is why we have, for example, extremely flexible liveness/readiness probes in Kubernetes (https://kubernetes.io/docs/tasks/configure-pod-container/con...) rather than the expectation that every application has a built-in web server that exports a predefined /statusz endpoint. That said, we have been more prescriptive in how to build Kubernetes control plane components (for example such components generally have /healthz endpoints and export Prometheus instrumentation according to the guidelines outlined at https://github.com/kubernetes/community/blob/master/contribu...). Over time as containers and the "cloud native" architecture become more popular, I think there will be more standardization in the ways you described when people see the benefits it provides in allowing them to plug in their app immediately to standard container ecosystems. To some extent Istio (https://github.com/istio/istio) is a step in that direction, and in some sense even better because it interposes transparently rather than requiring you to build your application a particular way.
For anyone interested in learning more about the evolution of cluster management systems at Google, I recommend this paper: https://ai.google/research/pubs/pub44843
While Kubernetes is definitely not the same codebase as Borg, I do think it's accurate to say that it is the descendant of Borg.
Dumb question: why does K8s use a centralized architecture like Borg, if the perf gains from an Omega-style shared-state scheduler decentralization (and maybe a Mesos-style two-level scheduler for batch with multiple frameworks) were already known, and Omega was already being folded back into Borg?
Is this related to (I'm assuming) the fact that K8s was originally architected "mostly" with service rather than batch in mind, and a monolithic scheduler was "good enough"?'
(Disclaimer: I haven't really followed K8s stuff in the last few months. How is multi-scheduler support for K8s nowadays, anyways?)
You can actually build an Omega vertical / Mesos framework architecture on Kubernetes, as described in this doc[1]. That doc pre-dated CRDs; the way you'd do it today is to build the application lifecycle management part of the framework using a CRD + controller, and run an application-specific scheduler (for pods created by that controller) alongside the default scheduler. The Kubernetes documentation page explaining how to run multiple/custom schedulers is here[2].
Borg only worked with a single scheduler, but Kubernetes allows you to build Omega/Mesos style verticals/frameworks and associated scheduling as user extensions to the control plane (as described above).
The rescheduler in Borg isn't a scheduler -- it just evicts pods, and then they go into the regular scheduler's pending queue and the regular scheduler decides where to schedule them. (At least that's how it worked at the time I left the project -- I assume it hasn't changed in this regard, but I don't know for sure.)
As a Xoogler myself, I have always wondered about the logic of "we can't open source X because it uses too many libraries and is too integrated". The obvious answer is, OK, open source the libraries and refactor the integrations to make them more flexible.
Reimplementing all of Borg from scratch seems crazy to me given the huge effort that went into it. Does Google want an open source cluster infrastructure or not? If yes, in what universe is it less effort to write a totally new one from scratch vs just progressively open sourcing things?
What's the size of the transitive dependency graph of Borg? 10MLOC? 50MLOC? 100MLOC? I have no idea. But it's a lot of code no matter what. Open sourcing that much code is a huge undertaking, unless you're just planning to throw it over the wall with no expectation of external people working on it.
On the other hand starting from scratch you get to grow the community and the codebase in lockstep.
It may be a large undertaking but yes, it's clearly still less work to release code that exists and build a community around it, than rewrite it all from scratch and also build a community around that too.
I've worked at Google for many years, and built open source communities based on code I've written from scratch several times. This is not an area I'm inexperienced in.
Generally Google software has a bottom up completely different approach to industry norm/standard. And the divergence started from Google’s very beginning.
Open sourcing system software from its internal state requires the same amount of work as rewriting, plus the effort to morph interfaces and internals to fit external needs, plus changes to internal workloads (assuming a unified stack internal & external).
I worked on google3 for years, most likely some of the code I wrote is still there. I've also done a lot of open source coding too. I'm quite familiar with the structure of google3 and it's not as different as you claim - Borg is a bunch of C++ libraries and programs that depend on each other, nothing magic about that.
So I completely disagree that open sourcing code is as hard as rewriting it from scratch. I think if you tried to argue that to anyone outside the Google bubble they'd think you were crazy. Writing code is hard work! Uploading it to github and creating some project structures around it is vastly less work.
I can't help wondering if this is engineers looking for new promotion-worthy projects.
To rebuttal your statements, I seem need to reveal a lot of technical details. You did not mention what type of software you were open sourcing when you were in Google. But it seems our overlap in knowledge is rather small.
I'll leave this open.
But I want to emphasize that what I stated are reasonable reasons for open sourcing by writing from scratch.
So it descends from Borg, which is fine. It does not replace Borg or indicate a Google strategy to replace Borg with Kubernetes, which was my entire point with supporting points on why, and explaining why you made the choices in Kubernetes that were made does not dispute that at all.
I note you were careful to use the word descendant, instead of my successor.
What I mean is simple: Borg has borgmaster. Kubernetes approached the same concept like a Web application, and now Kubernetes has an entire SIG to play on the same field as Borg. It was a poor architectural decision, along with many others in Kubernetes, but I wasn’t discussing that. I was discussing why Google won’t replace Borg with it.
> I don't want to break your hopes but stateful containers will only ever run on GCE and AWS.
Actually Kubernetes is starting to work on support for persistent local volumes; we know the lack of this feature is a significant barrier for running some stateful applications on Kubernetes, particularly on bare metal. The concrete proposal for how we are thinking to do it is at
https://github.com/kubernetes/kubernetes/pull/30044
Having containers with local volumes is counter productive. They're just pet that can't be moved around and killed/recreated whenever you want. (Though I understand that it can be useful at times for some testing).
IMO: It's a marketing and usage problems. You should re-focus people on running exclusively stateless containers. Sell the strengths of containers, what it's good at and what it's meant to do. Containers = stateless.
Stateful containers are an hyped aberration. People barely get stateless containers working but they want to do stateful.
I don't understand why you say "Having containers with local volumes is counter productive." I would agree it's probably not a good architecture if you're running a huge single-node Oracle database, but it's an excellent way to run data stores like Cassandra, MongoDB, ElasticSearch, Redis, etcd, Zookeeper, and so on. Many people are already doing this, and as one large-scale real-world example, all of Google's storage systems run in containers. The first containerized applications (both at Google and in the "real world") were indeed stateless, but there's nothing fundamental about containers that makes them fundamentally ill-suited for stateful applications.
You don't understand because you are blinded and spoiled by Google.
Go see the outside world => They have none of your internal tech and services. Stateful containers do not exist there. "Containers" means "docker" which is experimental at best.
Another recent development in this area is CoreOS operators, which leverages pluggable controllers + Third Party Resources and was previously discussed on HN: https://news.ycombinator.com/item?id=12868594
This is not really true:
https://news.ycombinator.com/item?id=25907312
https://cloud.google.com/blog/products/containers-kubernetes...
https://www.infoq.com/presentations/alibaba-kubernetes/