Unfortunately I have to quibble a bit, although bravo for such a high effort pos...

foobazgt · on Jan 5, 2024

This. I worked at Google around the same time. Adwords and Gmail were customers of my team.

I remember appreciating how much nicer it was to run Java servers, because best practice for C++ was (and presumably still is) to immediately abort the entire process any time an invariant was broken. This meant that it wasn't uncommon to experience queries of death that would trivially shoot down entire clusters. With Java, on the other hand, you'd just abort the specific request and keep chugging.

I didn't really see any appreciable attrition to golang from Java during my time at Google. Similarly, at my last job, the majority of work in golang was from people transitioning from ruby. I later learned a common reason to choose golang over Java was over confusion about the Java tooling / development workflow. For example, folks coming from ruby would often debug with log statements and process restarts instead of just using a debugger and hot patching code.

mike_hearn · on Jan 5, 2024

Yeah. C++ has exceptions but using them in combination with manual memory management is nearly impossible, despite RAII making it appear like it should be a reasonable thing to do. I was immediately burned by this the first time I wrote a codebase that combined C++ and exceptions, ugh, never again. Pretty sure I never encountered a C++ codebase that didn't ban exceptions by policy and rely on error codes instead.

This very C oriented mindset can be seen in Go's design too, even though Go has GC. I worked with a company using Go once where I was getting 500s from their servers when trying to use their API, and couldn't figure out why. I asked them to check their logs to tell me what was going wrong. They told me their logs didn't have any information about it, because the error code being logged only reflected that something had gone wrong somewhere inside a giant library and there were no stack traces to pinpoint the issue. Their suggested solution: just keep trying random things until you figure it out.

That was an immediate and visceral reminder of the value of exceptions, and by implication, GC.

unscaled · on Jan 5, 2024

> In 2009 most Google web servers (by unique codebase I mean, not replica count) were written in Java. A few of the oldest web servers were written in C++ like web search and Maps. C++ still dominated infrastructure servers like BigTable. However, most web frontends were written in Java, for example, the Gmail and Accounts frontends were written in Java but the spam filter was written in C++.

Thank you. I don't much about the breakup of different services by language in Google circa 2009, so your feedback helps me put things in focus. I knew that Java was more popular than the way Rob described it (in his 2012 talk[1], not this essay), but I didn't know how much.

I would still argue that like replacing C and C++ in server code was the main impetus for developing Go. This would be a rather strange impetus outside of big tech company like Google, which was writing a lot of C++ server code to begin with. But it also seems that Go was developed quite independently of Google's own problems.

> Rob also makes the very strange claim that Google wasn't using threads in its software stack, or that threads were outright banned. That doesn't match my memory at all.

I can't say anything about Google, but I also found that statement baffling. If you wanted to develop a scalable network server in Java at that time, you pretty much had to use threads. With C++ you had a few other alternatives (you could develop a single threaded server using an asynchronous library Boost ASIO for instance), but that was probably harder than dealing with deadlocks race conditions (which are still very much a problem in Go, the same way they are in multi-threaded C++ and Java).

> Google didn't use any of those frameworks. It also didn't use regular Java build systems or dependency management.

Yes, I am aware of that part, and it makes it clearer for me Go wasn't trying to solve any particular problem with the way Java was used within Google. I also think Go win over many experienced Java developers who already knew how to deal with Java. But it did offer a simpler build-deployment-and-configuration story than Java, and that's why it attracted many Python and Node.js where Java failed to do so.

Many commentators have mentioned better performance and fewer errors with static typing as the main attraction for dynamic language programmers coming to Go, but that cannot be the only reason, since Java had both of these long before Go came to being.

> At the time Go was developed Java had both the throughput-optimized parallel GC, and also the latency optimized CMS collector. Two years after Go was developed Java introduced the G1 GC which made the tradeoff more configurable.

Frankly speaking, GC was more minor problem for people coming from dynamic language. But the main issue for this type of developer, is that the GC in Java is configurable. In practice most of the developers I've worked with (even seasoned Java developers) do not know how to configure and benchmark Java GC, which is quite an issue.

JVM Warmup was and still is a major issue in Java. New features like AppCDS help a lot to solve this issue, but it requires some knowledge, understanding and work. Go solves that out of the box, by foregoing JIT (Of course, it loses other important optimizations that JIT natively enables like monomorphic dispatch).

[1] https://go.dev/talks/2012/splash.article

mike_hearn · on Jan 5, 2024

The Google codebase had the delightful combination of both heavily async callback oriented APIs and also heavy use of multithreading. Not surprising for a company for whom software performance was an existential problem.

The core libraries were not only multi-threaded, but threaded in such a way that there was no way to shut them down cleanly. I was rather surprised when I first learned this fact during initial training, but the rationale made perfect sense: clean shutdown in heavily threaded code is hard and can introduce a lot of synchronization bugs, but Google software was all designed on the assumption that the whole machine might die at any moment. So why bother with clean shutdown when you had to support unclean shutdown anyway. Might as well just SIGKILL things when you're done with them.

And by core libraries I mean things like the RPC library, without which you couldn't do anything at all. So that I think shows the extent to which threading was not banned at Google.

smarterclayton · on Jan 5, 2024

As an aside:

This principle (always shutdown uncleanly) was a significant point of design discussion in Kubernetes, another one of the projects that adapted lessons learned inside Google on the outside (and had to change as a result).

All of the core services (kubelet, apiserver, etc) mostly expect to shutdown uncleanly, because as a project we needed to handle unclean shutdowns anyway (and could fix bugs when they happened).

But quite a bit of the software run by Kubernetes (both early and today) doesn’t always necessarily behave that way - most notably Postgres in containers in the early days of Docker behaved badly when KILLed (where Linux terminates the process without it having a chance to react).

So faced with the expectation that Kubernetes would run a wide range of software where a Google-specific principle didn’t hold and couldn’t be enforced, Kubernetes always (modulo bugs or helpful contributors regressing under tested code paths) sends TERM, waits a few seconds, then KILLs.

And the lack of graceful Go http server shutdown (as well as it being hard to do correctly in big complex servers) for many years also made Kube apiservers harder to run in a highly available fashion for most deployers. If you don’t fully control the load balancing infrastructure in front of every server like Google does (because every large company already has a general load balancer approach built from Apache or nginx or haproxy for F5 or Cisco or …), or enforce that all clients handle all errors gracefully, you tend to prefer draining servers via code vs letting those errors escape to users. We ended up having to retrofit graceful shutdown to most of Kube’s server software after the fact, which was more effort than doing it from the beginning.

In a very real sense, Google’s economy of software scale is that it can enforce and drive consistent tradeoffs and principles across multiple projects where making a tradeoff saves effort in multiple domains. That is similar to the design principles in a programming language ecosystem like Go or orchestrator like Kubernetes, but is more extensive.

But those principles are inevitably under communicated to users (because who reads the docs before picking a programming language to implement a new project in?) and under enforced by projects (“you must be this tall to operate your own Kubernetes cluster”).

chubot · on Jan 5, 2024

John Ousterhout had famously written that threads were bad, and many people agreed with him because they seemed to be very hard to use.

Google software avoided them almost always, pretty much banning them outright, and the engineers doing the banning cited Ousterhout

Yeah this is simply not true. It was threads AND async in the C++ world (and C++ was of course most of the cycles)

The ONLY way to use all your cores is either threads or processes, and Google favored threads over processes (at least fork() based concurrency, which Burrows told people not to use).

For example, I'm 99.99% sure MapReduce workers started a bunch of threads to use the cores within a machine, not a bunch of processes. It's probably in the MapReduce paper.

So it can't be even a little true that threads were "avoided almost always"

---

What I will say is that pattern to say fan out requests to 50 or 200 servers and join in C++ was async. It wasn't idiomatic to use threads for that because of the cost, not because threads are "hard to use". (I learned that from hacking on Jeff Dean's tiny low latency gsearch code in 2006)

But even as early as 2009, people pushed back and used shitloads of threads, because ASYNC is hard to use -- it's a lot of manual state management.

e.g. From the paper about the incremental indexing system that launched in ~2009

https://research.google/pubs/large-scale-incremental-process...

https://storage.googleapis.com/gweb-research2023-media/pubto...

Early in the implementation of Percolator, we decided to make all API calls blocking and rely on running THOUSANDS OF THREADS PER MACHINE to provide enough parallelism to maintain good CPU utilization. We chose this thread-per-request model mainly to make application code easier to write, compared to the event-driven model. Forcing users to bundle up their state each of the (many) times they fetched a data item from the table would have made application development much more difficult. Our experience with thread-per-request was, on the whole, positive: application code is simple, we achieve good utilization on many-core machines, and crash debugging is simplified by meaningful and complete stack traces. We encountered fewer race conditions in application code than we feared. The biggest drawbacks of the approach were scalability issues in the Linux kernel and Google infrastructure related to high thread counts. Our in-house kernel development team was able to deploy fixes to ad- dress the kernel issues

To say threads were "almost always avoided" is indeed ridiculous -- IIRC this was a few dedicated clusters of >20,000 machines running 2000-5000+ threads each ... (on I'd guess ~32 cores at the time)

I remember being in a meeting where the indexing VP mentioned the kernel patches mentioned above, which is why I thought of that paper

Also as you say there were threads all over the place in other areas too, GWS, MapReduce, etc.