The Internals of PostgreSQL

jjice · on Jan 26, 2022

Part 3.1.1 is about the parser, which uses lex and yacc. I actually used it as a source to convince a professor to let us use lex and yacc to build our lexer and parser for a SQL engine for a DB implementation class I took. His expectations were a bit unrealistic for what a bunch of seniors would be able to accomplish in three weeks in C (other similar level of difficulty classes and job searching weigh a real burden), and he initially said no to me asking if we could use lex and yacc. Postgres's source made a convincing enough argument. Gave me a reason to learn lex and yacc, as well as making a much easier to modify parser. I wrote the parser for my group, and we were the only group that finished the parser in time, and at all.

Just an anecdote with some fond memories that were made possible by Postgres and its internals. Postgres has a place in my heart for that, and being a damn fine DB of course!

john567 · on Jan 26, 2022

I think it's important to write your own parser, at some point. Though, for the purpose of learning you need someone to guide you.

If you're learning this stuff you need a simple challenge to work through the end.

With regards to lex and yacc, I absolut detest these tools. They are horrible but they do work. And if you just want a functioning parser they'll do. My main criticism of these tools are the horrible errors you might end up with and the lack of sensible extension points.

If it is your first foray into parsers, I do think a simple grammar and handwritten lexer/parser is a good first step. Iterate some on that, then use tools to help with the verbose stuff.

Though to be honest, I enjoy writing parsers by hand. So, I'm a bit biased.

jjice · on Jan 26, 2022

So we did build the DDL handwritten and did well on that portion of the product, but given the constraint of me being the only person to write the parser for the DML, along with having to be done and error free in three weeks where I was also occupied, all while being written in C, I felt yacc really helped a lot there.

The big thing for this particular project was all the pitfalls of C with having to track memory allocations and passing arrays of pointers around and mentally keeping track. Yacc streamlined this a lot. If this was a production project and not a class I took along side four other, hand writing would be a serious option I'd consider (mostly for error reporting purposes like you mentioned), but I might even lean towards getting an MVP done in a parser generator and eventually converting over to handwritten if the need arose.

deterministic · on Jan 29, 2022

I agree. Lex/yacc is on my “worst software I have ever used” list. Especially given how easy it is to write parsers.

outworlder · on Jan 26, 2022

> and he initially said no to me asking if we could use lex and yacc

What?

On my compilers class, I was explicitly asked to use lex and yacc. Sure, for the very first assignment on that class we were asked to write the lexer by hand. But when it came to actually do the interesting parts, lex and yacc it was.

dboreham · on Jan 26, 2022

This seems to have been subject to waxing and waning fashion. I had a similar experience recently -- someone sent me a PR with a hand-written parser. I commented "do people write parsers by hand any more?". After some push back, I searched online and found a ton of articles "why you should write your own parser and not use a parser generator". Still beats me as to why though.

gavinray · on Jan 26, 2022

I've tried to write PEG grammars/parsers and found that I am not quite smart enough to figure it out.

Hand-written lexer/parser means you can be dumb as a doornail and just walk through the logic.

As an aside, it also gives you a better ability to write error messages, and an easier time debugging compared to autogenerated solutions.

sanxiyn · on Jan 27, 2022

> it also gives you a better ability to write error messages

This is not true. Instead of trying to use error recovery feature of parser genrator, you can just add error case to your grammar, and it works as well as hand-written parser. Parser generator error recovery is pure bonus on top.

> and an easier time debugging compared to autogenerated solutions

This is also not true. Most parser generators (certainly lex/yacc) have good enough debug support that you never see generated code while you debug.

chacham15 · on Jan 27, 2022

There are a few reasons for this. Some languages arent Context Free Grammars (the classic example for this is "A * B;" is this multiplication of the A and B variables or is it a declaration of the B variable with the type being a pointer to an A) which make writing the grammar much harder in the first place. Some solutions to this are using a GLR parser instead, but that has the downside of super high execution time in the worst case. Lastly, even if you overcome all of these with the proper structure and hints, you run into pretty bad error cases. I.e. handing an invalid input is hard and giving an error to the user that makes sense is even more difficult. (which of the possible branches in the grammar rules is the one thats wrong? in simpler grammars you can probably give a good error, but a C grammar is going to have a much harder time) This is why gcc moved away from using a grammar to a custom parser IIRC.

dboreham · on Jan 27, 2022

I've heard the "useful error message" argument, but my personal experience coding modern languages suggests that not much progress has been made on that front. The compiler error message is almost never "useful".

giovannibonetti · on Jan 27, 2022

You might want to check out the Elm compiler: https://elm-lang.org/news/compilers-as-assistants

deterministic · on Jan 29, 2022

I write parsers all the time. It is quick for me to do and it has many advantages. Including much better error messaging and easy to debug code. The code generated by most parser generators are not really human readable. Making fixing complicated parsing bugs super hard.

jamesfinlayson · on Jan 27, 2022

In my compilers class at university we wrote a recursive descent parser by hand but we wrote the LALR and LL parsers using yacc or bison I think - that was what the three assignments said to do and the professor said using a parser generator was the only sensible approach for LALR and LL parsers.

dboreham · on Jan 27, 2022

You must be from the same vintage as me!

jamesfinlayson · on Jan 27, 2022

This was around a decade ago but my professor was not a young man (early 60s I'd guess).

deterministic · on Jan 29, 2022

I highly recommend not using lex/yacc. I wrote an SQL parser in a day for a custom DB at work. It’s super easy to write recursive descendent parsers once you know how to do it. So just learn it by starting with something simple. XML, Lisp or JSON for example.

eatonphil · on Jan 26, 2022

What did he want you to do initially? Handwritten?

jjice · on Jan 26, 2022

Handwritten. It worked for the first portion of the parser (DDL), but it was a real pain to debug and build while keeping track of mallocing and freeing along the way.

We got an A on that first parser that was hand written as well, but once we got to DML, I felt like we could really use tools given our time frame.

stingraycharles · on Jan 26, 2022

It sounds a bit silly to implement a parser for a DB internals course, though. You'll end up teaching lexing and parsing (something which the professor wasn't too familiar with himself it seems?), rather than db internals.

Implementing your own WAL + storage API for example, is more something along the lines I would expect.

stickfigure · on Jan 26, 2022

Agreed. At my university, in my era (almost 30 years ago), the compilers sequence was two full quarters and generally considered to be the hardest classes in the CSc major. I really loved it but it would have been a huge distraction mixed in with the DB classes.

Those compilers classes turned out to be pretty useful. I ended up having to build custom parsers several times in my career... including an SQL parser.

jjice · on Jan 26, 2022

I actually took my university's compiler course at the same time as this DB course, and that was my favorite class out of all four years, hands down. Honestly, the parsing portion of my DB class was helped out by the knowledge I got from the compiler class earlier in the semester, so at least it worked out!

dtech · on Jan 26, 2022

to be pair those compilers course should cover much more than parsing. Iirc correctly it was like 10% of the course at my time.

jjice · on Jan 26, 2022

Absolutely. That was one of my complaints with the class. I have an interest in compilers and the process, but I took this course to learn about storage and efficient lookups.

kklisura · on Jan 26, 2022

> ...postgres server process listens to one network port, the default port is 5432

> Whenever receiving a connection request from a client, it starts a backend process. (And then, the started backend process handles all queries issued by the connected client.)

> To achieve this [server] starts ("forks") a new process for each connection. From that point on, the client and the new server process communicate without intervention by the original postgres process. Thus, the master server process is always running, waiting for client connections, whereas client and associated server processes come and go. [1]

So, Postgres is using process-per-connection model. Can some explain why this is? And why not something like thread-per-connection?

[1] https://www.postgresql.org/docs/9.6/tutorial-arch.html

dmw_ng · on Jan 26, 2022

It originally used a per-process model primarily because of the age of the codebase. It predates commodity SMP hardware (1986ish onwards, starting with Sequent) and standardized threading APIs (1993):

> DBMS code must run as a separate process from the application programs that access the database in order to provide data protection. The process structure can use one DBMS process per application program (i.e., a process-per-user model [STON81]) or one DBMS process for all application programs (i.e., a server model). The server model has many performance benefits (e.g., sharing of open file descriptors and buffers and optimized task switching and message send- ing overhead) in a large machine environment in which high performance is critical. However, this approach requires that a fairly complete special-purpose operating system be built. In constrast, the process-per-user model is simpler to implement but will not perform as well on most conventional operating systems. We decided after much soul searching to implement POSTGRES using a process-per-user model architecture because of our limited programming resources. POSTGRES is an ambitious undertaking and we believe the additional complexity introduced by the server architecture was not worth the additional risk of not getting the system running. Our current plan then is to implement POSTGRES as a process-per-user model on Unix 4.3 BSD.

(THE DESIGN OF POSTGRES, 1986, https://dsf.berkeley.edu/papers/ERL-M85-95.pdf )

> In POSTGRES they are run as subprocesses managed by the POSTMASTER. A last aspect of our design concerns the operating system process structure. Currently, POSTGRES runs as one process for each active user. This was done as an expedient to get a system operational as quickly as possible. We plan on converting POSTGRES to use lightweight processes available in the operating systems we are using. These include PRESTO for the Sequent Symmetry and threads in Version 4 of Sun/OS.

(The implementation of POSTGRES, 1990, https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93... )

outworlder · on Jan 26, 2022

> So, Postgres is using process-per-connection model. Can some explain why this is? And why not something like thread-per-connection?

Thread vs process under Linux is kind of a potato potato thing. They are essentially the same thing - based on the flags provided when you create the process, you can indicate if you want things like shared memory or not. But they are ultimately the same construct, with similar overhead. Plus, fork() is incredibly useful.

It makes sense, for such an important piece of software, to minimize shared memory.

mike_hock · on Jan 27, 2022

Also, Postgres predates CLONE_THREAD, so when Postgres was created, the distinction was even more potato-potahto than it is today.

boomer918 · on Jan 27, 2022

Are you certain? I thought threads are much more lightweight and also access memory faster because there is no CoW overhead (mutexes permitting).

E.g. why nginx is much faster and lighter than Apache and MySQL connections are cheaper and faster to create than Pg.

hans_castorp · on Jan 27, 2022

> So, Postgres is using process-per-connection model. Can some explain why this is? And why not something like thread-per-connection?

This is explained in their Wiki "Things we do NOT want"

https://wiki.postgresql.org/wiki/Todo#Features_We_Do_Not_Wan...

> All backends running as threads in a single process (not wanted)

> This eliminates the process protection we get from the current setup. Thread creation is usually the same overhead as process creation on modern systems, so it seems unwise to use a pure threaded model, and MySQL and DB2 have demonstrated that threads introduce as many issues as they solve. Threading specific operations such as I/O, seq scans, and connection management has been discussed and will probably be implemented to enable specific performance features.

dmw_ng · on Jan 27, 2022

You can actually find some dissent from one of the core team in the linked thread: https://www.postgresql.org/message-id/CA%2BTgmobuHESqa%3Db5s...

Modern PG is already threaded out of necessity for many kinds of query, there's really no other way to fully exploit modern hardware (or indeed, much of the hardware released in the past 30 years).

10000truths · on Jan 26, 2022

The other reply mentioned that threads and processes aren’t much different on Linux. But on some other POSIX systems, privilege separation and resource limits tend to operate on a per-process basis - both of which are important features for a mature RDBMS. If you want to limit the CPU/memory/IO usage of a connection or ensure that a connection never accesses a privileged resource, and you need it done in a portable manner, then each connection needs a dedicated process.

grantjpowell · on Jan 26, 2022

I constantly recommend this at work. The specific content isn't super helpful to Saas day to day development, but for me it built an intuition about postgres that has been invaluable. I think once I understood the "heart and soul" of postgres, the heap and the mvcc, many other properties about the database just "clicked" in my head.

rattray · on Jan 26, 2022

Do you remember which chapters (or other resources) talk about heap & mvcc specifically?

luhn · on Jan 26, 2022

The heap is described in 1.3 [1] and MVCC is described in Chapter 5 [2]. I'd also recommend the Postgres docs [3], they aren't quite as accessible as this but are detailed and overall well-written.

[1] https://www.interdb.jp/pg/pgsql01.html#_1.3.

[2] https://www.interdb.jp/pg/pgsql05.html

[3] https://www.postgresql.org/docs/14/internals.html

dang · on Jan 26, 2022

Past related threads:

The internals of PostgreSQL - https://news.ycombinator.com/item?id=13488315 - Jan 2017 (53 comments)

The Internals of PostgreSQL - https://news.ycombinator.com/item?id=12142364 - July 2016 (1 comment)

trulyme · on Jan 26, 2022

The license is interesting:

> If you work at Amazon, you cannot use and refer to this document because of the copyright violation issues.

I wonder which one of the Amazon's missteps triggered the OP's ire. :)

nexuist · on Jan 26, 2022

The real question is does this apply to the warehouse workers as well?

hbgl · on Jan 26, 2022

It shows that OP clearly has no idea about how copyright law works. Everybody can use the work under fair use and you cannot restrict people from posting links to your site based on copyright. It seems like the author has a god complex. "I create therefore I shall command."

Ostrogodsky · on Jan 26, 2022

The FAQ section is refreshingly and hilariously honest:

https://www.interdb.jp/pg/faq.html