Part 3.1.1 is about the parser, which uses lex and yacc. I actually used it as a source to convince a professor to let us use lex and yacc to build our lexer and parser for a SQL engine for a DB implementation class I took. His expectations were a bit unrealistic for what a bunch of seniors would be able to accomplish in three weeks in C (other similar level of difficulty classes and job searching weigh a real burden), and he initially said no to me asking if we could use lex and yacc. Postgres's source made a convincing enough argument. Gave me a reason to learn lex and yacc, as well as making a much easier to modify parser. I wrote the parser for my group, and we were the only group that finished the parser in time, and at all.
Just an anecdote with some fond memories that were made possible by Postgres and its internals. Postgres has a place in my heart for that, and being a damn fine DB of course!
I think it's important to write your own parser, at some point. Though, for the purpose of learning you need someone to guide you.
If you're learning this stuff you need a simple challenge to work through the end.
With regards to lex and yacc, I absolut detest these tools. They are horrible but they do work. And if you just want a functioning parser they'll do. My main criticism of these tools are the horrible errors you might end up with and the lack of sensible extension points.
If it is your first foray into parsers, I do think a simple grammar and handwritten lexer/parser is a good first step. Iterate some on that, then use tools to help with the verbose stuff.
Though to be honest, I enjoy writing parsers by hand. So, I'm a bit biased.
So we did build the DDL handwritten and did well on that portion of the product, but given the constraint of me being the only person to write the parser for the DML, along with having to be done and error free in three weeks where I was also occupied, all while being written in C, I felt yacc really helped a lot there.
The big thing for this particular project was all the pitfalls of C with having to track memory allocations and passing arrays of pointers around and mentally keeping track. Yacc streamlined this a lot. If this was a production project and not a class I took along side four other, hand writing would be a serious option I'd consider (mostly for error reporting purposes like you mentioned), but I might even lean towards getting an MVP done in a parser generator and eventually converting over to handwritten if the need arose.
> and he initially said no to me asking if we could use lex and yacc
What?
On my compilers class, I was explicitly asked to use lex and yacc. Sure, for the very first assignment on that class we were asked to write the lexer by hand. But when it came to actually do the interesting parts, lex and yacc it was.
This seems to have been subject to waxing and waning fashion. I had a similar experience recently -- someone sent me a PR with a hand-written parser. I commented "do people write parsers by hand any more?". After some push back, I searched online and found a ton of articles "why you should write your own parser and not use a parser generator". Still beats me as to why though.
> it also gives you a better ability to write error messages
This is not true. Instead of trying to use error recovery feature of parser genrator, you can just add error case to your grammar, and it works as well as hand-written parser. Parser generator error recovery is pure bonus on top.
> and an easier time debugging compared to autogenerated solutions
This is also not true. Most parser generators (certainly lex/yacc) have good enough debug support that you never see generated code while you debug.
There are a few reasons for this. Some languages arent Context Free Grammars (the classic example for this is "A * B;" is this multiplication of the A and B variables or is it a declaration of the B variable with the type being a pointer to an A) which make writing the grammar much harder in the first place. Some solutions to this are using a GLR parser instead, but that has the downside of super high execution time in the worst case. Lastly, even if you overcome all of these with the proper structure and hints, you run into pretty bad error cases. I.e. handing an invalid input is hard and giving an error to the user that makes sense is even more difficult. (which of the possible branches in the grammar rules is the one thats wrong? in simpler grammars you can probably give a good error, but a C grammar is going to have a much harder time) This is why gcc moved away from using a grammar to a custom parser IIRC.
I've heard the "useful error message" argument, but my personal experience coding modern languages suggests that not much progress has been made on that front. The compiler error message is almost never "useful".
I write parsers all the time. It is quick for me to do and it has many advantages. Including much better error messaging and easy to debug code. The code generated by most parser generators are not really human readable. Making fixing complicated parsing bugs super hard.
In my compilers class at university we wrote a recursive descent parser by hand but we wrote the LALR and LL parsers using yacc or bison I think - that was what the three assignments said to do and the professor said using a parser generator was the only sensible approach for LALR and LL parsers.
I highly recommend not using lex/yacc. I wrote an SQL parser in a day for a custom DB at work. It’s super easy to write recursive descendent parsers once you know how to do it. So just learn it by starting with something simple. XML, Lisp or JSON for example.
Handwritten. It worked for the first portion of the parser (DDL), but it was a real pain to debug and build while keeping track of mallocing and freeing along the way.
We got an A on that first parser that was hand written as well, but once we got to DML, I felt like we could really use tools given our time frame.
It sounds a bit silly to implement a parser for a DB internals course, though. You'll end up teaching lexing and parsing (something which the professor wasn't too familiar with himself it seems?), rather than db internals.
Implementing your own WAL + storage API for example, is more something along the lines I would expect.
Agreed. At my university, in my era (almost 30 years ago), the compilers sequence was two full quarters and generally considered to be the hardest classes in the CSc major.
I really loved it but it would have been a huge distraction mixed in with the DB classes.
Those compilers classes turned out to be pretty useful. I ended up having to build custom parsers several times in my career... including an SQL parser.
I actually took my university's compiler course at the same time as this DB course, and that was my favorite class out of all four years, hands down. Honestly, the parsing portion of my DB class was helped out by the knowledge I got from the compiler class earlier in the semester, so at least it worked out!
Absolutely. That was one of my complaints with the class. I have an interest in compilers and the process, but I took this course to learn about storage and efficient lookups.
> ...postgres server process listens to one network port, the default port is 5432
> Whenever receiving a connection request from a client, it starts a backend process. (And then, the started backend process handles all queries issued by the connected client.)
> To achieve this [server] starts ("forks") a new process for each connection. From that point on, the client and the new server process communicate without intervention by the original postgres process. Thus, the master server process is always running, waiting for client connections, whereas client and associated server processes come and go. [1]
So, Postgres is using process-per-connection model. Can some explain why this is? And why not something like thread-per-connection?
It originally used a per-process model primarily because of the age of the codebase. It predates commodity SMP hardware (1986ish onwards, starting with Sequent) and standardized threading APIs (1993):
> DBMS code must run as a separate process from the application programs that access the database in order to provide data protection. The process structure can use one DBMS process per application program (i.e., a process-per-user model [STON81]) or one DBMS process for all application programs (i.e., a server model). The server model has many performance benefits (e.g., sharing of open file descriptors and buffers and optimized task switching and message send- ing overhead) in a large machine environment in which high performance is critical. However, this approach requires that a fairly complete special-purpose operating system be built. In constrast, the process-per-user model is simpler to implement but will not perform as well on most conventional operating systems. We decided after much soul searching to implement POSTGRES using a process-per-user model architecture because of our limited programming resources. POSTGRES is an ambitious undertaking and we believe the additional complexity introduced by the server architecture was not worth the additional risk of not getting the system running. Our current plan then is to implement POSTGRES as a process-per-user model on Unix 4.3 BSD.
> In POSTGRES they are run as subprocesses managed by the POSTMASTER. A last aspect of our design concerns the operating system process structure. Currently, POSTGRES runs as one process for each active user. This was done as an expedient to get a system operational as quickly as possible. We plan on converting POSTGRES to use lightweight processes available in the operating systems we are using. These include PRESTO for the Sequent Symmetry and threads in Version 4 of Sun/OS.
> So, Postgres is using process-per-connection model. Can some explain why this is? And why not something like thread-per-connection?
Thread vs process under Linux is kind of a potato potato thing. They are essentially the same thing - based on the flags provided when you create the process, you can indicate if you want things like shared memory or not. But they are ultimately the same construct, with similar overhead. Plus, fork() is incredibly useful.
It makes sense, for such an important piece of software, to minimize shared memory.
> All backends running as threads in a single process (not wanted)
> This eliminates the process protection we get from the current setup. Thread creation is usually the same overhead as process creation on modern systems, so it seems unwise to use a pure threaded model, and MySQL and DB2 have demonstrated that threads introduce as many issues as they solve. Threading specific operations such as I/O, seq scans, and connection management has been discussed and will probably be implemented to enable specific performance features.
Modern PG is already threaded out of necessity for many kinds of query, there's really no other way to fully exploit modern hardware (or indeed, much of the hardware released in the past 30 years).
The other reply mentioned that threads and processes aren’t much different on Linux. But on some other POSIX systems, privilege separation and resource limits tend to operate on a per-process basis - both of which are important features for a mature RDBMS. If you want to limit the CPU/memory/IO usage of a connection or ensure that a connection never accesses a privileged resource, and you need it done in a portable manner, then each connection needs a dedicated process.
I constantly recommend this at work. The specific content isn't super helpful to Saas day to day development, but for me it built an intuition about postgres that has been invaluable. I think once I understood the "heart and soul" of postgres, the heap and the mvcc, many other properties about the database just "clicked" in my head.
The heap is described in 1.3 [1] and MVCC is described in Chapter 5 [2]. I'd also recommend the Postgres docs [3], they aren't quite as accessible as this but are detailed and overall well-written.
It shows that OP clearly has no idea about how copyright law works. Everybody can use the work under fair use and you cannot restrict people from posting links to your site based on copyright. It seems like the author has a god complex. "I create therefore I shall command."
Just an anecdote with some fond memories that were made possible by Postgres and its internals. Postgres has a place in my heart for that, and being a damn fine DB of course!