> "It's software" so of course it's buggy, he said. The notion that systemd has to be perfect, unlike any other system, raises the bar too high.
systemd is a PID 1 program, it means it have to raise bar higher. When troubles begin, you would need tools to fix them, and if PID1 is crashed, you are out of luck. If system cannot boot into shell, you'd need to fix it from initrd shell. Or to boot other system, to fix this one. It sucks.
Linux kernel chases very high standards of reliability, because when kernel panics it is even worse than PID1 crash. Init system should follow the same standards as linux.
Have you ever had pid 1 (systemd or any other init) crash? For the last ~three years I've been paid to maintain high reliability algorithmic trading systems that ran systemd and a whole lot of other stuff, and systemd has never crashed on me. Lots of other stuff, including the kernel itself, has crashed.
The bar is higher for pid 1 - if I were designing systemd I would have made a tiny pid 1 that just did message-passing to a more complex secondary process that could be restarted, or something, just to be safe - but I think systemd has empirically cleared the bar.
3AM, deep slumber, called out to look at a stricken server. Its problems included that systemd was frozen. Reluctantly I came to the conclusion that a restart was the only route forward. Cept, that is when you discover that the commands that have served you well for 2 decades don't work, as they are all wrappers for systemd, which has keeled over.
To this day, the `shutdown` man page, which I was checking in, makes no mention of how to resolve, tho in fairness the other commands (poweroff, halt, init) do. I discovered this after stumbling across https://github.com/systemd/systemd/issues/3282
If you find yourself stuck in the middle of the night, reading through docs to try and figure out how recover a machine with a crashed systemd, then `systemctl reboot -ff` or equivalent is what you are now looking for, the `-ff` being the key to "JUST £&*(ing RESTART THE MACHINE!!!".
The worst thing about this is when stuff goes down, it does so at the least convenient time. Back in 2003 I was on a customer site who had a RH server and there was no internet connection available (as it was routed through the box) and my phone was a Treo 180G which had precisely fuck all useful internet on it. The company still exists and is in the middle of nowhere on the end of a shonky ADSL line and no mobile phone reception so the story hasn't improved.
If this happened to me today with systemd I'd be up shit creek without a paddle.
I've had shutdown (for reboot) hang a few times after a systemd update, forcing me to cut the power. It's made me a bit paranoid, so I block the systemd package from having updates automatically installed, and every 6 months or so carefully manage update and reboot of each and every server ...
and while "all software has bugs" systemd really has the most annoying bugs (by virtue of trying to do everything core to the system) and always insists that they are features and we are backwards whiny geeks for complaining.
I’ve had Systemd completely stop responding before on numerous occasions on centos 7. As in can’t reboot or hangs rebooting or all commands hang.
Only recourse has been to reboot the instance from AWS dashboard.
I can’t get to the bottom of it because the tools don’t work when it’s down and there’s nothing there when it comes back up. I am not enjoying boiling to death in this pot of shit.
And then there’s the situation where it just won’t boot. I just fire up a new instance then because it’s easier than debugging it.
> Have you ever had pid 1 (systemd or any other init) crash?
No, I have not. But I have seen how systemd gracefully failed to boot system to login, with good looking colorful error message. Something that reminded me "Keyboard is not found. Press F1 to run setup."
Oh, to be clear I'm real mad about how systemd fails boot if (say) one of your filesystems is unavailable and makes you log in with a root password to fix it.
But OP was asserting that systemd crashes under normal operation because its pid 1 is too fragile, which is very different. At scale I already expect that there's a chance a machine won't come back if I reboot it - it's annoying if I can't ssh in, but, well, I already lost a disk I care about and it won't return to service and I need to fix it anyway. (And it's an easy fix, just add "nofail" to fstab.) At scale I don't expect init to crash under normal operation.
Personally, I lay the blame for this issue squarely at the feet of various UEFI implementations which fail to boot when the system's EFI variables are, for whatever reason, wiped clean. The UEFI spec explicitly states that clearing all of the variables on a system must not result in an unbootable system.
Actually, that was nothing to do with systemd. That was definitely a UEFI implementation issue. And systemd didn't delete anything, the user did - they ran:
The bug was in the kernel (it should not have allowed userspace to write arbitrary UEFI variables), but AFAIR it was exposed by systemd because it eagerly mounted the UEFI variable filesystem provided by the kernel into /sys/something/efivars.
Indeed, but again that was a firmware issue. systemd didn't delete the variables. And systemd was setting EFI variables, so consequently it needed it to be mounted as read/write.
The configuration files should have set that to read only after boot.
The kernel patch where this was fixed can be found here:
Very recently I had this issue[0] as the result of a systemd upgrade, requiring the use of a recovery disk to downgrade to the previous version as the keyboard input had failed to be initialized.
If this bug hits you in staging, no big problem, just don't promote that particular update to production. If this bug hits you in production, your lack of a staging environment is the bigger concern IMO.
(I have been hit by the same issue on my private notebook, but I have procedures in place to cleanly recover from failed upgrades on all systems, so it was not a big deal.)
"Yeah the software completely broke, but that's fine because you should be able to deal with that" does not make me feel better about the software in question.
> Have you ever had pid 1 (systemd or any other init) crash? For the last ~three years I've been paid to maintain high reliability algorithmic trading systems that ran systemd and a whole lot of other stuff, and systemd has never crashed on me.
You see, that's the argument I hear a lot from Systemd advocates. The problem with anecdotal evidence is obvious. When you hear people opposing Systemd, practically all of them have some real-life issues with it, often related to functionality that would otherwise be non-essential (i.e. doesn't really need to be handled by PID 1). Of course if you don't have a particular problem, you don't feel it's important. That's precisely the attitude people resent.
> When you hear people opposing Systemd, practically all of them have some real-life issues with it
Yes, but a lot of people have real-life issues with it on their desktop of the form "It's too complicated." I'm asking specifically about real-life issues on production servers at scale. There will of course be tools that are poorly suited for a personal machine (even a personal server) but well suited for a team that wants to run a bunch of reliable servers.
For instance I would never be happy running RHEL on my desktop, but that doesn't mean RHEL is useless.
I can't quote any statistics but have the impression that a large part of non-Systemd crowd are old-time admins who maintain a large number of servers, myself included. When you break something on a desktop machine, that's easily fixable. When you need to deal with a large heterogeneous environment, you prefer to have things handled a bit more gracefully. Linus is a good example of a person who got this right.
systemd is a PID 1 program, it means it have to raise bar higher. When troubles begin, you would need tools to fix them, and if PID1 is crashed, you are out of luck. If system cannot boot into shell, you'd need to fix it from initrd shell. Or to boot other system, to fix this one. It sucks.
Linux kernel chases very high standards of reliability, because when kernel panics it is even worse than PID1 crash. Init system should follow the same standards as linux.