*> "It's software" so of course it's buggy, he said. The notion that systemd has...

geofft · on Jan 29, 2019

Have you ever had pid 1 (systemd or any other init) crash? For the last ~three years I've been paid to maintain high reliability algorithmic trading systems that ran systemd and a whole lot of other stuff, and systemd has never crashed on me. Lots of other stuff, including the kernel itself, has crashed.

The bar is higher for pid 1 - if I were designing systemd I would have made a tiny pid 1 that just did message-passing to a more complex secondary process that could be restarted, or something, just to be safe - but I think systemd has empirically cleared the bar.

stewartm · on Jan 29, 2019

Yep.

3AM, deep slumber, called out to look at a stricken server. Its problems included that systemd was frozen. Reluctantly I came to the conclusion that a restart was the only route forward. Cept, that is when you discover that the commands that have served you well for 2 decades don't work, as they are all wrappers for systemd, which has keeled over.

To this day, the `shutdown` man page, which I was checking in, makes no mention of how to resolve, tho in fairness the other commands (poweroff, halt, init) do. I discovered this after stumbling across https://github.com/systemd/systemd/issues/3282

If you find yourself stuck in the middle of the night, reading through docs to try and figure out how recover a machine with a crashed systemd, then `systemctl reboot -ff` or equivalent is what you are now looking for, the `-ff` being the key to "JUST £&*(ing RESTART THE MACHINE!!!".

Experiences like that, don't win you friends.

setquk · on Jan 29, 2019

The worst thing about this is when stuff goes down, it does so at the least convenient time. Back in 2003 I was on a customer site who had a RH server and there was no internet connection available (as it was routed through the box) and my phone was a Treo 180G which had precisely fuck all useful internet on it. The company still exists and is in the middle of nowhere on the end of a shonky ADSL line and no mobile phone reception so the story hasn't improved.

If this happened to me today with systemd I'd be up shit creek without a paddle.

Twirrim · on Jan 29, 2019

Did raising elephants not work (SysRq + R E I S U B)

auscompgeek · on Jan 29, 2019

systemd disables the magic sysrq keys by default.

ploxiln · on Jan 29, 2019

I've had shutdown (for reboot) hang a few times after a systemd update, forcing me to cut the power. It's made me a bit paranoid, so I block the systemd package from having updates automatically installed, and every 6 months or so carefully manage update and reboot of each and every server ...

EDITs:

there's the classic case of the linux "debug" parameter: https://bugs.freedesktop.org/show_bug.cgi?id=76935

and the even more classic case of firmware loading events: https://lkml.org/lkml/2012/10/3/484

and while "all software has bugs" systemd really has the most annoying bugs (by virtue of trying to do everything core to the system) and always insists that they are features and we are backwards whiny geeks for complaining.

setquk · on Jan 29, 2019

I’ve had Systemd completely stop responding before on numerous occasions on centos 7. As in can’t reboot or hangs rebooting or all commands hang.

Only recourse has been to reboot the instance from AWS dashboard.

I can’t get to the bottom of it because the tools don’t work when it’s down and there’s nothing there when it comes back up. I am not enjoying boiling to death in this pot of shit.

And then there’s the situation where it just won’t boot. I just fire up a new instance then because it’s easier than debugging it.

ordu · on Jan 29, 2019

> Have you ever had pid 1 (systemd or any other init) crash?

No, I have not. But I have seen how systemd gracefully failed to boot system to login, with good looking colorful error message. Something that reminded me "Keyboard is not found. Press F1 to run setup."

geofft · on Jan 29, 2019

Oh, to be clear I'm real mad about how systemd fails boot if (say) one of your filesystems is unavailable and makes you log in with a root password to fix it.

But OP was asserting that systemd crashes under normal operation because its pid 1 is too fragile, which is very different. At scale I already expect that there's a chance a machine won't come back if I reboot it - it's annoying if I can't ssh in, but, well, I already lost a disk I care about and it won't return to service and I need to fix it anyway. (And it's an easy fix, just add "nofail" to fstab.) At scale I don't expect init to crash under normal operation.

rachelbythebay · on Jan 29, 2019

Yep. CentOS 6's upstart can be felled by generating a bunch of inotify events in /etc.

http://rachelbythebay.com/w/2014/11/24/touch/

CBLT · on Jan 29, 2019

I've never experienced an outright crash, but I've been bitten by [0] on some of my servers.

[0] https://github.com/systemd/systemd/issues/719

65a · on Jan 29, 2019

There was that time it was bricking computers by erasing UEFI variables, but I'll allocate equal blame between systemd and UEFI

voxadam · on Jan 29, 2019

Personally, I lay the blame for this issue squarely at the feet of various UEFI implementations which fail to boot when the system's EFI variables are, for whatever reason, wiped clean. The UEFI spec explicitly states that clearing all of the variables on a system must not result in an unbootable system.

JdeBP · on Jan 29, 2019

You shouldn't, because the maintainer of the kernel subsystem concerned told us all that systemd wasn't to blame for it.

* https://news.ycombinator.com/item?id=15973577

* https://news.ycombinator.com/item?id=11152880

chris_wot · on Jan 29, 2019

Actually, that was nothing to do with systemd. That was definitely a UEFI implementation issue. And systemd didn't delete anything, the user did - they ran:

  rm -rf --no-preserve-root /

https://lwn.net/Articles/674940/

majewsky · on Jan 29, 2019

The bug was in the kernel (it should not have allowed userspace to write arbitrary UEFI variables), but AFAIR it was exposed by systemd because it eagerly mounted the UEFI variable filesystem provided by the kernel into /sys/something/efivars.

chris_wot · on Jan 29, 2019

Indeed, but again that was a firmware issue. systemd didn't delete the variables. And systemd was setting EFI variables, so consequently it needed it to be mounted as read/write.

The configuration files should have set that to read only after boot.

The kernel patch where this was fixed can be found here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

RX14 · on Jan 29, 2019

Using systemd automount and NFS you can easily get pid1 unresponsive, hung in uninteruptible sleep forever.

JdeBP · on Jan 29, 2019

No. Not mine. Not systemd. Not others. And I touched upon how rare this was in practice in my experience some years ago on Hacker News.

* https://news.ycombinator.com/item?id=8384251

But it does happen to other people.

* https://unix.stackexchange.com/questions/440229/

And there was one crash that made the headlines.

* https://news.ycombinator.com/item?id=12600413

nonbirithm · on Jan 29, 2019

Very recently I had this issue[0] as the result of a systemd upgrade, requiring the use of a recovery disk to downgrade to the previous version as the keyboard input had failed to be initialized.

[0] https://github.com/systemd/systemd/issues/11314

majewsky · on Jan 29, 2019

If this bug hits you in staging, no big problem, just don't promote that particular update to production. If this bug hits you in production, your lack of a staging environment is the bigger concern IMO.

(I have been hit by the same issue on my private notebook, but I have procedures in place to cleanly recover from failed upgrades on all systems, so it was not a big deal.)

yjftsjthsd-h · on Jan 29, 2019

"Yeah the software completely broke, but that's fine because you should be able to deal with that" does not make me feel better about the software in question.

rurban · on Jan 29, 2019

The current debian testing version crashes with a NULL pointer segv in the kernel module. You need to downgrade to the previous version.

newnewpdro · on Jan 29, 2019

In what kernel module? There is no "the kernel module" in a systemd context.

rurban · on Jan 29, 2019

There is. udev loads kernel modules. See eg. http://www.linuxfromscratch.org/lfs/view/development/chapter...

JdeBP · on Jan 29, 2019

Waving in the direction of udev does not clarify what kernel module is supposedly the kernel module, which is what you were asked.

geofft · on Jan 29, 2019

'rurban is one of our resident trolls - see also https://news.ycombinator.com/item?id=13364173

dvfjsdhgfv · on Jan 29, 2019

> Have you ever had pid 1 (systemd or any other init) crash? For the last ~three years I've been paid to maintain high reliability algorithmic trading systems that ran systemd and a whole lot of other stuff, and systemd has never crashed on me.

You see, that's the argument I hear a lot from Systemd advocates. The problem with anecdotal evidence is obvious. When you hear people opposing Systemd, practically all of them have some real-life issues with it, often related to functionality that would otherwise be non-essential (i.e. doesn't really need to be handled by PID 1). Of course if you don't have a particular problem, you don't feel it's important. That's precisely the attitude people resent.

geofft · on Jan 29, 2019

> When you hear people opposing Systemd, practically all of them have some real-life issues with it

Yes, but a lot of people have real-life issues with it on their desktop of the form "It's too complicated." I'm asking specifically about real-life issues on production servers at scale. There will of course be tools that are poorly suited for a personal machine (even a personal server) but well suited for a team that wants to run a bunch of reliable servers.

For instance I would never be happy running RHEL on my desktop, but that doesn't mean RHEL is useless.

dvfjsdhgfv · on Jan 29, 2019

I can't quote any statistics but have the impression that a large part of non-Systemd crowd are old-time admins who maintain a large number of servers, myself included. When you break something on a desktop machine, that's easily fixable. When you need to deal with a large heterogeneous environment, you prefer to have things handled a bit more gracefully. Linus is a good example of a person who got this right.