Add Void Linux to the list, it uses runit.

petre · on June 29, 2017

I'll probably switch to Void Linux when the next Ubuntu LTS comes out. I'm running Ubuntu 16.04 on my laptop right now.

digi_owl · on June 28, 2017

Didn't Docker switch to using Void as basis for their container images?

cyphar · on June 28, 2017

No, alpine. But running systemd in Docker has never /really/ worked (except if you're running on RHEL because they have a bunch of patches to make it work) due to how much shit systemd does that makes it hard to run in a container. Even systemd-nspawn (their container runtime which runs systemd inside the container) doesn't work in a lot of cases.

LXC is the only runtime I'm aware of that actually runs systemd inside containers well, but they had to do a lot of unholy shit to play nicely with systemd.

runc has had countless issues with systemd thinking that it owns the system and it messing with container cgroups.

And don't get me started on the fact that cgroupv2 is specifically designed to only work if you have a global management process for cgroups (can you guess what management process that is?).

digi_owl · on June 28, 2017

And surprise surprise, the current cgroups maintainer is a Red Hat employee.

I'm not saying it is planned, but i wonder if there is a echo chamber effect going on...

cyphar · on June 29, 2017

> And surprise surprise, the current cgroups maintainer is a Red Hat employee.

Tejun used to work at RedHat, he's at Facebook now and I believe he was working at Google as well. However, he also does contribute to systemd development (recently he got a patch merged that broke every container runtime because they switched to a "hybrid" cgroupv2 setup in v232 which caused countless issues).

setq · on June 28, 2017

He who controls the spice....

digi_owl · on June 29, 2017

In in this case, the source...

JdeBP · on June 29, 2017

Yes, you should not get started on that, because it is a falsehood and not actually a fact.

* https://jdebp.eu/FGA/systemd-documentation-errata.html#Contr...

* https://news.ycombinator.com/item?id=11845867

cyphar · on June 29, 2017

There's no need to lecture me, I am very familiar with cgroups, having contributed to their implementation and also maintain runc which is a container runtime (that obviously uses cgroups quite heavily). I've also discussed these issues with other container runtime developers such as the LXC folks and kernel developers.

So let's talk about the API. First of all, cgroupv2 requires a single hierarchy. This means that if systemd is using cgroups for managing services, you cannot use cgroups for anything else because systemd will get confused if you create any new hierarchies. You may argue this is a bug in systemd, but I would argue it's because you can't have named cgroup hierarchies in v2 (like you could in v1, which is what systemd uses on v1).

But ignoring that "slight" issue, how about we talk about the no-internal node constraints and how subtree control works. First of all, in order to use a cgroup controller you must have all of your ancestor cgroups have that controller activated. So if systemd decides to not use a controller, then you can't use it either (without messing with things that systemd thinks it owns). But ignoring that, let's say you want to create a new cgroup under inside your user session (we've already established systemd won't like that, but let's assume that systemd plays along). You can't just create a new subcgroup (you won't be able to use the controllers), you have to create two and then move all of the other processes into one and then the process you wanted to control into the other. While this may sound okay, you have to realise that as a container runtime you now have to mess with processes that you have no control over or idea what they do. Not to mention that there's no way to atomically move all processes into a cgroup (so there'll be race conditions in trying to set this up).

The "delegation" model of cgroupv2 is effectively based around the systemd delegation model, where the higher level has to semantically grant you the right to manage your own resources. What kind of resource management system requires you to request the right to manage your own resources? prlimit(2) doesn't do that. cgroupv1 somewhat had this issue as well, but there is another cgroupv2 limitation added that actually means that even if you have write access to a child cgroup you still need to have write access in your current cgroup in order to move it into the child. Write access to cgroups.proc is actually a privilege in cgroups, so giving users access to this won't always be desirable, but it also further bakes in the management process design.

I've talked to Tejun on the mailing lists, and it's very clear that he prioritises the model of having a higher level process managing cgroups. In discussions about making unprivileged subtree delegation (something that is necessary for rootless containers to use cgroups) he made it clear that he isn't interested in the feature because it will cause systemd issues because it manages all cgroups on a system.

There's actually even more stuff you have to do to manage cgroups if you're not systemd by the way. I've talked to some LXC folks and we collated a list of 12 of different cases and things you need to deal with in order to use cgroupv2 effectively (and all of them break rootless containers, as well as making container runtimes very "noisy neighbours" as a result). cgroupv1 (despite its downsides) had none of these issues.

The only current user of cgroupv2 is systemd, and they've had several instances where they broke every container runtime because they flipped the cgroupv2 switch early.

Yes this was a rant, but I'm really tired of people defending this. cgroupv2 did make some good decisions, but then followed up by making some truly awful ones.

JdeBP · on June 29, 2017

No-one defended cgroups. What you said about a single global management process was just plain wrong. I do find it amusing that you erroneously think that other people are lecturing you, by the way. (-:

A control group on the machine in front of me tells me that you are wrong about two more things.

    jdebp %ll -a /sys/fs/cgroup/service-manager.slice/[email protected]/[email protected]
    total 0
    drwxr-xr-x 6 jdebp root  0 Jun 29 18:17 .
    drwxr-xr-x 3 root  root  0 Jun 29 18:17 ..
    -r--r--r-- 1 root  root  0 Jun 29 18:18 cgroup.controllers
    -r--r--r-- 1 root  root  0 Jun 29 18:18 cgroup.events
    -rw-r--r-- 1 jdebp root  0 Jun 29 18:17 cgroup.procs
    -rw-r--r-- 1 root  root  0 Jun 29 18:18 cgroup.subtree_control
    drwxr-xr-x 2 jdebp jdebp 0 Jun 29 18:17 me.slice
    drwxr-xr-x 2 jdebp jdebp 0 Jun 29 18:17 per-user-manager-log.slice
    drwxr-xr-x 3 jdebp jdebp 0 Jun 29 18:17 service-manager.slice
    drwxr-xr-x 2 jdebp jdebp 0 Jun 29 18:17 system-control.slice
    jdebp %

Unprivileged subtree delegation exists, that being a control group delegated to my account which has a whole subtree of further control groups in it, managed by multiple unprivileged processes. Your problem with "rootless" containers is not because of the non-existence, because Tejun Heo "isn't interested", of something that visibly exists. That's clearly not a correct description of the situation at all. Furthermore, https://lkml.org/lkml/2017/6/25/4 and https://lkml.org/lkml/2017/6/25/6 tell me that far from "isn't interested", Tejun Heo is interested in subtree delegation to unprivileged users. After all, xe is fidding with it right now.

systemd is not the sole user of version 2 control groups.

cyphar · on June 29, 2017

> A control group on the machine in front of me tells me that you are wrong about two more things.

But the problem is that the slices you showed are given to you by systemd. If systemd didn't want to give them to you for whatever reason, you couldn't use cgroups.

And you've not responded to any other part of my comments that relate to how the design of cgroupv2 is clearly geared towards management processes controlling subtrees as opposed to programs controlling themselves (the key point being that the root tree has to be controlled by someone).

> Unprivileged subtree delegation exists

But it requires a privileged user to "allow" it, making it less useful in most cases because it has to be automated (allowing for possible exploits) or done manually (not useful).

> Tejun Heo is interested in subtree delegation to unprivileged users

That's very odd, and is not the impression I got after discussing these issues with him last year. In particular I proposed something like his "nsdelegate" patch in early 2016 so it's nice to see that he's come around on that topic. But if he's changed his mind, that's great! Note though that the first patch is not directly related to unprivileged subtree delegation.

> systemd is not the sole user of version 2 control groups.

Can you give an example? I'm also fairly certain they're the only user of "hybrid" cgroup versions.

JdeBP · on July 13, 2017

> But the problem is that the slices you showed are given to you by systemd

No, they are not. I did say that that control group told me that you are wrong about two things, the second being that systemd is not in fact the sole user of version 2 control groups. That should have been a major tip-off that systemd was not involved in that control group at all. (-:

> Can you give an example?

I actually did, two messages ago. Here's the hyperlink again.

* https://news.ycombinator.com/item?id=11845867