Actually you are right nothing is stopping it from reading but that does not help it escape the kernel. If you are worried about something adversarial that tries to detect its in a sandbox but that is not what we are trying to protect from the idea is to follow the same model of a container with something that is more secure and has less surface area to protect or attack.
yes that is the goal though C++ is something i am not targetting in the short term. The idea is to be able to run untrusted binaries in a vm with no kernel. saves memory makes for faster loads and the the bin cannot escape the vm so it can never compromise your host.
Int80 is a great idea but int3 is what i landed on when i was looking and at this point just trying to get something working. The good thing about int80 is a 2 byte instruction i believe rather than int3 + nop that i am doing right now
I think you misunderstand my question. int 80h is an alternative legacy way that a program can issue syscalls. So without handling that your system may miss some syscalls. Which may be fine, I'm sure they are not that common. But if someone were to try to sneak a syscall past your monitoring that might be something they might do? Edit: Or maybe since it's running in a vm the outcome might just be that it doesn't work at all which may be fine I suppose.
AMA i am the author of that blog i have some working code just not something i want to share right away. Right now i am chasing density but yes security is something i will get to eventually. the issue is what to implement first :). This is the first of a series of blogs i am writing. you can check my substack. the next step is to show a density,launch speed demo hopefully middle of next week
seccomp is a very coarse filter and a very limited action set. think what you could do if you could see the payload of the syscall or change the output of a read syscall depending on agent identity.
gvisor tries to be a complete kernel in userland we are not trying to. We will consciously choose never to try and support multi-proess env in the sandbox. The idea is there are enough people running single process containers and they can benefit from a lighter more secure runtime. This solution will not try to replace the kernel. For example the python tests we run for https to some website ends up runnign implementing only 60 syscalls not 350. i expect to add another 10-20 for support typescript but this will always be strictly single process.Plus the performance overhead of gvisor is substantial 2-10us ( me reading internet) for the system i am implemeting on the hot path it is less than 1us. Plus there is always the density story my shim currently is 4KB the python runtime is shared through memfd. I am working on a demo showing i can run 1000 vm on 512 MB ram each launching in under 30msec.
Remember this will never replace or be able to handle generic mutli-process sandboxes this is targeted only at single process env where we can make lots of simplifying assumptions
The follow on posts describe where I plan to run the binaries. the idea is to run in a guest with no kernel and everything running at ring 0 that makes the sysret a dangerou thing to call. we don't have anything running at ring 3 also the syscall instruction clobber some registers all in all between the int3 and syscall instruction i counted around 20 extra instructions in my runtime. ( This is a guess me trying to figure what would happen). That is why the int3 becomes faster for what i am trying to build. The toolchain approach suffers from the diversity of options you have to support even if ignore stuff you guys encountered. Might be easier with llvm based things but still too many things to patch and the movement you tell people used my build environment it meets resistance.
I am currently aiming for python which is easy to do. The JIT is when i want to do javascript which i keep pushing out because once i go down there i have to worry about threading as well. Something i want to chase but right now trying to get something working.
Good question. I didn't cover this in the post — the binary doesn't run on the host kernel directly. It runs inside a lightweight KVM-based VM with no operating system. The shim is the only thing handling syscalls inside the guest. So strace on the host wouldn't see anything — no syscalls reach the host kernel from the guest. From the host side, the only visible activity is the hypervisor process making syscalls on behalf of the guest.
Inside the guest, there's no kernel to attach strace to — the shim IS the syscall handler. But we do have full observability: every syscall that hits the shim is logged to a trace ring buffer with the syscall number, arguments, and TSC timestamp. It's more complete than strace in some ways — you see denied calls too, with the policy verdict, and there's no observer overhead because the logging is part of the dispatch path.
So existing tools don't work, but you get something arguably better: a complete, tamper-proof record of every syscall the process attempted, including the ones that were denied before they could execute.
I'll publish a follow-on tomorrow that details how we load and execute this rewritten binary and what the VMM architecture looks like.
reply