My "restarter process" is upstart. It's convenient, since the OOM-killer tries to not kill init (for bad things happen when you kill init), so it's a somewhat-safe place to put supervisory logic. One of the better calls Canonical has made, I think. :)
Still, in your use-case, I'd definitely recommend only letting users run their "wild code" inside a memory cgroup+process namespace (e.g. an LXC container.)
Crash-only systems only work when a faulty component crashes itself before it crashes you. Processes modellable as mutually-untrustworthy agents should always have a failure boundary drawn between them. (User A shouldn't be able to bring down the cluster-agent; but they shouldn't be able to snipe user B's job by OOMing their job on the same cluster node, either.) And on a Unix box, the only true failure boundaries are jails/zones/containers; nothing else really stops a user from using up any number of not-oft-considered resources (file descriptors, PIDs, etc.)
Do you have any good resources on where to get started going about setting up failure boundaries/jails/zones/containers like this properly?
I think it's surprisingly easy to get yourself in the situation where this is a concern for you[0] but you don't know how to solve it.
[0] Just run "adduser" and have SSH running, or just create an upstart job, or write a custom daemon that accepts and executes jobs from not-quite-trustworthy-undergrads, or...
Still, in your use-case, I'd definitely recommend only letting users run their "wild code" inside a memory cgroup+process namespace (e.g. an LXC container.)
Crash-only systems only work when a faulty component crashes itself before it crashes you. Processes modellable as mutually-untrustworthy agents should always have a failure boundary drawn between them. (User A shouldn't be able to bring down the cluster-agent; but they shouldn't be able to snipe user B's job by OOMing their job on the same cluster node, either.) And on a Unix box, the only true failure boundaries are jails/zones/containers; nothing else really stops a user from using up any number of not-oft-considered resources (file descriptors, PIDs, etc.)