Am I missing why a distributed lock is an impossibility? The problem stated is that a partitioned node can't know it has lost the lock, but this is only an issue if there is a way to lose the lock short of returning it.
Which I guess is to say: what difference is there between a lease with an infinite timeout unless manually returned, and a "lock"?
Certainly the system deadlocks under partition but I'm not sure why that makes this "impossible".
> a lease with an infinite timeout unless manually returned
I would argue that "infinite timeout" is another negative shibboleth.
every operation in a distributed system has some duration after which you can be 99.9% confident (or 99.9999%, or whatever threshold you want to pick) that it was lost to the void and will never return a result.
in a robust distributed system, you want to pick a reasonable timeout value, and then take appropriate action in response to the timeout. typically this is retrying the operation, bubbling up a failure message to a higher level, or some combination of the two (retry a few times, fail if all the retries fail).
an infinite timeout represents a deliberate design choice of "I don't want to handle the case of this message or API call being lost in-transit and never returning either success or failure".
in my experience, infinite timeouts are often the cause of "hmm, this thing is up and running but seems 'stuck' and not making any progress, let me try manually restarting this service...OK, that seems to have recovered it" bugs and production alerts.
Zombie processes are already dead and aborted, there's nothing more a kill -9 would do to them.
The kernel retains minimal state about them because the system has made a promise to report that the process exited to its parent process, and the parent process hasn't gotten around to asking for that yet.
(Don't confuse zombies with uninterruptible I/O sleep, or buggy kernel workers.)
I might have done that last line, but I do mean generally. Even if it isn't kill -9 (which it should be, somehow, since that's the human's intent when they use it); there should be some mechanism for reaching the 'failed' state and the process itself leaving the accounting tables.
Stuck mounts have a half solution (lazy unmounts) but even _that_ interface really also needs a timeout value after which operations on the target should be assumed to fail rather than return correctly.
Offhand, I wonder if there's currently or previously been a DoS attack based on defunct uninterruptible sleep. Theoretically a system could be exhausted of PIDs which could lead to nasty issues.
> 'failed' state and the process itself leaving the accounting tables.
Once again, that cannot be done until the parent process consumes the exit status. That's what the zombie is there for. Zombies don't take up much space.
> Stuck mounts have a half solution (lazy unmounts) but even _that_ interface really also needs a timeout value after which operations on the target should be assumed to fail rather than return correctly.
These days most NFS etc mounts are "soft mounts", that is operations will eventually time out.
Lazy unmount doesn't really apply here, it makes the mountpoint disappear from the global namespace, but all existing open files remain untouched, and the mount lives as long as anything is still using it; it just removes the "entry point" to the mount.
On today's Linux, it's up to each filesystem to provide abort/timeout mechanism. For timeouts, this is the right design, as demonstrated by macOS complications with FUSE. I do wish there was a common way to make things abort.
There was a patch in circulation a long time ago, that could seamlessly switch all open FDs of any given mountpoint into a whole different filesystem named badfs. badfs would just return an error on any operation. As far as I know, that patch never got merged, probably because nobody ever got it working 100%.
That kind of a DoS would require a local attacker, and then the victim to access a mountpoint owned by the attacker. Using FUSE, you could get a lot of processes hanging like that, for sure. I guess you could trap a mail delivery agent, if you still had a system where mail was delivered to users' home directories.
The short answer is because if the lock holder fails your other nodes have no way of knowing if the lock holder failed (consequence of FLP Impossibility result). If you set a timeout, then that’s a lease.
The long answer is to peel this onion for yourself and see where it leads. It’s a lot of fun.
Why does it matter (in fully-general theory, which is what we're discussing here) if the lock holder fails? The lock is either released, at which point someone else can acquire it; or never released, at which point the system doesn't try to do whatever that lock is about any more. Assuming that every distributed system has to successfully make progress in all cases is just that—an assumption. A design could require that operations "fail stalled." Like a physical padlock that is "tamper proof" in the sense that it permanently seizes up when you put in the wrong key.
Big assumption that a distributed system has to serve “people” and have “responses.”
A distributed system might be, for example, the ACH system: all batch, all store-and-forward, no replies flowing down the line, only processed message response batches dropped off “eventually” in outboxes.
Or, for another example: any workload manager, whether on an HPC cluster or your local Kubernetes node. No synchronous workload execution; just async batch scheduler enqueue with later best-effort status querying.
Note that ACH expect a reponse under 3 days, so a blocked forever do not work. Because guess what. People expect an answer.
Saying to people "a system somewhere is blocked for possibly forever, so too bad we cannot do your thing" is our reality. Our system exist for their impact on people.
Otherwise they are art... which also exist for its impact on people.
It's not necessarily that "we cannot do your thing." Just that "we cannot do your thing using your lock. To get around this, simply make a new resource, to get a new lock."
Think of how in e.g. an IaaS control plane, when you delete a VM, it may take an arbitrarily-long time before you can create another VM with the same ID. (Maybe forever!) But you can always create a VM with a different ID, that otherwise fulfills all the same purposes (e.g. has the old instance's IP, FQDN, etc.) The old ID essentially has a distributed lock on its use, with an unbounded release time — and that's perfectly fine for the use-case.
For an example of fail-stalled being not only practical but preferred, consider tag-out locking systems (exclusive-access locks used to prevent machines from being turned on while maintenance is being performed on them.) If there was a digital lock of that type, you wouldn't want to ever automatically time it out. A human put that lock there, to keep them alive. They'll take it off when they're done. If you really suspect someone forgot to unlock the tag-out lock, you can always go and check with the lock's acquirer. But if you can't get in contact with them, you can't know that they don't still have their hands up in the gears of the machine. And in this case, failing to auto-restart the assembly line (until the "partition" is over and you can just ask the maintenance worker why they're still holding the lock) is worth much less than said maintenance worker's life.
I can't come up with a good reason as to why one would want to fail stalled. In what scenario would one want to have a distributed lock that fails in that way?
At our job someone decided to use a ready-to-use Go library which used Redis for distributed locking. But I found that it was broken by design and completely unreliable, and we had random transient errors stemming from it. It worked OK 99.9% the time, but once in a while we were getting inconsistent state in our application. The description initially made sense and the usage looked simple. It worked by a node creating a value with a TTL, which was used to make the lock auto-expire if a node crashed. If a node found that a value under the same name was already found in Redis, it would block. Since access to Redis is serialized, all such actions were basically atomic. The problem was due to the auto-expire feature. The TTL can expire while your code under the lock is scheduled out due to GC or waiting for I/O. So the lock that you held could be released basically at any point of execution while you were supposedly under the lock. Extending the lock's TTL after every line of code isn't practical and probably prone to race conditions anyway (and the library IIRC didn't provide a way to do it). I read there's a technique called token fencing but it requires additional changes to the way your shared resources are accessed which isn't always possible. I still don't know how to do distributed locks right and there seem to be many broken implementations in the wild.
So Redis isn't really a distributed locking system. The locks are all managed by a central, non-distributed server: Redis. But this kind of lock is useful too. One nice approach to handling crashes in a system like this is to use the idea of fate sharing [1]: you upper-bound the lifetime of a held lock by the lifetime of the TCP connection it was taken on. When the connection goes, the lock is auto-released. To support this, Redis would have to have some kind of EXPIRE_ON_DISCONNECT command - I don't know if it does or not.
The idea of fate sharing is very general and useful: you can, for example, introduce reconnectable sessions, and attach shared state to those, which gets you transport-independence and the ability to recover from transport failure.
[1] Clark, David D. “The Design Philosophy of the DARPA Internet Protocols.” ACM SIGCOMM Computer Communication Review 18, no. 4 (August 1988): 106–14. https://doi.org/10.1145/52325.52336.
Yes, absolutely! The caveats are relevant to distributed locking, though: sharding would help scale out a locking system horizontally, but each subset of keyspace would still be a non-distributed locking service. Primary-secondary replication doesn't (as far as I can tell!) offer the necessary invariants to act as a locking service - at least, not when employing the straightforward technique GP mentions.
Which I guess is to say: what difference is there between a lease with an infinite timeout unless manually returned, and a "lock"?
Certainly the system deadlocks under partition but I'm not sure why that makes this "impossible".