Description
Describe the bug
When building with remote builders, very often only one build will make progress; other ones are stuck for quite a long time. With sudo lslocks | grep upload
, you can see that processes are waiting on the .upload-lock
for the machine (and stracing them confirms that they're blocked on flock(5, LOCK_EX
). This can take a very long time. Interestingly, stracing the process that does have the lock seems to indicate that it's past the upload phase anyway - I see hundred of thousands lines with type
105:
write(2, "@nix {\"action\":\"result\",\"fields\":[221366008,0,0,0],\"id\":17878179326722332,\"type\":105}\n", 86) = 86
This could I suppose this could be happening in parallel to trying to copy files, though I somewhat doubt that.
Steps To Reproduce
- Enable a remote builder
- Kick off a bunch of simultaenous builds with --max-jobs 0
- Keep track of
.upload-lock
s inlslocks
- Strace the processes to see what they're doing
Expected behavior
I expect the builds to start building more quickly
nix-env --version
output
nix-env (Nix) 2.18.4
Additional context
Some investigation shows this is happening here. The original motivation for this logic, according to comments in the Perl precursor to this module from a decade ago, is to prevent multiple processes from trying to copy the same derivation over and over again.
It seems like the lock is potentially held too long. But moreover, it's too "big" a lock - we should probably only have a lock per store path + remote. And the alarm of 15 minutes also seems very long.
Priorities
Add 👍 to issues you find important.