Description
Describe the bug
In a unit test suite which repeatedly opens the same store, concurrently, I'm experiencing a deadlock in flock
via nix::lockFile(rw)
via nix::LocalStore::createTempRootsFile()
createTempRootsFile
was not designed with concurrent instances in mind. See Additional context.
I have not experienced this when running the tests with a daemon, perhaps because it introduces timing noise so that the critical section responsible for the hang is unlikely to (mis)align with the other workers, or the worker pid is used, which is unique.
Steps To Reproduce
Triggered by running this test in the Nix sandbox (so that it uses a local, alternate store)
nix build github:nixops4/nixops4/8ec34a75da90973a21335a2d844928741756b2b2#packages.x86_64-linux.nixops4-eval-release
(This program is also responsible for triggering realisations, despite its name. It's main reason to exist is evaluation though.)
Expected behavior
This problem could be solved in multiple ways:
a. Document that ensuring one store instance per store per process is the caller's responsibility
b. Use process-wide Sync<std::map<Path, TempRoots>>
, where temproots creation and perhaps other operations go through the shared TempRoots
object
c. Use process-wide Sync<std::map<Path, LocalStore>>
, to dedup the store instances in openStore
- It already has thread safety measures, including for temp roots
- Does not allow multiple instances with different settings!
d. Use subdirectories prefix/$pid/$random
or prefix/$pid-$random
instead of prefix/$pid
. I'd need to know more about the garbage collector file system state (which is undocumented). Clearing the $pid
would be different. I don't know what the other implications are. Current state: see fnTempRoots
below.
Metadata
Nix "master", 4f50b1d
Additional context
This section of createTempRootsFile
is synced to the LocalStore
instance, not to the file path, and openLockFile(_, true)
is not an exclusive (O_EXCL
) create operation, so multiple instances end up trying to lock the same fd. Fixing this race condition would solve the hang, but does not solve the erroneous clearing of other instances' temporary roots.
Lines 57 to 65 in d467f7a
Initialization of fnTempRoots
:
nix/src/libstore/local-store.cc
Line 112 in d467f7a
Checklist
- checked latest Nix manual (source)
- checked open bug issues and pull requests for possible duplicates
Add 👍 to issues you find important.