Skip to content

Concurrent instances of the same store hang on void LocalStore::createTempRootsFile() #11979

Open
@roberth

Description

Describe the bug

In a unit test suite which repeatedly opens the same store, concurrently, I'm experiencing a deadlock in flock via nix::lockFile(rw) via nix::LocalStore::createTempRootsFile()

createTempRootsFile was not designed with concurrent instances in mind. See Additional context.

I have not experienced this when running the tests with a daemon, perhaps because it introduces timing noise so that the critical section responsible for the hang is unlikely to (mis)align with the other workers, or the worker pid is used, which is unique.

Steps To Reproduce

Triggered by running this test in the Nix sandbox (so that it uses a local, alternate store)

nix build github:nixops4/nixops4/8ec34a75da90973a21335a2d844928741756b2b2#packages.x86_64-linux.nixops4-eval-release

(This program is also responsible for triggering realisations, despite its name. It's main reason to exist is evaluation though.)

Expected behavior

This problem could be solved in multiple ways:

a. Document that ensuring one store instance per store per process is the caller's responsibility

b. Use process-wide Sync<std::map<Path, TempRoots>>, where temproots creation and perhaps other operations go through the shared TempRoots object

c. Use process-wide Sync<std::map<Path, LocalStore>>, to dedup the store instances in openStore

  • It already has thread safety measures, including for temp roots
  • Does not allow multiple instances with different settings!

d. Use subdirectories prefix/$pid/$random or prefix/$pid-$random instead of prefix/$pid. I'd need to know more about the garbage collector file system state (which is undocumented). Clearing the $pid would be different. I don't know what the other implications are. Current state: see fnTempRoots below.

Metadata

Nix "master", 4f50b1d

Additional context

This section of createTempRootsFile is synced to the LocalStore instance, not to the file path, and openLockFile(_, true) is not an exclusive (O_EXCL) create operation, so multiple instances end up trying to lock the same fd. Fixing this race condition would solve the hang, but does not solve the erroneous clearing of other instances' temporary roots.

nix/src/libstore/gc.cc

Lines 57 to 65 in d467f7a

if (pathExists(fnTempRoots))
/* It *must* be stale, since there can be no two
processes with the same pid. */
unlink(fnTempRoots.c_str());
*fdTempRoots = openLockFile(fnTempRoots, true);
debug("acquiring write lock on '%s'", fnTempRoots);
lockFile(fdTempRoots->get(), ltWrite, true);

Initialization of fnTempRoots:

, fnTempRoots(fmt("%s/%d", tempRootsDir, getpid()))

Checklist


Add 👍 to issues you find important.

Metadata

Assignees

No one assigned

    Labels

    bugstoreIssues and pull requests concerning the Nix store

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions