Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdenv/setup.sh: fix parallel make #174473

Closed
wants to merge 1 commit into from

Conversation

markuskowa
Copy link
Member

Description of changes

The stdenv setup phase set both the j and the l option for make to $NIX_BUILD_CORES (e.g. nix-build's --cores option).
However, the l option sets an upper bound for the system load. If this load is exceeded, make basically runs with -j 1.
This leads to unwanted behavior and slow builds. For example: on a system with 48 cores and a load of 24, make will
only run one job at a time if $NIX_BUILD_CORES is set to less than 24, leaving the system under utilized.
It is not clear to me why l should be set to $NIX_BUILD_CORES. This PR removes the l option from the setup phase.

For reference from the GNU make manual:
"When the system is heavily loaded, you will probably want to run fewer jobs than when it is lightly loaded. You can use the ‘-l’ option to tell make to limit the number of jobs to run at once, based on the load average. The ‘-l’ or ‘--max-load’ option is followed by a floating-point number. For example,

-l 2.5

will not let make start more than one job if the load average is above 2.5. The ‘-l’ option with no following number removes the load limit, if one was given with a previous ‘-l’ option."

Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 22.05 Release Notes (or backporting 21.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
    • (Release notes changes) Ran nixos/doc/manual/md-to-db.sh to update generated release notes
  • Fits CONTRIBUTING.md.

@@ -1075,7 +1075,7 @@ buildPhase() {
# Old bash empty array hack
# shellcheck disable=SC2086
local flagsArray=(
${enableParallelBuilding:+-j${NIX_BUILD_CORES} -l${NIX_BUILD_CORES}}
Copy link
Member

@SuperSandro2000 SuperSandro2000 May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you think about setting this to -l 2.5 or maybe -l 5.0? I am not sure if that would make sense or not.

Copy link
Member Author

@markuskowa markuskowa May 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A constant value would not make sense here IMHO. Which value makes sense rather depends on the system and its configuration (i.e. number of cores and nix's max-jobs/cores): A value of 2.0 may be a good choice for a laptop but not for a big server with lots of CPU cores.
If max-jobs is set to the number of cores in the system, it could make sense to set -l <number cores> to avoid overly high system loads.
I would be interesting to know the motivation why -l${NIX_BUILD_CORES} was set here in the first place. Maybe @Ericson2314 knows more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't -l${NIX_BUILD_CORES} a good protection against overloading/DoS-ing build machines? Maybe it could be configurable at runtime, but I personally think the default is good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good to have safeguards in place. However, when the safeguards cause my builds to run with the equivalent of -j1 on a 64-core machine, it is no longer feasible to use Nix in any way in a professional context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cause my builds to run with the equivalent of -j1 on a 64-core machine

Are you sure? I would imagine it starting out with 64 jobs and if/when system load > 64, then new jobs are delayed until load falls below 64. But that'd mean it should run the overall build with >>1 job.

It would be cool to visualize it :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not able to run builds with cores = 64 because the OOM killer is invoked:

Screen Shot 2022-07-15 at 06 05 27

If I run the builds with cores = 8, -j8 -l8 will be passed to make. This is not good because the system has a load average higher than 8, which causes the builds to slow to a crawl.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't -l${NIX_BUILD_CORES} a good protection against overloading/DoS-ing build machines? Maybe it could be configurable at runtime, but I personally think the default is good.

It is certainly a protection against overloading. However, the question if it is an efficient protection. The default may be good for the main Hydra build farm. On servers with mixed load, this default does not work not work well: E.g.: on a 48 core machine which dedicates half of its cores to a constant, non-build load, and the other half to nix-build jobs, this results in the named problem of gross under utilization. Running nix with -l24 will not result in the desired result of using 24 cores for the build job, but in nix (or make) to only use a single core.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@centromere @markuskowa: Good point.

@centromere
Copy link
Member

Is there something that can be done to move this forward? I am currently affected by this bug.

Copy link
Member

@SuperSandro2000 SuperSandro2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ericson2314 what do you think?

I am really unsure how this will interact with hydra.

@markuskowa
Copy link
Member Author

@Ericson2314 what do you think?

I am really unsure how this will interact with hydra.

Feedback from someone familiar with the Hydra build farm (and its load problems) is absolutely needed here. To be clear: from what I can judge here, this certainly will have a non-negligible impact on the Hydra load patterns.

@vcunat
Copy link
Member

vcunat commented Jul 16, 2022

This isn't just about hydra. And I don't think it's good to go without any -l limit. By default that would parallelize up to square of your number of cores, and that would be quite likely to exhaust RAM.

So indeed the point is to protect the machine from overloading, even though it's quite a crude method. My experience of using the current setting on a 32-core is OK, but I can imagine it could be considered limiting if you have lots of non-CPU load (e.g. from rotating drives).

For an easy step, I think it would be nice to allow to override make's -l by a setting separate from its -j which each machine could configure. (EDIT: sounds like #124166) For a better scheduling, different nix builds would have to "communicate together" by something better than this current-load metric. An attempt was in PR #143820

@markuskowa
Copy link
Member Author

For an easy step, I think it would be nice to allow to override make's -l by a setting separate from its -j which each machine could > configure. (EDIT: sounds like #124166) For a better scheduling, different nix builds
would have to "communicate together" by something better than this current-load metric. An attempt was in PR #143820

I am closing this PR, since it is not mergable in its current from and would probably cause real trouble on Hydra's build farm.
The above mentioned options, such as making -l overridable, or even better, the job server solution seem to be much more appropriate.

@markuskowa markuskowa closed this Jul 31, 2022
@centromere
Copy link
Member

@markuskowa Any change to stdenv will invalidate (almost) every nar in the cache, yes? If so, is it feasible to make any changes whatsoever to stdenv without causing pain to the build farm?

@vcunat
Copy link
Member

vcunat commented Aug 1, 2022

No, we change stdenv several times a month. Mass rebuilds are common.

@centromere
Copy link
Member

Okay. What do y'all think of this plan?

  1. Add an integer nix.conf setting named core-limit.
  2. Inject NIX_CORE_LIMIT in to the build environment:
env["NIX_CORE_LIMIT"] = (format("%d") % settings.coreLimit).str();
  1. Rework stdenv as follows:
NIX_CORE_LIMIT="${NIX_CORE_LIMIT:-$NIX_BUILD_CORES}"
export NIX_CORE_LIMIT

local flagsArray=(
    ${enableParallelBuilding:+-j${NIX_BUILD_CORES}}
...
)

if ((NIX_CORE_LIMIT > 0)); then
    flagsArray+=("-l${NIX_CORE_LIMIT}")
fi

@ck3d
Copy link
Contributor

ck3d commented Aug 2, 2022

I had the same solution in mind. Only the name "core limit" should be more aligned to make and ninja wording -> load average.

@centromere
Copy link
Member

@ck3d @vcunat @markuskowa I've submitted some PRs to address this:

NixOS/nix#6855

#184886

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants