Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdenv: reintroduce limiting by system load #192799

Draft
wants to merge 1 commit into
base: staging
Choose a base branch
from

Conversation

vcunat
Copy link
Member

@vcunat vcunat commented Sep 24, 2022

Commit c2b898d removed the mechanism; now it's replaced by a different one with higher limit of $(nproc) * 2.

Unfortunately there's no good mechanism how to override it from outside like --cores or --max-jobs, but at least the new limit should be high enough to avoid the previous downsides.

Commit c2b898d removed the mechanism; now it's replaced
by a different one with higher limit of $(nproc) * 2.

Unfortunately there's no good mechanism how to override it from outside
like `--cores` or `--max-jobs`, but at least the new limit should be
high enough to avoid the previous downsides.
@github-actions github-actions bot added 6.topic: python 6.topic: stdenv Standard environment 6.topic: TeX Issues regarding texlive and TeX in general labels Sep 24, 2022
@vcunat
Copy link
Member Author

vcunat commented Sep 24, 2022

I'm opening this mainly for discussion for now. I believe that at least some load limit is better than none.

I was testing this on a 24-core machine to observe the behavior. Using $(nproc) * 3/2 was still leaving some periods where there was a noticeable under-utilization of CPU (by atop) due to the oscillation – though not large amount and not with all builds (llvm builds were doing this but not linux builds).

So I chose $(nproc) * 2 to improve on this. Also weakening the limit seemed more in line with my perception of community's reaction around the PR #192447.

@vcunat vcunat mentioned this pull request Sep 24, 2022
13 tasks
@jonringer
Copy link
Contributor

Ideally, I think should have this support on the nix side. Similar to how you can configure jobs and cores.

That being said, I think this is better than what we had.

@vcunat
Copy link
Member Author

vcunat commented Sep 24, 2022

Yes, adding a similar config to nix to populate NIX_LOAD_LIMIT could be done later, if this route gets consensus (in nixpkgs first). Now it doesn't take so much time for nix changes to get into the default version in nixpkgs.

@ck3d
Copy link
Contributor

ck3d commented Sep 24, 2022

For my setup a load limit twice the cores is too high. It is more likely that my machines run out of memory or starting to swap.
I think we have to introduce NIX_LOAD_LIMIT first (see NixOS/nix#6855) or set the value to a more conservative value, e.g. NIX_LOAD_LIMIT=$(ncores).

@jonringer
Copy link
Contributor

For my setup a load limit twice the cores is too high. It is more likely that my machines run out of memory or starting to swap.

Yea, there's a big difference between a 2-core CI machine, and something like my 128 vCPU server :/

@vcunat
Copy link
Member Author

vcunat commented Sep 24, 2022

A good default for everyone seems impossible with the mechanisms at hand. Generally I believe it helps to set zramSwap.enable = true;. (stalls due to zram-swapping will increase load, too)

@jonringer
Copy link
Contributor

I created NixOS/nix#7091 to make a longer term solution viable.

@grahamc
Copy link
Member

grahamc commented Sep 25, 2022

Yea, there's a big difference between a 2-core CI machine, and something like my 128 vCPU server :/

Note that most machines in Hydra have between 24 and 80 cores, but have between 2 and 80 cores allocated to the build. I agree there isn't a heuristic here that will be robust across hardware configurations.

@lheckemann
Copy link
Member

People who have RAM issues still have the option of reducing cores, and I'd argue that that's more appropriate. Load doesn't rise immediately, so it's likely that there's a burst with lots of processes at the beginning of a build (observed on hydra before the removal of the load limit). This can easily exhaust RAM too.

@jonringer
Copy link
Contributor

Yea, looked into what load actually provides, and I don't think it's as useful for burst workloads :(.

@vcunat
Copy link
Member Author

vcunat commented Sep 27, 2022

But by reducing --cores you slow down big builds, and many other builds are single-core. It's hard to find a good fixed split, unless you can be patient like Hydra (--cores 2 for almost all jobs).

@wegank wegank added 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 2.status: merge conflict This PR has merge conflicts with the target branch labels Mar 19, 2024
@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Mar 20, 2024
@wegank wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 4, 2024
@emilazy
Copy link
Member

emilazy commented Jul 21, 2024

Linking #328677 for those subscribed to this.

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 21, 2024
@wegank wegank added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.status: merge conflict This PR has merge conflicts with the target branch 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md 6.topic: python 6.topic: stdenv Standard environment 6.topic: TeX Issues regarding texlive and TeX in general 9.needs: community feedback 10.rebuild-darwin: 501+ 10.rebuild-darwin: 5001+ 10.rebuild-darwin-stdenv This PR causes stdenv to rebuild 10.rebuild-linux: 501+ 10.rebuild-linux: 5001+ 10.rebuild-linux-stdenv This PR causes stdenv to rebuild
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants