Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from conda-build to rattler-build #374

Open
wants to merge 35 commits into
base: branch-0.43
Choose a base branch
from

Conversation

gforsyth
Copy link
Contributor

@gforsyth gforsyth commented Feb 13, 2025

Porting over the conda-builds to rattler-build

Contributes to rapidsai/build-planning#47

@gforsyth gforsyth added the improvement Improves an existing functionality label Feb 13, 2025
@gforsyth gforsyth requested a review from a team as a code owner February 13, 2025 19:39
@gforsyth gforsyth added the non-breaking Introduces a non-breaking change label Feb 13, 2025
@gforsyth gforsyth requested a review from a team as a code owner February 13, 2025 19:39
version: ${{ version }}
build:
string: cuda${{ cuda_major }}_${{ date_string }}_${{ head_rev }}
script: install_libucxx.sh
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are comments in this build-script that I'm not sure how to preserve in the script directive here, so I've opted to leave that standalone script in place.

host:
- cuda-version =${{ cuda_version }}
- if: cuda_major == "11"
then: cuda-cudart-dev =${{ cuda_version }}
Copy link
Contributor Author

@gforsyth gforsyth Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixed cuda11 builds for me locally. See below for the failure without this pin where ucxx import tests are looking for libcudart.so.12.
Not sure if cuda-cudart-dev is constrained (or unconstrained) in some weird way version-wise, or if rattler-build is doing something unexpected re: version solving

( from https://github.com/rapidsai/ucxx/actions/runs/13318054370/job/37196857183)

 │ │ Installing test environment
 │ │  Successfully updated the test environment
 │ │ + python $SRC_DIR/conda_build_script.py
 │ │ Traceback (most recent call last):
 │ │   File "$SRC_DIR/conda_build_script.py", line 1, in <module>
 │ │     import ucxx
 │ │   File "$PREFIX/lib/python3.10/site-packages/ucxx/__init__.py", line 30, in <module>
 │ │     from . import exceptions, types, testing  # noqa
 │ │   File "$PREFIX/lib/python3.10/site-packages/ucxx/exceptions.py", line 4, in <module>
 │ │     from ._lib.libucxx import (  # noqa
 │ │   File "$PREFIX/lib/python3.10/site-packages/ucxx/_lib/__init__.py", line 4, in <module>
 │ │     from .libucxx import _create_exceptions
 │ │ ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
 │ │ × error Script failed with status 1
 │ │ × error Work directory: '/tmp/.tmpn812nb'
 │ │ × error To debug the build, run it manually in the work directory (execute the `./conda_build.sh` or `conda_build.bat` script)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr CUDA 11 packages aren't great.

I'm not actually sure what exactly is happening right now, but I think what is happening is that you are winding up with unconstrained nvidia channel packages. This shows up in CI even on the latest (passing) CUDA 11 builds like this one. If you look at the cuda-cudart-dev packages on conda-forge there aren't any for CUDA<12. Presumably we're being saved at runtime by something else bringing in cudatoolkit, namely probably librmm. You'll note that in the old recipe the condition for cuda-cudart-dev was {% if cuda_major != "11" %}. For CUDA 11 we are still reliant on including some CUDA libraries on the host system (not installed via conda).

Let's flip this constraint to != 11 and try again.

Separately, we should double-check if we even need to include the nvidia channel any more. I've lost track of the current state of things and would need to take 20 mins to refresh myself but I believe we may not need it any more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(^everything Vyas said, and...)

CUDA 12 builds should not need the nvidia channel for anything. The only CUDA 11 packages we need from the nvidia channel are math libraries, so that we can get the C++ headers that don't ship with conda-forge::cudatoolkit for CUDA 11. I think that only affects RAFT/cuML/cuGraph/cuVS.

We are currently only including the nvidia channel for consistency with CUDA 11, and will drop it from our build commands once CUDA 11 is no longer supported. Perhaps we should even consider dropping nvidia now, for the packages that don't need its math library headers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would we prefer?

  • drop -c nvidia from the rapids-rattler-channel-string and then add it piecemeal to raft, cuml, cugraph, and cuvs?
  • wait until we drop cuda 11

Copy link
Contributor

@bdice bdice Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rapids-rattler-channel-string should check the value of RAPIDS_CUDA_VERSION and only add --channel nvidia on CUDA 11. We can implement that as part of the rattler-build migration, no need to wait until we drop CUDA 11.

Or, we can just drop --channel nvidia from rapids-rattler-channel-string entirely and add it to the repos where it is needed. I am okay with either approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version of this package should be constrained by cuda-version. We should never pin CTK package versions in our recipes (for CUDA 12 only -- CUDA 11 requires exact pinnings like these). CUDA 12 packages should always float to whatever is allowed by cuda-version.

Suggested change
then: cuda-cudart-dev =${{ cuda_version }}
then: cuda-cudart-dev

Comment on lines 128 to 131
ignore_run_exports:
by_name:
- cuda-version
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this explicit cuda-version ignore, the libucxx-examples output was getting a runtime dependency of cuda-version>12.8,<13 -- inherited from the compiler('cuda'):

 │ │ Run dependencies:
 │ │ ╭──────────────────┬────────────────────────────────────────────────────────╮
 │ │ │ Name             ┆ Spec                                                   │
 │ │ ╞══════════════════╪════════════════════════════════════════════════════════╡
 │ │ │ Run dependencies ┆                                                        │
 │ │ │ libucxx          ┆ ==0.43.0a37 cuda12_250218_2a6940ca (PS)                │
 │ │ │ libgcc           ┆ >=13 (RE of [cache-build: gcc_linux-64])               │
 │ │ │ __glibc          ┆ >=2.28,<3.0.a0 (RE of [cache-build: sysroot_linux-64]) │
 │ │ │ cuda-version     ┆ >=12.8,<13 (RE of [cache-build: cuda-nvcc_linux-64])   │    <---- here
 │ │ │ libstdcxx        ┆ >=13 (RE of [cache-build: gxx_linux-64])               │
 │ │ │ libgcc           ┆ >=13 (RE of [cache-build: gxx_linux-64])               │
 │ │ │ ucx              ┆ >=1.15.0,<1.16.0a0 (RE of [cache-host: ucx])           │
 │ │ │ python_abi       ┆ 3.12.* *_cp312 (RE of [cache-host: python])            │
 │ │ │ librmm           ┆ >=25.4.0a26,<25.5.0a0 (RE of [cache-host: librmm])     │
 │ │ ╰──────────────────┴────────────────────────────────────────────────────────╯

this meant that in the test_cpp job running on cuda 12.0, the just-built package for examples was invalid, leading to the failures.

This looks to me like outputs in rattler-build are inheriting runtime dependencies from the build cache, which is usually fine but in this case is not what we want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a recent conversation about ARM / Tegra CUDA support in conda-forge. We may be adding new run-exports to the CUDA compiler packages on ARM.

Currently we use ignore_run_exports_from the CUDA compiler package. Instead, we should ignore_run_exports: on just cuda-version. So what you're saying here is right, and we should do this universally for all packages that support CUDA compatibility within the major version (that applies to all of RAPIDS).

@gforsyth gforsyth force-pushed the rattler-build branch 2 times, most recently from 8e43602 to fe45008 Compare February 19, 2025 18:50

sccache --show-adv-stats

# remove build_cache directory
rm -rf "$RAPIDS_CONDA_BLD_OUTPUT_DIR"/build_cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like something that should be an ignore pattern in our upload script instead of something that every build script has to include.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We opened an issue upstream, too, to add a CLI option to clear the build-cache post-build: prefix-dev/rattler-build#1424

host:
- cuda-version =${{ cuda_version }}
- if: cuda_major == "11"
then: cuda-cudart-dev =${{ cuda_version }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr CUDA 11 packages aren't great.

I'm not actually sure what exactly is happening right now, but I think what is happening is that you are winding up with unconstrained nvidia channel packages. This shows up in CI even on the latest (passing) CUDA 11 builds like this one. If you look at the cuda-cudart-dev packages on conda-forge there aren't any for CUDA<12. Presumably we're being saved at runtime by something else bringing in cudatoolkit, namely probably librmm. You'll note that in the old recipe the condition for cuda-cudart-dev was {% if cuda_major != "11" %}. For CUDA 11 we are still reliant on including some CUDA libraries on the host system (not installed via conda).

Let's flip this constraint to != 11 and try again.

Separately, we should double-check if we even need to include the nvidia channel any more. I've lost track of the current state of things and would need to take 20 mins to refresh myself but I believe we may not need it any more.

name: distributed-ucxx
version: ${{ version }}
build:
string: py${{ python }}_${{ date_string }}_${{ head_rev }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the general case (i.e. for the rest of RAPIDS) I'd consider rapidsai/build-planning#43 out of scope for the rattler-build work since it requires reworking our shared workflows. For ucxx specifically, we should take at least a quick look at solving it here. The reason that I suggest that is that for ucxx, distributed-ucxx is (AFAIK) a pure Python package that is built within the same recipe as a non-pure Python package, whereas everywhere else those packages are separate conda builds. It should not be constrained by the cpython ABI below, for example. If the conclusion winds up just being that we need to split this package out to properly support building a pure Python wheel that is fine too, but I think it's worth thinking a bit about it at least briefly right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

building it as noarch now and that seems to be working

@gforsyth gforsyth removed the request for review from KyleFromNVIDIA February 20, 2025 18:40
- cuda-version
tests:
- package_contents:
files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we alphabetically sort this whole list of files?

(Also we used to have lists of files to verify for cudf but we dropped it there because it was annoying to maintain. But I am okay with the status quo here.)

Comment on lines +134 to +140
ignore_run_exports:
from_package:
- cuda-cudart-dev
- python
- ucx
by_name:
- cuda-version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these packages are in the build/host sections, so these run-exports shouldn't exist in the first place. Can we delete this block?

Suggested change
ignore_run_exports:
from_package:
- cuda-cudart-dev
- python
- ucx
by_name:
- cuda-version

Comment on lines +164 to +170
ignore_run_exports:
from_package:
- cuda-cudart-dev
- python
- ucx
by_name:
- cuda-version
Copy link
Contributor

@bdice bdice Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Nothing in build/host for this package has run-exports that we need to ignore. We are manually defining the run dependency on cuda-cudart.

Suggested change
ignore_run_exports:
from_package:
- cuda-cudart-dev
- python
- ucx
by_name:
- cuda-version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants