-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch from conda-build
to rattler-build
#374
base: branch-0.43
Are you sure you want to change the base?
Conversation
version: ${{ version }} | ||
build: | ||
string: cuda${{ cuda_major }}_${{ date_string }}_${{ head_rev }} | ||
script: install_libucxx.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are comments in this build-script that I'm not sure how to preserve in the script
directive here, so I've opted to leave that standalone script in place.
09f42c5
to
cf0e6cb
Compare
host: | ||
- cuda-version =${{ cuda_version }} | ||
- if: cuda_major == "11" | ||
then: cuda-cudart-dev =${{ cuda_version }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixed cuda11 builds for me locally. See below for the failure without this pin where ucxx
import tests are looking for libcudart.so.12
.
Not sure if cuda-cudart-dev
is constrained (or unconstrained) in some weird way version-wise, or if rattler-build
is doing something unexpected re: version solving
( from https://github.com/rapidsai/ucxx/actions/runs/13318054370/job/37196857183)
│ │ Installing test environment
│ │ Successfully updated the test environment
│ │ + python $SRC_DIR/conda_build_script.py
│ │ Traceback (most recent call last):
│ │ File "$SRC_DIR/conda_build_script.py", line 1, in <module>
│ │ import ucxx
│ │ File "$PREFIX/lib/python3.10/site-packages/ucxx/__init__.py", line 30, in <module>
│ │ from . import exceptions, types, testing # noqa
│ │ File "$PREFIX/lib/python3.10/site-packages/ucxx/exceptions.py", line 4, in <module>
│ │ from ._lib.libucxx import ( # noqa
│ │ File "$PREFIX/lib/python3.10/site-packages/ucxx/_lib/__init__.py", line 4, in <module>
│ │ from .libucxx import _create_exceptions
│ │ ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
│ │ × error Script failed with status 1
│ │ × error Work directory: '/tmp/.tmpn812nb'
│ │ × error To debug the build, run it manually in the work directory (execute the `./conda_build.sh` or `conda_build.bat` script)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr CUDA 11 packages aren't great.
I'm not actually sure what exactly is happening right now, but I think what is happening is that you are winding up with unconstrained nvidia channel packages. This shows up in CI even on the latest (passing) CUDA 11 builds like this one. If you look at the cuda-cudart-dev
packages on conda-forge there aren't any for CUDA<12. Presumably we're being saved at runtime by something else bringing in cudatoolkit, namely probably librmm. You'll note that in the old recipe the condition for cuda-cudart-dev
was {% if cuda_major != "11" %}
. For CUDA 11 we are still reliant on including some CUDA libraries on the host system (not installed via conda).
Let's flip this constraint to != 11 and try again.
Separately, we should double-check if we even need to include the nvidia channel any more. I've lost track of the current state of things and would need to take 20 mins to refresh myself but I believe we may not need it any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(^everything Vyas said, and...)
CUDA 12 builds should not need the nvidia
channel for anything. The only CUDA 11 packages we need from the nvidia
channel are math libraries, so that we can get the C++ headers that don't ship with conda-forge::cudatoolkit
for CUDA 11. I think that only affects RAFT/cuML/cuGraph/cuVS.
We are currently only including the nvidia
channel for consistency with CUDA 11, and will drop it from our build commands once CUDA 11 is no longer supported. Perhaps we should even consider dropping nvidia
now, for the packages that don't need its math library headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would we prefer?
- drop
-c nvidia
from therapids-rattler-channel-string
and then add it piecemeal to raft, cuml, cugraph, and cuvs? - wait until we drop cuda 11
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rapids-rattler-channel-string
should check the value of RAPIDS_CUDA_VERSION
and only add --channel nvidia
on CUDA 11. We can implement that as part of the rattler-build migration, no need to wait until we drop CUDA 11.
Or, we can just drop --channel nvidia
from rapids-rattler-channel-string
entirely and add it to the repos where it is needed. I am okay with either approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version of this package should be constrained by cuda-version
. We should never pin CTK package versions in our recipes (for CUDA 12 only -- CUDA 11 requires exact pinnings like these). CUDA 12 packages should always float to whatever is allowed by cuda-version
.
then: cuda-cudart-dev =${{ cuda_version }} | |
then: cuda-cudart-dev |
0b95496
to
be7ce53
Compare
conda/recipes/ucxx/recipe.yaml
Outdated
ignore_run_exports: | ||
by_name: | ||
- cuda-version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this explicit cuda-version
ignore, the libucxx-examples
output was getting a runtime dependency of cuda-version>12.8,<13
-- inherited from the compiler('cuda')
:
│ │ Run dependencies:
│ │ ╭──────────────────┬────────────────────────────────────────────────────────╮
│ │ │ Name ┆ Spec │
│ │ ╞══════════════════╪════════════════════════════════════════════════════════╡
│ │ │ Run dependencies ┆ │
│ │ │ libucxx ┆ ==0.43.0a37 cuda12_250218_2a6940ca (PS) │
│ │ │ libgcc ┆ >=13 (RE of [cache-build: gcc_linux-64]) │
│ │ │ __glibc ┆ >=2.28,<3.0.a0 (RE of [cache-build: sysroot_linux-64]) │
│ │ │ cuda-version ┆ >=12.8,<13 (RE of [cache-build: cuda-nvcc_linux-64]) │ <---- here
│ │ │ libstdcxx ┆ >=13 (RE of [cache-build: gxx_linux-64]) │
│ │ │ libgcc ┆ >=13 (RE of [cache-build: gxx_linux-64]) │
│ │ │ ucx ┆ >=1.15.0,<1.16.0a0 (RE of [cache-host: ucx]) │
│ │ │ python_abi ┆ 3.12.* *_cp312 (RE of [cache-host: python]) │
│ │ │ librmm ┆ >=25.4.0a26,<25.5.0a0 (RE of [cache-host: librmm]) │
│ │ ╰──────────────────┴────────────────────────────────────────────────────────╯
this meant that in the test_cpp
job running on cuda 12.0, the just-built package for examples
was invalid, leading to the failures.
This looks to me like outputs in rattler-build
are inheriting runtime dependencies from the build cache, which is usually fine but in this case is not what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a recent conversation about ARM / Tegra CUDA support in conda-forge. We may be adding new run-exports to the CUDA compiler packages on ARM.
Currently we use ignore_run_exports_from
the CUDA compiler package. Instead, we should ignore_run_exports:
on just cuda-version
. So what you're saying here is right, and we should do this universally for all packages that support CUDA compatibility within the major version (that applies to all of RAPIDS).
This reverts commit dc5fb01.
8e43602
to
fe45008
Compare
7fc99b8
to
9eb378b
Compare
|
||
sccache --show-adv-stats | ||
|
||
# remove build_cache directory | ||
rm -rf "$RAPIDS_CONDA_BLD_OUTPUT_DIR"/build_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like something that should be an ignore pattern in our upload script instead of something that every build script has to include.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. We opened an issue upstream, too, to add a CLI option to clear the build-cache post-build: prefix-dev/rattler-build#1424
host: | ||
- cuda-version =${{ cuda_version }} | ||
- if: cuda_major == "11" | ||
then: cuda-cudart-dev =${{ cuda_version }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tl;dr CUDA 11 packages aren't great.
I'm not actually sure what exactly is happening right now, but I think what is happening is that you are winding up with unconstrained nvidia channel packages. This shows up in CI even on the latest (passing) CUDA 11 builds like this one. If you look at the cuda-cudart-dev
packages on conda-forge there aren't any for CUDA<12. Presumably we're being saved at runtime by something else bringing in cudatoolkit, namely probably librmm. You'll note that in the old recipe the condition for cuda-cudart-dev
was {% if cuda_major != "11" %}
. For CUDA 11 we are still reliant on including some CUDA libraries on the host system (not installed via conda).
Let's flip this constraint to != 11 and try again.
Separately, we should double-check if we even need to include the nvidia channel any more. I've lost track of the current state of things and would need to take 20 mins to refresh myself but I believe we may not need it any more.
conda/recipes/ucxx/recipe.yaml
Outdated
name: distributed-ucxx | ||
version: ${{ version }} | ||
build: | ||
string: py${{ python }}_${{ date_string }}_${{ head_rev }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the general case (i.e. for the rest of RAPIDS) I'd consider rapidsai/build-planning#43 out of scope for the rattler-build work since it requires reworking our shared workflows. For ucxx specifically, we should take at least a quick look at solving it here. The reason that I suggest that is that for ucxx, distributed-ucxx is (AFAIK) a pure Python package that is built within the same recipe as a non-pure Python package, whereas everywhere else those packages are separate conda builds. It should not be constrained by the cpython ABI below, for example. If the conclusion winds up just being that we need to split this package out to properly support building a pure Python wheel that is fine too, but I think it's worth thinking a bit about it at least briefly right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
building it as noarch
now and that seems to be working
- cuda-version | ||
tests: | ||
- package_contents: | ||
files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we alphabetically sort this whole list of files?
(Also we used to have lists of files to verify for cudf but we dropped it there because it was annoying to maintain. But I am okay with the status quo here.)
ignore_run_exports: | ||
from_package: | ||
- cuda-cudart-dev | ||
- python | ||
- ucx | ||
by_name: | ||
- cuda-version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these packages are in the build/host sections, so these run-exports shouldn't exist in the first place. Can we delete this block?
ignore_run_exports: | |
from_package: | |
- cuda-cudart-dev | |
- python | |
- ucx | |
by_name: | |
- cuda-version |
ignore_run_exports: | ||
from_package: | ||
- cuda-cudart-dev | ||
- python | ||
- ucx | ||
by_name: | ||
- cuda-version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. Nothing in build/host for this package has run-exports that we need to ignore. We are manually defining the run dependency on cuda-cudart
.
ignore_run_exports: | |
from_package: | |
- cuda-cudart-dev | |
- python | |
- ucx | |
by_name: | |
- cuda-version |
Porting over the conda-builds to
rattler-build
Contributes to rapidsai/build-planning#47