Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide docs for cuml.accel command line feature #6322

Draft
wants to merge 12 commits into
base: branch-25.04
Choose a base branch
from
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ Support for Windows is possible in the near future.
cuml_intro.rst
api.rst
user_guide.rst
zero-code-change.rst
cuml_blogs.rst


Expand Down
2 changes: 2 additions & 0 deletions docs/source/zero-code-change-benchmarks.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Benchmarks
----------
190 changes: 190 additions & 0 deletions docs/source/zero-code-change-limitations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
Known Limitations
-----------------

General Limitations
~~~~~~~~~~~~~~~~~~~

TODO(wphicks): Fill this in
TODO(wphicks): Pickle

Algorithm-Specific Limitations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

TODO(wphicks): Fills these in. Document when each will fall back to CPU, how to
assess equivalence with CPU implementations, and significant differences in
algorithm, as well as any other known issues.


``sklearn.cluster.KMeans``
^^^^^^^^^^^^^^^^^^^^^^^^^^

The default initialization algorithm used by ``cuml.accel`` is similar, but different.
``cuml.accel`` uses the ``"scalable-k-means++"`` algorithm, for more details refer to
:class:`cuml.KMeans`.

This means that the ``cluster_centers_`` attribute will not be exactly the same as for
the scikit-learn implementation. The ID of each cluster (``labels_`` attribute) might
change, this means samples labelled to be in cluster zero for scikit-learn might be
labelled to be in cluster one for ``cuml.accel``. The ``inertia_`` attribute might
differ as well if different cluster centers are used. The algorithm might converge
in a different number of iterations, this means the ``n_iter_`` attribute might differ.

To check that the resulting trained estimator is equivalent to the scikit-learn
estimator, you can evaluate the similarity of the clustering result on samples
not used to train the estimator. Both ``adjusted_rand_score`` and ``adjusted_mutual_info_score``
give a single score that should be above ``0.9``. For low dimensional data you
can also visually inspect the resulting cluster assignments.

``cuml.accel`` will not fall back to scikit-learn.


``sklearn.cluster.DBSCAN``
^^^^^^^^^^^^^^^^^^^^^^^^^^

``sklearn.decomposition.PCA``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``PCA`` implementation used by ``cuml.accel`` uses different SVD solvers
than the ones in Scikit-Learn, which may result in numeric differences in the
``components_`` and ``explained_variance_`` values. These differences should be
small for ``svd_solver`` values of ``"auto"``, ``"full"``, or ``"arpack"``, but
may be larger for randomized or less-numerically-stable solvers like
``"randomized"`` or ``"covariance_eigh"``.

Likewise, note that the implementation in ``cuml.accel`` currently may result
in some of the vectors in ``components_`` having inverted signs. This result is
not incorrect, but can make it harder to do direct numeric comparisons without
first normalizing the signs. One common way of handling this is by normalizing
the first non-zero values in each vector to be positive. You might find the
following ``numpy`` function useful for this.

.. code-block:: python

import numpy as np

def normalize(components):
"""Normalize the sign of components for easier numeric comparison"""
nonzero = components != 0
inds = np.where(nonzero.any(axis=1), nonzero.argmax(axis=1), 0)[:, None]
first_nonzero = np.take_along_axis(components, inds, 1)
return np.sign(first_nonzero) * components

For more algorithmic details, see :class:`cuml.PCA`.

* Algorithm Limitation:
* ``n_components="mle"`` will fallback to Scikit-Learn.
* Parameters for the ``"randomized"`` solver like ``random_state``,
``n_oversamples``, ``power_iteration_normalizer`` are ignored.

``sklearn.decomposition.TruncatedSVD``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The ``TruncatedSVD`` implementation used by ``cuml.accel`` uses different SVD
solvers than the ones in Scikit-Learn, which may result in numeric differences
in the ``components_`` and ``explained_variance_`` values. These differences
should be small for ``algorithm="arpack"``, but may be larger for
``algorithm="randomized"``.

Likewise, note that the implementation in ``cuml.accel`` currently may result
in some of the vectors in ``components_`` having inverted signs. This result is
not incorrect, but can make it harder to do direct numeric comparisons without
first normalizing the signs. One common way of handling this is by normalizing
the first non-zero values in each vector to be positive. You might find the
following ``numpy`` function useful for this.

.. code-block:: python

import numpy as np

def normalize(components):
"""Normalize the sign of components for easier numeric comparison"""
nonzero = components != 0
inds = np.where(nonzero.any(axis=1), nonzero.argmax(axis=1), 0)[:, None]
first_nonzero = np.take_along_axis(components, inds, 1)
return np.sign(first_nonzero) * components

For more algorithmic details, see :class:`cuml.TruncatedSVD`.

* Algorithm Limitation:
* Parameters for the ``"randomized"`` solver like ``random_state``,
``n_oversamples``, ``power_iteration_normalizer`` are ignored.

``sklearn.kernel_ridge.KernelRidge``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``sklearn.linear_model.LinearRegression``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``sklearn.linear_model.LogisticRegression``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``sklearn.linear_model.ElasticNet``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``sklearn.linear_model.Ridge``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``sklearn.linear_model.Lasso``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``sklearn.manifold.TSNE``
^^^^^^^^^^^^^^^^^^^^^^^^^

``sklearn.neighbors.NearestNeighbors``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* Algorithm Limitations:
* The "kd_tree" and "ball_tree" algorithms are not implemented in CUDA. When specified, the implementation will automatically fall back to using the "brute" force algorithm.

* Distance Metrics:
* Only Minkowski-family metrics (euclidean, manhattan, minkowski) and cosine similarity are GPU-accelerated
* Not all metrics are supported for algorithms.
* The "mahalanobis" metric is not supported on GPU and will trigger a fallback to CPU implementation.
* The "nan_euclidean" metric for handling missing values is not supported on GPU.
* Custom metric functions (callable metrics) are not supported on GPU.

* Other Limitations:
* Only the "uniform" weighting strategy is supported. Other weighting schemes will cause fallback to CPU
* The "radius" parameter for radius-based neighbor searches is not implemented and will be ignored

``sklearn.neighbors.KNeighborsClassifier``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* Algorithm Limitations:
* The "kd_tree" and "ball_tree" algorithms are not implemented in CUDA. When specified, the implementation will automatically fall back to using the "brute" force algorithm.

* Distance Metrics:
* Only Minkowski-family metrics (euclidean, manhattan, minkowski) and cosine similarity are GPU-accelerated
* Not all metrics are supported for algorithms.
* The "mahalanobis" metric is not supported on GPU and will trigger a fallback to CPU implementation.
* The "nan_euclidean" metric for handling missing values is not supported on GPU.
* Custom metric functions (callable metrics) are not supported on GPU.

* Other Limitations:
* Only the "uniform" weighting strategy is supported for vote counting.
* Distance-based weights ("distance" option) will trigger CPU fallback.
* Custom weight functions are not supported on GPU.

``sklearn.neighbors.KNeighborsRegressor``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* Algorithm Limitations:
* The "kd_tree" and "ball_tree" algorithms are not implemented in CUDA. When specified, the implementation will automatically fall back to using the "brute" force algorithm.

* Distance Metrics:
* Only Minkowski-family metrics (euclidean, manhattan, minkowski) and cosine similarity are GPU-accelerated
* Not all metrics are supported for algorithms.
* The "mahalanobis" metric is not supported on GPU and will trigger a fallback to CPU implementation.
* The "nan_euclidean" metric for handling missing values is not supported on GPU.
* Custom metric functions (callable metrics) are not supported on GPU.

* Regression-Specific Limitations:
* Only the "uniform" weighting strategy is supported for prediction averaging.
* Distance-based prediction weights ("distance" option) will trigger CPU fallback.
* Custom weight functions are not supported on GPU.

``umap.UMAP``
^^^^^^^^^^^^^

``hdbscan.HDBSCAN``
^^^^^^^^^^^^^^^^^^^
170 changes: 170 additions & 0 deletions docs/source/zero-code-change.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
cuml.accel: Zero Code Change Acceleration for Scikit-Learn
==========================================================

Starting in RAPIDS 25.02, cuML offers a new way to accelerate existing code
based on Scikit-Learn, UMAP-Learn, and HDBScan. Instead of rewriting that code
to import equivalent cuML functionality, simply invoke your existing,
unaltered Python script as follows, and cuML will accelerate as much of the
code as possible with NVIDIA GPUs, falling back to CPU where necessary:

.. code-block::

python -m cuml.accel unchanged_script.py

The same functionality is available in Jupyter notebooks using the
following magic at the beginning of the notebook (before other imports):

.. code-block::

%load_ext cuml.accel
import sklearn

**``cuml.accel`` is currently a beta feature and will continue to improve over
time.**

.. toctree::
:maxdepth: 2
:caption: Contents:

zero-code-change-limitations.rst
zero-code-change-benchmarks.rst


FAQs
----

1. Why use cuml.accel instead of using cuML directly?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Many software lifecycles involve running code on a variety of hardware. Maybe
the data scientists developing a pipeline do not have access to NVIDIA GPUs,
but you want the cost and time savings of running that pipeline on NVIDIA GPUs
in production. Rather than going through a manual migration to cuML every time
the pipeline is updated, ``cuml.accel`` allows you to immediately deploy
unaltered Scikit-Learn, UMAP-Learn, and HDBScan code on NVIDIA GPUs.
Furthermore, ``cuml.accel`` will automatically fall back to CPU execution for
anything which is implemented in Scikit-Learn but not yet accelerated by cuML.

Additionally, ``cuml.accel`` offers a quick way to evaluate the minimum
acceleration cuML can provide for your workload without touching a line of
code.

2. Why use cuML directly instead of cuml.accel?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In many cases, ``cuml.accel`` offers enough of a performance boost on its own
that there is no need to migrate code to cuML. However, cuML's API offers a
variety of additional parameters that let you fine-tune GPU execution in order
to get the maximum possible performance out of NVIDIA GPUs. So for software
that will always be run with NVIDIA GPUs available, it may be worthwhile to
write your code directly with cuML.

Additionally, running code directly with cuML offers finer control over GPU
memory usage. ``cuml.accel`` will automatically use `unified or managed memory <https://developer.nvidia.com/blog/unified-memory-cuda-beginners/>`_
for allocations in order to reduce the risk of CUDA OOM errors. In
contrast, cuML defaults to ordinary device memory, which can offer improved
performance but requires slightly more care to avoid exhausting the GPU VRAM.

3. What does ``cuml.accel`` accelerate?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``cuml.accel`` is designed to provide zero code change acceleration of any
Scikit-Learn-like estimator which has an equivalent cuML implementation,
including estimators from Scikit-Learn, UMAP-Learn, and HDBScan. Currently,
the following estimators are mostly or entirely accelerated when run under
``cuml.accel``:

* Scikit-Learn
* ``sklearn.cluster.KMeans``
* ``sklearn.cluster.DBSCAN``
* ``sklearn.decomposition.PCA``
* ``sklearn.decomposition.TruncatedSVD``
* ``sklearn.kernel_ridge.KernelRidge``
* ``sklearn.linear_model.LinearRegression``
* ``sklearn.linear_model.LogisticRegression``
* ``sklearn.linear_model.ElasticNet``
* ``sklearn.linear_model.Ridge``
* ``sklearn.linear_model.Lasso``
* ``sklearn.manifold.TSNE``
* ``sklearn.neighbors.NearestNeighbors``
* ``sklearn.neighbors.KNeighborsClassifier``
* ``sklearn.neighbors.KNeighborsRegressor``
* UMAP-Learn
* ``umap.UMAP``
* HDBScan
* ``hdbscan.HDBSCAN``

This list will continue to expand as ``cuml.accel`` development
continues.Please see `Zero Code Change Limitations <0cc_limitations.rst>`_
for known limitations.

4. Will I get the same results as I do without ``cuml.accel``?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``cuml.accel`` is designed to provide *equivalent* results to the estimators
it acelerates, but the output may have small numerical differences. To be more
specific, measures of the quality of the results (accuracy,
trustworthiness, etc.) should be approximately as good or better than those
obtained without ``cuml.accel``, even if the exact output varies.

A baseline limitation for obtaining exact numerical equality is that in
highly parallel execution environments (e.g. GPUs), there is no guarantee that
floating point operations will happen in exactly the same order as in
non-parallel environments. This means that floating point arithmetic error
may propagate differently and lead to different outcomes. This can be
exacerbated by discretization operations in which values end up in
different categories based on floating point values.

Secondarily, some algorithms are implemented in a fundamentally different
way on GPU than on CPU in order to make efficient use of the GPU's highly
parallel compute capabilities. In such cases, ``cuml.accel`` will translate
hyperparameters appropriately to maintain equivalence with the CPU
implementation. Differences of this kind are noted in the corresponding entry
of `Zero Code Change Limitations <0cc_limitations.rst>`_ for that
estimator.

If you discover a use case where the quality of results obtained with
``cuml.accel`` is worse than that obtained without, please `report it as a bug
<https://github.com/rapidsai/cuml/issues/new?template=bug_report.md>`_, and the
RAPIDS team will investigate.

5. How much faster is ``cuml.accel``?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This depends on the individual algorithm being accelerated and the dataset
being processed. As with cuML itself, you will generally see the most benefit
when ``cuml.accel`` is used on large datasets. Please see
`Zero Code Change Benchmarks <0cc_benchmarks.rst>`_ for some representative benchmarks.

Please note that the first time an estimator method is called in a Python
process, there may be some overhead due to JIT compilation of cupy kernels. To
get an accurate sense of performance, run the method once on a small subset of
data before measuring runtime on a full-scale dataset.

6. Will I run out of GPU memory if I use ``cuml.accel``?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``cuml.accel`` will use CUDA `managed memory <https://developer.nvidia.com/blog/unified-memory-cuda-beginners/>`_ for allocations on NVIDIA GPUs. This means that host memory can be used to augment GPU memory, and data will be migrated automatically as necessary. This does not mean that ``cuml.accel`` is entirely impervious to OOM errors, however. Very large datasets can exhaust the entirety of both host and device memory. Additionally, if device memory is heavily oversubscribed, it can lead to slow execution. ``cuml.accel`` is designed to minimize both possibilities, but if you observe OOM errors or slow execution on data that should fit in combined host plus device memory for your system, please `report it <https://github.com/rapidsai/cuml/issues/new?template=bug_report.md>`_, and the RAPIDS team will investigate.

7. What is the relationship between ``cuml.accel`` and ``cudf.pandas``?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Both projects serve a similar role. Just as ``cuml.accel`` offers zero code
change acceleration for Scikit-Learn and similar packages, ``cudf.pandas``
offers zero code change acceleration for Pandas. They can be used together by
TODO(wphicks): FILL THIS IN ONCE THIS MECHANISM HAS BEEN IMPLEMENTED.

8. What happens if something in my script is not implemented in cuML?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``cuml.accel`` should cleanly and transparently fall back to the CPU
implementation for any methods or estimators which are not implemented in cuML.
If it does not do so, please `report it as a bug <https://github.com/rapidsai/cuml/issues/new?template=bug_report.md>`_, and the RAPIDS team will investigate.

9. I've discovered a bug in ``cuml.accel``. How do I report it?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Bugs affecting ``cuml.accel`` can be reported via the `cuML issue tracker <https://github.com/rapidsai/cuml/issues/new?template=bug_report.md>`_. If you observe a significant difference in the quality of output with and without ``cuml.accel``, please report it as a bug. These issues will be taken especially seriously. Similarly, if runtime slows down for your estimator when using ``cuml.accel``, the RAPIDS team will try to triage and fix the issue as soon as possible. Note that library import time *will* be longer when using ``cuml.accel``, so please exclude that from runtime. Long import time is a known issue and will be improved with subsequent releases of cuML.

10. If I serialize a model using ``cuml.accel``, can I load it without ``cuml.accel``?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is a common use case for ``cuml.accel``, since it may be useful to train
a model using NVIDIA GPUs but deploy it for inference in an environment that
does not have access to NVIDIA GPUs. Currently, models serialized with
``cuml.accel`` need to be converted to pure Scikit-Learn (or UMAP/HDBScan/...)
models using the following invocation:

TODO(wphicks): FILL THIS OUT

This conversion step should become unnecessary in a future release of cuML.