CuPy: NumPy and SciPy for GPU

374 points by tanelpoder 4 days ago

gjstein 4 days ago

The idea that this is a drop in replacement for numpy (e.g., `import cupy as np`) is quite nice, though I've gotten similar benefit out of using `pytorch` for this purpose. It's a very popular and well-supported library with a syntax that's similar to numpy.

However, the AMD-GPU compatibility for CuPy is quite an attractive feature.

ogrisel 4 days ago

Note that NumPy, CuPy and PyTorch are all involved in the definition of a shared subset of their API:
https://data-apis.org/array-api/
So it's possible to write array API code that consumes arrays from any of those libraries and delegate computation to them without having to explicitly import any of them in your source code.
The only limitation for now is that PyTorch (and to some lower extent cupy as well) array API compliance is still incomplete and in practice one needs to go through this compatibility layer (hopefully temporarily):
https://data-apis.org/array-api-compat/
- ethbr1 4 days ago
  
  It's interesting to see hardware/software/API co-development in practice again.
  The last time I think this happen at market-scale was early 3d accelerator APIs? Glide/opengl/directx. Which has been a minute! (To a lesser extent CPU vectorization extensions)
  Curious how much of Nvidia's successful strategy was driven by people who were there during that period.
  Powerful first mover flywheel: build high performing hardware that allows you to define an API -> people write useful software that targets your API, because you have the highest performance -> GOTO 10 (because now more software is standardized on your API, so you can build even more performant hardware to optimize its operations)
- kmaehashi 4 days ago
  
  An excellent example of Array API usage can be found in scikit-learn. Estimators written in NumPy are now operable on various backends courtesy of Array API compatible libraries such as CuPy and PyTorch.
  https://scikit-learn.org/stable/modules/array_api.html
  Disclosure: I'm a CuPy maintainer.
- kccqzy 4 days ago
  
  And of course the native Python solution is memoryview. If you need to inter-operate with libraries like numpy but you cannot import numpy, use memoryview. It is specifically for fast low-level access which is why it has more C documentation than Python documentation: https://docs.python.org/3/c-api/memoryview.html
KeplerBoy 4 days ago

One could also "import jax.numpy as jnp". All those libraries have more or less complete implementations of numpy and scipy (i believe CuPy has the most functions, especially when it comes to scipy) functionality.
Also: You can just mix match all those functions and tensors thanks to the __cuda_array_interface__.
- yobbo 4 days ago
  
  Jax variables are immutable.
  Code written for CuPy looks similar to numpy but very different from Jax.
  - bbminner 4 days ago
    
    Ah, well, that's interesting! Does anyone know how cupy manages tensor mutability?
    
    kmaehashi 4 days ago
    
    CuPy tensors (or `ndarray`) provide the same semantics as NumPy. In-place operations are permitted.
  - KeplerBoy 4 days ago
    
    Ah yes, stumbled over that recently, but the error message is very helpful and it's a quick change.
- kmaehashi 4 days ago
  
  For those interested in the NumPy/SciPy API coverage in CuPy, here is the comparison table:
  https://docs.cupy.dev/en/latest/reference/comparison.html
- larodi 3 days ago
  
  Indeed, has anyone so far successfully drop-in replaced numpy in a project with this cupy and achieved massive improvements? Because, you know, when dealing with GPU it is very important to actually understand how data flows back and forth to it, not only the algorithmic nature of the code written.
  As a sidenote, it is funny how this gets released in 2024, and not in say 2014...
  - KeplerBoy 3 days ago
    
    Oh yes, I've personally used CuPy for great speed ups compared to Numpy in radar signal processing. Taking a code that took 30 seconds with NumPy down to 1 second with CuPy. The code basically performed a bunch of math on like 100 MB of data, so the PCIe bottleneck was not a big issue.
    Also CuPy was first released in 2015, this post is just a reminder for people that such things exist.
    
    larodi 3 days ago
    
    Thank you. Your post is informative, and well grounds the very inappropriate hype in mine.
  - lacker 3 days ago
    
    Yeah, the data managed by cupy generally stays on the GPU and you can control when you get it out pretty straightforwardly. It’s great if most of your work happens in a small number of standard operations. Like matrix operations or Fourier transforms, the sort of thing that cupy will provide for you. You can get custom kernels running through cupy but at some point it’s easier to just write c/c++.
  - faangguyindia 2 days ago
    
    It's because it was not possible to write tons of brain-dead, repetitive code in 2014.
    In 2024, with AI you can do these kind of projects very fast.
Narhem 4 days ago

As nice as it is to have a drop in replacement, most of the cost of GPU computing is moving memory around. Wouldn’t be surprised if this catches unsuspecting programmers in a few performance traps.
- markhahn 3 days ago
  
  The moving-data-around cost is conventional wisdom in GP-GPU circles.
  Is it changing though? Not only do PCIe interfaces keep doubling in performance, but CPU-GPU memory coherence is a thing.
  I guess it depends on your target: 8x H100s across a PCIe bridge is going to have quite different costs vs an APU (which have gotten to be quite powerful, not even mentioning MI300a)
- low_tech_love 3 days ago
  
  Exactly my experience. You end up having to juggle a whole different set of requirements and design factors in addition to whatever it is that you’re already doing. Usually after a while the results are worth it, but I found the “drop-in” idea to be slightly misleading. Just because the API is the same does not make it a drop-in replacement.
BiteCode_dev 3 days ago

Wondering why AMD isn'y currently heavily investing into creating tons of adapter like this to help yhe transition from cuda.
paperplatter 4 days ago

Hm. Tempted to try pytorch on my Mac for this. I have an AS chip rather than a Nvidia GPU.
WCSTombs 4 days ago

> However, the AMD-GPU compatibility for CuPy is quite an attractive feature.
Last I checked (a couple months ago) it wasn't quite there, but I totally agree in principle. I've not gotten it to work on my Radeons yet.
sspiff 4 days ago

It only supports AMD cards supported by ROCm, which is quite a limited set.
I know you can enable ROCm for other hardware as well, but it's not supported and quite hit or miss. I've had limited success with running stuff against ROCm on unsupported cards, mainly having issues with memory management IIRC.
- slavik81 3 days ago
  
  When I packaged the ROCm libraries that shipped in the Ubuntu 24.04 universe repository, I built and tested them with almost every discrete AMD GPU architecture from Vega to CDNA 2 and RDNA 3 (plus a few APUs). None of that is officially supported by AMD, but it is supported by me on a volunteer basis (for whatever that is worth).
  I think that every library required to build cupy is available in the universe repositories, though I've never tried building it myself.
  - markhahn 3 days ago
    
    to be clear, you're saying that ROCm works on a much larger range of GPUs than AMD's official support list? that's pretty exciting!
    
    slavik81 4 hours ago
    
    Yes. The primary difference in the support matrix is that all discrete RDNA 1 and RDNA 2 GPUs are enabled on in the Debian packages [1]. There is also Fiji / Polaris support enabled in the Debian packages, although there are a lot of bugs with those.
    [1]: https://salsa.debian.org/rocm-team/community/team-project/-/...
- sitkack 3 days ago
  
  Fingers crossed that all future AMD parts ship with full ROCm support.
hedgehog 4 days ago

It's kind of unfortunate that EagerPy didn't get more traction to make that kind of switching even easier.
amarcheschi 4 days ago

I'm supposed to end my undergraduate degree with an internship at the italian national research center and i'll have to use pytorch to write ml models from paper to code, i've tried looking at the tutorial but i feel like there's a lot going on to grasp. until now i've only used numpy (and pandas in combo with numpy), i'm quite excited but i'm a bit on the edge because i can't know whether i'll be up to the task or not
- KeplerBoy 4 days ago
  
  Go for it! There's nothing to lose.
  You could checkout some of EuroCC's courses. That should get you up to speed. https://www.eurocc-access.eu/services/training/
  - amarcheschi 3 days ago
    
    Thank you, I've found pytorch foundation has examples page where they actually do something practical explaining what they're doing
- saagarjha 3 days ago
  
  You'll do fine :) PyTorch has an API that is somewhat similar to numpy, although if you've never programmed a GPU you might want to get up to speed on that first.

curvilinear_m 4 days ago

I'm surprised to see pytorch and Jax mentioned as alternatives but not numba : https://github.com/numba/numba

I've recently had to implement a few kernels to lower the memory footprint and runtime of some pytorch function : it's been really nice because numba kernels have type hints support (as opposed to raw cupy kernels).

killingtime74 3 days ago

Numba doesn't support GPU though
- catshamando 3 days ago
  
  Yes it does: https://numba.readthedocs.io/en/stable/cuda/kernels.html
  - killingtime74 3 days ago
    
    ah thanks, TIL
- the_svd_doctor 3 days ago
  
  It does (for NVIDIA at least)

meisel 4 days ago

When building something that I want to run on both CPU and GPU, depending, I’ve found it much easier to use PyTorch than some combination of NumPy and CuPy. I don’t have to fiddle around with some global replacing of numpy.* with cupy.*, and PyTorch has very nearly all the functions that those libraries have.

setopt 4 days ago

Interesting. Any links to examples or docs on how to use PyTorch as a general linear algebra library for this purpose? Like a “SciPy to PyTorch” transition guide if I want to do the same?
- ttyprintk 3 days ago
  
  Mentioned above:
  https://data-apis.org/array-api-compat/
- meisel 4 days ago
  
  It's typically just importing torch and s/np/torch, not too different from NumPy -> CuPy. Try it in your own code and see!

johndough 4 days ago

CuPy is probably the easiest way to interface with custom CUDA kernels: https://docs.cupy.dev/en/stable/user_guide/kernel.html#raw-k...

And I recently learned that CuPy has a JIT compiler now if you prefer Python syntax over C++. https://docs.cupy.dev/en/stable/user_guide/kernel.html#jit-k...

einpoklum 4 days ago

> probably the easiest way to interface with custom CUDA kernels
In Python? Perhaps. Generally? No, it isn't. Try: https://github.com/eyalroz/cuda-api-wrappers/
Full power of the CUDA APIs including all runtime compilation options etc.
(Yes, I wrote that...)
- johndough 4 days ago
  Personally, I prefer CuPy over your library. For example, your vectorAdd.cu implementation at https://github.com/eyalroz/cuda-api-wrappers/blob/master/exa... is much longer than a similar CuPy implementation:
  import cupy as cp vector_add = cp.RawKernel(""" extern "C" __global__ void vector_add(const float *A, const float *B, float *C, int num_elements) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < num_elements) { C[i] = A[i] + B[i]; } } """, "vector_add") num_elements = 50_000 block_size = 256 # round up to next multiple of block_size grid_size = (num_elements + block_size - 1) // block_size a = cp.random.rand(num_elements, dtype=cp.float32) b = cp.random.rand(num_elements, dtype=cp.float32) c = cp.zeros(num_elements, dtype=cp.float32) args = (a, b, c, num_elements) print(f"[Vector addition of {num_elements} elements]") print(f"CUDA kernel launch with {grid_size} blocks of {block_size} threads each") vector_add((grid_size,), (block_size,), args) incorrect = cp.abs(a + b - c) > 1e-5 if cp.any(incorrect): print("Result verification failed at element", cp.argmax(incorrect)) print("Test PASSED") print("SUCCESS")
  It could be made even shorter with a cp.ElementwiseKernel https://docs.cupy.dev/en/stable/user_guide/kernel.html#basic...
  Although I have to concede that the automatic grid size computation in cuda-api-wrappers is nice.
  A few marketing tips for your README:
  * Put a code example directly at the top. You want to present the selling points of your library to the reader as fast as possible. For reference, look at the CuPy README https://github.com/cupy/cupy?tab=readme-ov-file#cupy--numpy-... which immediately shows reader what it is good for. Your README starts with lots of text, but nobody reads text anymore these days. A link to examples is almost at the end, and then the examples are deeply nested.
  * The first links in the README should link to your own library, for example to documentation or examples. You do not want to lead the reader away from your GitHub page.
  * Add syntax highlighting with "cpp" after triple backticks:
```cpp <code here> ```
- einpoklum 2 days ago
  
  cuPy is a useful, and kind of large, library which does a lot of things. In your example, you use it to create buffers, fill them up with random values, and perform elementwise arithmetic on them. numpy does that, which is why cuPy does that. My library only wraps CUDA functionality, and mostly "does nothing" [1] - so you have to "do everything yourself", except that it's easy(ish) to do so. It definitely never does anything behind-the-scenes or behind-your-back.
  This difference between the libraries makes your program more terse; however, you lose control over where your buffers are, from where they're accessible, when they get copied around and how etc. You can't even tell - from looking at the program source - whether the buffers will be "managed memory" accessed and copied page-by-page, or rather a copy will be made from system memory to device-global memory.
  So, in my book, it is not as easy to access and control CUDA with cuPy. But - it is easier for a user who "needs numpy for GPUs", and does not care about the nitty-gritty, to write their program and get things done. Your program demostrates both of these points.
  I should mention that I wrote my library with the hope that others will use it to build higher-level-abstraction libraries and apps. One could use it to create a cuCpp library that would be very numpy-like but for C++, a parallel of NumCpp [2].
  Thanks for the tips regarding the README, I'll fix it up.
  ----
  [1] : cuda-api-wrappers does offer a couple of utility classes like a poor man's span for pre-C++17, and a span+unique_ptr combo - which is beyond wrapping CUDA's APIs, but still doesn't quite "do" thing.
  [2]: https://github.com/dpilger26/NumCpp
  
  einpoklum a day ago
  
  > It definitely never does anything behind-the-scenes or behind-your-back.
  Actually that's a bit of a lie, because of the economy of primary device context reference counts (it's quite annoying if you need to do it well and not leak resources) and the context stack. So, let's say it does as little as possible behind the scenes... :-(

aterrel-nvidia 4 days ago

If you like cupy, definitely checkout the Multinode Multi-gpu version, cuNumeric: https://github.com/nv-legate/cunumeric

Would love to get any feedback from the community.

towerwoerjrrwer 3 days ago

Cupy was first but at this point you're better of using JAX.

Has a much larger community, a big push from Google Research, and unlike PFN's Chainer (of which CuPy is the computational base), is not semi-abandoned.

Kind of sad to see CuPy/Chainer eco-system die: not only did they pioneer the PyTorch programming model, but also stuck to Numpy API like JAX does (though the AD is layered on top in Chainer IIRC).

WanderPanda 3 days ago

JAX still has the
"This is a research project, not an official Google product. Expect bugs and sharp edges. Please help by trying it out, reporting bugs, and letting us know what you think!"
disclaimer in its readme. This is quite scary, especially coming from Google which is known to abandon projects out of the blue
- atty 3 days ago
  
  Jax is, as far as an outsider can tell, Google research/DeepMind’s primary computation library now, not Tensorflow. So it’s safer than most Google projects, unless you think they’re secretly developing their third tensor computation library in 10 years (I suppose if anyone was going to do it, it would be Google).
low_tech_love 3 days ago

I tried Jax last year and was not impressed. Maybe it was just me, but everything I tried to do (especially with the autograd stuff) involved huge compilation times and I simply could not get the promised performance. Maybe I’ll try again and read some new documentation, since everyone is excited about it.
kmaehashi 3 days ago

CuPy isn't semi-abandoned as well, obviously :)
p1esk 3 days ago

I agree. Though it's good to have options for GPU accelerated Numpy. Especially if Google decides to discontinue Jax at some point.

__mharrison__ 4 days ago

I taught my numpy class to a client who wanted to use GPUs. Installation (at that time) was a chore but afterwards it was really smooth using this library. Big gains with minimal to no code changes.

SubiculumCode 4 days ago

As an aside, since I was trying to install CuPy the other day and was having issues.

Open projects on github often (at least superficially) require specific versions of Cuda Toolkit (and all the specialty nvidia packages e.g. cudann), Tensorflow, etc, and changing the default versions of these for each little project, or step in a processing chain, is ridiculous.

pyenv et al have really made local, project specific versions of python packages much easier to manage. But I haven't seen a similar type solution for cuda toolkit and associated packages, and the solutions I've encountered seem terribly hacky..but I'm sure though that this is a common issue, so what do people do?

kmaehashi 4 days ago

As a maintainer of CuPy and also as a user of several GPU-powered Python libraries, I empathize with the frustrations and difficulties here. Indeed, one thing CuPy values is to make the installation step as easy and universal as possible. We strive to keep the binary package footprint small (currently less than 100 MiB), keep dependencies to a minimum, support wide variety of platforms including Windows and aarch64, and do not require a specific CUDA Toolkit version.
If anyone reading this message has encountered a roadblock while installing CuPy, please reach out. I'd be glad to help you.
welder a day ago
Yes, you need to install the right version or Cupy hangs forever when installing via pip:
```
    pip install cupy-cuda12x
```
ttyprintk 3 days ago

You can jam the argument cudatoolkit=1.2.3 when creating conda environments.
NB I’m using Miniforge.
mardifoufs 3 days ago

One way to do it is to explicitly add the link to say, the pytorch+CUDA wheel from the pytorch repos in your requirements.txt instead of using the normal pypi package. Which also sucks because you then have to do some other tweaks to make your requirements.txt portable across different platforms...
(and you can't just add another index for pip to look for if you want to use python build so it has to be explicitly linked to the right wheel, which absolutely sucks especially since you cannot get the CUDA version from pypi)
coeneedell 4 days ago

Ugh… docker containers. I also wish there was a simpler way but I don’t think there is.
- SubiculumCode 4 days ago
  
  this is not what I wanted to hear. NOT AT ALL. Please whisper sweet lies into my ears.
  - coeneedell 4 days ago
    
    At the moment I’m working on a system to quickly replicate academic deep learning repos (papers) at scale. At least Amazon has a catalogue of prebuilt containers with cuda/pytorch combos. I still occasionally have an issue where the container works on my 3090 test bench but not on the T4 cloud node…
m_d_ 4 days ago

conda provides cudatoolkit and associated packages. Does this solve the situation?
- SubiculumCode 3 days ago
  
  Actually yes it does....except I seem to remember that it doesn't go back that far in cuda versions. I can't seem to find it again right now.
- nyrikki 4 days ago
  
  The condos 200-employee threshold licence change is problematic for some.
  - boldlybold 4 days ago
    
    As long as you stay out of the "defaults" and "anaconda" repos, you're not subject to that license. For my needs conda-forge and bioconda have everything. I'm not sure about the nvidia repo but I assume it's similar.
    
    kmaehashi 4 days ago
    
    Actually all CUDA Toolkit libs are already available through the conda-forge channel: https://anaconda.org/conda-forge/cuda-cudart, https://anaconda.org/conda-forge/libcublas, etc.
whimsicalism 4 days ago

in real life everyone just uses containers, might not be the answer you want to hear though
- SubiculumCode 3 days ago
  
  I'm okay with containers generally, I think. Is this a situation where you put your code into the container and run it, or does the code make calls to the container's gpu?

markkitti 3 days ago

It's funny how much easier GPU support, especially vendor agnostic GPU support, is in Julia.

Kalanos 3 days ago

it is enticing. i did not find the syntax for working with ndim data to be intuitive though.
einpoklum 2 days ago

Your comment would be more useful with some elaboration and links.

setopt 4 days ago

I’ve been using CuPy a bit and found it to be excellent.

It’s very easy to replace some slow NumPy/SciPy calls with appropriate CuPy calls, with sometimes literally a 1000x performance boost from like 10min work. It’s also easy to write “hybrid code” where you can switch between NumPy and CuPy depending on what’s available.

glial 4 days ago

Are you able to share what functions or situations result in speedups? In my experience, vectorized numpy is already fast, so I'm very curious.
- setopt 4 days ago
  The largest speedup I have seen was for a quantum mechanics simulation where I needed to repeatedly calculate all eigenvalues of Hermitian matrices (but not necessarily their eigenvectors).
  This was basically the code needed:
  import scipy.linalg as la if cuda: import cupy as cp import cupy.linalg as cla ε = cp.asnumpy(cla.eigvalsh(cp.asarray(H))) else: ε = la.eigvalsh(H)
  I was using IntelPython which already has fast (parallelized) methods for this using MKL, but CuPy blew it out of the water.
  - JBits 3 days ago
    
    What did you need the eigenvalues for? I wouldn've guessed exact diagonalization but in that case you would need the eigenvectors.
    
    setopt 3 days ago
    
    In quantum mechanics, you use a “Hamiltonian matrix” H to encode, in some sense, “everything that a particle in your system is allowed to do” along with some energy associated with that. For instance, an electron in a metallic crystal is allowed to “hop” over to a neighboring atom and that is associated with some kinetic energy. Or it is in some cases allowed to stay on the same atom as another electron, and that is associated with a potential energy (Coulomb repulsion).
    The eigenvalues of this matrix is the answer to “what are the energies of each stable electron state in this system”. If you know how many electrons you have (they tend to fill the lowest energy states they can at zero temperature), and you know what temperature you have (which gives you the probability of each “excited” state being occupied), then you can say a lot about the system. For instance, you can say what physical state lowers the “free energy” of the electrons at a given temperature (which can be used to predict phase transitions and spin configurations), or what is the “density of states” (which can be used to predict electronic resistance). You can also obtain the system’s entropy from the eigenvalues alone.
    There are however many cases where you might need eigenvectors too, since they usually provide all the spatial information about “where in your system is this stuff happening”. When I need the eigenvectors, CuPy is still hundreds of times faster on my hardware, but the gap is just not as extreme as it was for pure eigenvalue calculation in my benchmarks.
- KeplerBoy 4 days ago
  
  Not OP, but think about stuff like FFTs or Matmuls. It's not even a competition, GPUs win when the algorithm is somewhat suitable and you're dealing with FP32 or lower precision.

sdenton4 4 days ago

Why not Jax?

johndough 4 days ago

> Why not Jax?
- JAX Windows support is lacking
- CuPy is much closer to CUDA than JAX, so you can get better performance
- CuPy is generally more mature than JAX (fewer bugs)
- CuPy is more flexible thanks to cp.RawKernel
- (For those familiar with NumPy) CuPy is closer to NumPy than jax.numpy
But CuPy does not support automatic gradient computation, so if you do deep learning, use JAX instead. Or PyTorch, if you do not trust Google to maintain a project for a prolonged period of time https://killedbygoogle.com/
- gnulinux 4 days ago
  
  What about CPU-only loads? If one wants to write code that'll eventually run in both CPU and GPU but in the short-to-mid term will only be used in CPU? Since JAX natively support CPU (with numpy backend), but CuPy doesn't, this seems like a potential problem for some.
  - nextaccountic 4 days ago
    
    Isn't there a way to dynamically select between numpy and cupy, depending on whether you want cpu or gpu code?
    
    kmaehashi 4 days ago
    
    NumPy has a mechanism to dispatch execution to CuPy: https://numpy.org/neps/nep-0018-array-function-protocol.html
    Just prepare the input on NumPy or CuPy, and then you can just feed it to NumPy APIs. NumPy functions will handle itself if the input is NumPy ndarray, or dispatch the execution to CuPy if the input is CuPy ndarray.
    
    johndough 4 days ago
    
    > Isn't there a way to dynamically select between numpy and cupy, depending on whether you want cpu or gpu code?
    CuPy is an (almost) drop-in replacement for NumPy, so the following works surprisingly often:
    if use_cpu: import numpy as np else: import cupy as np
    
    gnulinux 4 days ago
    
    There is but then you're using two separate libraries, that seems like a fragile point of failure compared to just using jax. But regardless since jax will use different backends anyway, it's arguably not any worse (but it ends up being your responsibility to ensure correctness as opposed to the jax team).
- insane_dreamer 4 days ago
  
  > CuPy does not support automatic gradient computation, so if you do deep learning, use JAX instead
  DL is major use case; is CuPy planning on adding auto gradient comp?
bee_rider 4 days ago

Real answer: CuPy has a name that is very similar to SciPy. I don’t know GPU, that’s why I’m using this sort of library, haha. The branding for CuPy makes it obvious. Is Jax the same thing, but implemented better somehow?
- sdenton4 4 days ago
  
  Yeah, Jax provides a one-to-one reimplementation of the Numpy interface, and a decent chunk of the scipy interface. Random number handling is a bit different, but Numpy random number handling seeeeems to be trending in the Jax direction (explicitly passed RNG objects).
  Jax also provides back-propagation wherever possible, so you can optimize.
- whimsicalism 4 days ago
  
  yes
palmy 4 days ago

cupy came out a long time before Jax; remember using it in a project for my BSc around 2015-2016.
Cool to see that it's still kicking!

adancalderon 4 days ago

If it ran in the background it could be CuPyd

whimsicalism 4 days ago

I was just thinking we didn’t have enough CUDA-accelerated numpy libraries.

Jax, pytorch, vanilla TF, triton. They just don’t cut it

bee_rider 4 days ago

Good a place as any to ask I guess. Do any of these GPU libraries have a BiCGStab (or similar) that handles multiple right hand sides? CuPy seems to have GMRES, which would be fine, but as far as I can tell it just does one right hand side.

johndough 4 days ago

If you have many right hand sides, you could also compute an LU factorization and then solve the right hand sides via back-substitution.
https://docs.cupy.dev/en/stable/reference/generated/cupyx.sc...
or https://docs.cupy.dev/en/stable/reference/generated/cupyx.sc... if your linear system is sparse.
But whether that works well depends on the problem you are trying to solve.
- bee_rider 4 days ago
  
  My systems are sparse, but might not fit on the GPU when factorized. Actually, usually I do CPU stuff with lots of ram, and Pardiso, so it isn’t an issue.
  But I was hoping to try out something like ILU+bicgstab on the GPU and the python-verse seems like it has the lowest barrier-to-entry for just playing around.
  - johndough 4 days ago
    
    For my tasks, I had some success with algebraic multigrid solvers as preconditioner, for example from AMGCL or PyAMG. They are also reasonably easy to get started with.
    https://github.com/pyamg/pyamg
    https://github.com/ddemidov/amgcl
    But I only have to deal with positive definite systems, so YMMV.
    I am not sure whether those libraries can deal with multiple right-hand sides, but most complexity is in the preconditioners anyway.
trostaft 4 days ago

IIRC jax's `scipy.sparse.linalg.bicgstab` does support multiple right hand sides.
EDIT: Or rather, all the solvers under jax's `scipy.sparse.linalg` all support multiple right hand sides.
- bee_rider 4 days ago
  
  Oh dang, that’s pretty awesome, thanks.
  “array or tree of arrays” sounds very general, probably even better than an old fashioned 2D array.
  - trostaft 4 days ago
    
    'tree of arrays'
    Ahh, that's just Jax's concept of pytrees. It was something that they invented to make it easier (this is how I view it, not complete) to pass complex objects to function but still be able to easily consider them as a concatenated vector for AD etc.. E.g. a common pattern is to pass parameters `p` to a function and then internally break them into their physical interpretations, e.g. `mass = p[0]`, `velocity = p[1]`. Pytrees let you just use something like a dictionary `p = {'mass' = 1.0, 'velocity = 1.0'}`, which is a stylistically more natural structure to pass around, and then jax is structured to understand later when AD'ing or otherwise that you're doing so with respect to the 'leaves' of the tree, or the values of the mass and velocity.
    Hopefully someone corrects me if I'm not right about this. I'm hardly 100% on Jax's vision on PyTrees.
    As an aside, just a list of right hand sides `[b1, b2, ..., bm]` is valid.

hamilyon2 4 days ago

There is a bit similar project which supports Intel GPU offloading: https://github.com/intel/scikit-learn-intelex

kunalgupta022 4 days ago

Is anyone aware of a pandas like library that is based on something like CuPy instead of Numpy. It would be great to have the ease of use of pandas with the parallelism unlocked by gpu.

kmaehashi 4 days ago

cuDF is a CuPy-based library providing drop-in replacement for Pandas: https://rapids.ai/
Scene_Cast2 4 days ago

I'd go digging into this - https://pola.rs/posts/polars-on-gpu/
lokimedes 4 days ago

Not specifically GPU, but that’s also highly dependent on the data access pattern: https://www.dask.org/

lmeyerov 4 days ago

We are fans! We mostly use cudf/cuml/cugraph (GPU dataframes etc) in the pygraphistry ecosystem, and when things get a bit tricky, cupy is one of the main escape hatches