GPU programming
How to actually make the GPU do work — the languages, libraries and frameworks for writing compute kernels. Each library below has its own page with install steps and a complete, runnable example. Start with the one that matches your hardware and language.
Core libraries
CUDA
NVIDIA's parallel computing platform and the de-facto standard for GPU compute. The richest ecosystem — cuDNN, cuBLAS, CUTLASS, Nsight tooling — and the target every major ML framework optimises for first.
ROCm / HIP
AMD's open compute stack. HIP is a thin C++ runtime that mirrors the CUDA API almost one-to-one, so CUDA code ports with hipify and the same source compiles for AMD or NVIDIA. PyTorch and TensorFlow ship ROCm builds.
Triton
Write fused GPU kernels in pure Python. Triton handles memory coalescing, shared-memory tiling and scheduling, so you reason in blocks of data rather than threads — close to hand-tuned CUDA performance with a fraction of the code. It powers much of PyTorch 2's compiler.
WebGPU
The modern GPU API for the web (and, via Dawn/wgpu, native apps). Compute shaders in WGSL run on NVIDIA, AMD, Intel and Apple GPUs from the same code, no install — the easiest way to ship GPU compute to everyone. Now shipping in Chrome, Edge, Firefox and Safari.
OpenCL
The long-standing open standard for heterogeneous compute. Runs on GPUs, CPUs and accelerators from every vendor, so a single kernel is broadly portable — the trade-off is a verbose host API and less cutting-edge tooling than CUDA.
Vulkan
An explicit, low-level cross-vendor API for graphics and compute. Compute shaders are written in GLSL (or HLSL), compiled ahead of time to SPIR-V, and run on every modern GPU from NVIDIA, AMD, Intel and — via MoltenVK — Apple. Verbose, but it gives you total control and portability, which is why most new engines target it.
OpenGL
The long-standing cross-platform graphics API. Since OpenGL 4.3 (and OpenGL ES 3.1) it has compute shaders in GLSL that read and write buffers and images — the simplest way to add GPGPU to an existing OpenGL app without bringing in a separate compute API. Note that WebGL does not expose compute shaders; use WebGPU there instead.
DirectX (Direct3D)
Microsoft's graphics and compute API for Windows and Xbox. DirectCompute runs GPU compute through HLSL compute shaders under Direct3D 11 and 12, and DirectML builds machine learning on top of it. It is the default path for Windows games and the Windows ML stack, on any vendor's GPU.
WebGL
The browser's original GPU API — OpenGL ES exposed to JavaScript, supported everywhere. WebGL has no compute shaders, so general-purpose GPU work is done either by rendering into a texture with a fragment shader, or (in WebGL2) with transform feedback as shown below. For genuine compute in the browser, reach for WebGPU instead; WebGL remains the compatibility fallback.
Also worth knowing
| Library | From | What it is |
|---|---|---|
| SYCL / oneAPI | Khronos / Intel | Single-source C++ heterogeneous compute; Intel's oneAPI (DPC++) is the flagship implementation. |
| Metal | Apple | Apple's GPU API; Metal Performance Shaders and MPSGraph back ML on Apple silicon. |
| CuPy | Community | Drop-in NumPy/SciPy on CUDA & ROCm — array code on the GPU with almost no changes. |
| Numba | Community | JIT-compile Python to CUDA kernels with @cuda.jit decorators. |
| JAX | NumPy + autodiff + XLA compilation across GPU/TPU; functional and composable. | |
| PyTorch | Meta | The dominant deep-learning framework; eager GPU tensors plus a Triton-backed compiler. |
| OpenAI Triton | OpenAI | See the dedicated page — Python kernels. |
Which should I pick?
- NVIDIA GPU, maximum performance → CUDA (or Triton for ML kernels in Python).
- AMD GPU → ROCm / HIP — and your CUDA code mostly ports across.
- Any GPU, in the browser → WebGPU for compute (no install), or WebGL as the graphics-only fallback.
- Compute inside a graphics app or game engine → Vulkan or OpenGL compute shaders cross-platform, or DirectX (DirectCompute) on Windows.
- Maximum portability across vendors → OpenCL or SYCL.
- Just want fast arrays in Python → CuPy, Numba, JAX or PyTorch sit on top of these and need no kernel writing at all.