GPU programming

How to actually make the GPU do work — the languages, libraries and frameworks for writing compute kernels. Each library below has its own page with install steps and a complete, runnable example. Start with the one that matches your hardware and language.

Core libraries

CUDA

NVIDIA's parallel computing platform and the de-facto standard for GPU compute. The richest ecosystem — cuDNN, cuBLAS, CUTLASS, Nsight tooling — and the target every major ML framework optimises for first.

NVIDIA onlymost mature

ROCm / HIP

AMD's open compute stack. HIP is a thin C++ runtime that mirrors the CUDA API almost one-to-one, so CUDA code ports with hipify and the same source compiles for AMD or NVIDIA. PyTorch and TensorFlow ship ROCm builds.

AMDopen sourceCUDA-portable

Triton

Write fused GPU kernels in pure Python. Triton handles memory coalescing, shared-memory tiling and scheduling, so you reason in blocks of data rather than threads — close to hand-tuned CUDA performance with a fraction of the code. It powers much of PyTorch 2's compiler.

Pythonopen sourcekernels

WebGPU

The modern GPU API for the web (and, via Dawn/wgpu, native apps). Compute shaders in WGSL run on NVIDIA, AMD, Intel and Apple GPUs from the same code, no install — the easiest way to ship GPU compute to everyone. Now shipping in Chrome, Edge, Firefox and Safari.

browsercross-vendoropen standard

OpenCL

The long-standing open standard for heterogeneous compute. Runs on GPUs, CPUs and accelerators from every vendor, so a single kernel is broadly portable — the trade-off is a verbose host API and less cutting-edge tooling than CUDA.

cross-vendoropen standard

Vulkan

An explicit, low-level cross-vendor API for graphics and compute. Compute shaders are written in GLSL (or HLSL), compiled ahead of time to SPIR-V, and run on every modern GPU from NVIDIA, AMD, Intel and — via MoltenVK — Apple. Verbose, but it gives you total control and portability, which is why most new engines target it.

cross-vendoropen standardlow-level

OpenGL

The long-standing cross-platform graphics API. Since OpenGL 4.3 (and OpenGL ES 3.1) it has compute shaders in GLSL that read and write buffers and images — the simplest way to add GPGPU to an existing OpenGL app without bringing in a separate compute API. Note that WebGL does not expose compute shaders; use WebGPU there instead.

cross-vendorcompute shaderswidely supported

DirectX (Direct3D)

Microsoft's graphics and compute API for Windows and Xbox. DirectCompute runs GPU compute through HLSL compute shaders under Direct3D 11 and 12, and DirectML builds machine learning on top of it. It is the default path for Windows games and the Windows ML stack, on any vendor's GPU.

WindowsDirectComputeHLSL

WebGL

The browser's original GPU API — OpenGL ES exposed to JavaScript, supported everywhere. WebGL has no compute shaders, so general-purpose GPU work is done either by rendering into a texture with a fragment shader, or (in WebGL2) with transform feedback as shown below. For genuine compute in the browser, reach for WebGPU instead; WebGL remains the compatibility fallback.

browsergraphicsno compute shaders

Also worth knowing

LibraryFromWhat it is
SYCL / oneAPIKhronos / IntelSingle-source C++ heterogeneous compute; Intel's oneAPI (DPC++) is the flagship implementation.
MetalAppleApple's GPU API; Metal Performance Shaders and MPSGraph back ML on Apple silicon.
CuPyCommunityDrop-in NumPy/SciPy on CUDA & ROCm — array code on the GPU with almost no changes.
NumbaCommunityJIT-compile Python to CUDA kernels with @cuda.jit decorators.
JAXGoogleNumPy + autodiff + XLA compilation across GPU/TPU; functional and composable.
PyTorchMetaThe dominant deep-learning framework; eager GPU tensors plus a Triton-backed compiler.
OpenAI TritonOpenAISee the dedicated page — Python kernels.

Which should I pick?

  • NVIDIA GPU, maximum performanceCUDA (or Triton for ML kernels in Python).
  • AMD GPUROCm / HIP — and your CUDA code mostly ports across.
  • Any GPU, in the browserWebGPU for compute (no install), or WebGL as the graphics-only fallback.
  • Compute inside a graphics app or game engineVulkan or OpenGL compute shaders cross-platform, or DirectX (DirectCompute) on Windows.
  • Maximum portability across vendorsOpenCL or SYCL.
  • Just want fast arrays in Python → CuPy, Numba, JAX or PyTorch sit on top of these and need no kernel writing at all.