Triton
Write fused GPU kernels in pure Python. Triton handles memory coalescing, shared-memory tiling and scheduling, so you reason in blocks of data rather than threads — close to hand-tuned CUDA performance with a fraction of the code. It powers much of PyTorch 2's compiler.
OpenAIPythonPythonopen sourcekernels
Install
pip install triton # ships with recent PyTorch on Linux
python -c "import triton; print(triton.__version__)"
Hello, GPU
add.py — a Triton vector-add kernel
import torch, triton, triton.language as tl
@triton.jit
def add_kernel(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr):
pid = tl.program_id(0)
offs = pid * BLOCK + tl.arange(0, BLOCK)
mask = offs < n
x = tl.load(x_ptr + offs, mask=mask)
y = tl.load(y_ptr + offs, mask=mask)
tl.store(out_ptr + offs, x + y, mask=mask)
def add(x, y):
out = torch.empty_like(x)
n = out.numel()
grid = (triton.cdiv(n, 1024),)
add_kernel[grid](x, y, out, n, BLOCK=1024)
return out
x = torch.rand(1_000_000, device="cuda")
y = torch.rand(1_000_000, device="cuda")
print(torch.allclose(add(x, y), x + y)) # True
Run it:
python add.py