Triton

Write fused GPU kernels in pure Python. Triton handles memory coalescing, shared-memory tiling and scheduling, so you reason in blocks of data rather than threads — close to hand-tuned CUDA performance with a fraction of the code. It powers much of PyTorch 2's compiler.

OpenAIPythonPythonopen sourcekernels

Official docs ↗ ← All libraries

Install

pip install triton        # ships with recent PyTorch on Linux
python -c "import triton; print(triton.__version__)"

Hello, GPU

add.py — a Triton vector-add kernel

import torch, triton, triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr):
    pid = tl.program_id(0)
    offs = pid * BLOCK + tl.arange(0, BLOCK)
    mask = offs < n
    x = tl.load(x_ptr + offs, mask=mask)
    y = tl.load(y_ptr + offs, mask=mask)
    tl.store(out_ptr + offs, x + y, mask=mask)

def add(x, y):
    out = torch.empty_like(x)
    n = out.numel()
    grid = (triton.cdiv(n, 1024),)
    add_kernel[grid](x, y, out, n, BLOCK=1024)
    return out

x = torch.rand(1_000_000, device="cuda")
y = torch.rand(1_000_000, device="cuda")
print(torch.allclose(add(x, y), x + y))   # True

Run it:

python add.py

Learn more