Text & LLMs

Large language models for chat, coding, reasoning and agents. Frontier models run in the cloud; open-weight models run on your own GPU — a 7–8B model fits in 8 GB, a 70B in 2×24 GB or one 48 GB card with quantisation.

Providers

The leading hosted services — sign up and use them via app or API.

ProviderFromStrengthsAccess
GPT-5 / o-seriesOpenAIGeneral reasoning, tools, codingAPI · app
Claude (Opus / Sonnet)AnthropicCoding, long-context, agentsAPI · app
GeminiGoogleMultimodal, huge contextAPI · app
LlamaMetaOpen weights, broad ecosystemOpen weights
Mistral / MagistralMistral AIEfficient open & API modelsOpen · API
DeepSeek V4DeepSeekStrong open reasoning, low costOpen · API
QwenAlibabaOpen weights, many sizesOpen weights
GrokxAIRealtime, reasoningAPI · app

Open-source tools

Run these yourself on a local or rented GPU. Open weights are free to use, private, and finetunable.

Run LLMs in C/C++ on CPU or GPU; GGUF quantisation; the backbone of local inference.

inferenceC++

One-command local model runner with a clean API; wraps llama.cpp.

inferenceeasy

High-throughput serving engine with PagedAttention; the standard for production inference.

servingfast

Desktop app to download and chat with local models, GPU-accelerated.

desktopeasy

Hugging Face's library — thousands of models behind one Python API.

librarytraining

2× faster, lower-memory fine-tuning of Llama/Mistral/Qwen with QLoRA.

fine-tune

Unified fine-tuning UI/CLI for 100+ LLMs and VLMs.

fine-tune

Apple-silicon array framework for running and training models on Macs.

apple

Run 70B inference on a single 4 GB GPU via layered offloading.

low-VRAM

Karpathy's minimal, hackable full-stack ChatGPT clone to learn from.

learn

What you need to run it

See GPU prices to buy a card, hosting to rent one by the hour, and GPU programming to understand the libraries underneath. VRAM is the deciding factor — check each tool's model card for its memory needs.