Audio, music & speech
Music generation, text-to-speech, voice cloning and speech-to-text. Small open models punch above their weight here — Kokoro TTS and MusicGen run on modest GPUs, and several run in the browser via WebGPU.
Providers
The leading hosted services — sign up and use them via app or API.
| Provider | From | Strengths | Access |
|---|---|---|---|
| Suno | Suno | Full songs with vocals | App · API |
| Udio | Udio | High-fidelity music | App |
| ElevenLabs | ElevenLabs | Best-in-class TTS & voice cloning | API · app |
| OpenAI audio | OpenAI | TTS, transcription, realtime voice | API |
| Lyria / MusicFX | Music generation | App | |
| AssemblyAI | AssemblyAI | Accurate, realtime speech-to-text | API |
| Soniox | Soniox | Multilingual speech-to-text | API |
Open-source tools
Run these yourself on a local or rented GPU. Open weights are free to use, private, and finetunable.
What you need to run it
See GPU prices to buy a card, hosting to rent one by the hour, and GPU programming to understand the libraries underneath. VRAM is the deciding factor — check each tool's model card for its memory needs.