Audio, music & speech

Music generation, text-to-speech, voice cloning and speech-to-text. Small open models punch above their weight here — Kokoro TTS and MusicGen run on modest GPUs, and several run in the browser via WebGPU.

Providers

The leading hosted services — sign up and use them via app or API.

Provider	From	Strengths	Access
Suno	Suno	Full songs with vocals	App · API
Udio	Udio	High-fidelity music	App
ElevenLabs	ElevenLabs	Best-in-class TTS & voice cloning	API · app
OpenAI audio	OpenAI	TTS, transcription, realtime voice	API
Lyria / MusicFX	Google	Music generation	App
AssemblyAI	AssemblyAI	Accurate, realtime speech-to-text	API
Soniox	Soniox	Multilingual speech-to-text	API

Open-source tools

Run these yourself on a local or rented GPU. Open weights are free to use, private, and finetunable.

MusicGen / AudioCraft

Meta's open music generation models, finetunable to any style.

musicopen

ACE-Step

Open music-generation foundation model — a step toward Suno-class open music.

musicopen

YuE

Open full-song generation with vocals, similar in spirit to Suno.

musicopen

Kokoro TTS

An 82M-parameter TTS model with great quality; runs in-browser via WebGPU.

TTStiny

Whisper

OpenAI's open speech-to-text — the de-facto open transcription model.

STTopen

CSM (Sesame)

A conversational speech-generation model for natural dialogue.

TTSopen

Voxtral

Mistral's open speech model — realtime transcription, runs via WebGPU.

STTopen

Piper

Fast, local neural TTS designed for the Raspberry Pi and edge devices.

TTSedge

What you need to run it

See GPU prices to buy a card, hosting to rent one by the hour, and GPU programming to understand the libraries underneath. VRAM is the deciding factor — check each tool's model card for its memory needs.