Audio, music & speech

Music generation, text-to-speech, voice cloning and speech-to-text. Small open models punch above their weight here — Kokoro TTS and MusicGen run on modest GPUs, and several run in the browser via WebGPU.

Providers

The leading hosted services — sign up and use them via app or API.

ProviderFromStrengthsAccess
SunoSunoFull songs with vocalsApp · API
UdioUdioHigh-fidelity musicApp
ElevenLabsElevenLabsBest-in-class TTS & voice cloningAPI · app
OpenAI audioOpenAITTS, transcription, realtime voiceAPI
Lyria / MusicFXGoogleMusic generationApp
AssemblyAIAssemblyAIAccurate, realtime speech-to-textAPI
SonioxSonioxMultilingual speech-to-textAPI

Open-source tools

Run these yourself on a local or rented GPU. Open weights are free to use, private, and finetunable.

Meta's open music generation models, finetunable to any style.

musicopen

Open music-generation foundation model — a step toward Suno-class open music.

musicopen

Open full-song generation with vocals, similar in spirit to Suno.

musicopen

An 82M-parameter TTS model with great quality; runs in-browser via WebGPU.

TTStiny

OpenAI's open speech-to-text — the de-facto open transcription model.

STTopen

A conversational speech-generation model for natural dialogue.

TTSopen

Mistral's open speech model — realtime transcription, runs via WebGPU.

STTopen

Fast, local neural TTS designed for the Raspberry Pi and edge devices.

TTSedge

What you need to run it

See GPU prices to buy a card, hosting to rent one by the hour, and GPU programming to understand the libraries underneath. VRAM is the deciding factor — check each tool's model card for its memory needs.