README

@anchorite3/14/2026, 10:33:55 AMOwner

Echo TTS Studio

Local text-to-speech with voice cloning, video dubbing, and voice editing.

What is this?

An enhanced, one-click installable studio built on the Echo-TTS project by Jordan Darefsky — enhanced for creators who want voice cloning, video dubbing, and voice editing without the setup headaches.

Model: jordand/echo-tts-base | Blog: echo-tts blog post

🖥️ Requirements

VRAM: 12 GB minimum (NVIDIA GPU recommended)
Platform: Windows · Linux · macOS
Install: One click via Pinokio — handles Python, dependencies, and model downloads automatically

Getting Started

Install Pinokio if you haven't already
Click the install badge above, or search "EchoStudio" in the Pinokio app
Hit Install → Start → done

Features

TTS

Voice cloning from reference audio
Multi-speaker support (S1/S2 tagging)
Long-form generation with automatic text chunking and crossfade stitching
Sampler presets and full control over CFG guidance, sampling style, and KV scaling

Dub

Upload video, extract audio, and transcribe/translate with Whisper
Editable transcript with segment timing
Re-voice with TTS using cloned or saved voices
Preserve background audio — AI source separation mixes ambient/background with the new TTS voice
Multi-speaker dubbing with S1/S2 tags

Voices

Upload audio or video files as voice sources
Edit saved voices directly via "Send to Edit"
Clip, trim silence, adjust speed, and normalize volume
Vocal isolation — separate clean vocals from noisy recordings (BS-Roformer, MDX-Net via audio-separator)
Background isolation for extracting ambience/music
Save edited voices as named profiles with cached speaker latents

Settings

Theme selection, memory management, custom output directory, temp file cleanup

Tips

Generation Length

Echo generates up to 30 seconds of audio per chunk. Longer text is automatically split and stitched with configurable silence gaps and crossfade. Shorter text produces shorter outputs naturally.

Reference Audio

Up to 5 minutes of reference audio is supported, but shorter clips (10 seconds or less) work well too. Use the Voices tab to clip, clean, and isolate vocals from noisy recordings.

Force Speaker (KV Scaling)

If the model generates a different speaker than expected, enable "Force Speaker" (default scale 1.5). Aim for the lowest scale that produces the correct speaker.

Text Prompt Format

Use [S1] and [S2] for speaker tags. Expression markers like (laughs), (angry), (whispering) control tone. Commas function as pauses.

Responsible Use

Don't use this model to impersonate real people without consent or generate deceptive audio. You are responsible for complying with local laws regarding biometric data and voice cloning.

License

Code in this repo is MIT-licensed except where file headers specify otherwise (e.g., autoencoder.py is Apache-2.0).

Audio outputs are CC-BY-NC-SA-4.0 due to the dependency on the Fish Speech S1-DAC autoencoder. Echo-TTS weights are released under CC-BY-NC-SA-4.0.

Citation

@misc{darefsky2025echo,
    author = {Darefsky, Jordan},
    title = {Echo-TTS},
    year = {2025},
    url = {https://jordandarefsky.com/blog/2025/echo/}
}

Discussion (0)

Up to 10 files, 25MB each. Images are optimized; GIFs -> MP4; videos 720p (max 120s).