Pinokio

Store

Related tags
Local-first audiobook production with cloned voices, chapter-based editing, segment regeneration, narrator + character casting, and final export in one web app. XTTS runs fully local by default. Voxtral is available as an optional cloud voice path with your own Mistral API key if you want another engine for specific voices. Voice profiles can use different engines in the same project, so you can mix narrator and character workflows without leaving Audiobook Studio. Built for iterative production: preview voices, regenerate only changed sections, queue chapter or segment work, and assemble the finished audiobook when everything is ready. Note: enabling Voxtral sends synthesis text and selected reference audio to Mistral. Learn more: https://senigami.github.io/audiobook-studio/
Check-ins3 check-ins
Multi-Voice Text-to-Speech for Stories and Audiobooks. Supports Kokoro and Chatterbox TTS engines with GPU acceleration.
A multi-voice AI audiobook generator built on Qwen3-TTS — annotate scripts with an LLM, assign unique voices to each character, per-line style instructions for delivery, clone voices from reference audio, design new voices from text descriptions, train custom voices with LoRA fine-tuning, and export to MP3 or Audacity multi-track projects
Check-ins4 check-ins
Platforms
GPUNVIDIAAMDApple
Wan2GPFeatured
Super Optimized Gradio UI for AI video creation for GPU poor machines (6GB+ VRAM). Supports Wan 2.1/2.2, Qwen, Hunyuan Video, LTX Video and Flux. https://github.com/deepbeepmeep/Wan2GP
Check-ins81 check-ins
Platforms
GPUNVIDIAAMDApple
Qwen3-TTSFeatured
Qwen3-TTS is an open-source series of TTS models developed by the Qwen team
Check-ins16 check-ins
Platforms
GPUNVIDIAAMDApple
High-quality text-to-speech with Beautiful Web UI & API, optimized for Apple Silicon using MLX. Features include Custom Voice (preset speakers), Voice Design (natural language), and Voice Cloning. With enhanced features for saving custom voices and long-form / endless TTS streaming.
Check-ins50 check-ins
Platforms
GPUNVIDIAAMDApple
Kokoro, KittenTTS, Higgs audio, Chatterbox/Multi, Fish-Speech, F5 & index-tts & indextts2, VoxCPM and VibeVoice in one app
Check-ins19 check-ins
Platforms
GPUNVIDIAAMDApple
VoiceboxFeatured
Local-first voice synthesis studio powered by Qwen3-TTS.
Check-ins19 check-ins
Platforms
GPUNVIDIAAMDApple
ComfyuiFeatured
The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface. https://github.com/comfyanonymous/ComfyUI
Check-ins38 check-ins
Platforms
GPUNVIDIAAMDApple
e2-f5-ttsFeatured
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching https://huggingface.co/spaces/mrfakename/E2-F5-TTS
[NVIDIA ONLY] AllTalk-TTS is a unified UI for E5-TTS, XTTS, Vite TTS, Piper TTS, Parler TTS and RVC, based on CoquiTTS, including a finetune mode.
Kokoro, KittenTTS, Higgs audio, Chatterbox/Multi, Fish-Speech, F5 & index-tts & indextts2, VoxCPM and VibeVoice in one app
Check-ins7 check-ins
GPUNVIDIAAMDApple
Orpheus TTS is an open-source text-to-speech system built on the Llama-3b backbone. Orpheus demonstrates the emergent capabilities of using LLMs for speech synthesis https://github.com/canopyai/Orpheus-TTS
Qwen3-TTS is an open-source series of TTS models developed by the Qwen team
Check-ins9 check-ins
GPUNVIDIAAMDApple
clone voices into different languages by using just a quick 3-second audio clip. (a local version of https://huggingface.co/spaces/coqui/xtts)
Check-ins3 check-ins
GPUNVIDIAAMDApple
Gradio-based web interface for the LuxTTS voice cloning and text-to-speech model, enabling users to generate customized speech from text using uploaded or recorded audio references with adjustable parameters like speed, guidance scale, and inference steps.
Realtime streaming TTS demo using microsoft/VibeVoice-Realtime-0.5B
Check-ins3 check-ins
OpenAudioFeatured
Multilingual Text-to-Speech with Voice Cloning (Supports: English, Japanese, Korean, Chinese, French, German, Arabic, and Spanish) https://github.com/fishaudio/fish-speech
Local multimodal app powered by Liquid AI LFM2.5-Audio-1.5B and LFM2.5-VL-1.6B models, delivering real-time voice chat, text-to-speech synthesis, long-form audio transcription, and multi-image vision reasoning.
All in one Gradio interface for chatterbox. Voice cloning from uploaded audio samples, automatic text processing for long content and real-time speech generation with configurable parameters. (Minimum Requirements 4GB VRAM / Recommended Requirements 8GB VRAM)