Qwen3-TTS is a powerful speech generation model offering support for voice cloning, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control. It provides developers and users with the most extensive set of speech generation features available. At MVSep, we use the largest 1.7 billion parameter model.
Original model page: https://github.com/QwenLM/Qwen3-TTS
Qwen3-TTS (Voice Cloning) allows you to upload a reference audio file to generate the target text using the sample voice. To improve cloning quality, you can optionally provide the audio transcript in the "Reference text in audio" field. You can also choose the language for this model or leave it as "auto".