Canto/README.md

# Qwen3-TTS WebUI

> **⚠️ Notice:** This project is largely AI-generated and is currently in an unstable state. Stable releases will be published in the [Releases](../../releases) section.

**Unofficial** text-to-speech web application based on Qwen3-TTS, supporting custom voice, voice design, and voice cloning with an intuitive interface.

> This is an unofficial project. For the official Qwen3-TTS repository, please visit [QwenLM/Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS).

[中文文档](./README.zh.md)

## Features

- Custom Voice: Predefined speaker voices
- Voice Design: Create voices from natural language descriptions
- Voice Cloning: Clone voices from uploaded audio
- **IndexTTS2**: High-quality voice cloning with emotion control (happy, angry, sad, fear, surprise, etc.) powered by [IndexTTS2](https://github.com/iszhanjiawei/indexTTS2)
- Audiobook Generation: Upload EPUB files and generate multi-character audiobooks with LLM-powered character extraction and voice assignment; supports IndexTTS2 per character
- Dual Backend Support: Switch between local model and Aliyun TTS API
- Multi-language Support: English, 简体中文, 繁體中文, 日本語, 한국어
- JWT auth, async tasks, voice cache, dark mode

## Interface Preview

### Desktop - Light Mode
![Light Mode](./images/lightmode-english.png)

### Desktop - Dark Mode
![Dark Mode](./images/darkmode-chinese.png)

### Mobile
<table>
  <tr>
    <td width="50%"><img src="./images/mobile-lightmode-custom.png" alt="Mobile Light Mode" /></td>
    <td width="50%"><img src="./images/mobile-settings.png" alt="Mobile Settings" /></td>
  </tr>
</table>

### Audiobook Generation
![Audiobook Overview](./images/audiobook-overview.png)

<table>
  <tr>
    <td width="50%"><img src="./images/audiobook-characters.png" alt="Audiobook Characters" /></td>
    <td width="50%"><img src="./images/audiobook-chapters.png" alt="Audiobook Chapters" /></td>
  </tr>
</table>

## Tech Stack

**Backend**: FastAPI + SQLAlchemy + PyTorch + JWT
- Direct PyTorch inference with Qwen3-TTS models
- Async task processing with batch optimization
- Local model support + Aliyun API integration

**Frontend**: React 19 + TypeScript + Vite + Tailwind + Shadcn/ui

## Docker Deployment

Pre-built images are available on Docker Hub: [bdim404/qwen3-tts-backend](https://hub.docker.com/r/bdim404/qwen3-tts-backend), [bdim404/qwen3-tts-frontend](https://hub.docker.com/r/bdim404/qwen3-tts-frontend)

**Prerequisites**: Docker, Docker Compose, NVIDIA GPU + [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)

```bash
git clone https://github.com/bdim404/Qwen3-TTS-WebUI.git
cd Qwen3-TTS-webUI

# Download models to docker/models/ (see Installation > Download Models below)
mkdir -p docker/models docker/data

# Configure
cp docker/.env.example docker/.env
# Edit docker/.env and set SECRET_KEY

cd docker

# Pull pre-built images
docker compose pull

# Start (CPU only)
docker compose up -d

# Start (with GPU)
docker compose -f docker-compose.yml -f docker-compose.gpu.yml up -d
```

Access the application at `http://localhost`. Default credentials: `admin` / `admin123456`

## Installation

### Prerequisites

- Python 3.9+ with CUDA support (for local model inference)
- Node.js 18+ (for frontend)
- Git

### 1. Clone Repository

```bash
git clone https://github.com/bdim404/Qwen3-TTS-WebUI.git
cd Qwen3-TTS-webUI
```

### 2. Download Models

**Important**: Models are **NOT** automatically downloaded. You need to manually download them first.

For more details, visit the official repository: [Qwen3-TTS Models](https://github.com/QwenLM/Qwen3-TTS)

Navigate to the models directory:
```bash
# Docker deployment
mkdir -p docker/models && cd docker/models

# Local deployment
cd qwen3-tts-backend && mkdir -p Qwen && cd Qwen
```

**Option 1: Download through ModelScope (Recommended for users in Mainland China)**

```bash
pip install -U modelscope

modelscope download --model Qwen/Qwen3-TTS-Tokenizer-12Hz --local_dir ./Qwen3-TTS-Tokenizer-12Hz
modelscope download --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --local_dir ./Qwen3-TTS-12Hz-1.7B-CustomVoice
modelscope download --model Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --local_dir ./Qwen3-TTS-12Hz-1.7B-VoiceDesign
modelscope download --model Qwen/Qwen3-TTS-12Hz-1.7B-Base --local_dir ./Qwen3-TTS-12Hz-1.7B-Base
```

Optional 0.6B models (smaller, faster):
```bash
modelscope download --model Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice --local_dir ./Qwen3-TTS-12Hz-0.6B-CustomVoice
modelscope download --model Qwen/Qwen3-TTS-12Hz-0.6B-Base --local_dir ./Qwen3-TTS-12Hz-0.6B-Base
```

**Option 2: Download through Hugging Face**

```bash
pip install -U "huggingface_hub[cli]"

hf download Qwen/Qwen3-TTS-Tokenizer-12Hz --local-dir ./Qwen3-TTS-Tokenizer-12Hz
hf download Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --local-dir ./Qwen3-TTS-12Hz-1.7B-CustomVoice
hf download Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --local-dir ./Qwen3-TTS-12Hz-1.7B-VoiceDesign
hf download Qwen/Qwen3-TTS-12Hz-1.7B-Base --local-dir ./Qwen3-TTS-12Hz-1.7B-Base
```

Optional 0.6B models (smaller, faster):
```bash
hf download Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice --local-dir ./Qwen3-TTS-12Hz-0.6B-CustomVoice
hf download Qwen/Qwen3-TTS-12Hz-0.6B-Base --local-dir ./Qwen3-TTS-12Hz-0.6B-Base
```

**IndexTTS2 Model (optional, for emotion-controlled voice cloning)**

IndexTTS2 is an optional feature. Only download these files if you want to use it. Navigate to the same `Qwen/` directory and run:

```bash
# Only the required files — no need to download the full repository
hf download IndexTeam/IndexTTS-2 \
  bpe.model config.yaml feat1.pt feat2.pt gpt.pth s2mel.pth wav2vec2bert_stats.pt \
  --local-dir ./IndexTTS2
```

Then install the indextts package:
```bash
git clone https://github.com/iszhanjiawei/indexTTS2.git
cd indexTTS2
pip install -e . --no-deps
cd ..
```

**Final directory structure:**

Docker deployment (`docker/models/`):
```
Qwen3-TTS-webUI/
└── docker/
    └── models/
        ├── Qwen3-TTS-Tokenizer-12Hz/
        ├── Qwen3-TTS-12Hz-1.7B-CustomVoice/
        ├── Qwen3-TTS-12Hz-1.7B-VoiceDesign/
        └── Qwen3-TTS-12Hz-1.7B-Base/
```

Local deployment (`qwen3-tts-backend/Qwen/`):
```
Qwen3-TTS-webUI/
└── qwen3-tts-backend/
    └── Qwen/
        ├── Qwen3-TTS-Tokenizer-12Hz/
        ├── Qwen3-TTS-12Hz-1.7B-CustomVoice/
        ├── Qwen3-TTS-12Hz-1.7B-VoiceDesign/
        ├── Qwen3-TTS-12Hz-1.7B-Base/
        └── IndexTTS2/          ← optional, for IndexTTS2 feature
            ├── bpe.model
            ├── config.yaml
            ├── feat1.pt
            ├── feat2.pt
            ├── gpt.pth
            ├── s2mel.pth
            └── wav2vec2bert_stats.pt
```

### 3. Backend Setup

```bash
cd qwen3-tts-backend

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install Qwen3-TTS
pip install qwen-tts

# Create configuration file
cp .env.example .env

# Edit .env file
# For local model: Set MODEL_BASE_PATH=./Qwen
# For Aliyun API only: Set DEFAULT_BACKEND=aliyun
nano .env  # or use your preferred editor
```

**Important Backend Configuration** (`.env`):
```env
MODEL_DEVICE=cuda:0              # Use GPU (or cpu for CPU-only)
MODEL_BASE_PATH=./Qwen           # Path to your downloaded models
DEFAULT_BACKEND=local            # Use 'local' for local models, 'aliyun' for API
DATABASE_URL=sqlite:///./qwen_tts.db
SECRET_KEY=your-secret-key-here  # Change this!
```

Start the backend server:
```bash
# Using uvicorn directly
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

# Or using conda (if you prefer)
conda run -n qwen3-tts uvicorn main:app --host 0.0.0.0 --port 8000 --reload
```

Verify backend is running:
```bash
curl http://127.0.0.1:8000/health
```

### 4. Frontend Setup

```bash
cd qwen3-tts-frontend

# Install dependencies
npm install

# Create configuration file
cp .env.example .env

# Start development server
npm run dev
```

### 5. Access the Application

Open your browser and visit: `http://localhost:5173`

**Default Credentials**:
- Username: `admin`
- Password: `admin123456`
- **IMPORTANT**: Change the password immediately after first login!

### Production Build

For production deployment:

```bash
# Backend: Use gunicorn or similar WSGI server
cd qwen3-tts-backend
gunicorn main:app -w 4 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:8000

# Frontend: Build static files
cd qwen3-tts-frontend
npm run build
# Serve the 'dist' folder with nginx or another web server
```

## Configuration

### Backend Configuration

Backend `.env` key settings:

```env
SECRET_KEY=your-secret-key
MODEL_DEVICE=cuda:0
MODEL_BASE_PATH=../Qwen
DATABASE_URL=sqlite:///./qwen_tts.db

DEFAULT_BACKEND=local

ALIYUN_REGION=beijing
ALIYUN_MODEL_FLASH=qwen3-tts-flash-realtime
ALIYUN_MODEL_VC=qwen3-tts-vc-realtime-2026-01-15
ALIYUN_MODEL_VD=qwen3-tts-vd-realtime-2026-01-15
```

**Backend Options:**

- `DEFAULT_BACKEND`: Default TTS backend, options: `local` or `aliyun`
- **Local Mode**: Uses local Qwen3-TTS model (requires `MODEL_BASE_PATH` configuration)
- **Aliyun Mode**: Uses Aliyun TTS API (requires users to configure their API keys in settings)

**Aliyun Configuration:**

- Users need to add their Aliyun API keys in the web interface settings page
- API keys are encrypted and stored securely in the database
- Superuser can enable/disable local model access for all users
- To obtain an Aliyun API key, visit the [Aliyun Console](https://dashscope.console.aliyun.com/)

## Usage

### Switching Between Backends

1. Log in to the web interface
2. Navigate to Settings page
3. Configure your preferred backend:
   - **Local Model**: Select "本地模型" (requires local model to be enabled by superuser)
   - **Aliyun API**: Select "阿里云" and add your API key
4. The selected backend will be used for all TTS operations by default
5. You can also specify a different backend per request using the `backend` parameter in the API

### Managing Aliyun API Key

1. In Settings page, find the "阿里云 API 密钥" section
2. Enter your Aliyun API key
3. Click "更新密钥" to save and validate
4. The system will verify the key before saving
5. You can delete the key anytime using the delete button

## Acknowledgments

This project is built upon the excellent work of the official [Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS) repository by the Qwen Team at Alibaba Cloud. Special thanks to the Qwen Team for open-sourcing such a powerful text-to-speech model.

## License

Apache-2.0 license