Spaces:

AJ50
/

voice-cloning-backend

Sleeping

App Files Files Community

AJ50 commited on 15 days ago

Commit

5008b66

1 Parent(s): 95def52

Initial voice cloning backend with all dependencies

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.dockerignore +18 -0
.gitignore +62 -0
Dockerfile +25 -0
README.md +570 -10
backend/.env.example +15 -0
backend/__init__.py +1 -0
backend/app/__init__.py +34 -0
backend/app/routes.py +409 -0
backend/app/vocoder/audio.py +108 -0
backend/app/vocoder/display.py +127 -0
backend/app/vocoder/distribution.py +132 -0
backend/app/vocoder/hparams.py +44 -0
backend/app/vocoder/inference.py +83 -0
backend/app/vocoder/models/fatchord_version.py +434 -0
backend/app/voice_cloning.py +108 -0
backend/download_models.py +54 -0
backend/encoder/__init__.py +0 -0
backend/encoder/audio.py +117 -0
backend/encoder/inference.py +178 -0
backend/encoder/model.py +135 -0
backend/encoder/params_data.py +29 -0
backend/encoder/params_model.py +11 -0
backend/enrolled_voices/voice_26bfa1ef.mp3 +0 -0
backend/enrolled_voices/voice_72beeda9.mp3 +0 -0
backend/enrolled_voices/voices.json +100 -0
backend/requirements.txt +14 -0
backend/runtime.txt +1 -0
backend/synthesizer/__init__.py +1 -0
backend/synthesizer/audio.py +211 -0
backend/synthesizer/hparams.py +92 -0
backend/synthesizer/inference.py +165 -0
backend/synthesizer/models/tacotron.py +542 -0
backend/synthesizer/utils/__init__.py +45 -0
backend/synthesizer/utils/cleaners.py +88 -0
backend/synthesizer/utils/numbers.py +69 -0
backend/synthesizer/utils/symbols.py +17 -0
backend/synthesizer/utils/text.py +75 -0
backend/wsgi.py +15 -0
frontend/.env.development +4 -0
frontend/.env.production +2 -0
frontend/.gitignore +99 -0
frontend/README.md +111 -0
frontend/components.json +20 -0
frontend/eslint.config.js +29 -0
frontend/index.html +24 -0
frontend/package-lock.json +0 -0
frontend/package.json +88 -0
frontend/postcss.config.js +6 -0
frontend/public/placeholder.svg +1 -0
frontend/public/robots.txt +14 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,18 @@

+__pycache__
+*.pyc
+.git
+.env
+.env.local
+node_modules
+dist
+build
+.DS_Store
+*.log
+.vscode
+.idea
+*.egg-info
+.pytest_cache
+frontend/node_modules
+.next
+.nuxt
+.cache

.gitignore ADDED Viewed

	@@ -0,0 +1,62 @@

+# Model files - downloaded at build time, not stored in git
+backend/models/default/*.pt
+models/default/*.pt
+*.pt
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+env/
+venv/
+ENV/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# Environment variables
+.env
+.env.local
+.env.*.local
+backend/.env
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# OS
+.DS_Store
+Thumbs.db
+# Node/Frontend
+node_modules/
+dist/
+.next/
+out/
+# Build artifacts
+outputs/
+temp_uploads/
+enrolled_voices/*.wav
+enrolled_voices/*.mp3
+# Cache
+.cache/
+.pytest_cache/

Dockerfile ADDED Viewed

	@@ -0,0 +1,25 @@

+FROM python:3.10-slim
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    libsndfile1 libsndfile1-dev \
+    ffmpeg \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+WORKDIR /app
+# Copy entire project
+COPY . .
+# Install Python dependencies
+RUN pip install --no-cache-dir -r backend/requirements.txt
+# Download models during build
+RUN cd backend && python download_models.py
+# Expose port (HF Spaces uses 7860)
+EXPOSE 7860
+# Start the application
+CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "1", "--timeout", "300", "backend.wsgi:app"]

README.md CHANGED Viewed

@@ -1,12 +1,572 @@
 ---
-title: Voice Cloning Backend
-emoji: 💻
-colorFrom: red
-colorTo: green
-sdk: docker
-pinned: false
-license: mit
-short_description: 'AI-powered Voice Cloning '
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Real-Time Voice Cloning (RTVC)
+A complete full-stack voice cloning application with React frontend and PyTorch backend that can synthesize speech in anyone's voice from just a few seconds of audio reference.
+[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
+[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/)
+[![React](https://img.shields.io/badge/React-18.0+-61dafb.svg)](https://reactjs.org/)
+[![TypeScript](https://img.shields.io/badge/TypeScript-5.0+-blue.svg)](https://www.typescriptlang.org/)
+[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
+## Features
+- **Full Stack Application**: Modern React UI + Flask API + PyTorch backend
+- **Voice Enrollment**: Record or upload voice samples directly in the browser
+- **Speech Synthesis**: Generate cloned speech with intuitive interface
+- **Voice Cloning**: Clone any voice with just 3-10 seconds of audio
+- **Real-Time Generation**: Generate speech at 2-3x real-time speed on CPU
+- **High Quality**: Natural-sounding synthetic speech using state-of-the-art models
+- **Easy to Use**: Beautiful UI with 3D visualizations and audio waveforms
+- **Multiple Formats**: Supports WAV, MP3, M4A, FLAC input audio
+- **Multi-Language**: Supports English and Hindi text-to-speech
+## Table of Contents
+- [Demo](#demo)
+- [Quick Start (Full Stack)](#quick-start-full-stack)
+- [Deployment](#deployment)
+- [How It Works](#how-it-works)
+- [Installation](#installation)
+- [Project Structure](#project-structure)
+- [Usage Examples](#usage-examples)
+- [API Documentation](#api-documentation)
+- [Troubleshooting](#troubleshooting)
+- [Technical Details](#technical-details)
+- [Credits](#credits)
+## Demo
+**Frontend UI**: Modern React interface with 3D visualizations
+**Voice Enrollment**: Record/upload voice samples → Backend saves to database
+**Speech Synthesis**: Select voice + Enter text → Backend generates cloned speech
+**Playback**: Listen to generated audio directly in browser or download
+## Quick Start (Full Stack)
+### Option 1: Using the Startup Script (Easiest)
+```powershell
+# Windows PowerShell
+cd rtvc
+.\start_app.ps1
+```
+This will:
+1. Start the Backend API server (port 5000)
+2. Start the Frontend dev server (port 8080)
+3. Open your browser to http://localhost:8080
+### Option 2: Manual Start
+**Terminal 1 - Backend API:**
+```bash
+cd rtvc
+python api_server.py
+```
+**Terminal 2 - Frontend:**
+```bash
+cd "rtvc/Frontend Voice Cloning"
+npm run dev
+```
+Then open http://localhost:8080 in your browser.
+## Deployment
+### Production Deployment Stack
+**Frontend**: Netlify (Free tier)
+**Backend**: Render (Free tier)
+**Models**: HuggingFace Hub (Free)
+See [DEPLOYMENT.md](DEPLOYMENT.md) for complete deployment guide.
+#### Quick Deployment
+1. **Deploy Backend to Render**
+   - Push to GitHub
+   - Connect Render to GitHub repo
+   - Use `render.yaml` configuration
+   - Models auto-download on first deploy (~10 minutes)
+2. **Deploy Frontend to Netlify**
+   - Connect Netlify to GitHub repo
+   - Set base directory: `frontend`
+   - Environment: `VITE_API_URL=your-render-backend-url`
+3. **Test**
+   - Visit your Netlify URL
+   - API calls automatically route to Render backend
+**Pricing**: Free tier for both (with optional paid upgrades)
+### Using the Application
+1. **Enroll a Voice**:
+   - Go to "Voice Enrollment" section
+   - Enter a voice name
+   - Record audio (3-10 seconds) or upload a file
+   - Click "Enroll Voice"
+2. **Generate Speech**:
+   - Go to "Speech Synthesis" section
+   - Select your enrolled voice
+   - Enter text to synthesize
+   - Click "Generate Speech"
+   - Play or download the result
+For detailed integration information, see [INTEGRATION_GUIDE.md](INTEGRATION_GUIDE.md).
+## How It Works
+The system uses a 3-stage pipeline based on the SV2TTS (Speaker Verification to Text-to-Speech) architecture:
+```
+Reference Audio → [Encoder] → Speaker Embedding (256-d vector)
+                                       ↓
+Text Input → [Synthesizer (Tacotron)] → Mel-Spectrogram
+                                       ↓
+                    [Vocoder (WaveRNN)] → Audio Output
+```
+### Pipeline Stages:
+1. **Speaker Encoder** - Extracts a unique voice "fingerprint" from reference audio
+2. **Synthesizer** - Generates mel-spectrograms from text conditioned on speaker embedding
+3. **Vocoder** - Converts mel-spectrograms to high-quality audio waveforms
+## Installation
+### Prerequisites
+- Python 3.11 or higher
+- Windows/Linux/macOS
+- ~2 GB disk space for models
+- 4 GB RAM minimum (8 GB recommended)
+### Step 1: Clone the Repository
+```bash
+git clone https://github.com/yourusername/rtvc.git
+cd rtvc
+```
+### Step 2: Install Dependencies
+```bash
+pip install torch numpy librosa scipy soundfile webrtcvad tqdm unidecode inflect matplotlib numba
+```
+Or install PyTorch with CUDA for GPU acceleration:
+```bash
+pip install torch --index-url https://download.pytorch.org/whl/cu118
+pip install numpy librosa scipy soundfile webrtcvad tqdm unidecode inflect matplotlib numba
+```
+### Step 3: Download Pretrained Models
+Download the pretrained models from [Google Drive](https://drive.google.com/drive/folders/1fU6umc5uQAVR2udZdHX-lDgXYzTyqG_j):
+| Model | Size | Description |
+|-------|------|-------------|
+| encoder.pt | 17 MB | Speaker encoder model |
+| synthesizer.pt | 370 MB | Tacotron synthesizer model |
+| vocoder.pt | 53 MB | WaveRNN vocoder model |
+Place all three files in the `models/default/` directory.
+### Step 4: Verify Installation
+```bash
+python clone_my_voice.py
+```
+If you see errors about missing models, check that all three `.pt` files are in `models/default/`.
+## Quick Start
+### Method 1: Simple Script (Recommended)
+1. Open `clone_my_voice.py`
+2. Edit these lines:
+```python
+# Your voice sample file
+VOICE_FILE = r"sample\your_voice.mp3"
+# The text you want to be spoken
+TEXT_TO_CLONE = """
+Your text here. Can be multiple sentences or even paragraphs!
+"""
+# Output location
+OUTPUT_FILE = r"outputs\cloned_voice.wav"
+```
+3. Run it:
+```bash
+python clone_my_voice.py
+```
+### Method 2: Command Line
+```bash
+python run_cli.py --voice "path/to/voice.wav" --text "Text to synthesize" --out "output.wav"
+```
+### Method 3: Advanced Runner Script
+```bash
+python run_voice_cloning.py
+```
+Edit the paths and text inside the script before running.
+## Project Structure
+```
+rtvc/
+├── clone_my_voice.py          # Simple script - EDIT THIS to clone your voice!
+├── run_cli.py                 # Command-line interface
+│
+├── encoder/                   # Speaker Encoder Module
+│   ├── __init__.py
+│   ├── audio.py                  # Audio preprocessing for encoder
+│   ├── inference.py              # Encoder inference functions
+│   ├── model.py                  # SpeakerEncoder neural network
+│   ├── params_data.py            # Data hyperparameters
+│   └── params_model.py           # Model hyperparameters
+│
+├── synthesizer/               # Tacotron Synthesizer Module
+│   ├── __init__.py
+│   ├── audio.py                  # Audio processing for synthesizer
+│   ├── hparams.py                # All synthesizer hyperparameters
+│   ├── inference.py              # Synthesizer inference class
+│   │
+│   ├── models/
+│   │   └── tacotron.py           # Tacotron 2 architecture
+│   │
+│   └── utils/
+│       ├── cleaners.py           # Text cleaning functions
+│       ├── numbers.py            # Number-to-text conversion
+│       ├── symbols.py            # Character/phoneme symbols
+│       └── text.py               # Text-to-sequence conversion
+│
+├── vocoder/                   # WaveRNN Vocoder Module
+│   ├── audio.py                  # Audio utilities for vocoder
+│   ├── display.py                # Progress display utilities
+│   ├── distribution.py           # Probability distributions
+│   ├── hparams.py                # Vocoder hyperparameters
+│   ├── inference.py              # Vocoder inference functions
+│   │
+│   └── models/
+│       └── fatchord_version.py   # WaveRNN architecture
+│
+├── utils/
+│   └── default_models.py         # Model download utilities
+│
+├── models/
+│   └── default/               # Pretrained models go here
+│       ├── encoder.pt            # (17 MB)
+│       ├── synthesizer.pt        # (370 MB) - Must download!
+│       └── vocoder.pt            # (53 MB)
+│
+├── sample/                    # Put your voice samples here
+│   └── your_voice.mp3
+│
+└── outputs/                   # Generated audio outputs
+    └── cloned_voice.wav
+```
+### Key Files Explained
+| File | Purpose |
+|------|---------|
+| `clone_my_voice.py` | **START HERE** - Simplest way to clone your voice |
+| `run_cli.py` | Command-line tool for voice cloning |
+| `encoder/inference.py` | Loads encoder and extracts speaker embeddings |
+| `synthesizer/inference.py` | Loads synthesizer and generates mel-spectrograms |
+| `vocoder/inference.py` | Loads vocoder and generates waveforms |
+| `**/hparams.py` | Configuration files for each module |
+## Usage Examples
+### Example 1: Basic Voice Cloning
+```bash
+python clone_my_voice.py
+```
+Edit `clone_my_voice.py` first:
+```python
+VOICE_FILE = r"sample\my_voice.mp3"
+TEXT_TO_CLONE = "Hello, this is my cloned voice!"
+```
+### Example 2: Multiple Outputs
+```bash
+# Generate first output
+python run_cli.py --voice "voice.wav" --text "First message" --out "output1.wav"
+# Generate second output with same voice
+python run_cli.py --voice "voice.wav" --text "Second message" --out "output2.wav"
+```
+### Example 3: Long Text
+```bash
+python run_cli.py --voice "voice.wav" --text "This is a very long text that spans multiple sentences. The voice cloning system will synthesize all of it in the reference voice. You can make it as long as you need."
+```
+### Example 4: Different Voice Samples
+```bash
+# Clone voice A
+python run_cli.py --voice "person_a.wav" --text "Message from person A"
+# Clone voice B
+python run_cli.py --voice "person_b.wav" --text "Message from person B"
+```
+## Troubleshooting
+### Common Issues
+#### "Model file not found"
+**Solution**: Download the models from Google Drive and place them in `models/default/`:
+- https://drive.google.com/drive/folders/1fU6umc5uQAVR2udZdHX-lDgXYzTyqG_j
+Verify file sizes:
+```bash
+# Windows
+dir models\default\*.pt
+# Linux/Mac
+ls -lh models/default/*.pt
+```
+Expected sizes:
+- encoder.pt: 17,090,379 bytes (17 MB)
+- synthesizer.pt: 370,554,559 bytes (370 MB) - Most common issue!
+- vocoder.pt: 53,845,290 bytes (53 MB)
+#### "Reference voice file not found"
+**Solution**: Use absolute paths or check current directory:
+```python
+# Use absolute path
+VOICE_FILE = r"C:\Users\YourName\Desktop\voice.mp3"
+# Or relative from project root
+VOICE_FILE = r"sample\voice.mp3"
+```
+#### Output sounds robotic or unclear
+**Solutions**:
+- Use a higher quality voice sample (16kHz+ sample rate)
+- Ensure voice sample is 3-10 seconds long
+- Remove background noise from voice sample
+- Speak clearly and naturally in the reference audio
+#### "AttributeError: module 'numpy' has no attribute 'cumproduct'"
+**Solution**: This is already fixed in the code. If you see this:
+```bash
+pip install --upgrade numpy
+```
+#### Slow generation on CPU
+**Solutions**:
+- Normal speed: 2-3x real-time on modern CPUs
+- For faster generation, install PyTorch with CUDA:
+```bash
+pip install torch --index-url https://download.pytorch.org/whl/cu118
+```
+Then the system will automatically use GPU if available.
+### Getting Help
+If you encounter other issues:
+1. Check the `HOW_TO_RUN.md` file for detailed instructions
+2. Verify all models are downloaded correctly
+3. Ensure Python 3.11+ is installed
+4. Check that all dependencies are installed
+## Technical Details
+### Audio Specifications
+| Parameter | Value |
+|-----------|-------|
+| Sample Rate | 16,000 Hz |
+| Channels | Mono |
+| Bit Depth | 16-bit |
+| FFT Size | 800 samples (50ms) |
+| Hop Size | 200 samples (12.5ms) |
+| Mel Channels | 80 (synthesizer/vocoder), 40 (encoder) |
+### Model Architectures
+#### Speaker Encoder
+- **Type**: LSTM + Linear Projection
+- **Input**: 40-channel mel-spectrogram
+- **Output**: 256-dimensional speaker embedding
+- **Parameters**: ~5M
+#### Synthesizer (Tacotron 2)
+- **Encoder**: CBHG (Convolution Bank + Highway + GRU)
+- **Decoder**: Attention-based LSTM
+- **PostNet**: 5-layer Residual CNN
+- **Parameters**: ~31M
+#### Vocoder (WaveRNN)
+- **Type**: Recurrent Neural Vocoder
+- **Mode**: Raw 9-bit with mu-law
+- **Upsample Factors**: (5, 5, 8)
+- **Parameters**: ~4.5M
+### Text Processing
+The system includes sophisticated text normalization:
+- **Numbers**: "123" → "one hundred twenty three"
+- **Currency**: "$5.50" → "five dollars, fifty cents"
+- **Ordinals**: "1st" → "first"
+- **Abbreviations**: "Dr." → "doctor"
+- **Unicode**: Automatic transliteration to ASCII
+### Performance
+| Hardware | Generation Speed |
+|----------|------------------|
+| CPU (Intel i7) | 2-3x real-time |
+| GPU (GTX 1060) | 10-15x real-time |
+| GPU (RTX 3080) | 30-50x real-time |
+Example: Generating 10 seconds of audio takes ~3-5 seconds on CPU.
+## How to Use for Different Applications
+### Podcast/Narration
+```python
+TEXT_TO_CLONE = """
+Welcome to today's episode. In this podcast, we'll be discussing
+the fascinating world of artificial intelligence and voice synthesis.
+Let's dive right in!
+"""
+```
+### Audiobook
+```python
+TEXT_TO_CLONE = """
+Chapter One: The Beginning.
+It was a dark and stormy night when everything changed.
+The old house stood alone on the hill, its windows dark and unwelcoming.
+"""
+```
+### Voiceover
+```python
+TEXT_TO_CLONE = """
+Introducing the all-new product that will change your life.
+With advanced features and intuitive design, it's the perfect solution.
+"""
+```
+### Multiple Languages
+The system supports English out of the box. For other languages:
+1. Use English transliteration for best results
+2. Or modify `synthesizer/utils/cleaners.py` for your language
+## Comparison with Other Methods
+| Method | Quality | Speed | Setup |
+|--------|---------|-------|-------|
+| Traditional TTS | Low | Fast | Easy |
+| Commercial APIs | High | Fast | API Key Required |
+| **This Project** | High | Medium | One-time Setup |
+| Training from Scratch | High | Slow | Very Complex |
+## Best Practices
+### For Best Voice Quality:
+1. **Reference Audio**:
+   - 3-10 seconds long
+   - Clear speech, no background noise
+   - Natural speaking tone (not reading/singing)
+   - 16kHz+ sample rate if possible
+2. **Text Input**:
+   - Use proper punctuation for natural pauses
+   - Break very long texts into paragraphs
+   - Avoid excessive special characters
+3. **Output**:
+   - Generate shorter clips for better quality
+   - Concatenate multiple clips if needed
+   - Post-process with audio editing software for polish
+## Known Limitations
+- Works best with English text
+- Requires good quality reference audio
+- May not perfectly capture very unique voice characteristics
+- Background noise in reference affects output quality
+- Very short reference audio (<3 seconds) may produce inconsistent results
+## Future Improvements
+- [ ] Add GUI interface
+- [ ] Support for multiple languages
+- [ ] Real-time streaming mode
+- [ ] Voice mixing/morphing capabilities
+- [ ] Fine-tuning on custom datasets
+- [ ] Mobile app version
+## Credits
+This implementation is based on:
+- **SV2TTS**: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
+- **Tacotron 2**: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
+- **WaveRNN**: Efficient Neural Audio Synthesis
+Original research papers:
+- [SV2TTS Paper](https://arxiv.org/abs/1806.04558)
+- [Tacotron 2 Paper](https://arxiv.org/abs/1712.05884)
+- [WaveRNN Paper](https://arxiv.org/abs/1802.08435)
+## License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## Contributing
+Contributions are welcome! Please feel free to submit a Pull Request.
+1. Fork the repository
+2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
+3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
+4. Push to the branch (`git push origin feature/AmazingFeature`)
+5. Open a Pull Request
+## Show Your Support
+If this project helped you, please give it a star!
+## Contact
+For questions or support, please open an issue on GitHub.
 ---
+**Made with love by the Voice Cloning Community**
+*Last Updated: October 30, 2025*

backend/.env.example ADDED Viewed

	@@ -0,0 +1,15 @@

+# Flask backend environment variables
+FLASK_APP=backend.app
+FLASK_ENV=production
+DEBUG=false
+# HuggingFace configuration
+HF_HOME=.cache/huggingface
+# CORS configuration for production
+CORS_ORIGINS=https://your-netlify-site.netlify.app
+# Model configuration
+MODEL_REPO_ENCODER=AJ50/voice-clone-encoder
+MODEL_REPO_SYNTHESIZER=AJ50/voice-clone-synthesizer
+MODEL_REPO_VOCODER=AJ50/voice-clone-vocoder

backend/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Backend package root to support relative imports."""

backend/app/__init__.py ADDED Viewed

	@@ -0,0 +1,34 @@

+"""Application factory for the voice cloning backend."""
+import os
+from flask import Flask
+from flask_cors import CORS
+def create_app():
+    """Create and configure the Flask application."""
+    app = Flask(__name__)
+    # CORS configuration - allow specific frontend URL or all origins
+    allowed_origins = os.getenv('FRONTEND_URL', '*').split(',')
+    cors_config = {
+        "origins": allowed_origins if allowed_origins != ['*'] else '*',
+        "methods": ["GET", "POST", "DELETE", "OPTIONS"],
+        "allow_headers": ["Content-Type", "Authorization"]
+    }
+    CORS(app, resources={r"/api/*": cors_config})
+    from .routes import bp
+    app.register_blueprint(bp)
+    # Root endpoint
+    @app.route('/')
+    def index():
+        return {'message': 'Voice Cloning API', 'status': 'running', 'api_prefix': '/api'}
+    return app
+app = create_app()

backend/app/routes.py ADDED Viewed

	@@ -0,0 +1,409 @@

+"""
+Flask API Backend for Voice Cloning
+Integrates the Python voice cloning backend with the React frontend
+"""
+from flask import Blueprint, request, jsonify, send_file
+from pathlib import Path
+import uuid
+import json
+from datetime import datetime
+import sys
+from .voice_cloning import synthesize
+bp = Blueprint('voice_cloning', __name__, url_prefix='/api')
+BASE_DIR = Path(__file__).resolve().parents[1]
+# Configuration
+UPLOAD_FOLDER = BASE_DIR / 'enrolled_voices'
+OUTPUT_FOLDER = BASE_DIR / 'outputs'
+MODELS_DIR = BASE_DIR / 'models'
+VOICES_DB = UPLOAD_FOLDER / 'voices.json'
+# Create directories with parents
+try:
+    UPLOAD_FOLDER.mkdir(parents=True, exist_ok=True)
+    OUTPUT_FOLDER.mkdir(parents=True, exist_ok=True)
+    VOICES_DB.parent.mkdir(parents=True, exist_ok=True)
+except Exception as e:
+    print(f"Failed to create directories: {e}")
+    sys.exit(1)
+# Allowed audio extensions
+ALLOWED_EXTENSIONS = {'mp3', 'wav', 'm4a', 'flac', 'ogg', 'webm'}
+def allowed_file(filename):
+    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
+def load_voices_db():
+    """Load the voices database"""
+    if VOICES_DB.exists():
+        with open(VOICES_DB, 'r') as f:
+            return json.load(f)
+    return []
+def save_voices_db(voices):
+    """Save the voices database"""
+    with open(VOICES_DB, 'w') as f:
+        json.dump(voices, f, indent=2)
+@bp.route('/health', methods=['GET'])
+def health_check():
+    """Health check endpoint"""
+    return jsonify({
+        'status': 'healthy',
+        'message': 'Voice Cloning API is running'
+    })
+@bp.route('/enroll', methods=['POST'])
+def enroll_voice():
+    """
+    Enroll a new voice by accepting audio file and voice name
+    Frontend sends: FormData with 'audio' (File) and 'voice_name' (string)
+    """
+    try:
+        # Check if audio file is present
+        if 'audio' not in request.files:
+            return jsonify({'error': 'No audio file provided'}), 400
+        audio_file = request.files['audio']
+        voice_name = request.form.get('voice_name', 'Unnamed Voice').strip()
+        if audio_file.filename == '':
+            return jsonify({'error': 'No file selected'}), 400
+        if not allowed_file(audio_file.filename):
+            return jsonify({'error': 'Invalid file type. Supported: mp3, wav, m4a, flac, ogg, webm'}), 400
+        # Ensure upload folder exists
+        UPLOAD_FOLDER.mkdir(parents=True, exist_ok=True)
+        # Generate unique ID and secure filename
+        voice_id = f"voice_{uuid.uuid4().hex[:8]}"
+        file_extension = audio_file.filename.rsplit('.', 1)[1].lower()
+        filename = f"{voice_id}.{file_extension}"
+        filepath = UPLOAD_FOLDER / filename
+        # Save the audio file with error handling
+        try:
+            audio_file.save(str(filepath))
+            print(f"✓ Audio file saved: {filepath}")
+        except Exception as file_err:
+            print(f"✗ Failed to save audio file: {file_err}")
+            return jsonify({'error': f'Failed to save audio: {str(file_err)}'}), 500
+        # Create voice entry
+        voice_entry = {
+            'id': voice_id,
+            'name': voice_name,
+            'filename': filename,
+            'createdAt': datetime.now().isoformat()
+        }
+        # Update voices database with error handling
+        try:
+            VOICES_DB.parent.mkdir(parents=True, exist_ok=True)
+            voices = load_voices_db()
+            voices.append(voice_entry)
+            save_voices_db(voices)
+            print(f"✓ Voice '{voice_name}' (ID: {voice_id}) enrolled successfully")
+        except Exception as db_err:
+            print(f"✗ Failed to update voices DB: {db_err}")
+            return jsonify({'error': f'Failed to save voice metadata: {str(db_err)}'}), 500
+        return jsonify({
+            'success': True,
+            'message': f'Voice "{voice_name}" enrolled successfully',
+            'voice_id': voice_id,
+            'voice_name': voice_name,
+            'created_at': voice_entry['createdAt']
+        }), 201
+    except Exception as e:
+        print(f"✗ Error enrolling voice: {e}")
+        import traceback
+        traceback.print_exc()
+        return jsonify({'error': f'Failed to enroll voice: {str(e)}'}), 500
+@bp.route('/voices', methods=['GET'])
+def get_voices():
+    """
+    Get list of all enrolled voices
+    Frontend uses this to populate the voice selection dropdown
+    """
+    try:
+        voices = load_voices_db()
+        # Return only necessary info for frontend
+        voices_list = [
+            {
+                'id': v['id'],
+                'name': v['name'],
+                'createdAt': v['createdAt']
+            }
+            for v in voices
+        ]
+        return jsonify({'voices': voices_list}), 200
+    except Exception as e:
+        print(f"Error getting voices: {e}")
+        return jsonify({'error': f'Failed to get voices: {str(e)}'}), 500
+@bp.route('/synthesize', methods=['POST'])
+def synthesize_speech():
+    """
+    Synthesize speech from text using enrolled voice
+    Frontend sends: { "text": "...", "voiceId": "voice_xxx" }
+    """
+    try:
+        data = request.get_json()
+        if not data:
+            return jsonify({'error': 'No data provided'}), 400
+        text = data.get('text', '').strip()
+        voice_id = data.get('voice_id', '')  # Changed from 'voiceId' to 'voice_id'
+        if not text:
+            return jsonify({'error': 'No text provided'}), 400
+        if not voice_id:
+            return jsonify({'error': 'No voice selected'}), 400
+        # Find the voice in database
+        voices = load_voices_db()
+        voice = next((v for v in voices if v['id'] == voice_id), None)
+        if not voice:
+            return jsonify({'error': 'Voice not found'}), 404
+        # Reconstruct path from UPLOAD_FOLDER (server-agnostic)
+        voice_filepath = UPLOAD_FOLDER / voice['filename']
+        if not voice_filepath.exists():
+            return jsonify({'error': f'Voice file not found: {voice_filepath}'}), 404
+        # Generate unique output filename
+        output_filename = f"synthesis_{uuid.uuid4().hex[:8]}.wav"
+        output_path = OUTPUT_FOLDER / output_filename
+        # Call the voice cloning synthesis function
+        print(f"Synthesizing: '{text}' with voice '{voice['name']}'")
+        print(f"Voice file: {voice_filepath}")
+        print(f"Output path: {output_path}")
+        print(f"Models dir: {MODELS_DIR}")
+        print("Starting synthesis... This may take 30-60 seconds...")
+        try:
+            # Flush output to see logs immediately
+            sys.stdout.flush()
+            synthesize(
+                voice_path=voice_filepath,
+                text=text,
+                models_dir=MODELS_DIR,
+                out_path=output_path
+            )
+            print(f"Synthesis completed! Output saved to: {output_path}")
+            sys.stdout.flush()
+        except Exception as synth_error:
+            print(f"Synthesis error: {synth_error}")
+            import traceback
+            traceback.print_exc()
+            sys.stdout.flush()
+            return jsonify({'error': f'Synthesis failed: {str(synth_error)}'}), 500
+        if not output_path.exists():
+            error_msg = 'Synthesis failed - output not generated'
+            return jsonify({'error': error_msg}), 500
+        # Return the audio file URL
+        return jsonify({
+            'success': True,
+            'message': 'Speech synthesized successfully',
+            'audio_url': f'/api/audio/{output_filename}'
+        }), 200
+    except Exception as e:
+        print(f"Error synthesizing speech: {e}")
+        import traceback
+        traceback.print_exc()
+        return jsonify({'error': f'Failed to synthesize speech: {str(e)}'}), 500
+@bp.route('/audio/<filename>', methods=['GET'])
+def get_audio(filename):
+    """
+    Serve synthesized audio files
+    Frontend uses this URL to play/download the generated audio
+    """
+    try:
+        filepath = OUTPUT_FOLDER / filename
+        if not filepath.exists():
+            return jsonify({'error': 'Audio file not found'}), 404
+        return send_file(
+            str(filepath),
+            mimetype='audio/wav',
+            as_attachment=False,
+            download_name=filename
+        )
+    except Exception as e:
+        print(f"Error serving audio: {e}")
+        return jsonify({'error': f'Failed to serve audio: {str(e)}'}), 500
+@bp.route('/voices/<voice_id>', methods=['DELETE'])
+def delete_voice(voice_id):
+    """
+    Delete an enrolled voice
+    Optional: Frontend can call this to remove voices
+    """
+    try:
+        voices = load_voices_db()
+        voice = next((v for v in voices if v['id'] == voice_id), None)
+        if not voice:
+            return jsonify({'error': 'Voice not found'}), 404
+        # Delete the audio file
+        voice_filepath = UPLOAD_FOLDER / voice['filename']
+        if voice_filepath.exists():
+            voice_filepath.unlink()
+        # Remove from database
+        voices = [v for v in voices if v['id'] != voice_id]
+        save_voices_db(voices)
+        return jsonify({
+            'success': True,
+            'message': f'Voice "{voice["name"]}" deleted successfully'
+        }), 200
+    except Exception as e:
+        print(f"Error deleting voice: {e}")
+        return jsonify({'error': f'Failed to delete voice: {str(e)}'}), 500
+@bp.route('/spectrogram/<audio_filename>', methods=['GET'])
+def get_spectrogram(audio_filename):
+    """
+    Generate and return mel-spectrogram data for visualization
+    Frontend can use this to display real-time mel-spectrogram
+    """
+    try:
+        print(f"[Spectrogram] Requested file: {audio_filename}")
+        filepath = OUTPUT_FOLDER / audio_filename
+        print(f"[Spectrogram] Full path: {filepath}")
+        print(f"[Spectrogram] File exists: {filepath.exists()}")
+        if not filepath.exists():
+            print(f"[Spectrogram] ERROR: File not found: {filepath}")
+            return jsonify({'error': f'Audio file {audio_filename} not found'}), 404
+        # Import librosa for mel-spectrogram generation
+        import librosa
+        import numpy as np
+        print(f"[Spectrogram] Loading audio file...")
+        # Load audio file
+        y, sr = librosa.load(str(filepath), sr=None)
+        print(f"[Spectrogram] Audio loaded: shape={y.shape}, sr={sr}")
+        # Generate mel-spectrogram
+        # 80 mel bands (common for Tacotron2), hop_length varies with sample rate
+        mel_spec = librosa.feature.melspectrogram(
+            y=y,
+            sr=sr,
+            n_mels=80,
+            hop_length=512
+        )
+        print(f"[Spectrogram] Mel-spec generated: shape={mel_spec.shape}")
+        # Convert to dB scale (log scale for better visualization)
+        mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
+        # Normalize to 0-255 range for visualization
+        mel_spec_normalized = np.clip(
+            ((mel_spec_db + 80) / 80 * 255),
+            0,
+            255
+        ).astype(np.uint8)
+        # Convert to list for JSON serialization
+        # Transpose to time x frequency format for frontend
+        spectrogram_data = mel_spec_normalized.T.tolist()
+        print(f"[Spectrogram] Successfully generated spectrogram: {len(spectrogram_data)} time steps")
+        return jsonify({
+            'spectrogram': spectrogram_data,
+            'n_mels': 80,
+            'shape': {
+                'time_steps': len(spectrogram_data),
+                'frequency_bins': 80
+            }
+        }), 200
+    except Exception as e:
+        print(f"[Spectrogram] ERROR: {str(e)}")
+        import traceback
+        traceback.print_exc()
+        return jsonify({'error': f'Failed to generate spectrogram: {str(e)}'}), 500
+@bp.route('/waveform/<audio_filename>', methods=['GET'])
+def get_waveform(audio_filename):
+    """
+    Serve audio waveform as numeric array for real-time FFT visualization
+    Frontend fetches this and computes FFT using Web Audio API
+    """
+    try:
+        filepath = OUTPUT_FOLDER / audio_filename
+        if not filepath.exists():
+            return jsonify({'error': 'Audio file not found'}), 404
+        import soundfile as sf
+        import numpy as np
+        # Load audio file
+        # soundfile returns (data, sample_rate)
+        y, sr = sf.read(str(filepath))
+        # If stereo, convert to mono by taking first channel or averaging
+        if len(y.shape) > 1:
+            y = np.mean(y, axis=1)
+        # Ensure float32 for compatibility
+        y = np.asarray(y, dtype=np.float32)
+        # Downsample if very long to reduce JSON payload
+        # Typical waveform for 60s at 22050Hz = 1.3M samples
+        # For FFT we can use 8000 Hz safely (captures up to 4 kHz)
+        target_sr = 8000
+        if sr > target_sr:
+            # Calculate downsample factor
+            resample_ratio = target_sr / sr
+            new_length = int(len(y) * resample_ratio)
+            # Simple linear interpolation for downsampling
+            indices = np.linspace(0, len(y) - 1, new_length)
+            y = np.interp(indices, np.arange(len(y)), y)
+            sr = target_sr
+        # Convert to list for JSON serialization
+        waveform_data = y.tolist()
+        return jsonify({
+            'waveform': waveform_data,
+            'sample_rate': sr,
+            'duration': len(y) / sr,
+            'samples': len(y)
+        }), 200
+    except ImportError as ie:
+        err_msg = f'Soundfile library not available: {str(ie)}'
+        return jsonify({'error': err_msg}), 500
+    except Exception as e:
+        print(f"Error serving waveform: {e}")
+        import traceback
+        traceback.print_exc()
+        err_msg = f'Failed to generate waveform: {str(e)}'
+        return jsonify({'error': err_msg}), 500

backend/app/vocoder/audio.py ADDED Viewed

	@@ -0,0 +1,108 @@

+import math
+import numpy as np
+import librosa
+from . import hparams as hp
+from scipy.signal import lfilter
+import soundfile as sf
+def label_2_float(x, bits) :
+    return 2 * x / (2**bits - 1.) - 1.
+def float_2_label(x, bits) :
+    assert abs(x).max() <= 1.0
+    x = (x + 1.) * (2**bits - 1) / 2
+    return x.clip(0, 2**bits - 1)
+def load_wav(path) :
+    return librosa.load(str(path), sr=hp.sample_rate)[0]
+def save_wav(x, path) :
+    sf.write(path, x.astype(np.float32), hp.sample_rate)
+def split_signal(x) :
+    unsigned = x + 2**15
+    coarse = unsigned // 256
+    fine = unsigned % 256
+    return coarse, fine
+def combine_signal(coarse, fine) :
+    return coarse * 256 + fine - 2**15
+def encode_16bits(x) :
+    return np.clip(x * 2**15, -2**15, 2**15 - 1).astype(np.int16)
+mel_basis = None
+def linear_to_mel(spectrogram):
+    global mel_basis
+    if mel_basis is None:
+        mel_basis = build_mel_basis()
+    return np.dot(mel_basis, spectrogram)
+def build_mel_basis():
+    return librosa.filters.mel(hp.sample_rate, hp.n_fft, n_mels=hp.num_mels, fmin=hp.fmin)
+def normalize(S):
+    return np.clip((S - hp.min_level_db) / -hp.min_level_db, 0, 1)
+def denormalize(S):
+    return (np.clip(S, 0, 1) * -hp.min_level_db) + hp.min_level_db
+def amp_to_db(x):
+    return 20 * np.log10(np.maximum(1e-5, x))
+def db_to_amp(x):
+    return np.power(10.0, x * 0.05)
+def spectrogram(y):
+    D = stft(y)
+    S = amp_to_db(np.abs(D)) - hp.ref_level_db
+    return normalize(S)
+def melspectrogram(y):
+    D = stft(y)
+    S = amp_to_db(linear_to_mel(np.abs(D)))
+    return normalize(S)
+def stft(y):
+    return librosa.stft(y=y, n_fft=hp.n_fft, hop_length=hp.hop_length, win_length=hp.win_length)
+def pre_emphasis(x):
+    return lfilter([1, -hp.preemphasis], [1], x)
+def de_emphasis(x):
+    return lfilter([1], [1, -hp.preemphasis], x)
+def encode_mu_law(x, mu) :
+    mu = mu - 1
+    fx = np.sign(x) * np.log(1 + mu * np.abs(x)) / np.log(1 + mu)
+    return np.floor((fx + 1) / 2 * mu + 0.5)
+def decode_mu_law(y, mu, from_labels=True) :
+    if from_labels:
+        y = label_2_float(y, math.log2(mu))
+    mu = mu - 1
+    x = np.sign(y) / mu * ((1 + mu) ** np.abs(y) - 1)
+    return x

backend/app/vocoder/display.py ADDED Viewed

	@@ -0,0 +1,127 @@

+import time
+import numpy as np
+import sys
+def progbar(i, n, size=16):
+    done = (i * size) // n
+    bar = ''
+    for i in range(size):
+        bar += '█' if i <= done else '░'
+    return bar
+def stream(message) :
+    try:
+        sys.stdout.write("\r{%s}" % message)
+    except:
+        #Remove non-ASCII characters from message
+        message = ''.join(i for i in message if ord(i)<128)
+        sys.stdout.write("\r{%s}" % message)
+def simple_table(item_tuples) :
+    border_pattern = '+---------------------------------------'
+    whitespace = '                                            '
+    headings, cells, = [], []
+    for item in item_tuples :
+        heading, cell = str(item[0]), str(item[1])
+        pad_head = True if len(heading) < len(cell) else False
+        pad = abs(len(heading) - len(cell))
+        pad = whitespace[:pad]
+        pad_left = pad[:len(pad)//2]
+        pad_right = pad[len(pad)//2:]
+        if pad_head :
+            heading = pad_left + heading + pad_right
+        else :
+            cell = pad_left + cell + pad_right
+        headings += [heading]
+        cells += [cell]
+    border, head, body = '', '', ''
+    for i in range(len(item_tuples)) :
+        temp_head = f'| {headings[i]} '
+        temp_body = f'| {cells[i]} '
+        border += border_pattern[:len(temp_head)]
+        head += temp_head
+        body += temp_body
+        if i == len(item_tuples) - 1 :
+            head += '|'
+            body += '|'
+            border += '+'
+    print(border)
+    print(head)
+    print(border)
+    print(body)
+    print(border)
+    print(' ')
+def time_since(started) :
+    elapsed = time.time() - started
+    m = int(elapsed // 60)
+    s = int(elapsed % 60)
+    if m >= 60 :
+        h = int(m // 60)
+        m = m % 60
+        return f'{h}h {m}m {s}s'
+    else :
+        return f'{m}m {s}s'
+def save_attention(attn, path):
+    import matplotlib.pyplot as plt
+    fig = plt.figure(figsize=(12, 6))
+    plt.imshow(attn.T, interpolation='nearest', aspect='auto')
+    fig.savefig(f'{path}.png', bbox_inches='tight')
+    plt.close(fig)
+def save_spectrogram(M, path, length=None):
+    import matplotlib.pyplot as plt
+    M = np.flip(M, axis=0)
+    if length : M = M[:, :length]
+    fig = plt.figure(figsize=(12, 6))
+    plt.imshow(M, interpolation='nearest', aspect='auto')
+    fig.savefig(f'{path}.png', bbox_inches='tight')
+    plt.close(fig)
+def plot(array):
+    import matplotlib.pyplot as plt
+    fig = plt.figure(figsize=(30, 5))
+    ax = fig.add_subplot(111)
+    ax.xaxis.label.set_color('grey')
+    ax.yaxis.label.set_color('grey')
+    ax.xaxis.label.set_fontsize(23)
+    ax.yaxis.label.set_fontsize(23)
+    ax.tick_params(axis='x', colors='grey', labelsize=23)
+    ax.tick_params(axis='y', colors='grey', labelsize=23)
+    plt.plot(array)
+def plot_spec(M):
+    import matplotlib.pyplot as plt
+    M = np.flip(M, axis=0)
+    plt.figure(figsize=(18,4))
+    plt.imshow(M, interpolation='nearest', aspect='auto')
+    plt.show()

backend/app/vocoder/distribution.py ADDED Viewed

	@@ -0,0 +1,132 @@

+import numpy as np
+import torch
+import torch.nn.functional as F
+def log_sum_exp(x):
+    """ numerically stable log_sum_exp implementation that prevents overflow """
+    # TF ordering
+    axis = len(x.size()) - 1
+    m, _ = torch.max(x, dim=axis)
+    m2, _ = torch.max(x, dim=axis, keepdim=True)
+    return m + torch.log(torch.sum(torch.exp(x - m2), dim=axis))
+# It is adapted from https://github.com/r9y9/wavenet_vocoder/blob/master/wavenet_vocoder/mixture.py
+def discretized_mix_logistic_loss(y_hat, y, num_classes=65536,
+                                  log_scale_min=None, reduce=True):
+    if log_scale_min is None:
+        log_scale_min = float(np.log(1e-14))
+    y_hat = y_hat.permute(0,2,1)
+    assert y_hat.dim() == 3
+    assert y_hat.size(1) % 3 == 0
+    nr_mix = y_hat.size(1) // 3
+    # (B x T x C)
+    y_hat = y_hat.transpose(1, 2)
+    # unpack parameters. (B, T, num_mixtures) x 3
+    logit_probs = y_hat[:, :, :nr_mix]
+    means = y_hat[:, :, nr_mix:2 * nr_mix]
+    log_scales = torch.clamp(y_hat[:, :, 2 * nr_mix:3 * nr_mix], min=log_scale_min)
+    # B x T x 1 -> B x T x num_mixtures
+    y = y.expand_as(means)
+    centered_y = y - means
+    inv_stdv = torch.exp(-log_scales)
+    plus_in = inv_stdv * (centered_y + 1. / (num_classes - 1))
+    cdf_plus = torch.sigmoid(plus_in)
+    min_in = inv_stdv * (centered_y - 1. / (num_classes - 1))
+    cdf_min = torch.sigmoid(min_in)
+    # log probability for edge case of 0 (before scaling)
+    # equivalent: torch.log(F.sigmoid(plus_in))
+    log_cdf_plus = plus_in - F.softplus(plus_in)
+    # log probability for edge case of 255 (before scaling)
+    # equivalent: (1 - F.sigmoid(min_in)).log()
+    log_one_minus_cdf_min = -F.softplus(min_in)
+    # probability for all other cases
+    cdf_delta = cdf_plus - cdf_min
+    mid_in = inv_stdv * centered_y
+    # log probability in the center of the bin, to be used in extreme cases
+    # (not actually used in our code)
+    log_pdf_mid = mid_in - log_scales - 2. * F.softplus(mid_in)
+    # tf equivalent
+    """
+    log_probs = tf.where(x < -0.999, log_cdf_plus,
+                         tf.where(x > 0.999, log_one_minus_cdf_min,
+                                  tf.where(cdf_delta > 1e-5,
+                                           tf.log(tf.maximum(cdf_delta, 1e-12)),
+                                           log_pdf_mid - np.log(127.5))))
+    """
+    # TODO: cdf_delta <= 1e-5 actually can happen. How can we choose the value
+    # for num_classes=65536 case? 1e-7? not sure..
+    inner_inner_cond = (cdf_delta > 1e-5).float()
+    inner_inner_out = inner_inner_cond * \
+        torch.log(torch.clamp(cdf_delta, min=1e-12)) + \
+        (1. - inner_inner_cond) * (log_pdf_mid - np.log((num_classes - 1) / 2))
+    inner_cond = (y > 0.999).float()
+    inner_out = inner_cond * log_one_minus_cdf_min + (1. - inner_cond) * inner_inner_out
+    cond = (y < -0.999).float()
+    log_probs = cond * log_cdf_plus + (1. - cond) * inner_out
+    log_probs = log_probs + F.log_softmax(logit_probs, -1)
+    if reduce:
+        return -torch.mean(log_sum_exp(log_probs))
+    else:
+        return -log_sum_exp(log_probs).unsqueeze(-1)
+def sample_from_discretized_mix_logistic(y, log_scale_min=None):
+    """
+    Sample from discretized mixture of logistic distributions
+    Args:
+        y (Tensor): B x C x T
+        log_scale_min (float): Log scale minimum value
+    Returns:
+        Tensor: sample in range of [-1, 1].
+    """
+    if log_scale_min is None:
+        log_scale_min = float(np.log(1e-14))
+    assert y.size(1) % 3 == 0
+    nr_mix = y.size(1) // 3
+    # B x T x C
+    y = y.transpose(1, 2)
+    logit_probs = y[:, :, :nr_mix]
+    # sample mixture indicator from softmax
+    temp = logit_probs.data.new(logit_probs.size()).uniform_(1e-5, 1.0 - 1e-5)
+    temp = logit_probs.data - torch.log(- torch.log(temp))
+    _, argmax = temp.max(dim=-1)
+    # (B, T) -> (B, T, nr_mix)
+    one_hot = to_one_hot(argmax, nr_mix)
+    # select logistic parameters
+    means = torch.sum(y[:, :, nr_mix:2 * nr_mix] * one_hot, dim=-1)
+    log_scales = torch.clamp(torch.sum(
+        y[:, :, 2 * nr_mix:3 * nr_mix] * one_hot, dim=-1), min=log_scale_min)
+    # sample from logistic & clip to interval
+    # we don't actually round to the nearest 8bit value when sampling
+    u = means.data.new(means.size()).uniform_(1e-5, 1.0 - 1e-5)
+    x = means + torch.exp(log_scales) * (torch.log(u) - torch.log(1. - u))
+    x = torch.clamp(torch.clamp(x, min=-1.), max=1.)
+    return x
+def to_one_hot(tensor, n, fill_with=1.):
+    # we perform one hot encore with respect to the last axis
+    one_hot = torch.FloatTensor(tensor.size() + (n,)).zero_()
+    if tensor.is_cuda:
+        one_hot = one_hot.cuda()
+    one_hot.scatter_(len(tensor.size()), tensor.unsqueeze(-1), fill_with)
+    return one_hot

backend/app/vocoder/hparams.py ADDED Viewed

	@@ -0,0 +1,44 @@

+from synthesizer.hparams import hparams as _syn_hp
+# Audio settings------------------------------------------------------------------------
+# Match the values of the synthesizer
+sample_rate = _syn_hp.sample_rate
+n_fft = _syn_hp.n_fft
+num_mels = _syn_hp.num_mels
+hop_length = _syn_hp.hop_size
+win_length = _syn_hp.win_size
+fmin = _syn_hp.fmin
+min_level_db = _syn_hp.min_level_db
+ref_level_db = _syn_hp.ref_level_db
+mel_max_abs_value = _syn_hp.max_abs_value
+preemphasis = _syn_hp.preemphasis
+apply_preemphasis = _syn_hp.preemphasize
+bits = 9                            # bit depth of signal
+mu_law = True                       # Recommended to suppress noise if using raw bits in hp.voc_mode
+                                    # below
+# WAVERNN / VOCODER --------------------------------------------------------------------------------
+voc_mode = 'RAW'                    # either 'RAW' (softmax on raw bits) or 'MOL' (sample from
+# mixture of logistics)
+voc_upsample_factors = (5, 5, 8)    # NB - this needs to correctly factorise hop_length
+voc_rnn_dims = 512
+voc_fc_dims = 512
+voc_compute_dims = 128
+voc_res_out_dims = 128
+voc_res_blocks = 10
+# Training
+voc_batch_size = 100
+voc_lr = 1e-4
+voc_gen_at_checkpoint = 5           # number of samples to generate at each checkpoint
+voc_pad = 2                         # this will pad the input so that the resnet can 'see' wider
+                                    # than input length
+voc_seq_len = hop_length * 5        # must be a multiple of hop_length
+# Generating / Synthesizing
+voc_gen_batched = True              # very fast (realtime+) single utterance batched generation
+voc_target = 8000                   # target number of samples to be generated in each batch entry
+voc_overlap = 400                   # number of samples for crossfading between batches

backend/app/vocoder/inference.py ADDED Viewed

	@@ -0,0 +1,83 @@

+from .models.fatchord_version import WaveRNN
+from . import hparams as hp
+import torch
+_model = None   # type: WaveRNN
+def load_model(weights_fpath, verbose=True):
+    global _model, _device
+    if verbose:
+        print("Building Wave-RNN")
+    _model = WaveRNN(
+        rnn_dims=hp.voc_rnn_dims,
+        fc_dims=hp.voc_fc_dims,
+        bits=hp.bits,
+        pad=hp.voc_pad,
+        upsample_factors=hp.voc_upsample_factors,
+        feat_dims=hp.num_mels,
+        compute_dims=hp.voc_compute_dims,
+        res_out_dims=hp.voc_res_out_dims,
+        res_blocks=hp.voc_res_blocks,
+        hop_length=hp.hop_length,
+        sample_rate=hp.sample_rate,
+        mode=hp.voc_mode
+    )
+    if torch.cuda.is_available():
+        _model = _model.cuda()
+        _device = torch.device('cuda')
+    else:
+        _device = torch.device('cpu')
+    if verbose:
+        print("Loading model weights at %s" % weights_fpath)
+    checkpoint = torch.load(weights_fpath, _device)
+    _model.load_state_dict(checkpoint['model_state'])
+    _model.eval()
+def is_loaded():
+    return _model is not None
+def infer_waveform(mel, normalize=True,  batched=True, target=8000, overlap=800,
+                   progress_callback=None):
+    """
+    Infers the waveform of a mel spectrogram output by the synthesizer (the format must match
+    that of the synthesizer!)
+    :param normalize:
+    :param batched:
+    :param target:
+    :param overlap:
+    :return:
+    """
+    import sys
+    if _model is None:
+        raise Exception("Please load Wave-RNN in memory before using it")
+    print(f"[Vocoder] Input mel-spectrogram shape: {mel.shape}")
+    print(f"[Vocoder] Normalize: {normalize}, Batched: {batched}, Target: {target}, Overlap: {overlap}")
+    print(f"[Vocoder] Device: {_device}, Model on: {next(_model.parameters()).device}")
+    try:
+        if normalize:
+            mel = mel / hp.mel_max_abs_value
+        mel = torch.from_numpy(mel[None, ...])
+        print(f"[Vocoder] Mel tensor shape after processing: {mel.shape}, dtype: {mel.dtype}")
+        print("[Vocoder] Starting waveform generation (this may take a while on CPU)...")
+        sys.stdout.flush()
+        wav = _model.generate(mel, batched, target, overlap, hp.mu_law, progress_callback)
+        print(f"[Vocoder] Waveform generated successfully, shape: {wav.shape}")
+        return wav
+    except Exception as e:
+        print(f"[Vocoder] ✗ Error during vocoding: {e}")
+        import traceback
+        traceback.print_exc()
+        sys.stdout.flush()
+        raise

backend/app/vocoder/models/fatchord_version.py ADDED Viewed

	@@ -0,0 +1,434 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from ..distribution import sample_from_discretized_mix_logistic
+from ..display import *
+from ..audio import *
+class ResBlock(nn.Module):
+    def __init__(self, dims):
+        super().__init__()
+        self.conv1 = nn.Conv1d(dims, dims, kernel_size=1, bias=False)
+        self.conv2 = nn.Conv1d(dims, dims, kernel_size=1, bias=False)
+        self.batch_norm1 = nn.BatchNorm1d(dims)
+        self.batch_norm2 = nn.BatchNorm1d(dims)
+    def forward(self, x):
+        residual = x
+        x = self.conv1(x)
+        x = self.batch_norm1(x)
+        x = F.relu(x)
+        x = self.conv2(x)
+        x = self.batch_norm2(x)
+        return x + residual
+class MelResNet(nn.Module):
+    def __init__(self, res_blocks, in_dims, compute_dims, res_out_dims, pad):
+        super().__init__()
+        k_size = pad * 2 + 1
+        self.conv_in = nn.Conv1d(in_dims, compute_dims, kernel_size=k_size, bias=False)
+        self.batch_norm = nn.BatchNorm1d(compute_dims)
+        self.layers = nn.ModuleList()
+        for i in range(res_blocks):
+            self.layers.append(ResBlock(compute_dims))
+        self.conv_out = nn.Conv1d(compute_dims, res_out_dims, kernel_size=1)
+    def forward(self, x):
+        x = self.conv_in(x)
+        x = self.batch_norm(x)
+        x = F.relu(x)
+        for f in self.layers: x = f(x)
+        x = self.conv_out(x)
+        return x
+class Stretch2d(nn.Module):
+    def __init__(self, x_scale, y_scale):
+        super().__init__()
+        self.x_scale = x_scale
+        self.y_scale = y_scale
+    def forward(self, x):
+        b, c, h, w = x.size()
+        x = x.unsqueeze(-1).unsqueeze(3)
+        x = x.repeat(1, 1, 1, self.y_scale, 1, self.x_scale)
+        return x.view(b, c, h * self.y_scale, w * self.x_scale)
+class UpsampleNetwork(nn.Module):
+    def __init__(self, feat_dims, upsample_scales, compute_dims,
+                 res_blocks, res_out_dims, pad):
+        super().__init__()
+        total_scale = np.cumprod(upsample_scales)[-1]
+        self.indent = pad * total_scale
+        self.resnet = MelResNet(res_blocks, feat_dims, compute_dims, res_out_dims, pad)
+        self.resnet_stretch = Stretch2d(total_scale, 1)
+        self.up_layers = nn.ModuleList()
+        for scale in upsample_scales:
+            k_size = (1, scale * 2 + 1)
+            padding = (0, scale)
+            stretch = Stretch2d(scale, 1)
+            conv = nn.Conv2d(1, 1, kernel_size=k_size, padding=padding, bias=False)
+            conv.weight.data.fill_(1. / k_size[1])
+            self.up_layers.append(stretch)
+            self.up_layers.append(conv)
+    def forward(self, m):
+        aux = self.resnet(m).unsqueeze(1)
+        aux = self.resnet_stretch(aux)
+        aux = aux.squeeze(1)
+        m = m.unsqueeze(1)
+        for f in self.up_layers: m = f(m)
+        m = m.squeeze(1)[:, :, self.indent:-self.indent]
+        return m.transpose(1, 2), aux.transpose(1, 2)
+class WaveRNN(nn.Module):
+    def __init__(self, rnn_dims, fc_dims, bits, pad, upsample_factors,
+                 feat_dims, compute_dims, res_out_dims, res_blocks,
+                 hop_length, sample_rate, mode='RAW'):
+        super().__init__()
+        self.mode = mode
+        self.pad = pad
+        if self.mode == 'RAW' :
+            self.n_classes = 2 ** bits
+        elif self.mode == 'MOL' :
+            self.n_classes = 30
+        else :
+            RuntimeError("Unknown model mode value - ", self.mode)
+        self.rnn_dims = rnn_dims
+        self.aux_dims = res_out_dims // 4
+        self.hop_length = hop_length
+        self.sample_rate = sample_rate
+        self.upsample = UpsampleNetwork(feat_dims, upsample_factors, compute_dims, res_blocks, res_out_dims, pad)
+        self.I = nn.Linear(feat_dims + self.aux_dims + 1, rnn_dims)
+        self.rnn1 = nn.GRU(rnn_dims, rnn_dims, batch_first=True)
+        self.rnn2 = nn.GRU(rnn_dims + self.aux_dims, rnn_dims, batch_first=True)
+        self.fc1 = nn.Linear(rnn_dims + self.aux_dims, fc_dims)
+        self.fc2 = nn.Linear(fc_dims + self.aux_dims, fc_dims)
+        self.fc3 = nn.Linear(fc_dims, self.n_classes)
+        self.step = nn.Parameter(torch.zeros(1).long(), requires_grad=False)
+        self.num_params()
+    def forward(self, x, mels):
+        self.step += 1
+        bsize = x.size(0)
+        if torch.cuda.is_available():
+            h1 = torch.zeros(1, bsize, self.rnn_dims).cuda()
+            h2 = torch.zeros(1, bsize, self.rnn_dims).cuda()
+        else:
+            h1 = torch.zeros(1, bsize, self.rnn_dims).cpu()
+            h2 = torch.zeros(1, bsize, self.rnn_dims).cpu()
+        mels, aux = self.upsample(mels)
+        aux_idx = [self.aux_dims * i for i in range(5)]
+        a1 = aux[:, :, aux_idx[0]:aux_idx[1]]
+        a2 = aux[:, :, aux_idx[1]:aux_idx[2]]
+        a3 = aux[:, :, aux_idx[2]:aux_idx[3]]
+        a4 = aux[:, :, aux_idx[3]:aux_idx[4]]
+        x = torch.cat([x.unsqueeze(-1), mels, a1], dim=2)
+        x = self.I(x)
+        res = x
+        x, _ = self.rnn1(x, h1)
+        x = x + res
+        res = x
+        x = torch.cat([x, a2], dim=2)
+        x, _ = self.rnn2(x, h2)
+        x = x + res
+        x = torch.cat([x, a3], dim=2)
+        x = F.relu(self.fc1(x))
+        x = torch.cat([x, a4], dim=2)
+        x = F.relu(self.fc2(x))
+        return self.fc3(x)
+    def generate(self, mels, batched, target, overlap, mu_law, progress_callback=None):
+        mu_law = mu_law if self.mode == 'RAW' else False
+        progress_callback = progress_callback or self.gen_display
+        self.eval()
+        output = []
+        start = time.time()
+        rnn1 = self.get_gru_cell(self.rnn1)
+        rnn2 = self.get_gru_cell(self.rnn2)
+        with torch.no_grad():
+            if torch.cuda.is_available():
+                mels = mels.cuda()
+            else:
+                mels = mels.cpu()
+            wave_len = (mels.size(-1) - 1) * self.hop_length
+            mels = self.pad_tensor(mels.transpose(1, 2), pad=self.pad, side='both')
+            mels, aux = self.upsample(mels.transpose(1, 2))
+            if batched:
+                mels = self.fold_with_overlap(mels, target, overlap)
+                aux = self.fold_with_overlap(aux, target, overlap)
+            b_size, seq_len, _ = mels.size()
+            if torch.cuda.is_available():
+                h1 = torch.zeros(b_size, self.rnn_dims).cuda()
+                h2 = torch.zeros(b_size, self.rnn_dims).cuda()
+                x = torch.zeros(b_size, 1).cuda()
+            else:
+                h1 = torch.zeros(b_size, self.rnn_dims).cpu()
+                h2 = torch.zeros(b_size, self.rnn_dims).cpu()
+                x = torch.zeros(b_size, 1).cpu()
+            d = self.aux_dims
+            aux_split = [aux[:, :, d * i:d * (i + 1)] for i in range(4)]
+            for i in range(seq_len):
+                m_t = mels[:, i, :]
+                a1_t, a2_t, a3_t, a4_t = (a[:, i, :] for a in aux_split)
+                x = torch.cat([x, m_t, a1_t], dim=1)
+                x = self.I(x)
+                h1 = rnn1(x, h1)
+                x = x + h1
+                inp = torch.cat([x, a2_t], dim=1)
+                h2 = rnn2(inp, h2)
+                x = x + h2
+                x = torch.cat([x, a3_t], dim=1)
+                x = F.relu(self.fc1(x))
+                x = torch.cat([x, a4_t], dim=1)
+                x = F.relu(self.fc2(x))
+                logits = self.fc3(x)
+                if self.mode == 'MOL':
+                    sample = sample_from_discretized_mix_logistic(logits.unsqueeze(0).transpose(1, 2))
+                    output.append(sample.view(-1))
+                    if torch.cuda.is_available():
+                        # x = torch.FloatTensor([[sample]]).cuda()
+                        x = sample.transpose(0, 1).cuda()
+                    else:
+                        x = sample.transpose(0, 1)
+                elif self.mode == 'RAW' :
+                    posterior = F.softmax(logits, dim=1)
+                    distrib = torch.distributions.Categorical(posterior)
+                    sample = 2 * distrib.sample().float() / (self.n_classes - 1.) - 1.
+                    output.append(sample)
+                    x = sample.unsqueeze(-1)
+                else:
+                    raise RuntimeError("Unknown model mode value - ", self.mode)
+                if i % 100 == 0:
+                    gen_rate = (i + 1) / (time.time() - start) * b_size / 1000
+                    progress_callback(i, seq_len, b_size, gen_rate)
+        output = torch.stack(output).transpose(0, 1)
+        output = output.cpu().numpy()
+        output = output.astype(np.float64)
+        if batched:
+            output = self.xfade_and_unfold(output, target, overlap)
+        else:
+            output = output[0]
+        if mu_law:
+            output = decode_mu_law(output, self.n_classes, False)
+        if hp.apply_preemphasis:
+            output = de_emphasis(output)
+        # Fade-out at the end to avoid signal cutting out suddenly
+        fade_out = np.linspace(1, 0, 20 * self.hop_length)
+        output = output[:wave_len]
+        output[-20 * self.hop_length:] *= fade_out
+        self.train()
+        return output
+    def gen_display(self, i, seq_len, b_size, gen_rate):
+        pbar = progbar(i, seq_len)
+        msg = f'| {pbar} {i*b_size}/{seq_len*b_size} | Batch Size: {b_size} | Gen Rate: {gen_rate:.1f}kHz | '
+        stream(msg)
+    def get_gru_cell(self, gru):
+        gru_cell = nn.GRUCell(gru.input_size, gru.hidden_size)
+        gru_cell.weight_hh.data = gru.weight_hh_l0.data
+        gru_cell.weight_ih.data = gru.weight_ih_l0.data
+        gru_cell.bias_hh.data = gru.bias_hh_l0.data
+        gru_cell.bias_ih.data = gru.bias_ih_l0.data
+        return gru_cell
+    def pad_tensor(self, x, pad, side='both'):
+        # NB - this is just a quick method i need right now
+        # i.e., it won't generalise to other shapes/dims
+        b, t, c = x.size()
+        total = t + 2 * pad if side == 'both' else t + pad
+        if torch.cuda.is_available():
+            padded = torch.zeros(b, total, c).cuda()
+        else:
+            padded = torch.zeros(b, total, c).cpu()
+        if side == 'before' or side == 'both':
+            padded[:, pad:pad + t, :] = x
+        elif side == 'after':
+            padded[:, :t, :] = x
+        return padded
+    def fold_with_overlap(self, x, target, overlap):
+        ''' Fold the tensor with overlap for quick batched inference.
+            Overlap will be used for crossfading in xfade_and_unfold()
+        Args:
+            x (tensor)    : Upsampled conditioning features.
+                            shape=(1, timesteps, features)
+            target (int)  : Target timesteps for each index of batch
+            overlap (int) : Timesteps for both xfade and rnn warmup
+        Return:
+            (tensor) : shape=(num_folds, target + 2 * overlap, features)
+        Details:
+            x = [[h1, h2, ... hn]]
+            Where each h is a vector of conditioning features
+            Eg: target=2, overlap=1 with x.size(1)=10
+            folded = [[h1, h2, h3, h4],
+                      [h4, h5, h6, h7],
+                      [h7, h8, h9, h10]]
+        '''
+        _, total_len, features = x.size()
+        # Calculate variables needed
+        num_folds = (total_len - overlap) // (target + overlap)
+        extended_len = num_folds * (overlap + target) + overlap
+        remaining = total_len - extended_len
+        # Pad if some time steps poking out
+        if remaining != 0:
+            num_folds += 1
+            padding = target + 2 * overlap - remaining
+            x = self.pad_tensor(x, padding, side='after')
+        if torch.cuda.is_available():
+            folded = torch.zeros(num_folds, target + 2 * overlap, features).cuda()
+        else:
+            folded = torch.zeros(num_folds, target + 2 * overlap, features).cpu()
+        # Get the values for the folded tensor
+        for i in range(num_folds):
+            start = i * (target + overlap)
+            end = start + target + 2 * overlap
+            folded[i] = x[:, start:end, :]
+        return folded
+    def xfade_and_unfold(self, y, target, overlap):
+        ''' Applies a crossfade and unfolds into a 1d array.
+        Args:
+            y (ndarry)    : Batched sequences of audio samples
+                            shape=(num_folds, target + 2 * overlap)
+                            dtype=np.float64
+            overlap (int) : Timesteps for both xfade and rnn warmup
+        Return:
+            (ndarry) : audio samples in a 1d array
+                       shape=(total_len)
+                       dtype=np.float64
+        Details:
+            y = [[seq1],
+                 [seq2],
+                 [seq3]]
+            Apply a gain envelope at both ends of the sequences
+            y = [[seq1_in, seq1_target, seq1_out],
+                 [seq2_in, seq2_target, seq2_out],
+                 [seq3_in, seq3_target, seq3_out]]
+            Stagger and add up the groups of samples:
+            [seq1_in, seq1_target, (seq1_out + seq2_in), seq2_target, ...]
+        '''
+        num_folds, length = y.shape
+        target = length - 2 * overlap
+        total_len = num_folds * (target + overlap) + overlap
+        # Need some silence for the rnn warmup
+        silence_len = overlap // 2
+        fade_len = overlap - silence_len
+        silence = np.zeros((silence_len), dtype=np.float64)
+        # Equal power crossfade
+        t = np.linspace(-1, 1, fade_len, dtype=np.float64)
+        fade_in = np.sqrt(0.5 * (1 + t))
+        fade_out = np.sqrt(0.5 * (1 - t))
+        # Concat the silence to the fades
+        fade_in = np.concatenate([silence, fade_in])
+        fade_out = np.concatenate([fade_out, silence])
+        # Apply the gain to the overlap samples
+        y[:, :overlap] *= fade_in
+        y[:, -overlap:] *= fade_out
+        unfolded = np.zeros((total_len), dtype=np.float64)
+        # Loop to add up all the samples
+        for i in range(num_folds):
+            start = i * (target + overlap)
+            end = start + target + 2 * overlap
+            unfolded[start:end] += y[i]
+        return unfolded
+    def get_step(self) :
+        return self.step.data.item()
+    def checkpoint(self, model_dir, optimizer) :
+        k_steps = self.get_step() // 1000
+        self.save(model_dir.joinpath("checkpoint_%dk_steps.pt" % k_steps), optimizer)
+    def log(self, path, msg) :
+        with open(path, 'a') as f:
+            print(msg, file=f)
+    def load(self, path, optimizer) :
+        checkpoint = torch.load(path)
+        if "optimizer_state" in checkpoint:
+            self.load_state_dict(checkpoint["model_state"])
+            optimizer.load_state_dict(checkpoint["optimizer_state"])
+        else:
+            # Backwards compatibility
+            self.load_state_dict(checkpoint)
+    def save(self, path, optimizer) :
+        torch.save({
+            "model_state": self.state_dict(),
+            "optimizer_state": optimizer.state_dict(),
+        }, path)
+    def num_params(self, print_out=True):
+        parameters = filter(lambda p: p.requires_grad, self.parameters())
+        parameters = sum([np.prod(p.size()) for p in parameters]) / 1_000_000
+        if print_out :
+            print('Trainable Parameters: %.3fM' % parameters)

backend/app/voice_cloning.py ADDED Viewed

	@@ -0,0 +1,108 @@

+"""Core voice cloning logic shared by the API routes."""
+from __future__ import annotations
+import shutil
+import gc
+import torch
+from pathlib import Path
+from typing import Dict, Tuple
+import numpy as np
+import soundfile as sf
+from huggingface_hub import hf_hub_download
+from encoder import inference as encoder_infer
+from synthesizer import inference as synthesizer_infer
+from synthesizer.hparams import hparams as syn_hp
+from app.vocoder import inference as vocoder_infer
+MODEL_SPECS: Dict[str, Tuple[str, str]] = {
+    "encoder.pt": ("AJ50/voice-clone-encoder", "encoder.pt"),
+    "synthesizer.pt": ("AJ50/voice-clone-synthesizer", "synthesizer.pt"),
+    "vocoder.pt": ("AJ50/voice-clone-vocoder", "vocoder.pt"),
+}
+def ensure_default_models(models_dir: Path) -> None:
+    """Download the required pretrained weights if they are missing."""
+    target_dir = models_dir / "default"
+    target_dir.mkdir(parents=True, exist_ok=True)
+    for filename, (repo_id, repo_filename) in MODEL_SPECS.items():
+        destination = target_dir / filename
+        if destination.exists():
+            continue
+        print(f"[Models] Downloading {filename} from {repo_id}...")
+        downloaded_path = Path(
+            hf_hub_download(repo_id=repo_id, filename=repo_filename)
+        )
+        shutil.copy2(downloaded_path, destination)
+        print(f"[Models] Saved to {destination}")
+def synthesize(voice_path: Path, text: str, models_dir: Path, out_path: Path) -> Path:
+    """Run end-to-end voice cloning and return the generated audio path."""
+    ensure_default_models(models_dir)
+    enc_path = models_dir / "default" / "encoder.pt"
+    syn_path = models_dir / "default" / "synthesizer.pt"
+    voc_path = models_dir / "default" / "vocoder.pt"
+    for model_path in (enc_path, syn_path, voc_path):
+        if not model_path.exists():
+            raise RuntimeError(f"Model file missing: {model_path}")
+    print("[VoiceCloning] Loading encoder...")
+    encoder_infer.load_model(enc_path)
+    print("[VoiceCloning] Loading synthesizer...")
+    synthesizer = synthesizer_infer.Synthesizer(syn_path)
+    print("[VoiceCloning] Loading vocoder...")
+    vocoder_infer.load_model(voc_path)
+    if not voice_path.exists():
+        raise RuntimeError(f"Reference voice file not found: {voice_path}")
+    print("[VoiceCloning] Preprocessing reference audio...")
+    wav = encoder_infer.preprocess_wav(voice_path)
+    embed = encoder_infer.embed_utterance(wav)
+    print("[VoiceCloning] Generating mel-spectrogram...")
+    mels = synthesizer.synthesize_spectrograms([text], [embed])
+    mel = mels[0]
+    print("[VoiceCloning] Vocoding waveform...")
+    try:
+        waveform = synthesizer.griffin_lim(mel).astype(np.float32)
+    except Exception:
+        waveform = vocoder_infer.infer_waveform(
+            mel, normalize=True, batched=False, target=8000, overlap=800
+        ).astype(np.float32)
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    sf.write(out_path.as_posix(), waveform, syn_hp.sample_rate)
+    print(f"[VoiceCloning] Audio saved to {out_path}")
+    # Memory optimization for Render free tier
+    print("[VoiceCloning] Cleaning up models to free memory...")
+    try:
+        # Clear model caches
+        if hasattr(encoder_infer, '_model'):
+            encoder_infer._model = None
+        if hasattr(synthesizer_infer, '_model'):
+            synthesizer_infer._model = None
+        if hasattr(vocoder_infer, '_model'):
+            vocoder_infer._model = None
+        # Force garbage collection
+        gc.collect()
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+    except Exception as e:
+        print(f"[VoiceCloning] Warning during cleanup: {e}")
+    return out_path

backend/download_models.py ADDED Viewed

	@@ -0,0 +1,54 @@

+"""
+Download models from HuggingFace on startup
+Run this once or on container startup for Render
+"""
+from pathlib import Path
+from huggingface_hub import hf_hub_download
+import shutil
+import sys
+MODEL_SPECS = {
+    "encoder.pt": ("AJ50/voice-clone-encoder", "encoder.pt"),
+    "synthesizer.pt": ("AJ50/voice-clone-synthesizer", "synthesizer.pt"),
+    "vocoder.pt": ("AJ50/voice-clone-vocoder", "vocoder.pt"),
+}
+def download_models(models_dir: Path) -> None:
+    """Download required models from HuggingFace if missing"""
+    target_dir = models_dir / "default"
+    target_dir.mkdir(parents=True, exist_ok=True)
+    print(f"[Models] Target directory: {target_dir}")
+    for filename, (repo_id, repo_filename) in MODEL_SPECS.items():
+        destination = target_dir / filename
+        # Skip if already exists
+        if destination.exists():
+            size_mb = destination.stat().st_size / (1024 * 1024)
+            print(f"✓ {filename} already exists ({size_mb:.1f} MB)")
+            continue
+        print(f"[Models] Downloading {filename} from {repo_id}...")
+        try:
+            downloaded_path = Path(
+                hf_hub_download(repo_id=repo_id, filename=repo_filename)
+            )
+            shutil.copy2(downloaded_path, destination)
+            size_mb = destination.stat().st_size / (1024 * 1024)
+            print(f"✓ Saved {filename} ({size_mb:.1f} MB) to {destination}")
+        except Exception as e:
+            print(f"✗ Failed to download {filename}: {e}")
+            return False
+    print("[Models] All models downloaded successfully!")
+    return True
+if __name__ == "__main__":
+    backend_dir = Path(__file__).parent
+    models_dir = backend_dir / "models"
+    success = download_models(models_dir)
+    sys.exit(0 if success else 1)

backend/encoder/__init__.py ADDED Viewed

File without changes

backend/encoder/audio.py ADDED Viewed

	@@ -0,0 +1,117 @@

+from scipy.ndimage.morphology import binary_dilation
+from encoder.params_data import *
+from pathlib import Path
+from typing import Optional, Union
+from warnings import warn
+import numpy as np
+import librosa
+import struct
+try:
+    import webrtcvad
+except:
+    warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")
+    webrtcvad=None
+int16_max = (2 ** 15) - 1
+def preprocess_wav(fpath_or_wav: Union[str, Path, np.ndarray],
+                   source_sr: Optional[int] = None,
+                   normalize: Optional[bool] = True,
+                   trim_silence: Optional[bool] = True):
+    """
+    Applies the preprocessing operations used in training the Speaker Encoder to a waveform
+    either on disk or in memory. The waveform will be resampled to match the data hyperparameters.
+    :param fpath_or_wav: either a filepath to an audio file (many extensions are supported, not
+    just .wav), either the waveform as a numpy array of floats.
+    :param source_sr: if passing an audio waveform, the sampling rate of the waveform before
+    preprocessing. After preprocessing, the waveform's sampling rate will match the data
+    hyperparameters. If passing a filepath, the sampling rate will be automatically detected and
+    this argument will be ignored.
+    """
+    # Load the wav from disk if needed
+    if isinstance(fpath_or_wav, str) or isinstance(fpath_or_wav, Path):
+        wav, source_sr = librosa.load(str(fpath_or_wav), sr=None)
+    else:
+        wav = fpath_or_wav
+    # Resample the wav if needed
+    if source_sr is not None and source_sr != sampling_rate:
+        wav = librosa.resample(y=wav, orig_sr=source_sr, target_sr=sampling_rate)
+    # Apply the preprocessing: normalize volume and shorten long silences
+    if normalize:
+        wav = normalize_volume(wav, audio_norm_target_dBFS, increase_only=True)
+    if webrtcvad and trim_silence:
+        wav = trim_long_silences(wav)
+    return wav
+def wav_to_mel_spectrogram(wav):
+    """
+    Derives a mel spectrogram ready to be used by the encoder from a preprocessed audio waveform.
+    Note: this not a log-mel spectrogram.
+    """
+    frames = librosa.feature.melspectrogram(
+        y=wav,
+        sr=sampling_rate,
+        n_fft=int(sampling_rate * mel_window_length / 1000),
+        hop_length=int(sampling_rate * mel_window_step / 1000),
+        n_mels=mel_n_channels
+    )
+    return frames.astype(np.float32).T
+def trim_long_silences(wav):
+    """
+    Ensures that segments without voice in the waveform remain no longer than a
+    threshold determined by the VAD parameters in params.py.
+    :param wav: the raw waveform as a numpy array of floats
+    :return: the same waveform with silences trimmed away (length <= original wav length)
+    """
+    # Compute the voice detection window size
+    samples_per_window = (vad_window_length * sampling_rate) // 1000
+    # Trim the end of the audio to have a multiple of the window size
+    wav = wav[:len(wav) - (len(wav) % samples_per_window)]
+    # Convert the float waveform to 16-bit mono PCM
+    pcm_wave = struct.pack("%dh" % len(wav), *(np.round(wav * int16_max)).astype(np.int16))
+    # Perform voice activation detection
+    voice_flags = []
+    vad = webrtcvad.Vad(mode=3)
+    for window_start in range(0, len(wav), samples_per_window):
+        window_end = window_start + samples_per_window
+        voice_flags.append(vad.is_speech(pcm_wave[window_start * 2:window_end * 2],
+                                         sample_rate=sampling_rate))
+    voice_flags = np.array(voice_flags)
+    # Smooth the voice detection with a moving average
+    def moving_average(array, width):
+        array_padded = np.concatenate((np.zeros((width - 1) // 2), array, np.zeros(width // 2)))
+        ret = np.cumsum(array_padded, dtype=float)
+        ret[width:] = ret[width:] - ret[:-width]
+        return ret[width - 1:] / width
+    audio_mask = moving_average(voice_flags, vad_moving_average_width)
+    audio_mask = np.round(audio_mask).astype(bool)
+    # Dilate the voiced regions
+    audio_mask = binary_dilation(audio_mask, np.ones(vad_max_silence_length + 1))
+    audio_mask = np.repeat(audio_mask, samples_per_window)
+    return wav[audio_mask == True]
+def normalize_volume(wav, target_dBFS, increase_only=False, decrease_only=False):
+    if increase_only and decrease_only:
+        raise ValueError("Both increase only and decrease only are set")
+    dBFS_change = target_dBFS - 10 * np.log10(np.mean(wav ** 2))
+    if (dBFS_change < 0 and increase_only) or (dBFS_change > 0 and decrease_only):
+        return wav
+    return wav * (10 ** (dBFS_change / 20))

backend/encoder/inference.py ADDED Viewed

	@@ -0,0 +1,178 @@

+from encoder.params_data import *
+from encoder.model import SpeakerEncoder
+from encoder.audio import preprocess_wav   # We want to expose this function from here
+from matplotlib import cm
+from encoder import audio
+from pathlib import Path
+import numpy as np
+import torch
+_model = None # type: SpeakerEncoder
+_device = None # type: torch.device
+def load_model(weights_fpath: Path, device=None):
+    """
+    Loads the model in memory. If this function is not explicitely called, it will be run on the
+    first call to embed_frames() with the default weights file.
+    :param weights_fpath: the path to saved model weights.
+    :param device: either a torch device or the name of a torch device (e.g. "cpu", "cuda"). The
+    model will be loaded and will run on this device. Outputs will however always be on the cpu.
+    If None, will default to your GPU if it"s available, otherwise your CPU.
+    """
+    # TODO: I think the slow loading of the encoder might have something to do with the device it
+    #   was saved on. Worth investigating.
+    global _model, _device
+    if device is None:
+        _device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    elif isinstance(device, str):
+        _device = torch.device(device)
+    _model = SpeakerEncoder(_device, torch.device("cpu"))
+    checkpoint = torch.load(weights_fpath, _device)
+    _model.load_state_dict(checkpoint["model_state"])
+    _model.eval()
+    print("Loaded encoder \"%s\" trained to step %d" % (weights_fpath.name, checkpoint["step"]))
+def is_loaded():
+    return _model is not None
+def embed_frames_batch(frames_batch):
+    """
+    Computes embeddings for a batch of mel spectrogram.
+    :param frames_batch: a batch mel of spectrogram as a numpy array of float32 of shape
+    (batch_size, n_frames, n_channels)
+    :return: the embeddings as a numpy array of float32 of shape (batch_size, model_embedding_size)
+    """
+    if _model is None:
+        raise Exception("Model was not loaded. Call load_model() before inference.")
+    frames = torch.from_numpy(frames_batch).to(_device)
+    embed = _model.forward(frames).detach().cpu().numpy()
+    return embed
+def compute_partial_slices(n_samples, partial_utterance_n_frames=partials_n_frames,
+                           min_pad_coverage=0.75, overlap=0.5):
+    """
+    Computes where to split an utterance waveform and its corresponding mel spectrogram to obtain
+    partial utterances of <partial_utterance_n_frames> each. Both the waveform and the mel
+    spectrogram slices are returned, so as to make each partial utterance waveform correspond to
+    its spectrogram. This function assumes that the mel spectrogram parameters used are those
+    defined in params_data.py.
+    The returned ranges may be indexing further than the length of the waveform. It is
+    recommended that you pad the waveform with zeros up to wave_slices[-1].stop.
+    :param n_samples: the number of samples in the waveform
+    :param partial_utterance_n_frames: the number of mel spectrogram frames in each partial
+    utterance
+    :param min_pad_coverage: when reaching the last partial utterance, it may or may not have
+    enough frames. If at least <min_pad_coverage> of <partial_utterance_n_frames> are present,
+    then the last partial utterance will be considered, as if we padded the audio. Otherwise,
+    it will be discarded, as if we trimmed the audio. If there aren't enough frames for 1 partial
+    utterance, this parameter is ignored so that the function always returns at least 1 slice.
+    :param overlap: by how much the partial utterance should overlap. If set to 0, the partial
+    utterances are entirely disjoint.
+    :return: the waveform slices and mel spectrogram slices as lists of array slices. Index
+    respectively the waveform and the mel spectrogram with these slices to obtain the partial
+    utterances.
+    """
+    assert 0 <= overlap < 1
+    assert 0 < min_pad_coverage <= 1
+    samples_per_frame = int((sampling_rate * mel_window_step / 1000))
+    n_frames = int(np.ceil((n_samples + 1) / samples_per_frame))
+    frame_step = max(int(np.round(partial_utterance_n_frames * (1 - overlap))), 1)
+    # Compute the slices
+    wav_slices, mel_slices = [], []
+    steps = max(1, n_frames - partial_utterance_n_frames + frame_step + 1)
+    for i in range(0, steps, frame_step):
+        mel_range = np.array([i, i + partial_utterance_n_frames])
+        wav_range = mel_range * samples_per_frame
+        mel_slices.append(slice(*mel_range))
+        wav_slices.append(slice(*wav_range))
+    # Evaluate whether extra padding is warranted or not
+    last_wav_range = wav_slices[-1]
+    coverage = (n_samples - last_wav_range.start) / (last_wav_range.stop - last_wav_range.start)
+    if coverage < min_pad_coverage and len(mel_slices) > 1:
+        mel_slices = mel_slices[:-1]
+        wav_slices = wav_slices[:-1]
+    return wav_slices, mel_slices
+def embed_utterance(wav, using_partials=True, return_partials=False, **kwargs):
+    """
+    Computes an embedding for a single utterance.
+    # TODO: handle multiple wavs to benefit from batching on GPU
+    :param wav: a preprocessed (see audio.py) utterance waveform as a numpy array of float32
+    :param using_partials: if True, then the utterance is split in partial utterances of
+    <partial_utterance_n_frames> frames and the utterance embedding is computed from their
+    normalized average. If False, the utterance is instead computed from feeding the entire
+    spectogram to the network.
+    :param return_partials: if True, the partial embeddings will also be returned along with the
+    wav slices that correspond to the partial embeddings.
+    :param kwargs: additional arguments to compute_partial_splits()
+    :return: the embedding as a numpy array of float32 of shape (model_embedding_size,). If
+    <return_partials> is True, the partial utterances as a numpy array of float32 of shape
+    (n_partials, model_embedding_size) and the wav partials as a list of slices will also be
+    returned. If <using_partials> is simultaneously set to False, both these values will be None
+    instead.
+    """
+    # Process the entire utterance if not using partials
+    if not using_partials:
+        frames = audio.wav_to_mel_spectrogram(wav)
+        embed = embed_frames_batch(frames[None, ...])[0]
+        if return_partials:
+            return embed, None, None
+        return embed
+    # Compute where to split the utterance into partials and pad if necessary
+    wave_slices, mel_slices = compute_partial_slices(len(wav), **kwargs)
+    max_wave_length = wave_slices[-1].stop
+    if max_wave_length >= len(wav):
+        wav = np.pad(wav, (0, max_wave_length - len(wav)), "constant")
+    # Split the utterance into partials
+    frames = audio.wav_to_mel_spectrogram(wav)
+    frames_batch = np.array([frames[s] for s in mel_slices])
+    partial_embeds = embed_frames_batch(frames_batch)
+    # Compute the utterance embedding from the partial embeddings
+    raw_embed = np.mean(partial_embeds, axis=0)
+    embed = raw_embed / np.linalg.norm(raw_embed, 2)
+    if return_partials:
+        return embed, partial_embeds, wave_slices
+    return embed
+def embed_speaker(wavs, **kwargs):
+    raise NotImplemented()
+def plot_embedding_as_heatmap(embed, ax=None, title="", shape=None, color_range=(0, 0.30)):
+    import matplotlib.pyplot as plt
+    if ax is None:
+        ax = plt.gca()
+    if shape is None:
+        height = int(np.sqrt(len(embed)))
+        shape = (height, -1)
+    embed = embed.reshape(shape)
+    cmap = cm.get_cmap()
+    mappable = ax.imshow(embed, cmap=cmap)
+    cbar = plt.colorbar(mappable, ax=ax, fraction=0.046, pad=0.04)
+    sm = cm.ScalarMappable(cmap=cmap)
+    sm.set_clim(*color_range)
+    ax.set_xticks([]), ax.set_yticks([])
+    ax.set_title(title)

backend/encoder/model.py ADDED Viewed

	@@ -0,0 +1,135 @@

+from encoder.params_model import *
+from encoder.params_data import *
+from scipy.interpolate import interp1d
+from sklearn.metrics import roc_curve
+from torch.nn.utils import clip_grad_norm_
+from scipy.optimize import brentq
+from torch import nn
+import numpy as np
+import torch
+class SpeakerEncoder(nn.Module):
+    def __init__(self, device, loss_device):
+        super().__init__()
+        self.loss_device = loss_device
+        # Network defition
+        self.lstm = nn.LSTM(input_size=mel_n_channels,
+                            hidden_size=model_hidden_size,
+                            num_layers=model_num_layers,
+                            batch_first=True).to(device)
+        self.linear = nn.Linear(in_features=model_hidden_size,
+                                out_features=model_embedding_size).to(device)
+        self.relu = torch.nn.ReLU().to(device)
+        # Cosine similarity scaling (with fixed initial parameter values)
+        self.similarity_weight = nn.Parameter(torch.tensor([10.])).to(loss_device)
+        self.similarity_bias = nn.Parameter(torch.tensor([-5.])).to(loss_device)
+        # Loss
+        self.loss_fn = nn.CrossEntropyLoss().to(loss_device)
+    def do_gradient_ops(self):
+        # Gradient scale
+        self.similarity_weight.grad *= 0.01
+        self.similarity_bias.grad *= 0.01
+        # Gradient clipping
+        clip_grad_norm_(self.parameters(), 3, norm_type=2)
+    def forward(self, utterances, hidden_init=None):
+        """
+        Computes the embeddings of a batch of utterance spectrograms.
+        :param utterances: batch of mel-scale filterbanks of same duration as a tensor of shape
+        (batch_size, n_frames, n_channels)
+        :param hidden_init: initial hidden state of the LSTM as a tensor of shape (num_layers,
+        batch_size, hidden_size). Will default to a tensor of zeros if None.
+        :return: the embeddings as a tensor of shape (batch_size, embedding_size)
+        """
+        # Pass the input through the LSTM layers and retrieve all outputs, the final hidden state
+        # and the final cell state.
+        out, (hidden, cell) = self.lstm(utterances, hidden_init)
+        # We take only the hidden state of the last layer
+        embeds_raw = self.relu(self.linear(hidden[-1]))
+        # L2-normalize it
+        embeds = embeds_raw / (torch.norm(embeds_raw, dim=1, keepdim=True) + 1e-5)
+        return embeds
+    def similarity_matrix(self, embeds):
+        """
+        Computes the similarity matrix according the section 2.1 of GE2E.
+        :param embeds: the embeddings as a tensor of shape (speakers_per_batch,
+        utterances_per_speaker, embedding_size)
+        :return: the similarity matrix as a tensor of shape (speakers_per_batch,
+        utterances_per_speaker, speakers_per_batch)
+        """
+        speakers_per_batch, utterances_per_speaker = embeds.shape[:2]
+        # Inclusive centroids (1 per speaker). Cloning is needed for reverse differentiation
+        centroids_incl = torch.mean(embeds, dim=1, keepdim=True)
+        centroids_incl = centroids_incl.clone() / (torch.norm(centroids_incl, dim=2, keepdim=True) + 1e-5)
+        # Exclusive centroids (1 per utterance)
+        centroids_excl = (torch.sum(embeds, dim=1, keepdim=True) - embeds)
+        centroids_excl /= (utterances_per_speaker - 1)
+        centroids_excl = centroids_excl.clone() / (torch.norm(centroids_excl, dim=2, keepdim=True) + 1e-5)
+        # Similarity matrix. The cosine similarity of already 2-normed vectors is simply the dot
+        # product of these vectors (which is just an element-wise multiplication reduced by a sum).
+        # We vectorize the computation for efficiency.
+        sim_matrix = torch.zeros(speakers_per_batch, utterances_per_speaker,
+                                 speakers_per_batch).to(self.loss_device)
+        mask_matrix = 1 - np.eye(speakers_per_batch, dtype=np.int)
+        for j in range(speakers_per_batch):
+            mask = np.where(mask_matrix[j])[0]
+            sim_matrix[mask, :, j] = (embeds[mask] * centroids_incl[j]).sum(dim=2)
+            sim_matrix[j, :, j] = (embeds[j] * centroids_excl[j]).sum(dim=1)
+        ## Even more vectorized version (slower maybe because of transpose)
+        # sim_matrix2 = torch.zeros(speakers_per_batch, speakers_per_batch, utterances_per_speaker
+        #                           ).to(self.loss_device)
+        # eye = np.eye(speakers_per_batch, dtype=np.int)
+        # mask = np.where(1 - eye)
+        # sim_matrix2[mask] = (embeds[mask[0]] * centroids_incl[mask[1]]).sum(dim=2)
+        # mask = np.where(eye)
+        # sim_matrix2[mask] = (embeds * centroids_excl).sum(dim=2)
+        # sim_matrix2 = sim_matrix2.transpose(1, 2)
+        sim_matrix = sim_matrix * self.similarity_weight + self.similarity_bias
+        return sim_matrix
+    def loss(self, embeds):
+        """
+        Computes the softmax loss according the section 2.1 of GE2E.
+        :param embeds: the embeddings as a tensor of shape (speakers_per_batch,
+        utterances_per_speaker, embedding_size)
+        :return: the loss and the EER for this batch of embeddings.
+        """
+        speakers_per_batch, utterances_per_speaker = embeds.shape[:2]
+        # Loss
+        sim_matrix = self.similarity_matrix(embeds)
+        sim_matrix = sim_matrix.reshape((speakers_per_batch * utterances_per_speaker,
+                                         speakers_per_batch))
+        ground_truth = np.repeat(np.arange(speakers_per_batch), utterances_per_speaker)
+        target = torch.from_numpy(ground_truth).long().to(self.loss_device)
+        loss = self.loss_fn(sim_matrix, target)
+        # EER (not backpropagated)
+        with torch.no_grad():
+            inv_argmax = lambda i: np.eye(1, speakers_per_batch, i, dtype=np.int)[0]
+            labels = np.array([inv_argmax(i) for i in ground_truth])
+            preds = sim_matrix.detach().cpu().numpy()
+            # Snippet from https://yangcha.github.io/EER-ROC/
+            fpr, tpr, thresholds = roc_curve(labels.flatten(), preds.flatten())
+            eer = brentq(lambda x: 1. - x - interp1d(fpr, tpr)(x), 0., 1.)
+        return loss, eer

backend/encoder/params_data.py ADDED Viewed

	@@ -0,0 +1,29 @@

+## Mel-filterbank
+mel_window_length = 25  # In milliseconds
+mel_window_step = 10    # In milliseconds
+mel_n_channels = 40
+## Audio
+sampling_rate = 16000
+# Number of spectrogram frames in a partial utterance
+partials_n_frames = 160     # 1600 ms
+# Number of spectrogram frames at inference
+inference_n_frames = 80     #  800 ms
+## Voice Activation Detection
+# Window size of the VAD. Must be either 10, 20 or 30 milliseconds.
+# This sets the granularity of the VAD. Should not need to be changed.
+vad_window_length = 30  # In milliseconds
+# Number of frames to average together when performing the moving average smoothing.
+# The larger this value, the larger the VAD variations must be to not get smoothed out.
+vad_moving_average_width = 8
+# Maximum number of consecutive silent frames a segment can have.
+vad_max_silence_length = 6
+## Audio volume normalization
+audio_norm_target_dBFS = -30

backend/encoder/params_model.py ADDED Viewed

	@@ -0,0 +1,11 @@

+## Model parameters
+model_hidden_size = 256
+model_embedding_size = 256
+model_num_layers = 3
+## Training parameters
+learning_rate_init = 1e-4
+speakers_per_batch = 64
+utterances_per_speaker = 10

backend/enrolled_voices/voice_26bfa1ef.mp3 ADDED Viewed

Binary file (20.6 kB). View file

backend/enrolled_voices/voice_72beeda9.mp3 ADDED Viewed

Binary file (20.6 kB). View file

backend/enrolled_voices/voices.json ADDED Viewed

	@@ -0,0 +1,100 @@

+[
+  {
+    "id": "voice_705f524b",
+    "name": "Pragyan",
+    "filename": "voice_705f524b.wav",
+    "filepath": "enrolled_voices\\voice_705f524b.wav",
+    "createdAt": "2025-11-05T11:15:58.834934"
+  },
+  {
+    "id": "voice_5b7e198d",
+    "name": "Pragyan",
+    "filename": "voice_5b7e198d.wav",
+    "filepath": "enrolled_voices\\voice_5b7e198d.wav",
+    "createdAt": "2025-11-05T11:23:18.943413"
+  },
+  {
+    "id": "voice_e0a7c06e",
+    "name": "Pragyan",
+    "filename": "voice_e0a7c06e.mp3",
+    "filepath": "enrolled_voices\\voice_e0a7c06e.mp3",
+    "createdAt": "2025-11-05T11:31:33.094765"
+  },
+  {
+    "id": "voice_7d278c5f",
+    "name": "mY",
+    "filename": "voice_7d278c5f.mp3",
+    "filepath": "enrolled_voices\\voice_7d278c5f.mp3",
+    "createdAt": "2025-11-05T11:49:35.933861"
+  },
+  {
+    "id": "voice_44c22d65",
+    "name": "My1",
+    "filename": "voice_44c22d65.mp3",
+    "filepath": "enrolled_voices\\voice_44c22d65.mp3",
+    "createdAt": "2025-11-05T11:49:52.844973"
+  },
+  {
+    "id": "voice_eb54f62d",
+    "name": "MY2",
+    "filename": "voice_eb54f62d.mp3",
+    "filepath": "enrolled_voices\\voice_eb54f62d.mp3",
+    "createdAt": "2025-11-05T11:50:13.886497"
+  },
+  {
+    "id": "voice_ecb824ec",
+    "name": "Monu",
+    "filename": "voice_ecb824ec.wav",
+    "filepath": "enrolled_voices\\voice_ecb824ec.wav",
+    "createdAt": "2025-11-06T10:28:22.279407"
+  },
+  {
+    "id": "voice_0adf8594",
+    "name": "Pragyan1",
+    "filename": "voice_0adf8594.wav",
+    "filepath": "enrolled_voices\\voice_0adf8594.wav",
+    "createdAt": "2025-11-06T14:22:06.737234"
+  },
+  {
+    "id": "voice_fd577924",
+    "name": "MY3",
+    "filename": "voice_fd577924.wav",
+    "filepath": "enrolled_voices\\voice_fd577924.wav",
+    "createdAt": "2025-11-20T15:15:40.488404"
+  },
+  {
+    "id": "voice_a51275b7",
+    "name": "Testing Voice",
+    "filename": "voice_a51275b7.wav",
+    "filepath": "enrolled_voices\\voice_a51275b7.wav",
+    "createdAt": "2025-11-20T15:23:43.665441"
+  },
+  {
+    "id": "voice_ea85f251",
+    "name": "test",
+    "filename": "voice_ea85f251.wav",
+    "filepath": "enrolled_voices\\voice_ea85f251.wav",
+    "createdAt": "2025-11-25T09:47:22.148753"
+  },
+  {
+    "id": "voice_a4e34f00",
+    "name": "Class",
+    "filename": "voice_a4e34f00.wav",
+    "filepath": "enrolled_voices\\voice_a4e34f00.wav",
+    "createdAt": "2025-11-25T10:32:08.525704"
+  },
+  {
+    "id": "voice_26bfa1ef",
+    "name": "Saksham voice",
+    "filename": "voice_26bfa1ef.mp3",
+    "filepath": "E:\\Sem 5\\mini proejct main\\pragyan branch\\backend\\enrolled_voices\\voice_26bfa1ef.mp3",
+    "createdAt": "2025-11-28T11:08:59.773738"
+  },
+  {
+    "id": "voice_72beeda9",
+    "name": "Saksham voice",
+    "filename": "voice_72beeda9.mp3",
+    "filepath": "E:\\Sem 5\\mini proejct main\\pragyan branch\\backend\\enrolled_voices\\voice_72beeda9.mp3",
+    "createdAt": "2025-11-28T11:16:33.409663"
+  }
+]

backend/requirements.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+flask==2.3.3
+flask-cors==4.0.0
+gunicorn==21.2.0
+torch>=2.5.0
+librosa>=0.10.0
+soundfile>=0.12.0
+numpy>=1.21.0
+huggingface_hub>=0.19.0
+matplotlib>=3.5.0
+webrtcvad==2.0.10
+scipy>=1.6.0
+scikit-learn>=1.1.0
+unidecode>=1.2.0
+inflect>=6.0.0

backend/runtime.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ python-3.10.0

backend/synthesizer/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ #

backend/synthesizer/audio.py ADDED Viewed

	@@ -0,0 +1,211 @@

+import librosa
+import librosa.filters
+import numpy as np
+from scipy import signal
+from scipy.io import wavfile
+import soundfile as sf
+def load_wav(path, sr):
+    return librosa.core.load(path, sr=sr)[0]
+def save_wav(wav, path, sr):
+    wav *= 32767 / max(0.01, np.max(np.abs(wav)))
+    #proposed by @dsmiller
+    wavfile.write(path, sr, wav.astype(np.int16))
+def save_wavenet_wav(wav, path, sr):
+    sf.write(path, wav.astype(np.float32), sr)
+def preemphasis(wav, k, preemphasize=True):
+    if preemphasize:
+        return signal.lfilter([1, -k], [1], wav)
+    return wav
+def inv_preemphasis(wav, k, inv_preemphasize=True):
+    if inv_preemphasize:
+        return signal.lfilter([1], [1, -k], wav)
+    return wav
+#From https://github.com/r9y9/wavenet_vocoder/blob/master/audio.py
+def start_and_end_indices(quantized, silence_threshold=2):
+    for start in range(quantized.size):
+        if abs(quantized[start] - 127) > silence_threshold:
+            break
+    for end in range(quantized.size - 1, 1, -1):
+        if abs(quantized[end] - 127) > silence_threshold:
+            break
+    assert abs(quantized[start] - 127) > silence_threshold
+    assert abs(quantized[end] - 127) > silence_threshold
+    return start, end
+def get_hop_size(hparams):
+    hop_size = hparams.hop_size
+    if hop_size is None:
+        assert hparams.frame_shift_ms is not None
+        hop_size = int(hparams.frame_shift_ms / 1000 * hparams.sample_rate)
+    return hop_size
+def linearspectrogram(wav, hparams):
+    D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
+    S = _amp_to_db(np.abs(D), hparams) - hparams.ref_level_db
+    if hparams.signal_normalization:
+        return _normalize(S, hparams)
+    return S
+def melspectrogram(wav, hparams):
+    D = _stft(preemphasis(wav, hparams.preemphasis, hparams.preemphasize), hparams)
+    S = _amp_to_db(_linear_to_mel(np.abs(D), hparams), hparams) - hparams.ref_level_db
+    if hparams.signal_normalization:
+        return _normalize(S, hparams)
+    return S
+def inv_linear_spectrogram(linear_spectrogram, hparams):
+    """Converts linear spectrogram to waveform using librosa"""
+    if hparams.signal_normalization:
+        D = _denormalize(linear_spectrogram, hparams)
+    else:
+        D = linear_spectrogram
+    S = _db_to_amp(D + hparams.ref_level_db) #Convert back to linear
+    if hparams.use_lws:
+        processor = _lws_processor(hparams)
+        D = processor.run_lws(S.astype(np.float64).T ** hparams.power)
+        y = processor.istft(D).astype(np.float32)
+        return inv_preemphasis(y, hparams.preemphasis, hparams.preemphasize)
+    else:
+        return inv_preemphasis(_griffin_lim(S ** hparams.power, hparams), hparams.preemphasis, hparams.preemphasize)
+def inv_mel_spectrogram(mel_spectrogram, hparams):
+    """Converts mel spectrogram to waveform using librosa"""
+    if hparams.signal_normalization:
+        D = _denormalize(mel_spectrogram, hparams)
+    else:
+        D = mel_spectrogram
+    S = _mel_to_linear(_db_to_amp(D + hparams.ref_level_db), hparams)  # Convert back to linear
+    if hparams.use_lws:
+        processor = _lws_processor(hparams)
+        D = processor.run_lws(S.astype(np.float64).T ** hparams.power)
+        y = processor.istft(D).astype(np.float32)
+        return inv_preemphasis(y, hparams.preemphasis, hparams.preemphasize)
+    else:
+        return inv_preemphasis(_griffin_lim(S ** hparams.power, hparams), hparams.preemphasis, hparams.preemphasize)
+def _lws_processor(hparams):
+    import lws
+    return lws.lws(hparams.n_fft, get_hop_size(hparams), fftsize=hparams.win_size, mode="speech")
+def _griffin_lim(S, hparams):
+    """librosa implementation of Griffin-Lim
+    Based on https://github.com/librosa/librosa/issues/434
+    """
+    angles = np.exp(2j * np.pi * np.random.rand(*S.shape))
+    S_complex = np.abs(S).astype(np.complex)
+    y = _istft(S_complex * angles, hparams)
+    for i in range(hparams.griffin_lim_iters):
+        angles = np.exp(1j * np.angle(_stft(y, hparams)))
+        y = _istft(S_complex * angles, hparams)
+    return y
+def _stft(y, hparams):
+    if hparams.use_lws:
+        return _lws_processor(hparams).stft(y).T
+    else:
+        return librosa.stft(y=y, n_fft=hparams.n_fft, hop_length=get_hop_size(hparams), win_length=hparams.win_size)
+def _istft(y, hparams):
+    return librosa.istft(y, hop_length=get_hop_size(hparams), win_length=hparams.win_size)
+##########################################################
+#Those are only correct when using lws!!! (This was messing with Wavenet quality for a long time!)
+def num_frames(length, fsize, fshift):
+    """Compute number of time frames of spectrogram
+    """
+    pad = (fsize - fshift)
+    if length % fshift == 0:
+        M = (length + pad * 2 - fsize) // fshift + 1
+    else:
+        M = (length + pad * 2 - fsize) // fshift + 2
+    return M
+def pad_lr(x, fsize, fshift):
+    """Compute left and right padding
+    """
+    M = num_frames(len(x), fsize, fshift)
+    pad = (fsize - fshift)
+    T = len(x) + 2 * pad
+    r = (M - 1) * fshift + fsize - T
+    return pad, pad + r
+##########################################################
+#Librosa correct padding
+def librosa_pad_lr(x, fsize, fshift):
+    return 0, (x.shape[0] // fshift + 1) * fshift - x.shape[0]
+# Conversions
+_mel_basis = None
+_inv_mel_basis = None
+def _linear_to_mel(spectogram, hparams):
+    global _mel_basis
+    if _mel_basis is None:
+        _mel_basis = _build_mel_basis(hparams)
+    return np.dot(_mel_basis, spectogram)
+def _mel_to_linear(mel_spectrogram, hparams):
+    global _inv_mel_basis
+    if _inv_mel_basis is None:
+        _inv_mel_basis = np.linalg.pinv(_build_mel_basis(hparams))
+    return np.maximum(1e-10, np.dot(_inv_mel_basis, mel_spectrogram))
+def _build_mel_basis(hparams):
+    assert hparams.fmax <= hparams.sample_rate // 2
+    return librosa.filters.mel(
+        sr=hparams.sample_rate,
+        n_fft=hparams.n_fft,
+        n_mels=hparams.num_mels,
+        fmin=hparams.fmin,
+        fmax=hparams.fmax
+    )
+def _amp_to_db(x, hparams):
+    min_level = np.exp(hparams.min_level_db / 20 * np.log(10))
+    return 20 * np.log10(np.maximum(min_level, x))
+def _db_to_amp(x):
+    return np.power(10.0, (x) * 0.05)
+def _normalize(S, hparams):
+    if hparams.allow_clipping_in_normalization:
+        if hparams.symmetric_mels:
+            return np.clip((2 * hparams.max_abs_value) * ((S - hparams.min_level_db) / (-hparams.min_level_db)) - hparams.max_abs_value,
+                           -hparams.max_abs_value, hparams.max_abs_value)
+        else:
+            return np.clip(hparams.max_abs_value * ((S - hparams.min_level_db) / (-hparams.min_level_db)), 0, hparams.max_abs_value)
+    assert S.max() <= 0 and S.min() - hparams.min_level_db >= 0
+    if hparams.symmetric_mels:
+        return (2 * hparams.max_abs_value) * ((S - hparams.min_level_db) / (-hparams.min_level_db)) - hparams.max_abs_value
+    else:
+        return hparams.max_abs_value * ((S - hparams.min_level_db) / (-hparams.min_level_db))
+def _denormalize(D, hparams):
+    if hparams.allow_clipping_in_normalization:
+        if hparams.symmetric_mels:
+            return (((np.clip(D, -hparams.max_abs_value,
+                              hparams.max_abs_value) + hparams.max_abs_value) * -hparams.min_level_db / (2 * hparams.max_abs_value))
+                    + hparams.min_level_db)
+        else:
+            return ((np.clip(D, 0, hparams.max_abs_value) * -hparams.min_level_db / hparams.max_abs_value) + hparams.min_level_db)
+    if hparams.symmetric_mels:
+        return (((D + hparams.max_abs_value) * -hparams.min_level_db / (2 * hparams.max_abs_value)) + hparams.min_level_db)
+    else:
+        return ((D * -hparams.min_level_db / hparams.max_abs_value) + hparams.min_level_db)

backend/synthesizer/hparams.py ADDED Viewed

	@@ -0,0 +1,92 @@

+import ast
+import pprint
+class HParams(object):
+    def __init__(self, **kwargs): self.__dict__.update(kwargs)
+    def __setitem__(self, key, value): setattr(self, key, value)
+    def __getitem__(self, key): return getattr(self, key)
+    def __repr__(self): return pprint.pformat(self.__dict__)
+    def parse(self, string):
+        # Overrides hparams from a comma-separated string of name=value pairs
+        if len(string) > 0:
+            overrides = [s.split("=") for s in string.split(",")]
+            keys, values = zip(*overrides)
+            keys = list(map(str.strip, keys))
+            values = list(map(str.strip, values))
+            for k in keys:
+                self.__dict__[k] = ast.literal_eval(values[keys.index(k)])
+        return self
+hparams = HParams(
+        ### Signal Processing (used in both synthesizer and vocoder)
+        sample_rate = 16000,
+        n_fft = 800,
+        num_mels = 80,
+        hop_size = 200,                             # Tacotron uses 12.5 ms frame shift (set to sample_rate * 0.0125)
+        win_size = 800,                             # Tacotron uses 50 ms frame length (set to sample_rate * 0.050)
+        fmin = 55,
+        min_level_db = -100,
+        ref_level_db = 20,
+        max_abs_value = 4.,                         # Gradient explodes if too big, premature convergence if too small.
+        preemphasis = 0.97,                         # Filter coefficient to use if preemphasize is True
+        preemphasize = True,
+        ### Tacotron Text-to-Speech (TTS)
+        tts_embed_dims = 512,                       # Embedding dimension for the graphemes/phoneme inputs
+        tts_encoder_dims = 256,
+        tts_decoder_dims = 128,
+        tts_postnet_dims = 512,
+        tts_encoder_K = 5,
+        tts_lstm_dims = 1024,
+        tts_postnet_K = 5,
+        tts_num_highways = 4,
+        tts_dropout = 0.5,
+        tts_cleaner_names = ["english_cleaners"],
+        tts_stop_threshold = -3.4,                  # Value below which audio generation ends.
+                                                    # For example, for a range of [-4, 4], this
+                                                    # will terminate the sequence at the first
+                                                    # frame that has all values < -3.4
+        ### Tacotron Training
+        tts_schedule = [(2,  1e-3,  20_000,  12),   # Progressive training schedule
+                        (2,  5e-4,  40_000,  12),   # (r, lr, step, batch_size)
+                        (2,  2e-4,  80_000,  12),   #
+                        (2,  1e-4, 160_000,  12),   # r = reduction factor (# of mel frames
+                        (2,  3e-5, 320_000,  12),   #     synthesized for each decoder iteration)
+                        (2,  1e-5, 640_000,  12)],  # lr = learning rate
+        tts_clip_grad_norm = 1.0,                   # clips the gradient norm to prevent explosion - set to None if not needed
+        tts_eval_interval = 500,                    # Number of steps between model evaluation (sample generation)
+                                                    # Set to -1 to generate after completing epoch, or 0 to disable
+        tts_eval_num_samples = 1,                   # Makes this number of samples
+        ### Data Preprocessing
+        max_mel_frames = 900,
+        rescale = True,
+        rescaling_max = 0.9,
+        synthesis_batch_size = 16,                  # For vocoder preprocessing and inference.
+        ### Mel Visualization and Griffin-Lim
+        signal_normalization = True,
+        power = 1.5,
+        griffin_lim_iters = 60,
+        ### Audio processing options
+        fmax = 7600,                                # Should not exceed (sample_rate // 2)
+        allow_clipping_in_normalization = True,     # Used when signal_normalization = True
+        clip_mels_length = True,                    # If true, discards samples exceeding max_mel_frames
+        use_lws = False,                            # "Fast spectrogram phase recovery using local weighted sums"
+        symmetric_mels = True,                      # Sets mel range to [-max_abs_value, max_abs_value] if True,
+                                                    #               and [0, max_abs_value] if False
+        trim_silence = True,                        # Use with sample_rate of 16000 for best results
+        ### SV2TTS
+        speaker_embedding_size = 256,               # Dimension for the speaker embedding
+        silence_min_duration_split = 0.4,           # Duration in seconds of a silence for an utterance to be split
+        utterance_min_duration = 1.6,               # Duration in seconds below which utterances are discarded
+        )
+def hparams_debug_string():
+    return str(hparams)

backend/synthesizer/inference.py ADDED Viewed

	@@ -0,0 +1,165 @@

+import torch
+from synthesizer import audio
+from synthesizer.hparams import hparams
+from synthesizer.models.tacotron import Tacotron
+from synthesizer.utils.symbols import symbols
+from synthesizer.utils.text import text_to_sequence
+from app.vocoder.display import simple_table
+from pathlib import Path
+from typing import Union, List
+import numpy as np
+import librosa
+class Synthesizer:
+    sample_rate = hparams.sample_rate
+    hparams = hparams
+    def __init__(self, model_fpath: Path, verbose=True):
+        """
+        The model isn't instantiated and loaded in memory until needed or until load() is called.
+        :param model_fpath: path to the trained model file
+        :param verbose: if False, prints less information when using the model
+        """
+        self.model_fpath = model_fpath
+        self.verbose = verbose
+        # Check for GPU
+        if torch.cuda.is_available():
+            self.device = torch.device("cuda")
+        else:
+            self.device = torch.device("cpu")
+        if self.verbose:
+            print("Synthesizer using device:", self.device)
+        # Tacotron model will be instantiated later on first use.
+        self._model = None
+    def is_loaded(self):
+        """
+        Whether the model is loaded in memory.
+        """
+        return self._model is not None
+    def load(self):
+        """
+        Instantiates and loads the model given the weights file that was passed in the constructor.
+        """
+        self._model = Tacotron(embed_dims=hparams.tts_embed_dims,
+                               num_chars=len(symbols),
+                               encoder_dims=hparams.tts_encoder_dims,
+                               decoder_dims=hparams.tts_decoder_dims,
+                               n_mels=hparams.num_mels,
+                               fft_bins=hparams.num_mels,
+                               postnet_dims=hparams.tts_postnet_dims,
+                               encoder_K=hparams.tts_encoder_K,
+                               lstm_dims=hparams.tts_lstm_dims,
+                               postnet_K=hparams.tts_postnet_K,
+                               num_highways=hparams.tts_num_highways,
+                               dropout=hparams.tts_dropout,
+                               stop_threshold=hparams.tts_stop_threshold,
+                               speaker_embedding_size=hparams.speaker_embedding_size).to(self.device)
+        self._model.load(self.model_fpath)
+        self._model.eval()
+        if self.verbose:
+            print("Loaded synthesizer \"%s\" trained to step %d" % (self.model_fpath.name, self._model.state_dict()["step"]))
+    def synthesize_spectrograms(self, texts: List[str],
+                                embeddings: Union[np.ndarray, List[np.ndarray]],
+                                return_alignments=False):
+        """
+        Synthesizes mel spectrograms from texts and speaker embeddings.
+        :param texts: a list of N text prompts to be synthesized
+        :param embeddings: a numpy array or list of speaker embeddings of shape (N, 256)
+        :param return_alignments: if True, a matrix representing the alignments between the
+        characters
+        and each decoder output step will be returned for each spectrogram
+        :return: a list of N melspectrograms as numpy arrays of shape (80, Mi), where Mi is the
+        sequence length of spectrogram i, and possibly the alignments.
+        """
+        # Load the model on the first request.
+        if not self.is_loaded():
+            self.load()
+        # Preprocess text inputs
+        inputs = [text_to_sequence(text.strip(), hparams.tts_cleaner_names) for text in texts]
+        if not isinstance(embeddings, list):
+            embeddings = [embeddings]
+        # Batch inputs
+        batched_inputs = [inputs[i:i+hparams.synthesis_batch_size]
+                             for i in range(0, len(inputs), hparams.synthesis_batch_size)]
+        batched_embeds = [embeddings[i:i+hparams.synthesis_batch_size]
+                             for i in range(0, len(embeddings), hparams.synthesis_batch_size)]
+        specs = []
+        for i, batch in enumerate(batched_inputs, 1):
+            if self.verbose:
+                print(f"\n| Generating {i}/{len(batched_inputs)}")
+            # Pad texts so they are all the same length
+            text_lens = [len(text) for text in batch]
+            max_text_len = max(text_lens)
+            chars = [pad1d(text, max_text_len) for text in batch]
+            chars = np.stack(chars)
+            # Stack speaker embeddings into 2D array for batch processing
+            speaker_embeds = np.stack(batched_embeds[i-1])
+            # Convert to tensor
+            chars = torch.tensor(chars).long().to(self.device)
+            speaker_embeddings = torch.tensor(speaker_embeds).float().to(self.device)
+            # Inference
+            _, mels, alignments = self._model.generate(chars, speaker_embeddings)
+            mels = mels.detach().cpu().numpy()
+            for m in mels:
+                # Trim silence from end of each spectrogram
+                while np.max(m[:, -1]) < hparams.tts_stop_threshold:
+                    m = m[:, :-1]
+                specs.append(m)
+        if self.verbose:
+            print("\n\nDone.\n")
+        return (specs, alignments) if return_alignments else specs
+    @staticmethod
+    def load_preprocess_wav(fpath):
+        """
+        Loads and preprocesses an audio file under the same conditions the audio files were used to
+        train the synthesizer.
+        """
+        wav = librosa.load(str(fpath), hparams.sample_rate)[0]
+        if hparams.rescale:
+            wav = wav / np.abs(wav).max() * hparams.rescaling_max
+        return wav
+    @staticmethod
+    def make_spectrogram(fpath_or_wav: Union[str, Path, np.ndarray]):
+        """
+        Creates a mel spectrogram from an audio file in the same manner as the mel spectrograms that
+        were fed to the synthesizer when training.
+        """
+        if isinstance(fpath_or_wav, str) or isinstance(fpath_or_wav, Path):
+            wav = Synthesizer.load_preprocess_wav(fpath_or_wav)
+        else:
+            wav = fpath_or_wav
+        mel_spectrogram = audio.melspectrogram(wav, hparams).astype(np.float32)
+        return mel_spectrogram
+    @staticmethod
+    def griffin_lim(mel):
+        """
+        Inverts a mel spectrogram using Griffin-Lim. The mel spectrogram is expected to have been built
+        with the same parameters present in hparams.py.
+        """
+        return audio.inv_mel_spectrogram(mel, hparams)
+def pad1d(x, max_len, pad_value=0):
+    return np.pad(x, (0, max_len - len(x)), mode="constant", constant_values=pad_value)

backend/synthesizer/models/tacotron.py ADDED Viewed

	@@ -0,0 +1,542 @@

+import os
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from pathlib import Path
+from typing import Union
+class HighwayNetwork(nn.Module):
+    def __init__(self, size):
+        super().__init__()
+        self.W1 = nn.Linear(size, size)
+        self.W2 = nn.Linear(size, size)
+        self.W1.bias.data.fill_(0.)
+    def forward(self, x):
+        x1 = self.W1(x)
+        x2 = self.W2(x)
+        g = torch.sigmoid(x2)
+        y = g * F.relu(x1) + (1. - g) * x
+        return y
+class Encoder(nn.Module):
+    def __init__(self, embed_dims, num_chars, encoder_dims, K, num_highways, dropout):
+        super().__init__()
+        prenet_dims = (encoder_dims, encoder_dims)
+        cbhg_channels = encoder_dims
+        self.embedding = nn.Embedding(num_chars, embed_dims)
+        self.pre_net = PreNet(embed_dims, fc1_dims=prenet_dims[0], fc2_dims=prenet_dims[1],
+                              dropout=dropout)
+        self.cbhg = CBHG(K=K, in_channels=cbhg_channels, channels=cbhg_channels,
+                         proj_channels=[cbhg_channels, cbhg_channels],
+                         num_highways=num_highways)
+    def forward(self, x, speaker_embedding=None):
+        x = self.embedding(x)
+        x = self.pre_net(x)
+        x.transpose_(1, 2)
+        x = self.cbhg(x)
+        if speaker_embedding is not None:
+            x = self.add_speaker_embedding(x, speaker_embedding)
+        return x
+    def add_speaker_embedding(self, x, speaker_embedding):
+        # SV2TTS
+        # The input x is the encoder output and is a 3D tensor with size (batch_size, num_chars, tts_embed_dims)
+        # When training, speaker_embedding is also a 2D tensor with size (batch_size, speaker_embedding_size)
+        #     (for inference, speaker_embedding is a 1D tensor with size (speaker_embedding_size))
+        # This concats the speaker embedding for each char in the encoder output
+        # Save the dimensions as human-readable names
+        batch_size = x.size()[0]
+        num_chars = x.size()[1]
+        if speaker_embedding.dim() == 1:
+            idx = 0
+        else:
+            idx = 1
+        # Start by making a copy of each speaker embedding to match the input text length
+        # The output of this has size (batch_size, num_chars * tts_embed_dims)
+        speaker_embedding_size = speaker_embedding.size()[idx]
+        e = speaker_embedding.repeat_interleave(num_chars, dim=idx)
+        # Reshape it and transpose
+        e = e.reshape(batch_size, speaker_embedding_size, num_chars)
+        e = e.transpose(1, 2)
+        # Concatenate the tiled speaker embedding with the encoder output
+        x = torch.cat((x, e), 2)
+        return x
+class BatchNormConv(nn.Module):
+    def __init__(self, in_channels, out_channels, kernel, relu=True):
+        super().__init__()
+        self.conv = nn.Conv1d(in_channels, out_channels, kernel, stride=1, padding=kernel // 2, bias=False)
+        self.bnorm = nn.BatchNorm1d(out_channels)
+        self.relu = relu
+    def forward(self, x):
+        x = self.conv(x)
+        x = F.relu(x) if self.relu is True else x
+        return self.bnorm(x)
+class CBHG(nn.Module):
+    def __init__(self, K, in_channels, channels, proj_channels, num_highways):
+        super().__init__()
+        # List of all rnns to call `flatten_parameters()` on
+        self._to_flatten = []
+        self.bank_kernels = [i for i in range(1, K + 1)]
+        self.conv1d_bank = nn.ModuleList()
+        for k in self.bank_kernels:
+            conv = BatchNormConv(in_channels, channels, k)
+            self.conv1d_bank.append(conv)
+        self.maxpool = nn.MaxPool1d(kernel_size=2, stride=1, padding=1)
+        self.conv_project1 = BatchNormConv(len(self.bank_kernels) * channels, proj_channels[0], 3)
+        self.conv_project2 = BatchNormConv(proj_channels[0], proj_channels[1], 3, relu=False)
+        # Fix the highway input if necessary
+        if proj_channels[-1] != channels:
+            self.highway_mismatch = True
+            self.pre_highway = nn.Linear(proj_channels[-1], channels, bias=False)
+        else:
+            self.highway_mismatch = False
+        self.highways = nn.ModuleList()
+        for i in range(num_highways):
+            hn = HighwayNetwork(channels)
+            self.highways.append(hn)
+        self.rnn = nn.GRU(channels, channels // 2, batch_first=True, bidirectional=True)
+        self._to_flatten.append(self.rnn)
+        # Avoid fragmentation of RNN parameters and associated warning
+        self._flatten_parameters()
+    def forward(self, x):
+        # Although we `_flatten_parameters()` on init, when using DataParallel
+        # the model gets replicated, making it no longer guaranteed that the
+        # weights are contiguous in GPU memory. Hence, we must call it again
+        self._flatten_parameters()
+        # Save these for later
+        residual = x
+        seq_len = x.size(-1)
+        conv_bank = []
+        # Convolution Bank
+        for conv in self.conv1d_bank:
+            c = conv(x) # Convolution
+            conv_bank.append(c[:, :, :seq_len])
+        # Stack along the channel axis
+        conv_bank = torch.cat(conv_bank, dim=1)
+        # dump the last padding to fit residual
+        x = self.maxpool(conv_bank)[:, :, :seq_len]
+        # Conv1d projections
+        x = self.conv_project1(x)
+        x = self.conv_project2(x)
+        # Residual Connect
+        x = x + residual
+        # Through the highways
+        x = x.transpose(1, 2)
+        if self.highway_mismatch is True:
+            x = self.pre_highway(x)
+        for h in self.highways: x = h(x)
+        # And then the RNN
+        x, _ = self.rnn(x)
+        return x
+    def _flatten_parameters(self):
+        """Calls `flatten_parameters` on all the rnns used by the WaveRNN. Used
+        to improve efficiency and avoid PyTorch yelling at us."""
+        [m.flatten_parameters() for m in self._to_flatten]
+class PreNet(nn.Module):
+    def __init__(self, in_dims, fc1_dims=256, fc2_dims=128, dropout=0.5):
+        super().__init__()
+        self.fc1 = nn.Linear(in_dims, fc1_dims)
+        self.fc2 = nn.Linear(fc1_dims, fc2_dims)
+        self.p = dropout
+    def forward(self, x):
+        x = self.fc1(x)
+        x = F.relu(x)
+        x = F.dropout(x, self.p, training=True)
+        x = self.fc2(x)
+        x = F.relu(x)
+        x = F.dropout(x, self.p, training=True)
+        return x
+class Attention(nn.Module):
+    def __init__(self, attn_dims):
+        super().__init__()
+        self.W = nn.Linear(attn_dims, attn_dims, bias=False)
+        self.v = nn.Linear(attn_dims, 1, bias=False)
+    def forward(self, encoder_seq_proj, query, t):
+        # print(encoder_seq_proj.shape)
+        # Transform the query vector
+        query_proj = self.W(query).unsqueeze(1)
+        # Compute the scores
+        u = self.v(torch.tanh(encoder_seq_proj + query_proj))
+        scores = F.softmax(u, dim=1)
+        return scores.transpose(1, 2)
+class LSA(nn.Module):
+    def __init__(self, attn_dim, kernel_size=31, filters=32):
+        super().__init__()
+        self.conv = nn.Conv1d(1, filters, padding=(kernel_size - 1) // 2, kernel_size=kernel_size, bias=True)
+        self.L = nn.Linear(filters, attn_dim, bias=False)
+        self.W = nn.Linear(attn_dim, attn_dim, bias=True) # Include the attention bias in this term
+        self.v = nn.Linear(attn_dim, 1, bias=False)
+        self.cumulative = None
+        self.attention = None
+    def init_attention(self, encoder_seq_proj):
+        device = next(self.parameters()).device  # use same device as parameters
+        b, t, c = encoder_seq_proj.size()
+        self.cumulative = torch.zeros(b, t, device=device)
+        self.attention = torch.zeros(b, t, device=device)
+    def forward(self, encoder_seq_proj, query, t, chars):
+        if t == 0: self.init_attention(encoder_seq_proj)
+        processed_query = self.W(query).unsqueeze(1)
+        location = self.cumulative.unsqueeze(1)
+        processed_loc = self.L(self.conv(location).transpose(1, 2))
+        u = self.v(torch.tanh(processed_query + encoder_seq_proj + processed_loc))
+        u = u.squeeze(-1)
+        # Mask zero padding chars
+        u = u * (chars != 0).float()
+        # Smooth Attention
+        # scores = torch.sigmoid(u) / torch.sigmoid(u).sum(dim=1, keepdim=True)
+        scores = F.softmax(u, dim=1)
+        self.attention = scores
+        self.cumulative = self.cumulative + self.attention
+        return scores.unsqueeze(-1).transpose(1, 2)
+class Decoder(nn.Module):
+    # Class variable because its value doesn't change between classes
+    # yet ought to be scoped by class because its a property of a Decoder
+    max_r = 20
+    def __init__(self, n_mels, encoder_dims, decoder_dims, lstm_dims,
+                 dropout, speaker_embedding_size):
+        super().__init__()
+        self.register_buffer("r", torch.tensor(1, dtype=torch.int))
+        self.n_mels = n_mels
+        prenet_dims = (decoder_dims * 2, decoder_dims * 2)
+        self.prenet = PreNet(n_mels, fc1_dims=prenet_dims[0], fc2_dims=prenet_dims[1],
+                             dropout=dropout)
+        self.attn_net = LSA(decoder_dims)
+        self.attn_rnn = nn.GRUCell(encoder_dims + prenet_dims[1] + speaker_embedding_size, decoder_dims)
+        self.rnn_input = nn.Linear(encoder_dims + decoder_dims + speaker_embedding_size, lstm_dims)
+        self.res_rnn1 = nn.LSTMCell(lstm_dims, lstm_dims)
+        self.res_rnn2 = nn.LSTMCell(lstm_dims, lstm_dims)
+        self.mel_proj = nn.Linear(lstm_dims, n_mels * self.max_r, bias=False)
+        self.stop_proj = nn.Linear(encoder_dims + speaker_embedding_size + lstm_dims, 1)
+    def zoneout(self, prev, current, p=0.1):
+        device = next(self.parameters()).device  # Use same device as parameters
+        mask = torch.zeros(prev.size(), device=device).bernoulli_(p)
+        return prev * mask + current * (1 - mask)
+    def forward(self, encoder_seq, encoder_seq_proj, prenet_in,
+                hidden_states, cell_states, context_vec, t, chars):
+        # Need this for reshaping mels
+        batch_size = encoder_seq.size(0)
+        # Unpack the hidden and cell states
+        attn_hidden, rnn1_hidden, rnn2_hidden = hidden_states
+        rnn1_cell, rnn2_cell = cell_states
+        # PreNet for the Attention RNN
+        prenet_out = self.prenet(prenet_in)
+        # Compute the Attention RNN hidden state
+        attn_rnn_in = torch.cat([context_vec, prenet_out], dim=-1)
+        attn_hidden = self.attn_rnn(attn_rnn_in.squeeze(1), attn_hidden)
+        # Compute the attention scores
+        scores = self.attn_net(encoder_seq_proj, attn_hidden, t, chars)
+        # Dot product to create the context vector
+        context_vec = scores @ encoder_seq
+        context_vec = context_vec.squeeze(1)
+        # Concat Attention RNN output w. Context Vector & project
+        x = torch.cat([context_vec, attn_hidden], dim=1)
+        x = self.rnn_input(x)
+        # Compute first Residual RNN
+        rnn1_hidden_next, rnn1_cell = self.res_rnn1(x, (rnn1_hidden, rnn1_cell))
+        if self.training:
+            rnn1_hidden = self.zoneout(rnn1_hidden, rnn1_hidden_next)
+        else:
+            rnn1_hidden = rnn1_hidden_next
+        x = x + rnn1_hidden
+        # Compute second Residual RNN
+        rnn2_hidden_next, rnn2_cell = self.res_rnn2(x, (rnn2_hidden, rnn2_cell))
+        if self.training:
+            rnn2_hidden = self.zoneout(rnn2_hidden, rnn2_hidden_next)
+        else:
+            rnn2_hidden = rnn2_hidden_next
+        x = x + rnn2_hidden
+        # Project Mels
+        mels = self.mel_proj(x)
+        mels = mels.view(batch_size, self.n_mels, self.max_r)[:, :, :self.r]
+        hidden_states = (attn_hidden, rnn1_hidden, rnn2_hidden)
+        cell_states = (rnn1_cell, rnn2_cell)
+        # Stop token prediction
+        s = torch.cat((x, context_vec), dim=1)
+        s = self.stop_proj(s)
+        stop_tokens = torch.sigmoid(s)
+        return mels, scores, hidden_states, cell_states, context_vec, stop_tokens
+class Tacotron(nn.Module):
+    def __init__(self, embed_dims, num_chars, encoder_dims, decoder_dims, n_mels,
+                 fft_bins, postnet_dims, encoder_K, lstm_dims, postnet_K, num_highways,
+                 dropout, stop_threshold, speaker_embedding_size):
+        super().__init__()
+        self.n_mels = n_mels
+        self.lstm_dims = lstm_dims
+        self.encoder_dims = encoder_dims
+        self.decoder_dims = decoder_dims
+        self.speaker_embedding_size = speaker_embedding_size
+        self.encoder = Encoder(embed_dims, num_chars, encoder_dims,
+                               encoder_K, num_highways, dropout)
+        self.encoder_proj = nn.Linear(encoder_dims + speaker_embedding_size, decoder_dims, bias=False)
+        self.decoder = Decoder(n_mels, encoder_dims, decoder_dims, lstm_dims,
+                               dropout, speaker_embedding_size)
+        self.postnet = CBHG(postnet_K, n_mels, postnet_dims,
+                            [postnet_dims, fft_bins], num_highways)
+        self.post_proj = nn.Linear(postnet_dims, fft_bins, bias=False)
+        self.init_model()
+        self.num_params()
+        self.register_buffer("step", torch.zeros(1, dtype=torch.long))
+        self.register_buffer("stop_threshold", torch.tensor(stop_threshold, dtype=torch.float32))
+    @property
+    def r(self):
+        return self.decoder.r.item()
+    @r.setter
+    def r(self, value):
+        self.decoder.r = self.decoder.r.new_tensor(value, requires_grad=False)
+    def forward(self, x, m, speaker_embedding):
+        device = next(self.parameters()).device  # use same device as parameters
+        self.step += 1
+        batch_size, _, steps  = m.size()
+        # Initialise all hidden states and pack into tuple
+        attn_hidden = torch.zeros(batch_size, self.decoder_dims, device=device)
+        rnn1_hidden = torch.zeros(batch_size, self.lstm_dims, device=device)
+        rnn2_hidden = torch.zeros(batch_size, self.lstm_dims, device=device)
+        hidden_states = (attn_hidden, rnn1_hidden, rnn2_hidden)
+        # Initialise all lstm cell states and pack into tuple
+        rnn1_cell = torch.zeros(batch_size, self.lstm_dims, device=device)
+        rnn2_cell = torch.zeros(batch_size, self.lstm_dims, device=device)
+        cell_states = (rnn1_cell, rnn2_cell)
+        # <GO> Frame for start of decoder loop
+        go_frame = torch.zeros(batch_size, self.n_mels, device=device)
+        # Need an initial context vector
+        context_vec = torch.zeros(batch_size, self.encoder_dims + self.speaker_embedding_size, device=device)
+        # SV2TTS: Run the encoder with the speaker embedding
+        # The projection avoids unnecessary matmuls in the decoder loop
+        encoder_seq = self.encoder(x, speaker_embedding)
+        encoder_seq_proj = self.encoder_proj(encoder_seq)
+        # Need a couple of lists for outputs
+        mel_outputs, attn_scores, stop_outputs = [], [], []
+        # Run the decoder loop
+        for t in range(0, steps, self.r):
+            prenet_in = m[:, :, t - 1] if t > 0 else go_frame
+            mel_frames, scores, hidden_states, cell_states, context_vec, stop_tokens = \
+                self.decoder(encoder_seq, encoder_seq_proj, prenet_in,
+                             hidden_states, cell_states, context_vec, t, x)
+            mel_outputs.append(mel_frames)
+            attn_scores.append(scores)
+            stop_outputs.extend([stop_tokens] * self.r)
+        # Concat the mel outputs into sequence
+        mel_outputs = torch.cat(mel_outputs, dim=2)
+        # Post-Process for Linear Spectrograms
+        postnet_out = self.postnet(mel_outputs)
+        linear = self.post_proj(postnet_out)
+        linear = linear.transpose(1, 2)
+        # For easy visualisation
+        attn_scores = torch.cat(attn_scores, 1)
+        # attn_scores = attn_scores.cpu().data.numpy()
+        stop_outputs = torch.cat(stop_outputs, 1)
+        return mel_outputs, linear, attn_scores, stop_outputs
+    def generate(self, x, speaker_embedding=None, steps=2000):
+        import sys
+        self.eval()
+        device = next(self.parameters()).device  # use same device as parameters
+        batch_size, _  = x.size()
+        # Need to initialise all hidden states and pack into tuple for tidyness
+        attn_hidden = torch.zeros(batch_size, self.decoder_dims, device=device)
+        rnn1_hidden = torch.zeros(batch_size, self.lstm_dims, device=device)
+        rnn2_hidden = torch.zeros(batch_size, self.lstm_dims, device=device)
+        hidden_states = (attn_hidden, rnn1_hidden, rnn2_hidden)
+        # Need to initialise all lstm cell states and pack into tuple for tidyness
+        rnn1_cell = torch.zeros(batch_size, self.lstm_dims, device=device)
+        rnn2_cell = torch.zeros(batch_size, self.lstm_dims, device=device)
+        cell_states = (rnn1_cell, rnn2_cell)
+        # Need a <GO> Frame for start of decoder loop
+        go_frame = torch.zeros(batch_size, self.n_mels, device=device)
+        # Need an initial context vector
+        context_vec = torch.zeros(batch_size, self.encoder_dims + self.speaker_embedding_size, device=device)
+        # SV2TTS: Run the encoder with the speaker embedding
+        # The projection avoids unnecessary matmuls in the decoder loop
+        print("    [Tacotron] Running encoder...", end='', flush=True)
+        sys.stdout.flush()
+        encoder_seq = self.encoder(x, speaker_embedding)
+        encoder_seq_proj = self.encoder_proj(encoder_seq)
+        print(" OK")
+        sys.stdout.flush()
+        # Need a couple of lists for outputs
+        mel_outputs, attn_scores, stop_outputs = [], [], []
+        # Run the decoder loop
+        print(f"    [Tacotron] Decoder loop: 0/{steps} steps", end='')
+        sys.stdout.flush()
+        for t in range(0, steps, self.r):
+            prenet_in = mel_outputs[-1][:, :, -1] if t > 0 else go_frame
+            mel_frames, scores, hidden_states, cell_states, context_vec, stop_tokens = \
+            self.decoder(encoder_seq, encoder_seq_proj, prenet_in,
+                         hidden_states, cell_states, context_vec, t, x)
+            mel_outputs.append(mel_frames)
+            attn_scores.append(scores)
+            stop_outputs.extend([stop_tokens] * self.r)
+            # Progress every 100 steps
+            if t % 100 == 0:
+                print(f"\r    [Tacotron] Decoder loop: {t}/{steps} steps", end='')
+                sys.stdout.flush()
+            # Stop the loop when all stop tokens in batch exceed threshold
+            if (stop_tokens > 0.5).all() and t > 10:
+                print(f"\r    [Tacotron] Decoder loop: {t}/{steps} steps (stopped early)")
+                sys.stdout.flush()
+                break
+        print(f"\r    [Tacotron] Decoder loop: {len(mel_outputs) * self.r}/{steps} steps (complete)")
+        sys.stdout.flush()
+        # Concat the mel outputs into sequence
+        print("    [Tacotron] Concatenating and post-processing...", end='', flush=True)
+        sys.stdout.flush()
+        mel_outputs = torch.cat(mel_outputs, dim=2)
+        # Post-Process for Linear Spectrograms
+        postnet_out = self.postnet(mel_outputs)
+        linear = self.post_proj(postnet_out)
+        linear = linear.transpose(1, 2)
+        # For easy visualisation
+        attn_scores = torch.cat(attn_scores, 1)
+        stop_outputs = torch.cat(stop_outputs, 1)
+        print(" OK")
+        sys.stdout.flush()
+        self.train()
+        return mel_outputs, linear, attn_scores
+    def init_model(self):
+        for p in self.parameters():
+            if p.dim() > 1: nn.init.xavier_uniform_(p)
+    def get_step(self):
+        return self.step.data.item()
+    def reset_step(self):
+        # assignment to parameters or buffers is overloaded, updates internal dict entry
+        self.step = self.step.data.new_tensor(1)
+    def log(self, path, msg):
+        with open(path, "a") as f:
+            print(msg, file=f)
+    def load(self, path, optimizer=None):
+        # Use device of model params as location for loaded state
+        device = next(self.parameters()).device
+        checkpoint = torch.load(str(path), map_location=device)
+        self.load_state_dict(checkpoint["model_state"])
+        if "optimizer_state" in checkpoint and optimizer is not None:
+            optimizer.load_state_dict(checkpoint["optimizer_state"])
+    def save(self, path, optimizer=None):
+        if optimizer is not None:
+            torch.save({
+                "model_state": self.state_dict(),
+                "optimizer_state": optimizer.state_dict(),
+            }, str(path))
+        else:
+            torch.save({
+                "model_state": self.state_dict(),
+            }, str(path))
+    def num_params(self, print_out=True):
+        parameters = filter(lambda p: p.requires_grad, self.parameters())
+        parameters = sum([np.prod(p.size()) for p in parameters]) / 1_000_000
+        if print_out:
+            print("Trainable Parameters: %.3fM" % parameters)
+        return parameters

backend/synthesizer/utils/__init__.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import torch
+_output_ref = None
+_replicas_ref = None
+def data_parallel_workaround(model, *input):
+    global _output_ref
+    global _replicas_ref
+    device_ids = list(range(torch.cuda.device_count()))
+    output_device = device_ids[0]
+    replicas = torch.nn.parallel.replicate(model, device_ids)
+    # input.shape = (num_args, batch, ...)
+    inputs = torch.nn.parallel.scatter(input, device_ids)
+    # inputs.shape = (num_gpus, num_args, batch/num_gpus, ...)
+    replicas = replicas[:len(inputs)]
+    outputs = torch.nn.parallel.parallel_apply(replicas, inputs)
+    y_hat = torch.nn.parallel.gather(outputs, output_device)
+    _output_ref = outputs
+    _replicas_ref = replicas
+    return y_hat
+class ValueWindow():
+  def __init__(self, window_size=100):
+    self._window_size = window_size
+    self._values = []
+  def append(self, x):
+    self._values = self._values[-(self._window_size - 1):] + [x]
+  @property
+  def sum(self):
+    return sum(self._values)
+  @property
+  def count(self):
+    return len(self._values)
+  @property
+  def average(self):
+    return self.sum / max(1, self.count)
+  def reset(self):
+    self._values = []

backend/synthesizer/utils/cleaners.py ADDED Viewed

	@@ -0,0 +1,88 @@

+"""
+Cleaners are transformations that run over the input text at both training and eval time.
+Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners"
+hyperparameter. Some cleaners are English-specific. You"ll typically want to use:
+  1. "english_cleaners" for English text
+  2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using
+     the Unidecode library (https://pypi.python.org/pypi/Unidecode)
+  3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update
+     the symbols in symbols.py to match your data).
+"""
+import re
+from unidecode import unidecode
+from .numbers import normalize_numbers
+# Regular expression matching whitespace:
+_whitespace_re = re.compile(r"\s+")
+# List of (regular expression, replacement) pairs for abbreviations:
+_abbreviations = [(re.compile("\\b%s\\." % x[0], re.IGNORECASE), x[1]) for x in [
+    ("mrs", "misess"),
+    ("mr", "mister"),
+    ("dr", "doctor"),
+    ("st", "saint"),
+    ("co", "company"),
+    ("jr", "junior"),
+    ("maj", "major"),
+    ("gen", "general"),
+    ("drs", "doctors"),
+    ("rev", "reverend"),
+    ("lt", "lieutenant"),
+    ("hon", "honorable"),
+    ("sgt", "sergeant"),
+    ("capt", "captain"),
+    ("esq", "esquire"),
+    ("ltd", "limited"),
+    ("col", "colonel"),
+    ("ft", "fort"),
+]]
+def expand_abbreviations(text):
+    for regex, replacement in _abbreviations:
+        text = re.sub(regex, replacement, text)
+    return text
+def expand_numbers(text):
+    return normalize_numbers(text)
+def lowercase(text):
+    """lowercase input tokens."""
+    return text.lower()
+def collapse_whitespace(text):
+    return re.sub(_whitespace_re, " ", text)
+def convert_to_ascii(text):
+    return unidecode(text)
+def basic_cleaners(text):
+    """Basic pipeline that lowercases and collapses whitespace without transliteration."""
+    text = lowercase(text)
+    text = collapse_whitespace(text)
+    return text
+def transliteration_cleaners(text):
+    """Pipeline for non-English text that transliterates to ASCII."""
+    text = convert_to_ascii(text)
+    text = lowercase(text)
+    text = collapse_whitespace(text)
+    return text
+def english_cleaners(text):
+    """Pipeline for English text, including number and abbreviation expansion."""
+    text = convert_to_ascii(text)
+    text = lowercase(text)
+    text = expand_numbers(text)
+    text = expand_abbreviations(text)
+    text = collapse_whitespace(text)
+    return text

backend/synthesizer/utils/numbers.py ADDED Viewed

	@@ -0,0 +1,69 @@

+import re
+import inflect
+_inflect = inflect.engine()
+_comma_number_re = re.compile(r"([0-9][0-9\,]+[0-9])")
+_decimal_number_re = re.compile(r"([0-9]+\.[0-9]+)")
+_pounds_re = re.compile(r"£([0-9\,]*[0-9]+)")
+_dollars_re = re.compile(r"\$([0-9\.\,]*[0-9]+)")
+_ordinal_re = re.compile(r"[0-9]+(st|nd|rd|th)")
+_number_re = re.compile(r"[0-9]+")
+def _remove_commas(m):
+    return m.group(1).replace(",", "")
+def _expand_decimal_point(m):
+    return m.group(1).replace(".", " point ")
+def _expand_dollars(m):
+    match = m.group(1)
+    parts = match.split(".")
+    if len(parts) > 2:
+        return match + " dollars"  # Unexpected format
+    dollars = int(parts[0]) if parts[0] else 0
+    cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
+    if dollars and cents:
+        dollar_unit = "dollar" if dollars == 1 else "dollars"
+        cent_unit = "cent" if cents == 1 else "cents"
+        return "%s %s, %s %s" % (dollars, dollar_unit, cents, cent_unit)
+    elif dollars:
+        dollar_unit = "dollar" if dollars == 1 else "dollars"
+        return "%s %s" % (dollars, dollar_unit)
+    elif cents:
+        cent_unit = "cent" if cents == 1 else "cents"
+        return "%s %s" % (cents, cent_unit)
+    else:
+        return "zero dollars"
+def _expand_ordinal(m):
+    return _inflect.number_to_words(m.group(0))
+def _expand_number(m):
+    num = int(m.group(0))
+    if num > 1000 and num < 3000:
+        if num == 2000:
+            return "two thousand"
+        elif num > 2000 and num < 2010:
+            return "two thousand " + _inflect.number_to_words(num % 100)
+        elif num % 100 == 0:
+            return _inflect.number_to_words(num // 100) + " hundred"
+        else:
+            return _inflect.number_to_words(num, andword="", zero="oh", group=2).replace(", ", " ")
+    else:
+        return _inflect.number_to_words(num, andword="")
+def normalize_numbers(text):
+    text = re.sub(_comma_number_re, _remove_commas, text)
+    text = re.sub(_pounds_re, r"\1 pounds", text)
+    text = re.sub(_dollars_re, _expand_dollars, text)
+    text = re.sub(_decimal_number_re, _expand_decimal_point, text)
+    text = re.sub(_ordinal_re, _expand_ordinal, text)
+    text = re.sub(_number_re, _expand_number, text)
+    return text

backend/synthesizer/utils/symbols.py ADDED Viewed

	@@ -0,0 +1,17 @@

+"""
+Defines the set of symbols used in text input to the model.
+The default is a set of ASCII characters that works well for English or text that has been run
+through Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details.
+"""
+# from . import cmudict
+_pad        = "_"
+_eos        = "~"
+_characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'\"(),-.:;? "
+# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as uppercase letters):
+#_arpabet = ["@' + s for s in cmudict.valid_symbols]
+# Export all symbols:
+symbols = [_pad, _eos] + list(_characters) #+ _arpabet

backend/synthesizer/utils/text.py ADDED Viewed

	@@ -0,0 +1,75 @@

+from .symbols import symbols
+from . import cleaners
+import re
+# Mappings from symbol to numeric ID and vice versa:
+_symbol_to_id = {s: i for i, s in enumerate(symbols)}
+_id_to_symbol = {i: s for i, s in enumerate(symbols)}
+# Regular expression matching text enclosed in curly braces:
+_curly_re = re.compile(r"(.*?)\{(.+?)\}(.*)")
+def text_to_sequence(text, cleaner_names):
+    """Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
+      The text can optionally have ARPAbet sequences enclosed in curly braces embedded
+      in it. For example, "Turn left on {HH AW1 S S T AH0 N} Street."
+      Args:
+        text: string to convert to a sequence
+        cleaner_names: names of the cleaner functions to run the text through
+      Returns:
+        List of integers corresponding to the symbols in the text
+    """
+    sequence = []
+    # Check for curly braces and treat their contents as ARPAbet:
+    while len(text):
+        m = _curly_re.match(text)
+        if not m:
+            sequence += _symbols_to_sequence(_clean_text(text, cleaner_names))
+            break
+        sequence += _symbols_to_sequence(_clean_text(m.group(1), cleaner_names))
+        sequence += _arpabet_to_sequence(m.group(2))
+        text = m.group(3)
+    # Append EOS token
+    sequence.append(_symbol_to_id["~"])
+    return sequence
+def sequence_to_text(sequence):
+    """Converts a sequence of IDs back to a string"""
+    result = ""
+    for symbol_id in sequence:
+        if symbol_id in _id_to_symbol:
+            s = _id_to_symbol[symbol_id]
+            # Enclose ARPAbet back in curly braces:
+            if len(s) > 1 and s[0] == "@":
+                s = "{%s}" % s[1:]
+            result += s
+    return result.replace("}{", " ")
+def _clean_text(text, cleaner_names):
+    for name in cleaner_names:
+        cleaner = getattr(cleaners, name)
+        if not cleaner:
+            raise Exception("Unknown cleaner: %s" % name)
+        text = cleaner(text)
+    return text
+def _symbols_to_sequence(symbols):
+    return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)]
+def _arpabet_to_sequence(text):
+    return _symbols_to_sequence(["@" + s for s in text.split()])
+def _should_keep_symbol(s):
+    return s in _symbol_to_id and s not in ("_", "~")

backend/wsgi.py ADDED Viewed

	@@ -0,0 +1,15 @@

+"""Gunicorn entry point for the voice cloning backend."""
+import sys
+from pathlib import Path
+# Ensure backend directory is in the path for imports
+backend_dir = Path(__file__).parent
+if str(backend_dir) not in sys.path:
+    sys.path.insert(0, str(backend_dir))
+from app import app
+if __name__ == "__main__":
+    app.run()

frontend/.env.development ADDED Viewed

	@@ -0,0 +1,4 @@

+# Local development
+VITE_API_URL=http://localhost:5000
+FLASK_ENV=development
+DEBUG=true

frontend/.env.production ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Production deployment
2	+ VITE_API_URL=https://voice-cloning-personalized-speech.onrender.com

frontend/.gitignore ADDED Viewed

	@@ -0,0 +1,99 @@

+# Logs
+logs
+*.log
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+pnpm-debug.log*
+lerna-debug.log*
+# Dependencies
+/node_modules
+/.pnp
+.pnp.js
+# Testing
+/coverage
+# Next.js
+/.next/
+/out/
+# Production
+/build
+/dist
+/dist-ssr
+# Local env files
+.env*.local
+.env
+# Debug logs
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+pnpm-debug.log*
+# Editor directories and files
+.idea
+.vscode/*
+!.vscode/extensions.json
+.DS_Store
+*.suo
+*.ntvs*
+*.njsproj
+*.sln
+*.sw?
+# System Files
+.DS_Store
+Thumbs.db
+# Cache
+.cache/
+.temp/
+.tmp/
+# Misc
+.vercel
+.next
+.vercel_build_output
+# Local Netlify folder
+.netlify
+# Optional npm cache directory
+.npm
+# Optional eslint cache
+.eslintcache
+# Optional REPL history
+.node_repl_history
+# Output of 'npm pack'
+*.tgz
+# Yarn Integrity file
+.yarn-integrity
+# dotenv environment variables file
+.env*.local
+.env
+# parcel-bundler cache (https://parceljs.org/)
+.parcel-cache
+# Next.js build output
+.next
+out
+# Vercel
+.vercel
+# TypeScript
+*.tsbuildinfo
+next-env.d.ts
+# Optional stylelint cache
+.stylelintcache

frontend/README.md ADDED Viewed

	@@ -0,0 +1,111 @@

+# Voice Cloning – Personalized Speech Synthesis (Frontend)
+ > Note: On first load, please wait 2–3 minutes. The app initializes several 3D elements which can take time to fetch and compile in the browser, including examples like:
+ > - Spline-powered scenes and backgrounds
+ > - Interactive Orb (Three.js) with real-time interaction
+ > - Particle Field and Floating Elements
+ > - Speaker/Microphone 3D scenes and visualizers
+ This repository contains the fully custom-built frontend for a Voice Cloning and Personalized Speech Synthesis application.
+ - Modern, responsive UI with smooth 3D visuals and an accessible design system.
+ ---
+ ## Overview
+The frontend provides:
+- A clean interface to enroll voice samples and synthesize speech.
+- Real-time audio recording, waveform visualization, and playback controls.
+- Rich 3D/animated visuals to enhance the user experience (Spline and Three.js).
+- A component-driven architecture for maintainability and reusability.
+ ---
+ ## Features
+- Audio
+  - Audio recorder and waveform visualization
+  - Error boundaries and robust UI states
+- 3D & Visuals
+  - Spline background scenes
+  - Interactive Orb, Particle Field, Floating Elements
+  - Speaker/Microphone scenes and animated transitions
+- UI/UX
+  - shadcn/ui components with Tailwind CSS
+  - Responsive, accessible design
+  - Theming and utility-first styling
+ ---
+ ## Tech Stack
+ - Vite (bundler & dev server)
+ - React (UI) + TypeScript
+ - Tailwind CSS + PostCSS
+ - shadcn/ui component library
+ - Three.js & Spline (3D scenes and interactions)
+ - ESLint (code quality) and modern TS configs
+ ---
+ ## Getting Started
+Prerequisites:
+ - Node.js and npm installed (recommend using nvm)
+Install and run:
+```bash
+npm install
+npm run dev
+```
+Open the local URL printed in the terminal. First load may take 2–3 minutes due to 3D assets.
+ ---
+ ## Available Scripts
+ - `npm run dev` – Start the development server
+ - `npm run build` – Build for production into `dist/`
+ - `npm run preview` – Preview the production build locally
+ ---
+ ## Project Structure (high level)
+ - `src/`
+   - `components/`
+     - `audio/` – Recorder, waveform, audio UI
+     - `three/` – Interactive Orb, Particle Field, Speaker/Mic scenes, Spline background
+     - `ui/` – shadcn/ui component wrappers and utilities
+   - `pages/` – App pages and routing
+   - `lib/` – Utility functions
+ - `public/` – Static assets (icons, placeholders, robots.txt)
+ - `tailwind.config.ts`, `postcss.config.js` – Styling configuration
+ - `eslint.config.js` – Linting configuration
+ ---
+ ## Deployment
+Build a production bundle:
+```bash
+npm run build
+npm run preview
+```
+Deploy the contents of `dist/` to your hosting of choice (e.g., Netlify, Vercel, GitHub Pages, or a static server).
+ ---
+## License
+Copyright The project owner. All rights reserved.

frontend/components.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "$schema": "https://ui.shadcn.com/schema.json",
+  "style": "default",
+  "rsc": false,
+  "tsx": true,
+  "tailwind": {
+    "config": "tailwind.config.ts",
+    "css": "src/index.css",
+    "baseColor": "slate",
+    "cssVariables": true,
+    "prefix": ""
+  },
+  "aliases": {
+    "components": "@/components",
+    "utils": "@/lib/utils",
+    "ui": "@/components/ui",
+    "lib": "@/lib",
+    "hooks": "@/hooks"
+  }
+}

frontend/eslint.config.js ADDED Viewed

	@@ -0,0 +1,29 @@

+import js from "@eslint/js";
+import globals from "globals";
+import reactHooks from "eslint-plugin-react-hooks";
+import reactRefresh from "eslint-plugin-react-refresh";
+import tseslint from "typescript-eslint";
+export default tseslint.config(
+  { ignores: ["dist"] },
+  {
+    extends: [js.configs.recommended, ...tseslint.configs.recommended],
+    files: ["**/*.{ts,tsx}"],
+    languageOptions: {
+      ecmaVersion: 2020,
+      globals: globals.browser,
+    },
+    plugins: {
+      "react-hooks": reactHooks,
+      "react-refresh": reactRefresh,
+    },
+    rules: {
+      ...reactHooks.configs.recommended.rules,
+      "react-refresh/only-export-components": [
+        "warn",
+        { allowConstantExport: true },
+      ],
+      "@typescript-eslint/no-unused-vars": "off",
+    },
+  }
+);

frontend/index.html ADDED Viewed

	@@ -0,0 +1,24 @@

+<!DOCTYPE html>
+<html lang="en">
+  <head>
+    <meta charset="UTF-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <title>Dhwanii Voice Cloning AI</title>
+    <meta name="description" content="Voice cloning and speech synthesis demo" />
+    <meta name="author" content="Dhwanii Voice Cloning AI" />
+    <meta property="og:title" content="Dhwanii Voice Cloning AI" />
+    <meta property="og:description" content="Voice cloning and speech synthesis demo" />
+    <meta property="og:type" content="website" />
+    <meta property="og:image" content="/favicon.ico" />
+    <meta name="twitter:card" content="summary" />
+    <meta name="twitter:site" content="@Arjitsharma00074" />
+    <meta name="twitter:image" content="/favicon.ico" />
+  </head>
+  <body>
+    <div id="root"></div>
+    <script type="module" src="/src/main.tsx"></script>
+  </body>
+</html>

frontend/package-lock.json ADDED Viewed

The diff for this file is too large to render. See raw diff

frontend/package.json ADDED Viewed

	@@ -0,0 +1,88 @@

+{
+  "name": "vite_react_shadcn_ts",
+  "private": true,
+  "version": "0.0.0",
+  "type": "module",
+  "scripts": {
+    "dev": "vite",
+    "build": "vite build",
+    "build:dev": "vite build --mode development",
+    "lint": "eslint .",
+    "preview": "vite preview"
+  },
+  "dependencies": {
+    "@hookform/resolvers": "^3.10.0",
+    "@radix-ui/react-accordion": "^1.2.11",
+    "@radix-ui/react-alert-dialog": "^1.1.14",
+    "@radix-ui/react-aspect-ratio": "^1.1.7",
+    "@radix-ui/react-avatar": "^1.1.10",
+    "@radix-ui/react-checkbox": "^1.3.2",
+    "@radix-ui/react-collapsible": "^1.1.11",
+    "@radix-ui/react-context-menu": "^2.2.15",
+    "@radix-ui/react-dialog": "^1.1.14",
+    "@radix-ui/react-dropdown-menu": "^2.1.15",
+    "@radix-ui/react-hover-card": "^1.1.14",
+    "@radix-ui/react-label": "^2.1.7",
+    "@radix-ui/react-menubar": "^1.1.15",
+    "@radix-ui/react-navigation-menu": "^1.2.13",
+    "@radix-ui/react-popover": "^1.1.14",
+    "@radix-ui/react-progress": "^1.1.7",
+    "@radix-ui/react-radio-group": "^1.3.7",
+    "@radix-ui/react-scroll-area": "^1.2.9",
+    "@radix-ui/react-select": "^2.2.5",
+    "@radix-ui/react-separator": "^1.1.7",
+    "@radix-ui/react-slider": "^1.3.5",
+    "@radix-ui/react-slot": "^1.2.3",
+    "@radix-ui/react-switch": "^1.2.5",
+    "@radix-ui/react-tabs": "^1.1.12",
+    "@radix-ui/react-toast": "^1.2.14",
+    "@radix-ui/react-toggle": "^1.1.9",
+    "@radix-ui/react-toggle-group": "^1.1.10",
+    "@radix-ui/react-tooltip": "^1.2.7",
+    "@react-three/drei": "^9.122.0",
+    "@react-three/fiber": "^8.18.0",
+    "@react-three/postprocessing": "^2.19.1",
+    "@splinetool/react-spline": "^4.1.0",
+    "@splinetool/runtime": "^1.10.55",
+    "@tanstack/react-query": "^5.83.0",
+    "class-variance-authority": "^0.7.1",
+    "clsx": "^2.1.1",
+    "cmdk": "^1.1.1",
+    "date-fns": "^3.6.0",
+    "embla-carousel-react": "^8.6.0",
+    "input-otp": "^1.4.2",
+    "lucide-react": "^0.462.0",
+    "next-themes": "^0.3.0",
+    "react": "^18.3.1",
+    "react-day-picker": "^8.10.1",
+    "react-dom": "^18.3.1",
+    "react-hook-form": "^7.61.1",
+    "react-resizable-panels": "^2.1.9",
+    "react-router-dom": "^6.30.1",
+    "recharts": "^2.15.4",
+    "sonner": "^1.7.4",
+    "tailwind-merge": "^2.6.0",
+    "tailwindcss-animate": "^1.0.7",
+    "three": "^0.169.0",
+    "vaul": "^0.9.9",
+    "zod": "^3.25.76"
+  },
+  "devDependencies": {
+    "@eslint/js": "^9.32.0",
+    "@tailwindcss/typography": "^0.5.16",
+    "@types/node": "^22.16.5",
+    "@types/react": "^18.3.23",
+    "@types/react-dom": "^18.3.7",
+    "@vitejs/plugin-react-swc": "^3.11.0",
+    "autoprefixer": "^10.4.21",
+    "eslint": "^9.32.0",
+    "eslint-plugin-react-hooks": "^5.2.0",
+    "eslint-plugin-react-refresh": "^0.4.20",
+    "globals": "^15.15.0",
+    "postcss": "^8.5.6",
+    "tailwindcss": "^3.4.17",
+    "typescript": "^5.8.3",
+    "typescript-eslint": "^8.38.0",
+    "vite": "^7.1.4"
+  }
+}

frontend/postcss.config.js ADDED Viewed

	@@ -0,0 +1,6 @@

+export default {
+  plugins: {
+    tailwindcss: {},
+    autoprefixer: {},
+  },
+}

frontend/public/placeholder.svg ADDED Viewed

frontend/public/robots.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+User-agent: Googlebot
+Allow: /
+User-agent: Bingbot
+Allow: /
+User-agent: Twitterbot
+Allow: /
+User-agent: facebookexternalhit
+Allow: /
+User-agent: *
+Allow: /