ByteDance
/

MegaTTS3

@@ -4,37 +4,55 @@ language:
 - zh
 license: apache-2.0
 pipeline_tag: text-to-speech
 ---
-# Model Description
-This is a huggingface model card for MegaTTS 3 👋
--   Paper: [MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis](https://huggingface.co/papers/2502.18924)
--   Project Page (Audio Samples): <https://sditdemo.github.io/sditdemo/>
--   github: <https://github.com/bytedance/MegaTTS3>
--   [Demo Video](https://github.com/user-attachments/assets/0174c111-f392-4376-a34b-0b5b8164aacc)
--   Huggingface Space: https://huggingface.co/spaces/ByteDance/MegaTTS3
-## Installation
-```sh
-# Clone the repository
-git clone https://github.com/bytedance/MegaTTS3
-cd MegaTTS3
 ```
 **Model Download**
-```sh
-huggingface-cli download ByteDance/MegaTTS3 --local-dir ./checkpoints --local-dir-use-symlinks False
-```
 **Requirements (for Linux)**
-```sh
-# Create a python 3.10 conda env (you could also use virtualenv)
-conda create -n megatts3-env python=3.10
-conda activate megatts3-env
 pip install -r requirements.txt
 # Set the root directory
@@ -48,21 +66,13 @@ export CUDA_VISIBLE_DEVICES=0
 ```
 **Requirements (for Windows)**
-```sh
-# [The Windows version is currently under testing]
-# Comment below dependence in requirements.txt:
-# # WeTextProcessing==1.0.4.1
-# Create a python 3.10 conda env (you could also use virtualenv)
-conda create -n megatts3-env python=3.10
-conda activate megatts3-env
 pip install -r requirements.txt
 conda install -y -c conda-forge pynini==2.1.5
 pip install WeTextProcessing==1.0.3
 # [Optional] If you want GPU inference, you may need to install specific version of PyTorch for your GPU from https://pytorch.org/.
-pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
 # [Note] if you encounter bugs related with `ffprobe` or `ffmpeg`, you can install it through `conda install -c conda-forge ffmpeg`
@@ -74,61 +84,132 @@ conda env config vars set PYTHONPATH="C:\path\to\MegaTTS3;%PYTHONPATH%" # For co
 # [Optional] Set GPU
 set CUDA_VISIBLE_DEVICES=0 # Windows
 $env:CUDA_VISIBLE_DEVICES=0 # Powershell on Windows
 ```
-**Requirements (for Docker)**
-```sh
-# [The Docker version is currently under testing]
-# ! You should download the pretrained checkpoint before running the following command
-docker build . -t megatts3:latest
-# For GPU inference
-docker run -it -p 7929:7929 --gpus all -e CUDA_VISIBLE_DEVICES=0 megatts3:latest
-# For CPU inference
-docker run -it -p 7929:7929  megatts3:latest
-# Visit http://0.0.0.0:7929/ for gradio.
-```
-> \[!TIP]
-> \[IMPORTANT]
-> For security issues, we do not upload the parameters of WaveVAE encoder to the above links. You can only use the pre-extracted latents from [link1](https://drive.google.com/drive/folders/1QhcHWcy20JfqWjgqZX1YM3I6i9u4oNlr?usp=sharing) for inference. If you want to synthesize speech for speaker A, you need "A.wav" and "A.npy" in the same directory. If you have any questions or suggestions for our model, please email us.
->
-> This project is primarily intended for academic purposes. For academic datasets requiring evaluation, you may upload them to the voice request queue in [link2](https://drive.google.com/drive/folders/1gCWL1y_2xu9nIFhUX_OW5MbcFuB7J5Cl?usp=sharing) (within 24s for each clip). After verifying that your uploaded voices are free from safety issues, we will upload their latent files to [link1](https://drive.google.com/drive/folders/1QhcHWcy20JfqWjgqZX1YM3I6i9u4oNlr?usp=sharing) as soon as possible.
->
-> In the coming days, we will also prepare and release the latent representations for some common TTS benchmarks.
-## Inference
-**Command-Line Usage (Standard)**
-```bash
-# p_w (intelligibility weight), t_w (similarity weight). Typically, prompt with more noises requires higher p_w and t_w
-python tts/infer_cli.py --input_wav 'assets/Chinese_prompt.wav'  --input_text "另一边的桌上,一位读书人嗤之以鼻道,'佛子三藏,神子燕小鱼是什么样的人物,李家的那个李子夜如何与他们相提并论？'" --output_dir ./gen
-# As long as audio volume and pronunciation are appropriate, increasing --t_w within reasonable ranges (2.0~5.0)
-# will increase the generated speech's expressiveness and similarity (especially for some emotional cases).
-python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text 'As his long promised tariff threat turned into reality this week, top human advisers began fielding a wave of calls from business leaders, particularly in the automotive sector, along with lawmakers who were sounding the alarm.' --output_dir ./gen --p_w 2.0 --t_w 3.0
-```
-**Command-Line Usage (for TTS with Accents)**
-```bash
-# When p_w (intelligibility weight) ≈ 1.0, the generated audio closely retains the speaker’s original accent. As p_w increases, it shifts toward standard pronunciation.
-# t_w (similarity weight) is typically set 0–3 points higher than p_w for optimal results.
-# Useful for accented TTS or solving the accent problems in cross-lingual TTS.
-python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这是一条有口音的音频。' --output_dir ./gen --p_w 1.0 --t_w 3.0
-python tts/infer_cli.py --input_wav 'assets/English_prompt.wav' --input_text '这条音频的发音标准一些了吗？' --output_dir ./gen --p_w 2.5 --t_w 2.5
-```
-**Web UI Usage**
-```bash
-# We also support cpu inference, but it may take about 30 seconds (for 10 inference steps).
-python tts/gradio_api.py
-```
 ## Security
@@ -143,9 +224,16 @@ This project is licensed under the [Apache-2.0 License](LICENSE).
 ## BibTeX Entry and Citation Info
-This repo contains forced-align version of `Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis` and the WavVAE is mainly based on `Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling`. Compared to the model described in paper, the repository includes additional models. These models not only enhance the stability and cloning capabilities of the algorithm but can also be independently utilized to serve a wider range of scenarios.
-```
 @article{jiang2025sparse,
   title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
   author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},

 - zh
 license: apache-2.0
 pipeline_tag: text-to-speech
+library_name: transformers
+tags:
+- multimodal-agent
 ---
+# PresentAgent: Multimodal Agent for Presentation Video Generation
+This repository contains the model and code for **PresentAgent**, a multimodal agent that transforms long-form documents into narrated presentation videos, as described in the paper:
+**[PresentAgent: Multimodal Agent for Presentation Video Generation](https://huggingface.co/papers/2507.04036)**
+<div align="center">
+[🏠 Homepage](https://github.com/AIGeeksGroup/PresentAgent) | [💻 Code](https://github.com/AIGeeksGroup/PresentAgent) | [🚀 Colab Demo](https://colab.research.google.com/drive/1_2buwbVY6RLoi9NdjXihCOTWoEdL70Fk?usp=sharing) | [📄 Paper](https://huggingface.co/papers/2507.04036)
+</div>
+> [!NOTE]
+> 🙋🏻‍♀️ To learn more about PresentAgent, please see the following presentation video, which was generated entirely by PresentAgent **without** any manual curation.
+https://github.com/user-attachments/assets/240d3ae9-61a1-4e5f-98d7-9c20a99f4c2b
+## Introduction
+We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document–presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats.
+![PresentAgent Architecture](https://raw.githubusercontent.com/AIGeeksGroup/PresentAgent/main/presentagent/presentagent.png)
+## 🔧Run Your PresentAgent
+> [!TIP]
+> 🎮 Before deploying PresentAgent on your local machine, please check out our [**Colab demo**](https://colab.research.google.com/drive/1_2buwbVY6RLoi9NdjXihCOTWoEdL70Fk?usp=sharing), which is available online and ready to use.
+### 1. Install & Requirements
+```bash
+conda create -n presentagent python=3.11
+conda activate presentagent
+pip install -r requirements.txt
+cd presentagent/MegaTTS3
 ```
 **Model Download**
+The pretrained checkpoint can be found at [Google Drive](https://drive.google.com/drive/folders/1CidiSqtHgJTBDAHQ746_on_YR0boHDYB?usp=sharing) or [Huggingface](https://huggingface.co/ByteDance/MegaTTS3). Please download them and put them to ``presentagent/MegaTTS3/checkpoints/xxx``.
 **Requirements (for Linux)**
+``` sh
 pip install -r requirements.txt
 # Set the root directory
 ```
 **Requirements (for Windows)**
+``` sh
 pip install -r requirements.txt
 conda install -y -c conda-forge pynini==2.1.5
 pip install WeTextProcessing==1.0.3
 # [Optional] If you want GPU inference, you may need to install specific version of PyTorch for your GPU from https://pytorch.org/.
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
 # [Note] if you encounter bugs related with `ffprobe` or `ffmpeg`, you can install it through `conda install -c conda-forge ffmpeg`
 # [Optional] Set GPU
 set CUDA_VISIBLE_DEVICES=0 # Windows
 $env:CUDA_VISIBLE_DEVICES=0 # Powershell on Windows
 ```
+### 2. Generate Via WebUI
+1.  **Serve Backend**
+    Initialize your models in `presentagent/backend.py`:
+    ```python
+    language_model = AsyncLLM(
+        model="Qwen2.5-72B-Instruct",
+        api_base="http://localhost:7812/v1"
+    )
+    vision_model = AsyncLLM(model="gpt-4o-2024-08-06")
+    text_embedder = AsyncLLM(model="text-embedding-3-small")
+    ```
+    Or use the environment variables:
+    ```bash
+    export OPENAI_API_KEY="your_key"
+    export API_BASE="http://your_service_provider/v1"
+    export LANGUAGE_MODEL="Qwen2.5-72B-Instruct-GPTQ-Int4"
+    export VISION_MODEL="gpt-4o-2024-08-06"
+    export TEXT_MODEL="text-embedding-3-small"
+    ```
+    ```bash
+    python backend.py
+    ```
+2.  **Launch Frontend**
+    > Note: The backend API endpoint is configured at `presentagent/vue.config.js`
+    ```bash
+    cd presentagent
+    npm install
+    npm run serve
+    ```
+    ### Usage
+    First, you need to upload a PPT template and the document, then click **Generate Slides** to generate and download the PPT. After downloading the PPT, you can modify it in your own way and then click **PPT2Presentation**.
+    ![image](https://raw.githubusercontent.com/AIGeeksGroup/PresentAgent/main/presentagent/home.png)
+    After uploading the PPT, you can click **Start Conversion** to make a presentation video.
+    ![image](https://raw.githubusercontent.com/AIGeeksGroup/PresentAgent/main/presentagent/ppt2presentation1.png)
+    Finally, you will get a presentation video and watch it in the page or download it.
+    ![image](https://raw.githubusercontent.com/AIGeeksGroup/PresentAgent/main/presentagent/ppt2presentation2.png)
+## 📁 Presentation Benchmark
+### Doc2Present Benchmark
+To support the evaluation of document to presentation video generation, we curate the **Doc2Present Benchmark**, a diverse dataset of document–presentation video pairs spanning multiple domains. As shown in the following figure, our benchmark encompasses four representative document types (academic papers, web pages, technical blogs, and slides) paired with human-authored videos, covering diverse real-world domains like education, research, and business reports.
+![image](https://raw.githubusercontent.com/AIGeeksGroup/PresentAgent/main/presentagent/datasets.jpg)
+We collect **30 high-quality video samples** from **public platforms**, **educational repositories**, and **professional presentation archives**. Each video follows a structured narration format, combining slide-based visuals with synchronized voiceover. We manually align each video with its source document and ensure the following conditions are met:
+-   The content structure of the video follows that of the document.
+-   The visuals convey document information in a compact, structured form.
+-   The narration and slides are well-aligned temporally.
+The average document length is **3,000–8,000 words**, while the corresponding videos range from **1 to 2 minutes** and contain **5-10 slides**. This setting highlights the core challenge of the task: transforming dense, domain-specific documents into effective and digestible multimodal presentations.
+### PresentEval
+To assess the quality of generated presentation videos, we adopt two complementary evaluation strategies: Objective Quiz Evaluation and Subjective Scoring.
+![image](https://raw.githubusercontent.com/AIGeeksGroup/PresentAgent/main/presentagent/eval.jpg)
+For each video, we provide the vision-language model with the complete set of slide images and the full narration transcript as a unified input—simulating how a real viewer would experience the presentation.
+-   In Objective Quiz Evaluation, the model answers a fixed set of factual questions to determine whether the video accurately conveys the key information from the source content.
+-   In Subjective Scoring, the model evaluates the video along three dimensions: the coherence of the narration, the clarity and design of the visuals, and the overall ease of understanding.
+-   All evaluations are conducted without ground-truth references and rely entirely on the model’s interpretation of the presented content.
+For Objective Quiz Evaluation, to evaluate whether a generated presentation video effectively conveys the core content of its source document, we use a fixed-question comprehension evaluation protocol. Specifically, we manually design five multiple-choice questions for each document, tailored to its content as follows:
+|     Prensentation of Web Pages      | What is the main feature highlighted in the iPhone’s promotional webpage? |
+| :---------------------------------: | ------------------------------------------------------------ |
+|                 A.                  | A more powerful chip for faster performance                  |
+|                 B.                  | A brighter and more vibrant display                          |
+|                 C.                  | An upgraded camera system with better lenses                 |
+|                 D.                  | A longer-lasting and more efficient battery                  |
+| **Prensentation of Academic Paper** | What primary research gap did the authors aim to address by introducing the FineGym dataset? |
+|                 A.                  | Lack of low-resolution sports footage for compression studies |
+|                 B.                  | Need for fine-grained action understanding that goes beyond coarse categories |
+|                 C.                  | Absence of synthetic data to replace human annotations       |
+|                 D.                  | Shortage of benchmarks for background context recognition    |
+For Subjective Scoring, to evaluate the quality of generated presentation videos, we adopt a prompt-based assessment using vision-language models. The prompts are as follows:
+|     Video     | Scoring Prompt                                               |
+| :-----------: | ------------------------------------------------------------ |
+|  Narr. Coh.   | “How coherent is the narration across the video? Are the ideas logically connected and easy to follow?” |
+| Visual Appeal | “How would you rate the visual design of the slides in terms of layout, aesthetics, and overall quality?” |
+|  Comp. Diff.  | “How easy is it to understand the presentation as a viewer? Were there any confusing or contradictory parts?” |
+|   **Audio**   | **Scoring Prompt**                                           |
+|  Narr. Coh.   | “How coherent is the narration throughout the audio? Are the ideas logically structured and easy to follow?” |
+| Audio Appeal  | “How pleasant and engaging is the narrator’s voice in terms of tone, pacing, and delivery?” |
+## 🧪 Experiment
+### ✳️ Comparative Study
+|    Method    |       Model       | Quiz Accuracy | Video Score(mean) | Audio Score(mean) |
+| :----------: | :---------------: | :-----------: | :---------------: | :---------------: |
+|    Human     |       Human       |     0.56      |       4.47        |       4.80        |
+| PresentAgent | Claude-3.7-sonnet |     0.64      |       4.00        |       4.53        |
+| PresentAgent |    Qwen-VL-Max    |     0.52      |       4.47        |       4.60        |
+| PresentAgent |  Gemini-2.5-pro   |     0.52      |       4.33        |       4.33        |
+| PresentAgent | Gemini-2.5-flash  |     0.52      |       4.33        |       4.40        |
+| PresentAgent |    GPT-4o-Mini    |     0.64      |       4.67        |       4.40        |
+| PresentAgent |      GPT-4o       |     0.56      |       3.93        |       4.47        |
+---
+## ⭐ Contribute
+We warmly welcome you to contribute to our project by submitting pull requests—your involvement is key to keeping our work at the cutting edge! Specifically, we encourage efforts to expand its compatibility with the **latest visual-language (VL) models** and **text-to-speech (TTS) models**, ensuring the project stays aligned with the most recent advancements in these rapidly evolving fields.
+Beyond model updates, we also invite you to explore adding new features that could enhance the project’s functionality, usability, or versatility. Whether it’s optimizing existing workflows, introducing novel tools, or addressing unmet needs in the community, your creative contributions will help make this project more robust and valuable for everyone.
+## Acknowledgement
+We thank the authors of [PPTAgent](https://github.com/icip-cas/PPTAgent), [PPT Presenter](https://github.com/chaonan99/ppt_presenter), and [MegaTTS3](https://github.com/bytedance/MegaTTS3) for their open-source code.
 ## Security
 ## BibTeX Entry and Citation Info
+This repository serves as a hub for the **PresentAgent** system, which leverages components such as **MegaTTS3** and **Wavtokenizer**. Please cite the following papers if you use content from this repository or the PresentAgent project:
+```bibtex
+@article{shi2025presentagent,
+  title={PresentAgent: Multimodal Agent for Presentation Video Generation},
+  author={Shi, Jingwei and Zhang, Zeyu and Wu, Biao and Liang, Yanjie and Fang, Meng and Chen, Ling and Zhao, Yang},
+  journal={arXiv preprint arXiv:2507.04036},
+  year={2025}
+}
 @article{jiang2025sparse,
   title={Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis},
   author={Jiang, Ziyue and Ren, Yi and Li, Ruiqi and Ji, Shengpeng and Ye, Zhenhui and Zhang, Chen and Jionghao, Bai and Yang, Xiaoda and Zuo, Jialong and Zhang, Yu and others},