# Paper2Video

English | 简体中文

Paper2Video: Automatic Video Generation from Scientific Papers
从学术论文自动生成演讲视频

Zeyu Zhu*, Kevin Qinghong Lin*, Mike Zheng Shou
Show Lab, National University of Singapore

📄 Paper | 🤗 Daily Paper | 📊 Dataset | 🌐 Project Website | 💬 X (Twitter)

- **Input:** a paper ➕ an image ➕ an audio | Paper | Image | Audio | |--------|--------|--------| |

[🔗 Paper link](https://arxiv.org/pdf/1509.01626) |

Hinton's photo|

[🔗 Audio sample](https://github.com/showlab/Paper2Video/blob/page/assets/hinton/ref_audio_10.wav) | - **Output:** a presentation video https://github.com/user-attachments/assets/39221a9a-48cb-4e20-9d1c-080a5d8379c4 Check out more examples at [🌐 project page](https://showlab.github.io/Paper2Video/). ## 🔥 Update - [x] [2025.10.11] Our work receives attention on [YC Hacker News](https://news.ycombinator.com/item?id=45553701). - [x] [2025.10.9] Thanks AK for sharing our work on [Twitter](https://x.com/_akhaliq/status/1976099830004072849)! - [x] [2025.10.9] Our work is reported by [Medium](https://medium.com/@dataism/how-ai-learned-to-make-scientific-videos-from-slides-to-a-talking-head-0d807e491b27). - [x] [2025.10.8] Check out our demo video below! - [x] [2025.10.7] We release the [arxiv paper](https://arxiv.org/abs/2510.05096). - [x] [2025.10.6] We release the [code](https://github.com/showlab/Paper2Video) and [dataset](https://huggingface.co/datasets/ZaynZhu/Paper2Video). - [x] [2025.9.28] Paper2Video has been accepted to the **Scaling Environments for Agents Workshop([SEA](https://sea-workshop.github.io/)) at NeurIPS 2025**. https://github.com/user-attachments/assets/a655e3c7-9d76-4c48-b946-1068fdb6cdd9 --- ### Table of Contents - [🌟 Overview](#-overview) - [🚀 Quick Start: PaperTalker](#-try-papertalker-for-your-paper-) - [1. Requirements](#1-requirements) - [2. Configure LLMs](#2-configure-llms) - [3. Inference](#3-inference) - [📊 Evaluation: Paper2Video](#-evaluation-paper2video) - [😼 Fun: Paper2Video for Paper2Video](#-fun-paper2video-for-paper2video) - [🙏 Acknowledgements](#-acknowledgements) - [📌 Citation](#-citation) --- ## 🌟 Overview

Overview

This work solves two core problems for academic presentations: - **Left: How to create a presentation video from a paper?** *PaperTalker* — an agent that integrates **slides**, **subtitling**, **cursor grounding**, **speech synthesis**, and **talking-head video rendering**. - **Right: How to evaluate a presentation video?** *Paper2Video* — a benchmark with well-designed metrics to evaluate presentation quality. --- ## 🚀 Try PaperTalker for your Paper!

Approach

### 1. Requirements Prepare the environment: ```bash cd src conda create -n p2v python=3.10 conda activate p2v pip install -r requirements.txt conda install -c conda-forge tectonic ```` Download the dependent code and follow the instructions in **[Hallo2](https://github.com/fudan-generative-vision/hallo2)** to download the model weight. ```bash git clone https://github.com/fudan-generative-vision/hallo2.git ``` You need to **prepare the environment separately for talking-head generation** to potential avoide package conflicts, please refer to Hallo2. After installing, use `which python` to get the python environment path. ```bash cd hallo2 conda create -n hallo python=3.10 conda activate hallo pip install -r requirements.txt ``` ### 2. Configure LLMs Export your **API credentials**: ```bash export GEMINI_API_KEY="your_gemini_key_here" export OPENAI_API_KEY="your_openai_key_here" ``` The best practice is to use **GPT4.1** or **Gemini2.5-Pro** for both LLM and VLMs. We also support locally deployed open-source model(e.g., Qwen), details please referring to Paper2Poster. ### 3. Inference The script `pipeline.py` provides an automated pipeline for generating academic presentation videos. It takes **LaTeX paper sources** together with **reference image/audio** as input, and goes through multiple sub-modules (Slides → Subtitles → Speech → Cursor → Talking Head) to produce a complete presentation video. ⚡ The minimum recommended GPU for running this pipeline is **NVIDIA A6000** with 48G. #### Example Usage Run the following command to launch a full generation: ```bash python pipeline.py \ --model_name_t gpt-4.1 \ --model_name_v gpt-4.1 \ --model_name_talking hallo2 \ --result_dir /path/to/output \ --paper_latex_root /path/to/latex_proj \ --ref_img /path/to/ref_img.png \ --ref_audio /path/to/ref_audio.wav \ --talking_head_env /path/to/hallo2_env \ --gpu_list [0,1,2,3,4,5,6,7] ``` | Argument | Type | Default | Description | |----------|------|---------|-------------| | `--model_name_t` | `str` | `gpt-4.1` | LLM | | `--model_name_v` | `str` | `gpt-4.1` | VLM | | `--model_name_talking` | `str` | `hallo2` | Talking Head model. Currently only **hallo2** is supported | | `--result_dir` | `str` | `/path/to/output` | Output directory (slides, subtitles, videos, etc.) | | `--paper_latex_root` | `str` | `/path/to/latex_proj` | Root directory of the LaTeX paper project | | `--ref_img` | `str` | `/path/to/ref_img.png` | Reference image (must be **square** portrait) | | `--ref_audio` | `str` | `/path/to/ref_audio.wav` | Reference audio (recommended: ~10s) | | `--ref_text` | `str` | `None` | Optional reference text (for style guidance for subtitles) | | `--beamer_templete_prompt` | `str` | `None` | Optional reference text (for style guidance for slides) | | `--gpu_list` | `list[int]` | `""` | GPU list for parallel execution (used in **cursor generation** and **Talking Head rendering**) | | `--if_tree_search` | `bool` | `True` | Whether to enable tree search for slide layout refinement | | `--stage` | `str` | `"[0]"` | Pipeline stages to run (e.g., `[0]` full pipeline, `[1,2,3]` partial stages) | | `--talking_head_env` | `str` | `/path/to/hallo2_env` | python environment path for talking-head generation | --- ## 📊 Evaluation: Paper2Video

Metrics

Unlike natural video generation, academic presentation videos serve a highly specialized role: they are not merely about visual fidelity but about **communicating scholarship**. This makes it difficult to directly apply conventional metrics from video synthesis(e.g., FVD, IS, or CLIP-based similarity). Instead, their value lies in how well they **disseminate research** and **amplify scholarly visibility**.From this perspective, we argue that a high-quality academic presentation video should be judged along two complementary dimensions: #### For the Audience - The video is expected to **faithfully convey the paper’s core ideas**. - It should remain **accessible to diverse audiences**. #### For the Author - The video should **foreground the authors’ intellectual contribution and identity**. - It should **enhance the work’s visibility and impact**. To capture these goals, we introduce evaluation metrics specifically designed for academic presentation videos: Meta Similarity, PresentArena, PresentQuiz, IP Memory. ### Run Eval - Prepare the environment: ```bash cd src/evaluation conda create -n p2v_e python=3.10 conda activate p2v_e pip install -r requirements.txt ``` - For MetaSimilarity and PresentArena: ```bash python MetaSim_audio.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir python MetaSim_content.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir ``` ```bash python PresentArena.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir ``` - For **PresentQuiz**, first generate questions from paper and eval using Gemini: ```bash cd PresentQuiz python create_paper_questions.py ----paper_folder /path/to/data python PresentQuiz.py --r /path/to/result_dir --g /path/to/gt_dir --s /path/to/save_dir ``` - For **IP Memory**, first generate question pairs from generated videos and eval using Gemini: ```bash cd IPMemory python construct.py python ip_qa.py ``` See the codes for more details! 👉 Paper2Video Benchmark is available at: [HuggingFace](https://huggingface.co/datasets/ZaynZhu/Paper2Video) --- ## 😼 Fun: Paper2Video for Paper2Video Check out **How Paper2Video for Paper2Video**: https://github.com/user-attachments/assets/ff58f4d8-8376-4e12-b967-711118adf3c4 ## 🙏 Acknowledgements * The souces of the presentation videos are SlideLive and YouTuBe. * We thank all the authors who spend a great effort to create presentation videos! * We thank [CAMEL](https://github.com/camel-ai/camel) for open-source well-organized multi-agent framework codebase. * We thank the authors of [Hallo2](https://github.com/fudan-generative-vision/hallo2.git) and [Paper2Poster](https://github.com/Paper2Poster/Paper2Poster.git) for their open-sourced codes. * We thank [Wei Jia](https://github.com/weeadd) for his effort in collecting the data and implementing the baselines. We also thank all the participants involved in the human studies. * We thank all the **Show Lab @ NUS** members for support! --- ## 📌 Citation If you find our work useful, please cite: ```bibtex @misc{paper2video, title={Paper2Video: Automatic Video Generation from Scientific Papers}, author={Zeyu Zhu and Kevin Qinghong Lin and Mike Zheng Shou}, year={2025}, eprint={2510.05096}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.05096}, } ``` [![Star History](https://api.star-history.com/svg?repos=showlab/Paper2Video&type=Date)](https://star-history.com/#showlab/Paper2Video&Date)