Papers
arxiv:2601.04342

ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

Published on Jan 7
ยท Submitted by
Amir Habibian
on Jan 9
ยท qualcomm Qualcomm
Authors:
,

Abstract

ReHyAt introduces a recurrent hybrid attention mechanism that combines softmax and linear attention benefits, enabling efficient video generation with reduced computational costs and improved scalability.

AI-generated summary

Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt's hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at https://qualcomm-ai-research.github.io/rehyat.

Community

Paper author Paper submitter
This comment has been hidden (marked as Off-Topic)
Paper author Paper submitter
This comment has been hidden (marked as Off-Topic)
Paper author Paper submitter

๐Ÿš€ Introducing PyramidalWan! Our paper presents a novel pipeline to convert pretrained video diffusion models (like Wan2.1-1.3B) into efficient pyramidal ones via low-cost finetuning.

Key Innovations:

Efficiency via Hierarchy: We restructure the diffusion process into three spatiotemporal stages, processing high noise at lower resolutions to significantly reduce inference costs.

Theoretical Generalization: We extended resolution transitions to a broader class of upsampling/downsampling functions based on orthogonal transforms.

Step Distillation: A systematic study of distillation techniques for pyramidal setups, including the first successful training of Pyramidal Patchification models for few-step generation.

Key Insights & Results:
Near-Original Quality: Achieves video quality comparable to the original Wan model while requiring significantly less compute.

Superior Motion: Our recommended recipe, PyramidalWan-DMD-PT*, provides consistent motion and fills the gap for high-performing few-step inference.
Artifact Reduction: Unlike training-free acceleration methods (e.g., Jenga), our approach avoids severe scene and motion artifacts.

Paper author Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.04342 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.04342 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.04342 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.