Spaces:

lulavc
/

Z-Image-Turbo

Running on Zero

App Files Files Community

Z-Image-Turbo / TECH_STACK.md

lulavc

Add comprehensive technical stack documentation

ee99244 verified 1 day ago

preview code

raw

history blame contribute delete

13.8 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Z-Image Turbo - Technical Stack Report

Version: 15.0
Last Updated: December 2025
Space URL: https://huggingface.co/spaces/lulavc/Z-Image-Turbo

Overview

Z-Image Turbo is a high-performance AI image generation and transformation application built on Hugging Face Spaces. It leverages the Z-Image-Turbo model from Alibaba's Tongyi-MAI team with multiple performance optimizations for fast, high-quality image synthesis.

Core Model

Z-Image-Turbo (Tongyi-MAI)

Specification	Details
Model Name	`Tongyi-MAI/Z-Image-Turbo`
Architecture	Scalable Single-Stream Diffusion Transformer (S3-DiT)
Parameters	6 Billion
License	Apache 2.0
Precision	BFloat16
Inference Steps	8 (optimized distilled model)
Guidance Scale	0.0 (classifier-free)

Key Model Features

Sub-second latency on enterprise GPUs
Photorealistic image generation with exceptional detail
Bilingual text rendering (English & Chinese)
Distilled architecture for fast inference without quality loss
Consumer GPU compatible (<16GB VRAM)

Hardware Infrastructure

ZeroGPU (Hugging Face Spaces)

Specification	Details
GPU	NVIDIA H200
VRAM	70GB per workload
Compute Capability	9.0
Allocation	Dynamic (on-demand)
Tensor Packing	~28.7GB

Benefits

Free GPU access for demos
Dynamic allocation reduces idle costs
H200 enables advanced optimizations (FP8, FlashAttention-2)
No dedicated GPU management required

Performance Optimizations

1. FP8 Dynamic Quantization (torchao)

from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())

Metric	Improvement
Inference Speed	30-50% faster
Memory Usage	~50% reduction
Quality Impact	Minimal (imperceptible)

How it works: Quantizes transformer weights and activations to FP8 format dynamically during inference, reducing memory bandwidth requirements and enabling faster matrix operations on H200's FP8 tensor cores.

2. FlashAttention-2 via SDPA

torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)

Metric	Improvement
Attention Speed	2-4x faster
Memory Usage	O(n) instead of O(n²)
Quality Impact	None (mathematically equivalent)

How it works: PyTorch's Scaled Dot-Product Attention (SDPA) backend automatically uses FlashAttention-2 on compatible hardware (H200), computing attention without materializing the full attention matrix.

3. cuDNN Auto-Tuning

torch.backends.cudnn.benchmark = True

Metric	Improvement
Convolution Speed	5-15% faster
First Run	Slightly slower (tuning)
Subsequent Runs	Optimized kernels cached

How it works: Enables cuDNN's auto-tuner to find the fastest convolution algorithms for the specific input sizes and hardware configuration.

4. VAE Tiling

pipe.vae.enable_tiling()

Metric	Improvement
Max Resolution	Unlimited (memory permitting)
Memory Usage	Significantly reduced for large images
Quality Impact	Minimal (potential tile boundaries)

How it works: Processes large images in tiles rather than all at once, enabling generation of high-resolution images (2K+) without running out of VRAM.

5. VAE Slicing

pipe.vae.enable_slicing()

Metric	Improvement
Batch Processing	More memory efficient
Memory Usage	Reduced peak usage
Quality Impact	None

How it works: Processes VAE encoding/decoding in slices along the batch dimension, reducing peak memory usage when processing multiple images.

Software Stack

Dependencies

Package	Version	Purpose
`diffusers`	Latest (git)	Diffusion model pipelines
`transformers`	≥4.44.0	Text encoders, tokenizers
`accelerate`	≥0.33.0	Device management, optimization
`torchao`	≥0.5.0	FP8 quantization
`sentencepiece`	Latest	Tokenization
`gradio`	Latest	Web UI framework
`spaces`	Latest	ZeroGPU integration
`torch`	2.8.0+cu128	Deep learning framework
`PIL/Pillow`	Latest	Image processing

Runtime Environment

Component	Details
Python	3.10
CUDA	12.8
Platform	Hugging Face Spaces
SDK	Gradio

Application Features

Generate Tab (Text-to-Image)

Feature	Details
Pipelines	`DiffusionPipeline`
Input	Text prompt
Styles	10 presets (None, Photorealistic, Cinematic, Anime, Digital Art, Oil Painting, Watercolor, 3D Render, Fantasy, Sci-Fi)
Aspect Ratios	18 options (1024px to 2048px)
Steps	4-16 (default: 8)
Seed Control	Manual or random
Output Format	PNG
Share	HuggingFace CDN upload

Transform Tab (Image-to-Image)

Feature	Details
Pipeline	`ZImageImg2ImgPipeline`
Input	Image upload + text prompt
Strength	0.1-1.0 (transformation intensity)
Styles	Same 10 presets
Auto-Resize	Supports 512-2048px (multiple of 16)
Steps	4-16 (default: 8)
Output Format	PNG

Supported Resolutions

Category	Resolutions
Standard	1024x1024, 1344x768, 768x1344, 1152x896, 896x1152, 1536x640, 1216x832, 832x1216
XL	1536x1536, 1920x1088, 1088x1920, 1536x1152, 1152x1536
MAX	2048x2048, 2048x1152, 1152x2048, 2048x1536, 1536x2048

UI/UX Design

Theme

Color Scheme: Blue gradient (#e8f4fc to #d4e9f7)
Primary Color: #2563eb (buttons, active elements)
Secondary Color: #3b82f6 (accents)
Background: Light blue gradient
Cards: White with subtle shadows

Components

Centered header with lightning bolt icon
Tabbed interface (Generate / Transform)
Two-column layout (controls | output)
Example prompts with one-click loading
Share button for CDN uploads
Copy-to-clipboard for image links

Performance Benchmarks

Generation Speed (1024x1024, 8 steps)

Configuration	Time
Baseline (BF16 only)	~5-6 seconds
With All Optimizations	~3-4 seconds
Improvement	~~2 seconds faster (~~40%)

Memory Usage

Configuration	VRAM
Baseline (BF16)	~12GB
With FP8 Quantization	~6GB
Reduction	~50%

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    Gradio Web Interface                      │
│  ┌─────────────────────┐  ┌─────────────────────────────┐   │
│  │   🎨 Generate Tab   │  │     ✨ Transform Tab        │   │
│  │  - Prompt input     │  │  - Image upload             │   │
│  │  - Style selector   │  │  - Transformation prompt    │   │
│  │  - Aspect ratio     │  │  - Strength slider          │   │
│  │  - Steps/Seed       │  │  - Style/Steps/Seed         │   │
│  └─────────────────────┘  └─────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    ZeroGPU (@spaces.GPU)                     │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              NVIDIA H200 (70GB VRAM)                │    │
│  │  ┌─────────────────┐  ┌─────────────────────────┐   │    │
│  │  │  pipe_t2i       │  │  pipe_i2i               │   │    │
│  │  │  (Text-to-Img)  │  │  (Img-to-Img)           │   │    │
│  │  └────────┬────────┘  └────────────┬────────────┘   │    │
│  │           │                        │                │    │
│  │           ▼                        ▼                │    │
│  │  ┌─────────────────────────────────────────────┐    │    │
│  │  │         Z-Image Transformer (6B)            │    │    │
│  │  │  ┌─────────────────────────────────────┐    │    │    │
│  │  │  │  FP8 Quantized (torchao)            │    │    │    │
│  │  │  │  FlashAttention-2 (SDPA backend)    │    │    │    │
│  │  │  └─────────────────────────────────────┘    │    │    │
│  │  └─────────────────────────────────────────────┘    │    │
│  │                        │                            │    │
│  │                        ▼                            │    │
│  │  ┌─────────────────────────────────────────────┐    │    │
│  │  │              VAE Decoder                    │    │    │
│  │  │  ┌─────────────────────────────────────┐    │    │    │
│  │  │  │  Tiling enabled (large images)      │    │    │    │
│  │  │  │  Slicing enabled (memory efficient) │    │    │    │
│  │  │  └─────────────────────────────────────┘    │    │    │
│  │  └─────────────────────────────────────────────┘    │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                      Output                                  │
│  - PNG image (full quality)                                 │
│  - Seed value (reproducibility)                             │
│  - Optional: HuggingFace CDN share link                     │
└─────────────────────────────────────────────────────────────┘

Known Limitations

torch.compile Incompatibility

The Z-Image transformer contains code patterns (device = x[0].device) that are incompatible with PyTorch's dynamo tracer. This prevents using torch.compile for additional speedup.

FlashAttention-3

FlashAttention3Processor is not yet available in diffusers for the Z-Image architecture. The application uses FlashAttention-2 via SDPA backend instead.

torchao Version Warning

A deprecation warning appears for float8_dynamic_activation_float8_weight. This is cosmetic and doesn't affect functionality.

Future Optimization Opportunities

Ahead-of-Time Compilation (AoTI) - When Z-Image becomes compatible with torch.compile
INT8 Quantization - Alternative to FP8 for broader hardware support
Model Sharding - For even larger batch processing
Speculative Decoding - Potential speedup for iterative generation
LoRA Support - Custom style fine-tuning

Credits

Model: Alibaba Tongyi-MAI Team (Z-Image-Turbo)
Infrastructure: Hugging Face (Spaces, ZeroGPU, Diffusers)
Optimizations: PyTorch Team (SDPA, torchao)
Application: Built with Gradio

License

Model: Apache 2.0
Application Code: MIT
Dependencies: Various open-source licenses

This report documents the technical implementation of Z-Image Turbo v15 as deployed on Hugging Face Spaces.