Z-Image-Turbo / TECH_STACK.md
lulavc's picture
Add comprehensive technical stack documentation
ee99244 verified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Z-Image Turbo - Technical Stack Report

Version: 15.0
Last Updated: December 2025
Space URL: https://huggingface.co/spaces/lulavc/Z-Image-Turbo


Overview

Z-Image Turbo is a high-performance AI image generation and transformation application built on Hugging Face Spaces. It leverages the Z-Image-Turbo model from Alibaba's Tongyi-MAI team with multiple performance optimizations for fast, high-quality image synthesis.


Core Model

Z-Image-Turbo (Tongyi-MAI)

Specification Details
Model Name Tongyi-MAI/Z-Image-Turbo
Architecture Scalable Single-Stream Diffusion Transformer (S3-DiT)
Parameters 6 Billion
License Apache 2.0
Precision BFloat16
Inference Steps 8 (optimized distilled model)
Guidance Scale 0.0 (classifier-free)

Key Model Features

  • Sub-second latency on enterprise GPUs
  • Photorealistic image generation with exceptional detail
  • Bilingual text rendering (English & Chinese)
  • Distilled architecture for fast inference without quality loss
  • Consumer GPU compatible (<16GB VRAM)

Hardware Infrastructure

ZeroGPU (Hugging Face Spaces)

Specification Details
GPU NVIDIA H200
VRAM 70GB per workload
Compute Capability 9.0
Allocation Dynamic (on-demand)
Tensor Packing ~28.7GB

Benefits

  • Free GPU access for demos
  • Dynamic allocation reduces idle costs
  • H200 enables advanced optimizations (FP8, FlashAttention-2)
  • No dedicated GPU management required

Performance Optimizations

1. FP8 Dynamic Quantization (torchao)

from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
Metric Improvement
Inference Speed 30-50% faster
Memory Usage ~50% reduction
Quality Impact Minimal (imperceptible)

How it works: Quantizes transformer weights and activations to FP8 format dynamically during inference, reducing memory bandwidth requirements and enabling faster matrix operations on H200's FP8 tensor cores.


2. FlashAttention-2 via SDPA

torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
Metric Improvement
Attention Speed 2-4x faster
Memory Usage O(n) instead of O(nΒ²)
Quality Impact None (mathematically equivalent)

How it works: PyTorch's Scaled Dot-Product Attention (SDPA) backend automatically uses FlashAttention-2 on compatible hardware (H200), computing attention without materializing the full attention matrix.


3. cuDNN Auto-Tuning

torch.backends.cudnn.benchmark = True
Metric Improvement
Convolution Speed 5-15% faster
First Run Slightly slower (tuning)
Subsequent Runs Optimized kernels cached

How it works: Enables cuDNN's auto-tuner to find the fastest convolution algorithms for the specific input sizes and hardware configuration.


4. VAE Tiling

pipe.vae.enable_tiling()
Metric Improvement
Max Resolution Unlimited (memory permitting)
Memory Usage Significantly reduced for large images
Quality Impact Minimal (potential tile boundaries)

How it works: Processes large images in tiles rather than all at once, enabling generation of high-resolution images (2K+) without running out of VRAM.


5. VAE Slicing

pipe.vae.enable_slicing()
Metric Improvement
Batch Processing More memory efficient
Memory Usage Reduced peak usage
Quality Impact None

How it works: Processes VAE encoding/decoding in slices along the batch dimension, reducing peak memory usage when processing multiple images.


Software Stack

Dependencies

Package Version Purpose
diffusers Latest (git) Diffusion model pipelines
transformers β‰₯4.44.0 Text encoders, tokenizers
accelerate β‰₯0.33.0 Device management, optimization
torchao β‰₯0.5.0 FP8 quantization
sentencepiece Latest Tokenization
gradio Latest Web UI framework
spaces Latest ZeroGPU integration
torch 2.8.0+cu128 Deep learning framework
PIL/Pillow Latest Image processing

Runtime Environment

Component Details
Python 3.10
CUDA 12.8
Platform Hugging Face Spaces
SDK Gradio

Application Features

Generate Tab (Text-to-Image)

Feature Details
Pipelines DiffusionPipeline
Input Text prompt
Styles 10 presets (None, Photorealistic, Cinematic, Anime, Digital Art, Oil Painting, Watercolor, 3D Render, Fantasy, Sci-Fi)
Aspect Ratios 18 options (1024px to 2048px)
Steps 4-16 (default: 8)
Seed Control Manual or random
Output Format PNG
Share HuggingFace CDN upload

Transform Tab (Image-to-Image)

Feature Details
Pipeline ZImageImg2ImgPipeline
Input Image upload + text prompt
Strength 0.1-1.0 (transformation intensity)
Styles Same 10 presets
Auto-Resize Supports 512-2048px (multiple of 16)
Steps 4-16 (default: 8)
Output Format PNG

Supported Resolutions

Category Resolutions
Standard 1024x1024, 1344x768, 768x1344, 1152x896, 896x1152, 1536x640, 1216x832, 832x1216
XL 1536x1536, 1920x1088, 1088x1920, 1536x1152, 1152x1536
MAX 2048x2048, 2048x1152, 1152x2048, 2048x1536, 1536x2048

UI/UX Design

Theme

  • Color Scheme: Blue gradient (#e8f4fc to #d4e9f7)
  • Primary Color: #2563eb (buttons, active elements)
  • Secondary Color: #3b82f6 (accents)
  • Background: Light blue gradient
  • Cards: White with subtle shadows

Components

  • Centered header with lightning bolt icon
  • Tabbed interface (Generate / Transform)
  • Two-column layout (controls | output)
  • Example prompts with one-click loading
  • Share button for CDN uploads
  • Copy-to-clipboard for image links

Performance Benchmarks

Generation Speed (1024x1024, 8 steps)

Configuration Time
Baseline (BF16 only) ~5-6 seconds
With All Optimizations ~3-4 seconds
Improvement 2 seconds faster (40%)

Memory Usage

Configuration VRAM
Baseline (BF16) ~12GB
With FP8 Quantization ~6GB
Reduction ~50%

Architecture Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Gradio Web Interface                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚   🎨 Generate Tab   β”‚  β”‚     ✨ Transform Tab        β”‚   β”‚
β”‚  β”‚  - Prompt input     β”‚  β”‚  - Image upload             β”‚   β”‚
β”‚  β”‚  - Style selector   β”‚  β”‚  - Transformation prompt    β”‚   β”‚
β”‚  β”‚  - Aspect ratio     β”‚  β”‚  - Strength slider          β”‚   β”‚
β”‚  β”‚  - Steps/Seed       β”‚  β”‚  - Style/Steps/Seed         β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ZeroGPU (@spaces.GPU)                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚              NVIDIA H200 (70GB VRAM)                β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚    β”‚
β”‚  β”‚  β”‚  pipe_t2i       β”‚  β”‚  pipe_i2i               β”‚   β”‚    β”‚
β”‚  β”‚  β”‚  (Text-to-Img)  β”‚  β”‚  (Img-to-Img)           β”‚   β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚    β”‚
β”‚  β”‚           β”‚                        β”‚                β”‚    β”‚
β”‚  β”‚           β–Ό                        β–Ό                β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚    β”‚
β”‚  β”‚  β”‚         Z-Image Transformer (6B)            β”‚    β”‚    β”‚
β”‚  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚    β”‚    β”‚
β”‚  β”‚  β”‚  β”‚  FP8 Quantized (torchao)            β”‚    β”‚    β”‚    β”‚
β”‚  β”‚  β”‚  β”‚  FlashAttention-2 (SDPA backend)    β”‚    β”‚    β”‚    β”‚
β”‚  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚    β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚    β”‚
β”‚  β”‚                        β”‚                            β”‚    β”‚
β”‚  β”‚                        β–Ό                            β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚    β”‚
β”‚  β”‚  β”‚              VAE Decoder                    β”‚    β”‚    β”‚
β”‚  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚    β”‚    β”‚
β”‚  β”‚  β”‚  β”‚  Tiling enabled (large images)      β”‚    β”‚    β”‚    β”‚
β”‚  β”‚  β”‚  β”‚  Slicing enabled (memory efficient) β”‚    β”‚    β”‚    β”‚
β”‚  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚    β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Output                                  β”‚
β”‚  - PNG image (full quality)                                 β”‚
β”‚  - Seed value (reproducibility)                             β”‚
β”‚  - Optional: HuggingFace CDN share link                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Known Limitations

torch.compile Incompatibility

The Z-Image transformer contains code patterns (device = x[0].device) that are incompatible with PyTorch's dynamo tracer. This prevents using torch.compile for additional speedup.

FlashAttention-3

FlashAttention3Processor is not yet available in diffusers for the Z-Image architecture. The application uses FlashAttention-2 via SDPA backend instead.

torchao Version Warning

A deprecation warning appears for float8_dynamic_activation_float8_weight. This is cosmetic and doesn't affect functionality.


Future Optimization Opportunities

  1. Ahead-of-Time Compilation (AoTI) - When Z-Image becomes compatible with torch.compile
  2. INT8 Quantization - Alternative to FP8 for broader hardware support
  3. Model Sharding - For even larger batch processing
  4. Speculative Decoding - Potential speedup for iterative generation
  5. LoRA Support - Custom style fine-tuning

Credits

  • Model: Alibaba Tongyi-MAI Team (Z-Image-Turbo)
  • Infrastructure: Hugging Face (Spaces, ZeroGPU, Diffusers)
  • Optimizations: PyTorch Team (SDPA, torchao)
  • Application: Built with Gradio

License

  • Model: Apache 2.0
  • Application Code: MIT
  • Dependencies: Various open-source licenses

This report documents the technical implementation of Z-Image Turbo v15 as deployed on Hugging Face Spaces.