Spaces:

lulavc
/

Z-Image-Turbo

Running on Zero

App Files Files Community

Z-Image-Turbo / TECH_STACK.md

lulavc

Add comprehensive technical stack documentation

ee99244 verified 1 day ago

preview code

raw

history blame contribute delete

13.8 kB

	# Z-Image Turbo - Technical Stack Report

	Version: 15.0
	Last Updated: December 2025
	Space URL: https://huggingface.co/spaces/lulavc/Z-Image-Turbo

	---

	## Overview

	Z-Image Turbo is a high-performance AI image generation and transformation application built on Hugging Face Spaces. It leverages the Z-Image-Turbo model from Alibaba's Tongyi-MAI team with multiple performance optimizations for fast, high-quality image synthesis.

	---

	## Core Model

	### Z-Image-Turbo (Tongyi-MAI)

	\| Specification \| Details \|
	\|---------------\|---------\|
	\| Model Name \| `Tongyi-MAI/Z-Image-Turbo` \|
	\| Architecture \| Scalable Single-Stream Diffusion Transformer (S3-DiT) \|
	\| Parameters \| 6 Billion \|
	\| License \| Apache 2.0 \|
	\| Precision \| BFloat16 \|
	\| Inference Steps \| 8 (optimized distilled model) \|
	\| Guidance Scale \| 0.0 (classifier-free) \|

	### Key Model Features
	- Sub-second latency on enterprise GPUs
	- Photorealistic image generation with exceptional detail
	- Bilingual text rendering (English & Chinese)
	- Distilled architecture for fast inference without quality loss
	- Consumer GPU compatible (<16GB VRAM)

	---

	## Hardware Infrastructure

	### ZeroGPU (Hugging Face Spaces)

	\| Specification \| Details \|
	\|---------------\|---------\|
	\| GPU \| NVIDIA H200 \|
	\| VRAM \| 70GB per workload \|
	\| Compute Capability \| 9.0 \|
	\| Allocation \| Dynamic (on-demand) \|
	\| Tensor Packing \| ~28.7GB \|

	### Benefits
	- Free GPU access for demos
	- Dynamic allocation reduces idle costs
	- H200 enables advanced optimizations (FP8, FlashAttention-2)
	- No dedicated GPU management required

	---

	## Performance Optimizations

	### 1. FP8 Dynamic Quantization (torchao)

	```python
	from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight
	quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
	```

	\| Metric \| Improvement \|
	\|--------\|-------------\|
	\| Inference Speed \| 30-50% faster \|
	\| Memory Usage \| ~50% reduction \|
	\| Quality Impact \| Minimal (imperceptible) \|

	How it works: Quantizes transformer weights and activations to FP8 format dynamically during inference, reducing memory bandwidth requirements and enabling faster matrix operations on H200's FP8 tensor cores.

	---

	### 2. FlashAttention-2 via SDPA

	```python
	torch.backends.cuda.enable_flash_sdp(True)
	torch.backends.cuda.enable_mem_efficient_sdp(True)
	```

	\| Metric \| Improvement \|
	\|--------\|-------------\|
	\| Attention Speed \| 2-4x faster \|
	\| Memory Usage \| O(n) instead of O(n²) \|
	\| Quality Impact \| None (mathematically equivalent) \|

	How it works: PyTorch's Scaled Dot-Product Attention (SDPA) backend automatically uses FlashAttention-2 on compatible hardware (H200), computing attention without materializing the full attention matrix.

	---

	### 3. cuDNN Auto-Tuning

	```python
	torch.backends.cudnn.benchmark = True
	```

	\| Metric \| Improvement \|
	\|--------\|-------------\|
	\| Convolution Speed \| 5-15% faster \|
	\| First Run \| Slightly slower (tuning) \|
	\| Subsequent Runs \| Optimized kernels cached \|

	How it works: Enables cuDNN's auto-tuner to find the fastest convolution algorithms for the specific input sizes and hardware configuration.

	---

	### 4. VAE Tiling

	```python
	pipe.vae.enable_tiling()
	```

	\| Metric \| Improvement \|
	\|--------\|-------------\|
	\| Max Resolution \| Unlimited (memory permitting) \|
	\| Memory Usage \| Significantly reduced for large images \|
	\| Quality Impact \| Minimal (potential tile boundaries) \|

	How it works: Processes large images in tiles rather than all at once, enabling generation of high-resolution images (2K+) without running out of VRAM.

	---

	### 5. VAE Slicing

	```python
	pipe.vae.enable_slicing()
	```

	\| Metric \| Improvement \|
	\|--------\|-------------\|
	\| Batch Processing \| More memory efficient \|
	\| Memory Usage \| Reduced peak usage \|
	\| Quality Impact \| None \|

	How it works: Processes VAE encoding/decoding in slices along the batch dimension, reducing peak memory usage when processing multiple images.

	---

	## Software Stack

	### Dependencies

	\| Package \| Version \| Purpose \|
	\|---------\|---------\|---------\|
	\| `diffusers` \| Latest (git) \| Diffusion model pipelines \|
	\| `transformers` \| ≥4.44.0 \| Text encoders, tokenizers \|
	\| `accelerate` \| ≥0.33.0 \| Device management, optimization \|
	\| `torchao` \| ≥0.5.0 \| FP8 quantization \|
	\| `sentencepiece` \| Latest \| Tokenization \|
	\| `gradio` \| Latest \| Web UI framework \|
	\| `spaces` \| Latest \| ZeroGPU integration \|
	\| `torch` \| 2.8.0+cu128 \| Deep learning framework \|
	\| `PIL/Pillow` \| Latest \| Image processing \|

	### Runtime Environment

	\| Component \| Details \|
	\|-----------\|---------\|
	\| Python \| 3.10 \|
	\| CUDA \| 12.8 \|
	\| Platform \| Hugging Face Spaces \|
	\| SDK \| Gradio \|

	---

	## Application Features

	### Generate Tab (Text-to-Image)

	\| Feature \| Details \|
	\|---------\|---------\|
	\| Pipelines \| `DiffusionPipeline` \|
	\| Input \| Text prompt \|
	\| Styles \| 10 presets (None, Photorealistic, Cinematic, Anime, Digital Art, Oil Painting, Watercolor, 3D Render, Fantasy, Sci-Fi) \|
	\| Aspect Ratios \| 18 options (1024px to 2048px) \|
	\| Steps \| 4-16 (default: 8) \|
	\| Seed Control \| Manual or random \|
	\| Output Format \| PNG \|
	\| Share \| HuggingFace CDN upload \|

	### Transform Tab (Image-to-Image)

	\| Feature \| Details \|
	\|---------\|---------\|
	\| Pipeline \| `ZImageImg2ImgPipeline` \|
	\| Input \| Image upload + text prompt \|
	\| Strength \| 0.1-1.0 (transformation intensity) \|
	\| Styles \| Same 10 presets \|
	\| Auto-Resize \| Supports 512-2048px (multiple of 16) \|
	\| Steps \| 4-16 (default: 8) \|
	\| Output Format \| PNG \|

	### Supported Resolutions

	\| Category \| Resolutions \|
	\|----------\|-------------\|
	\| Standard \| 1024x1024, 1344x768, 768x1344, 1152x896, 896x1152, 1536x640, 1216x832, 832x1216 \|
	\| XL \| 1536x1536, 1920x1088, 1088x1920, 1536x1152, 1152x1536 \|
	\| MAX \| 2048x2048, 2048x1152, 1152x2048, 2048x1536, 1536x2048 \|

	---

	## UI/UX Design

	### Theme
	- Color Scheme: Blue gradient (#e8f4fc to #d4e9f7)
	- Primary Color: #2563eb (buttons, active elements)
	- Secondary Color: #3b82f6 (accents)
	- Background: Light blue gradient
	- Cards: White with subtle shadows

	### Components
	- Centered header with lightning bolt icon
	- Tabbed interface (Generate / Transform)
	- Two-column layout (controls \| output)
	- Example prompts with one-click loading
	- Share button for CDN uploads
	- Copy-to-clipboard for image links

	---

	## Performance Benchmarks

	### Generation Speed (1024x1024, 8 steps)

	\| Configuration \| Time \|
	\|---------------\|------\|
	\| Baseline (BF16 only) \| ~5-6 seconds \|
	\| With All Optimizations \| ~3-4 seconds \|
	\| Improvement \| ~2 seconds faster (~40%) \|

	### Memory Usage

	\| Configuration \| VRAM \|
	\|---------------\|------\|
	\| Baseline (BF16) \| ~12GB \|
	\| With FP8 Quantization \| ~6GB \|
	\| Reduction \| ~50% \|

	---

	## Architecture Diagram

	```
	┌─────────────────────────────────────────────────────────────┐
	│ Gradio Web Interface │
	│ ┌─────────────────────┐ ┌─────────────────────────────┐ │
	│ │ 🎨 Generate Tab │ │ ✨ Transform Tab │ │
	│ │ - Prompt input │ │ - Image upload │ │
	│ │ - Style selector │ │ - Transformation prompt │ │
	│ │ - Aspect ratio │ │ - Strength slider │ │
	│ │ - Steps/Seed │ │ - Style/Steps/Seed │ │
	│ └─────────────────────┘ └─────────────────────────────┘ │
	└─────────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────────┐
	│ ZeroGPU (@spaces.GPU) │
	│ ┌─────────────────────────────────────────────────────┐ │
	│ │ NVIDIA H200 (70GB VRAM) │ │
	│ │ ┌─────────────────┐ ┌─────────────────────────┐ │ │
	│ │ │ pipe_t2i │ │ pipe_i2i │ │ │
	│ │ │ (Text-to-Img) │ │ (Img-to-Img) │ │ │
	│ │ └────────┬────────┘ └────────────┬────────────┘ │ │
	│ │ │ │ │ │
	│ │ ▼ ▼ │ │
	│ │ ┌─────────────────────────────────────────────┐ │ │
	│ │ │ Z-Image Transformer (6B) │ │ │
	│ │ │ ┌─────────────────────────────────────┐ │ │ │
	│ │ │ │ FP8 Quantized (torchao) │ │ │ │
	│ │ │ │ FlashAttention-2 (SDPA backend) │ │ │ │
	│ │ │ └─────────────────────────────────────┘ │ │ │
	│ │ └─────────────────────────────────────────────┘ │ │
	│ │ │ │ │
	│ │ ▼ │ │
	│ │ ┌─────────────────────────────────────────────┐ │ │
	│ │ │ VAE Decoder │ │ │
	│ │ │ ┌─────────────────────────────────────┐ │ │ │
	│ │ │ │ Tiling enabled (large images) │ │ │ │
	│ │ │ │ Slicing enabled (memory efficient) │ │ │ │
	│ │ │ └─────────────────────────────────────┘ │ │ │
	│ │ └─────────────────────────────────────────────┘ │ │
	│ └─────────────────────────────────────────────────────┘ │
	└─────────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────────┐
	│ Output │
	│ - PNG image (full quality) │
	│ - Seed value (reproducibility) │
	│ - Optional: HuggingFace CDN share link │
	└─────────────────────────────────────────────────────────────┘
	```

	---

	## Known Limitations

	### torch.compile Incompatibility
	The Z-Image transformer contains code patterns (`device = x[0].device`) that are incompatible with PyTorch's dynamo tracer. This prevents using `torch.compile` for additional speedup.

	### FlashAttention-3
	`FlashAttention3Processor` is not yet available in diffusers for the Z-Image architecture. The application uses FlashAttention-2 via SDPA backend instead.

	### torchao Version Warning
	A deprecation warning appears for `float8_dynamic_activation_float8_weight`. This is cosmetic and doesn't affect functionality.

	---

	## Future Optimization Opportunities

	1. Ahead-of-Time Compilation (AoTI) - When Z-Image becomes compatible with torch.compile
	2. INT8 Quantization - Alternative to FP8 for broader hardware support
	3. Model Sharding - For even larger batch processing
	4. Speculative Decoding - Potential speedup for iterative generation
	5. LoRA Support - Custom style fine-tuning

	---

	## Credits

	- Model: Alibaba Tongyi-MAI Team (Z-Image-Turbo)
	- Infrastructure: Hugging Face (Spaces, ZeroGPU, Diffusers)
	- Optimizations: PyTorch Team (SDPA, torchao)
	- Application: Built with Gradio

	---

	## License

	- Model: Apache 2.0
	- Application Code: MIT
	- Dependencies: Various open-source licenses

	---

	This report documents the technical implementation of Z-Image Turbo v15 as deployed on Hugging Face Spaces.