Z-Image-Turbo / TECH_STACK.md
lulavc's picture
Add comprehensive technical stack documentation
ee99244 verified
# Z-Image Turbo - Technical Stack Report
**Version:** 15.0
**Last Updated:** December 2025
**Space URL:** https://huggingface.co/spaces/lulavc/Z-Image-Turbo
---
## Overview
Z-Image Turbo is a high-performance AI image generation and transformation application built on Hugging Face Spaces. It leverages the Z-Image-Turbo model from Alibaba's Tongyi-MAI team with multiple performance optimizations for fast, high-quality image synthesis.
---
## Core Model
### Z-Image-Turbo (Tongyi-MAI)
| Specification | Details |
|---------------|---------|
| **Model Name** | `Tongyi-MAI/Z-Image-Turbo` |
| **Architecture** | Scalable Single-Stream Diffusion Transformer (S3-DiT) |
| **Parameters** | 6 Billion |
| **License** | Apache 2.0 |
| **Precision** | BFloat16 |
| **Inference Steps** | 8 (optimized distilled model) |
| **Guidance Scale** | 0.0 (classifier-free) |
### Key Model Features
- **Sub-second latency** on enterprise GPUs
- **Photorealistic image generation** with exceptional detail
- **Bilingual text rendering** (English & Chinese)
- **Distilled architecture** for fast inference without quality loss
- **Consumer GPU compatible** (<16GB VRAM)
---
## Hardware Infrastructure
### ZeroGPU (Hugging Face Spaces)
| Specification | Details |
|---------------|---------|
| **GPU** | NVIDIA H200 |
| **VRAM** | 70GB per workload |
| **Compute Capability** | 9.0 |
| **Allocation** | Dynamic (on-demand) |
| **Tensor Packing** | ~28.7GB |
### Benefits
- Free GPU access for demos
- Dynamic allocation reduces idle costs
- H200 enables advanced optimizations (FP8, FlashAttention-2)
- No dedicated GPU management required
---
## Performance Optimizations
### 1. FP8 Dynamic Quantization (torchao)
```python
from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
```
| Metric | Improvement |
|--------|-------------|
| **Inference Speed** | 30-50% faster |
| **Memory Usage** | ~50% reduction |
| **Quality Impact** | Minimal (imperceptible) |
**How it works:** Quantizes transformer weights and activations to FP8 format dynamically during inference, reducing memory bandwidth requirements and enabling faster matrix operations on H200's FP8 tensor cores.
---
### 2. FlashAttention-2 via SDPA
```python
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
```
| Metric | Improvement |
|--------|-------------|
| **Attention Speed** | 2-4x faster |
| **Memory Usage** | O(n) instead of O(nΒ²) |
| **Quality Impact** | None (mathematically equivalent) |
**How it works:** PyTorch's Scaled Dot-Product Attention (SDPA) backend automatically uses FlashAttention-2 on compatible hardware (H200), computing attention without materializing the full attention matrix.
---
### 3. cuDNN Auto-Tuning
```python
torch.backends.cudnn.benchmark = True
```
| Metric | Improvement |
|--------|-------------|
| **Convolution Speed** | 5-15% faster |
| **First Run** | Slightly slower (tuning) |
| **Subsequent Runs** | Optimized kernels cached |
**How it works:** Enables cuDNN's auto-tuner to find the fastest convolution algorithms for the specific input sizes and hardware configuration.
---
### 4. VAE Tiling
```python
pipe.vae.enable_tiling()
```
| Metric | Improvement |
|--------|-------------|
| **Max Resolution** | Unlimited (memory permitting) |
| **Memory Usage** | Significantly reduced for large images |
| **Quality Impact** | Minimal (potential tile boundaries) |
**How it works:** Processes large images in tiles rather than all at once, enabling generation of high-resolution images (2K+) without running out of VRAM.
---
### 5. VAE Slicing
```python
pipe.vae.enable_slicing()
```
| Metric | Improvement |
|--------|-------------|
| **Batch Processing** | More memory efficient |
| **Memory Usage** | Reduced peak usage |
| **Quality Impact** | None |
**How it works:** Processes VAE encoding/decoding in slices along the batch dimension, reducing peak memory usage when processing multiple images.
---
## Software Stack
### Dependencies
| Package | Version | Purpose |
|---------|---------|---------|
| `diffusers` | Latest (git) | Diffusion model pipelines |
| `transformers` | β‰₯4.44.0 | Text encoders, tokenizers |
| `accelerate` | β‰₯0.33.0 | Device management, optimization |
| `torchao` | β‰₯0.5.0 | FP8 quantization |
| `sentencepiece` | Latest | Tokenization |
| `gradio` | Latest | Web UI framework |
| `spaces` | Latest | ZeroGPU integration |
| `torch` | 2.8.0+cu128 | Deep learning framework |
| `PIL/Pillow` | Latest | Image processing |
### Runtime Environment
| Component | Details |
|-----------|---------|
| **Python** | 3.10 |
| **CUDA** | 12.8 |
| **Platform** | Hugging Face Spaces |
| **SDK** | Gradio |
---
## Application Features
### Generate Tab (Text-to-Image)
| Feature | Details |
|---------|---------|
| **Pipelines** | `DiffusionPipeline` |
| **Input** | Text prompt |
| **Styles** | 10 presets (None, Photorealistic, Cinematic, Anime, Digital Art, Oil Painting, Watercolor, 3D Render, Fantasy, Sci-Fi) |
| **Aspect Ratios** | 18 options (1024px to 2048px) |
| **Steps** | 4-16 (default: 8) |
| **Seed Control** | Manual or random |
| **Output Format** | PNG |
| **Share** | HuggingFace CDN upload |
### Transform Tab (Image-to-Image)
| Feature | Details |
|---------|---------|
| **Pipeline** | `ZImageImg2ImgPipeline` |
| **Input** | Image upload + text prompt |
| **Strength** | 0.1-1.0 (transformation intensity) |
| **Styles** | Same 10 presets |
| **Auto-Resize** | Supports 512-2048px (multiple of 16) |
| **Steps** | 4-16 (default: 8) |
| **Output Format** | PNG |
### Supported Resolutions
| Category | Resolutions |
|----------|-------------|
| **Standard** | 1024x1024, 1344x768, 768x1344, 1152x896, 896x1152, 1536x640, 1216x832, 832x1216 |
| **XL** | 1536x1536, 1920x1088, 1088x1920, 1536x1152, 1152x1536 |
| **MAX** | 2048x2048, 2048x1152, 1152x2048, 2048x1536, 1536x2048 |
---
## UI/UX Design
### Theme
- **Color Scheme:** Blue gradient (#e8f4fc to #d4e9f7)
- **Primary Color:** #2563eb (buttons, active elements)
- **Secondary Color:** #3b82f6 (accents)
- **Background:** Light blue gradient
- **Cards:** White with subtle shadows
### Components
- Centered header with lightning bolt icon
- Tabbed interface (Generate / Transform)
- Two-column layout (controls | output)
- Example prompts with one-click loading
- Share button for CDN uploads
- Copy-to-clipboard for image links
---
## Performance Benchmarks
### Generation Speed (1024x1024, 8 steps)
| Configuration | Time |
|---------------|------|
| **Baseline (BF16 only)** | ~5-6 seconds |
| **With All Optimizations** | ~3-4 seconds |
| **Improvement** | ~2 seconds faster (~40%) |
### Memory Usage
| Configuration | VRAM |
|---------------|------|
| **Baseline (BF16)** | ~12GB |
| **With FP8 Quantization** | ~6GB |
| **Reduction** | ~50% |
---
## Architecture Diagram
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Gradio Web Interface β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ 🎨 Generate Tab β”‚ β”‚ ✨ Transform Tab β”‚ β”‚
β”‚ β”‚ - Prompt input β”‚ β”‚ - Image upload β”‚ β”‚
β”‚ β”‚ - Style selector β”‚ β”‚ - Transformation prompt β”‚ β”‚
β”‚ β”‚ - Aspect ratio β”‚ β”‚ - Strength slider β”‚ β”‚
β”‚ β”‚ - Steps/Seed β”‚ β”‚ - Style/Steps/Seed β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ZeroGPU (@spaces.GPU) β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ NVIDIA H200 (70GB VRAM) β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ pipe_t2i β”‚ β”‚ pipe_i2i β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ (Text-to-Img) β”‚ β”‚ (Img-to-Img) β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β–Ό β–Ό β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ Z-Image Transformer (6B) β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ FP8 Quantized (torchao) β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ FlashAttention-2 (SDPA backend) β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β–Ό β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ VAE Decoder β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ Tiling enabled (large images) β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ Slicing enabled (memory efficient) β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Output β”‚
β”‚ - PNG image (full quality) β”‚
β”‚ - Seed value (reproducibility) β”‚
β”‚ - Optional: HuggingFace CDN share link β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## Known Limitations
### torch.compile Incompatibility
The Z-Image transformer contains code patterns (`device = x[0].device`) that are incompatible with PyTorch's dynamo tracer. This prevents using `torch.compile` for additional speedup.
### FlashAttention-3
`FlashAttention3Processor` is not yet available in diffusers for the Z-Image architecture. The application uses FlashAttention-2 via SDPA backend instead.
### torchao Version Warning
A deprecation warning appears for `float8_dynamic_activation_float8_weight`. This is cosmetic and doesn't affect functionality.
---
## Future Optimization Opportunities
1. **Ahead-of-Time Compilation (AoTI)** - When Z-Image becomes compatible with torch.compile
2. **INT8 Quantization** - Alternative to FP8 for broader hardware support
3. **Model Sharding** - For even larger batch processing
4. **Speculative Decoding** - Potential speedup for iterative generation
5. **LoRA Support** - Custom style fine-tuning
---
## Credits
- **Model:** Alibaba Tongyi-MAI Team (Z-Image-Turbo)
- **Infrastructure:** Hugging Face (Spaces, ZeroGPU, Diffusers)
- **Optimizations:** PyTorch Team (SDPA, torchao)
- **Application:** Built with Gradio
---
## License
- **Model:** Apache 2.0
- **Application Code:** MIT
- **Dependencies:** Various open-source licenses
---
*This report documents the technical implementation of Z-Image Turbo v15 as deployed on Hugging Face Spaces.*