Spaces:

lulavc
/

Z-Image-Turbo

Running on Zero

File size: 13,808 Bytes

ee99244

# Z-Image Turbo - Technical Stack Report

**Version:** 15.0  
**Last Updated:** December 2025  
**Space URL:** https://huggingface.co/spaces/lulavc/Z-Image-Turbo

---

## Overview

Z-Image Turbo is a high-performance AI image generation and transformation application built on Hugging Face Spaces. It leverages the Z-Image-Turbo model from Alibaba's Tongyi-MAI team with multiple performance optimizations for fast, high-quality image synthesis.

---

## Core Model

### Z-Image-Turbo (Tongyi-MAI)

| Specification | Details |
|---------------|---------|
| **Model Name** | `Tongyi-MAI/Z-Image-Turbo` |
| **Architecture** | Scalable Single-Stream Diffusion Transformer (S3-DiT) |
| **Parameters** | 6 Billion |
| **License** | Apache 2.0 |
| **Precision** | BFloat16 |
| **Inference Steps** | 8 (optimized distilled model) |
| **Guidance Scale** | 0.0 (classifier-free) |

### Key Model Features
- **Sub-second latency** on enterprise GPUs
- **Photorealistic image generation** with exceptional detail
- **Bilingual text rendering** (English & Chinese)
- **Distilled architecture** for fast inference without quality loss
- **Consumer GPU compatible** (<16GB VRAM)

---

## Hardware Infrastructure

### ZeroGPU (Hugging Face Spaces)

| Specification | Details |
|---------------|---------|
| **GPU** | NVIDIA H200 |
| **VRAM** | 70GB per workload |
| **Compute Capability** | 9.0 |
| **Allocation** | Dynamic (on-demand) |
| **Tensor Packing** | ~28.7GB |

### Benefits
- Free GPU access for demos
- Dynamic allocation reduces idle costs
- H200 enables advanced optimizations (FP8, FlashAttention-2)
- No dedicated GPU management required

---

## Performance Optimizations

### 1. FP8 Dynamic Quantization (torchao)

```python
from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
```

| Metric | Improvement |
|--------|-------------|
| **Inference Speed** | 30-50% faster |
| **Memory Usage** | ~50% reduction |
| **Quality Impact** | Minimal (imperceptible) |

**How it works:** Quantizes transformer weights and activations to FP8 format dynamically during inference, reducing memory bandwidth requirements and enabling faster matrix operations on H200's FP8 tensor cores.

---

### 2. FlashAttention-2 via SDPA

```python
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
```

| Metric | Improvement |
|--------|-------------|
| **Attention Speed** | 2-4x faster |
| **Memory Usage** | O(n) instead of O(n²) |
| **Quality Impact** | None (mathematically equivalent) |

**How it works:** PyTorch's Scaled Dot-Product Attention (SDPA) backend automatically uses FlashAttention-2 on compatible hardware (H200), computing attention without materializing the full attention matrix.

---

### 3. cuDNN Auto-Tuning

```python
torch.backends.cudnn.benchmark = True
```

| Metric | Improvement |
|--------|-------------|
| **Convolution Speed** | 5-15% faster |
| **First Run** | Slightly slower (tuning) |
| **Subsequent Runs** | Optimized kernels cached |

**How it works:** Enables cuDNN's auto-tuner to find the fastest convolution algorithms for the specific input sizes and hardware configuration.

---

### 4. VAE Tiling

```python
pipe.vae.enable_tiling()
```

| Metric | Improvement |
|--------|-------------|
| **Max Resolution** | Unlimited (memory permitting) |
| **Memory Usage** | Significantly reduced for large images |
| **Quality Impact** | Minimal (potential tile boundaries) |

**How it works:** Processes large images in tiles rather than all at once, enabling generation of high-resolution images (2K+) without running out of VRAM.

---

### 5. VAE Slicing

```python
pipe.vae.enable_slicing()
```

| Metric | Improvement |
|--------|-------------|
| **Batch Processing** | More memory efficient |
| **Memory Usage** | Reduced peak usage |
| **Quality Impact** | None |

**How it works:** Processes VAE encoding/decoding in slices along the batch dimension, reducing peak memory usage when processing multiple images.

---

## Software Stack

### Dependencies

| Package | Version | Purpose |
|---------|---------|---------|
| `diffusers` | Latest (git) | Diffusion model pipelines |
| `transformers` | ≥4.44.0 | Text encoders, tokenizers |
| `accelerate` | ≥0.33.0 | Device management, optimization |
| `torchao` | ≥0.5.0 | FP8 quantization |
| `sentencepiece` | Latest | Tokenization |
| `gradio` | Latest | Web UI framework |
| `spaces` | Latest | ZeroGPU integration |
| `torch` | 2.8.0+cu128 | Deep learning framework |
| `PIL/Pillow` | Latest | Image processing |

### Runtime Environment

| Component | Details |
|-----------|---------|
| **Python** | 3.10 |
| **CUDA** | 12.8 |
| **Platform** | Hugging Face Spaces |
| **SDK** | Gradio |

---

## Application Features

### Generate Tab (Text-to-Image)

| Feature | Details |
|---------|---------|
| **Pipelines** | `DiffusionPipeline` |
| **Input** | Text prompt |
| **Styles** | 10 presets (None, Photorealistic, Cinematic, Anime, Digital Art, Oil Painting, Watercolor, 3D Render, Fantasy, Sci-Fi) |
| **Aspect Ratios** | 18 options (1024px to 2048px) |
| **Steps** | 4-16 (default: 8) |
| **Seed Control** | Manual or random |
| **Output Format** | PNG |
| **Share** | HuggingFace CDN upload |

### Transform Tab (Image-to-Image)

| Feature | Details |
|---------|---------|
| **Pipeline** | `ZImageImg2ImgPipeline` |
| **Input** | Image upload + text prompt |
| **Strength** | 0.1-1.0 (transformation intensity) |
| **Styles** | Same 10 presets |
| **Auto-Resize** | Supports 512-2048px (multiple of 16) |
| **Steps** | 4-16 (default: 8) |
| **Output Format** | PNG |

### Supported Resolutions

| Category | Resolutions |
|----------|-------------|
| **Standard** | 1024x1024, 1344x768, 768x1344, 1152x896, 896x1152, 1536x640, 1216x832, 832x1216 |
| **XL** | 1536x1536, 1920x1088, 1088x1920, 1536x1152, 1152x1536 |
| **MAX** | 2048x2048, 2048x1152, 1152x2048, 2048x1536, 1536x2048 |

---

## UI/UX Design

### Theme
- **Color Scheme:** Blue gradient (#e8f4fc to #d4e9f7)
- **Primary Color:** #2563eb (buttons, active elements)
- **Secondary Color:** #3b82f6 (accents)
- **Background:** Light blue gradient
- **Cards:** White with subtle shadows

### Components
- Centered header with lightning bolt icon
- Tabbed interface (Generate / Transform)
- Two-column layout (controls | output)
- Example prompts with one-click loading
- Share button for CDN uploads
- Copy-to-clipboard for image links

---

## Performance Benchmarks

### Generation Speed (1024x1024, 8 steps)

| Configuration | Time |
|---------------|------|
| **Baseline (BF16 only)** | ~5-6 seconds |
| **With All Optimizations** | ~3-4 seconds |
| **Improvement** | ~2 seconds faster (~40%) |

### Memory Usage

| Configuration | VRAM |
|---------------|------|
| **Baseline (BF16)** | ~12GB |
| **With FP8 Quantization** | ~6GB |
| **Reduction** | ~50% |

---

## Architecture Diagram

```
┌─────────────────────────────────────────────────────────────┐
│                    Gradio Web Interface                      │
│  ┌─────────────────────┐  ┌─────────────────────────────┐   │
│  │   🎨 Generate Tab   │  │     ✨ Transform Tab        │   │
│  │  - Prompt input     │  │  - Image upload             │   │
│  │  - Style selector   │  │  - Transformation prompt    │   │
│  │  - Aspect ratio     │  │  - Strength slider          │   │
│  │  - Steps/Seed       │  │  - Style/Steps/Seed         │   │
│  └─────────────────────┘  └─────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    ZeroGPU (@spaces.GPU)                     │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              NVIDIA H200 (70GB VRAM)                │    │
│  │  ┌─────────────────┐  ┌─────────────────────────┐   │    │
│  │  │  pipe_t2i       │  │  pipe_i2i               │   │    │
│  │  │  (Text-to-Img)  │  │  (Img-to-Img)           │   │    │
│  │  └────────┬────────┘  └────────────┬────────────┘   │    │
│  │           │                        │                │    │
│  │           ▼                        ▼                │    │
│  │  ┌─────────────────────────────────────────────┐    │    │
│  │  │         Z-Image Transformer (6B)            │    │    │
│  │  │  ┌─────────────────────────────────────┐    │    │    │
│  │  │  │  FP8 Quantized (torchao)            │    │    │    │
│  │  │  │  FlashAttention-2 (SDPA backend)    │    │    │    │
│  │  │  └─────────────────────────────────────┘    │    │    │
│  │  └─────────────────────────────────────────────┘    │    │
│  │                        │                            │    │
│  │                        ▼                            │    │
│  │  ┌─────────────────────────────────────────────┐    │    │
│  │  │              VAE Decoder                    │    │    │
│  │  │  ┌─────────────────────────────────────┐    │    │    │
│  │  │  │  Tiling enabled (large images)      │    │    │    │
│  │  │  │  Slicing enabled (memory efficient) │    │    │    │
│  │  │  └─────────────────────────────────────┘    │    │    │
│  │  └─────────────────────────────────────────────┘    │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                      Output                                  │
│  - PNG image (full quality)                                 │
│  - Seed value (reproducibility)                             │
│  - Optional: HuggingFace CDN share link                     │
└─────────────────────────────────────────────────────────────┘
```

---

## Known Limitations

### torch.compile Incompatibility
The Z-Image transformer contains code patterns (`device = x[0].device`) that are incompatible with PyTorch's dynamo tracer. This prevents using `torch.compile` for additional speedup.

### FlashAttention-3
`FlashAttention3Processor` is not yet available in diffusers for the Z-Image architecture. The application uses FlashAttention-2 via SDPA backend instead.

### torchao Version Warning
A deprecation warning appears for `float8_dynamic_activation_float8_weight`. This is cosmetic and doesn't affect functionality.

---

## Future Optimization Opportunities

1. **Ahead-of-Time Compilation (AoTI)** - When Z-Image becomes compatible with torch.compile
2. **INT8 Quantization** - Alternative to FP8 for broader hardware support
3. **Model Sharding** - For even larger batch processing
4. **Speculative Decoding** - Potential speedup for iterative generation
5. **LoRA Support** - Custom style fine-tuning

---

## Credits

- **Model:** Alibaba Tongyi-MAI Team (Z-Image-Turbo)
- **Infrastructure:** Hugging Face (Spaces, ZeroGPU, Diffusers)
- **Optimizations:** PyTorch Team (SDPA, torchao)
- **Application:** Built with Gradio

---

## License

- **Model:** Apache 2.0
- **Application Code:** MIT
- **Dependencies:** Various open-source licenses

---

*This report documents the technical implementation of Z-Image Turbo v15 as deployed on Hugging Face Spaces.*