File size: 13,808 Bytes
ee99244
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
# Z-Image Turbo - Technical Stack Report

**Version:** 15.0  
**Last Updated:** December 2025  
**Space URL:** https://huggingface.co/spaces/lulavc/Z-Image-Turbo

---

## Overview

Z-Image Turbo is a high-performance AI image generation and transformation application built on Hugging Face Spaces. It leverages the Z-Image-Turbo model from Alibaba's Tongyi-MAI team with multiple performance optimizations for fast, high-quality image synthesis.

---

## Core Model

### Z-Image-Turbo (Tongyi-MAI)

| Specification | Details |
|---------------|---------|
| **Model Name** | `Tongyi-MAI/Z-Image-Turbo` |
| **Architecture** | Scalable Single-Stream Diffusion Transformer (S3-DiT) |
| **Parameters** | 6 Billion |
| **License** | Apache 2.0 |
| **Precision** | BFloat16 |
| **Inference Steps** | 8 (optimized distilled model) |
| **Guidance Scale** | 0.0 (classifier-free) |

### Key Model Features
- **Sub-second latency** on enterprise GPUs
- **Photorealistic image generation** with exceptional detail
- **Bilingual text rendering** (English & Chinese)
- **Distilled architecture** for fast inference without quality loss
- **Consumer GPU compatible** (<16GB VRAM)

---

## Hardware Infrastructure

### ZeroGPU (Hugging Face Spaces)

| Specification | Details |
|---------------|---------|
| **GPU** | NVIDIA H200 |
| **VRAM** | 70GB per workload |
| **Compute Capability** | 9.0 |
| **Allocation** | Dynamic (on-demand) |
| **Tensor Packing** | ~28.7GB |

### Benefits
- Free GPU access for demos
- Dynamic allocation reduces idle costs
- H200 enables advanced optimizations (FP8, FlashAttention-2)
- No dedicated GPU management required

---

## Performance Optimizations

### 1. FP8 Dynamic Quantization (torchao)

```python
from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
```

| Metric | Improvement |
|--------|-------------|
| **Inference Speed** | 30-50% faster |
| **Memory Usage** | ~50% reduction |
| **Quality Impact** | Minimal (imperceptible) |

**How it works:** Quantizes transformer weights and activations to FP8 format dynamically during inference, reducing memory bandwidth requirements and enabling faster matrix operations on H200's FP8 tensor cores.

---

### 2. FlashAttention-2 via SDPA

```python
torch.backends.cuda.enable_flash_sdp(True)
torch.backends.cuda.enable_mem_efficient_sdp(True)
```

| Metric | Improvement |
|--------|-------------|
| **Attention Speed** | 2-4x faster |
| **Memory Usage** | O(n) instead of O(nΒ²) |
| **Quality Impact** | None (mathematically equivalent) |

**How it works:** PyTorch's Scaled Dot-Product Attention (SDPA) backend automatically uses FlashAttention-2 on compatible hardware (H200), computing attention without materializing the full attention matrix.

---

### 3. cuDNN Auto-Tuning

```python
torch.backends.cudnn.benchmark = True
```

| Metric | Improvement |
|--------|-------------|
| **Convolution Speed** | 5-15% faster |
| **First Run** | Slightly slower (tuning) |
| **Subsequent Runs** | Optimized kernels cached |

**How it works:** Enables cuDNN's auto-tuner to find the fastest convolution algorithms for the specific input sizes and hardware configuration.

---

### 4. VAE Tiling

```python
pipe.vae.enable_tiling()
```

| Metric | Improvement |
|--------|-------------|
| **Max Resolution** | Unlimited (memory permitting) |
| **Memory Usage** | Significantly reduced for large images |
| **Quality Impact** | Minimal (potential tile boundaries) |

**How it works:** Processes large images in tiles rather than all at once, enabling generation of high-resolution images (2K+) without running out of VRAM.

---

### 5. VAE Slicing

```python
pipe.vae.enable_slicing()
```

| Metric | Improvement |
|--------|-------------|
| **Batch Processing** | More memory efficient |
| **Memory Usage** | Reduced peak usage |
| **Quality Impact** | None |

**How it works:** Processes VAE encoding/decoding in slices along the batch dimension, reducing peak memory usage when processing multiple images.

---

## Software Stack

### Dependencies

| Package | Version | Purpose |
|---------|---------|---------|
| `diffusers` | Latest (git) | Diffusion model pipelines |
| `transformers` | β‰₯4.44.0 | Text encoders, tokenizers |
| `accelerate` | β‰₯0.33.0 | Device management, optimization |
| `torchao` | β‰₯0.5.0 | FP8 quantization |
| `sentencepiece` | Latest | Tokenization |
| `gradio` | Latest | Web UI framework |
| `spaces` | Latest | ZeroGPU integration |
| `torch` | 2.8.0+cu128 | Deep learning framework |
| `PIL/Pillow` | Latest | Image processing |

### Runtime Environment

| Component | Details |
|-----------|---------|
| **Python** | 3.10 |
| **CUDA** | 12.8 |
| **Platform** | Hugging Face Spaces |
| **SDK** | Gradio |

---

## Application Features

### Generate Tab (Text-to-Image)

| Feature | Details |
|---------|---------|
| **Pipelines** | `DiffusionPipeline` |
| **Input** | Text prompt |
| **Styles** | 10 presets (None, Photorealistic, Cinematic, Anime, Digital Art, Oil Painting, Watercolor, 3D Render, Fantasy, Sci-Fi) |
| **Aspect Ratios** | 18 options (1024px to 2048px) |
| **Steps** | 4-16 (default: 8) |
| **Seed Control** | Manual or random |
| **Output Format** | PNG |
| **Share** | HuggingFace CDN upload |

### Transform Tab (Image-to-Image)

| Feature | Details |
|---------|---------|
| **Pipeline** | `ZImageImg2ImgPipeline` |
| **Input** | Image upload + text prompt |
| **Strength** | 0.1-1.0 (transformation intensity) |
| **Styles** | Same 10 presets |
| **Auto-Resize** | Supports 512-2048px (multiple of 16) |
| **Steps** | 4-16 (default: 8) |
| **Output Format** | PNG |

### Supported Resolutions

| Category | Resolutions |
|----------|-------------|
| **Standard** | 1024x1024, 1344x768, 768x1344, 1152x896, 896x1152, 1536x640, 1216x832, 832x1216 |
| **XL** | 1536x1536, 1920x1088, 1088x1920, 1536x1152, 1152x1536 |
| **MAX** | 2048x2048, 2048x1152, 1152x2048, 2048x1536, 1536x2048 |

---

## UI/UX Design

### Theme
- **Color Scheme:** Blue gradient (#e8f4fc to #d4e9f7)
- **Primary Color:** #2563eb (buttons, active elements)
- **Secondary Color:** #3b82f6 (accents)
- **Background:** Light blue gradient
- **Cards:** White with subtle shadows

### Components
- Centered header with lightning bolt icon
- Tabbed interface (Generate / Transform)
- Two-column layout (controls | output)
- Example prompts with one-click loading
- Share button for CDN uploads
- Copy-to-clipboard for image links

---

## Performance Benchmarks

### Generation Speed (1024x1024, 8 steps)

| Configuration | Time |
|---------------|------|
| **Baseline (BF16 only)** | ~5-6 seconds |
| **With All Optimizations** | ~3-4 seconds |
| **Improvement** | ~2 seconds faster (~40%) |

### Memory Usage

| Configuration | VRAM |
|---------------|------|
| **Baseline (BF16)** | ~12GB |
| **With FP8 Quantization** | ~6GB |
| **Reduction** | ~50% |

---

## Architecture Diagram

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Gradio Web Interface                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚   🎨 Generate Tab   β”‚  β”‚     ✨ Transform Tab        β”‚   β”‚
β”‚  β”‚  - Prompt input     β”‚  β”‚  - Image upload             β”‚   β”‚
β”‚  β”‚  - Style selector   β”‚  β”‚  - Transformation prompt    β”‚   β”‚
β”‚  β”‚  - Aspect ratio     β”‚  β”‚  - Strength slider          β”‚   β”‚
β”‚  β”‚  - Steps/Seed       β”‚  β”‚  - Style/Steps/Seed         β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ZeroGPU (@spaces.GPU)                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚              NVIDIA H200 (70GB VRAM)                β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚    β”‚
β”‚  β”‚  β”‚  pipe_t2i       β”‚  β”‚  pipe_i2i               β”‚   β”‚    β”‚
β”‚  β”‚  β”‚  (Text-to-Img)  β”‚  β”‚  (Img-to-Img)           β”‚   β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚    β”‚
β”‚  β”‚           β”‚                        β”‚                β”‚    β”‚
β”‚  β”‚           β–Ό                        β–Ό                β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚    β”‚
β”‚  β”‚  β”‚         Z-Image Transformer (6B)            β”‚    β”‚    β”‚
β”‚  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚    β”‚    β”‚
β”‚  β”‚  β”‚  β”‚  FP8 Quantized (torchao)            β”‚    β”‚    β”‚    β”‚
β”‚  β”‚  β”‚  β”‚  FlashAttention-2 (SDPA backend)    β”‚    β”‚    β”‚    β”‚
β”‚  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚    β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚    β”‚
β”‚  β”‚                        β”‚                            β”‚    β”‚
β”‚  β”‚                        β–Ό                            β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚    β”‚
β”‚  β”‚  β”‚              VAE Decoder                    β”‚    β”‚    β”‚
β”‚  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚    β”‚    β”‚
β”‚  β”‚  β”‚  β”‚  Tiling enabled (large images)      β”‚    β”‚    β”‚    β”‚
β”‚  β”‚  β”‚  β”‚  Slicing enabled (memory efficient) β”‚    β”‚    β”‚    β”‚
β”‚  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚    β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Output                                  β”‚
β”‚  - PNG image (full quality)                                 β”‚
β”‚  - Seed value (reproducibility)                             β”‚
β”‚  - Optional: HuggingFace CDN share link                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## Known Limitations

### torch.compile Incompatibility
The Z-Image transformer contains code patterns (`device = x[0].device`) that are incompatible with PyTorch's dynamo tracer. This prevents using `torch.compile` for additional speedup.

### FlashAttention-3
`FlashAttention3Processor` is not yet available in diffusers for the Z-Image architecture. The application uses FlashAttention-2 via SDPA backend instead.

### torchao Version Warning
A deprecation warning appears for `float8_dynamic_activation_float8_weight`. This is cosmetic and doesn't affect functionality.

---

## Future Optimization Opportunities

1. **Ahead-of-Time Compilation (AoTI)** - When Z-Image becomes compatible with torch.compile
2. **INT8 Quantization** - Alternative to FP8 for broader hardware support
3. **Model Sharding** - For even larger batch processing
4. **Speculative Decoding** - Potential speedup for iterative generation
5. **LoRA Support** - Custom style fine-tuning

---

## Credits

- **Model:** Alibaba Tongyi-MAI Team (Z-Image-Turbo)
- **Infrastructure:** Hugging Face (Spaces, ZeroGPU, Diffusers)
- **Optimizations:** PyTorch Team (SDPA, torchao)
- **Application:** Built with Gradio

---

## License

- **Model:** Apache 2.0
- **Application Code:** MIT
- **Dependencies:** Various open-source licenses

---

*This report documents the technical implementation of Z-Image Turbo v15 as deployed on Hugging Face Spaces.*