Compatibility Notes
Critical Version Constraints
Python
- Python 3.9-3.11: HF Spaces typically supports these versions
- Avoid Python 3.12+ for maximum compatibility
PyTorch
- PyTorch 2.1.x: Latest stable with good HF ecosystem support
- CPU-only builds for ZeroGPU deployments
Transformers
- Transformers 4.35.x: Latest features with stability
- Ensures compatibility with latest HF models
Gradio
- Gradio 4.x: Current major version with mobile optimizations
- Required for mobile-responsive interface
HF Spaces Specific Considerations
ZeroGPU Environment
- Limited GPU memory: CPU-optimized versions are used
- All models run on CPU
- Use
faiss-cpuinstead offaiss-gpu
Storage Limits
- Limited persistent storage: Efficient caching is crucial
- Session data must be optimized for minimal storage usage
- Implement aggressive cleanup policies
Network Restrictions
- May have restrictions on external API calls
- All LLM calls must use Hugging Face Inference API
- Avoid external HTTP requests in production
Model Selection
For ZeroGPU
- Embedding model:
sentence-transformers/all-MiniLM-L6-v2(384d, fast) - Primary LLM: Use HF Inference API endpoint calls
- Avoid local model loading for large models
Memory Optimization
- Limit concurrent requests
- Use streaming responses
- Implement response compression
Performance Considerations
Cache Strategy
- In-memory caching for active sessions
- Aggressive cache eviction (LRU)
- TTL-based expiration
Mobile Optimization
- Reduced max tokens for mobile (800 vs 2000)
- Shorter timeout (15s vs 30s)
- Lazy loading of UI components
Dependencies Compatibility Matrix
| Package | Version Range | Notes |
|---|---|---|
| Python | 3.9-3.11 | HF Spaces supported versions |
| PyTorch | 2.1.x | CPU version |
| Transformers | 4.35.x | Latest stable |
| Gradio | 4.x | Mobile support |
| FAISS | CPU-only | No GPU support |
| NumPy | 1.24.x | Compatibility layer |
Known Issues & Workarounds
Issue: FAISS GPU Not Available
Solution: Use faiss-cpu in requirements.txt
Issue: Model Loading Memory
Solution: Use HF Inference API instead of local loading
Issue: Session Storage Limits
Solution: Implement data compression and TTL-based cleanup
Issue: Concurrent Request Limits
Solution: Implement request queue with max_workers limit
Testing Recommendations
- Test on ZeroGPU environment before production
- Verify memory usage stays under 512MB
- Test mobile responsiveness
- Validate cache efficiency (target: >60% hit rate)