Deployment Notes
Hugging Face Spaces Deployment
NVIDIA T4 Medium Configuration
This MVP is optimized for NVIDIA T4 Medium GPU deployment on Hugging Face Spaces.
Hardware Specifications
- GPU: NVIDIA T4 (persistent, always available)
- vCPU: 8 cores
- RAM: 30GB
- vRAM: 24GB
- Storage: ~20GB
- Network: Shared infrastructure
Resource Capacity
- GPU Memory: 24GB vRAM (sufficient for local model loading)
- System Memory: 30GB RAM (excellent for caching and processing)
- CPU: 8 vCPU (good for parallel operations)
Environment Variables
Required environment variables for deployment:
HF_TOKEN=your_huggingface_token_here
HF_HOME=/tmp/huggingface
MAX_WORKERS=4
CACHE_TTL=3600
DB_PATH=sessions.db
FAISS_INDEX_PATH=embeddings.faiss
SESSION_TIMEOUT=3600
MAX_SESSION_SIZE_MB=10
MOBILE_MAX_TOKENS=800
MOBILE_TIMEOUT=15000
GRADIO_PORT=7860
GRADIO_HOST=0.0.0.0
LOG_LEVEL=INFO
Space Configuration
Create a README.md in the HF Space with:
---
title: AI Research Assistant MVP
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
---
Deployment Steps
Clone/Setup Repository
git clone your-repo cd Research_AssistantInstall Dependencies
bash install.sh # or pip install -r requirements.txtTest Installation
python test_setup.py # or bash quick_test.shRun Locally
python app.pyDeploy to HF Spaces
- Push to GitHub
- Connect to HF Spaces
- Select NVIDIA T4 Medium GPU hardware
- Deploy
Resource Management
Memory Limits
- Base Python: ~100MB
- Gradio: ~50MB
- Models (loaded on GPU): ~14-16GB vRAM
- Primary model (Qwen/Qwen2.5-7B): ~14GB
- Embedding model: ~500MB
- Classification models: ~500MB each
- System RAM: ~2-4GB for caching and processing
- Cache: ~500MB-1GB max
GPU Memory Budget: ~24GB vRAM (models fit comfortably) System RAM Budget: 30GB (plenty of headroom)
Strategies
- Local GPU Model Loading: Models loaded on GPU for faster inference
- Lazy Loading: Models loaded on-demand to speed up startup
- GPU Memory Management: Automatic device placement with FP16 precision
- Caching: Aggressive caching with 30GB RAM available
- Stream responses: To reduce memory during generation
Performance Optimization
For NVIDIA T4 GPU
- Local Model Loading: Models run locally on GPU (faster than API)
- Primary model: Qwen/Qwen2.5-7B-Instruct (~14GB vRAM)
- Embedding model: sentence-transformers/all-MiniLM-L6-v2 (~500MB)
- GPU Acceleration: All inference runs on GPU
- Parallel Processing: 4 workers (MAX_WORKERS=4) for concurrent requests
- Fallback to API: Automatically falls back to HF Inference API if local models fail
- Request Queuing: Built-in async request handling
- Response Streaming: Implemented for efficient memory usage
Mobile Optimizations
- Reduce max tokens to 800
- Shorten timeout to 15s
- Implement progressive loading
- Use touch-optimized UI
Monitoring
Health Checks
- Application health endpoint:
/health - Database connectivity check
- Cache hit rate monitoring
- Response time tracking
Logging
- Use structured logging (structlog)
- Log levels: DEBUG (dev), INFO (prod)
- Monitor error rates
- Track performance metrics
Troubleshooting
Common Issues
Issue: Out of memory errors
- Solution: Reduce max_workers, implement request queuing
Issue: Slow responses
- Solution: Enable aggressive caching, use streaming
Issue: Model loading failures
- Solution: Use HF Inference API instead of local models
Issue: Session data loss
- Solution: Implement proper persistence with SQLite backup
Scaling Considerations
For Production
- Horizontal Scaling: Deploy multiple instances
- Caching Layer: Add Redis for shared session data
- Load Balancing: Use HF Spaces built-in load balancer
- CDN: Static assets via CDN
- Database: Consider PostgreSQL for production
Migration Path
- Phase 1: MVP on ZeroGPU (current)
- Phase 2: Upgrade to GPU for local models
- Phase 3: Scale to multiple workers
- Phase 4: Enterprise deployment with managed infrastructure