# Deployment Notes ## Hugging Face Spaces Deployment ### NVIDIA T4 Medium Configuration This MVP is optimized for **NVIDIA T4 Medium** GPU deployment on Hugging Face Spaces. #### Hardware Specifications - **GPU**: NVIDIA T4 (persistent, always available) - **vCPU**: 8 cores - **RAM**: 30GB - **vRAM**: 24GB - **Storage**: ~20GB - **Network**: Shared infrastructure #### Resource Capacity - **GPU Memory**: 24GB vRAM (sufficient for local model loading) - **System Memory**: 30GB RAM (excellent for caching and processing) - **CPU**: 8 vCPU (good for parallel operations) ### Environment Variables Required environment variables for deployment: ```bash HF_TOKEN=your_huggingface_token_here HF_HOME=/tmp/huggingface MAX_WORKERS=4 CACHE_TTL=3600 DB_PATH=sessions.db FAISS_INDEX_PATH=embeddings.faiss SESSION_TIMEOUT=3600 MAX_SESSION_SIZE_MB=10 MOBILE_MAX_TOKENS=800 MOBILE_TIMEOUT=15000 GRADIO_PORT=7860 GRADIO_HOST=0.0.0.0 LOG_LEVEL=INFO ``` ### Space Configuration Create a `README.md` in the HF Space with: ```yaml --- title: AI Research Assistant MVP emoji: 🧠 colorFrom: blue colorTo: purple sdk: docker app_port: 7860 pinned: false license: apache-2.0 --- ``` ### Deployment Steps 1. **Clone/Setup Repository** ```bash git clone your-repo cd Research_Assistant ``` 2. **Install Dependencies** ```bash bash install.sh # or pip install -r requirements.txt ``` 3. **Test Installation** ```bash python test_setup.py # or bash quick_test.sh ``` 4. **Run Locally** ```bash python app.py ``` 5. **Deploy to HF Spaces** - Push to GitHub - Connect to HF Spaces - Select NVIDIA T4 Medium GPU hardware - Deploy ### Resource Management #### Memory Limits - **Base Python**: ~100MB - **Gradio**: ~50MB - **Models (loaded on GPU)**: ~14-16GB vRAM - Primary model (Qwen/Qwen2.5-7B): ~14GB - Embedding model: ~500MB - Classification models: ~500MB each - **System RAM**: ~2-4GB for caching and processing - **Cache**: ~500MB-1GB max **GPU Memory Budget**: ~24GB vRAM (models fit comfortably) **System RAM Budget**: 30GB (plenty of headroom) #### Strategies - **Local GPU Model Loading**: Models loaded on GPU for faster inference - **Lazy Loading**: Models loaded on-demand to speed up startup - **GPU Memory Management**: Automatic device placement with FP16 precision - **Caching**: Aggressive caching with 30GB RAM available - **Stream responses**: To reduce memory during generation ### Performance Optimization #### For NVIDIA T4 GPU 1. **Local Model Loading**: Models run locally on GPU (faster than API) - Primary model: Qwen/Qwen2.5-7B-Instruct (~14GB vRAM) - Embedding model: sentence-transformers/all-MiniLM-L6-v2 (~500MB) 2. **GPU Acceleration**: All inference runs on GPU 3. **Parallel Processing**: 4 workers (MAX_WORKERS=4) for concurrent requests 4. **Fallback to API**: Automatically falls back to HF Inference API if local models fail 5. **Request Queuing**: Built-in async request handling 6. **Response Streaming**: Implemented for efficient memory usage #### Mobile Optimizations - Reduce max tokens to 800 - Shorten timeout to 15s - Implement progressive loading - Use touch-optimized UI ### Monitoring #### Health Checks - Application health endpoint: `/health` - Database connectivity check - Cache hit rate monitoring - Response time tracking #### Logging - Use structured logging (structlog) - Log levels: DEBUG (dev), INFO (prod) - Monitor error rates - Track performance metrics ### Troubleshooting #### Common Issues **Issue**: Out of memory errors - **Solution**: Reduce max_workers, implement request queuing **Issue**: Slow responses - **Solution**: Enable aggressive caching, use streaming **Issue**: Model loading failures - **Solution**: Use HF Inference API instead of local models **Issue**: Session data loss - **Solution**: Implement proper persistence with SQLite backup ### Scaling Considerations #### For Production 1. **Horizontal Scaling**: Deploy multiple instances 2. **Caching Layer**: Add Redis for shared session data 3. **Load Balancing**: Use HF Spaces built-in load balancer 4. **CDN**: Static assets via CDN 5. **Database**: Consider PostgreSQL for production #### Migration Path - **Phase 1**: MVP on ZeroGPU (current) - **Phase 2**: Upgrade to GPU for local models - **Phase 3**: Scale to multiple workers - **Phase 4**: Enterprise deployment with managed infrastructure