# Deployment Notes ## Hugging Face Spaces Deployment ### ZeroGPU Configuration This MVP is optimized for **ZeroGPU** deployment on Hugging Face Spaces. #### Key Settings - **GPU**: None (CPU-only) - **Storage**: Limited (~20GB) - **Memory**: 32GB RAM - **Network**: Shared infrastructure ### Environment Variables Required environment variables for deployment: ```bash HF_TOKEN=your_huggingface_token_here HF_HOME=/tmp/huggingface MAX_WORKERS=2 CACHE_TTL=3600 DB_PATH=sessions.db FAISS_INDEX_PATH=embeddings.faiss SESSION_TIMEOUT=3600 MAX_SESSION_SIZE_MB=10 MOBILE_MAX_TOKENS=800 MOBILE_TIMEOUT=15000 GRADIO_PORT=7860 GRADIO_HOST=0.0.0.0 LOG_LEVEL=INFO ``` ### Space Configuration Create a `README.md` in the HF Space with: ```yaml --- title: AI Research Assistant MVP emoji: 🧠 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.0.0 app_file: app.py pinned: false license: apache-2.0 --- ``` ### Deployment Steps 1. **Clone/Setup Repository** ```bash git clone your-repo cd Research_Assistant ``` 2. **Install Dependencies** ```bash bash install.sh # or pip install -r requirements.txt ``` 3. **Test Installation** ```bash python test_setup.py # or bash quick_test.sh ``` 4. **Run Locally** ```bash python app.py ``` 5. **Deploy to HF Spaces** - Push to GitHub - Connect to HF Spaces - Select ZeroGPU hardware - Deploy ### Resource Management #### Memory Limits - **Base Python**: ~100MB - **Gradio**: ~50MB - **Models (loaded)**: ~200-500MB - **Cache**: ~100MB max - **Buffer**: ~100MB **Total Budget**: ~512MB (within HF Spaces limits) #### Strategies - Lazy model loading - Model offloading when not in use - Aggressive cache eviction - Stream responses to reduce memory ### Performance Optimization #### For ZeroGPU 1. Use HF Inference API for LLM calls (not local models) 2. Use `sentence-transformers` for embeddings (lightweight) 3. Implement request queuing 4. Use FAISS-CPU (not GPU version) 5. Implement response streaming #### Mobile Optimizations - Reduce max tokens to 800 - Shorten timeout to 15s - Implement progressive loading - Use touch-optimized UI ### Monitoring #### Health Checks - Application health endpoint: `/health` - Database connectivity check - Cache hit rate monitoring - Response time tracking #### Logging - Use structured logging (structlog) - Log levels: DEBUG (dev), INFO (prod) - Monitor error rates - Track performance metrics ### Troubleshooting #### Common Issues **Issue**: Out of memory errors - **Solution**: Reduce max_workers, implement request queuing **Issue**: Slow responses - **Solution**: Enable aggressive caching, use streaming **Issue**: Model loading failures - **Solution**: Use HF Inference API instead of local models **Issue**: Session data loss - **Solution**: Implement proper persistence with SQLite backup ### Scaling Considerations #### For Production 1. **Horizontal Scaling**: Deploy multiple instances 2. **Caching Layer**: Add Redis for shared session data 3. **Load Balancing**: Use HF Spaces built-in load balancer 4. **CDN**: Static assets via CDN 5. **Database**: Consider PostgreSQL for production #### Migration Path - **Phase 1**: MVP on ZeroGPU (current) - **Phase 2**: Upgrade to GPU for local models - **Phase 3**: Scale to multiple workers - **Phase 4**: Enterprise deployment with managed infrastructure