# Deployment Notes

## Hugging Face Spaces Deployment

### ZeroGPU Configuration
This MVP is optimized for **ZeroGPU** deployment on Hugging Face Spaces.

#### Key Settings
- **GPU**: None (CPU-only)
- **Storage**: Limited (~20GB)
- **Memory**: 32GB RAM
- **Network**: Shared infrastructure

### Environment Variables
Required environment variables for deployment:

```bash
HF_TOKEN=your_huggingface_token_here
HF_HOME=/tmp/huggingface
MAX_WORKERS=2
CACHE_TTL=3600
DB_PATH=sessions.db
FAISS_INDEX_PATH=embeddings.faiss
SESSION_TIMEOUT=3600
MAX_SESSION_SIZE_MB=10
MOBILE_MAX_TOKENS=800
MOBILE_TIMEOUT=15000
GRADIO_PORT=7860
GRADIO_HOST=0.0.0.0
LOG_LEVEL=INFO
```

### Space Configuration
Create a `README.md` in the HF Space with:

```yaml
---
title: AI Research Assistant MVP
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.0.0
app_file: app.py
pinned: false
license: apache-2.0
---
```

### Deployment Steps

1. **Clone/Setup Repository**
   ```bash
   git clone your-repo
   cd Research_Assistant
   ```

2. **Install Dependencies**
   ```bash
   bash install.sh
   # or
   pip install -r requirements.txt
   ```

3. **Test Installation**
   ```bash
   python test_setup.py
   # or
   bash quick_test.sh
   ```

4. **Run Locally**
   ```bash
   python app.py
   ```

5. **Deploy to HF Spaces**
   - Push to GitHub
   - Connect to HF Spaces
   - Select ZeroGPU hardware
   - Deploy

### Resource Management

#### Memory Limits
- **Base Python**: ~100MB
- **Gradio**: ~50MB
- **Models (loaded)**: ~200-500MB
- **Cache**: ~100MB max
- **Buffer**: ~100MB

**Total Budget**: ~512MB (within HF Spaces limits)

#### Strategies
- Lazy model loading
- Model offloading when not in use
- Aggressive cache eviction
- Stream responses to reduce memory

### Performance Optimization

#### For ZeroGPU
1. Use HF Inference API for LLM calls (not local models)
2. Use `sentence-transformers` for embeddings (lightweight)
3. Implement request queuing
4. Use FAISS-CPU (not GPU version)
5. Implement response streaming

#### Mobile Optimizations
- Reduce max tokens to 800
- Shorten timeout to 15s
- Implement progressive loading
- Use touch-optimized UI

### Monitoring

#### Health Checks
- Application health endpoint: `/health`
- Database connectivity check
- Cache hit rate monitoring
- Response time tracking

#### Logging
- Use structured logging (structlog)
- Log levels: DEBUG (dev), INFO (prod)
- Monitor error rates
- Track performance metrics

### Troubleshooting

#### Common Issues

**Issue**: Out of memory errors
- **Solution**: Reduce max_workers, implement request queuing

**Issue**: Slow responses
- **Solution**: Enable aggressive caching, use streaming

**Issue**: Model loading failures
- **Solution**: Use HF Inference API instead of local models

**Issue**: Session data loss
- **Solution**: Implement proper persistence with SQLite backup

### Scaling Considerations

#### For Production
1. **Horizontal Scaling**: Deploy multiple instances
2. **Caching Layer**: Add Redis for shared session data
3. **Load Balancing**: Use HF Spaces built-in load balancer
4. **CDN**: Static assets via CDN
5. **Database**: Consider PostgreSQL for production

#### Migration Path
- **Phase 1**: MVP on ZeroGPU (current)
- **Phase 2**: Upgrade to GPU for local models
- **Phase 3**: Scale to multiple workers
- **Phase 4**: Enterprise deployment with managed infrastructure