# Deployment Notes

## Hugging Face Spaces Deployment

### NVIDIA T4 Medium Configuration
This MVP is optimized for **NVIDIA T4 Medium** GPU deployment on Hugging Face Spaces.

#### Hardware Specifications
- **GPU**: NVIDIA T4 (persistent, always available)
- **vCPU**: 8 cores
- **RAM**: 30GB
- **vRAM**: 24GB
- **Storage**: ~20GB
- **Network**: Shared infrastructure

#### Resource Capacity
- **GPU Memory**: 24GB vRAM (sufficient for local model loading)
- **System Memory**: 30GB RAM (excellent for caching and processing)
- **CPU**: 8 vCPU (good for parallel operations)

### Environment Variables
Required environment variables for deployment:

```bash
HF_TOKEN=your_huggingface_token_here
HF_HOME=/tmp/huggingface
MAX_WORKERS=4
CACHE_TTL=3600
DB_PATH=sessions.db
FAISS_INDEX_PATH=embeddings.faiss
SESSION_TIMEOUT=3600
MAX_SESSION_SIZE_MB=10
MOBILE_MAX_TOKENS=800
MOBILE_TIMEOUT=15000
GRADIO_PORT=7860
GRADIO_HOST=0.0.0.0
LOG_LEVEL=INFO
```

### Space Configuration
Create a `README.md` in the HF Space with:

```yaml
---
title: AI Research Assistant MVP
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
---
```

### Deployment Steps

1. **Clone/Setup Repository**
   ```bash
   git clone your-repo
   cd Research_Assistant
   ```

2. **Install Dependencies**
   ```bash
   bash install.sh
   # or
   pip install -r requirements.txt
   ```

3. **Test Installation**
   ```bash
   python test_setup.py
   # or
   bash quick_test.sh
   ```

4. **Run Locally**
   ```bash
   python app.py
   ```

5. **Deploy to HF Spaces**
   - Push to GitHub
   - Connect to HF Spaces
   - Select NVIDIA T4 Medium GPU hardware
   - Deploy

### Resource Management

#### Memory Limits
- **Base Python**: ~100MB
- **Gradio**: ~50MB
- **Models (loaded on GPU)**: ~14-16GB vRAM
  - Primary model (Qwen/Qwen2.5-7B): ~14GB
  - Embedding model: ~500MB
  - Classification models: ~500MB each
- **System RAM**: ~2-4GB for caching and processing
- **Cache**: ~500MB-1GB max

**GPU Memory Budget**: ~24GB vRAM (models fit comfortably)
**System RAM Budget**: 30GB (plenty of headroom)

#### Strategies
- **Local GPU Model Loading**: Models loaded on GPU for faster inference
- **Lazy Loading**: Models loaded on-demand to speed up startup
- **GPU Memory Management**: Automatic device placement with FP16 precision
- **Caching**: Aggressive caching with 30GB RAM available
- **Stream responses**: To reduce memory during generation

### Performance Optimization

#### For NVIDIA T4 GPU
1. **Local Model Loading**: Models run locally on GPU (faster than API)
   - Primary model: Qwen/Qwen2.5-7B-Instruct (~14GB vRAM)
   - Embedding model: sentence-transformers/all-MiniLM-L6-v2 (~500MB)
2. **GPU Acceleration**: All inference runs on GPU
3. **Parallel Processing**: 4 workers (MAX_WORKERS=4) for concurrent requests
4. **Fallback to API**: Automatically falls back to HF Inference API if local models fail
5. **Request Queuing**: Built-in async request handling
6. **Response Streaming**: Implemented for efficient memory usage

#### Mobile Optimizations
- Reduce max tokens to 800
- Shorten timeout to 15s
- Implement progressive loading
- Use touch-optimized UI

### Monitoring

#### Health Checks
- Application health endpoint: `/health`
- Database connectivity check
- Cache hit rate monitoring
- Response time tracking

#### Logging
- Use structured logging (structlog)
- Log levels: DEBUG (dev), INFO (prod)
- Monitor error rates
- Track performance metrics

### Troubleshooting

#### Common Issues

**Issue**: Out of memory errors
- **Solution**: Reduce max_workers, implement request queuing

**Issue**: Slow responses
- **Solution**: Enable aggressive caching, use streaming

**Issue**: Model loading failures
- **Solution**: Use HF Inference API instead of local models

**Issue**: Session data loss
- **Solution**: Implement proper persistence with SQLite backup

### Scaling Considerations

#### For Production
1. **Horizontal Scaling**: Deploy multiple instances
2. **Caching Layer**: Add Redis for shared session data
3. **Load Balancing**: Use HF Spaces built-in load balancer
4. **CDN**: Static assets via CDN
5. **Database**: Consider PostgreSQL for production

#### Migration Path
- **Phase 1**: MVP on ZeroGPU (current)
- **Phase 2**: Upgrade to GPU for local models
- **Phase 3**: Scale to multiple workers
- **Phase 4**: Enterprise deployment with managed infrastructure