File size: 4,424 Bytes
66dbebd 7632802 66dbebd 7632802 66dbebd 7632802 66dbebd 7632802 66dbebd 7632802 66dbebd 7632802 66dbebd 7632802 66dbebd 7632802 66dbebd 7632802 66dbebd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
# Deployment Notes
## Hugging Face Spaces Deployment
### NVIDIA T4 Medium Configuration
This MVP is optimized for **NVIDIA T4 Medium** GPU deployment on Hugging Face Spaces.
#### Hardware Specifications
- **GPU**: NVIDIA T4 (persistent, always available)
- **vCPU**: 8 cores
- **RAM**: 30GB
- **vRAM**: 24GB
- **Storage**: ~20GB
- **Network**: Shared infrastructure
#### Resource Capacity
- **GPU Memory**: 24GB vRAM (sufficient for local model loading)
- **System Memory**: 30GB RAM (excellent for caching and processing)
- **CPU**: 8 vCPU (good for parallel operations)
### Environment Variables
Required environment variables for deployment:
```bash
HF_TOKEN=your_huggingface_token_here
HF_HOME=/tmp/huggingface
MAX_WORKERS=4
CACHE_TTL=3600
DB_PATH=sessions.db
FAISS_INDEX_PATH=embeddings.faiss
SESSION_TIMEOUT=3600
MAX_SESSION_SIZE_MB=10
MOBILE_MAX_TOKENS=800
MOBILE_TIMEOUT=15000
GRADIO_PORT=7860
GRADIO_HOST=0.0.0.0
LOG_LEVEL=INFO
```
### Space Configuration
Create a `README.md` in the HF Space with:
```yaml
---
title: AI Research Assistant MVP
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
---
```
### Deployment Steps
1. **Clone/Setup Repository**
```bash
git clone your-repo
cd Research_Assistant
```
2. **Install Dependencies**
```bash
bash install.sh
# or
pip install -r requirements.txt
```
3. **Test Installation**
```bash
python test_setup.py
# or
bash quick_test.sh
```
4. **Run Locally**
```bash
python app.py
```
5. **Deploy to HF Spaces**
- Push to GitHub
- Connect to HF Spaces
- Select NVIDIA T4 Medium GPU hardware
- Deploy
### Resource Management
#### Memory Limits
- **Base Python**: ~100MB
- **Gradio**: ~50MB
- **Models (loaded on GPU)**: ~14-16GB vRAM
- Primary model (Qwen/Qwen2.5-7B): ~14GB
- Embedding model: ~500MB
- Classification models: ~500MB each
- **System RAM**: ~2-4GB for caching and processing
- **Cache**: ~500MB-1GB max
**GPU Memory Budget**: ~24GB vRAM (models fit comfortably)
**System RAM Budget**: 30GB (plenty of headroom)
#### Strategies
- **Local GPU Model Loading**: Models loaded on GPU for faster inference
- **Lazy Loading**: Models loaded on-demand to speed up startup
- **GPU Memory Management**: Automatic device placement with FP16 precision
- **Caching**: Aggressive caching with 30GB RAM available
- **Stream responses**: To reduce memory during generation
### Performance Optimization
#### For NVIDIA T4 GPU
1. **Local Model Loading**: Models run locally on GPU (faster than API)
- Primary model: Qwen/Qwen2.5-7B-Instruct (~14GB vRAM)
- Embedding model: sentence-transformers/all-MiniLM-L6-v2 (~500MB)
2. **GPU Acceleration**: All inference runs on GPU
3. **Parallel Processing**: 4 workers (MAX_WORKERS=4) for concurrent requests
4. **Fallback to API**: Automatically falls back to HF Inference API if local models fail
5. **Request Queuing**: Built-in async request handling
6. **Response Streaming**: Implemented for efficient memory usage
#### Mobile Optimizations
- Reduce max tokens to 800
- Shorten timeout to 15s
- Implement progressive loading
- Use touch-optimized UI
### Monitoring
#### Health Checks
- Application health endpoint: `/health`
- Database connectivity check
- Cache hit rate monitoring
- Response time tracking
#### Logging
- Use structured logging (structlog)
- Log levels: DEBUG (dev), INFO (prod)
- Monitor error rates
- Track performance metrics
### Troubleshooting
#### Common Issues
**Issue**: Out of memory errors
- **Solution**: Reduce max_workers, implement request queuing
**Issue**: Slow responses
- **Solution**: Enable aggressive caching, use streaming
**Issue**: Model loading failures
- **Solution**: Use HF Inference API instead of local models
**Issue**: Session data loss
- **Solution**: Implement proper persistence with SQLite backup
### Scaling Considerations
#### For Production
1. **Horizontal Scaling**: Deploy multiple instances
2. **Caching Layer**: Add Redis for shared session data
3. **Load Balancing**: Use HF Spaces built-in load balancer
4. **CDN**: Static assets via CDN
5. **Database**: Consider PostgreSQL for production
#### Migration Path
- **Phase 1**: MVP on ZeroGPU (current)
- **Phase 2**: Upgrade to GPU for local models
- **Phase 3**: Scale to multiple workers
- **Phase 4**: Enterprise deployment with managed infrastructure
|