File size: 4,424 Bytes
66dbebd
 
 
 
7632802
 
 
 
 
 
 
 
 
66dbebd
 
7632802
 
 
 
 
66dbebd
 
 
 
 
 
7632802
66dbebd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7632802
 
66dbebd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7632802
66dbebd
 
 
 
 
 
 
7632802
 
 
 
 
 
66dbebd
7632802
 
66dbebd
 
7632802
 
 
 
 
66dbebd
 
 
7632802
 
 
 
 
 
 
 
 
66dbebd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
# Deployment Notes

## Hugging Face Spaces Deployment

### NVIDIA T4 Medium Configuration
This MVP is optimized for **NVIDIA T4 Medium** GPU deployment on Hugging Face Spaces.

#### Hardware Specifications
- **GPU**: NVIDIA T4 (persistent, always available)
- **vCPU**: 8 cores
- **RAM**: 30GB
- **vRAM**: 24GB
- **Storage**: ~20GB
- **Network**: Shared infrastructure

#### Resource Capacity
- **GPU Memory**: 24GB vRAM (sufficient for local model loading)
- **System Memory**: 30GB RAM (excellent for caching and processing)
- **CPU**: 8 vCPU (good for parallel operations)

### Environment Variables
Required environment variables for deployment:

```bash
HF_TOKEN=your_huggingface_token_here
HF_HOME=/tmp/huggingface
MAX_WORKERS=4
CACHE_TTL=3600
DB_PATH=sessions.db
FAISS_INDEX_PATH=embeddings.faiss
SESSION_TIMEOUT=3600
MAX_SESSION_SIZE_MB=10
MOBILE_MAX_TOKENS=800
MOBILE_TIMEOUT=15000
GRADIO_PORT=7860
GRADIO_HOST=0.0.0.0
LOG_LEVEL=INFO
```

### Space Configuration
Create a `README.md` in the HF Space with:

```yaml
---
title: AI Research Assistant MVP
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
---
```

### Deployment Steps

1. **Clone/Setup Repository**
   ```bash
   git clone your-repo
   cd Research_Assistant
   ```

2. **Install Dependencies**
   ```bash
   bash install.sh
   # or
   pip install -r requirements.txt
   ```

3. **Test Installation**
   ```bash
   python test_setup.py
   # or
   bash quick_test.sh
   ```

4. **Run Locally**
   ```bash
   python app.py
   ```

5. **Deploy to HF Spaces**
   - Push to GitHub
   - Connect to HF Spaces
   - Select NVIDIA T4 Medium GPU hardware
   - Deploy

### Resource Management

#### Memory Limits
- **Base Python**: ~100MB
- **Gradio**: ~50MB
- **Models (loaded on GPU)**: ~14-16GB vRAM
  - Primary model (Qwen/Qwen2.5-7B): ~14GB
  - Embedding model: ~500MB
  - Classification models: ~500MB each
- **System RAM**: ~2-4GB for caching and processing
- **Cache**: ~500MB-1GB max

**GPU Memory Budget**: ~24GB vRAM (models fit comfortably)
**System RAM Budget**: 30GB (plenty of headroom)

#### Strategies
- **Local GPU Model Loading**: Models loaded on GPU for faster inference
- **Lazy Loading**: Models loaded on-demand to speed up startup
- **GPU Memory Management**: Automatic device placement with FP16 precision
- **Caching**: Aggressive caching with 30GB RAM available
- **Stream responses**: To reduce memory during generation

### Performance Optimization

#### For NVIDIA T4 GPU
1. **Local Model Loading**: Models run locally on GPU (faster than API)
   - Primary model: Qwen/Qwen2.5-7B-Instruct (~14GB vRAM)
   - Embedding model: sentence-transformers/all-MiniLM-L6-v2 (~500MB)
2. **GPU Acceleration**: All inference runs on GPU
3. **Parallel Processing**: 4 workers (MAX_WORKERS=4) for concurrent requests
4. **Fallback to API**: Automatically falls back to HF Inference API if local models fail
5. **Request Queuing**: Built-in async request handling
6. **Response Streaming**: Implemented for efficient memory usage

#### Mobile Optimizations
- Reduce max tokens to 800
- Shorten timeout to 15s
- Implement progressive loading
- Use touch-optimized UI

### Monitoring

#### Health Checks
- Application health endpoint: `/health`
- Database connectivity check
- Cache hit rate monitoring
- Response time tracking

#### Logging
- Use structured logging (structlog)
- Log levels: DEBUG (dev), INFO (prod)
- Monitor error rates
- Track performance metrics

### Troubleshooting

#### Common Issues

**Issue**: Out of memory errors
- **Solution**: Reduce max_workers, implement request queuing

**Issue**: Slow responses
- **Solution**: Enable aggressive caching, use streaming

**Issue**: Model loading failures
- **Solution**: Use HF Inference API instead of local models

**Issue**: Session data loss
- **Solution**: Implement proper persistence with SQLite backup

### Scaling Considerations

#### For Production
1. **Horizontal Scaling**: Deploy multiple instances
2. **Caching Layer**: Add Redis for shared session data
3. **Load Balancing**: Use HF Spaces built-in load balancer
4. **CDN**: Static assets via CDN
5. **Database**: Consider PostgreSQL for production

#### Migration Path
- **Phase 1**: MVP on ZeroGPU (current)
- **Phase 2**: Upgrade to GPU for local models
- **Phase 3**: Scale to multiple workers
- **Phase 4**: Enterprise deployment with managed infrastructure