File size: 2,661 Bytes
66dbebd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# Compatibility Notes

## Critical Version Constraints

### Python
- **Python 3.9-3.11**: HF Spaces typically supports these versions
- Avoid Python 3.12+ for maximum compatibility

### PyTorch
- **PyTorch 2.1.x**: Latest stable with good HF ecosystem support
- CPU-only builds for ZeroGPU deployments

### Transformers
- **Transformers 4.35.x**: Latest features with stability
- Ensures compatibility with latest HF models

### Gradio
- **Gradio 4.x**: Current major version with mobile optimizations
- Required for mobile-responsive interface

## HF Spaces Specific Considerations

### ZeroGPU Environment
- **Limited GPU memory**: CPU-optimized versions are used
- All models run on CPU
- Use `faiss-cpu` instead of `faiss-gpu`

### Storage Limits
- **Limited persistent storage**: Efficient caching is crucial
- Session data must be optimized for minimal storage usage
- Implement aggressive cleanup policies

### Network Restrictions
- **May have restrictions on external API calls**
- All LLM calls must use Hugging Face Inference API
- Avoid external HTTP requests in production

## Model Selection

### For ZeroGPU
- **Embedding model**: `sentence-transformers/all-MiniLM-L6-v2` (384d, fast)
- **Primary LLM**: Use HF Inference API endpoint calls
- **Avoid local model loading** for large models

### Memory Optimization
- Limit concurrent requests
- Use streaming responses
- Implement response compression

## Performance Considerations

### Cache Strategy
- In-memory caching for active sessions
- Aggressive cache eviction (LRU)
- TTL-based expiration

### Mobile Optimization
- Reduced max tokens for mobile (800 vs 2000)
- Shorter timeout (15s vs 30s)
- Lazy loading of UI components

## Dependencies Compatibility Matrix

| Package | Version Range | Notes |
|---------|---------------|-------|
| Python | 3.9-3.11 | HF Spaces supported versions |
| PyTorch | 2.1.x | CPU version |
| Transformers | 4.35.x | Latest stable |
| Gradio | 4.x | Mobile support |
| FAISS | CPU-only | No GPU support |
| NumPy | 1.24.x | Compatibility layer |

## Known Issues & Workarounds

### Issue: FAISS GPU Not Available
**Solution**: Use `faiss-cpu` in requirements.txt

### Issue: Model Loading Memory
**Solution**: Use HF Inference API instead of local loading

### Issue: Session Storage Limits
**Solution**: Implement data compression and TTL-based cleanup

### Issue: Concurrent Request Limits
**Solution**: Implement request queue with max_workers limit

## Testing Recommendations

1. Test on ZeroGPU environment before production
2. Verify memory usage stays under 512MB
3. Test mobile responsiveness
4. Validate cache efficiency (target: >60% hit rate)