Spaces:

JatinAutonomousLabs
/

Research_AI_Assistant

Sleeping

HF_TOKEN=your_huggingface_token_here
HF_HOME=/tmp/huggingface
MAX_WORKERS=4
CACHE_TTL=3600
DB_PATH=sessions.db
FAISS_INDEX_PATH=embeddings.faiss
SESSION_TIMEOUT=3600
MAX_SESSION_SIZE_MB=10
MOBILE_MAX_TOKENS=800
MOBILE_TIMEOUT=15000
GRADIO_PORT=7860
GRADIO_HOST=0.0.0.0
LOG_LEVEL=INFO

Space Configuration

Create a README.md in the HF Space with:

---
title: AI Research Assistant MVP
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
---

Deployment Steps

Clone/Setup Repository

git clone your-repo
cd Research_Assistant

Install Dependencies

bash install.sh
# or
pip install -r requirements.txt

Test Installation

python test_setup.py
# or
bash quick_test.sh

Run Locally
```
python app.py
```
Deploy to HF Spaces
- Push to GitHub
- Connect to HF Spaces
- Select NVIDIA T4 Medium GPU hardware
- Deploy

Resource Management

Memory Limits

Base Python: ~100MB
Gradio: ~50MB
Models (loaded on GPU): ~14-16GB vRAM
- Primary model (Qwen/Qwen2.5-7B): ~14GB
- Embedding model: ~500MB
- Classification models: ~500MB each
System RAM: ~2-4GB for caching and processing
Cache: ~500MB-1GB max

GPU Memory Budget: ~24GB vRAM (models fit comfortably) System RAM Budget: 30GB (plenty of headroom)

Strategies

Local GPU Model Loading: Models loaded on GPU for faster inference
Lazy Loading: Models loaded on-demand to speed up startup
GPU Memory Management: Automatic device placement with FP16 precision
Caching: Aggressive caching with 30GB RAM available
Stream responses: To reduce memory during generation

Performance Optimization

For NVIDIA T4 GPU

Local Model Loading: Models run locally on GPU (faster than API)
- Primary model: Qwen/Qwen2.5-7B-Instruct (~14GB vRAM)
- Embedding model: sentence-transformers/all-MiniLM-L6-v2 (~500MB)
GPU Acceleration: All inference runs on GPU
Parallel Processing: 4 workers (MAX_WORKERS=4) for concurrent requests
Fallback to API: Automatically falls back to HF Inference API if local models fail
Request Queuing: Built-in async request handling
Response Streaming: Implemented for efficient memory usage

Mobile Optimizations

Reduce max tokens to 800
Shorten timeout to 15s
Implement progressive loading
Use touch-optimized UI

Monitoring

Health Checks

Application health endpoint: /health
Database connectivity check
Cache hit rate monitoring
Response time tracking

Logging

Use structured logging (structlog)
Log levels: DEBUG (dev), INFO (prod)
Monitor error rates
Track performance metrics

Troubleshooting

Common Issues

Issue: Out of memory errors

Solution: Reduce max_workers, implement request queuing

Issue: Slow responses

Solution: Enable aggressive caching, use streaming

Issue: Model loading failures

Solution: Use HF Inference API instead of local models

Issue: Session data loss

Solution: Implement proper persistence with SQLite backup

Scaling Considerations

For Production

Horizontal Scaling: Deploy multiple instances
Caching Layer: Add Redis for shared session data
Load Balancing: Use HF Spaces built-in load balancer
CDN: Static assets via CDN
Database: Consider PostgreSQL for production

Migration Path

Phase 1: MVP on ZeroGPU (current)
Phase 2: Upgrade to GPU for local models
Phase 3: Scale to multiple workers
Phase 4: Enterprise deployment with managed infrastructure