Research_AI_Assistant / DEPLOYMENT_NOTES.md
JatsTheAIGen's picture
api migration v2
7632802
|
raw
history blame
4.42 kB

Deployment Notes

Hugging Face Spaces Deployment

NVIDIA T4 Medium Configuration

This MVP is optimized for NVIDIA T4 Medium GPU deployment on Hugging Face Spaces.

Hardware Specifications

  • GPU: NVIDIA T4 (persistent, always available)
  • vCPU: 8 cores
  • RAM: 30GB
  • vRAM: 24GB
  • Storage: ~20GB
  • Network: Shared infrastructure

Resource Capacity

  • GPU Memory: 24GB vRAM (sufficient for local model loading)
  • System Memory: 30GB RAM (excellent for caching and processing)
  • CPU: 8 vCPU (good for parallel operations)

Environment Variables

Required environment variables for deployment:

HF_TOKEN=your_huggingface_token_here
HF_HOME=/tmp/huggingface
MAX_WORKERS=4
CACHE_TTL=3600
DB_PATH=sessions.db
FAISS_INDEX_PATH=embeddings.faiss
SESSION_TIMEOUT=3600
MAX_SESSION_SIZE_MB=10
MOBILE_MAX_TOKENS=800
MOBILE_TIMEOUT=15000
GRADIO_PORT=7860
GRADIO_HOST=0.0.0.0
LOG_LEVEL=INFO

Space Configuration

Create a README.md in the HF Space with:

---
title: AI Research Assistant MVP
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
---

Deployment Steps

  1. Clone/Setup Repository

    git clone your-repo
    cd Research_Assistant
    
  2. Install Dependencies

    bash install.sh
    # or
    pip install -r requirements.txt
    
  3. Test Installation

    python test_setup.py
    # or
    bash quick_test.sh
    
  4. Run Locally

    python app.py
    
  5. Deploy to HF Spaces

    • Push to GitHub
    • Connect to HF Spaces
    • Select NVIDIA T4 Medium GPU hardware
    • Deploy

Resource Management

Memory Limits

  • Base Python: ~100MB
  • Gradio: ~50MB
  • Models (loaded on GPU): ~14-16GB vRAM
    • Primary model (Qwen/Qwen2.5-7B): ~14GB
    • Embedding model: ~500MB
    • Classification models: ~500MB each
  • System RAM: ~2-4GB for caching and processing
  • Cache: ~500MB-1GB max

GPU Memory Budget: ~24GB vRAM (models fit comfortably) System RAM Budget: 30GB (plenty of headroom)

Strategies

  • Local GPU Model Loading: Models loaded on GPU for faster inference
  • Lazy Loading: Models loaded on-demand to speed up startup
  • GPU Memory Management: Automatic device placement with FP16 precision
  • Caching: Aggressive caching with 30GB RAM available
  • Stream responses: To reduce memory during generation

Performance Optimization

For NVIDIA T4 GPU

  1. Local Model Loading: Models run locally on GPU (faster than API)
    • Primary model: Qwen/Qwen2.5-7B-Instruct (~14GB vRAM)
    • Embedding model: sentence-transformers/all-MiniLM-L6-v2 (~500MB)
  2. GPU Acceleration: All inference runs on GPU
  3. Parallel Processing: 4 workers (MAX_WORKERS=4) for concurrent requests
  4. Fallback to API: Automatically falls back to HF Inference API if local models fail
  5. Request Queuing: Built-in async request handling
  6. Response Streaming: Implemented for efficient memory usage

Mobile Optimizations

  • Reduce max tokens to 800
  • Shorten timeout to 15s
  • Implement progressive loading
  • Use touch-optimized UI

Monitoring

Health Checks

  • Application health endpoint: /health
  • Database connectivity check
  • Cache hit rate monitoring
  • Response time tracking

Logging

  • Use structured logging (structlog)
  • Log levels: DEBUG (dev), INFO (prod)
  • Monitor error rates
  • Track performance metrics

Troubleshooting

Common Issues

Issue: Out of memory errors

  • Solution: Reduce max_workers, implement request queuing

Issue: Slow responses

  • Solution: Enable aggressive caching, use streaming

Issue: Model loading failures

  • Solution: Use HF Inference API instead of local models

Issue: Session data loss

  • Solution: Implement proper persistence with SQLite backup

Scaling Considerations

For Production

  1. Horizontal Scaling: Deploy multiple instances
  2. Caching Layer: Add Redis for shared session data
  3. Load Balancing: Use HF Spaces built-in load balancer
  4. CDN: Static assets via CDN
  5. Database: Consider PostgreSQL for production

Migration Path

  • Phase 1: MVP on ZeroGPU (current)
  • Phase 2: Upgrade to GPU for local models
  • Phase 3: Scale to multiple workers
  • Phase 4: Enterprise deployment with managed infrastructure