Spaces:

JatinAutonomousLabs
/

Research_AI_Assistant

Sleeping

App Files Files Community

JatsTheAIGen commited on Nov 4

Commit

7632802

1 Parent(s): 83fb1b5

api migration v2

Browse files

Files changed (11) hide show

DEPLOYMENT_NOTES.md +40 -26
Dockerfile.flask +39 -0
FLASK_API_DEPLOYMENT_FILES.md +194 -0
README.md +3 -3
README_FLASK_API.md +92 -0
config.py +1 -1
flask_api_standalone.py +257 -0
requirements.txt +6 -2
src/config.py +1 -1
src/llm_router.py +118 -4
src/local_model_loader.py +322 -0

DEPLOYMENT_NOTES.md CHANGED Viewed

@@ -2,22 +2,29 @@
 ## Hugging Face Spaces Deployment
-### ZeroGPU Configuration
-This MVP is optimized for **ZeroGPU** deployment on Hugging Face Spaces.
-#### Key Settings
-- **GPU**: None (CPU-only)
-- **Storage**: Limited (~20GB)
-- **Memory**: 32GB RAM
 - **Network**: Shared infrastructure
 ### Environment Variables
 Required environment variables for deployment:
 ```bash
 HF_TOKEN=your_huggingface_token_here
 HF_HOME=/tmp/huggingface
-MAX_WORKERS=2
 CACHE_TTL=3600
 DB_PATH=sessions.db
 FAISS_INDEX_PATH=embeddings.faiss
@@ -39,9 +46,8 @@ title: AI Research Assistant MVP
 emoji: 🧠
 colorFrom: blue
 colorTo: purple
-sdk: gradio
-sdk_version: 4.0.0
-app_file: app.py
 pinned: false
 license: apache-2.0
 ---
@@ -77,7 +83,7 @@ license: apache-2.0
 5. **Deploy to HF Spaces**
    - Push to GitHub
    - Connect to HF Spaces
-   - Select ZeroGPU hardware
    - Deploy
 ### Resource Management
@@ -85,26 +91,34 @@ license: apache-2.0
 #### Memory Limits
 - **Base Python**: ~100MB
 - **Gradio**: ~50MB
-- **Models (loaded)**: ~200-500MB
-- **Cache**: ~100MB max
-- **Buffer**: ~100MB
-**Total Budget**: ~512MB (within HF Spaces limits)
 #### Strategies
-- Lazy model loading
-- Model offloading when not in use
-- Aggressive cache eviction
-- Stream responses to reduce memory
 ### Performance Optimization
-#### For ZeroGPU
-1. Use HF Inference API for LLM calls (not local models)
-2. Use `sentence-transformers` for embeddings (lightweight)
-3. Implement request queuing
-4. Use FAISS-CPU (not GPU version)
-5. Implement response streaming
 #### Mobile Optimizations
 - Reduce max tokens to 800

 ## Hugging Face Spaces Deployment
+### NVIDIA T4 Medium Configuration
+This MVP is optimized for **NVIDIA T4 Medium** GPU deployment on Hugging Face Spaces.
+#### Hardware Specifications
+- **GPU**: NVIDIA T4 (persistent, always available)
+- **vCPU**: 8 cores
+- **RAM**: 30GB
+- **vRAM**: 24GB
+- **Storage**: ~20GB
 - **Network**: Shared infrastructure
+#### Resource Capacity
+- **GPU Memory**: 24GB vRAM (sufficient for local model loading)
+- **System Memory**: 30GB RAM (excellent for caching and processing)
+- **CPU**: 8 vCPU (good for parallel operations)
 ### Environment Variables
 Required environment variables for deployment:
 ```bash
 HF_TOKEN=your_huggingface_token_here
 HF_HOME=/tmp/huggingface
+MAX_WORKERS=4
 CACHE_TTL=3600
 DB_PATH=sessions.db
 FAISS_INDEX_PATH=embeddings.faiss
 emoji: 🧠
 colorFrom: blue
 colorTo: purple
+sdk: docker
+app_port: 7860
 pinned: false
 license: apache-2.0
 ---
 5. **Deploy to HF Spaces**
    - Push to GitHub
    - Connect to HF Spaces
+   - Select NVIDIA T4 Medium GPU hardware
    - Deploy
 ### Resource Management
 #### Memory Limits
 - **Base Python**: ~100MB
 - **Gradio**: ~50MB
+- **Models (loaded on GPU)**: ~14-16GB vRAM
+  - Primary model (Qwen/Qwen2.5-7B): ~14GB
+  - Embedding model: ~500MB
+  - Classification models: ~500MB each
+- **System RAM**: ~2-4GB for caching and processing
+- **Cache**: ~500MB-1GB max
+**GPU Memory Budget**: ~24GB vRAM (models fit comfortably)
+**System RAM Budget**: 30GB (plenty of headroom)
 #### Strategies
+- **Local GPU Model Loading**: Models loaded on GPU for faster inference
+- **Lazy Loading**: Models loaded on-demand to speed up startup
+- **GPU Memory Management**: Automatic device placement with FP16 precision
+- **Caching**: Aggressive caching with 30GB RAM available
+- **Stream responses**: To reduce memory during generation
 ### Performance Optimization
+#### For NVIDIA T4 GPU
+1. **Local Model Loading**: Models run locally on GPU (faster than API)
+   - Primary model: Qwen/Qwen2.5-7B-Instruct (~14GB vRAM)
+   - Embedding model: sentence-transformers/all-MiniLM-L6-v2 (~500MB)
+2. **GPU Acceleration**: All inference runs on GPU
+3. **Parallel Processing**: 4 workers (MAX_WORKERS=4) for concurrent requests
+4. **Fallback to API**: Automatically falls back to HF Inference API if local models fail
+5. **Request Queuing**: Built-in async request handling
+6. **Response Streaming**: Implemented for efficient memory usage
 #### Mobile Optimizations
 - Reduce max tokens to 800

Dockerfile.flask ADDED Viewed

	@@ -0,0 +1,39 @@

+FROM python:3.10-slim
+# Set working directory
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    gcc \
+    g++ \
+    cmake \
+    libopenblas-dev \
+    libomp-dev \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements file
+COPY requirements.txt .
+# Install Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY . .
+# Expose port 7860 (HF Spaces standard)
+EXPOSE 7860
+# Set environment variables
+ENV PYTHONUNBUFFERED=1
+ENV PORT=7860
+# Health check
+HEALTHCHECK --interval=30s --timeout=30s --start-period=120s --retries=3 \
+    CMD curl -f http://localhost:7860/api/health || exit 1
+# Run Flask application
+# Note: For Flask-only deployment, use this Dockerfile with README_FLASK_API.md
+CMD ["python", "flask_api_standalone.py"]

FLASK_API_DEPLOYMENT_FILES.md ADDED Viewed

	@@ -0,0 +1,194 @@

+# Flask API Only - Required Files List
+This document lists all files needed for a **Flask API-only deployment** (no Gradio UI).
+## 📋 Essential Files (Required)
+### Core Application Files
+```
+Research_AI_Assistant/
+├── flask_api_standalone.py          # Main Flask application (REQUIRED)
+├── Dockerfile.flask                  # Dockerfile for Flask deployment (rename to Dockerfile)
+├── README_FLASK_API.md              # README with HF Spaces frontmatter (rename to README.md)
+└── requirements.txt                 # Python dependencies (REQUIRED)
+```
+### Source Code Directory (`src/`)
+```
+Research_AI_Assistant/src/
+├── __init__.py                      # Package initialization
+├── config.py                        # Configuration settings
+├── llm_router.py                    # LLM routing (local GPU models)
+├── local_model_loader.py            # GPU model loader (NEW - for local inference)
+├── orchestrator_engine.py           # Main orchestrator
+├── context_manager.py               # Context management
+├── models_config.py                 # Model configurations
+├── agents/
+│   ├── __init__.py
+│   ├── intent_agent.py              # Intent recognition agent
+│   ├── synthesis_agent.py            # Response synthesis agent
+│   ├── safety_agent.py               # Safety checking agent
+│   └── skills_identification_agent.py # Skills identification agent
+└── database.py                      # Database management (if used)
+```
+### Configuration Files (Optional but Recommended)
+```
+Research_AI_Assistant/
+├── .env                             # Environment variables (optional, use HF Secrets instead)
+└── .gitignore                       # Git ignore rules
+```
+## 📦 File Descriptions
+### 1. `flask_api_standalone.py` ⭐ REQUIRED
+- **Purpose**: Main Flask application entry point
+- **Contains**: API endpoints, orchestrator initialization, request handling
+- **Key Features**:
+  - Local GPU model loading
+  - Async orchestrator support
+  - Health checks
+  - Error handling
+### 2. `Dockerfile.flask` → `Dockerfile` ⭐ REQUIRED
+- **Purpose**: Container configuration
+- **Action**: Rename to `Dockerfile` when deploying
+- **Includes**: Python 3.10, system dependencies, health checks
+### 3. `README_FLASK_API.md` → `README.md` ⭐ REQUIRED
+- **Purpose**: HF Spaces configuration and API documentation
+- **Action**: Rename to `README.md` when deploying
+- **Contains**: Frontmatter with `sdk: docker`, API endpoints, usage examples
+### 4. `requirements.txt` ⭐ REQUIRED
+- **Purpose**: Python package dependencies
+- **Includes**: Flask, transformers, torch (GPU), sentence-transformers, etc.
+### 5. `src/local_model_loader.py` ⭐ REQUIRED (NEW)
+- **Purpose**: Loads models locally on GPU
+- **Features**: GPU detection, model caching, FP16 optimization
+### 6. `src/llm_router.py` ⭐ REQUIRED (UPDATED)
+- **Purpose**: Routes inference requests
+- **Features**: Tries local models first, falls back to HF API
+### 7. `src/orchestrator_engine.py` ⭐ REQUIRED
+- **Purpose**: Main AI orchestration engine
+- **Contains**: Agent coordination, request processing
+### 8. `src/context_manager.py` ⭐ REQUIRED
+- **Purpose**: Manages conversation context
+- **Features**: Session management, context retrieval
+### 9. `src/agents/*.py` ⭐ REQUIRED
+- **Purpose**: Individual AI agents
+- **Agents**: Intent, Synthesis, Safety, Skills Identification
+### 10. `src/config.py` ⭐ REQUIRED
+- **Purpose**: Application configuration
+- **Settings**: MAX_WORKERS=4, model paths, etc.
+## ❌ Files NOT Needed (Gradio/UI Related)
+These files can be **excluded** from Flask API deployment:
+```
+Research_AI_Assistant/
+├── app.py                           # Gradio UI (NOT NEEDED)
+├── main.py                           # Gradio + Flask launcher (NOT NEEDED)
+├── flask_api.py                      # Flask API (use standalone instead)
+├── Dockerfile                        # Main Dockerfile (use Dockerfile.flask)
+├── Dockerfile.hf                     # Alternative Dockerfile (NOT NEEDED)
+├── README.md                         # Main README (use README_FLASK_API.md)
+└── All .md files except this one     # Documentation (optional)
+```
+## 🚀 Quick Deployment Checklist
+### Step 1: Prepare Files
+```bash
+# In your Flask API Space directory:
+cp Dockerfile.flask Dockerfile
+cp README_FLASK_API.md README.md
+```
+### Step 2: Verify Structure
+```
+Your Space/
+├── Dockerfile                        # ✅ Renamed from Dockerfile.flask
+├── README.md                         # ✅ Renamed from README_FLASK_API.md
+├── flask_api_standalone.py          # ✅ Main Flask app
+├── requirements.txt                  # ✅ Dependencies
+└── src/                              # ✅ All source files
+    ├── __init__.py
+    ├── config.py
+    ├── llm_router.py
+    ���── local_model_loader.py
+    ├── orchestrator_engine.py
+    ├── context_manager.py
+    ├── models_config.py
+    └── agents/
+        ├── __init__.py
+        ├── intent_agent.py
+        ├── synthesis_agent.py
+        ├── safety_agent.py
+        └── skills_identification_agent.py
+```
+### Step 3: Set Environment Variables
+In HF Spaces Settings → Secrets:
+- `HF_TOKEN` - Your Hugging Face token
+### Step 4: Deploy
+- Select **NVIDIA T4 Medium** GPU
+- Set **SDK: docker**
+- Deploy
+## 📊 File Size Considerations
+### Minimal Deployment (Essential Only)
+- Core files: ~50 KB
+- Source code: ~500 KB
+- **Total**: ~550 KB code
+### With Models (First Load)
+- Code: ~550 KB
+- Models (downloaded on first run): ~14-16 GB
+- **Total**: ~14-16 GB (first build)
+### Subsequent Builds
+- Models cached by HF Spaces
+- Code only: ~550 KB
+## 🔍 Verification
+After deployment, verify these files exist:
+```bash
+# Check main files
+ls -la Dockerfile README.md flask_api_standalone.py requirements.txt
+# Check source directory
+ls -la src/
+ls -la src/agents/
+# Verify key components
+grep -r "local_model_loader" src/llm_router.py
+grep -r "MAX_WORKERS" src/config.py
+```
+## 📝 Summary
+**Minimum Required Files:**
+1. `flask_api_standalone.py`
+2. `Dockerfile` (from Dockerfile.flask)
+3. `README.md` (from README_FLASK_API.md)
+4. `requirements.txt`
+5. All files in `src/` directory
+**Total: ~15-20 files** (excluding documentation)
+---
+**Note**: This is a minimal deployment. All Gradio UI files, documentation, and test files are optional and can be excluded to reduce repository size.

README.md CHANGED Viewed

@@ -39,7 +39,7 @@ public: true
 ![HF Spaces](https://img.shields.io/badge/🤗-Hugging%20Face%20Spaces-blue)
 ![Python](https://img.shields.io/badge/Python-3.9%2B-green)
 ![Gradio](https://img.shields.io/badge/Interface-Gradio-FF6B6B)
-![ZeroGPU](https://img.shields.io/badge/GPU-ZeroGPU-lightgrey)
 **Academic-grade AI assistant with transparent reasoning and mobile-optimized interface**
@@ -50,7 +50,7 @@ public: true
 ## 🎯 Overview
-This MVP demonstrates an intelligent research assistant framework featuring **transparent reasoning chains**, **specialized agent architecture**, and **mobile-first design**. Built for Hugging Face Spaces with ZeroGPU optimization.
 ### Key Differentiators
 - **🔍 Transparent Reasoning**: Watch the AI think step-by-step with Chain of Thought
@@ -286,7 +286,7 @@ pytest tests/test_mobile_ux.py -v
 |-------|----------|
 | **HF_TOKEN not found** | Add token in Space Settings → Secrets |
 | **Build timeout** | Reduce model sizes in requirements |
-| **Memory errors** | Enable ZeroGPU and optimize cache |
 | **Import errors** | Check Python version (3.9+) |
 ### Performance Optimization

 ![HF Spaces](https://img.shields.io/badge/🤗-Hugging%20Face%20Spaces-blue)
 ![Python](https://img.shields.io/badge/Python-3.9%2B-green)
 ![Gradio](https://img.shields.io/badge/Interface-Gradio-FF6B6B)
+![NVIDIA T4](https://img.shields.io/badge/GPU-NVIDIA%20T4-blue)
 **Academic-grade AI assistant with transparent reasoning and mobile-optimized interface**
 ## 🎯 Overview
+This MVP demonstrates an intelligent research assistant framework featuring **transparent reasoning chains**, **specialized agent architecture**, and **mobile-first design**. Built for Hugging Face Spaces with NVIDIA T4 GPU acceleration for local model inference.
 ### Key Differentiators
 - **🔍 Transparent Reasoning**: Watch the AI think step-by-step with Chain of Thought
 |-------|----------|
 | **HF_TOKEN not found** | Add token in Space Settings → Secrets |
 | **Build timeout** | Reduce model sizes in requirements |
+| **Memory errors** | Check GPU memory usage, optimize model loading |
 | **Import errors** | Check Python version (3.9+) |
 ### Performance Optimization

README_FLASK_API.md ADDED Viewed

	@@ -0,0 +1,92 @@

+---
+title: AI Assistant Flask API
+emoji: 🤖
+colorFrom: blue
+colorTo: green
+sdk: docker
+app_port: 7860
+pinned: false
+---
+# AI Assistant Flask API
+Pure Flask REST API for AI research assistant.
+## Quick Start
+This Space provides a REST API (no UI). Test the endpoints:
+```bash
+# Health check
+curl https://YOUR-SPACE.hf.space/api/health
+# Chat
+curl -X POST https://YOUR-SPACE.hf.space/api/chat \
+  -H "Content-Type: application/json" \
+  -d '{
+    "message": "Hello, how are you?",
+    "session_id": "test-123",
+    "user_id": "user@example.com"
+  }'
+```
+## API Endpoints
+### GET /api/health
+Health check endpoint.
+**Response:**
+```json
+{
+  "status": "healthy",
+  "orchestrator_ready": true
+}
+```
+### POST /api/chat
+Process a chat message.
+**Request:**
+```json
+{
+  "message": "Your question here",
+  "history": [],
+  "session_id": "optional-session-id",
+  "user_id": "optional-user-id"
+}
+```
+**Response:**
+```json
+{
+  "success": true,
+  "message": "AI response here",
+  "history": [["Your question", "AI response"]],
+  "reasoning": {},
+  "performance": {}
+}
+```
+## Environment Variables
+Set in Space Settings → Repository secrets:
+- `HF_TOKEN` - Your Hugging Face API token (required)
+## Technology
+- Flask 3.0
+- Python 3.10
+- Custom AI orchestrator with multiple agents
+- Docker containerized
+- **NVIDIA T4 GPU** for local model inference
+## Features
+- 🤖 AI-powered responses with local GPU models
+- 🔄 Context-aware conversations
+- 🛡️ Safety checking
+- 📊 Performance metrics
+- 🎯 Intent recognition
+- 🔧 Skills identification

config.py CHANGED Viewed

@@ -13,7 +13,7 @@ class Settings(BaseSettings):
     classification_model: str = "cardiffnlp/twitter-roberta-base-emotion"
     # Performance settings
-    max_workers: int = int(os.getenv("MAX_WORKERS", "2"))
     cache_ttl: int = int(os.getenv("CACHE_TTL", "3600"))
     # Database settings

     classification_model: str = "cardiffnlp/twitter-roberta-base-emotion"
     # Performance settings
+    max_workers: int = int(os.getenv("MAX_WORKERS", "4"))
     cache_ttl: int = int(os.getenv("CACHE_TTL", "3600"))
     # Database settings

flask_api_standalone.py ADDED Viewed

	@@ -0,0 +1,257 @@

+#!/usr/bin/env python3
+"""
+Pure Flask API for Hugging Face Spaces
+No Gradio - Just Flask REST API
+Uses local GPU models for inference
+"""
+from flask import Flask, request, jsonify
+from flask_cors import CORS
+import logging
+import sys
+import os
+import asyncio
+from pathlib import Path
+# Setup logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+# Add project root to path
+project_root = Path(__file__).parent
+sys.path.insert(0, str(project_root))
+# Create Flask app
+app = Flask(__name__)
+CORS(app)  # Enable CORS for all origins
+# Global orchestrator
+orchestrator = None
+orchestrator_available = False
+def initialize_orchestrator():
+    """Initialize the AI orchestrator with local GPU models"""
+    global orchestrator, orchestrator_available
+    try:
+        logger.info("=" * 60)
+        logger.info("INITIALIZING AI ORCHESTRATOR (Local GPU Models)")
+        logger.info("=" * 60)
+        from src.agents.intent_agent import create_intent_agent
+        from src.agents.synthesis_agent import create_synthesis_agent
+        from src.agents.safety_agent import create_safety_agent
+        from src.agents.skills_identification_agent import create_skills_identification_agent
+        from src.llm_router import LLMRouter
+        from src.orchestrator_engine import MVPOrchestrator
+        from src.context_manager import EfficientContextManager
+        logger.info("✓ Imports successful")
+        hf_token = os.getenv('HF_TOKEN', '')
+        if not hf_token:
+            logger.warning("HF_TOKEN not set - API fallback will be used if local models fail")
+        # Initialize LLM Router with local model loading enabled
+        logger.info("Initializing LLM Router with local GPU model loading...")
+        llm_router = LLMRouter(hf_token, use_local_models=True)
+        logger.info("Initializing Agents...")
+        agents = {
+            'intent_recognition': create_intent_agent(llm_router),
+            'response_synthesis': create_synthesis_agent(llm_router),
+            'safety_check': create_safety_agent(llm_router),
+            'skills_identification': create_skills_identification_agent(llm_router)
+        }
+        logger.info("Initializing Context Manager...")
+        context_manager = EfficientContextManager(llm_router=llm_router)
+        logger.info("Initializing Orchestrator...")
+        orchestrator = MVPOrchestrator(llm_router, context_manager, agents)
+        orchestrator_available = True
+        logger.info("=" * 60)
+        logger.info("✓ AI ORCHESTRATOR READY")
+        logger.info("  - Local GPU models enabled")
+        logger.info("  - MAX_WORKERS: 4")
+        logger.info("=" * 60)
+        return True
+    except Exception as e:
+        logger.error(f"Failed to initialize: {e}", exc_info=True)
+        orchestrator_available = False
+        return False
+# Root endpoint
+@app.route('/', methods=['GET'])
+def root():
+    """API information"""
+    return jsonify({
+        'name': 'AI Assistant Flask API',
+        'version': '1.0',
+        'status': 'running',
+        'orchestrator_ready': orchestrator_available,
+        'features': {
+            'local_gpu_models': True,
+            'max_workers': 4,
+            'hardware': 'NVIDIA T4 Medium'
+        },
+        'endpoints': {
+            'health': 'GET /api/health',
+            'chat': 'POST /api/chat',
+            'initialize': 'POST /api/initialize'
+        }
+    })
+# Health check
+@app.route('/api/health', methods=['GET'])
+def health_check():
+    """Health check endpoint"""
+    return jsonify({
+        'status': 'healthy' if orchestrator_available else 'initializing',
+        'orchestrator_ready': orchestrator_available
+    })
+# Chat endpoint
+@app.route('/api/chat', methods=['POST'])
+def chat():
+    """
+    Process chat message
+    POST /api/chat
+    {
+        "message": "user message",
+        "history": [[user, assistant], ...],
+        "session_id": "session-123",
+        "user_id": "user-456"
+    }
+    Returns:
+    {
+        "success": true,
+        "message": "AI response",
+        "history": [...],
+        "reasoning": {...},
+        "performance": {...}
+    }
+    """
+    try:
+        data = request.get_json()
+        if not data or 'message' not in data:
+            return jsonify({
+                'success': False,
+                'error': 'Message is required'
+            }), 400
+        message = data['message']
+        history = data.get('history', [])
+        session_id = data.get('session_id')
+        user_id = data.get('user_id', 'anonymous')
+        logger.info(f"Chat request - User: {user_id}, Session: {session_id}")
+        logger.info(f"Message: {message[:100]}...")
+        if not orchestrator_available or orchestrator is None:
+            return jsonify({
+                'success': False,
+                'error': 'Orchestrator not ready',
+                'message': 'AI system is initializing. Please try again in a moment.'
+            }), 503
+        # Process with orchestrator (async method)
+        # Set user_id for session tracking
+        if session_id:
+            orchestrator.set_user_id(session_id, user_id)
+        # Run async process_request in event loop
+        loop = asyncio.new_event_loop()
+        asyncio.set_event_loop(loop)
+        try:
+            result = loop.run_until_complete(
+                orchestrator.process_request(
+                    session_id=session_id or f"session-{user_id}",
+                    user_input=message
+                )
+            )
+        finally:
+            loop.close()
+        # Extract response
+        if isinstance(result, dict):
+            response_text = result.get('response', '')
+            reasoning = result.get('reasoning', {})
+            performance = result.get('performance', {})
+        else:
+            response_text = str(result)
+            reasoning = {}
+            performance = {}
+        updated_history = history + [[message, response_text]]
+        logger.info(f"✓ Response generated (length: {len(response_text)})")
+        return jsonify({
+            'success': True,
+            'message': response_text,
+            'history': updated_history,
+            'reasoning': reasoning,
+            'performance': performance
+        })
+    except Exception as e:
+        logger.error(f"Chat error: {e}", exc_info=True)
+        return jsonify({
+            'success': False,
+            'error': str(e),
+            'message': 'Error processing your request. Please try again.'
+        }), 500
+# Manual initialization endpoint
+@app.route('/api/initialize', methods=['POST'])
+def initialize():
+    """Manually trigger initialization"""
+    success = initialize_orchestrator()
+    if success:
+        return jsonify({
+            'success': True,
+            'message': 'Orchestrator initialized successfully'
+        })
+    else:
+        return jsonify({
+            'success': False,
+            'message': 'Initialization failed. Check logs for details.'
+        }), 500
+# Initialize on startup
+if __name__ == '__main__':
+    logger.info("=" * 60)
+    logger.info("STARTING PURE FLASK API")
+    logger.info("=" * 60)
+    # Initialize orchestrator
+    initialize_orchestrator()
+    port = int(os.getenv('PORT', 7860))
+    logger.info(f"Starting Flask on port {port}")
+    logger.info("Endpoints available:")
+    logger.info("  GET  /")
+    logger.info("  GET  /api/health")
+    logger.info("  POST /api/chat")
+    logger.info("  POST /api/initialize")
+    logger.info("=" * 60)
+    app.run(
+        host='0.0.0.0',
+        port=port,
+        debug=False,
+        threaded=True  # Enable threading for concurrent requests
+    )

requirements.txt CHANGED Viewed

@@ -1,9 +1,13 @@
-# requirements.txt for Hugging Face Spaces with ZeroGPU
 # Core Framework Dependencies
-# Note: gradio, fastapi, uvicorn, torch, datasets, huggingface-hub,
 # pydantic==2.10.6, and protobuf<4 are installed by HF Spaces SDK
 # Web Framework & Interface
 aiohttp>=3.9.0
 httpx>=0.25.0

+# requirements.txt for Hugging Face Spaces with NVIDIA T4 GPU
 # Core Framework Dependencies
+# Note: gradio, fastapi, uvicorn, datasets, huggingface-hub,
 # pydantic==2.10.6, and protobuf<4 are installed by HF Spaces SDK
+# PyTorch with CUDA support (for GPU inference)
+# Note: HF Spaces provides torch, but we ensure GPU support
+torch>=2.0.0
 # Web Framework & Interface
 aiohttp>=3.9.0
 httpx>=0.25.0

src/config.py CHANGED Viewed

@@ -13,7 +13,7 @@ class Settings(BaseSettings):
     classification_model: str = "cardiffnlp/twitter-roberta-base-emotion"
     # Performance settings
-    max_workers: int = int(os.getenv("MAX_WORKERS", "2"))
     cache_ttl: int = int(os.getenv("CACHE_TTL", "3600"))
     # Database settings

     classification_model: str = "cardiffnlp/twitter-roberta-base-emotion"
     # Performance settings
+    max_workers: int = int(os.getenv("MAX_WORKERS", "4"))
     cache_ttl: int = int(os.getenv("CACHE_TTL", "3600"))
     # Database settings

src/llm_router.py CHANGED Viewed

@@ -1,40 +1,154 @@
-# llm_router.py - FIXED VERSION
 import logging
 import asyncio
-from typing import Dict
 from .models_config import LLM_CONFIG
 logger = logging.getLogger(__name__)
 class LLMRouter:
-    def __init__(self, hf_token):
         self.hf_token = hf_token
         self.health_status = {}
         logger.info("LLMRouter initialized")
         if hf_token:
             logger.info("HF token available")
         else:
             logger.warning("No HF token provided")
     async def route_inference(self, task_type: str, prompt: str, **kwargs):
         """
         Smart routing based on task specialization
         """
         logger.info(f"Routing inference for task: {task_type}")
         model_config = self._select_model(task_type)
         logger.info(f"Selected model: {model_config['model_id']}")
         # Health check and fallback logic
         if not await self._is_model_healthy(model_config["model_id"]):
             logger.warning(f"Model unhealthy, using fallback")
             model_config = self._get_fallback_model(task_type)
             logger.info(f"Fallback model: {model_config['model_id']}")
-        # FIXED: Ensure task_type is passed to the _call_hf_endpoint method
         result = await self._call_hf_endpoint(model_config, prompt, task_type, **kwargs)
         logger.info(f"Inference complete for {task_type}")
         return result
     def _select_model(self, task_type: str) -> dict:
         model_map = {
             "intent_classification": LLM_CONFIG["models"]["classification_specialist"],

+# llm_router.py - UPDATED FOR LOCAL GPU MODEL LOADING
 import logging
 import asyncio
+from typing import Dict, Optional
 from .models_config import LLM_CONFIG
 logger = logging.getLogger(__name__)
 class LLMRouter:
+    def __init__(self, hf_token, use_local_models: bool = True):
         self.hf_token = hf_token
         self.health_status = {}
+        self.use_local_models = use_local_models
+        self.local_loader = None
         logger.info("LLMRouter initialized")
         if hf_token:
             logger.info("HF token available")
         else:
             logger.warning("No HF token provided")
+        # Initialize local model loader if enabled
+        if self.use_local_models:
+            try:
+                from .local_model_loader import LocalModelLoader
+                self.local_loader = LocalModelLoader()
+                logger.info("✓ Local model loader initialized (GPU-based inference)")
+                # Note: Pre-loading will happen on first request (lazy loading)
+                # Models will be loaded on-demand to avoid blocking startup
+                logger.info("Models will be loaded on-demand for faster startup")
+            except Exception as e:
+                logger.warning(f"Could not initialize local model loader: {e}. Falling back to API.")
+                logger.warning("This is normal if transformers/torch not available")
+                self.use_local_models = False
+                self.local_loader = None
     async def route_inference(self, task_type: str, prompt: str, **kwargs):
         """
         Smart routing based on task specialization
+        Tries local models first, falls back to HF Inference API if needed
         """
         logger.info(f"Routing inference for task: {task_type}")
         model_config = self._select_model(task_type)
         logger.info(f"Selected model: {model_config['model_id']}")
+        # Try local model first if available
+        if self.use_local_models and self.local_loader:
+            try:
+                # Handle embedding generation separately
+                if task_type == "embedding_generation":
+                    result = await self._call_local_embedding(model_config, prompt, **kwargs)
+                else:
+                    result = await self._call_local_model(model_config, prompt, task_type, **kwargs)
+                if result is not None:
+                    logger.info(f"Inference complete for {task_type} (local model)")
+                    return result
+                else:
+                    logger.warning("Local model returned None, falling back to API")
+            except Exception as e:
+                logger.warning(f"Local model inference failed: {e}. Falling back to API.")
+                logger.debug("Exception details:", exc_info=True)
+        # Fallback to HF Inference API
+        logger.info("Using HF Inference API")
         # Health check and fallback logic
         if not await self._is_model_healthy(model_config["model_id"]):
             logger.warning(f"Model unhealthy, using fallback")
             model_config = self._get_fallback_model(task_type)
             logger.info(f"Fallback model: {model_config['model_id']}")
         result = await self._call_hf_endpoint(model_config, prompt, task_type, **kwargs)
         logger.info(f"Inference complete for {task_type}")
         return result
+    async def _call_local_model(self, model_config: dict, prompt: str, task_type: str, **kwargs) -> Optional[str]:
+        """Call local model for inference."""
+        if not self.local_loader:
+            return None
+        model_id = model_config["model_id"]
+        max_tokens = kwargs.get('max_tokens', 512)
+        temperature = kwargs.get('temperature', 0.7)
+        try:
+            # Ensure model is loaded
+            if model_id not in self.local_loader.loaded_models:
+                logger.info(f"Loading model {model_id} on demand...")
+                self.local_loader.load_chat_model(model_id, load_in_8bit=False)
+            # Format as chat messages if needed
+            messages = [{"role": "user", "content": prompt}]
+            # Generate using local model
+            result = await asyncio.to_thread(
+                self.local_loader.generate_chat_completion,
+                model_id=model_id,
+                messages=messages,
+                max_tokens=max_tokens,
+                temperature=temperature
+            )
+            logger.info(f"Local model {model_id} generated response (length: {len(result)})")
+            logger.info("=" * 80)
+            logger.info("LOCAL MODEL RESPONSE:")
+            logger.info("=" * 80)
+            logger.info(f"Model: {model_id}")
+            logger.info(f"Task Type: {task_type}")
+            logger.info(f"Response Length: {len(result)} characters")
+            logger.info("-" * 40)
+            logger.info("FULL RESPONSE CONTENT:")
+            logger.info("-" * 40)
+            logger.info(result)
+            logger.info("-" * 40)
+            logger.info("END OF RESPONSE")
+            logger.info("=" * 80)
+            return result
+        except Exception as e:
+            logger.error(f"Error calling local model: {e}", exc_info=True)
+            return None
+    async def _call_local_embedding(self, model_config: dict, text: str, **kwargs) -> Optional[list]:
+        """Call local embedding model."""
+        if not self.local_loader:
+            return None
+        model_id = model_config["model_id"]
+        try:
+            # Ensure model is loaded
+            if model_id not in self.local_loader.loaded_embedding_models:
+                logger.info(f"Loading embedding model {model_id} on demand...")
+                self.local_loader.load_embedding_model(model_id)
+            # Generate embedding
+            embedding = await asyncio.to_thread(
+                self.local_loader.get_embedding,
+                model_id=model_id,
+                text=text
+            )
+            logger.info(f"Local embedding model {model_id} generated vector (dim: {len(embedding)})")
+            return embedding
+        except Exception as e:
+            logger.error(f"Error calling local embedding model: {e}", exc_info=True)
+            return None
     def _select_model(self, task_type: str) -> dict:
         model_map = {
             "intent_classification": LLM_CONFIG["models"]["classification_specialist"],

src/local_model_loader.py ADDED Viewed

	@@ -0,0 +1,322 @@

+# local_model_loader.py
+# Local GPU-based model loading for NVIDIA T4 Medium (24GB vRAM)
+import logging
+import torch
+from typing import Optional, Dict, Any
+from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModel
+from sentence_transformers import SentenceTransformer
+logger = logging.getLogger(__name__)
+class LocalModelLoader:
+    """
+    Loads and manages models locally on GPU for faster inference.
+    Optimized for NVIDIA T4 Medium with 24GB vRAM.
+    """
+    def __init__(self, device: Optional[str] = None):
+        """Initialize the model loader with GPU device detection."""
+        # Detect device
+        if device is None:
+            if torch.cuda.is_available():
+                self.device = "cuda"
+                self.device_name = torch.cuda.get_device_name(0)
+                logger.info(f"GPU detected: {self.device_name}")
+                logger.info(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
+            else:
+                self.device = "cpu"
+                self.device_name = "CPU"
+                logger.warning("No GPU detected, using CPU")
+        else:
+            self.device = device
+            self.device_name = device
+        # Model cache
+        self.loaded_models: Dict[str, Any] = {}
+        self.loaded_tokenizers: Dict[str, Any] = {}
+        self.loaded_embedding_models: Dict[str, Any] = {}
+    def load_chat_model(self, model_id: str, load_in_8bit: bool = False, load_in_4bit: bool = False) -> tuple:
+        """
+        Load a chat model and tokenizer on GPU.
+        Args:
+            model_id: HuggingFace model identifier
+            load_in_8bit: Use 8-bit quantization (saves memory)
+            load_in_4bit: Use 4-bit quantization (saves more memory)
+        Returns:
+            Tuple of (model, tokenizer)
+        """
+        if model_id in self.loaded_models:
+            logger.info(f"Model {model_id} already loaded, reusing")
+            return self.loaded_models[model_id], self.loaded_tokenizers[model_id]
+        try:
+            logger.info(f"Loading model {model_id} on {self.device}...")
+            # Load tokenizer
+            tokenizer = AutoTokenizer.from_pretrained(
+                model_id,
+                trust_remote_code=True
+            )
+            # Determine quantization config
+            if load_in_4bit and self.device == "cuda":
+                try:
+                    from transformers import BitsAndBytesConfig
+                    quantization_config = BitsAndBytesConfig(
+                        load_in_4bit=True,
+                        bnb_4bit_compute_dtype=torch.float16,
+                        bnb_4bit_use_double_quant=True,
+                        bnb_4bit_quant_type="nf4"
+                    )
+                    logger.info("Using 4-bit quantization")
+                except ImportError:
+                    logger.warning("bitsandbytes not available, loading without quantization")
+                    quantization_config = None
+            elif load_in_8bit and self.device == "cuda":
+                try:
+                    quantization_config = {"load_in_8bit": True}
+                    logger.info("Using 8-bit quantization")
+                except:
+                    quantization_config = None
+            else:
+                quantization_config = None
+            # Load model with GPU optimization
+            if self.device == "cuda":
+                model = AutoModelForCausalLM.from_pretrained(
+                    model_id,
+                    device_map="auto",  # Automatically uses GPU
+                    torch_dtype=torch.float16,  # Use FP16 for memory efficiency
+                    trust_remote_code=True,
+                    **(quantization_config if isinstance(quantization_config, dict) else {}),
+                    **({"quantization_config": quantization_config} if quantization_config and not isinstance(quantization_config, dict) else {})
+                )
+            else:
+                model = AutoModelForCausalLM.from_pretrained(
+                    model_id,
+                    torch_dtype=torch.float32,
+                    trust_remote_code=True
+                )
+                model = model.to(self.device)
+            # Ensure padding token is set
+            if tokenizer.pad_token is None:
+                tokenizer.pad_token = tokenizer.eos_token
+            # Cache models
+            self.loaded_models[model_id] = model
+            self.loaded_tokenizers[model_id] = tokenizer
+            # Log memory usage
+            if self.device == "cuda":
+                allocated = torch.cuda.memory_allocated(0) / 1024**3
+                reserved = torch.cuda.memory_reserved(0) / 1024**3
+                logger.info(f"GPU Memory - Allocated: {allocated:.2f} GB, Reserved: {reserved:.2f} GB")
+            logger.info(f"✓ Model {model_id} loaded successfully on {self.device}")
+            return model, tokenizer
+        except Exception as e:
+            logger.error(f"Error loading model {model_id}: {e}", exc_info=True)
+            raise
+    def load_embedding_model(self, model_id: str) -> SentenceTransformer:
+        """
+        Load a sentence transformer model for embeddings.
+        Args:
+            model_id: HuggingFace model identifier
+        Returns:
+            SentenceTransformer model
+        """
+        if model_id in self.loaded_embedding_models:
+            logger.info(f"Embedding model {model_id} already loaded, reusing")
+            return self.loaded_embedding_models[model_id]
+        try:
+            logger.info(f"Loading embedding model {model_id}...")
+            # SentenceTransformer automatically handles GPU
+            model = SentenceTransformer(
+                model_id,
+                device=self.device
+            )
+            # Cache model
+            self.loaded_embedding_models[model_id] = model
+            logger.info(f"✓ Embedding model {model_id} loaded successfully on {self.device}")
+            return model
+        except Exception as e:
+            logger.error(f"Error loading embedding model {model_id}: {e}", exc_info=True)
+            raise
+    def generate_text(
+        self,
+        model_id: str,
+        prompt: str,
+        max_tokens: int = 512,
+        temperature: float = 0.7,
+        **kwargs
+    ) -> str:
+        """
+        Generate text using a loaded chat model.
+        Args:
+            model_id: Model identifier
+            prompt: Input prompt
+            max_tokens: Maximum tokens to generate
+            temperature: Sampling temperature
+        Returns:
+            Generated text
+        """
+        if model_id not in self.loaded_models:
+            raise ValueError(f"Model {model_id} not loaded. Call load_chat_model() first.")
+        model = self.loaded_models[model_id]
+        tokenizer = self.loaded_tokenizers[model_id]
+        try:
+            # Tokenize input
+            inputs = tokenizer(prompt, return_tensors="pt").to(self.device)
+            # Generate
+            with torch.no_grad():
+                outputs = model.generate(
+                    **inputs,
+                    max_new_tokens=max_tokens,
+                    temperature=temperature,
+                    do_sample=True,
+                    pad_token_id=tokenizer.pad_token_id,
+                    eos_token_id=tokenizer.eos_token_id,
+                    **kwargs
+                )
+            # Decode
+            generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+            # Remove prompt from output if present
+            if generated_text.startswith(prompt):
+                generated_text = generated_text[len(prompt):].strip()
+            return generated_text
+        except Exception as e:
+            logger.error(f"Error generating text: {e}", exc_info=True)
+            raise
+    def generate_chat_completion(
+        self,
+        model_id: str,
+        messages: list,
+        max_tokens: int = 512,
+        temperature: float = 0.7,
+        **kwargs
+    ) -> str:
+        """
+        Generate chat completion using a loaded model.
+        Args:
+            model_id: Model identifier
+            messages: List of message dicts with 'role' and 'content'
+            max_tokens: Maximum tokens to generate
+            temperature: Sampling temperature
+        Returns:
+            Generated response
+        """
+        if model_id not in self.loaded_models:
+            raise ValueError(f"Model {model_id} not loaded. Call load_chat_model() first.")
+        model = self.loaded_models[model_id]
+        tokenizer = self.loaded_tokenizers[model_id]
+        try:
+            # Format messages as prompt
+            if hasattr(tokenizer, 'apply_chat_template'):
+                # Use chat template if available
+                prompt = tokenizer.apply_chat_template(
+                    messages,
+                    tokenize=False,
+                    add_generation_prompt=True
+                )
+            else:
+                # Fallback: simple formatting
+                prompt = "\n".join([
+                    f"{msg['role']}: {msg['content']}"
+                    for msg in messages
+                ]) + "\nassistant: "
+            # Generate
+            return self.generate_text(
+                model_id=model_id,
+                prompt=prompt,
+                max_tokens=max_tokens,
+                temperature=temperature,
+                **kwargs
+            )
+        except Exception as e:
+            logger.error(f"Error generating chat completion: {e}", exc_info=True)
+            raise
+    def get_embedding(self, model_id: str, text: str) -> list:
+        """
+        Get embedding vector for text.
+        Args:
+            model_id: Embedding model identifier
+            text: Input text
+        Returns:
+            Embedding vector
+        """
+        if model_id not in self.loaded_embedding_models:
+            raise ValueError(f"Embedding model {model_id} not loaded. Call load_embedding_model() first.")
+        model = self.loaded_embedding_models[model_id]
+        try:
+            embedding = model.encode(text, convert_to_numpy=True)
+            return embedding.tolist()
+        except Exception as e:
+            logger.error(f"Error getting embedding: {e}", exc_info=True)
+            raise
+    def clear_cache(self):
+        """Clear all loaded models from memory."""
+        logger.info("Clearing model cache...")
+        # Clear models
+        for model_id in list(self.loaded_models.keys()):
+            del self.loaded_models[model_id]
+        for model_id in list(self.loaded_tokenizers.keys()):
+            del self.loaded_tokenizers[model_id]
+        for model_id in list(self.loaded_embedding_models.keys()):
+            del self.loaded_embedding_models[model_id]
+        # Clear GPU cache
+        if self.device == "cuda":
+            torch.cuda.empty_cache()
+        logger.info("✓ Model cache cleared")
+    def get_memory_usage(self) -> Dict[str, float]:
+        """Get current GPU memory usage in GB."""
+        if self.device != "cuda":
+            return {"device": "cpu", "gpu_available": False}
+        return {
+            "device": self.device_name,
+            "gpu_available": True,
+            "allocated_gb": torch.cuda.memory_allocated(0) / 1024**3,
+            "reserved_gb": torch.cuda.memory_reserved(0) / 1024**3,
+            "total_gb": torch.cuda.get_device_properties(0).total_memory / 1024**3
+        }