Spaces:

JatinAutonomousLabs
/

Research_AI_Assistant

Sleeping

JatsTheAIGen commited on Nov 7

Commit

a58b1f9

1 Parent(s): 7632802

feat: Add ZeroGPU Chat API integration

- Add ZeroGPU API client (zero_gpu_client.py) with JWT authentication and auto-refresh
- Update LLM router to support ZeroGPU API as inference provider
- Add ZeroGPU configuration to config.py (enabled via USE_ZERO_GPU env var)
- Add task type mapping for ZeroGPU API (general, reasoning, classification, embedding)
- Update app.py and flask_api_standalone.py to pass ZeroGPU config to LLM router
- Implement fallback chain: Local models -> ZeroGPU API -> HF Inference API
- Add comprehensive integration review documentation

The ZeroGPU API provides:
- Built-in user management and authentication
- Comprehensive server-side logging and audit trail
- Task-based routing (general, reasoning, classification, embedding)
- Rich metadata (tokens, timing, quality metrics)
- Rate limiting and security features

Configuration:
- Set USE_ZERO_GPU=true to enable
- Set ZERO_GPU_API_URL, ZERO_GPU_EMAIL, ZERO_GPU_PASSWORD
- Falls back to HF API if ZeroGPU unavailable or disabled

Files changed (8) hide show

ZEROGPU_API_INTEGRATION_REVIEW.md +829 -0
ZEROGPU_API_REVIEW_SUMMARY.md +179 -0
app.py +16 -1
config.py +6 -0
flask_api_standalone.py +16 -1
src/llm_router.py +159 -5
src/models_config.py +7 -0
zero_gpu_client.py +219 -0

ZEROGPU_API_INTEGRATION_REVIEW.md ADDED Viewed

	@@ -0,0 +1,829 @@

+# ZeroGPU Chat API Integration Review
+**Date:** 2025-01-07
+**Reviewer:** AI Assistant
+**Purpose:** Comprehensive review of ZeroGPU Chat API documentation for replacing HF/Novita Inference endpoints
+---
+## Executive Summary
+The ZeroGPU Chat API provides a comprehensive replacement for Hugging Face Inference API with significant advantages:
+- ✅ **Built-in user management and authentication** (JWT-based)
+- ✅ **Comprehensive audit logging** (all requests logged server-side)
+- ✅ **Multi-task support** (general, reasoning, classification, embedding)
+- ✅ **Rate limiting and security features**
+- ✅ **Better integration patterns** for multi-agent systems
+**Key Integration Points:**
+1. Replace `llm_router.py` HF endpoint calls with ZeroGPU `/chat` endpoint
+2. Implement JWT authentication flow (login → access token → refresh)
+3. Map current task types to ZeroGPU task types
+4. Leverage API's built-in logging instead of local logging
+5. Update user management to use API's user system or maintain dual system
+---
+## 1. API Documentation Review
+### 1.1 Endpoint Comparison
+#### Current System (HF Inference API)
+```python
+# Current: llm_router.py
+api_url = "https://router.huggingface.co/v1/chat/completions"
+headers = {"Authorization": f"Bearer {self.hf_token}"}
+payload = {
+    "model": model_id,
+    "messages": [{"role": "user", "content": prompt}],
+    "max_tokens": max_tokens,
+    "temperature": temperature
+}
+```
+#### ZeroGPU Chat API
+```python
+# New: ZeroGPU API
+api_url = "http://your-pod-ip:8000/chat"
+headers = {"Authorization": f"Bearer {access_token}"}
+payload = {
+    "message": prompt,
+    "task": "general",  # or "reasoning", "classification", "embedding"
+    "context": [...],  # Optional conversation history
+    "max_tokens": max_tokens,
+    "temperature": temperature,
+    "system_prompt": "..."
+}
+```
+**Key Differences:**
+- ✅ **Task-based routing** instead of model selection
+- ✅ **Context support** built-in (conversation history)
+- ✅ **System prompts** supported natively
+- ✅ **Authentication** via JWT tokens (not API keys)
+- ⚠️ **Different payload structure** (message vs messages array)
+### 1.2 Task Type Mapping
+**Current System Task Types:**
+```python
+# From models_config.py and llm_router.py
+task_types = {
+    "intent_classification": "classification_specialist",
+    "embedding_generation": "embedding_specialist",
+    "safety_check": "safety_checker",
+    "general_reasoning": "reasoning_primary",
+    "response_synthesis": "reasoning_primary"
+}
+```
+**ZeroGPU Task Types:**
+```python
+# From ZeroGPU API documentation
+zero_gpu_tasks = {
+    "general": "General purpose chat and Q&A",
+    "reasoning": "Complex reasoning and problem-solving",
+    "classification": "Text classification tasks",
+    "embedding": "Text embeddings (vector representations)"
+}
+```
+**Recommended Mapping:**
+```python
+TASK_MAPPING = {
+    "intent_classification": "classification",
+    "embedding_generation": "embedding",
+    "safety_check": "general",  # Or create custom safety endpoint
+    "general_reasoning": "reasoning",
+    "response_synthesis": "general"  # Or "reasoning" for complex synthesis
+}
+```
+### 1.3 Authentication Flow
+**Current System:**
+- Uses HF token directly in headers
+- No user management
+- No token refresh needed
+**ZeroGPU API:**
+- Requires user registration/login
+- JWT access tokens (15 min expiry)
+- Refresh tokens (7 day expiry)
+- User approval workflow
+**Integration Strategy:**
+1. **Option A: Service Account** (Recommended for single-tenant)
+   - Create one service account for the application
+   - Use that account for all API calls
+   - Simpler, but all usage tracked under one user
+2. **Option B: Per-User Accounts** (Multi-tenant)
+   - Map each application user to ZeroGPU user
+   - Track usage per user
+   - More complex but better for multi-tenant scenarios
+3. **Option C: Hybrid** (Recommended for migration)
+   - Use service account initially
+   - Migrate to per-user accounts gradually
+   - Maintain user mapping table
+### 1.4 Response Structure Comparison
+**Current HF API Response:**
+```python
+{
+    "choices": [{
+        "message": {
+            "content": "response text"
+        }
+    }]
+}
+```
+**ZeroGPU API Response:**
+```python
+{
+    "response": "response text",
+    "task": "general",
+    "model_used": "mistralai/Mistral-7B-Instruct-v0.2",
+    "tokens_used": {
+        "input": 15,
+        "output": 8,
+        "total": 23
+    },
+    "inference_metrics": {
+        "inference_duration": 0.45,
+        "total_duration": 0.52,
+        "tokens_per_second": 17.78
+    },
+    "confidence_scores": {...},
+    "quality_metrics": {...},
+    "performance_metrics": {...},
+    "audit_info": {
+        "timestamp": "2024-01-01T12:00:00",
+        "user_id": 1,
+        "model_name": "...",
+        "task": "general",
+        "generation_parameters": {...},
+        "compliance": {
+            "logged": true,
+            "retention_days": 90,
+            "audit_enabled": true
+        }
+    }
+}
+```
+**Advantages:**
+- ✅ **Rich metadata** (tokens, timing, quality metrics)
+- ✅ **Audit trail** built-in
+- ✅ **Performance metrics** included
+- ✅ **Compliance information** for logging
+---
+## 2. Data Storage Analysis
+### 2.1 Current System Data Storage
+**Database Schema (from database_schema.sql and context_manager.py):**
+```sql
+-- Sessions
+CREATE TABLE sessions (
+    session_id TEXT PRIMARY KEY,
+    user_id TEXT DEFAULT 'Test_Any',
+    created_at TIMESTAMP,
+    last_activity TIMESTAMP,
+    context_data TEXT,
+    user_metadata TEXT
+);
+-- Interactions
+CREATE TABLE interactions (
+    interaction_id TEXT PRIMARY KEY,
+    session_id TEXT REFERENCES sessions(session_id),
+    user_input TEXT,
+    context_snapshot TEXT,
+    created_at TIMESTAMP
+);
+-- User Contexts
+CREATE TABLE user_contexts (
+    user_id TEXT PRIMARY KEY,
+    persona_summary TEXT,
+    updated_at TIMESTAMP
+);
+```
+**Current Logging:**
+- Application-level logging to files/console
+- Database storage for sessions/interactions
+- No centralized audit trail
+- No built-in compliance logging
+### 2.2 ZeroGPU API Data Storage
+**API-Side Storage (from documentation):**
+The API provides comprehensive server-side logging:
+- ✅ **All inference requests logged** with full audit trail
+- ✅ **User activity tracking** (usage stats endpoint)
+- ✅ **Request/response logging** with timestamps
+- ✅ **Compliance logging** (90-day retention mentioned)
+- ✅ **Performance metrics** stored
+- ✅ **Token usage tracking** per user
+**What the API Stores:**
+1. **User Accounts** (email, mobile, approval status)
+2. **Inference Logs** (all `/chat` requests)
+   - User ID
+   - Timestamp
+   - Model used
+   - Task type
+   - Generation parameters
+   - Tokens used
+   - Performance metrics
+   - Request/response content (likely)
+3. **Usage Statistics** (aggregated per user)
+   - Total requests
+   - Total tokens
+   - Requests by task
+   - Average inference time
+**What Your System Should Still Store:**
+1. **Session Management** (conversation continuity)
+   - Session IDs
+   - Conversation history
+   - Context summaries
+2. **User Preferences** (application-specific)
+   - UI preferences
+   - Response speed settings
+   - Context mode preferences
+3. **Application State** (non-API data)
+   - Agent traces
+   - Reasoning chains
+   - Custom metadata
+### 2.3 Data Synchronization Strategy
+**Recommended Approach:**
+1. **Dual Storage Pattern:**
+   ```
+   Application DB (SQLite)          ZeroGPU API
+   ├── Sessions                      ├── User Accounts
+   ├── Interactions                  ├── Inference Logs
+   ├── User Contexts                 ├── Usage Statistics
+   └── User Preferences              └── Audit Trail
+   ```
+2. **Data Flow:**
+   - **Read from API:** User info, usage stats
+   - **Write to API:** Inference requests (auto-logged)
+   - **Read from Local DB:** Session history, preferences
+   - **Write to Local DB:** Session management, app state
+3. **Migration Considerations:**
+   - **User IDs:** API generates its own user IDs
+   - **Email as Key:** Use email for user lookups (stable identifier)
+   - **Session Mapping:** Maintain mapping: `local_session_id → api_user_id`
+   - **Historical Data:** Keep existing sessions in local DB
+---
+## 3. User Logging Capabilities Analysis
+### 3.1 API-Provided Logging Features
+#### 3.1.1 Automatic Request Logging
+**Every `/chat` request is automatically logged with:**
+- ✅ User ID
+- ✅ Timestamp
+- ✅ Model name
+- ✅ Task type
+- ✅ Generation parameters (max_tokens, temperature, etc.)
+- ✅ Context information (has_context, context_messages count)
+- ✅ Compliance flags (logged, retention_days, audit_enabled)
+**From API Response:**
+```json
+"audit_info": {
+    "timestamp": "2024-01-01T12:00:00",
+    "user_id": 1,
+    "model_name": "mistralai/Mistral-7B-Instruct-v0.2",
+    "task": "general",
+    "generation_parameters": {
+        "max_tokens": 512,
+        "temperature": 0.7,
+        "has_context": true,
+        "context_messages": 2
+    },
+    "compliance": {
+        "logged": true,
+        "retention_days": 90,
+        "audit_enabled": true
+    }
+}
+```
+#### 3.1.2 Usage Statistics Endpoint
+**`GET /usage/stats` provides aggregated logging:**
+```json
+{
+    "user_id": 1,
+    "period_days": 30,
+    "total_requests": 150,
+    "total_tokens": 45000,
+    "total_inference_time": 125.5,
+    "requests_by_task": {
+        "general": 100,
+        "reasoning": 30,
+        "classification": 20
+    },
+    "tokens_by_task": {
+        "general": 30000,
+        "reasoning": 10000,
+        "classification": 5000
+    },
+    "average_tokens_per_request": 300,
+    "average_inference_time": 0.84
+}
+```
+**Capabilities:**
+- ✅ **Per-user statistics** (requires authentication)
+- ✅ **Time-period filtering** (days parameter)
+- ✅ **Task breakdown** (requests and tokens by task)
+- ✅ **Performance metrics** (average inference time)
+- ✅ **Token usage tracking** (input/output/total)
+#### 3.1.3 Admin Logging Endpoints
+**Admin endpoints provide additional logging:**
+- `GET /admin/all-users` - All user accounts
+- `GET /admin/pending-users` - Pending approvals
+- User approval/deactivation actions logged
+#### 3.1.4 Rate Limiting Headers
+**Every response includes rate limit logging:**
+```
+X-RateLimit-Limit: 60
+X-RateLimit-Remaining: 45
+X-RateLimit-Reset: 1704067200
+```
+### 3.2 What's NOT Logged by API (Need Local Logging)
+**The API does NOT provide:**
+1. **Request/Response Content** (message text, response text)
+   - API logs metadata but may not store full content
+   - **Action:** Continue local logging for full conversation history
+2. **Agent Traces** (your multi-agent system specifics)
+   - API doesn't know about your agent architecture
+   - **Action:** Keep agent trace logging in local DB
+3. **Reasoning Chains** (chain of thought)
+   - Application-specific reasoning data
+   - **Action:** Store in local interactions table
+4. **Context Summaries** (user persona, session context)
+   - Application-level context management
+   - **Action:** Continue using local context_manager
+### 3.3 Recommended Logging Strategy
+**Hybrid Logging Approach:**
+```python
+# 1. API handles inference logging (automatic)
+response = zero_gpu_client.chat(
+    message=user_input,
+    task="general",
+    context=conversation_context
+)
+# API automatically logs: user_id, timestamp, model, task, params, metrics
+# 2. Application handles application-specific logging
+local_db.save_interaction(
+    session_id=session_id,
+    user_input=user_input,
+    response=response["response"],
+    agent_trace=agent_trace,  # Your system's agent data
+    reasoning_data=reasoning_data,  # Your system's reasoning
+    api_audit_info=response["audit_info"]  # Link to API log
+)
+# 3. Periodic sync for usage stats
+usage_stats = zero_gpu_client.get_usage_stats(days=30)
+local_db.update_usage_cache(user_id, usage_stats)
+```
+**Benefits:**
+- ✅ **API handles compliance** (audit trail, retention)
+- ✅ **Application handles context** (conversation continuity)
+- ✅ **Reduced local logging** (no need to log inference details)
+- ✅ **Better separation** (API concerns vs application concerns)
+---
+## 4. Integration Requirements
+### 4.1 Code Changes Required
+#### 4.1.1 Create ZeroGPU API Client
+**New File: `zero_gpu_client.py`**
+```python
+import requests
+import time
+from typing import Optional, List, Dict, Any
+class ZeroGPUChatClient:
+    def __init__(self, base_url: str, email: str, password: str):
+        self.base_url = base_url.rstrip('/')
+        self.access_token = None
+        self.refresh_token = None
+        self.login(email, password)
+    def login(self, email: str, password: str):
+        """Login and get tokens"""
+        response = requests.post(
+            f"{self.base_url}/login",
+            json={"email": email, "password": password}
+        )
+        response.raise_for_status()
+        data = response.json()
+        self.access_token = data["access_token"]
+        self.refresh_token = data["refresh_token"]
+    def refresh_access_token(self):
+        """Refresh access token"""
+        response = requests.post(
+            f"{self.base_url}/refresh",
+            headers={"X-Refresh-Token": self.refresh_token}
+        )
+        response.raise_for_status()
+        data = response.json()
+        self.access_token = data["access_token"]
+        self.refresh_token = data["refresh_token"]
+    def chat(self, message: str, task: str = "general", **kwargs) -> Dict[str, Any]:
+        """Send chat message with auto-retry on 401"""
+        url = f"{self.base_url}/chat"
+        headers = {
+            "Authorization": f"Bearer {self.access_token}",
+            "Content-Type": "application/json"
+        }
+        payload = {
+            "message": message,
+            "task": task,
+            **kwargs
+        }
+        response = requests.post(url, json=payload, headers=headers)
+        if response.status_code == 401:
+            # Token expired, refresh and retry
+            self.refresh_access_token()
+            headers["Authorization"] = f"Bearer {self.access_token}"
+            response = requests.post(url, json=payload, headers=headers)
+        response.raise_for_status()
+        return response.json()
+```
+#### 4.1.2 Update LLM Router
+**Modify: `llm_router.py` or `src/llm_router.py`**
+**Current:**
+```python
+async def _call_hf_endpoint(self, model_config: dict, prompt: str, task_type: str, **kwargs):
+    api_url = "https://router.huggingface.co/v1/chat/completions"
+    # ... HF API call
+```
+**New:**
+```python
+async def _call_zero_gpu_endpoint(self, task_type: str, prompt: str, context: List[Dict] = None, **kwargs):
+    # Map task type to ZeroGPU task
+    task_mapping = {
+        "intent_classification": "classification",
+        "embedding_generation": "embedding",
+        "general_reasoning": "reasoning",
+        "response_synthesis": "general"
+    }
+    zero_gpu_task = task_mapping.get(task_type, "general")
+    # Prepare context if provided
+    context_messages = None
+    if context:
+        context_messages = [
+            {
+                "role": msg.get("role", "user"),
+                "content": msg.get("content", ""),
+                "timestamp": msg.get("timestamp", datetime.utcnow().isoformat())
+            }
+            for msg in context
+        ]
+    # Call ZeroGPU API
+    response = self.zero_gpu_client.chat(
+        message=prompt,
+        task=zero_gpu_task,
+        context=context_messages,
+        max_tokens=kwargs.get('max_tokens', 512),
+        temperature=kwargs.get('temperature', 0.7),
+        **{k: v for k, v in kwargs.items() if k not in ['max_tokens', 'temperature']}
+    )
+    return response["response"]
+```
+#### 4.1.3 Update Configuration
+**Modify: `config.py` or create `zero_gpu_config.py`**
+```python
+ZERO_GPU_CONFIG = {
+    "base_url": os.getenv("ZERO_GPU_API_URL", "http://your-pod-ip:8000"),
+    "service_account": {
+        "email": os.getenv("ZERO_GPU_EMAIL", "service@example.com"),
+        "password": os.getenv("ZERO_GPU_PASSWORD", "")
+    },
+    "task_mapping": {
+        "intent_classification": "classification",
+        "embedding_generation": "embedding",
+        "general_reasoning": "reasoning",
+        "response_synthesis": "general",
+        "safety_check": "general"
+    },
+    "retry_config": {
+        "max_retries": 3,
+        "timeout": 30,
+        "wait_for_ready": True,
+        "ready_timeout": 300
+    }
+}
+```
+### 4.2 Database Schema Updates
+**No schema changes required**, but consider adding:
+```sql
+-- Optional: Track API user mapping
+CREATE TABLE IF NOT EXISTS api_user_mapping (
+    local_user_id TEXT PRIMARY KEY,
+    api_user_id INTEGER,
+    api_email TEXT,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
+);
+-- Optional: Cache usage stats
+CREATE TABLE IF NOT EXISTS api_usage_cache (
+    user_id TEXT PRIMARY KEY,
+    stats_json TEXT,
+    last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP
+);
+```
+### 4.3 Environment Variables
+**Add to `.env` or environment:**
+```bash
+# ZeroGPU API Configuration
+ZERO_GPU_API_URL=http://your-pod-ip:8000
+ZERO_GPU_EMAIL=service@example.com
+ZERO_GPU_PASSWORD=your-secure-password
+# Optional: Fallback to HF if ZeroGPU unavailable
+USE_ZERO_GPU=true
+HF_TOKEN=your-hf-token  # Keep as fallback
+```
+---
+## 5. Migration Plan
+### 5.1 Phase 1: Setup and Testing (Week 1)
+1. ✅ Set up ZeroGPU API instance
+2. ✅ Create service account
+3. ✅ Implement ZeroGPU client
+4. ✅ Test authentication flow
+5. ✅ Test basic chat endpoint
+6. ✅ Verify logging works
+### 5.2 Phase 2: Integration (Week 2)
+1. ✅ Update LLM router to use ZeroGPU
+2. ✅ Implement task mapping
+3. ✅ Add context support
+4. ✅ Update error handling
+5. ✅ Test all task types
+6. ✅ Verify fallback logic
+### 5.3 Phase 3: User Management (Week 3)
+1. ✅ Decide on user strategy (service account vs per-user)
+2. ✅ Implement user mapping if needed
+3. ✅ Update user creation flow
+4. ✅ Test user approval workflow
+5. ✅ Migrate existing users (if applicable)
+### 5.4 Phase 4: Logging Integration (Week 4)
+1. ✅ Reduce local inference logging (rely on API)
+2. ✅ Keep application-specific logging
+3. ✅ Implement usage stats sync
+4. ✅ Test audit trail access
+5. ✅ Verify compliance requirements
+### 5.5 Phase 5: Production Deployment (Week 5)
+1. ✅ Deploy to staging
+2. ✅ Load testing
+3. ✅ Monitor API usage
+4. ✅ Verify logging completeness
+5. ✅ Deploy to production
+6. ✅ Monitor and optimize
+---
+## 6. Advantages and Considerations
+### 6.1 Advantages
+1. **Built-in User Management**
+   - No need to manage user accounts separately
+   - JWT authentication is industry standard
+   - User approval workflow built-in
+2. **Comprehensive Logging**
+   - All requests automatically logged
+   - Audit trail for compliance
+   - Usage statistics readily available
+   - Reduced local logging overhead
+3. **Better Task Routing**
+   - Task-based instead of model-based
+   - API handles model selection
+   - Simpler configuration
+4. **Rich Metadata**
+   - Performance metrics included
+   - Quality scores provided
+   - Token usage tracked
+   - Inference timing available
+5. **Security Features**
+   - Prompt injection detection
+   - Rate limiting built-in
+   - Input validation
+   - JWT token security
+### 6.2 Considerations
+1. **Authentication Complexity**
+   - Need to manage tokens (access + refresh)
+   - Token expiry handling required
+   - More complex than API key
+2. **User Management Overhead**
+   - User approval workflow (unless auto-approved)
+   - Need to maintain user accounts
+   - Migration complexity if using per-user accounts
+3. **API Dependency**
+   - Single point of failure
+   - Network dependency
+   - Need fallback strategy
+4. **Data Location**
+   - Logs stored on API server
+   - Need to trust API provider
+   - May need data export capability
+5. **Cost Considerations**
+   - May have different pricing model
+   - Usage tracking helps monitor costs
+   - Rate limits may affect throughput
+---
+## 7. Recommendations
+### 7.1 Immediate Actions
+1. **✅ Start with Service Account**
+   - Simpler initial integration
+   - Faster to implement
+   - Can migrate to per-user later
+2. **✅ Keep Local Logging Initially**
+   - Don't remove local logging immediately
+   - Run dual logging during migration
+   - Verify API logging completeness
+   - Remove local logging after verification
+3. **✅ Implement Fallback**
+   - Keep HF API as fallback
+   - Handle API unavailability gracefully
+   - Test fallback scenarios
+4. **✅ Test Thoroughly**
+   - Test all task types
+   - Test authentication flow
+   - Test token refresh
+   - Test error scenarios
+   - Test rate limiting
+### 7.2 Long-term Strategy
+1. **Migrate to Per-User Accounts** (if multi-tenant)
+   - Better usage tracking
+   - Per-user rate limits
+   - Better audit trail
+2. **Leverage API Logging**
+   - Reduce local logging overhead
+   - Use API for compliance reporting
+   - Sync usage stats periodically
+3. **Optimize Context Management**
+   - Use API's context parameter
+   - Reduce local context storage
+   - Leverage API's context validation
+4. **Monitor and Optimize**
+   - Track API usage patterns
+   - Optimize task mapping
+   - Adjust rate limits if needed
+   - Monitor costs
+---
+## 8. Testing Checklist
+### 8.1 Authentication Testing
+- [ ] User registration works
+- [ ] Login returns valid tokens
+- [ ] Token refresh works
+- [ ] Expired token handling
+- [ ] Invalid token rejection
+- [ ] User approval workflow
+### 8.2 API Endpoint Testing
+- [ ] `/chat` endpoint works for all task types
+- [ ] Context parameter works correctly
+- [ ] System prompts work
+- [ ] Generation parameters respected
+- [ ] Error handling works
+- [ ] Rate limiting works
+### 8.3 Logging Verification
+- [ ] All requests logged in API
+- [ ] Usage stats accurate
+- [ ] Audit info included in responses
+- [ ] Token usage tracked correctly
+- [ ] Performance metrics available
+### 8.4 Integration Testing
+- [ ] LLM router uses ZeroGPU
+- [ ] Task mapping correct
+- [ ] Context passed correctly
+- [ ] Error handling graceful
+- [ ] Fallback works if API unavailable
+- [ ] Performance acceptable
+### 8.5 Production Readiness
+- [ ] Load testing completed
+- [ ] Monitoring in place
+- [ ] Alerting configured
+- [ ] Documentation updated
+- [ ] Team trained
+- [ ] Rollback plan ready
+---
+## 9. Conclusion
+The ZeroGPU Chat API is a **strong replacement** for Hugging Face Inference API with significant advantages:
+**✅ Recommended for Integration:**
+- Better user management
+- Comprehensive logging
+- Rich metadata
+- Security features
+- Task-based routing
+**⚠️ Requires Careful Planning:**
+- Authentication complexity
+- User management strategy
+- Migration planning
+- Fallback implementation
+**📋 Next Steps:**
+1. Review this document with team
+2. Set up ZeroGPU API instance
+3. Create service account
+4. Implement client library
+5. Test integration
+6. Plan migration timeline
+---
+**Document Version:** 1.0
+**Last Updated:** 2025-01-07
+**Status:** Ready for Review

ZEROGPU_API_REVIEW_SUMMARY.md ADDED Viewed

	@@ -0,0 +1,179 @@

+# ZeroGPU Chat API - Quick Review Summary
+## 🎯 Key Findings
+### ✅ **API Documentation Quality: Excellent**
+- Comprehensive documentation with clear examples
+- Well-structured endpoint descriptions
+- Good error handling documentation
+- Multi-agent integration guide included
+### ✅ **Data Storage: Server-Side Logging**
+The API provides **comprehensive server-side logging**:
+- ✅ All inference requests automatically logged
+- ✅ User activity tracking via `/usage/stats` endpoint
+- ✅ Audit trail with 90-day retention
+- ✅ Performance metrics stored
+- ✅ Token usage tracking per user
+**What This Means:**
+- You can **reduce local logging** for inference requests
+- API handles **compliance logging** automatically
+- Usage statistics available via API endpoint
+- Still need local storage for: sessions, agent traces, reasoning chains
+### ✅ **User Logging Capabilities: Comprehensive**
+#### Automatic Logging (Every Request)
+```json
+"audit_info": {
+    "timestamp": "2024-01-01T12:00:00",
+    "user_id": 1,
+    "model_name": "...",
+    "task": "general",
+    "generation_parameters": {...},
+    "compliance": {
+        "logged": true,
+        "retention_days": 90,
+        "audit_enabled": true
+    }
+}
+```
+#### Usage Statistics Endpoint
+- Per-user statistics
+- Time-period filtering
+- Task breakdown (requests/tokens by task)
+- Performance metrics
+- Token usage tracking
+#### What's NOT Logged by API
+- Request/response content (full text) - may not be stored
+- Agent traces (your system-specific)
+- Reasoning chains (application-specific)
+- Context summaries (user persona)
+**Recommendation:** Use hybrid logging - API for inference logs, local DB for application-specific data.
+---
+## 🔄 Integration Requirements
+### 1. Replace HF Endpoint Calls
+**Current:** `https://router.huggingface.co/v1/chat/completions`
+**New:** `http://your-pod-ip:8000/chat`
+### 2. Implement Authentication
+- JWT-based (access token + refresh token)
+- Token expiry handling required
+- User approval workflow
+### 3. Task Type Mapping
+```python
+TASK_MAPPING = {
+    "intent_classification": "classification",
+    "embedding_generation": "embedding",
+    "general_reasoning": "reasoning",
+    "response_synthesis": "general"
+}
+```
+### 4. Update LLM Router
+- Replace `_call_hf_endpoint()` with `_call_zero_gpu_endpoint()`
+- Add context parameter support
+- Implement token refresh logic
+---
+## 📊 Data Storage Comparison
+### Current System
+- **Local SQLite:** Sessions, interactions, user contexts
+- **Local Logging:** Application logs, inference logs
+- **No centralized audit trail**
+### With ZeroGPU API
+- **API Server:** User accounts, inference logs, usage stats, audit trail
+- **Local SQLite:** Sessions, agent traces, reasoning, preferences
+- **Hybrid Approach:** API handles compliance, local handles context
+**Storage Strategy:**
+```
+API Handles:                    Local DB Handles:
+├── User accounts               ├── Session management
+├── Inference logs              ├── Conversation history
+├── Usage statistics            ├── Agent traces
+├── Audit trail                 ├── Reasoning chains
+└── Token usage                 └── User preferences
+```
+---
+## ⚠️ Key Considerations
+### Advantages
+1. ✅ Built-in user management (JWT auth)
+2. ✅ Comprehensive audit logging
+3. ✅ Rich metadata (tokens, timing, quality)
+4. ✅ Task-based routing (simpler config)
+5. ✅ Security features (rate limiting, prompt injection detection)
+### Challenges
+1. ⚠️ Authentication complexity (tokens vs API keys)
+2. ⚠️ User management overhead (approval workflow)
+3. ⚠️ API dependency (single point of failure)
+4. ⚠️ Data location (logs on API server)
+5. ⚠️ Migration complexity (user mapping)
+---
+## 🚀 Recommended Approach
+### Phase 1: Service Account (Start Here)
+- Create one service account for application
+- Simpler initial integration
+- All usage tracked under one user
+- Can migrate to per-user later
+### Phase 2: Hybrid Logging
+- API handles inference logging (automatic)
+- Local DB handles application-specific data
+- Reduce local logging overhead
+- Keep agent traces and reasoning locally
+### Phase 3: Gradual Migration
+- Start with service account
+- Test thoroughly
+- Monitor API logging
+- Migrate to per-user accounts if needed
+---
+## 📋 Action Items
+1. **Review Integration Plan** (`ZEROGPU_API_INTEGRATION_REVIEW.md`)
+2. **Set up ZeroGPU API instance**
+3. **Create service account**
+4. **Implement ZeroGPU client** (`zero_gpu_client.py`)
+5. **Update LLM router** (replace HF calls)
+6. **Test authentication flow**
+7. **Test all task types**
+8. **Verify logging works**
+9. **Implement fallback** (keep HF as backup)
+10. **Deploy to staging**
+---
+## 📚 Documentation References
+- **Full Review:** `ZEROGPU_API_INTEGRATION_REVIEW.md`
+- **API Documentation:** Provided ZeroGPU API docs
+- **Current API:** `API_QUICK_REFERENCE.md`
+- **Current Implementation:** `llm_router.py`, `models_config.py`
+---
+**Status:** ✅ Ready for Integration
+**Priority:** High (replaces HF/Novita endpoints)
+**Complexity:** Medium (requires authentication and task mapping)

app.py CHANGED Viewed

@@ -2024,9 +2024,24 @@ def initialize_orchestrator():
         if not hf_token:
             logger.warning("HF_TOKEN not found in environment")
         # Initialize LLM Router
         logger.info("Step 1/6: Initializing LLM Router...")
-        llm_router = LLMRouter(hf_token)
         logger.info("✓ LLM Router initialized")
         # Initialize Agents

         if not hf_token:
             logger.warning("HF_TOKEN not found in environment")
+        # Prepare ZeroGPU config if enabled
+        zero_gpu_config = None
+        try:
+            from config import settings
+            if settings.zero_gpu_enabled and settings.zero_gpu_email and settings.zero_gpu_password:
+                zero_gpu_config = {
+                    "enabled": True,
+                    "base_url": settings.zero_gpu_base_url,
+                    "email": settings.zero_gpu_email,
+                    "password": settings.zero_gpu_password
+                }
+                logger.info("ZeroGPU API enabled in configuration")
+        except Exception as e:
+            logger.debug(f"Could not load ZeroGPU config: {e}")
         # Initialize LLM Router
         logger.info("Step 1/6: Initializing LLM Router...")
+        llm_router = LLMRouter(hf_token, use_local_models=True, zero_gpu_config=zero_gpu_config)
         logger.info("✓ LLM Router initialized")
         # Initialize Agents

config.py CHANGED Viewed

@@ -36,6 +36,12 @@ class Settings(BaseSettings):
     log_level: str = os.getenv("LOG_LEVEL", "INFO")
     log_format: str = os.getenv("LOG_FORMAT", "json")
     class Config:
         env_file = ".env"

     log_level: str = os.getenv("LOG_LEVEL", "INFO")
     log_format: str = os.getenv("LOG_FORMAT", "json")
+    # ZeroGPU API settings
+    zero_gpu_enabled: bool = os.getenv("USE_ZERO_GPU", "false").lower() == "true"
+    zero_gpu_base_url: str = os.getenv("ZERO_GPU_API_URL", "http://localhost:8000")
+    zero_gpu_email: str = os.getenv("ZERO_GPU_EMAIL", "")
+    zero_gpu_password: str = os.getenv("ZERO_GPU_PASSWORD", "")
     class Config:
         env_file = ".env"

flask_api_standalone.py CHANGED Viewed

@@ -55,9 +55,24 @@ def initialize_orchestrator():
         if not hf_token:
             logger.warning("HF_TOKEN not set - API fallback will be used if local models fail")
         # Initialize LLM Router with local model loading enabled
         logger.info("Initializing LLM Router with local GPU model loading...")
-        llm_router = LLMRouter(hf_token, use_local_models=True)
         logger.info("Initializing Agents...")
         agents = {

         if not hf_token:
             logger.warning("HF_TOKEN not set - API fallback will be used if local models fail")
+        # Prepare ZeroGPU config if enabled
+        zero_gpu_config = None
+        try:
+            from config import settings
+            if settings.zero_gpu_enabled and settings.zero_gpu_email and settings.zero_gpu_password:
+                zero_gpu_config = {
+                    "enabled": True,
+                    "base_url": settings.zero_gpu_base_url,
+                    "email": settings.zero_gpu_email,
+                    "password": settings.zero_gpu_password
+                }
+                logger.info("ZeroGPU API enabled in configuration")
+        except Exception as e:
+            logger.debug(f"Could not load ZeroGPU config: {e}")
         # Initialize LLM Router with local model loading enabled
         logger.info("Initializing LLM Router with local GPU model loading...")
+        llm_router = LLMRouter(hf_token, use_local_models=True, zero_gpu_config=zero_gpu_config)
         logger.info("Initializing Agents...")
         agents = {

src/llm_router.py CHANGED Viewed

@@ -1,17 +1,20 @@
-# llm_router.py - UPDATED FOR LOCAL GPU MODEL LOADING
 import logging
 import asyncio
-from typing import Dict, Optional
 from .models_config import LLM_CONFIG
 logger = logging.getLogger(__name__)
 class LLMRouter:
-    def __init__(self, hf_token, use_local_models: bool = True):
         self.hf_token = hf_token
         self.health_status = {}
         self.use_local_models = use_local_models
         self.local_loader = None
         logger.info("LLMRouter initialized")
         if hf_token:
@@ -19,6 +22,35 @@ class LLMRouter:
         else:
             logger.warning("No HF token provided")
         # Initialize local model loader if enabled
         if self.use_local_models:
             try:
@@ -35,10 +67,10 @@ class LLMRouter:
                 self.use_local_models = False
                 self.local_loader = None
-    async def route_inference(self, task_type: str, prompt: str, **kwargs):
         """
         Smart routing based on task specialization
-        Tries local models first, falls back to HF Inference API if needed
         """
         logger.info(f"Routing inference for task: {task_type}")
         model_config = self._select_model(task_type)
@@ -62,6 +94,19 @@ class LLMRouter:
                 logger.warning(f"Local model inference failed: {e}. Falling back to API.")
                 logger.debug("Exception details:", exc_info=True)
         # Fallback to HF Inference API
         logger.info("Using HF Inference API")
         # Health check and fallback logic
@@ -149,6 +194,115 @@ class LLMRouter:
             logger.error(f"Error calling local embedding model: {e}", exc_info=True)
             return None
     def _select_model(self, task_type: str) -> dict:
         model_map = {
             "intent_classification": LLM_CONFIG["models"]["classification_specialist"],

+# llm_router.py - UPDATED FOR LOCAL GPU MODEL LOADING + ZEROGPU API
 import logging
 import asyncio
+import os
+from typing import Dict, Optional, List
 from .models_config import LLM_CONFIG
 logger = logging.getLogger(__name__)
 class LLMRouter:
+    def __init__(self, hf_token, use_local_models: bool = True, zero_gpu_config: Optional[Dict] = None):
         self.hf_token = hf_token
         self.health_status = {}
         self.use_local_models = use_local_models
         self.local_loader = None
+        self.zero_gpu_client = None
+        self.use_zero_gpu = False
         logger.info("LLMRouter initialized")
         if hf_token:
         else:
             logger.warning("No HF token provided")
+        # Initialize ZeroGPU client if configured
+        if zero_gpu_config and zero_gpu_config.get("enabled", False):
+            try:
+                from zero_gpu_client import ZeroGPUChatClient
+                base_url = zero_gpu_config.get("base_url", os.getenv("ZERO_GPU_API_URL", "http://localhost:8000"))
+                email = zero_gpu_config.get("email", os.getenv("ZERO_GPU_EMAIL", ""))
+                password = zero_gpu_config.get("password", os.getenv("ZERO_GPU_PASSWORD", ""))
+                if email and password:
+                    self.zero_gpu_client = ZeroGPUChatClient(base_url, email, password)
+                    self.use_zero_gpu = True
+                    logger.info("✓ ZeroGPU API client initialized")
+                    # Wait for API to be ready (non-blocking, will fallback if not ready)
+                    try:
+                        if not self.zero_gpu_client.wait_for_ready(timeout=10):
+                            logger.warning("ZeroGPU API not ready, will use HF fallback")
+                            self.use_zero_gpu = False
+                    except Exception as e:
+                        logger.warning(f"Could not verify ZeroGPU API readiness: {e}. Will use HF fallback.")
+                        self.use_zero_gpu = False
+                else:
+                    logger.warning("ZeroGPU enabled but credentials not provided")
+            except ImportError:
+                logger.warning("zero_gpu_client not available, ZeroGPU disabled")
+            except Exception as e:
+                logger.warning(f"Could not initialize ZeroGPU client: {e}. Falling back to HF API.")
+                self.use_zero_gpu = False
         # Initialize local model loader if enabled
         if self.use_local_models:
             try:
                 self.use_local_models = False
                 self.local_loader = None
+    async def route_inference(self, task_type: str, prompt: str, context: Optional[List[Dict]] = None, **kwargs):
         """
         Smart routing based on task specialization
+        Tries local models first, then ZeroGPU API, falls back to HF Inference API if needed
         """
         logger.info(f"Routing inference for task: {task_type}")
         model_config = self._select_model(task_type)
                 logger.warning(f"Local model inference failed: {e}. Falling back to API.")
                 logger.debug("Exception details:", exc_info=True)
+        # Try ZeroGPU API if enabled
+        if self.use_zero_gpu and self.zero_gpu_client:
+            try:
+                result = await self._call_zero_gpu_endpoint(task_type, prompt, context, **kwargs)
+                if result is not None:
+                    logger.info(f"Inference complete for {task_type} (ZeroGPU API)")
+                    return result
+                else:
+                    logger.warning("ZeroGPU API returned None, falling back to HF")
+            except Exception as e:
+                logger.warning(f"ZeroGPU API inference failed: {e}. Falling back to HF API.")
+                logger.debug("Exception details:", exc_info=True)
         # Fallback to HF Inference API
         logger.info("Using HF Inference API")
         # Health check and fallback logic
             logger.error(f"Error calling local embedding model: {e}", exc_info=True)
             return None
+    async def _call_zero_gpu_endpoint(self, task_type: str, prompt: str, context: Optional[List[Dict]] = None, **kwargs) -> Optional[str]:
+        """
+        Call ZeroGPU API endpoint
+        Args:
+            task_type: Task type (e.g., "intent_classification", "general_reasoning")
+            prompt: User prompt/message
+            context: Optional conversation context
+            **kwargs: Additional generation parameters
+        Returns:
+            Generated text response or None if failed
+        """
+        if not self.zero_gpu_client:
+            return None
+        try:
+            # Map task type to ZeroGPU task
+            task_mapping = LLM_CONFIG.get("zero_gpu_task_mapping", {})
+            zero_gpu_task = task_mapping.get(task_type, "general")
+            logger.info(f"Calling ZeroGPU API for task: {task_type} -> {zero_gpu_task}")
+            logger.debug(f"Prompt length: {len(prompt)}")
+            logger.info("=" * 80)
+            logger.info("ZEROGPU API REQUEST:")
+            logger.info("=" * 80)
+            logger.info(f"Task Type: {task_type} -> ZeroGPU Task: {zero_gpu_task}")
+            logger.info(f"Prompt Length: {len(prompt)} characters")
+            logger.info("-" * 40)
+            logger.info("FULL PROMPT CONTENT:")
+            logger.info("-" * 40)
+            logger.info(prompt)
+            logger.info("-" * 40)
+            logger.info("END OF PROMPT")
+            logger.info("=" * 80)
+            # Prepare context if provided
+            context_messages = None
+            if context:
+                context_messages = []
+                for msg in context[-50:]:  # Limit to 50 messages as per API
+                    context_messages.append({
+                        "role": msg.get("role", "user"),
+                        "content": msg.get("content", ""),
+                        "timestamp": msg.get("timestamp", "")
+                    })
+            # Prepare generation parameters
+            generation_params = {
+                "max_tokens": kwargs.get('max_tokens', 512),
+                "temperature": kwargs.get('temperature', 0.7),
+            }
+            # Add optional parameters
+            if 'top_p' in kwargs:
+                generation_params["top_p"] = kwargs['top_p']
+            if 'system_prompt' in kwargs:
+                generation_params["system_prompt"] = kwargs['system_prompt']
+            # Call ZeroGPU API
+            response = self.zero_gpu_client.chat(
+                message=prompt,
+                task=zero_gpu_task,
+                context=context_messages,
+                **generation_params
+            )
+            # Extract response text
+            if response and "response" in response:
+                generated_text = response["response"]
+                if not generated_text or generated_text.strip() == "":
+                    logger.warning("ZeroGPU API returned empty response")
+                    return None
+                logger.info(f"ZeroGPU API returned response (length: {len(generated_text)})")
+                logger.info("=" * 80)
+                logger.info("COMPLETE ZEROGPU API RESPONSE:")
+                logger.info("=" * 80)
+                logger.info(f"Task Type: {task_type} -> ZeroGPU Task: {zero_gpu_task}")
+                logger.info(f"Response Length: {len(generated_text)} characters")
+                # Log metrics if available
+                if "tokens_used" in response:
+                    tokens = response["tokens_used"]
+                    logger.info(f"Tokens: input={tokens.get('input', 0)}, output={tokens.get('output', 0)}, total={tokens.get('total', 0)}")
+                if "inference_metrics" in response:
+                    metrics = response["inference_metrics"]
+                    logger.info(f"Inference Duration: {metrics.get('inference_duration', 0):.2f}s")
+                    logger.info(f"Tokens/Second: {metrics.get('tokens_per_second', 0):.2f}")
+                logger.info("-" * 40)
+                logger.info("FULL RESPONSE CONTENT:")
+                logger.info("-" * 40)
+                logger.info(generated_text)
+                logger.info("-" * 40)
+                logger.info("END OF RESPONSE")
+                logger.info("=" * 80)
+                return generated_text
+            else:
+                logger.error(f"Unexpected ZeroGPU response format: {response}")
+                return None
+        except Exception as e:
+            logger.error(f"Error calling ZeroGPU API: {e}", exc_info=True)
+            return None
     def _select_model(self, task_type: str) -> dict:
         model_map = {
             "intent_classification": LLM_CONFIG["models"]["classification_specialist"],

src/models_config.py CHANGED Viewed

@@ -39,5 +39,12 @@ LLM_CONFIG = {
         "strategy": "task_based_routing",
         "fallback_chain": ["primary", "fallback", "degraded_mode"],
         "load_balancing": "round_robin_with_health_check"
     }
 }

         "strategy": "task_based_routing",
         "fallback_chain": ["primary", "fallback", "degraded_mode"],
         "load_balancing": "round_robin_with_health_check"
+    },
+    "zero_gpu_task_mapping": {
+        "intent_classification": "classification",
+        "embedding_generation": "embedding",
+        "safety_check": "general",
+        "general_reasoning": "reasoning",
+        "response_synthesis": "general"
     }
 }

zero_gpu_client.py ADDED Viewed

	@@ -0,0 +1,219 @@

+# zero_gpu_client.py
+"""
+ZeroGPU Chat API Client
+Provides authentication and API access to ZeroGPU Chat API
+"""
+import requests
+import time
+import logging
+from typing import Optional, List, Dict, Any
+from datetime import datetime
+logger = logging.getLogger(__name__)
+class ZeroGPUChatClient:
+    """Client for ZeroGPU Chat API with automatic token refresh"""
+    def __init__(self, base_url: str, email: str, password: str):
+        """
+        Initialize ZeroGPU API client
+        Args:
+            base_url: Base URL of ZeroGPU API (e.g., "http://your-pod-ip:8000")
+            email: User email for authentication
+            password: User password for authentication
+        """
+        self.base_url = base_url.rstrip('/')
+        self.email = email
+        self.password = password
+        self.access_token = None
+        self.refresh_token = None
+        self._last_token_refresh = None
+        logger.info(f"Initializing ZeroGPU client for {self.base_url}")
+        self.login(email, password)
+    def login(self, email: str, password: str):
+        """Login and get authentication tokens"""
+        try:
+            response = requests.post(
+                f"{self.base_url}/login",
+                json={"email": email, "password": password},
+                timeout=10
+            )
+            response.raise_for_status()
+            data = response.json()
+            self.access_token = data["access_token"]
+            self.refresh_token = data["refresh_token"]
+            self._last_token_refresh = time.time()
+            logger.info("✓ ZeroGPU authentication successful")
+        except requests.exceptions.RequestException as e:
+            logger.error(f"ZeroGPU login failed: {e}")
+            raise
+    def refresh_access_token(self):
+        """Refresh access token using refresh token"""
+        try:
+            response = requests.post(
+                f"{self.base_url}/refresh",
+                headers={"X-Refresh-Token": self.refresh_token},
+                timeout=10
+            )
+            response.raise_for_status()
+            data = response.json()
+            self.access_token = data["access_token"]
+            self.refresh_token = data.get("refresh_token", self.refresh_token)
+            self._last_token_refresh = time.time()
+            logger.info("✓ ZeroGPU token refreshed")
+        except requests.exceptions.RequestException as e:
+            logger.warning(f"Token refresh failed, attempting re-login: {e}")
+            # Try to re-login if refresh fails
+            self.login(self.email, self.password)
+    def _get_headers(self) -> Dict[str, str]:
+        """Get request headers with authentication"""
+        return {
+            "Authorization": f"Bearer {self.access_token}",
+            "Content-Type": "application/json"
+        }
+    def _ensure_valid_token(self):
+        """Ensure access token is valid, refresh if needed"""
+        # Refresh token if it's been more than 10 minutes (tokens expire in 15 min)
+        if self._last_token_refresh and (time.time() - self._last_token_refresh) > 600:
+            self.refresh_access_token()
+    def _request(self, method: str, endpoint: str, **kwargs) -> Dict[str, Any]:
+        """
+        Make authenticated request with auto-retry on 401
+        Args:
+            method: HTTP method (GET, POST, etc.)
+            endpoint: API endpoint (e.g., "/chat")
+            **kwargs: Additional arguments for requests.request()
+        Returns:
+            Response JSON as dictionary
+        """
+        url = f"{self.base_url}{endpoint}"
+        self._ensure_valid_token()
+        kwargs.setdefault("headers", {}).update(self._get_headers())
+        try:
+            response = requests.request(method, url, **kwargs)
+            if response.status_code == 401:
+                # Token expired, refresh and retry
+                logger.info("Token expired, refreshing...")
+                self.refresh_access_token()
+                kwargs["headers"].update(self._get_headers())
+                response = requests.request(method, url, **kwargs)
+            response.raise_for_status()
+            return response.json()
+        except requests.exceptions.RequestException as e:
+            logger.error(f"ZeroGPU API request failed: {e}")
+            if hasattr(e, 'response') and e.response is not None:
+                logger.error(f"Response: {e.response.text}")
+            raise
+    def chat(
+        self,
+        message: str,
+        task: str = "general",
+        context: Optional[List[Dict[str, str]]] = None,
+        max_tokens: Optional[int] = None,
+        temperature: Optional[float] = None,
+        top_p: Optional[float] = None,
+        system_prompt: Optional[str] = None,
+        **kwargs
+    ) -> Dict[str, Any]:
+        """
+        Send chat message to ZeroGPU API
+        Args:
+            message: User message/prompt
+            task: Task type ("general", "reasoning", "classification", "embedding")
+            context: Optional conversation context (list of message dicts)
+            max_tokens: Maximum tokens to generate
+            temperature: Sampling temperature (0.0-2.0)
+            top_p: Nucleus sampling (0.0-1.0)
+            system_prompt: Optional system prompt
+            **kwargs: Additional generation parameters
+        Returns:
+            API response dictionary with response, metrics, and audit info
+        """
+        payload = {
+            "message": message,
+            "task": task,
+            **kwargs
+        }
+        if context:
+            payload["context"] = context
+        if max_tokens is not None:
+            payload["max_tokens"] = max_tokens
+        if temperature is not None:
+            payload["temperature"] = temperature
+        if top_p is not None:
+            payload["top_p"] = top_p
+        if system_prompt:
+            payload["system_prompt"] = system_prompt
+        logger.debug(f"ZeroGPU chat request: task={task}, message_length={len(message)}")
+        return self._request("POST", "/chat", json=payload, timeout=60)
+    def get_tasks(self) -> Dict[str, Any]:
+        """Get available tasks and their specifications"""
+        return self._request("GET", "/tasks")
+    def get_usage_stats(self, days: int = 30) -> Dict[str, Any]:
+        """Get usage statistics for authenticated user"""
+        return self._request("GET", f"/usage/stats?days={days}")
+    def get_user_info(self) -> Dict[str, Any]:
+        """Get current authenticated user information"""
+        return self._request("GET", "/me")
+    def wait_for_ready(self, timeout: int = 300) -> bool:
+        """
+        Wait for API to be ready (models loaded)
+        Args:
+            timeout: Maximum time to wait in seconds
+        Returns:
+            True if ready, False if timeout
+        """
+        start_time = time.time()
+        while time.time() - start_time < timeout:
+            try:
+                response = requests.get(f"{self.base_url}/ready", timeout=5)
+                if response.status_code == 200:
+                    data = response.json()
+                    if data.get("ready", False):
+                        logger.info("✓ ZeroGPU API is ready")
+                        return True
+            except requests.exceptions.RequestException:
+                pass
+            logger.info("Waiting for ZeroGPU API to be ready...")
+            time.sleep(5)
+        logger.warning(f"ZeroGPU API not ready after {timeout} seconds")
+        return False
+    def health_check(self) -> bool:
+        """Check if API is healthy"""
+        try:
+            response = requests.get(f"{self.base_url}/health", timeout=5)
+            return response.status_code == 200
+        except requests.exceptions.RequestException:
+            return False