HonestAI

Paused

App Files Files Community

JatsTheAIGen commited on Nov 5

Commit

927854c

1 Parent(s): ea87e33

Integrate Novita AI as exclusive inference provider - Add Novita AI API integration with DeepSeek-R1-Distill-Qwen-7B model - Remove all local model dependencies - Optimize token allocation for user inputs and context - Add Anaconda environment setup files - Add comprehensive test scripts and documentation

Browse files

Files changed (16) hide show

CONDA_SETUP_GUIDE.md +166 -0
ENV_EXAMPLE_CONTENT.txt +163 -0
NOVITA_AI_IMPLEMENTATION_SUMMARY.md +212 -0
QUICK_TEST_NOVITA.md +88 -0
TEST_NOVITA_CONNECTION.md +220 -0
environment.yml +43 -0
flask_api_standalone.py +30 -40
requirements.txt +3 -0
setup_conda_env.bat +37 -0
setup_conda_env.sh +41 -0
src/config.py +92 -0
src/context_manager.py +23 -9
src/llm_router.py +238 -326
src/models_config.py +29 -45
test_novita_conda.bat +53 -0
test_novita_connection.py +275 -0

CONDA_SETUP_GUIDE.md ADDED Viewed

	@@ -0,0 +1,166 @@

+# Anaconda Environment Setup Guide
+## Quick Start
+### 1. Create Conda Environment
+```bash
+# Create environment from environment.yml
+conda env create -f environment.yml
+# OR create manually
+conda create -n research-ai-assistant python=3.10
+conda activate research-ai-assistant
+```
+### 2. Activate Environment
+```bash
+# Windows
+conda activate research-ai-assistant
+# Linux/Mac
+source activate research-ai-assistant
+# OR
+conda activate research-ai-assistant
+```
+### 3. Install Dependencies
+```bash
+# Install from requirements.txt
+pip install -r requirements.txt
+# OR install openai package directly
+pip install openai>=1.0.0
+```
+### 4. Set Environment Variables
+```bash
+# Windows (PowerShell)
+$env:NOVITA_API_KEY="your_api_key_here"
+$env:NOVITA_BASE_URL="https://api.novita.ai/dedicated/v1/openai"
+$env:NOVITA_MODEL="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2"
+# Windows (CMD)
+set NOVITA_API_KEY=your_api_key_here
+set NOVITA_BASE_URL=https://api.novita.ai/dedicated/v1/openai
+set NOVITA_MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2
+# Linux/Mac
+export NOVITA_API_KEY=your_api_key_here
+export NOVITA_BASE_URL=https://api.novita.ai/dedicated/v1/openai
+export NOVITA_MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2
+```
+### 5. Test Connection
+```bash
+# Run the test script
+python test_novita_connection.py
+# OR use the batch script (Windows)
+test_novita_conda.bat
+```
+## Using Anaconda Prompt (Windows)
+1. **Open Anaconda Prompt** (search for "Anaconda Prompt" in Start menu)
+2. **Navigate to project directory:**
+   ```bash
+   cd C:\Users\85jat\GenAI_work_V2\Prototyping\Research_AI_Assistant_V2\Research_AI_Assistant_API
+   ```
+3. **Create/activate environment:**
+   ```bash
+   conda env create -f environment.yml
+   conda activate research-ai-assistant
+   ```
+4. **Install dependencies:**
+   ```bash
+   pip install -r requirements.txt
+   ```
+5. **Set environment variables:**
+   ```bash
+   set NOVITA_API_KEY=your_api_key_here
+   set NOVITA_BASE_URL=https://api.novita.ai/dedicated/v1/openai
+   set NOVITA_MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2
+   ```
+6. **Run test:**
+   ```bash
+   python test_novita_connection.py
+   ```
+## Environment Management
+### List environments
+```bash
+conda env list
+```
+### Activate environment
+```bash
+conda activate research-ai-assistant
+```
+### Deactivate environment
+```bash
+conda deactivate
+```
+### Remove environment (if needed)
+```bash
+conda env remove -n research-ai-assistant
+```
+### Update environment
+```bash
+conda env update -f environment.yml --prune
+```
+## Verification
+After setup, verify everything works:
+```bash
+# Activate environment
+conda activate research-ai-assistant
+# Check Python
+python --version
+# Check openai package
+python -c "import openai; print(openai.__version__)"
+# Check configuration
+python -c "from src.config import get_settings; s = get_settings(); print(f'API Key: {s.novita_api_key[:10]}...' if s.novita_api_key else 'API Key: NOT SET')"
+# Run full test
+python test_novita_connection.py
+```
+## Troubleshooting
+### Conda command not found
+- **Windows:** Open Anaconda Prompt instead of regular PowerShell/CMD
+- **Linux/Mac:** Ensure conda is initialized: `conda init bash` or `conda init zsh`
+### Environment activation fails
+- Try: `conda activate base` first, then `conda activate research-ai-assistant`
+- On Windows: Use Anaconda Prompt instead of regular terminal
+### Package installation fails
+- Update conda: `conda update conda`
+- Update pip: `pip install --upgrade pip`
+- Try installing from conda-forge: `conda install -c conda-forge openai`
+### Import errors
+- Ensure environment is activated: `conda activate research-ai-assistant`
+- Verify package is installed: `pip list | grep openai`
+- Reinstall if needed: `pip install --force-reinstall openai>=1.0.0`

ENV_EXAMPLE_CONTENT.txt ADDED Viewed

	@@ -0,0 +1,163 @@

+# =============================================================================
+# Research AI Assistant API - Environment Configuration
+# =============================================================================
+# Copy this content to a file named .env and fill in your actual values
+# Never commit .env to version control!
+# =============================================================================
+# Novita AI Configuration (REQUIRED)
+# =============================================================================
+# Get your API key from: https://novita.ai
+NOVITA_API_KEY=your_novita_api_key_here
+# Dedicated endpoint base URL (default for dedicated endpoints)
+NOVITA_BASE_URL=https://api.novita.ai/dedicated/v1/openai
+# Your dedicated endpoint model ID
+# Format: model-name:endpoint-id
+NOVITA_MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2
+# =============================================================================
+# DeepSeek-R1 Optimized Settings
+# =============================================================================
+# Temperature: 0.5-0.7 range (0.6 recommended for DeepSeek-R1)
+DEEPSEEK_R1_TEMPERATURE=0.6
+# Force reasoning trigger: Enable to ensure DeepSeek-R1 uses reasoning pattern
+# Set to True to add `<think>` prefix for reasoning tasks
+DEEPSEEK_R1_FORCE_REASONING=True
+# =============================================================================
+# Token Allocation Configuration
+# =============================================================================
+# Maximum tokens dedicated for user input (prioritized over context)
+# Recommended: 8000 tokens for large queries
+USER_INPUT_MAX_TOKENS=8000
+# Maximum tokens for context preparation (includes user input + context)
+# Recommended: 28000 tokens for 32K context window models
+CONTEXT_PREPARATION_BUDGET=28000
+# Context pruning threshold (should match context_preparation_budget)
+CONTEXT_PRUNING_THRESHOLD=28000
+# Always prioritize user input over historical context
+PRIORITIZE_USER_INPUT=True
+# =============================================================================
+# Database Configuration
+# =============================================================================
+# SQLite database path (default: sessions.db)
+# Use /tmp/ for Docker/containerized environments
+DB_PATH=sessions.db
+# FAISS index path for embeddings (default: embeddings.faiss)
+FAISS_INDEX_PATH=embeddings.faiss
+# =============================================================================
+# Cache Configuration
+# =============================================================================
+# HuggingFace cache directory (for any remaining model downloads)
+HF_HOME=~/.cache/huggingface
+TRANSFORMERS_CACHE=~/.cache/huggingface
+# HuggingFace token (optional - only needed if using gated models)
+HF_TOKEN=
+# Cache TTL in seconds (default: 3600 = 1 hour)
+CACHE_TTL=3600
+# =============================================================================
+# Session Configuration
+# =============================================================================
+# Session timeout in seconds (default: 3600 = 1 hour)
+SESSION_TIMEOUT=3600
+# Maximum session size in megabytes (default: 10 MB)
+MAX_SESSION_SIZE_MB=10
+# =============================================================================
+# Performance Configuration
+# =============================================================================
+# Maximum worker threads for parallel processing (default: 4)
+MAX_WORKERS=4
+# =============================================================================
+# Mobile Optimization
+# =============================================================================
+# Maximum tokens for mobile responses (default: 1200)
+# Increased from 800 to allow better responses on mobile
+MOBILE_MAX_TOKENS=1200
+# Mobile request timeout in milliseconds (default: 15000)
+MOBILE_TIMEOUT=15000
+# =============================================================================
+# API Configuration
+# =============================================================================
+# Flask/Gradio server port (default: 7860)
+GRADIO_PORT=7860
+# Server host (default: 0.0.0.0 for all interfaces)
+GRADIO_HOST=0.0.0.0
+# =============================================================================
+# Logging Configuration
+# =============================================================================
+# Logging level: DEBUG, INFO, WARNING, ERROR, CRITICAL (default: INFO)
+LOG_LEVEL=INFO
+# Log format: json or text (default: json)
+LOG_FORMAT=json
+# Log directory (default: /tmp/logs)
+LOG_DIR=/tmp/logs
+# =============================================================================
+# Context Configuration
+# =============================================================================
+# Maximum context tokens (default: 4000)
+# Note: This is overridden by CONTEXT_PREPARATION_BUDGET if set
+MAX_CONTEXT_TOKENS=4000
+# Cache TTL for context in seconds (default: 300 = 5 minutes)
+CACHE_TTL_SECONDS=300
+# Maximum cache size (default: 100)
+MAX_CACHE_SIZE=100
+# Enable parallel processing (default: True)
+PARALLEL_PROCESSING=True
+# Context decay factor (default: 0.8)
+CONTEXT_DECAY_FACTOR=0.8
+# Maximum interactions to keep in context (default: 10)
+MAX_INTERACTIONS_TO_KEEP=10
+# Enable metrics collection (default: True)
+ENABLE_METRICS=True
+# Enable context compression (default: True)
+COMPRESSION_ENABLED=True
+# Summarization threshold in tokens (default: 2000)
+SUMMARIZATION_THRESHOLD=2000
+# =============================================================================
+# Model Selection (for context operations - if still using local models)
+# =============================================================================
+# These are optional and only used if local models are still needed
+# for context summarization or other operations
+CONTEXT_SUMMARIZATION_MODEL=Qwen/Qwen2.5-7B-Instruct
+CONTEXT_INTENT_MODEL=Qwen/Qwen2.5-7B-Instruct
+CONTEXT_SYNTHESIS_MODEL=Qwen/Qwen2.5-7B-Instruct
+# =============================================================================
+# Security Notes
+# =============================================================================
+# - Never commit .env file to version control
+# - Keep API keys secret and rotate them regularly
+# - Use environment variables in production (not .env files)
+# - Set proper file permissions: chmod 600 .env

NOVITA_AI_IMPLEMENTATION_SUMMARY.md ADDED Viewed

	@@ -0,0 +1,212 @@

+# Novita AI Implementation Summary
+## ✅ Implementation Complete
+All changes have been implemented to switch from local models to Novita AI API as the only inference source.
+## 📋 Files Modified
+### 1. ✅ `src/config.py`
+- Added Novita AI configuration section with:
+  - `novita_api_key` (required, validated)
+  - `novita_base_url` (default: https://api.novita.ai/dedicated/v1/openai)
+  - `novita_model` (default: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2)
+  - `deepseek_r1_temperature` (default: 0.6, validated 0.5-0.7 range)
+  - `deepseek_r1_force_reasoning` (default: True)
+  - Token allocation configuration:
+    - `user_input_max_tokens` (default: 8000)
+    - `context_preparation_budget` (default: 28000)
+    - `context_pruning_threshold` (default: 28000)
+    - `prioritize_user_input` (default: True)
+### 2. ✅ `requirements.txt`
+- Added `openai>=1.0.0` package
+### 3. ✅ `src/models_config.py`
+- Changed `primary_provider` from "local" to "novita_api"
+- Updated all model IDs to Novita model ID
+- Added DeepSeek-R1 optimized parameters:
+  - Temperature: 0.6 for reasoning, 0.5 for classification/safety
+  - Top_p: 0.95 for reasoning, 0.9 for classification
+  - `force_reasoning_prefix: True` for reasoning tasks
+- Removed all local model configuration (quantization, fallbacks)
+### 4. ✅ `src/llm_router.py` (Complete Rewrite)
+- Removed all local model loading code
+- Removed `LocalModelLoader` dependencies
+- Added OpenAI client initialization
+- Implemented `_call_novita_api()` method
+- Added DeepSeek-R1 optimizations:
+  - `_format_deepseek_r1_prompt()` - reasoning trigger and math directives
+  - `_is_math_query()` - automatic math detection
+  - `_clean_reasoning_tags()` - response cleanup
+- Updated `prepare_context_for_llm()` with:
+  - User input priority (never truncated)
+  - Dedicated 8K token budget for user input
+  - 28K token context preparation budget
+  - Dynamic context allocation
+- Updated `health_check()` for Novita API
+- Removed all local model methods
+### 5. ✅ `flask_api_standalone.py`
+- Updated `initialize_orchestrator()`:
+  - Changed to "Novita AI API Only" mode
+  - Removed HF_TOKEN dependency
+  - Set `use_local_models=False`
+  - Updated error handling for configuration errors
+- Increased `MAX_MESSAGE_LENGTH` from 10KB to 100KB
+- Updated logging messages
+### 6. ✅ `src/context_manager.py`
+- Updated `prune_context()` to use config threshold (28000 tokens)
+- Increased user input storage from 500 to 5000 characters
+- Increased system response storage from 1000 to 2000 characters
+- Updated interaction context generation to use more of user input
+## 📝 Environment Variables Required
+Create a `.env` file with the following (see `.env.example` for full template):
+```bash
+# REQUIRED - Novita AI Configuration
+NOVITA_API_KEY=your_api_key_here
+NOVITA_BASE_URL=https://api.novita.ai/dedicated/v1/openai
+NOVITA_MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2
+# DeepSeek-R1 Optimized Settings
+DEEPSEEK_R1_TEMPERATURE=0.6
+DEEPSEEK_R1_FORCE_REASONING=True
+# Token Allocation (Optional - defaults provided)
+USER_INPUT_MAX_TOKENS=8000
+CONTEXT_PREPARATION_BUDGET=28000
+CONTEXT_PRUNING_THRESHOLD=28000
+PRIORITIZE_USER_INPUT=True
+```
+## 🚀 Installation Steps
+1. **Install dependencies:**
+   ```bash
+   pip install -r requirements.txt
+   ```
+2. **Create `.env` file:**
+   ```bash
+   cp .env.example .env
+   # Edit .env and add your NOVITA_API_KEY
+   ```
+3. **Set environment variables:**
+   ```bash
+   export NOVITA_API_KEY=your_api_key_here
+   export NOVITA_BASE_URL=https://api.novita.ai/dedicated/v1/openai
+   export NOVITA_MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2
+   ```
+4. **Start the application:**
+   ```bash
+   python flask_api_standalone.py
+   ```
+## ✨ Key Features Implemented
+### DeepSeek-R1 Optimizations
+- ✅ Temperature set to 0.6 (recommended range 0.5-0.7)
+- ✅ Reasoning trigger (`<think>` prefix) for reasoning tasks
+- ✅ Automatic math directive detection
+- ✅ No system prompts (all instructions in user prompt)
+### Token Allocation
+- ✅ User input: 8K tokens dedicated budget (never truncated)
+- ✅ Context preparation: 28K tokens total budget
+- ✅ Context pruning: 28K token threshold
+- ✅ User input always prioritized over historical context
+### API Improvements
+- ✅ Message length limit: 100KB (increased from 10KB)
+- ✅ Better error messages with token estimates
+- ✅ Configuration validation with helpful error messages
+### Database Storage
+- ✅ User input storage: 5000 characters (increased from 500)
+- ✅ System response storage: 2000 characters (increased from 1000)
+## 🧪 Testing Checklist
+- [ ] Test API health check endpoint
+- [ ] Test simple inference request
+- [ ] Test large user input (5K+ tokens)
+- [ ] Test reasoning tasks (should see reasoning trigger)
+- [ ] Test math queries (should see math directive)
+- [ ] Test context preparation (user input should not be truncated)
+- [ ] Test error handling (missing API key, invalid endpoint)
+## 📊 Expected Behavior
+1. **Startup:**
+   - System initializes Novita AI client
+   - Validates API key is present
+   - Logs Novita AI configuration
+2. **Inference:**
+   - All requests routed to Novita AI API
+   - DeepSeek-R1 optimizations applied automatically
+   - User input prioritized in context preparation
+3. **Error Handling:**
+   - Clear error messages if API key missing
+   - Helpful guidance for configuration issues
+   - Graceful handling of API failures
+## 🔧 Troubleshooting
+### Issue: "NOVITA_API_KEY is required"
+**Solution:** Set the environment variable:
+```bash
+export NOVITA_API_KEY=your_key_here
+```
+### Issue: "openai package not available"
+**Solution:** Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+### Issue: API connection errors
+**Solution:**
+- Verify API key is correct
+- Check base URL matches your endpoint
+- Verify model ID matches your deployment
+## 📚 Configuration Reference
+### Model Configuration
+- **Model ID:** `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2`
+- **Context Window:** 131,072 tokens (131K)
+- **Optimized Settings:** Temperature 0.6, Top_p 0.95
+### Token Allocation
+- **User Input:** 8,000 tokens (dedicated, never truncated)
+- **Context Budget:** 28,000 tokens (includes user input + context)
+- **Output Limits:**
+  - Reasoning: 4,096 tokens
+  - Synthesis: 2,000 tokens
+  - Classification: 512 tokens
+## 🎯 Next Steps
+1. Set your `NOVITA_API_KEY` in environment variables
+2. Test the health check endpoint: `GET /api/health`
+3. Send a test request: `POST /api/chat`
+4. Monitor logs for Novita AI API calls
+5. Verify DeepSeek-R1 optimizations are working
+## 📝 Notes
+- All local model code has been removed
+- System now depends entirely on Novita AI API
+- No GPU/quantization configuration needed
+- No model downloading required
+- Faster startup (no model loading)

QUICK_TEST_NOVITA.md ADDED Viewed

	@@ -0,0 +1,88 @@

+# Quick Test: Novita AI Connection with Anaconda
+## Step-by-Step Instructions
+### 1. Open Anaconda Prompt
+- Search for "Anaconda Prompt" in Windows Start menu
+- This ensures conda commands work properly
+### 2. Navigate to Project Directory
+```bash
+cd C:\Users\85jat\GenAI_work_V2\Prototyping\Research_AI_Assistant_V2\Research_AI_Assistant_API
+```
+### 3. Create Conda Environment (First Time Only)
+```bash
+conda create -n research-ai-assistant python=3.10 -y
+```
+### 4. Activate Environment
+```bash
+conda activate research-ai-assistant
+```
+### 5. Install Required Packages
+```bash
+pip install openai>=1.0.0
+pip install -r requirements.txt
+```
+### 6. Set Environment Variables
+```bash
+# Set your Novita API key
+set NOVITA_API_KEY=your_api_key_here
+set NOVITA_BASE_URL=https://api.novita.ai/dedicated/v1/openai
+set NOVITA_MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2
+```
+### 7. Run Test
+```bash
+python test_novita_connection.py
+```
+## Alternative: Use Batch Script
+Simply double-click or run:
+```bash
+test_novita_conda.bat
+```
+## Expected Output
+You should see:
+```
+============================================================
+NOVITA AI CONNECTION TEST
+============================================================
+============================================================
+TEST 1: Configuration Loading
+============================================================
+✓ Configuration loaded successfully
+  Novita API Key: Set
+  Base URL: https://api.novita.ai/dedicated/v1/openai
+  Model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2
+  ...
+============================================================
+TEST 4: Simple API Call
+============================================================
+✓ API call successful!
+  Response: ...
+🎉 All tests passed! Novita AI connection is working correctly.
+```
+## Troubleshooting
+**If conda command not found:**
+- Use Anaconda Prompt instead of regular PowerShell
+- Or run: `C:\Users\85jat\anaconda3\Scripts\activate.bat` (adjust path as needed)
+**If environment activation fails:**
+- Create environment first: `conda create -n research-ai-assistant python=3.10`
+**If import errors:**
+- Ensure environment is activated: `conda activate research-ai-assistant`
+- Install packages: `pip install openai>=1.0.0`

TEST_NOVITA_CONNECTION.md ADDED Viewed

	@@ -0,0 +1,220 @@

+# Testing Novita AI Connection
+## Quick Test Instructions
+### Option 1: Run Test Script (Recommended)
+1. **Ensure Python is available:**
+   ```bash
+   # Check Python version
+   python --version
+   # OR
+   python3 --version
+   # OR (Windows)
+   py --version
+   ```
+2. **Install dependencies if needed:**
+   ```bash
+   pip install openai>=1.0.0
+   pip install -r requirements.txt
+   ```
+3. **Set environment variables:**
+   ```bash
+   # Windows (PowerShell)
+   $env:NOVITA_API_KEY="your_api_key_here"
+   $env:NOVITA_BASE_URL="https://api.novita.ai/dedicated/v1/openai"
+   $env:NOVITA_MODEL="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2"
+   # Windows (CMD)
+   set NOVITA_API_KEY=your_api_key_here
+   set NOVITA_BASE_URL=https://api.novita.ai/dedicated/v1/openai
+   set NOVITA_MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2
+   # Linux/Mac
+   export NOVITA_API_KEY=your_api_key_here
+   export NOVITA_BASE_URL=https://api.novita.ai/dedicated/v1/openai
+   export NOVITA_MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2
+   ```
+4. **Run the test script:**
+   ```bash
+   python test_novita_connection.py
+   # OR
+   python3 test_novita_connection.py
+   # OR (Windows)
+   py test_novita_connection.py
+   ```
+### Option 2: Manual Python Test
+Create a simple test file `quick_test.py`:
+```python
+import os
+from openai import OpenAI
+# Get API key from environment
+api_key = os.getenv("NOVITA_API_KEY")
+base_url = os.getenv("NOVITA_BASE_URL", "https://api.novita.ai/dedicated/v1/openai")
+model = os.getenv("NOVITA_MODEL", "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2")
+if not api_key:
+    print("ERROR: NOVITA_API_KEY not set!")
+    exit(1)
+print(f"Testing Novita AI connection...")
+print(f"Base URL: {base_url}")
+print(f"Model: {model}")
+client = OpenAI(
+    base_url=base_url,
+    api_key=api_key,
+)
+try:
+    response = client.chat.completions.create(
+        model=model,
+        messages=[{"role": "user", "content": "Say 'Hello' if you can hear me."}],
+        max_tokens=20,
+        temperature=0.6
+    )
+    if response.choices:
+        print(f"\n✓ SUCCESS! Connection working.")
+        print(f"Response: {response.choices[0].message.content}")
+    else:
+        print("\n❌ No response received")
+except Exception as e:
+    print(f"\n❌ ERROR: {e}")
+```
+Run it:
+```bash
+python quick_test.py
+```
+### Option 3: Test via API Endpoint
+If the Flask server is running:
+1. **Start the server:**
+   ```bash
+   python flask_api_standalone.py
+   ```
+2. **Test health endpoint:**
+   ```bash
+   curl http://localhost:7860/api/health
+   # OR
+   # Visit http://localhost:7860/api/health in browser
+   ```
+3. **Test chat endpoint:**
+   ```bash
+   curl -X POST http://localhost:7860/api/chat \
+     -H "Content-Type: application/json" \
+     -d '{"message": "Hello", "session_id": "test-123"}'
+   ```
+## Expected Test Results
+### Successful Test Output:
+```
+============================================================
+NOVITA AI CONNECTION TEST
+============================================================
+============================================================
+TEST 1: Configuration Loading
+============================================================
+✓ Configuration loaded successfully
+  Novita API Key: Set
+  Base URL: https://api.novita.ai/dedicated/v1/openai
+  Model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2
+  Temperature: 0.6
+  Force Reasoning: True
+  User Input Max Tokens: 8000
+  Context Preparation Budget: 28000
+============================================================
+TEST 2: OpenAI Package Check
+============================================================
+✓ OpenAI package is available
+============================================================
+TEST 3: Novita AI Client Initialization
+============================================================
+✓ Novita AI client initialized successfully
+  Base URL: https://api.novita.ai/dedicated/v1/openai
+  API Key: nv-****
+============================================================
+TEST 4: Simple API Call
+============================================================
+Sending test request to: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2
+Prompt: 'Hello, this is a test. Please respond briefly.'
+✓ API call successful!
+  Response length: XX characters
+  Response preview: ...
+============================================================
+TEST 5: LLM Router Initialization
+============================================================
+Initializing LLM Router...
+✓ LLM Router initialized successfully
+Testing health check...
+✓ Health check result: {'provider': 'novita_api', 'status': 'healthy', ...}
+============================================================
+TEST 6: Inference Test
+============================================================
+Test prompt: What is the capital of France? Answer in one sentence.
+✓ Inference successful!
+  Response length: XX characters
+  Response: ...
+============================================================
+TEST SUMMARY
+============================================================
+  CONFIG: ✓ PASS
+  PACKAGE: ✓ PASS
+  CLIENT: ✓ PASS
+  API_CALL: ✓ PASS
+  ROUTER: ✓ PASS
+  INFERENCE: ✓ PASS
+Total: 6/6 tests passed
+🎉 All tests passed! Novita AI connection is working correctly.
+```
+## Troubleshooting
+### Error: "NOVITA_API_KEY is required"
+**Solution:** Set the environment variable:
+```bash
+export NOVITA_API_KEY=your_key_here
+```
+### Error: "openai package not available"
+**Solution:** Install the package:
+```bash
+pip install openai>=1.0.0
+```
+### Error: "Failed to initialize Novita AI client"
+**Solution:**
+- Verify API key is correct
+- Check base URL matches your endpoint
+- Verify network connectivity
+### Error: "API call failed"
+**Solution:**
+- Check API key has proper permissions
+- Verify model ID matches your deployment
+- Check Novita AI service status

environment.yml ADDED Viewed

	@@ -0,0 +1,43 @@

+name: research-ai-assistant
+channels:
+  - conda-forge
+  - defaults
+dependencies:
+  - python>=3.10,<3.12
+  - pip
+  - pip:
+    # LLM API Client (required for Novita AI API)
+    - openai>=1.0.0
+    # Web Framework & Interface
+    - aiohttp>=3.9.0
+    - httpx>=0.25.0
+    # Flask API for external integrations
+    - flask>=3.0.0
+    - flask-cors>=4.0.0
+    - flask-limiter>=3.5.0
+    # Security & Validation
+    - pydantic-settings>=2.1.0
+    - python-dotenv>=1.0.0
+    # Database & Persistence
+    - sqlalchemy>=2.0.0
+    # Data Processing & Utilities
+    - pandas>=2.1.0
+    - numpy>=1.24.0,<2.0.0
+    # Caching & Performance
+    - cachetools>=5.3.0
+    # Async & Concurrency
+    - aiofiles>=23.2.0
+    # Logging & Monitoring
+    - structlog>=23.2.0
+    - prometheus-client>=0.19.0
+    - psutil>=5.9.0
+    # Utility Libraries
+    - python-dateutil>=2.8.0
+    - pytz>=2023.3
+    - requests>=2.31.0
+    # Production WSGI Server
+    - gunicorn>=21.2.0
+    # Development & Testing
+    - pytest>=7.4.0
+    - pytest-asyncio>=0.21.0

flask_api_standalone.py CHANGED Viewed

@@ -145,7 +145,7 @@ initialization_attempted = False
 initialization_error = None
 def initialize_orchestrator():
-    """Initialize the AI orchestrator with local GPU models"""
     global orchestrator, orchestrator_available, initialization_attempted, initialization_error
     initialization_attempted = True
@@ -153,7 +153,7 @@ def initialize_orchestrator():
     try:
         logger.info("=" * 60)
-        logger.info("INITIALIZING AI ORCHESTRATOR (Local GPU Models)")
         logger.info("=" * 60)
         from src.agents.intent_agent import create_intent_agent
@@ -166,27 +166,16 @@ def initialize_orchestrator():
         logger.info("✓ Imports successful")
-        # Initialize LLM Router - local models only (no API fallback)
-        hf_token = os.getenv('HF_TOKEN', '')  # Optional - only needed for downloading gated models
-        if not hf_token:
-            logger.warning("HF_TOKEN not set - may be needed for gated model access")
-        else:
-            logger.info(f"HF_TOKEN available (for model download only)")
-        # Import GatedRepoError for better error handling
         try:
-            from huggingface_hub.exceptions import GatedRepoError
-        except ImportError:
-            GatedRepoError = Exception
-        logger.info("Initializing LLM Router (local models only, no API fallback)...")
-        try:
-            # Always use local models - API fallback removed
-            llm_router = LLMRouter(hf_token=hf_token, use_local_models=True)
-            logger.info("✓ LLM Router initialized (local models only)")
         except Exception as e:
             logger.error(f"❌ Failed to initialize LLM Router: {e}", exc_info=True)
-            logger.error("This is a critical error - local models are required")
             raise
         logger.info("Initializing Agents...")
@@ -221,28 +210,29 @@ def initialize_orchestrator():
         orchestrator_available = True
         logger.info("=" * 60)
         logger.info("✓ AI ORCHESTRATOR READY")
-        logger.info("  - Local GPU models enabled" if llm_router.use_local_models else "  - API-only mode (local models disabled)")
         logger.info("  - MAX_WORKERS: 4")
         logger.info("=" * 60)
         return True
-    except GatedRepoError as e:
-        logger.error("=" * 60)
-        logger.error("❌ GATED REPOSITORY ERROR DURING INITIALIZATION")
-        logger.error("=" * 60)
-        logger.error(f"Error: {e}")
-        logger.error("")
-        logger.error("SOLUTION:")
-        logger.error("1. Visit the model repository on Hugging Face")
-        logger.error("2. Click 'Agree and access repository'")
-        logger.error("3. Wait for approval (usually instant)")
-        logger.error("4. Ensure HF_TOKEN is set with your access token")
-        logger.error("")
-        logger.error("NOTE: API fallback has been removed. Local models are required.")
-        logger.error("=" * 60)
-        orchestrator_available = False
-        initialization_error = f"GatedRepoError: {str(e)}"
         return False
     except Exception as e:
         logger.error("=" * 60)
@@ -351,12 +341,12 @@ def chat():
                 'error': 'Message cannot be empty'
             }), 400
-        # Length limit (prevent abuse)
-        MAX_MESSAGE_LENGTH = 10000  # 10KB limit
         if len(message) > MAX_MESSAGE_LENGTH:
             return jsonify({
                 'success': False,
-                'error': f'Message too long. Maximum length is {MAX_MESSAGE_LENGTH} characters'
             }), 400
         history = data.get('history', [])

 initialization_error = None
 def initialize_orchestrator():
+    """Initialize the AI orchestrator with Novita AI API only"""
     global orchestrator, orchestrator_available, initialization_attempted, initialization_error
     initialization_attempted = True
     try:
         logger.info("=" * 60)
+        logger.info("INITIALIZING AI ORCHESTRATOR (Novita AI API Only)")
         logger.info("=" * 60)
         from src.agents.intent_agent import create_intent_agent
         logger.info("✓ Imports successful")
+        # Initialize LLM Router - Novita AI API only
+        logger.info("Initializing LLM Router (Novita AI API only)...")
         try:
+            # Always use Novita AI API (local models disabled)
+            llm_router = LLMRouter(hf_token=None, use_local_models=False)
+            logger.info("✓ LLM Router initialized (Novita AI API)")
         except Exception as e:
             logger.error(f"❌ Failed to initialize LLM Router: {e}", exc_info=True)
+            logger.error("This is a critical error - Novita AI API is required")
+            logger.error("Please ensure NOVITA_API_KEY is set in environment variables")
             raise
         logger.info("Initializing Agents...")
         orchestrator_available = True
         logger.info("=" * 60)
         logger.info("✓ AI ORCHESTRATOR READY")
+        logger.info("  - Novita AI API enabled")
         logger.info("  - MAX_WORKERS: 4")
         logger.info("=" * 60)
         return True
+    except ValueError as e:
+        # Handle configuration errors (e.g., missing NOVITA_API_KEY)
+        if "NOVITA_API_KEY" in str(e) or "required" in str(e).lower():
+            logger.error("=" * 60)
+            logger.error("❌ CONFIGURATION ERROR")
+            logger.error("=" * 60)
+            logger.error(f"Error: {e}")
+            logger.error("")
+            logger.error("SOLUTION:")
+            logger.error("1. Set NOVITA_API_KEY in environment variables")
+            logger.error("2. Ensure NOVITA_BASE_URL is correct")
+            logger.error("3. Verify NOVITA_MODEL matches your endpoint")
+            logger.error("=" * 60)
+            orchestrator_available = False
+            initialization_error = f"Configuration Error: {str(e)}"
+        else:
+            raise
         return False
     except Exception as e:
         logger.error("=" * 60)
                 'error': 'Message cannot be empty'
             }), 400
+        # Length limit (allow larger inputs for complex queries)
+        MAX_MESSAGE_LENGTH = 100000  # 100KB limit (increased from 10KB)
         if len(message) > MAX_MESSAGE_LENGTH:
             return jsonify({
                 'success': False,
+                'error': f'Message too long. Maximum length is {MAX_MESSAGE_LENGTH} characters (approximately {MAX_MESSAGE_LENGTH // 4} tokens)'
             }), 400
         history = data.get('history', [])

requirements.txt CHANGED Viewed

@@ -107,3 +107,6 @@ debugpy>=1.7.0
 bandit>=1.7.5  # Security linter for Python code
 safety>=2.3.5  # Dependency vulnerability scanner

 bandit>=1.7.5  # Security linter for Python code
 safety>=2.3.5  # Dependency vulnerability scanner
+# LLM API Client (required for Novita AI API)
+openai>=1.0.0

setup_conda_env.bat ADDED Viewed

	@@ -0,0 +1,37 @@

+@echo off
+REM Setup script for Anaconda environment (Windows)
+REM This script creates and activates a conda environment for the Research AI Assistant
+echo ============================================================
+echo Setting up Anaconda environment for Research AI Assistant
+echo ============================================================
+REM Check if conda is available
+where conda >nul 2>&1
+if %ERRORLEVEL% NEQ 0 (
+    echo ERROR: conda command not found
+    echo Please install Anaconda or Miniconda first
+    echo Download from: https://www.anaconda.com/products/distribution
+    exit /b 1
+)
+echo Conda found
+REM Create environment from environment.yml
+echo.
+echo Creating conda environment from environment.yml...
+conda env create -f environment.yml
+if %ERRORLEVEL% EQU 0 (
+    echo Environment created successfully
+    echo.
+    echo To activate the environment, run:
+    echo   conda activate research-ai-assistant
+    echo.
+    echo Then install remaining dependencies:
+    echo   pip install -r requirements.txt
+) else (
+    echo Environment creation failed
+    exit /b 1
+)

setup_conda_env.sh ADDED Viewed

	@@ -0,0 +1,41 @@

+#!/bin/bash
+# Setup script for Anaconda environment
+# This script creates and activates a conda environment for the Research AI Assistant
+echo "============================================================"
+echo "Setting up Anaconda environment for Research AI Assistant"
+echo "============================================================"
+# Check if conda is available
+if ! command -v conda &> /dev/null; then
+    echo "❌ Error: conda command not found"
+    echo "   Please install Anaconda or Miniconda first"
+    echo "   Download from: https://www.anaconda.com/products/distribution"
+    exit 1
+fi
+echo "✓ Conda found"
+# Create environment from environment.yml
+echo ""
+echo "Creating conda environment from environment.yml..."
+conda env create -f environment.yml
+if [ $? -eq 0 ]; then
+    echo "✓ Environment created successfully"
+else
+    echo "❌ Environment creation failed"
+    exit 1
+fi
+# Activate environment
+echo ""
+echo "To activate the environment, run:"
+echo "  conda activate research-ai-assistant"
+echo ""
+echo "Or on Windows:"
+echo "  conda activate research-ai-assistant"
+echo ""
+echo "Then install remaining dependencies:"
+echo "  pip install -r requirements.txt"

src/config.py CHANGED Viewed

@@ -174,6 +174,98 @@ class Settings(BaseSettings):
         return self._cached_cache_dir
     # ==================== Model Configuration ====================
     default_model: str = Field(

         return self._cached_cache_dir
+    # ==================== Novita AI Configuration ====================
+    novita_api_key: str = Field(
+        default="",
+        description="Novita AI API key (required)",
+        env="NOVITA_API_KEY"
+    )
+    novita_base_url: str = Field(
+        default="https://api.novita.ai/dedicated/v1/openai",
+        description="Novita AI dedicated endpoint base URL",
+        env="NOVITA_BASE_URL"
+    )
+    novita_model: str = Field(
+        default="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2",
+        description="Novita AI dedicated endpoint model ID",
+        env="NOVITA_MODEL"
+    )
+    # DeepSeek-R1 optimized settings
+    deepseek_r1_temperature: float = Field(
+        default=0.6,
+        description="Temperature for DeepSeek-R1 models (0.5-0.7 range, 0.6 recommended)",
+        env="DEEPSEEK_R1_TEMPERATURE"
+    )
+    deepseek_r1_force_reasoning: bool = Field(
+        default=True,
+        description="Force DeepSeek-R1 to start with reasoning trigger",
+        env="DEEPSEEK_R1_FORCE_REASONING"
+    )
+    # Token Allocation Configuration
+    user_input_max_tokens: int = Field(
+        default=8000,
+        description="Maximum tokens dedicated for user input (prioritized over context)",
+        env="USER_INPUT_MAX_TOKENS"
+    )
+    context_preparation_budget: int = Field(
+        default=28000,
+        description="Maximum tokens for context preparation (includes user input + context)",
+        env="CONTEXT_PREPARATION_BUDGET"
+    )
+    context_pruning_threshold: int = Field(
+        default=28000,
+        description="Context pruning threshold (should match context_preparation_budget)",
+        env="CONTEXT_PRUNING_THRESHOLD"
+    )
+    prioritize_user_input: bool = Field(
+        default=True,
+        description="Always prioritize user input over historical context",
+        env="PRIORITIZE_USER_INPUT"
+    )
+    @validator("novita_api_key", pre=True)
+    def validate_novita_api_key(cls, v):
+        """Validate and clean Novita API key"""
+        if v is None:
+            return ""
+        return str(v).strip()
+    @validator("deepseek_r1_temperature", pre=True)
+    def validate_deepseek_temperature(cls, v):
+        """Validate DeepSeek-R1 temperature is in recommended range"""
+        if isinstance(v, str):
+            v = float(v)
+        temp = float(v) if v else 0.6
+        return max(0.5, min(0.7, temp))
+    @validator("deepseek_r1_force_reasoning", pre=True)
+    def validate_force_reasoning(cls, v):
+        """Convert string to boolean for force_reasoning"""
+        if isinstance(v, str):
+            return v.lower() in ("true", "1", "yes", "on")
+        return bool(v)
+    @validator("user_input_max_tokens", pre=True)
+    def validate_user_input_tokens(cls, v):
+        """Validate user input token limit"""
+        val = int(v) if v else 8000
+        return max(1000, min(20000, val))
+    @validator("context_preparation_budget", pre=True)
+    def validate_context_budget(cls, v):
+        """Validate context preparation budget"""
+        val = int(v) if v else 28000
+        return max(4000, min(120000, val))
     # ==================== Model Configuration ====================
     default_model: str = Field(

src/context_manager.py CHANGED Viewed

@@ -439,10 +439,13 @@ Keep the summary concise and focused (approximately 500 tokens)."""
             if not self.llm_router:
                 return ""
             prompt = f"""Summarize this interaction in approximately 50 tokens:
-User Input: {user_input[:200]}
-System Response: {system_response[:300]}
 Provide a brief summary capturing the key exchange."""
@@ -466,8 +469,8 @@ Provide a brief summary capturing the key exchange."""
                     """, (
                         interaction_id,
                         session_id,
-                        user_input[:500],
-                        system_response[:1000],
                         summary.strip(),
                         created_at
                     ))
@@ -607,8 +610,8 @@ Keep the summary concise (approximately 100 tokens)."""
         Applies smart pruning before formatting.
         """
-        # Step 4: Prune context if it exceeds token limits
-        pruned_context = self.prune_context(context, max_tokens=2000)
         # Get context mode (fresh or relevant)
         session_id = pruned_context.get("session_id")
@@ -735,19 +738,30 @@ Keep the summary concise (approximately 100 tokens)."""
         # Simple approximation: 4 characters per token
         return len(text) // 4
-    def prune_context(self, context: dict, max_tokens: int = 2000) -> dict:
         """
-        Step 4: Implement Smart Context Pruning
         Prune context to stay within token limit while keeping most recent and relevant content.
         Args:
             context: Context dictionary to prune
-            max_tokens: Maximum token count (default 2000)
         Returns:
             Pruned context dictionary
         """
         try:
             # Calculate current token count
             current_tokens = self._calculate_context_tokens(context)

             if not self.llm_router:
                 return ""
+            # Use full user input for context generation (not truncated in prompt)
+            # Only truncate for display in prompt if extremely long
+            user_input_preview = user_input[:500] if len(user_input) > 500 else user_input
             prompt = f"""Summarize this interaction in approximately 50 tokens:
+User Input: {user_input_preview}
+System Response: {system_response[:500]}
 Provide a brief summary capturing the key exchange."""
                     """, (
                         interaction_id,
                         session_id,
+                        user_input[:5000],  # Increased from 500 to 5000 characters
+                        system_response[:2000],  # Increased from 1000 to 2000
                         summary.strip(),
                         created_at
                     ))
         Applies smart pruning before formatting.
         """
+        # Step 4: Prune context if it exceeds token limits (uses config threshold)
+        pruned_context = self.prune_context(context)
         # Get context mode (fresh or relevant)
         session_id = pruned_context.get("session_id")
         # Simple approximation: 4 characters per token
         return len(text) // 4
+    def prune_context(self, context: dict, max_tokens: Optional[int] = None) -> dict:
         """
+        Step 4: Implement Smart Context Pruning with configurable threshold
         Prune context to stay within token limit while keeping most recent and relevant content.
         Args:
             context: Context dictionary to prune
+            max_tokens: Maximum token count (uses config default if None)
         Returns:
             Pruned context dictionary
         """
+        # Use config threshold if not provided
+        if max_tokens is None:
+            try:
+                from .config import get_settings
+                settings = get_settings()
+                max_tokens = settings.context_pruning_threshold
+                logger.debug(f"Using config pruning threshold: {max_tokens} tokens")
+            except Exception:
+                max_tokens = 2000  # Fallback to default
+                logger.warning("Could not load config, using default pruning threshold: 2000")
         try:
             # Calculate current token count
             current_tokens = self._calculate_context_tokens(context)

src/llm_router.py CHANGED Viewed

@@ -1,290 +1,213 @@
-# llm_router.py - UPDATED FOR LOCAL GPU MODEL LOADING
 import logging
 import asyncio
 from typing import Dict, Optional
 from .models_config import LLM_CONFIG
-# Import GatedRepoError for handling gated repositories
 try:
-    from huggingface_hub.exceptions import GatedRepoError
 except ImportError:
-    # Fallback if huggingface_hub is not available
-    GatedRepoError = Exception
 logger = logging.getLogger(__name__)
 class LLMRouter:
-    def __init__(self, hf_token=None, use_local_models: bool = True):
-        # hf_token kept for backward compatibility but not used for API calls
-        # Only needed for downloading gated models from HuggingFace Hub
-        self.hf_token = hf_token
-        self.health_status = {}
-        self.use_local_models = use_local_models
-        self.local_loader = None
-        logger.info("LLMRouter initialized (local models only, no API fallback)")
-        if hf_token:
-            logger.info("HF token available (for model download only)")
-        else:
-            logger.warning("HF_TOKEN not set - may be needed for gated model access")
-        # Initialize local model loader - REQUIRED
-        if self.use_local_models:
-            try:
-                from .local_model_loader import LocalModelLoader
-                self.local_loader = LocalModelLoader()
-                logger.info("✓ Local model loader initialized (GPU-based inference)")
-                # Note: Pre-loading will happen on first request (lazy loading)
-                # Models will be loaded on-demand to avoid blocking startup
-                logger.info("Models will be loaded on-demand for faster startup")
-            except Exception as e:
-                logger.error(f"❌ CRITICAL: Could not initialize local model loader: {e}")
-                logger.error("Local models are required - API fallback has been removed")
-                raise RuntimeError(
-                    "Local model loader is required but could not be initialized. "
-                    "Please ensure transformers and torch are installed."
-                ) from e
-        else:
-            logger.error("use_local_models=False but API fallback removed - this will fail")
-            raise ValueError("use_local_models must be True - API fallback has been removed")
     async def route_inference(self, task_type: str, prompt: str, **kwargs):
         """
-        Smart routing based on task specialization
-        Uses ONLY local models - no API fallback
         """
-        logger.info(f"Routing inference for task: {task_type}")
-        model_config = self._select_model(task_type)
-        logger.info(f"Selected model: {model_config['model_id']}")
-        # Use local models only
-        if not self.local_loader:
-            raise RuntimeError("Local model loader not available - cannot perform inference")
         try:
-            # Handle embedding generation separately
             if task_type == "embedding_generation":
-                result = await self._call_local_embedding(model_config, prompt, **kwargs)
             else:
-                result = await self._call_local_model(model_config, prompt, task_type, **kwargs)
             if result is None:
-                logger.error(f"Local model returned None for task: {task_type}")
                 raise RuntimeError(f"Inference failed for task: {task_type}")
-            logger.info(f"Inference complete for {task_type} (local model)")
             return result
         except Exception as e:
-            logger.error(f"Local model inference failed: {e}", exc_info=True)
-            # Try fallback model if configured
-            fallback_model_id = model_config.get("fallback")
-            if fallback_model_id and fallback_model_id != model_config["model_id"]:
-                logger.warning(f"Attempting fallback model: {fallback_model_id}")
-                try:
-                    fallback_config = model_config.copy()
-                    fallback_config["model_id"] = fallback_model_id
-                    fallback_config.pop("fallback", None)  # Prevent infinite recursion
-                    if task_type == "embedding_generation":
-                        result = await self._call_local_embedding(fallback_config, prompt, **kwargs)
-                    else:
-                        result = await self._call_local_model(fallback_config, prompt, task_type, **{**kwargs, '_is_fallback': True})
-                    if result is not None:
-                        logger.info(f"Inference complete using fallback model: {fallback_model_id}")
-                        return result
-                except Exception as fallback_error:
-                    logger.error(f"Fallback model also failed: {fallback_error}")
-            # No API fallback - raise error
             raise RuntimeError(
                 f"Inference failed for task: {task_type}. "
-                f"Local models are required - ensure models are properly loaded and accessible."
             ) from e
-    async def _call_local_model(self, model_config: dict, prompt: str, task_type: str, **kwargs) -> Optional[str]:
-        """Call local model for inference."""
-        if not self.local_loader:
             return None
-        # Check if this is already a fallback attempt (prevent infinite loops)
-        is_fallback_attempt = kwargs.get('_is_fallback', False)
-        model_id = model_config["model_id"]
-        max_tokens = kwargs.get('max_tokens', 512)
-        temperature = kwargs.get('temperature', 0.7)
         try:
-            # Ensure model is loaded
-            if model_id not in self.local_loader.loaded_models:
-                logger.info(f"Loading model {model_id} on demand...")
-                # Check if model config specifies quantization
-                use_4bit = model_config.get("use_4bit_quantization", False)
-                use_8bit = model_config.get("use_8bit_quantization", False)
-                # Fallback to default quantization settings if not specified
-                if not use_4bit and not use_8bit:
-                    quantization_config = LLM_CONFIG.get("quantization_settings", {})
-                    use_4bit = quantization_config.get("default_4bit", True)
-                    use_8bit = quantization_config.get("default_8bit", False)
-                try:
-                    self.local_loader.load_chat_model(
-                        model_id,
-                        load_in_8bit=use_8bit,
-                        load_in_4bit=use_4bit
-                    )
-                except GatedRepoError as e:
-                    logger.error(f"❌ Cannot access gated repository {model_id}")
-                    logger.error(f"   Visit https://huggingface.co/{model_id.split(':')[0] if ':' in model_id else model_id} to request access.")
-                    # Prevent infinite loops: if this is already a fallback attempt, don't try another fallback
-                    if is_fallback_attempt:
-                        logger.error("❌ Fallback model also failed with gated repository error")
-                        raise RuntimeError("Both primary and fallback models are gated repositories") from e
-                    # Try fallback models in order (fallback, then fallback2)
-                    fallback_chain = []
-                    if model_config.get("fallback") and model_config.get("fallback") != model_id:
-                        fallback_chain.append(model_config.get("fallback"))
-                    if model_config.get("fallback2") and model_config.get("fallback2") != model_id:
-                        fallback_chain.append(model_config.get("fallback2"))
-                    if fallback_chain:
-                        last_error = e
-                        for fallback_idx, fallback_model_id in enumerate(fallback_chain):
-                            logger.warning(f"Attempting fallback model {fallback_idx + 1}/{len(fallback_chain)}: {fallback_model_id}")
-                            try:
-                                # Create fallback config
-                                fallback_config = model_config.copy()
-                                fallback_config["model_id"] = fallback_model_id
-                                # Remove this fallback and subsequent ones to prevent infinite recursion
-                                fallback_config.pop("fallback", None)
-                                fallback_config.pop("fallback2", None)
-                                # Retry with fallback model (mark as fallback attempt if this is the last fallback)
-                                is_last_fallback = (fallback_idx == len(fallback_chain) - 1)
-                                return await self._call_local_model(
-                                    fallback_config,
-                                    prompt,
-                                    task_type,
-                                    **{**kwargs, '_is_fallback': is_last_fallback}
-                                )
-                            except GatedRepoError as fallback_gated_error:
-                                logger.error(f"❌ Fallback model {fallback_model_id} is also gated")
-                                last_error = fallback_gated_error
-                                if fallback_idx == len(fallback_chain) - 1:
-                                    # Last fallback failed
-                                    raise RuntimeError("All models (primary and fallbacks) are gated repositories") from fallback_gated_error
-                                # Continue to next fallback
-                                continue
-                            except Exception as fallback_error:
-                                logger.error(f"Fallback model {fallback_model_id} failed: {fallback_error}")
-                                last_error = fallback_error
-                                if fallback_idx == len(fallback_chain) - 1:
-                                    # Last fallback failed
-                                    raise
-                                # Continue to next fallback
-                                continue
-                        # All fallbacks exhausted
-                        raise RuntimeError(f"All models failed. Last error: {last_error}") from last_error
-                    else:
-                        raise RuntimeError(f"Model {model_id} is a gated repository and no fallback available") from e
-                except (RuntimeError, ModuleNotFoundError, ImportError) as e:
-                    # Check if this is a bitsandbytes error (not a gated repo error)
-                    error_str = str(e).lower()
-                    if "bitsandbytes" in error_str or "int8_mm_dequant" in error_str or "validate_bnb_backend" in error_str:
-                        logger.warning(f"⚠ BitsAndBytes compatibility issue detected: {e}")
-                        logger.warning(f"⚠ Model {model_id} will be loaded without quantization")
-                        # Retry without quantization
-                        try:
-                            # Disable quantization for this attempt
-                            fallback_config = model_config.copy()
-                            fallback_config["use_4bit_quantization"] = False
-                            fallback_config["use_8bit_quantization"] = False
-                            return await self._call_local_model(
-                                fallback_config,
-                                prompt,
-                                task_type,
-                                **kwargs
-                            )
-                        except Exception as retry_error:
-                            logger.error(f"Failed to load model even without quantization: {retry_error}")
-                            raise RuntimeError(f"Model loading failed: {retry_error}") from retry_error
-                    else:
-                        # Not a bitsandbytes error, re-raise
-                        raise
-            # Format as chat messages if needed
-            messages = [{"role": "user", "content": prompt}]
-            # Generate using local model
-            result = await asyncio.to_thread(
-                self.local_loader.generate_chat_completion,
-                model_id=model_id,
-                messages=messages,
-                max_tokens=max_tokens,
-                temperature=temperature
-            )
-            logger.info(f"Local model {model_id} generated response (length: {len(result)})")
-            logger.info("=" * 80)
-            logger.info("LOCAL MODEL RESPONSE:")
-            logger.info("=" * 80)
-            logger.info(f"Model: {model_id}")
-            logger.info(f"Task Type: {task_type}")
-            logger.info(f"Response Length: {len(result)} characters")
-            logger.info("-" * 40)
-            logger.info("FULL RESPONSE CONTENT:")
-            logger.info("-" * 40)
-            logger.info(result)
-            logger.info("-" * 40)
-            logger.info("END OF RESPONSE")
-            logger.info("=" * 80)
-            return result
-        except GatedRepoError:
-            # Re-raise to be handled by caller
-            raise
         except Exception as e:
-            logger.error(f"Error calling local model: {e}", exc_info=True)
             raise
-    async def _call_local_embedding(self, model_config: dict, text: str, **kwargs) -> Optional[list]:
-        """Call local embedding model."""
-        if not self.local_loader:
-            raise RuntimeError("Local model loader not available")
-        model_id = model_config["model_id"]
-        try:
-            # Ensure model is loaded
-            if model_id not in self.local_loader.loaded_embedding_models:
-                logger.info(f"Loading embedding model {model_id} on demand...")
-                try:
-                    self.local_loader.load_embedding_model(model_id)
-                except GatedRepoError as e:
-                    logger.error(f"❌ Cannot access gated repository {model_id}")
-                    logger.error(f"   Visit https://huggingface.co/{model_id.split(':')[0] if ':' in model_id else model_id} to request access.")
-                    raise RuntimeError(f"Embedding model {model_id} is a gated repository") from e
-            # Generate embedding
-            embedding = await asyncio.to_thread(
-                self.local_loader.get_embedding,
-                model_id=model_id,
-                text=text
-            )
-            logger.info(f"Local embedding model {model_id} generated vector (dim: {len(embedding)})")
-            return embedding
-        except Exception as e:
-            logger.error(f"Error calling local embedding model: {e}", exc_info=True)
-            raise
     def _select_model(self, task_type: str) -> dict:
         model_map = {
             "intent_classification": LLM_CONFIG["models"]["classification_specialist"],
             "embedding_generation": LLM_CONFIG["models"]["embedding_specialist"],
@@ -294,64 +217,73 @@ class LLMRouter:
         }
         return model_map.get(task_type, LLM_CONFIG["models"]["reasoning_primary"])
-    # REMOVED: _is_model_healthy - no longer needed (local models only)
-    # REMOVED: _get_fallback_model - no longer needed (local models only)
-    # REMOVED: _call_hf_endpoint - HF API inference removed
     async def get_available_models(self):
-        """
-        Get list of available models for testing
-        """
-        return list(LLM_CONFIG["models"].keys())
     async def health_check(self):
         """
-        Perform health check on local models only
         """
-        health_status = {}
-        if not self.local_loader:
-            return {"error": "Local model loader not available"}
-        for model_name, model_config in LLM_CONFIG["models"].items():
-            model_id = model_config["model_id"]
-            # Check if model is loaded (for chat models)
-            is_loaded = model_id in self.local_loader.loaded_models or model_id in self.local_loader.loaded_embedding_models
-            health_status[model_name] = {
-                "model_id": model_id,
-                "loaded": is_loaded,
-                "healthy": is_loaded  # Consider loaded models healthy
-            }
-        return health_status
-    def prepare_context_for_llm(self, raw_context: Dict, max_tokens: int = 4000) -> str:
-        """Smart context windowing for LLM calls"""
-        try:
-            from transformers import AutoTokenizer
-            # Initialize tokenizer lazily
-            if not hasattr(self, 'tokenizer'):
-                try:
-                    # Use the primary model for tokenization
-                    primary_model_id = LLM_CONFIG["models"]["reasoning_primary"]["model_id"]
-                    # Strip API suffix if present (though we don't use them anymore)
-                    base_model_id = primary_model_id.split(':')[0] if ':' in primary_model_id else primary_model_id
-                    self.tokenizer = AutoTokenizer.from_pretrained(base_model_id)
-                except GatedRepoError as e:
-                    logger.warning(f"Gated repository error loading tokenizer: {e}")
-                    logger.warning("Using character count estimation instead")
-                    self.tokenizer = None
-                except Exception as e:
-                    logger.warning(f"Could not load tokenizer: {e}, using character count estimation")
-                    self.tokenizer = None
-        except ImportError:
-            logger.warning("transformers library not available, using character count estimation")
-            self.tokenizer = None
-        # Priority order for context elements
         priority_elements = [
-            ('current_query', 1.0),
             ('recent_interactions', 0.8),
             ('user_preferences', 0.6),
             ('session_summary', 0.4),
@@ -359,12 +291,15 @@ class LLMRouter:
         ]
         formatted_context = []
-        total_tokens = 0
         for element, priority in priority_elements:
-            # Map element names to context keys
             element_key_map = {
-                'current_query': raw_context.get('user_input', ''),
                 'recent_interactions': raw_context.get('interaction_contexts', []),
                 'user_preferences': raw_context.get('preferences', {}),
                 'session_summary': raw_context.get('session_context', {}),
@@ -377,55 +312,32 @@ class LLMRouter:
             if isinstance(content, dict):
                 content = str(content)
             elif isinstance(content, list):
-                content = "\n".join([str(item) for item in content[:10]])  # Limit to 10 items
             if not content:
                 continue
-            # Estimate tokens
-            if self.tokenizer:
-                try:
-                    tokens = len(self.tokenizer.encode(content))
-                except:
-                    # Fallback to character-based estimation (rough: 1 token ≈ 4 chars)
-                    tokens = len(content) // 4
-            else:
-                # Character-based estimation (rough: 1 token ≈ 4 chars)
-                tokens = len(content) // 4
             if total_tokens + tokens <= max_tokens:
                 formatted_context.append(f"=== {element.upper()} ===\n{content}")
                 total_tokens += tokens
-            elif priority > 0.5:  # Critical elements - truncate if needed
                 available = max_tokens - total_tokens
                 if available > 100:  # Only truncate if we have meaningful space
                     truncated = self._truncate_to_tokens(content, available)
                     formatted_context.append(f"=== {element.upper()} (TRUNCATED) ===\n{truncated}")
                 break
         return "\n\n".join(formatted_context)
     def _truncate_to_tokens(self, content: str, max_tokens: int) -> str:
         """Truncate content to fit within token limit"""
-        if not self.tokenizer:
-            # Simple character-based truncation
-            max_chars = max_tokens * 4
-            if len(content) <= max_chars:
-                return content
-            return content[:max_chars-3] + "..."
-        try:
-            # Tokenize and truncate
-            tokens = self.tokenizer.encode(content)
-            if len(tokens) <= max_tokens:
-                return content
-            truncated_tokens = tokens[:max_tokens-3]  # Leave room for "..."
-            truncated_text = self.tokenizer.decode(truncated_tokens)
-            return truncated_text + "..."
-        except Exception as e:
-            logger.warning(f"Error truncating with tokenizer: {e}, using character truncation")
-            max_chars = max_tokens * 4
-            if len(content) <= max_chars:
-                return content
-            return content[:max_chars-3] + "..."

+# llm_router.py - NOVITA AI API ONLY
 import logging
 import asyncio
 from typing import Dict, Optional
 from .models_config import LLM_CONFIG
+from .config import get_settings
+# Import OpenAI client for Novita AI API
 try:
+    from openai import OpenAI
+    OPENAI_AVAILABLE = True
 except ImportError:
+    OPENAI_AVAILABLE = False
+    logger = logging.getLogger(__name__)
+    logger.error("openai package not available - Novita AI API requires openai package")
 logger = logging.getLogger(__name__)
 class LLMRouter:
+    def __init__(self, hf_token=None, use_local_models: bool = False):
+        """
+        Initialize LLM Router with Novita AI API only.
+        Args:
+            hf_token: Not used (kept for backward compatibility)
+            use_local_models: Must be False (local models disabled)
+        """
+        if use_local_models:
+            raise ValueError("Local models are disabled. Only Novita AI API is supported.")
+        self.settings = get_settings()
+        self.novita_client = None
+        # Validate OpenAI package
+        if not OPENAI_AVAILABLE:
+            raise ImportError(
+                "openai package is required for Novita AI API. "
+                "Install it with: pip install openai>=1.0.0"
+            )
+        # Validate API key
+        if not self.settings.novita_api_key:
+            raise ValueError(
+                "NOVITA_API_KEY is required. "
+                "Set it in environment variables or .env file"
+            )
+        # Initialize Novita AI client
+        try:
+            self.novita_client = OpenAI(
+                base_url=self.settings.novita_base_url,
+                api_key=self.settings.novita_api_key,
+            )
+            logger.info(f"✓ Novita AI API client initialized")
+            logger.info(f"  Base URL: {self.settings.novita_base_url}")
+            logger.info(f"  Model: {self.settings.novita_model}")
+        except Exception as e:
+            logger.error(f"Failed to initialize Novita AI client: {e}")
+            raise RuntimeError(f"Could not initialize Novita AI API client: {e}") from e
     async def route_inference(self, task_type: str, prompt: str, **kwargs):
         """
+        Route inference to Novita AI API.
+        Args:
+            task_type: Type of task (general_reasoning, intent_classification, etc.)
+            prompt: Input prompt
+            **kwargs: Additional parameters (max_tokens, temperature, etc.)
+        Returns:
+            Generated text response
         """
+        logger.info(f"Routing inference to Novita AI API for task: {task_type}")
+        if not self.novita_client:
+            raise RuntimeError("Novita AI client not initialized")
         try:
+            # Handle embedding generation (may need special handling)
             if task_type == "embedding_generation":
+                logger.warning("Embedding generation via Novita API may require special implementation")
+                # For now, use chat completion (may need adjustment based on Novita API capabilities)
+                result = await self._call_novita_api(task_type, prompt, **kwargs)
             else:
+                result = await self._call_novita_api(task_type, prompt, **kwargs)
             if result is None:
+                logger.error(f"Novita AI API returned None for task: {task_type}")
                 raise RuntimeError(f"Inference failed for task: {task_type}")
+            logger.info(f"Inference complete for {task_type} (Novita AI API)")
             return result
         except Exception as e:
+            logger.error(f"Novita AI API inference failed: {e}", exc_info=True)
             raise RuntimeError(
                 f"Inference failed for task: {task_type}. "
+                f"Novita AI API error: {e}"
             ) from e
+    async def _call_novita_api(self, task_type: str, prompt: str, **kwargs) -> Optional[str]:
+        """Call Novita AI API for inference."""
+        if not self.novita_client:
             return None
+        # Get model config
+        model_config = self._select_model(task_type)
+        model_name = kwargs.get('model', self.settings.novita_model)
+        # Get optimized parameters
+        max_tokens = kwargs.get('max_tokens', model_config.get('max_tokens', 4096))
+        temperature = kwargs.get('temperature',
+            model_config.get('temperature', self.settings.deepseek_r1_temperature))
+        top_p = kwargs.get('top_p', model_config.get('top_p', 0.95))
+        stream = kwargs.get('stream', False)
+        # Format prompt according to DeepSeek-R1 best practices
+        formatted_prompt = self._format_deepseek_r1_prompt(prompt, task_type, model_config)
+        # IMPORTANT: No system prompt - all instructions in user prompt
+        messages = [{"role": "user", "content": formatted_prompt}]
+        # Build request parameters
+        request_params = {
+            "model": model_name,
+            "messages": messages,
+            "stream": stream,
+            "max_tokens": max_tokens,
+            "temperature": temperature,
+            "top_p": top_p,
+        }
         try:
+            if stream:
+                # Handle streaming response
+                response_text = ""
+                stream_response = self.novita_client.chat.completions.create(**request_params)
+                for chunk in stream_response:
+                    if chunk.choices and len(chunk.choices) > 0:
+                        delta = chunk.choices[0].delta
+                        if delta and delta.content:
+                            response_text += delta.content
+                # Clean up reasoning tags if present
+                response_text = self._clean_reasoning_tags(response_text)
+                logger.info(f"Novita AI API generated response (length: {len(response_text)})")
+                return response_text
+            else:
+                # Handle non-streaming response
+                response = self.novita_client.chat.completions.create(**request_params)
+                if response.choices and len(response.choices) > 0:
+                    result = response.choices[0].message.content
+                    # Clean up reasoning tags if present
+                    result = self._clean_reasoning_tags(result)
+                    logger.info(f"Novita AI API generated response (length: {len(result)})")
+                    return result
+                else:
+                    logger.error("Novita AI API returned empty response")
+                    return None
         except Exception as e:
+            logger.error(f"Error calling Novita AI API: {e}", exc_info=True)
             raise
+    def _format_deepseek_r1_prompt(self, prompt: str, task_type: str, model_config: dict) -> str:
+        """
+        Format prompt according to DeepSeek-R1 best practices:
+        - No system prompt (all instructions in user prompt)
+        - Force reasoning trigger for reasoning tasks
+        - Add math directive for mathematical problems
+        """
+        formatted_prompt = prompt
+        # Check if we should force reasoning prefix
+        force_reasoning = (
+            self.settings.deepseek_r1_force_reasoning and
+            model_config.get("force_reasoning_prefix", False)
+        )
+        if force_reasoning:
+            # Force model to start with reasoning trigger
+            formatted_prompt = f"`<think>`\n\n{formatted_prompt}"
+        # Add math directive for mathematical problems
+        if self._is_math_query(prompt):
+            math_directive = "Please reason step by step, and put your final answer within \\boxed{}."
+            formatted_prompt = f"{formatted_prompt}\n\n{math_directive}"
+        return formatted_prompt
+    def _is_math_query(self, prompt: str) -> bool:
+        """Detect if query is mathematical"""
+        math_keywords = [
+            "solve", "calculate", "compute", "equation", "formula",
+            "mathematical", "algebra", "geometry", "calculus", "integral",
+            "derivative", "theorem", "proof", "problem"
+        ]
+        prompt_lower = prompt.lower()
+        return any(keyword in prompt_lower for keyword in math_keywords)
+    def _clean_reasoning_tags(self, text: str) -> str:
+        """Clean up reasoning tags from response"""
+        text = text.replace("`<think>`", "").replace("`</think>`", "")
+        text = text.strip()
+        return text
     def _select_model(self, task_type: str) -> dict:
+        """Select model configuration based on task type"""
         model_map = {
             "intent_classification": LLM_CONFIG["models"]["classification_specialist"],
             "embedding_generation": LLM_CONFIG["models"]["embedding_specialist"],
         }
         return model_map.get(task_type, LLM_CONFIG["models"]["reasoning_primary"])
     async def get_available_models(self):
+        """Get list of available models (Novita AI only)"""
+        return ["Novita AI API - DeepSeek-R1-Distill-Qwen-7B"]
     async def health_check(self):
+        """Perform health check on Novita AI API"""
+        try:
+            # Test API with a simple request
+            test_response = self.novita_client.chat.completions.create(
+                model=self.settings.novita_model,
+                messages=[{"role": "user", "content": "test"}],
+                max_tokens=5
+            )
+            return {
+                "provider": "novita_api",
+                "status": "healthy",
+                "model": self.settings.novita_model,
+                "base_url": self.settings.novita_base_url
+            }
+        except Exception as e:
+            logger.error(f"Health check failed: {e}")
+            return {
+                "provider": "novita_api",
+                "status": "unhealthy",
+                "error": str(e)
+            }
+    def prepare_context_for_llm(self, raw_context: Dict, max_tokens: Optional[int] = None,
+                                user_input: Optional[str] = None) -> str:
         """
+        Smart context windowing with user input priority.
+        User input is NEVER truncated - context is reduced to fit.
+        Args:
+            raw_context: Context dictionary
+            max_tokens: Optional override (uses config default if None)
+            user_input: Optional explicit user input (takes priority over raw_context['user_input'])
         """
+        # Use config budget if not provided
+        if max_tokens is None:
+            max_tokens = self.settings.context_preparation_budget
+        # Get user input (explicit parameter takes priority)
+        actual_user_input = user_input or raw_context.get('user_input', '')
+        # Calculate user input tokens (simple estimation: 1 token ≈ 4 chars)
+        user_input_tokens = len(actual_user_input) // 4
+        # Ensure user input fits within dedicated budget
+        user_input_max = self.settings.user_input_max_tokens
+        if user_input_tokens > user_input_max:
+            logger.warning(f"User input ({user_input_tokens} tokens) exceeds max ({user_input_max}), truncating")
+            max_chars = user_input_max * 4
+            actual_user_input = actual_user_input[:max_chars - 3] + "..."
+            user_input_tokens = user_input_max
+        # Reserve space for user input (it has highest priority)
+        remaining_tokens = max_tokens - user_input_tokens
+        if remaining_tokens < 0:
+            logger.warning(f"User input ({user_input_tokens} tokens) exceeds total budget ({max_tokens})")
+            remaining_tokens = 0
+        logger.info(f"Token allocation: User input={user_input_tokens}, Context budget={remaining_tokens}, Total={max_tokens}")
+        # Priority order for context elements (user input already handled)
         priority_elements = [
             ('recent_interactions', 0.8),
             ('user_preferences', 0.6),
             ('session_summary', 0.4),
         ]
         formatted_context = []
+        total_tokens = user_input_tokens  # Start with user input tokens
+        # Add user input first (unconditionally, never truncated)
+        if actual_user_input:
+            formatted_context.append(f"=== USER INPUT ===\n{actual_user_input}")
+        # Now add context elements within remaining budget
         for element, priority in priority_elements:
             element_key_map = {
                 'recent_interactions': raw_context.get('interaction_contexts', []),
                 'user_preferences': raw_context.get('preferences', {}),
                 'session_summary': raw_context.get('session_context', {}),
             if isinstance(content, dict):
                 content = str(content)
             elif isinstance(content, list):
+                content = "\n".join([str(item) for item in content[:10]])
             if not content:
                 continue
+            # Estimate tokens (simple: 1 token ≈ 4 chars)
+            tokens = len(content) // 4
             if total_tokens + tokens <= max_tokens:
                 formatted_context.append(f"=== {element.upper()} ===\n{content}")
                 total_tokens += tokens
+            elif priority > 0.5 and remaining_tokens > 0:  # Critical elements - truncate if needed
                 available = max_tokens - total_tokens
                 if available > 100:  # Only truncate if we have meaningful space
                     truncated = self._truncate_to_tokens(content, available)
                     formatted_context.append(f"=== {element.upper()} (TRUNCATED) ===\n{truncated}")
+                    total_tokens += available
                 break
+        logger.info(f"Context prepared: {total_tokens}/{max_tokens} tokens (user input: {user_input_tokens}, context: {total_tokens - user_input_tokens})")
         return "\n\n".join(formatted_context)
     def _truncate_to_tokens(self, content: str, max_tokens: int) -> str:
         """Truncate content to fit within token limit"""
+        # Simple character-based truncation (1 token ≈ 4 chars)
+        max_chars = max_tokens * 4
+        if len(content) <= max_chars:
+            return content
+        return content[:max_chars - 3] + "..."

src/models_config.py CHANGED Viewed

@@ -1,61 +1,45 @@
 # models_config.py
-# Optimized for NVIDIA T4 Medium (16GB VRAM) with 4-bit quantization
-# UPDATED: Local models only - no API fallback
 LLM_CONFIG = {
-    "primary_provider": "local",
     "models": {
         "reasoning_primary": {
-            # Primary: Qwen (gated, requires access) - Fallback: Mistral (non-gated, stable)
-            "model_id": "Qwen/Qwen2.5-7B-Instruct",  # Single primary model for all text tasks
             "task": "general_reasoning",
-            "max_tokens": 8000,  # Reduced from 10000
-            "temperature": 0.7,
-            # Fallback to Mistral (non-gated, no DynamicCache issues) before Phi-3
-            "fallback": "mistralai/Mistral-7B-Instruct-v0.2",  # Non-gated, stable, no DynamicCache issues
-            "fallback2": "microsoft/Phi-3-mini-4k-instruct",  # Secondary fallback (3.8B, has DynamicCache workaround)
-            "is_chat_model": True,
-            "use_4bit_quantization": True,  # Enable 4-bit quantization for 16GB T4
-            "use_8bit_quantization": False
-        },
-        "embedding_specialist": {
-            "model_id": "intfloat/e5-large-v2",  # 1024-dim embeddings for semantic similarity
-            "task": "embeddings",
-            "vector_dimensions": 1024,
-            "purpose": "semantic_similarity",
-            "is_chat_model": False
         },
         "classification_specialist": {
-            "model_id": "Qwen/Qwen2.5-7B-Instruct",  # Same model for all text tasks
             "task": "intent_classification",
-            "max_length": 512,
-            "specialization": "fast_inference",
-            "latency_target": "<100ms",
-            "is_chat_model": True,
-            "use_4bit_quantization": True,
-            "fallback": "mistralai/Mistral-7B-Instruct-v0.2",  # Non-gated, stable
-            "fallback2": "microsoft/Phi-3-mini-4k-instruct"  # Secondary fallback with DynamicCache workaround
         },
         "safety_checker": {
-            "model_id": "Qwen/Qwen2.5-7B-Instruct",  # Same model for all text tasks
             "task": "content_moderation",
-            "confidence_threshold": 0.85,
-            "purpose": "bias_detection",
-            "is_chat_model": True,
-            "use_4bit_quantization": True,
-            "fallback": "mistralai/Mistral-7B-Instruct-v0.2",  # Non-gated, stable
-            "fallback2": "microsoft/Phi-3-mini-4k-instruct"  # Secondary fallback with DynamicCache workaround
         }
     },
     "routing_logic": {
-        "strategy": "task_based_routing",
-        "fallback_chain": ["primary"],  # No API fallback
-        "load_balancing": "single_model_reuse"
-    },
-    "quantization_settings": {
-        "default_4bit": True,  # Enable 4-bit quantization by default for T4 16GB
-        "default_8bit": False,
-        "bnb_4bit_compute_dtype": "float16",
-        "bnb_4bit_use_double_quant": True,
-        "bnb_4bit_quant_type": "nf4"
     }
 }

 # models_config.py
+# UPDATED: Novita AI API only - no local models
 LLM_CONFIG = {
+    "primary_provider": "novita_api",
     "models": {
         "reasoning_primary": {
+            "model_id": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2",
             "task": "general_reasoning",
+            "max_tokens": 4096,
+            "temperature": 0.6,  # Recommended for DeepSeek-R1
+            "top_p": 0.95,
+            "force_reasoning_prefix": True,
+            "is_chat_model": True
         },
         "classification_specialist": {
+            "model_id": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2",
             "task": "intent_classification",
+            "max_tokens": 512,
+            "temperature": 0.5,  # Lower for consistency
+            "top_p": 0.9,
+            "force_reasoning_prefix": False,
+            "is_chat_model": True
         },
         "safety_checker": {
+            "model_id": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2",
             "task": "content_moderation",
+            "max_tokens": 1024,
+            "temperature": 0.5,
+            "top_p": 0.9,
+            "force_reasoning_prefix": False,
+            "is_chat_model": True
+        },
+        "embedding_specialist": {
+            "model_id": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B:de-1a706eeafbf3ebc2",
+            "task": "embeddings",
+            "note": "Embeddings via Novita API - may require special handling",
+            "is_chat_model": True
         }
     },
     "routing_logic": {
+        "strategy": "novita_api_only",
+        "fallback_chain": [],
+        "load_balancing": "single_endpoint"
     }
 }

test_novita_conda.bat ADDED Viewed

	@@ -0,0 +1,53 @@

+@echo off
+REM Test Novita AI connection using Anaconda environment
+REM This script activates the conda environment and runs the test
+echo ============================================================
+echo Testing Novita AI Connection with Anaconda
+echo ============================================================
+echo.
+REM Check if conda is available
+where conda >nul 2>&1
+if %ERRORLEVEL% NEQ 0 (
+    echo ERROR: conda command not found
+    echo Please activate Anaconda Prompt first or add conda to PATH
+    goto :end
+)
+echo Step 1: Checking conda environments...
+call conda env list
+echo.
+echo Step 2: Creating environment if it doesn't exist...
+call conda env create -f environment.yml --name research-ai-assistant 2>nul
+if %ERRORLEVEL% NEQ 0 (
+    echo Environment may already exist, continuing...
+)
+echo.
+echo Step 3: Activating environment and running test...
+call conda activate research-ai-assistant
+if %ERRORLEVEL% NEQ 0 (
+    echo ERROR: Failed to activate environment
+    echo Try: conda activate research-ai-assistant
+    goto :end
+)
+echo.
+echo Step 4: Installing openai package if needed...
+python -c "import openai" 2>nul
+if %ERRORLEVEL% NEQ 0 (
+    echo Installing openai package...
+    pip install openai>=1.0.0
+)
+echo.
+echo Step 5: Running Novita AI connection test...
+python test_novita_connection.py
+:end
+echo.
+echo Test complete!
+pause

test_novita_connection.py ADDED Viewed

	@@ -0,0 +1,275 @@

+#!/usr/bin/env python3
+"""
+Test script for Novita AI API connection
+Tests configuration, client initialization, and API calls
+"""
+import os
+import sys
+import asyncio
+from pathlib import Path
+# Add project root to path
+project_root = Path(__file__).parent
+sys.path.insert(0, str(project_root))
+def test_configuration():
+    """Test configuration loading"""
+    print("=" * 60)
+    print("TEST 1: Configuration Loading")
+    print("=" * 60)
+    try:
+        from src.config import get_settings
+        settings = get_settings()
+        print(f"✓ Configuration loaded successfully")
+        print(f"  Novita API Key: {'Set' if settings.novita_api_key else 'NOT SET'}")
+        print(f"  Base URL: {settings.novita_base_url}")
+        print(f"  Model: {settings.novita_model}")
+        print(f"  Temperature: {settings.deepseek_r1_temperature}")
+        print(f"  Force Reasoning: {settings.deepseek_r1_force_reasoning}")
+        print(f"  User Input Max Tokens: {settings.user_input_max_tokens}")
+        print(f"  Context Preparation Budget: {settings.context_preparation_budget}")
+        if not settings.novita_api_key:
+            print("\n❌ ERROR: NOVITA_API_KEY is not set!")
+            print("   Please set it in environment variables or .env file")
+            return False
+        return True
+    except Exception as e:
+        print(f"❌ Configuration loading failed: {e}")
+        import traceback
+        traceback.print_exc()
+        return False
+def test_openai_package():
+    """Test OpenAI package availability"""
+    print("\n" + "=" * 60)
+    print("TEST 2: OpenAI Package Check")
+    print("=" * 60)
+    try:
+        from openai import OpenAI
+        print("✓ OpenAI package is available")
+        print(f"  OpenAI version: {OpenAI.__module__}")
+        return True
+    except ImportError as e:
+        print(f"❌ OpenAI package not available: {e}")
+        print("   Install with: pip install openai>=1.0.0")
+        return False
+def test_client_initialization():
+    """Test Novita AI client initialization"""
+    print("\n" + "=" * 60)
+    print("TEST 3: Novita AI Client Initialization")
+    print("=" * 60)
+    try:
+        from src.config import get_settings
+        from openai import OpenAI
+        settings = get_settings()
+        if not settings.novita_api_key:
+            print("❌ Cannot test - NOVITA_API_KEY not set")
+            return False
+        client = OpenAI(
+            base_url=settings.novita_base_url,
+            api_key=settings.novita_api_key,
+        )
+        print("✓ Novita AI client initialized successfully")
+        print(f"  Base URL: {settings.novita_base_url}")
+        print(f"  API Key: {settings.novita_api_key[:10]}...{settings.novita_api_key[-4:] if len(settings.novita_api_key) > 14 else '***'}")
+        return True, client
+    except Exception as e:
+        print(f"❌ Client initialization failed: {e}")
+        import traceback
+        traceback.print_exc()
+        return False, None
+def test_simple_api_call(client):
+    """Test a simple API call to Novita AI"""
+    print("\n" + "=" * 60)
+    print("TEST 4: Simple API Call")
+    print("=" * 60)
+    if not client:
+        print("❌ Cannot test - client not initialized")
+        return False
+    try:
+        from src.config import get_settings
+        settings = get_settings()
+        print(f"Sending test request to: {settings.novita_model}")
+        print("Prompt: 'Hello, this is a test. Please respond briefly.'")
+        response = client.chat.completions.create(
+            model=settings.novita_model,
+            messages=[
+                {"role": "user", "content": "Hello, this is a test. Please respond briefly."}
+            ],
+            max_tokens=50,
+            temperature=0.6
+        )
+        if response.choices and len(response.choices) > 0:
+            result = response.choices[0].message.content
+            print(f"✓ API call successful!")
+            print(f"  Response length: {len(result)} characters")
+            print(f"  Response preview: {result[:100]}...")
+            print(f"  Model used: {response.model if hasattr(response, 'model') else 'N/A'}")
+            return True
+        else:
+            print("❌ API call returned empty response")
+            return False
+    except Exception as e:
+        print(f"❌ API call failed: {e}")
+        import traceback
+        traceback.print_exc()
+        return False
+def test_llm_router():
+    """Test LLM Router initialization and health check"""
+    print("\n" + "=" * 60)
+    print("TEST 5: LLM Router Initialization")
+    print("=" * 60)
+    try:
+        from src.llm_router import LLMRouter
+        print("Initializing LLM Router...")
+        router = LLMRouter(hf_token=None, use_local_models=False)
+        print("✓ LLM Router initialized successfully")
+        # Test health check
+        print("\nTesting health check...")
+        async def test_health():
+            health = await router.health_check()
+            return health
+        health = asyncio.run(test_health())
+        print(f"✓ Health check result: {health}")
+        return True
+    except Exception as e:
+        print(f"❌ LLM Router initialization failed: {e}")
+        import traceback
+        traceback.print_exc()
+        return False
+async def test_inference():
+    """Test actual inference through LLM Router"""
+    print("\n" + "=" * 60)
+    print("TEST 6: Inference Test")
+    print("=" * 60)
+    try:
+        from src.llm_router import LLMRouter
+        router = LLMRouter(hf_token=None, use_local_models=False)
+        test_prompt = "What is the capital of France? Answer in one sentence."
+        print(f"Test prompt: {test_prompt}")
+        result = await router.route_inference(
+            task_type="general_reasoning",
+            prompt=test_prompt,
+            max_tokens=100,
+            temperature=0.6
+        )
+        if result:
+            print(f"✓ Inference successful!")
+            print(f"  Response length: {len(result)} characters")
+            print(f"  Response: {result}")
+            return True
+        else:
+            print("❌ Inference returned None")
+            return False
+    except Exception as e:
+        print(f"❌ Inference test failed: {e}")
+        import traceback
+        traceback.print_exc()
+        return False
+def main():
+    """Run all tests"""
+    print("\n" + "=" * 60)
+    print("NOVITA AI CONNECTION TEST")
+    print("=" * 60)
+    print()
+    results = {}
+    # Test 1: Configuration
+    results['config'] = test_configuration()
+    if not results['config']:
+        print("\n❌ Configuration test failed. Please check your environment variables.")
+        return
+    # Test 2: OpenAI package
+    results['package'] = test_openai_package()
+    if not results['package']:
+        print("\n❌ Package test failed. Please install: pip install openai>=1.0.0")
+        return
+    # Test 3: Client initialization
+    client_init_result = test_client_initialization()
+    if isinstance(client_init_result, tuple):
+        results['client'] = client_init_result[0]
+        client = client_init_result[1]
+    else:
+        results['client'] = client_init_result
+        client = None
+    if not results['client']:
+        print("\n❌ Client initialization failed. Check your API key and base URL.")
+        return
+    # Test 4: Simple API call
+    results['api_call'] = test_simple_api_call(client)
+    # Test 5: LLM Router
+    results['router'] = test_llm_router()
+    # Test 6: Inference
+    if results['router']:
+        results['inference'] = asyncio.run(test_inference())
+    # Summary
+    print("\n" + "=" * 60)
+    print("TEST SUMMARY")
+    print("=" * 60)
+    total_tests = len(results)
+    passed_tests = sum(1 for v in results.values() if v)
+    for test_name, result in results.items():
+        status = "✓ PASS" if result else "❌ FAIL"
+        print(f"  {test_name.upper()}: {status}")
+    print(f"\nTotal: {passed_tests}/{total_tests} tests passed")
+    if passed_tests == total_tests:
+        print("\n🎉 All tests passed! Novita AI connection is working correctly.")
+        return 0
+    else:
+        print("\n⚠️  Some tests failed. Please review the errors above.")
+        return 1
+if __name__ == "__main__":
+    exit_code = main()
+    sys.exit(exit_code)