| # Deployment Configuration Guide | |
| ## Critical Issues and Solutions | |
| ### 1. Cache Directory Permissions | |
| **Problem**: `PermissionError: [Errno 13] Permission denied: '/.cache'` | |
| **Solution**: The code now automatically detects Docker and uses `/tmp/huggingface_cache`. However, ensure the Dockerfile sets proper permissions. | |
| **Dockerfile Fix**: | |
| ```dockerfile | |
| # Create cache directory with proper permissions | |
| RUN mkdir -p /tmp/huggingface_cache && chmod 777 /tmp/huggingface_cache | |
| ENV HF_HOME=/tmp/huggingface_cache | |
| ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache | |
| ``` | |
| ### 2. User ID Issues | |
| **Problem**: `KeyError: 'getpwuid(): uid not found: 1000'` | |
| **Solution**: Run container with proper user or ensure user exists in container. | |
| **Option A - Use root (simplest for HF Spaces)**: | |
| ```dockerfile | |
| # Already running as root in HF Spaces - this is fine | |
| # Just ensure cache directories are writable | |
| ``` | |
| **Option B - Create user in Dockerfile**: | |
| ```dockerfile | |
| RUN useradd -m -u 1000 -s /bin/bash appuser && \ | |
| mkdir -p /tmp/huggingface_cache && \ | |
| chown -R appuser:appuser /tmp/huggingface_cache /app | |
| USER appuser | |
| ``` | |
| **For Hugging Face Spaces**: Spaces typically run as root, so Option A is fine. | |
| ### 3. HuggingFace Token Configuration | |
| **Problem**: Gated repository access errors | |
| **Solution**: Set HF_TOKEN in Hugging Face Spaces secrets. | |
| **Steps**: | |
| 1. Go to your Space → Settings → Repository secrets | |
| 2. Add `HF_TOKEN` with your Hugging Face access token | |
| 3. Token should have read access to gated models | |
| **Verify Token**: | |
| ```bash | |
| # Test token access | |
| curl -H "Authorization: Bearer YOUR_TOKEN" https://huggingface.co/api/models/Qwen/Qwen2.5-7B-Instruct | |
| ``` | |
| ### 4. GPU Tensor Device Placement | |
| **Problem**: `Tensor on device cuda:0 is not on the expected device meta!` | |
| **Solution**: Use explicit device placement instead of `device_map="auto"` for non-quantized models. | |
| **Code Fix**: Already implemented in `src/local_model_loader.py` - uses `device_map="auto"` only with quantization, explicit placement otherwise. | |
| ### 5. Model Selection for Testing | |
| **Current Models**: | |
| - Primary: `Qwen/Qwen2.5-7B-Instruct` (gated - requires access) | |
| - Fallback: `microsoft/Phi-3-mini-4k-instruct` (non-gated, verified) | |
| **For Testing Without Gated Models**: | |
| Update `src/models_config.py` to use non-gated models: | |
| ```python | |
| "reasoning_primary": { | |
| "model_id": "microsoft/Phi-3-mini-4k-instruct", # Non-gated | |
| ... | |
| } | |
| ``` | |
| ## Recommended Dockerfile Updates | |
| ```dockerfile | |
| FROM python:3.10-slim | |
| WORKDIR /app | |
| # Install system dependencies | |
| RUN apt-get update && apt-get install -y \ | |
| gcc \ | |
| g++ \ | |
| cmake \ | |
| libopenblas-dev \ | |
| libomp-dev \ | |
| curl \ | |
| && rm -rf /var/lib/apt/lists/* | |
| # Create cache directories with proper permissions | |
| RUN mkdir -p /tmp/huggingface_cache && \ | |
| chmod 777 /tmp/huggingface_cache && \ | |
| mkdir -p /tmp/logs && \ | |
| chmod 777 /tmp/logs | |
| # Copy requirements file | |
| COPY requirements.txt . | |
| # Install Python dependencies | |
| RUN pip install --no-cache-dir --upgrade pip && \ | |
| pip install --no-cache-dir -r requirements.txt | |
| # Copy application code | |
| COPY . . | |
| # Set environment variables | |
| ENV PYTHONUNBUFFERED=1 | |
| ENV PORT=7860 | |
| ENV OMP_NUM_THREADS=4 | |
| ENV MKL_NUM_THREADS=4 | |
| ENV DB_PATH=/tmp/sessions.db | |
| ENV FAISS_INDEX_PATH=/tmp/embeddings.faiss | |
| ENV LOG_DIR=/tmp/logs | |
| ENV HF_HOME=/tmp/huggingface_cache | |
| ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache | |
| ENV RATE_LIMIT_ENABLED=true | |
| # Expose port | |
| EXPOSE 7860 | |
| # Health check | |
| HEALTHCHECK --interval=30s --timeout=30s --start-period=120s --retries=3 \ | |
| CMD curl -f http://localhost:7860/api/health || exit 1 | |
| # Run with Gunicorn | |
| CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "4", "--threads", "2", "--timeout", "120", "--access-logfile", "-", "--error-logfile", "-", "--log-level", "info", "flask_api_standalone:app"] | |
| ``` | |
| ## Hugging Face Spaces Configuration | |
| ### Required Secrets: | |
| 1. `HF_TOKEN` - Your Hugging Face access token (for gated models) | |
| ### Environment Variables (Optional): | |
| - `HF_HOME` - Will auto-detect to `/tmp/huggingface_cache` in Docker | |
| - `TRANSFORMERS_CACHE` - Will auto-detect to `/tmp/huggingface_cache` in Docker | |
| ### Hardware Requirements: | |
| - GPU: NVIDIA T4 (16GB VRAM) - ✅ Detected in logs | |
| - Memory: At least 8GB RAM | |
| - Disk: 20GB+ for model cache | |
| ## Verification Steps | |
| 1. **Check Cache Directory**: | |
| ```bash | |
| ls -la /tmp/huggingface_cache | |
| # Should show writable directory | |
| ``` | |
| 2. **Check HF Token**: | |
| ```python | |
| import os | |
| print("HF_TOKEN set:", bool(os.getenv("HF_TOKEN"))) | |
| ``` | |
| 3. **Check GPU**: | |
| ```python | |
| import torch | |
| print("CUDA available:", torch.cuda.is_available()) | |
| print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None") | |
| ``` | |
| 4. **Test Model Loading**: | |
| - Check logs for: `✓ Cache directory verified: /tmp/huggingface_cache` | |
| - Check logs for: `✓ HF_TOKEN authenticated for gated model access` (if token set) | |
| - Check logs for: `✓ Model loaded successfully` | |
| ## Troubleshooting | |
| ### Issue: Still getting permission errors | |
| **Fix**: Ensure Dockerfile creates cache directory with 777 permissions | |
| ### Issue: Gated repository errors persist | |
| **Fix**: | |
| 1. Verify HF_TOKEN is set in Spaces secrets | |
| 2. Visit model page and request access | |
| 3. Wait for approval (usually instant) | |
| 4. Use fallback model (Phi-3-mini) until access granted | |
| ### Issue: Tensor device errors | |
| **Fix**: Code now handles this - if quantization fails, loads without quantization and uses explicit device placement | |
| ### Issue: Model too large for GPU | |
| **Fix**: | |
| - Code automatically falls back to no quantization if bitsandbytes fails | |
| - Consider using smaller model (Phi-3-mini) for testing | |
| - Check GPU memory: `nvidia-smi` | |
| ## Quick Start Checklist | |
| - [ ] HF_TOKEN set in Spaces secrets | |
| - [ ] Dockerfile creates cache directory with proper permissions | |
| - [ ] GPU detected (check logs) | |
| - [ ] Cache directory writable (check logs) | |
| - [ ] Model access granted (or using non-gated fallback) | |
| - [ ] No tensor device errors (check logs) | |
| ## Next Steps | |
| 1. Update Dockerfile with cache directory creation | |
| 2. Set HF_TOKEN in Spaces secrets | |
| 3. Request access to gated models (Qwen) | |
| 4. Test with fallback model first (Phi-3-mini) | |
| 5. Monitor logs for successful model loading | |