HonestAI

Paused

JatsTheAIGen commited on Nov 4

Commit

67c580c

1 Parent(s): 13fa6c4

Fix: GPU tensor placement and Docker deployment configuration

CRITICAL FIXES:
- Fixed tensor device placement errors (meta device issues)
- Added explicit device placement for non-quantized models
- Updated Dockerfile with cache directory setup
- Created comprehensive deployment configuration guide

Changes:
- src/local_model_loader.py:
- Use device_map='auto' only with quantization (prevents meta device errors)
- Explicit .to(device) placement for non-quantized models
- Better logging for model loading status

- Dockerfile:
- Create cache directories with proper permissions
- Set HF_HOME and TRANSFORMERS_CACHE environment variables
- Ensure /tmp directories are writable

- DEPLOYMENT_CONFIG_GUIDE.md (NEW):
- Comprehensive guide for all deployment issues
- Cache directory permission fixes
- HF_TOKEN configuration
- GPU tensor placement solutions
- Troubleshooting steps
- Verification checklist

Fixes:
- Tensor on device meta errors → Explicit device placement
- Permission denied /cache errors → Dockerfile creates /tmp/cache
- User ID issues → Proper directory permissions in Dockerfile
- Gated repository access → HF_TOKEN configuration guide

Ready for production deployment.

Files changed (3) hide show

DEPLOYMENT_CONFIG_GUIDE.md +214 -0
Dockerfile +10 -0
src/local_model_loader.py +12 -1

DEPLOYMENT_CONFIG_GUIDE.md ADDED Viewed

	@@ -0,0 +1,214 @@

+# Deployment Configuration Guide
+## Critical Issues and Solutions
+### 1. Cache Directory Permissions
+**Problem**: `PermissionError: [Errno 13] Permission denied: '/.cache'`
+**Solution**: The code now automatically detects Docker and uses `/tmp/huggingface_cache`. However, ensure the Dockerfile sets proper permissions.
+**Dockerfile Fix**:
+```dockerfile
+# Create cache directory with proper permissions
+RUN mkdir -p /tmp/huggingface_cache && chmod 777 /tmp/huggingface_cache
+ENV HF_HOME=/tmp/huggingface_cache
+ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
+```
+### 2. User ID Issues
+**Problem**: `KeyError: 'getpwuid(): uid not found: 1000'`
+**Solution**: Run container with proper user or ensure user exists in container.
+**Option A - Use root (simplest for HF Spaces)**:
+```dockerfile
+# Already running as root in HF Spaces - this is fine
+# Just ensure cache directories are writable
+```
+**Option B - Create user in Dockerfile**:
+```dockerfile
+RUN useradd -m -u 1000 -s /bin/bash appuser && \
+    mkdir -p /tmp/huggingface_cache && \
+    chown -R appuser:appuser /tmp/huggingface_cache /app
+USER appuser
+```
+**For Hugging Face Spaces**: Spaces typically run as root, so Option A is fine.
+### 3. HuggingFace Token Configuration
+**Problem**: Gated repository access errors
+**Solution**: Set HF_TOKEN in Hugging Face Spaces secrets.
+**Steps**:
+1. Go to your Space → Settings → Repository secrets
+2. Add `HF_TOKEN` with your Hugging Face access token
+3. Token should have read access to gated models
+**Verify Token**:
+```bash
+# Test token access
+curl -H "Authorization: Bearer YOUR_TOKEN" https://huggingface.co/api/models/Qwen/Qwen2.5-7B-Instruct
+```
+### 4. GPU Tensor Device Placement
+**Problem**: `Tensor on device cuda:0 is not on the expected device meta!`
+**Solution**: Use explicit device placement instead of `device_map="auto"` for non-quantized models.
+**Code Fix**: Already implemented in `src/local_model_loader.py` - uses `device_map="auto"` only with quantization, explicit placement otherwise.
+### 5. Model Selection for Testing
+**Current Models**:
+- Primary: `Qwen/Qwen2.5-7B-Instruct` (gated - requires access)
+- Fallback: `microsoft/Phi-3-mini-4k-instruct` (non-gated, verified)
+**For Testing Without Gated Models**:
+Update `src/models_config.py` to use non-gated models:
+```python
+"reasoning_primary": {
+    "model_id": "microsoft/Phi-3-mini-4k-instruct",  # Non-gated
+    ...
+}
+```
+## Recommended Dockerfile Updates
+```dockerfile
+FROM python:3.10-slim
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    gcc \
+    g++ \
+    cmake \
+    libopenblas-dev \
+    libomp-dev \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+# Create cache directories with proper permissions
+RUN mkdir -p /tmp/huggingface_cache && \
+    chmod 777 /tmp/huggingface_cache && \
+    mkdir -p /tmp/logs && \
+    chmod 777 /tmp/logs
+# Copy requirements file
+COPY requirements.txt .
+# Install Python dependencies
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY . .
+# Set environment variables
+ENV PYTHONUNBUFFERED=1
+ENV PORT=7860
+ENV OMP_NUM_THREADS=4
+ENV MKL_NUM_THREADS=4
+ENV DB_PATH=/tmp/sessions.db
+ENV FAISS_INDEX_PATH=/tmp/embeddings.faiss
+ENV LOG_DIR=/tmp/logs
+ENV HF_HOME=/tmp/huggingface_cache
+ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
+ENV RATE_LIMIT_ENABLED=true
+# Expose port
+EXPOSE 7860
+# Health check
+HEALTHCHECK --interval=30s --timeout=30s --start-period=120s --retries=3 \
+    CMD curl -f http://localhost:7860/api/health || exit 1
+# Run with Gunicorn
+CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "4", "--threads", "2", "--timeout", "120", "--access-logfile", "-", "--error-logfile", "-", "--log-level", "info", "flask_api_standalone:app"]
+```
+## Hugging Face Spaces Configuration
+### Required Secrets:
+1. `HF_TOKEN` - Your Hugging Face access token (for gated models)
+### Environment Variables (Optional):
+- `HF_HOME` - Will auto-detect to `/tmp/huggingface_cache` in Docker
+- `TRANSFORMERS_CACHE` - Will auto-detect to `/tmp/huggingface_cache` in Docker
+### Hardware Requirements:
+- GPU: NVIDIA T4 (16GB VRAM) - ✅ Detected in logs
+- Memory: At least 8GB RAM
+- Disk: 20GB+ for model cache
+## Verification Steps
+1. **Check Cache Directory**:
+   ```bash
+   ls -la /tmp/huggingface_cache
+   # Should show writable directory
+   ```
+2. **Check HF Token**:
+   ```python
+   import os
+   print("HF_TOKEN set:", bool(os.getenv("HF_TOKEN")))
+   ```
+3. **Check GPU**:
+   ```python
+   import torch
+   print("CUDA available:", torch.cuda.is_available())
+   print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")
+   ```
+4. **Test Model Loading**:
+   - Check logs for: `✓ Cache directory verified: /tmp/huggingface_cache`
+   - Check logs for: `✓ HF_TOKEN authenticated for gated model access` (if token set)
+   - Check logs for: `✓ Model loaded successfully`
+## Troubleshooting
+### Issue: Still getting permission errors
+**Fix**: Ensure Dockerfile creates cache directory with 777 permissions
+### Issue: Gated repository errors persist
+**Fix**:
+1. Verify HF_TOKEN is set in Spaces secrets
+2. Visit model page and request access
+3. Wait for approval (usually instant)
+4. Use fallback model (Phi-3-mini) until access granted
+### Issue: Tensor device errors
+**Fix**: Code now handles this - if quantization fails, loads without quantization and uses explicit device placement
+### Issue: Model too large for GPU
+**Fix**:
+- Code automatically falls back to no quantization if bitsandbytes fails
+- Consider using smaller model (Phi-3-mini) for testing
+- Check GPU memory: `nvidia-smi`
+## Quick Start Checklist
+- [ ] HF_TOKEN set in Spaces secrets
+- [ ] Dockerfile creates cache directory with proper permissions
+- [ ] GPU detected (check logs)
+- [ ] Cache directory writable (check logs)
+- [ ] Model access granted (or using non-gated fallback)
+- [ ] No tensor device errors (check logs)
+## Next Steps
+1. Update Dockerfile with cache directory creation
+2. Set HF_TOKEN in Spaces secrets
+3. Request access to gated models (Qwen)
+4. Test with fallback model first (Phi-3-mini)
+5. Monitor logs for successful model loading

Dockerfile CHANGED Viewed

@@ -16,6 +16,13 @@ RUN apt-get update && apt-get install -y \
     curl \
     && rm -rf /var/lib/apt/lists/*
 # Copy requirements file first (for better caching)
 COPY requirements.txt .
@@ -39,6 +46,9 @@ ENV DB_PATH=/tmp/sessions.db
 ENV FAISS_INDEX_PATH=/tmp/embeddings.faiss
 ENV LOG_DIR=/tmp/logs
 ENV RATE_LIMIT_ENABLED=true
 # Health check
 HEALTHCHECK --interval=30s --timeout=30s --start-period=120s --retries=3 \

     curl \
     && rm -rf /var/lib/apt/lists/*
+# Create cache directories with proper permissions
+# Hugging Face Spaces runs as root, so we can use /tmp without permission issues
+RUN mkdir -p /tmp/huggingface_cache && \
+    chmod 777 /tmp/huggingface_cache && \
+    mkdir -p /tmp/logs && \
+    chmod 777 /tmp/logs
 # Copy requirements file first (for better caching)
 COPY requirements.txt .
 ENV FAISS_INDEX_PATH=/tmp/embeddings.faiss
 ENV LOG_DIR=/tmp/logs
 ENV RATE_LIMIT_ENABLED=true
+# Cache directories - will be used by transformers and huggingface_hub
+ENV HF_HOME=/tmp/huggingface_cache
+ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
 # Health check
 HEALTHCHECK --interval=30s --timeout=30s --start-period=120s --retries=3 \

src/local_model_loader.py CHANGED Viewed

@@ -172,10 +172,13 @@ class LocalModelLoader:
             }
             if self.device == "cuda":
                 load_kwargs.update({
-                    "device_map": "auto",  # Automatically uses GPU
                     "torch_dtype": torch.float16,  # Use FP16 for memory efficiency
                 })
             # Try loading with quantization first
             model = None
@@ -188,6 +191,9 @@ class LocalModelLoader:
                     else:
                         load_kwargs["quantization_config"] = quantization_config
                     model = AutoModelForCausalLM.from_pretrained(
                         base_model_id,
                         **load_kwargs
@@ -212,10 +218,15 @@ class LocalModelLoader:
             if model is None:
                 try:
                     if self.device == "cuda":
                         model = AutoModelForCausalLM.from_pretrained(
                             base_model_id,
                             **load_kwargs
                         )
                     else:
                         load_kwargs.update({
                             "torch_dtype": torch.float32,

             }
             if self.device == "cuda":
+                # Use explicit device placement to avoid meta device issues
+                # device_map="auto" works well with quantization, but can cause issues without it
                 load_kwargs.update({
                     "torch_dtype": torch.float16,  # Use FP16 for memory efficiency
                 })
+                # Only use device_map="auto" with quantization, otherwise use explicit placement
+                # This prevents "Tensor on device meta" errors
             # Try loading with quantization first
             model = None
                     else:
                         load_kwargs["quantization_config"] = quantization_config
+                    # With quantization, device_map="auto" works correctly
+                    load_kwargs["device_map"] = "auto"
                     model = AutoModelForCausalLM.from_pretrained(
                         base_model_id,
                         **load_kwargs
             if model is None:
                 try:
                     if self.device == "cuda":
+                        # Without quantization, use explicit device placement to avoid meta device issues
+                        # Don't use device_map="auto" here - it can cause tensor placement errors
                         model = AutoModelForCausalLM.from_pretrained(
                             base_model_id,
                             **load_kwargs
                         )
+                        # Explicitly move to GPU after loading
+                        model = model.to(self.device)
+                        logger.info(f"✓ Model loaded without quantization on {self.device}")
                     else:
                         load_kwargs.update({
                             "torch_dtype": torch.float32,