HonestAI / DEPLOYMENT_CONFIG_GUIDE.md
JatsTheAIGen's picture
Fix: GPU tensor placement and Docker deployment configuration
67c580c

Deployment Configuration Guide

Critical Issues and Solutions

1. Cache Directory Permissions

Problem: PermissionError: [Errno 13] Permission denied: '/.cache'

Solution: The code now automatically detects Docker and uses /tmp/huggingface_cache. However, ensure the Dockerfile sets proper permissions.

Dockerfile Fix:

# Create cache directory with proper permissions
RUN mkdir -p /tmp/huggingface_cache && chmod 777 /tmp/huggingface_cache
ENV HF_HOME=/tmp/huggingface_cache
ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache

2. User ID Issues

Problem: KeyError: 'getpwuid(): uid not found: 1000'

Solution: Run container with proper user or ensure user exists in container.

Option A - Use root (simplest for HF Spaces):

# Already running as root in HF Spaces - this is fine
# Just ensure cache directories are writable

Option B - Create user in Dockerfile:

RUN useradd -m -u 1000 -s /bin/bash appuser && \
    mkdir -p /tmp/huggingface_cache && \
    chown -R appuser:appuser /tmp/huggingface_cache /app
USER appuser

For Hugging Face Spaces: Spaces typically run as root, so Option A is fine.

3. HuggingFace Token Configuration

Problem: Gated repository access errors

Solution: Set HF_TOKEN in Hugging Face Spaces secrets.

Steps:

  1. Go to your Space → Settings → Repository secrets
  2. Add HF_TOKEN with your Hugging Face access token
  3. Token should have read access to gated models

Verify Token:

# Test token access
curl -H "Authorization: Bearer YOUR_TOKEN" https://huggingface.co/api/models/Qwen/Qwen2.5-7B-Instruct

4. GPU Tensor Device Placement

Problem: Tensor on device cuda:0 is not on the expected device meta!

Solution: Use explicit device placement instead of device_map="auto" for non-quantized models.

Code Fix: Already implemented in src/local_model_loader.py - uses device_map="auto" only with quantization, explicit placement otherwise.

5. Model Selection for Testing

Current Models:

  • Primary: Qwen/Qwen2.5-7B-Instruct (gated - requires access)
  • Fallback: microsoft/Phi-3-mini-4k-instruct (non-gated, verified)

For Testing Without Gated Models: Update src/models_config.py to use non-gated models:

"reasoning_primary": {
    "model_id": "microsoft/Phi-3-mini-4k-instruct",  # Non-gated
    ...
}

Recommended Dockerfile Updates

FROM python:3.10-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    cmake \
    libopenblas-dev \
    libomp-dev \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Create cache directories with proper permissions
RUN mkdir -p /tmp/huggingface_cache && \
    chmod 777 /tmp/huggingface_cache && \
    mkdir -p /tmp/logs && \
    chmod 777 /tmp/logs

# Copy requirements file
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV PORT=7860
ENV OMP_NUM_THREADS=4
ENV MKL_NUM_THREADS=4
ENV DB_PATH=/tmp/sessions.db
ENV FAISS_INDEX_PATH=/tmp/embeddings.faiss
ENV LOG_DIR=/tmp/logs
ENV HF_HOME=/tmp/huggingface_cache
ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
ENV RATE_LIMIT_ENABLED=true

# Expose port
EXPOSE 7860

# Health check
HEALTHCHECK --interval=30s --timeout=30s --start-period=120s --retries=3 \
    CMD curl -f http://localhost:7860/api/health || exit 1

# Run with Gunicorn
CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "4", "--threads", "2", "--timeout", "120", "--access-logfile", "-", "--error-logfile", "-", "--log-level", "info", "flask_api_standalone:app"]

Hugging Face Spaces Configuration

Required Secrets:

  1. HF_TOKEN - Your Hugging Face access token (for gated models)

Environment Variables (Optional):

  • HF_HOME - Will auto-detect to /tmp/huggingface_cache in Docker
  • TRANSFORMERS_CACHE - Will auto-detect to /tmp/huggingface_cache in Docker

Hardware Requirements:

  • GPU: NVIDIA T4 (16GB VRAM) - ✅ Detected in logs
  • Memory: At least 8GB RAM
  • Disk: 20GB+ for model cache

Verification Steps

  1. Check Cache Directory:

    ls -la /tmp/huggingface_cache
    # Should show writable directory
    
  2. Check HF Token:

    import os
    print("HF_TOKEN set:", bool(os.getenv("HF_TOKEN")))
    
  3. Check GPU:

    import torch
    print("CUDA available:", torch.cuda.is_available())
    print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")
    
  4. Test Model Loading:

    • Check logs for: ✓ Cache directory verified: /tmp/huggingface_cache
    • Check logs for: ✓ HF_TOKEN authenticated for gated model access (if token set)
    • Check logs for: ✓ Model loaded successfully

Troubleshooting

Issue: Still getting permission errors

Fix: Ensure Dockerfile creates cache directory with 777 permissions

Issue: Gated repository errors persist

Fix:

  1. Verify HF_TOKEN is set in Spaces secrets
  2. Visit model page and request access
  3. Wait for approval (usually instant)
  4. Use fallback model (Phi-3-mini) until access granted

Issue: Tensor device errors

Fix: Code now handles this - if quantization fails, loads without quantization and uses explicit device placement

Issue: Model too large for GPU

Fix:

  • Code automatically falls back to no quantization if bitsandbytes fails
  • Consider using smaller model (Phi-3-mini) for testing
  • Check GPU memory: nvidia-smi

Quick Start Checklist

  • HF_TOKEN set in Spaces secrets
  • Dockerfile creates cache directory with proper permissions
  • GPU detected (check logs)
  • Cache directory writable (check logs)
  • Model access granted (or using non-gated fallback)
  • No tensor device errors (check logs)

Next Steps

  1. Update Dockerfile with cache directory creation
  2. Set HF_TOKEN in Spaces secrets
  3. Request access to gated models (Qwen)
  4. Test with fallback model first (Phi-3-mini)
  5. Monitor logs for successful model loading