HonestAI

Paused

App Files Files Community

HonestAI / DEPLOYMENT_CONFIG_GUIDE.md

JatsTheAIGen

Fix: GPU tensor placement and Docker deployment configuration

67c580c about 1 month ago

preview code

raw

history blame contribute delete

6.24 kB

Deployment Configuration Guide

Critical Issues and Solutions

1. Cache Directory Permissions

Problem: PermissionError: [Errno 13] Permission denied: '/.cache'

Solution: The code now automatically detects Docker and uses /tmp/huggingface_cache. However, ensure the Dockerfile sets proper permissions.

Dockerfile Fix:

# Create cache directory with proper permissions
RUN mkdir -p /tmp/huggingface_cache && chmod 777 /tmp/huggingface_cache
ENV HF_HOME=/tmp/huggingface_cache
ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache

2. User ID Issues

Problem: KeyError: 'getpwuid(): uid not found: 1000'

Solution: Run container with proper user or ensure user exists in container.

Option A - Use root (simplest for HF Spaces):

# Already running as root in HF Spaces - this is fine
# Just ensure cache directories are writable

Option B - Create user in Dockerfile:

RUN useradd -m -u 1000 -s /bin/bash appuser && \
    mkdir -p /tmp/huggingface_cache && \
    chown -R appuser:appuser /tmp/huggingface_cache /app
USER appuser

For Hugging Face Spaces: Spaces typically run as root, so Option A is fine.

3. HuggingFace Token Configuration

Problem: Gated repository access errors

Solution: Set HF_TOKEN in Hugging Face Spaces secrets.

Steps:

Go to your Space → Settings → Repository secrets
Add HF_TOKEN with your Hugging Face access token
Token should have read access to gated models

Verify Token:

# Test token access
curl -H "Authorization: Bearer YOUR_TOKEN" https://huggingface.co/api/models/Qwen/Qwen2.5-7B-Instruct

4. GPU Tensor Device Placement

Problem: Tensor on device cuda:0 is not on the expected device meta!

Solution: Use explicit device placement instead of device_map="auto" for non-quantized models.

Code Fix: Already implemented in src/local_model_loader.py - uses device_map="auto" only with quantization, explicit placement otherwise.

5. Model Selection for Testing

Current Models:

Primary: Qwen/Qwen2.5-7B-Instruct (gated - requires access)
Fallback: microsoft/Phi-3-mini-4k-instruct (non-gated, verified)

For Testing Without Gated Models: Update src/models_config.py to use non-gated models:

"reasoning_primary": {
    "model_id": "microsoft/Phi-3-mini-4k-instruct",  # Non-gated
    ...
}

Recommended Dockerfile Updates

FROM python:3.10-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    cmake \
    libopenblas-dev \
    libomp-dev \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Create cache directories with proper permissions
RUN mkdir -p /tmp/huggingface_cache && \
    chmod 777 /tmp/huggingface_cache && \
    mkdir -p /tmp/logs && \
    chmod 777 /tmp/logs

# Copy requirements file
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV PORT=7860
ENV OMP_NUM_THREADS=4
ENV MKL_NUM_THREADS=4
ENV DB_PATH=/tmp/sessions.db
ENV FAISS_INDEX_PATH=/tmp/embeddings.faiss
ENV LOG_DIR=/tmp/logs
ENV HF_HOME=/tmp/huggingface_cache
ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
ENV RATE_LIMIT_ENABLED=true

# Expose port
EXPOSE 7860

# Health check
HEALTHCHECK --interval=30s --timeout=30s --start-period=120s --retries=3 \
    CMD curl -f http://localhost:7860/api/health || exit 1

# Run with Gunicorn
CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "4", "--threads", "2", "--timeout", "120", "--access-logfile", "-", "--error-logfile", "-", "--log-level", "info", "flask_api_standalone:app"]

Hugging Face Spaces Configuration

Required Secrets:

HF_TOKEN - Your Hugging Face access token (for gated models)

Environment Variables (Optional):

HF_HOME - Will auto-detect to /tmp/huggingface_cache in Docker
TRANSFORMERS_CACHE - Will auto-detect to /tmp/huggingface_cache in Docker

Hardware Requirements:

GPU: NVIDIA T4 (16GB VRAM) - ✅ Detected in logs
Memory: At least 8GB RAM
Disk: 20GB+ for model cache

Verification Steps

Check Cache Directory:

ls -la /tmp/huggingface_cache
# Should show writable directory

Check HF Token:

import os
print("HF_TOKEN set:", bool(os.getenv("HF_TOKEN")))

Check GPU:

import torch
print("CUDA available:", torch.cuda.is_available())
print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")

Test Model Loading:
- Check logs for: ✓ Cache directory verified: /tmp/huggingface_cache
- Check logs for: ✓ HF_TOKEN authenticated for gated model access (if token set)
- Check logs for: ✓ Model loaded successfully

Troubleshooting

Issue: Still getting permission errors

Fix: Ensure Dockerfile creates cache directory with 777 permissions

Issue: Gated repository errors persist

Fix:

Verify HF_TOKEN is set in Spaces secrets
Visit model page and request access
Wait for approval (usually instant)
Use fallback model (Phi-3-mini) until access granted

Issue: Tensor device errors

Fix: Code now handles this - if quantization fails, loads without quantization and uses explicit device placement

Issue: Model too large for GPU

Fix:

Code automatically falls back to no quantization if bitsandbytes fails
Consider using smaller model (Phi-3-mini) for testing
Check GPU memory: nvidia-smi

Quick Start Checklist

HF_TOKEN set in Spaces secrets
Dockerfile creates cache directory with proper permissions
GPU detected (check logs)
Cache directory writable (check logs)
Model access granted (or using non-gated fallback)
No tensor device errors (check logs)

Next Steps

Update Dockerfile with cache directory creation
Set HF_TOKEN in Spaces secrets
Request access to gated models (Qwen)
Test with fallback model first (Phi-3-mini)
Monitor logs for successful model loading