Deployment Configuration Guide
Critical Issues and Solutions
1. Cache Directory Permissions
Problem: PermissionError: [Errno 13] Permission denied: '/.cache'
Solution: The code now automatically detects Docker and uses /tmp/huggingface_cache. However, ensure the Dockerfile sets proper permissions.
Dockerfile Fix:
# Create cache directory with proper permissions
RUN mkdir -p /tmp/huggingface_cache && chmod 777 /tmp/huggingface_cache
ENV HF_HOME=/tmp/huggingface_cache
ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
2. User ID Issues
Problem: KeyError: 'getpwuid(): uid not found: 1000'
Solution: Run container with proper user or ensure user exists in container.
Option A - Use root (simplest for HF Spaces):
# Already running as root in HF Spaces - this is fine
# Just ensure cache directories are writable
Option B - Create user in Dockerfile:
RUN useradd -m -u 1000 -s /bin/bash appuser && \
mkdir -p /tmp/huggingface_cache && \
chown -R appuser:appuser /tmp/huggingface_cache /app
USER appuser
For Hugging Face Spaces: Spaces typically run as root, so Option A is fine.
3. HuggingFace Token Configuration
Problem: Gated repository access errors
Solution: Set HF_TOKEN in Hugging Face Spaces secrets.
Steps:
- Go to your Space → Settings → Repository secrets
- Add
HF_TOKENwith your Hugging Face access token - Token should have read access to gated models
Verify Token:
# Test token access
curl -H "Authorization: Bearer YOUR_TOKEN" https://huggingface.co/api/models/Qwen/Qwen2.5-7B-Instruct
4. GPU Tensor Device Placement
Problem: Tensor on device cuda:0 is not on the expected device meta!
Solution: Use explicit device placement instead of device_map="auto" for non-quantized models.
Code Fix: Already implemented in src/local_model_loader.py - uses device_map="auto" only with quantization, explicit placement otherwise.
5. Model Selection for Testing
Current Models:
- Primary:
Qwen/Qwen2.5-7B-Instruct(gated - requires access) - Fallback:
microsoft/Phi-3-mini-4k-instruct(non-gated, verified)
For Testing Without Gated Models:
Update src/models_config.py to use non-gated models:
"reasoning_primary": {
"model_id": "microsoft/Phi-3-mini-4k-instruct", # Non-gated
...
}
Recommended Dockerfile Updates
FROM python:3.10-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
gcc \
g++ \
cmake \
libopenblas-dev \
libomp-dev \
curl \
&& rm -rf /var/lib/apt/lists/*
# Create cache directories with proper permissions
RUN mkdir -p /tmp/huggingface_cache && \
chmod 777 /tmp/huggingface_cache && \
mkdir -p /tmp/logs && \
chmod 777 /tmp/logs
# Copy requirements file
COPY requirements.txt .
# Install Python dependencies
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV PORT=7860
ENV OMP_NUM_THREADS=4
ENV MKL_NUM_THREADS=4
ENV DB_PATH=/tmp/sessions.db
ENV FAISS_INDEX_PATH=/tmp/embeddings.faiss
ENV LOG_DIR=/tmp/logs
ENV HF_HOME=/tmp/huggingface_cache
ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
ENV RATE_LIMIT_ENABLED=true
# Expose port
EXPOSE 7860
# Health check
HEALTHCHECK --interval=30s --timeout=30s --start-period=120s --retries=3 \
CMD curl -f http://localhost:7860/api/health || exit 1
# Run with Gunicorn
CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "4", "--threads", "2", "--timeout", "120", "--access-logfile", "-", "--error-logfile", "-", "--log-level", "info", "flask_api_standalone:app"]
Hugging Face Spaces Configuration
Required Secrets:
HF_TOKEN- Your Hugging Face access token (for gated models)
Environment Variables (Optional):
HF_HOME- Will auto-detect to/tmp/huggingface_cachein DockerTRANSFORMERS_CACHE- Will auto-detect to/tmp/huggingface_cachein Docker
Hardware Requirements:
- GPU: NVIDIA T4 (16GB VRAM) - ✅ Detected in logs
- Memory: At least 8GB RAM
- Disk: 20GB+ for model cache
Verification Steps
Check Cache Directory:
ls -la /tmp/huggingface_cache # Should show writable directoryCheck HF Token:
import os print("HF_TOKEN set:", bool(os.getenv("HF_TOKEN")))Check GPU:
import torch print("CUDA available:", torch.cuda.is_available()) print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")Test Model Loading:
- Check logs for:
✓ Cache directory verified: /tmp/huggingface_cache - Check logs for:
✓ HF_TOKEN authenticated for gated model access(if token set) - Check logs for:
✓ Model loaded successfully
- Check logs for:
Troubleshooting
Issue: Still getting permission errors
Fix: Ensure Dockerfile creates cache directory with 777 permissions
Issue: Gated repository errors persist
Fix:
- Verify HF_TOKEN is set in Spaces secrets
- Visit model page and request access
- Wait for approval (usually instant)
- Use fallback model (Phi-3-mini) until access granted
Issue: Tensor device errors
Fix: Code now handles this - if quantization fails, loads without quantization and uses explicit device placement
Issue: Model too large for GPU
Fix:
- Code automatically falls back to no quantization if bitsandbytes fails
- Consider using smaller model (Phi-3-mini) for testing
- Check GPU memory:
nvidia-smi
Quick Start Checklist
- HF_TOKEN set in Spaces secrets
- Dockerfile creates cache directory with proper permissions
- GPU detected (check logs)
- Cache directory writable (check logs)
- Model access granted (or using non-gated fallback)
- No tensor device errors (check logs)
Next Steps
- Update Dockerfile with cache directory creation
- Set HF_TOKEN in Spaces secrets
- Request access to gated models (Qwen)
- Test with fallback model first (Phi-3-mini)
- Monitor logs for successful model loading