HonestAI / DEPLOYMENT_CONFIG_GUIDE.md
JatsTheAIGen's picture
Fix: GPU tensor placement and Docker deployment configuration
67c580c
# Deployment Configuration Guide
## Critical Issues and Solutions
### 1. Cache Directory Permissions
**Problem**: `PermissionError: [Errno 13] Permission denied: '/.cache'`
**Solution**: The code now automatically detects Docker and uses `/tmp/huggingface_cache`. However, ensure the Dockerfile sets proper permissions.
**Dockerfile Fix**:
```dockerfile
# Create cache directory with proper permissions
RUN mkdir -p /tmp/huggingface_cache && chmod 777 /tmp/huggingface_cache
ENV HF_HOME=/tmp/huggingface_cache
ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
```
### 2. User ID Issues
**Problem**: `KeyError: 'getpwuid(): uid not found: 1000'`
**Solution**: Run container with proper user or ensure user exists in container.
**Option A - Use root (simplest for HF Spaces)**:
```dockerfile
# Already running as root in HF Spaces - this is fine
# Just ensure cache directories are writable
```
**Option B - Create user in Dockerfile**:
```dockerfile
RUN useradd -m -u 1000 -s /bin/bash appuser && \
mkdir -p /tmp/huggingface_cache && \
chown -R appuser:appuser /tmp/huggingface_cache /app
USER appuser
```
**For Hugging Face Spaces**: Spaces typically run as root, so Option A is fine.
### 3. HuggingFace Token Configuration
**Problem**: Gated repository access errors
**Solution**: Set HF_TOKEN in Hugging Face Spaces secrets.
**Steps**:
1. Go to your Space → Settings → Repository secrets
2. Add `HF_TOKEN` with your Hugging Face access token
3. Token should have read access to gated models
**Verify Token**:
```bash
# Test token access
curl -H "Authorization: Bearer YOUR_TOKEN" https://huggingface.co/api/models/Qwen/Qwen2.5-7B-Instruct
```
### 4. GPU Tensor Device Placement
**Problem**: `Tensor on device cuda:0 is not on the expected device meta!`
**Solution**: Use explicit device placement instead of `device_map="auto"` for non-quantized models.
**Code Fix**: Already implemented in `src/local_model_loader.py` - uses `device_map="auto"` only with quantization, explicit placement otherwise.
### 5. Model Selection for Testing
**Current Models**:
- Primary: `Qwen/Qwen2.5-7B-Instruct` (gated - requires access)
- Fallback: `microsoft/Phi-3-mini-4k-instruct` (non-gated, verified)
**For Testing Without Gated Models**:
Update `src/models_config.py` to use non-gated models:
```python
"reasoning_primary": {
"model_id": "microsoft/Phi-3-mini-4k-instruct", # Non-gated
...
}
```
## Recommended Dockerfile Updates
```dockerfile
FROM python:3.10-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
gcc \
g++ \
cmake \
libopenblas-dev \
libomp-dev \
curl \
&& rm -rf /var/lib/apt/lists/*
# Create cache directories with proper permissions
RUN mkdir -p /tmp/huggingface_cache && \
chmod 777 /tmp/huggingface_cache && \
mkdir -p /tmp/logs && \
chmod 777 /tmp/logs
# Copy requirements file
COPY requirements.txt .
# Install Python dependencies
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV PORT=7860
ENV OMP_NUM_THREADS=4
ENV MKL_NUM_THREADS=4
ENV DB_PATH=/tmp/sessions.db
ENV FAISS_INDEX_PATH=/tmp/embeddings.faiss
ENV LOG_DIR=/tmp/logs
ENV HF_HOME=/tmp/huggingface_cache
ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
ENV RATE_LIMIT_ENABLED=true
# Expose port
EXPOSE 7860
# Health check
HEALTHCHECK --interval=30s --timeout=30s --start-period=120s --retries=3 \
CMD curl -f http://localhost:7860/api/health || exit 1
# Run with Gunicorn
CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "4", "--threads", "2", "--timeout", "120", "--access-logfile", "-", "--error-logfile", "-", "--log-level", "info", "flask_api_standalone:app"]
```
## Hugging Face Spaces Configuration
### Required Secrets:
1. `HF_TOKEN` - Your Hugging Face access token (for gated models)
### Environment Variables (Optional):
- `HF_HOME` - Will auto-detect to `/tmp/huggingface_cache` in Docker
- `TRANSFORMERS_CACHE` - Will auto-detect to `/tmp/huggingface_cache` in Docker
### Hardware Requirements:
- GPU: NVIDIA T4 (16GB VRAM) - ✅ Detected in logs
- Memory: At least 8GB RAM
- Disk: 20GB+ for model cache
## Verification Steps
1. **Check Cache Directory**:
```bash
ls -la /tmp/huggingface_cache
# Should show writable directory
```
2. **Check HF Token**:
```python
import os
print("HF_TOKEN set:", bool(os.getenv("HF_TOKEN")))
```
3. **Check GPU**:
```python
import torch
print("CUDA available:", torch.cuda.is_available())
print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")
```
4. **Test Model Loading**:
- Check logs for: `✓ Cache directory verified: /tmp/huggingface_cache`
- Check logs for: `✓ HF_TOKEN authenticated for gated model access` (if token set)
- Check logs for: `✓ Model loaded successfully`
## Troubleshooting
### Issue: Still getting permission errors
**Fix**: Ensure Dockerfile creates cache directory with 777 permissions
### Issue: Gated repository errors persist
**Fix**:
1. Verify HF_TOKEN is set in Spaces secrets
2. Visit model page and request access
3. Wait for approval (usually instant)
4. Use fallback model (Phi-3-mini) until access granted
### Issue: Tensor device errors
**Fix**: Code now handles this - if quantization fails, loads without quantization and uses explicit device placement
### Issue: Model too large for GPU
**Fix**:
- Code automatically falls back to no quantization if bitsandbytes fails
- Consider using smaller model (Phi-3-mini) for testing
- Check GPU memory: `nvidia-smi`
## Quick Start Checklist
- [ ] HF_TOKEN set in Spaces secrets
- [ ] Dockerfile creates cache directory with proper permissions
- [ ] GPU detected (check logs)
- [ ] Cache directory writable (check logs)
- [ ] Model access granted (or using non-gated fallback)
- [ ] No tensor device errors (check logs)
## Next Steps
1. Update Dockerfile with cache directory creation
2. Set HF_TOKEN in Spaces secrets
3. Request access to gated models (Qwen)
4. Test with fallback model first (Phi-3-mini)
5. Monitor logs for successful model loading