HonestAI

Paused

App Files Files Community

HonestAI / DEPLOYMENT_CONFIG_GUIDE.md

JatsTheAIGen

Fix: GPU tensor placement and Docker deployment configuration

67c580c about 1 month ago

preview code

raw

history blame contribute delete

6.24 kB

	# Deployment Configuration Guide

	## Critical Issues and Solutions

	### 1. Cache Directory Permissions

	Problem: `PermissionError: [Errno 13] Permission denied: '/.cache'`

	Solution: The code now automatically detects Docker and uses `/tmp/huggingface_cache`. However, ensure the Dockerfile sets proper permissions.

	Dockerfile Fix:
	```dockerfile
	# Create cache directory with proper permissions
	RUN mkdir -p /tmp/huggingface_cache && chmod 777 /tmp/huggingface_cache
	ENV HF_HOME=/tmp/huggingface_cache
	ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
	```

	### 2. User ID Issues

	Problem: `KeyError: 'getpwuid(): uid not found: 1000'`

	Solution: Run container with proper user or ensure user exists in container.

	Option A - Use root (simplest for HF Spaces):
	```dockerfile
	# Already running as root in HF Spaces - this is fine
	# Just ensure cache directories are writable
	```

	Option B - Create user in Dockerfile:
	```dockerfile
	RUN useradd -m -u 1000 -s /bin/bash appuser && \
	mkdir -p /tmp/huggingface_cache && \
	chown -R appuser:appuser /tmp/huggingface_cache /app
	USER appuser
	```

	For Hugging Face Spaces: Spaces typically run as root, so Option A is fine.

	### 3. HuggingFace Token Configuration

	Problem: Gated repository access errors

	Solution: Set HF_TOKEN in Hugging Face Spaces secrets.

	Steps:
	1. Go to your Space → Settings → Repository secrets
	2. Add `HF_TOKEN` with your Hugging Face access token
	3. Token should have read access to gated models

	Verify Token:
	```bash
	# Test token access
	curl -H "Authorization: Bearer YOUR_TOKEN" https://huggingface.co/api/models/Qwen/Qwen2.5-7B-Instruct
	```

	### 4. GPU Tensor Device Placement

	Problem: `Tensor on device cuda:0 is not on the expected device meta!`

	Solution: Use explicit device placement instead of `device_map="auto"` for non-quantized models.

	Code Fix: Already implemented in `src/local_model_loader.py` - uses `device_map="auto"` only with quantization, explicit placement otherwise.

	### 5. Model Selection for Testing

	Current Models:
	- Primary: `Qwen/Qwen2.5-7B-Instruct` (gated - requires access)
	- Fallback: `microsoft/Phi-3-mini-4k-instruct` (non-gated, verified)

	For Testing Without Gated Models:
	Update `src/models_config.py` to use non-gated models:
	```python
	"reasoning_primary": {
	"model_id": "microsoft/Phi-3-mini-4k-instruct", # Non-gated
	...
	}
	```

	## Recommended Dockerfile Updates

	```dockerfile
	FROM python:3.10-slim

	WORKDIR /app

	# Install system dependencies
	RUN apt-get update && apt-get install -y \
	gcc \
	g++ \
	cmake \
	libopenblas-dev \
	libomp-dev \
	curl \
	&& rm -rf /var/lib/apt/lists/*

	# Create cache directories with proper permissions
	RUN mkdir -p /tmp/huggingface_cache && \
	chmod 777 /tmp/huggingface_cache && \
	mkdir -p /tmp/logs && \
	chmod 777 /tmp/logs

	# Copy requirements file
	COPY requirements.txt .

	# Install Python dependencies
	RUN pip install --no-cache-dir --upgrade pip && \
	pip install --no-cache-dir -r requirements.txt

	# Copy application code
	COPY . .

	# Set environment variables
	ENV PYTHONUNBUFFERED=1
	ENV PORT=7860
	ENV OMP_NUM_THREADS=4
	ENV MKL_NUM_THREADS=4
	ENV DB_PATH=/tmp/sessions.db
	ENV FAISS_INDEX_PATH=/tmp/embeddings.faiss
	ENV LOG_DIR=/tmp/logs
	ENV HF_HOME=/tmp/huggingface_cache
	ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
	ENV RATE_LIMIT_ENABLED=true

	# Expose port
	EXPOSE 7860

	# Health check
	HEALTHCHECK --interval=30s --timeout=30s --start-period=120s --retries=3 \
	CMD curl -f http://localhost:7860/api/health \|\| exit 1

	# Run with Gunicorn
	CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "4", "--threads", "2", "--timeout", "120", "--access-logfile", "-", "--error-logfile", "-", "--log-level", "info", "flask_api_standalone:app"]
	```

	## Hugging Face Spaces Configuration

	### Required Secrets:
	1. `HF_TOKEN` - Your Hugging Face access token (for gated models)

	### Environment Variables (Optional):
	- `HF_HOME` - Will auto-detect to `/tmp/huggingface_cache` in Docker
	- `TRANSFORMERS_CACHE` - Will auto-detect to `/tmp/huggingface_cache` in Docker

	### Hardware Requirements:
	- GPU: NVIDIA T4 (16GB VRAM) - ✅ Detected in logs
	- Memory: At least 8GB RAM
	- Disk: 20GB+ for model cache

	## Verification Steps

	1. Check Cache Directory:
	```bash
	ls -la /tmp/huggingface_cache
	# Should show writable directory
	```

	2. Check HF Token:
	```python
	import os
	print("HF_TOKEN set:", bool(os.getenv("HF_TOKEN")))
	```

	3. Check GPU:
	```python
	import torch
	print("CUDA available:", torch.cuda.is_available())
	print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")
	```

	4. Test Model Loading:
	- Check logs for: `✓ Cache directory verified: /tmp/huggingface_cache`
	- Check logs for: `✓ HF_TOKEN authenticated for gated model access` (if token set)
	- Check logs for: `✓ Model loaded successfully`

	## Troubleshooting

	### Issue: Still getting permission errors
	Fix: Ensure Dockerfile creates cache directory with 777 permissions

	### Issue: Gated repository errors persist
	Fix:
	1. Verify HF_TOKEN is set in Spaces secrets
	2. Visit model page and request access
	3. Wait for approval (usually instant)
	4. Use fallback model (Phi-3-mini) until access granted

	### Issue: Tensor device errors
	Fix: Code now handles this - if quantization fails, loads without quantization and uses explicit device placement

	### Issue: Model too large for GPU
	Fix:
	- Code automatically falls back to no quantization if bitsandbytes fails
	- Consider using smaller model (Phi-3-mini) for testing
	- Check GPU memory: `nvidia-smi`

	## Quick Start Checklist

	- [ ] HF_TOKEN set in Spaces secrets
	- [ ] Dockerfile creates cache directory with proper permissions
	- [ ] GPU detected (check logs)
	- [ ] Cache directory writable (check logs)
	- [ ] Model access granted (or using non-gated fallback)
	- [ ] No tensor device errors (check logs)

	## Next Steps

	1. Update Dockerfile with cache directory creation
	2. Set HF_TOKEN in Spaces secrets
	3. Request access to gated models (Qwen)
	4. Test with fallback model first (Phi-3-mini)
	5. Monitor logs for successful model loading