JatsTheAIGen commited on
Commit
67c580c
·
1 Parent(s): 13fa6c4

Fix: GPU tensor placement and Docker deployment configuration

Browse files

CRITICAL FIXES:
- Fixed tensor device placement errors (meta device issues)
- Added explicit device placement for non-quantized models
- Updated Dockerfile with cache directory setup
- Created comprehensive deployment configuration guide

Changes:
- src/local_model_loader.py:
- Use device_map='auto' only with quantization (prevents meta device errors)
- Explicit .to(device) placement for non-quantized models
- Better logging for model loading status

- Dockerfile:
- Create cache directories with proper permissions
- Set HF_HOME and TRANSFORMERS_CACHE environment variables
- Ensure /tmp directories are writable

- DEPLOYMENT_CONFIG_GUIDE.md (NEW):
- Comprehensive guide for all deployment issues
- Cache directory permission fixes
- HF_TOKEN configuration
- GPU tensor placement solutions
- Troubleshooting steps
- Verification checklist

Fixes:
- Tensor on device meta errors → Explicit device placement
- Permission denied /cache errors → Dockerfile creates /tmp/cache
- User ID issues → Proper directory permissions in Dockerfile
- Gated repository access → HF_TOKEN configuration guide

Ready for production deployment.

Files changed (3) hide show
  1. DEPLOYMENT_CONFIG_GUIDE.md +214 -0
  2. Dockerfile +10 -0
  3. src/local_model_loader.py +12 -1
DEPLOYMENT_CONFIG_GUIDE.md ADDED
@@ -0,0 +1,214 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deployment Configuration Guide
2
+
3
+ ## Critical Issues and Solutions
4
+
5
+ ### 1. Cache Directory Permissions
6
+
7
+ **Problem**: `PermissionError: [Errno 13] Permission denied: '/.cache'`
8
+
9
+ **Solution**: The code now automatically detects Docker and uses `/tmp/huggingface_cache`. However, ensure the Dockerfile sets proper permissions.
10
+
11
+ **Dockerfile Fix**:
12
+ ```dockerfile
13
+ # Create cache directory with proper permissions
14
+ RUN mkdir -p /tmp/huggingface_cache && chmod 777 /tmp/huggingface_cache
15
+ ENV HF_HOME=/tmp/huggingface_cache
16
+ ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
17
+ ```
18
+
19
+ ### 2. User ID Issues
20
+
21
+ **Problem**: `KeyError: 'getpwuid(): uid not found: 1000'`
22
+
23
+ **Solution**: Run container with proper user or ensure user exists in container.
24
+
25
+ **Option A - Use root (simplest for HF Spaces)**:
26
+ ```dockerfile
27
+ # Already running as root in HF Spaces - this is fine
28
+ # Just ensure cache directories are writable
29
+ ```
30
+
31
+ **Option B - Create user in Dockerfile**:
32
+ ```dockerfile
33
+ RUN useradd -m -u 1000 -s /bin/bash appuser && \
34
+ mkdir -p /tmp/huggingface_cache && \
35
+ chown -R appuser:appuser /tmp/huggingface_cache /app
36
+ USER appuser
37
+ ```
38
+
39
+ **For Hugging Face Spaces**: Spaces typically run as root, so Option A is fine.
40
+
41
+ ### 3. HuggingFace Token Configuration
42
+
43
+ **Problem**: Gated repository access errors
44
+
45
+ **Solution**: Set HF_TOKEN in Hugging Face Spaces secrets.
46
+
47
+ **Steps**:
48
+ 1. Go to your Space → Settings → Repository secrets
49
+ 2. Add `HF_TOKEN` with your Hugging Face access token
50
+ 3. Token should have read access to gated models
51
+
52
+ **Verify Token**:
53
+ ```bash
54
+ # Test token access
55
+ curl -H "Authorization: Bearer YOUR_TOKEN" https://huggingface.co/api/models/Qwen/Qwen2.5-7B-Instruct
56
+ ```
57
+
58
+ ### 4. GPU Tensor Device Placement
59
+
60
+ **Problem**: `Tensor on device cuda:0 is not on the expected device meta!`
61
+
62
+ **Solution**: Use explicit device placement instead of `device_map="auto"` for non-quantized models.
63
+
64
+ **Code Fix**: Already implemented in `src/local_model_loader.py` - uses `device_map="auto"` only with quantization, explicit placement otherwise.
65
+
66
+ ### 5. Model Selection for Testing
67
+
68
+ **Current Models**:
69
+ - Primary: `Qwen/Qwen2.5-7B-Instruct` (gated - requires access)
70
+ - Fallback: `microsoft/Phi-3-mini-4k-instruct` (non-gated, verified)
71
+
72
+ **For Testing Without Gated Models**:
73
+ Update `src/models_config.py` to use non-gated models:
74
+ ```python
75
+ "reasoning_primary": {
76
+ "model_id": "microsoft/Phi-3-mini-4k-instruct", # Non-gated
77
+ ...
78
+ }
79
+ ```
80
+
81
+ ## Recommended Dockerfile Updates
82
+
83
+ ```dockerfile
84
+ FROM python:3.10-slim
85
+
86
+ WORKDIR /app
87
+
88
+ # Install system dependencies
89
+ RUN apt-get update && apt-get install -y \
90
+ gcc \
91
+ g++ \
92
+ cmake \
93
+ libopenblas-dev \
94
+ libomp-dev \
95
+ curl \
96
+ && rm -rf /var/lib/apt/lists/*
97
+
98
+ # Create cache directories with proper permissions
99
+ RUN mkdir -p /tmp/huggingface_cache && \
100
+ chmod 777 /tmp/huggingface_cache && \
101
+ mkdir -p /tmp/logs && \
102
+ chmod 777 /tmp/logs
103
+
104
+ # Copy requirements file
105
+ COPY requirements.txt .
106
+
107
+ # Install Python dependencies
108
+ RUN pip install --no-cache-dir --upgrade pip && \
109
+ pip install --no-cache-dir -r requirements.txt
110
+
111
+ # Copy application code
112
+ COPY . .
113
+
114
+ # Set environment variables
115
+ ENV PYTHONUNBUFFERED=1
116
+ ENV PORT=7860
117
+ ENV OMP_NUM_THREADS=4
118
+ ENV MKL_NUM_THREADS=4
119
+ ENV DB_PATH=/tmp/sessions.db
120
+ ENV FAISS_INDEX_PATH=/tmp/embeddings.faiss
121
+ ENV LOG_DIR=/tmp/logs
122
+ ENV HF_HOME=/tmp/huggingface_cache
123
+ ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
124
+ ENV RATE_LIMIT_ENABLED=true
125
+
126
+ # Expose port
127
+ EXPOSE 7860
128
+
129
+ # Health check
130
+ HEALTHCHECK --interval=30s --timeout=30s --start-period=120s --retries=3 \
131
+ CMD curl -f http://localhost:7860/api/health || exit 1
132
+
133
+ # Run with Gunicorn
134
+ CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "4", "--threads", "2", "--timeout", "120", "--access-logfile", "-", "--error-logfile", "-", "--log-level", "info", "flask_api_standalone:app"]
135
+ ```
136
+
137
+ ## Hugging Face Spaces Configuration
138
+
139
+ ### Required Secrets:
140
+ 1. `HF_TOKEN` - Your Hugging Face access token (for gated models)
141
+
142
+ ### Environment Variables (Optional):
143
+ - `HF_HOME` - Will auto-detect to `/tmp/huggingface_cache` in Docker
144
+ - `TRANSFORMERS_CACHE` - Will auto-detect to `/tmp/huggingface_cache` in Docker
145
+
146
+ ### Hardware Requirements:
147
+ - GPU: NVIDIA T4 (16GB VRAM) - ✅ Detected in logs
148
+ - Memory: At least 8GB RAM
149
+ - Disk: 20GB+ for model cache
150
+
151
+ ## Verification Steps
152
+
153
+ 1. **Check Cache Directory**:
154
+ ```bash
155
+ ls -la /tmp/huggingface_cache
156
+ # Should show writable directory
157
+ ```
158
+
159
+ 2. **Check HF Token**:
160
+ ```python
161
+ import os
162
+ print("HF_TOKEN set:", bool(os.getenv("HF_TOKEN")))
163
+ ```
164
+
165
+ 3. **Check GPU**:
166
+ ```python
167
+ import torch
168
+ print("CUDA available:", torch.cuda.is_available())
169
+ print("GPU:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")
170
+ ```
171
+
172
+ 4. **Test Model Loading**:
173
+ - Check logs for: `✓ Cache directory verified: /tmp/huggingface_cache`
174
+ - Check logs for: `✓ HF_TOKEN authenticated for gated model access` (if token set)
175
+ - Check logs for: `✓ Model loaded successfully`
176
+
177
+ ## Troubleshooting
178
+
179
+ ### Issue: Still getting permission errors
180
+ **Fix**: Ensure Dockerfile creates cache directory with 777 permissions
181
+
182
+ ### Issue: Gated repository errors persist
183
+ **Fix**:
184
+ 1. Verify HF_TOKEN is set in Spaces secrets
185
+ 2. Visit model page and request access
186
+ 3. Wait for approval (usually instant)
187
+ 4. Use fallback model (Phi-3-mini) until access granted
188
+
189
+ ### Issue: Tensor device errors
190
+ **Fix**: Code now handles this - if quantization fails, loads without quantization and uses explicit device placement
191
+
192
+ ### Issue: Model too large for GPU
193
+ **Fix**:
194
+ - Code automatically falls back to no quantization if bitsandbytes fails
195
+ - Consider using smaller model (Phi-3-mini) for testing
196
+ - Check GPU memory: `nvidia-smi`
197
+
198
+ ## Quick Start Checklist
199
+
200
+ - [ ] HF_TOKEN set in Spaces secrets
201
+ - [ ] Dockerfile creates cache directory with proper permissions
202
+ - [ ] GPU detected (check logs)
203
+ - [ ] Cache directory writable (check logs)
204
+ - [ ] Model access granted (or using non-gated fallback)
205
+ - [ ] No tensor device errors (check logs)
206
+
207
+ ## Next Steps
208
+
209
+ 1. Update Dockerfile with cache directory creation
210
+ 2. Set HF_TOKEN in Spaces secrets
211
+ 3. Request access to gated models (Qwen)
212
+ 4. Test with fallback model first (Phi-3-mini)
213
+ 5. Monitor logs for successful model loading
214
+
Dockerfile CHANGED
@@ -16,6 +16,13 @@ RUN apt-get update && apt-get install -y \
16
  curl \
17
  && rm -rf /var/lib/apt/lists/*
18
 
 
 
 
 
 
 
 
19
  # Copy requirements file first (for better caching)
20
  COPY requirements.txt .
21
 
@@ -39,6 +46,9 @@ ENV DB_PATH=/tmp/sessions.db
39
  ENV FAISS_INDEX_PATH=/tmp/embeddings.faiss
40
  ENV LOG_DIR=/tmp/logs
41
  ENV RATE_LIMIT_ENABLED=true
 
 
 
42
 
43
  # Health check
44
  HEALTHCHECK --interval=30s --timeout=30s --start-period=120s --retries=3 \
 
16
  curl \
17
  && rm -rf /var/lib/apt/lists/*
18
 
19
+ # Create cache directories with proper permissions
20
+ # Hugging Face Spaces runs as root, so we can use /tmp without permission issues
21
+ RUN mkdir -p /tmp/huggingface_cache && \
22
+ chmod 777 /tmp/huggingface_cache && \
23
+ mkdir -p /tmp/logs && \
24
+ chmod 777 /tmp/logs
25
+
26
  # Copy requirements file first (for better caching)
27
  COPY requirements.txt .
28
 
 
46
  ENV FAISS_INDEX_PATH=/tmp/embeddings.faiss
47
  ENV LOG_DIR=/tmp/logs
48
  ENV RATE_LIMIT_ENABLED=true
49
+ # Cache directories - will be used by transformers and huggingface_hub
50
+ ENV HF_HOME=/tmp/huggingface_cache
51
+ ENV TRANSFORMERS_CACHE=/tmp/huggingface_cache
52
 
53
  # Health check
54
  HEALTHCHECK --interval=30s --timeout=30s --start-period=120s --retries=3 \
src/local_model_loader.py CHANGED
@@ -172,10 +172,13 @@ class LocalModelLoader:
172
  }
173
 
174
  if self.device == "cuda":
 
 
175
  load_kwargs.update({
176
- "device_map": "auto", # Automatically uses GPU
177
  "torch_dtype": torch.float16, # Use FP16 for memory efficiency
178
  })
 
 
179
 
180
  # Try loading with quantization first
181
  model = None
@@ -188,6 +191,9 @@ class LocalModelLoader:
188
  else:
189
  load_kwargs["quantization_config"] = quantization_config
190
 
 
 
 
191
  model = AutoModelForCausalLM.from_pretrained(
192
  base_model_id,
193
  **load_kwargs
@@ -212,10 +218,15 @@ class LocalModelLoader:
212
  if model is None:
213
  try:
214
  if self.device == "cuda":
 
 
215
  model = AutoModelForCausalLM.from_pretrained(
216
  base_model_id,
217
  **load_kwargs
218
  )
 
 
 
219
  else:
220
  load_kwargs.update({
221
  "torch_dtype": torch.float32,
 
172
  }
173
 
174
  if self.device == "cuda":
175
+ # Use explicit device placement to avoid meta device issues
176
+ # device_map="auto" works well with quantization, but can cause issues without it
177
  load_kwargs.update({
 
178
  "torch_dtype": torch.float16, # Use FP16 for memory efficiency
179
  })
180
+ # Only use device_map="auto" with quantization, otherwise use explicit placement
181
+ # This prevents "Tensor on device meta" errors
182
 
183
  # Try loading with quantization first
184
  model = None
 
191
  else:
192
  load_kwargs["quantization_config"] = quantization_config
193
 
194
+ # With quantization, device_map="auto" works correctly
195
+ load_kwargs["device_map"] = "auto"
196
+
197
  model = AutoModelForCausalLM.from_pretrained(
198
  base_model_id,
199
  **load_kwargs
 
218
  if model is None:
219
  try:
220
  if self.device == "cuda":
221
+ # Without quantization, use explicit device placement to avoid meta device issues
222
+ # Don't use device_map="auto" here - it can cause tensor placement errors
223
  model = AutoModelForCausalLM.from_pretrained(
224
  base_model_id,
225
  **load_kwargs
226
  )
227
+ # Explicitly move to GPU after loading
228
+ model = model.to(self.device)
229
+ logger.info(f"✓ Model loaded without quantization on {self.device}")
230
  else:
231
  load_kwargs.update({
232
  "torch_dtype": torch.float32,