llaa33219 commited on
Commit
6d15327
Β·
verified Β·
1 Parent(s): 07a2da6

Upload 4 files

Browse files
Files changed (4) hide show
  1. Dockerfile +28 -0
  2. README.md +94 -7
  3. app.py +379 -0
  4. requirements.txt +8 -0
Dockerfile ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10-slim
2
+
3
+ WORKDIR /app
4
+
5
+ # Install system dependencies
6
+ RUN apt-get update && apt-get install -y \
7
+ git \
8
+ && rm -rf /var/lib/apt/lists/*
9
+
10
+ # Copy requirements
11
+ COPY requirements.txt .
12
+ RUN pip install --no-cache-dir -r requirements.txt
13
+
14
+ # Copy application files
15
+ COPY . .
16
+
17
+ # Create persistent volume mount point
18
+ RUN mkdir -p /data/checkpoints /data/models
19
+
20
+ # Set environment variables
21
+ ENV GRADIO_SERVER_NAME="0.0.0.0"
22
+ ENV GRADIO_SERVER_PORT=7860
23
+
24
+ # Expose port
25
+ EXPOSE 7860
26
+
27
+ # Run the application
28
+ CMD ["python", "app.py"]
README.md CHANGED
@@ -1,11 +1,98 @@
1
  ---
2
- title: Train3
3
- emoji: πŸŒ–
4
- colorFrom: purple
5
- colorTo: yellow
6
- sdk: docker
 
 
7
  pinned: false
8
- short_description: test
 
 
 
 
9
  ---
10
 
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: CoDA Fine-tuning
3
+ emoji: πŸš€
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 4.44.0
8
+ app_file: app.py
9
  pinned: false
10
+ license: apache-2.0
11
+ hf_oauth: true
12
+ hf_oauth_scopes:
13
+ - read-repos
14
+ - write-repos
15
  ---
16
 
17
+ # CoDA Model Fine-tuning Space
18
+
19
+ This Space allows you to fine-tune the **Salesforce/CoDA-v0-Instruct** text generation diffusion model on the **baseten-admin/gpt-oss120b-generated-perfectblend** dataset.
20
+
21
+ ## Features
22
+
23
+ - 🎯 **Full Fine-tuning**: Complete parameter fine-tuning (not LoRA)
24
+ - πŸ’¬ **ChatML Format**: Processes conversation data with question-answer pairs
25
+ - πŸ”„ **Auto Upload**: Automatically uploads trained model to your Hugging Face account
26
+ - πŸ“Š **Progress Tracking**: Real-time training progress updates
27
+ - πŸ” **OAuth Integration**: Secure authentication via Hugging Face login
28
+
29
+ ## How to Use
30
+
31
+ 1. **Login**: Click the "Sign in with Hugging Face" button
32
+ 2. **Configure**: Adjust training parameters (epochs, batch size, learning rate)
33
+ 3. **Train**: Click "Start Training" (requires GPU - upgrade Space to GPU tier)
34
+ 4. **Resume**: If training is interrupted, check "Resume from last checkpoint" and restart
35
+ 5. **Upload**: After training completes, click "Upload to Hugging Face Hub"
36
+
37
+ ### Persistence
38
+
39
+ This Space supports checkpoint persistence:
40
+ - Training checkpoints are saved every 500 steps
41
+ - If interrupted, you can resume from the last checkpoint
42
+ - For Docker deployment: Mount `/data` volume for full persistence
43
+ - On Spaces: Checkpoints persist within the same session and across rebuilds if using persistent storage tier
44
+
45
+ ## Requirements
46
+
47
+ - **Hardware**: GPU (T4, A10G, or better) strongly recommended
48
+ - **Account**: Hugging Face account with write permissions
49
+ - **Time**: Training takes several hours depending on configuration
50
+
51
+ ## About the Model
52
+
53
+ **CoDA (Code Diffusion with Autoregressive)** is a 1.7B parameter bidirectional diffusion model developed by Salesforce AI Research. Unlike traditional autoregressive models, CoDA uses discrete denoising for text generation. The Instruct version is pre-tuned for instruction following, making it ideal for fine-tuning on conversational data.
54
+
55
+ ### Model Configuration
56
+
57
+ ```json
58
+ {
59
+ "architectures": ["CoDALanguageModel"],
60
+ "hidden_size": 2048,
61
+ "num_hidden_layers": 28,
62
+ "num_attention_heads": 16,
63
+ "vocab_size": 151936,
64
+ "max_position_embeddings": 40960
65
+ }
66
+ ```
67
+
68
+ ## Dataset
69
+
70
+ The training uses the **baseten-admin/gpt-oss120b-generated-perfectblend** dataset:
71
+ - **Format**: Conversational data in ChatML format
72
+ - **Column**: `conversations` (list of role-content pairs)
73
+ - **Split**: Uses `train` split with 90/10 train/eval split
74
+
75
+ ## Training Details
76
+
77
+ - **Optimizer**: AdamW
78
+ - **Precision**: FP16 (on GPU)
79
+ - **Gradient Accumulation**: 4 steps
80
+ - **Gradient Checkpointing**: Enabled for memory efficiency
81
+ - **Max Sequence Length**: 2048 tokens
82
+
83
+ ## Citation
84
+
85
+ If you use this Space or the CoDA model, please cite:
86
+
87
+ ```bibtex
88
+ @article{coda2023,
89
+ title={CoDA: Bidirectional Code Diffusion},
90
+ author={Salesforce AI Research},
91
+ journal={arXiv preprint},
92
+ year={2023}
93
+ }
94
+ ```
95
+
96
+ ## License
97
+
98
+ Apache 2.0
app.py ADDED
@@ -0,0 +1,379 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import torch
3
+ from transformers import (
4
+ AutoModel,
5
+ AutoTokenizer,
6
+ TrainingArguments,
7
+ Trainer,
8
+ DataCollatorForLanguageModeling
9
+ )
10
+ from datasets import load_dataset
11
+ from huggingface_hub import HfApi, login, whoami
12
+ import os
13
+ from datetime import datetime
14
+ import json
15
+ import pickle
16
+ from pathlib import Path
17
+
18
+ # Custom Trainer for CoDA model
19
+ class CoDATrainer(Trainer):
20
+ def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
21
+ """
22
+ Custom loss computation for CoDA diffusion model.
23
+ CoDA returns a dict with 'loss' key instead of a scalar.
24
+ """
25
+ outputs = model(**inputs)
26
+
27
+ # CoDA model returns a dict with 'loss' key
28
+ if isinstance(outputs, dict) and 'loss' in outputs:
29
+ loss = outputs['loss']
30
+ elif hasattr(outputs, 'loss'):
31
+ loss = outputs.loss
32
+ else:
33
+ # Fallback: compute standard LM loss
34
+ labels = inputs.get('labels')
35
+ logits = outputs.get('logits') if isinstance(outputs, dict) else outputs[0]
36
+ loss_fct = torch.nn.CrossEntropyLoss()
37
+ loss = loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))
38
+
39
+ # Ensure loss is a scalar
40
+ if loss.dim() > 0:
41
+ loss = loss.mean()
42
+
43
+ return (loss, outputs) if return_outputs else loss
44
+
45
+ def preprocess_conversations(examples, tokenizer):
46
+ """Convert ChatML-style conversations to text for training"""
47
+ texts = []
48
+ for conv in examples['conversations']:
49
+ # Format: [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]
50
+ if not isinstance(conv, list):
51
+ raise ValueError(f"Expected conversation to be a list, got {type(conv)}")
52
+
53
+ text = ""
54
+ for message in conv:
55
+ if not isinstance(message, dict):
56
+ raise ValueError(f"Expected message to be a dict, got {type(message)}")
57
+
58
+ role = message.get('role', '')
59
+ content = message.get('content', '')
60
+ if role == 'user':
61
+ text += f"<|user|>\n{content}\n"
62
+ elif role == 'assistant':
63
+ text += f"<|assistant|>\n{content}\n"
64
+ texts.append(text)
65
+
66
+ return tokenizer(texts, truncation=True, max_length=2048, padding=False)
67
+
68
+ # Persistent storage paths
69
+ CHECKPOINT_DIR = Path("/data/checkpoints") if Path("/data").exists() else Path("./checkpoints")
70
+ MODEL_DIR = Path("/data/models") if Path("/data").exists() else Path("./models")
71
+ CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
72
+ MODEL_DIR.mkdir(parents=True, exist_ok=True)
73
+
74
+ STATE_FILE = CHECKPOINT_DIR / "training_state.pkl"
75
+
76
+ def save_training_state(state):
77
+ """Save training state to persistent storage"""
78
+ with open(STATE_FILE, 'wb') as f:
79
+ pickle.dump(state, f)
80
+
81
+ def load_training_state():
82
+ """Load training state from persistent storage"""
83
+ if STATE_FILE.exists():
84
+ with open(STATE_FILE, 'rb') as f:
85
+ return pickle.load(f)
86
+ return None
87
+
88
+ def train_model(epochs, batch_size, learning_rate, resume=False, progress=gr.Progress()):
89
+ try:
90
+ # Check for existing training state
91
+ if resume:
92
+ saved_state = load_training_state()
93
+ if saved_state:
94
+ progress(0, desc=f"Resuming from step {saved_state.get('step', 0)}...")
95
+
96
+ progress(0, desc="Initializing training...")
97
+
98
+ # Check for GPU
99
+ device = "cuda" if torch.cuda.is_available() else "cpu"
100
+ if device == "cpu":
101
+ return "⚠️ Warning: Training on CPU will be very slow. Please upgrade Space to GPU."
102
+
103
+ progress(0.1, desc="Loading model and tokenizer...")
104
+
105
+ # Load model and tokenizer
106
+ # Note: Using Instruct version which is better for fine-tuning
107
+ model_name = "Salesforce/CoDA-v0-Instruct"
108
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
109
+ model = AutoModel.from_pretrained(
110
+ model_name,
111
+ trust_remote_code=True,
112
+ torch_dtype=torch.float16 if device == "cuda" else torch.float32
113
+ )
114
+
115
+ # Move model to device (CoDA doesn't support device_map='auto')
116
+ if device == "cuda":
117
+ model = model.to(device)
118
+
119
+ # Set pad token if not exists
120
+ if tokenizer.pad_token is None:
121
+ tokenizer.pad_token = tokenizer.eos_token
122
+ model.config.pad_token_id = tokenizer.eos_token_id
123
+
124
+ progress(0.2, desc="Loading dataset...")
125
+
126
+ # Load dataset
127
+ dataset = load_dataset("baseten-admin/gpt-oss120b-generated-perfectblend", split="train")
128
+
129
+ # Verify dataset has conversations column
130
+ if 'conversations' not in dataset.column_names:
131
+ return f"❌ Error: Dataset does not have 'conversations' column. Found columns: {dataset.column_names}"
132
+
133
+ # Preprocess dataset
134
+ progress(0.3, desc="Preprocessing dataset...")
135
+ tokenized_dataset = dataset.map(
136
+ lambda x: preprocess_conversations(x, tokenizer),
137
+ batched=True,
138
+ remove_columns=dataset.column_names
139
+ )
140
+
141
+ # Split into train/eval
142
+ train_test_split = tokenized_dataset.train_test_split(test_size=0.1, seed=42)
143
+ train_dataset = train_test_split['train']
144
+ eval_dataset = train_test_split['test']
145
+
146
+ progress(0.4, desc="Setting up training configuration...")
147
+
148
+ # Training arguments - use persistent storage
149
+ output_dir = str(MODEL_DIR / "coda-finetuned")
150
+ training_args = TrainingArguments(
151
+ output_dir=output_dir,
152
+ num_train_epochs=epochs,
153
+ per_device_train_batch_size=batch_size,
154
+ per_device_eval_batch_size=batch_size,
155
+ learning_rate=learning_rate,
156
+ warmup_steps=100,
157
+ logging_steps=5, # More frequent logging
158
+ logging_first_step=True,
159
+ eval_strategy="steps",
160
+ eval_steps=100,
161
+ save_strategy="steps",
162
+ save_steps=500,
163
+ save_total_limit=2,
164
+ fp16=True if device == "cuda" else False,
165
+ gradient_accumulation_steps=4,
166
+ gradient_checkpointing=False, # CoDA doesn't support gradient checkpointing
167
+ optim="adamw_torch",
168
+ report_to="none",
169
+ load_best_model_at_end=True,
170
+ metric_for_best_model="loss",
171
+ greater_is_better=False,
172
+ )
173
+
174
+ # Data collator
175
+ data_collator = DataCollatorForLanguageModeling(
176
+ tokenizer=tokenizer,
177
+ mlm=False
178
+ )
179
+
180
+ # Initialize trainer with custom loss
181
+ trainer = CoDATrainer(
182
+ model=model,
183
+ args=training_args,
184
+ train_dataset=train_dataset,
185
+ eval_dataset=eval_dataset,
186
+ data_collator=data_collator,
187
+ )
188
+
189
+ progress(0.5, desc=f"Training for {epochs} epochs...")
190
+
191
+ # Train with live logging
192
+ class ProgressCallback:
193
+ def __init__(self, progress_fn):
194
+ self.progress_fn = progress_fn
195
+ self.step = 0
196
+
197
+ def on_log(self, args, state, control, logs=None, **kwargs):
198
+ if logs:
199
+ self.step += 1
200
+ log_str = f"Step {state.global_step}: "
201
+ if 'loss' in logs:
202
+ log_str += f"loss={logs['loss']:.4f} "
203
+ if 'learning_rate' in logs:
204
+ log_str += f"lr={logs['learning_rate']:.2e}"
205
+ self.progress_fn(0.5 + (0.4 * state.global_step / state.max_steps), desc=log_str)
206
+
207
+ from transformers import TrainerCallback
208
+ class GradioProgressCallback(TrainerCallback):
209
+ def __init__(self, progress_fn):
210
+ self.progress_fn = progress_fn
211
+
212
+ def on_log(self, args, state, control, logs=None, **kwargs):
213
+ if logs and state.max_steps > 0:
214
+ log_str = f"Step {state.global_step}/{state.max_steps}: "
215
+ if 'loss' in logs:
216
+ log_str += f"loss={logs['loss']:.4f} "
217
+ if 'learning_rate' in logs:
218
+ log_str += f"lr={logs['learning_rate']:.2e}"
219
+ progress = 0.5 + (0.4 * state.global_step / state.max_steps)
220
+ self.progress_fn(progress, desc=log_str)
221
+
222
+ # Add state saving callback
223
+ class StateSavingCallback(TrainerCallback):
224
+ def on_save(self, args, state, control, **kwargs):
225
+ save_training_state({
226
+ 'step': state.global_step,
227
+ 'epoch': state.epoch,
228
+ 'best_metric': state.best_metric
229
+ })
230
+
231
+ trainer.add_callback(GradioProgressCallback(progress))
232
+ trainer.add_callback(StateSavingCallback())
233
+
234
+ # Resume from checkpoint if exists
235
+ resume_from_checkpoint = None
236
+ if resume:
237
+ checkpoints = list(Path(output_dir).glob("checkpoint-*"))
238
+ if checkpoints:
239
+ latest_checkpoint = max(checkpoints, key=lambda x: int(x.name.split("-")[1]))
240
+ resume_from_checkpoint = str(latest_checkpoint)
241
+ progress(0, desc=f"Resuming from {latest_checkpoint.name}...")
242
+
243
+ trainer.train(resume_from_checkpoint=resume_from_checkpoint)
244
+
245
+ progress(0.9, desc="Saving model...")
246
+
247
+ # Save final model
248
+ trainer.save_model(output_dir)
249
+ tokenizer.save_pretrained(output_dir)
250
+
251
+ progress(1.0, desc="Training complete!")
252
+
253
+ return f"βœ… Training completed successfully!\nModel saved to: {output_dir}\n\nFinal training loss: {trainer.state.log_history[-1].get('loss', 'N/A')}"
254
+
255
+ except Exception as e:
256
+ return f"❌ Error during training: {str(e)}"
257
+
258
+ def upload_to_hub(repo_name, oauth_token: gr.OAuthToken | None, progress=gr.Progress()):
259
+ try:
260
+ if oauth_token is None:
261
+ return "❌ Please login first to upload the model!"
262
+
263
+ progress(0, desc="Authenticating...")
264
+
265
+ # Login with OAuth token
266
+ login(token=oauth_token.token)
267
+ user_info = whoami(oauth_token.token)
268
+ username = user_info['name']
269
+
270
+ progress(0.2, desc="Preparing model for upload...")
271
+
272
+ # Full repo ID
273
+ if not repo_name:
274
+ timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
275
+ repo_name = f"coda-finetuned-{timestamp}"
276
+
277
+ repo_id = f"{username}/{repo_name}"
278
+
279
+ progress(0.3, desc=f"Creating repository {repo_id}...")
280
+
281
+ # Create repo
282
+ api = HfApi()
283
+ api.create_repo(repo_id=repo_id, exist_ok=True, token=oauth_token.token, repo_type="model")
284
+
285
+ progress(0.5, desc="Uploading model files...")
286
+
287
+ # Upload folder
288
+ model_dir = "./coda-finetuned"
289
+ if not os.path.exists(model_dir):
290
+ return "❌ No trained model found! Please train a model first."
291
+
292
+ api.upload_folder(
293
+ folder_path=model_dir,
294
+ repo_id=repo_id,
295
+ repo_type="model",
296
+ token=oauth_token.token
297
+ )
298
+
299
+ progress(1.0, desc="Upload complete!")
300
+
301
+ return f"βœ… Model successfully uploaded to: https://huggingface.co/{repo_id}"
302
+
303
+ except Exception as e:
304
+ return f"❌ Error during upload: {str(e)}"
305
+
306
+ # Gradio UI
307
+ with gr.Blocks(title="CoDA Fine-tuning Space") as demo:
308
+ gr.Markdown("""
309
+ # πŸš€ CoDA Model Fine-tuning Space
310
+
311
+ This Space fine-tunes the **Salesforce/CoDA-v0-Instruct** diffusion model on the **baseten-admin/gpt-oss120b-generated-perfectblend** dataset.
312
+
313
+ ### Steps:
314
+ 1. **Login** with your Hugging Face account (required for upload)
315
+ 2. **Configure** training parameters
316
+ 3. **Train** the model (requires GPU - upgrade Space if needed)
317
+ 4. **Upload** the trained model to your account
318
+
319
+ ⚠️ **Note**:
320
+ - Full fine-tuning requires significant GPU resources. Training may take several hours.
321
+ - **Checkpoints are saved every 500 steps** - you can resume if interrupted.
322
+ - For Docker: Mount `/data` volume for full persistence across container restarts.
323
+ - On Spaces: Checkpoints persist in the same session and across rebuilds with persistent storage.
324
+ """)
325
+
326
+ with gr.Row():
327
+ login_button = gr.LoginButton()
328
+
329
+ gr.Markdown("## Training Configuration")
330
+
331
+ with gr.Row():
332
+ with gr.Column():
333
+ epochs = gr.Slider(minimum=1, maximum=10, value=3, step=1, label="Epochs")
334
+ batch_size = gr.Slider(minimum=1, maximum=8, value=2, step=1, label="Batch Size per Device")
335
+ learning_rate = gr.Slider(minimum=1e-6, maximum=1e-4, value=2e-5, step=1e-6, label="Learning Rate", info="Default: 2e-5")
336
+ resume_training = gr.Checkbox(label="Resume from last checkpoint", value=False, info="Check if training was interrupted")
337
+
338
+ with gr.Row():
339
+ train_button = gr.Button("🎯 Start Training", variant="primary", size="lg")
340
+
341
+ training_output = gr.Textbox(label="Training Status", lines=5)
342
+
343
+ gr.Markdown("## Upload Trained Model")
344
+
345
+ with gr.Row():
346
+ repo_name = gr.Textbox(label="Model Repository Name", placeholder="coda-finetuned-v1", info="Leave empty for auto-generated name")
347
+
348
+ with gr.Row():
349
+ upload_button = gr.Button("πŸ“€ Upload to Hugging Face Hub", variant="secondary", size="lg")
350
+
351
+ upload_output = gr.Textbox(label="Upload Status", lines=3)
352
+
353
+ gr.Markdown("""
354
+ ---
355
+ ### About
356
+
357
+ **CoDA (Code Diffusion with Autoregressive)** is a 1.7B parameter bidirectional diffusion model for text generation.
358
+ This Space performs full fine-tuning on conversational data in ChatML format.
359
+
360
+ **Dataset**: The training uses the `conversations` column from the dataset, which contains question-answer pairs.
361
+
362
+ **Hardware**: GPU (T4 or better) is strongly recommended. CPU training will be extremely slow.
363
+ """)
364
+
365
+ # Event handlers
366
+ train_button.click(
367
+ fn=train_model,
368
+ inputs=[epochs, batch_size, learning_rate, resume_training],
369
+ outputs=training_output
370
+ )
371
+
372
+ upload_button.click(
373
+ fn=upload_to_hub,
374
+ inputs=[repo_name, login_button],
375
+ outputs=upload_output
376
+ )
377
+
378
+ if __name__ == "__main__":
379
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ gradio==4.44.0
2
+ transformers==4.47.1
3
+ torch==2.5.1
4
+ datasets==3.1.0
5
+ huggingface-hub==0.26.2
6
+ accelerate==1.1.1
7
+ bitsandbytes==0.44.1
8
+ scipy==1.14.1