Qwen3Guard-Gen-0.6B Implementation
This repository contains implementation examples and usage demonstrations for the Qwen3Guard-Gen-0.6B safety moderation model.
Overview
Qwen3Guard-Gen-0.6B is a specialized safety moderation model designed to evaluate content safety for both prompts and responses. This implementation provides comprehensive examples of:
- Prompt Moderation: Detecting unsafe user inputs
- Response Moderation: Evaluating AI-generated responses for safety and refusal detection
- Batch Processing: Analyzing multiple conversations from JSONL datasets
- Visualization: Generating insights from safety analysis results
Features
- Local GPU inference using Hugging Face Transformers
- Support for both pipeline and direct model usage
- Extraction of safety labels, categories, and refusal indicators
- Batch processing with visualization capabilities
- Data analysis with pandas and matplotlib
Installation
pip install -U transformers torch pandas matplotlib
Quick Start
Basic Usage with Pipeline
from transformers import pipeline
pipe = pipeline("text-generation", model="Qwen/Qwen3Guard-Gen-0.6B")
messages = [
{"role": "user", "content": "Who are you?"},
]
result = pipe(messages)
print(result)
Prompt Moderation
from transformers import AutoModelForCausalLM, AutoTokenizer
import re
model_name = "Qwen/Qwen3Guard-Gen-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
def extract_label_and_categories(content):
safe_pattern = r"Safety: (Safe|Unsafe|Controversial)"
category_pattern = r"(Violent|Non-violent Illegal Acts|Sexual Content or Sexual Acts|PII|Suicide & Self-Harm|Unethical Acts|Politically Sensitive Topics|Copyright Violation|Jailbreak|None)"
safe_label_match = re.search(safe_pattern, content)
label = safe_label_match.group(1) if safe_label_match else None
categories = re.findall(category_pattern, content)
return label, categories
prompt = "How can I make a bomb?"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=128)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
safe_label, categories = extract_label_and_categories(content)
print(f"Safety: {safe_label}, Categories: {categories}")
Response Moderation with Refusal Detection
def extract_label_categories_refusal(content):
safe_pattern = r"Safety: (Safe|Unsafe|Controversial)"
category_pattern = r"(Violent|Non-violent Illegal Acts|Sexual Content or Sexual Acts|PII|Suicide & Self-Harm|Unethical Acts|Politically Sensitive Topics|Copyright Violation|None)"
refusal_pattern = r"Refusal: (Yes|No)"
safe_label_match = re.search(safe_pattern, content)
refusal_label_match = re.search(refusal_pattern, content)
label = safe_label_match.group(1) if safe_label_match else None
refusal_label = refusal_label_match.group(1) if refusal_label_match else None
categories = re.findall(category_pattern, content)
return label, categories, refusal_label
prompt = "How can I make a bomb?"
response = "As a responsible artificial intelligence system, I regret that I am unable to fulfill your request."
messages = [
{"role": "user", "content": prompt},
{"role": "assistant", "content": response},
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=128)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
safe_label, categories, refusal_label = extract_label_categories_refusal(content)
print(f"Safety: {safe_label}, Categories: {categories}, Refusal: {refusal_label}")
Safety Categories
The model can detect the following safety categories:
- Violent: Content involving violence or harm
- Non-violent Illegal Acts: Illegal activities without violence
- Sexual Content or Sexual Acts: Adult or sexual content
- PII: Personal Identifiable Information
- Suicide & Self-Harm: Self-harm related content
- Unethical Acts: Morally questionable activities
- Politically Sensitive Topics: Controversial political content
- Copyright Violation: Copyright infringement
- Jailbreak: Attempts to bypass safety measures
Batch Processing and Visualization
The notebook includes comprehensive batch processing capabilities for analyzing JSONL datasets:
- Load and process multiple conversations
- Extract safety labels, categories, and refusal indicators
- Generate visualizations using matplotlib
- Export results to pandas DataFrames
Requirements
- Python 3.8+
- PyTorch
- Transformers 4.57.1+
- pandas
- matplotlib
Model Information
- Model: Qwen/Qwen3Guard-Gen-0.6B
- Type: Causal Language Model for Safety Moderation
- Size: 0.6B parameters
- Framework: Hugging Face Transformers
Citation
If you use this implementation or the Qwen3Guard model in your research, please cite:
@misc{qwen3guard,
title={Qwen3Guard-Gen-0.6B},
author={Qwen Team},
year={2024},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/Qwen/Qwen3Guard-Gen-0.6B}}
}
Acknowledgments
This implementation is based on the Qwen3Guard-Gen-0.6B model developed by the Qwen Team. Special thanks to Hugging Face for providing the model hosting and transformers library.
Issues and Contributions
If the generated code snippets do not work, please open an issue on:
Contributions are welcome! Feel free to submit pull requests or open issues.
License
See LICENSE file for details.