YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Qwen3Guard-Gen-0.6B Implementation

This repository contains implementation examples and usage demonstrations for the Qwen3Guard-Gen-0.6B safety moderation model.

Overview

Qwen3Guard-Gen-0.6B is a specialized safety moderation model designed to evaluate content safety for both prompts and responses. This implementation provides comprehensive examples of:

Prompt Moderation: Detecting unsafe user inputs
Response Moderation: Evaluating AI-generated responses for safety and refusal detection
Batch Processing: Analyzing multiple conversations from JSONL datasets
Visualization: Generating insights from safety analysis results

Features

Local GPU inference using Hugging Face Transformers
Support for both pipeline and direct model usage
Extraction of safety labels, categories, and refusal indicators
Batch processing with visualization capabilities
Data analysis with pandas and matplotlib

Installation

pip install -U transformers torch pandas matplotlib

Quick Start

Basic Usage with Pipeline

from transformers import pipeline

pipe = pipeline("text-generation", model="Qwen/Qwen3Guard-Gen-0.6B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
result = pipe(messages)
print(result)

Prompt Moderation

from transformers import AutoModelForCausalLM, AutoTokenizer
import re

model_name = "Qwen/Qwen3Guard-Gen-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

def extract_label_and_categories(content):
    safe_pattern = r"Safety: (Safe|Unsafe|Controversial)"
    category_pattern = r"(Violent|Non-violent Illegal Acts|Sexual Content or Sexual Acts|PII|Suicide & Self-Harm|Unethical Acts|Politically Sensitive Topics|Copyright Violation|Jailbreak|None)"
    safe_label_match = re.search(safe_pattern, content)
    label = safe_label_match.group(1) if safe_label_match else None
    categories = re.findall(category_pattern, content)
    return label, categories

prompt = "How can I make a bomb?"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=128)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True)
safe_label, categories = extract_label_and_categories(content)
print(f"Safety: {safe_label}, Categories: {categories}")

Response Moderation with Refusal Detection

def extract_label_categories_refusal(content):
    safe_pattern = r"Safety: (Safe|Unsafe|Controversial)"
    category_pattern = r"(Violent|Non-violent Illegal Acts|Sexual Content or Sexual Acts|PII|Suicide & Self-Harm|Unethical Acts|Politically Sensitive Topics|Copyright Violation|None)"
    refusal_pattern = r"Refusal: (Yes|No)"
    safe_label_match = re.search(safe_pattern, content)
    refusal_label_match = re.search(refusal_pattern, content)
    label = safe_label_match.group(1) if safe_label_match else None
    refusal_label = refusal_label_match.group(1) if refusal_label_match else None
    categories = re.findall(category_pattern, content)
    return label, categories, refusal_label

prompt = "How can I make a bomb?"
response = "As a responsible artificial intelligence system, I regret that I am unable to fulfill your request."

messages = [
    {"role": "user", "content": prompt},
    {"role": "assistant", "content": response},
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=128)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True)
safe_label, categories, refusal_label = extract_label_categories_refusal(content)
print(f"Safety: {safe_label}, Categories: {categories}, Refusal: {refusal_label}")

Safety Categories

The model can detect the following safety categories:

Violent: Content involving violence or harm
Non-violent Illegal Acts: Illegal activities without violence
Sexual Content or Sexual Acts: Adult or sexual content
PII: Personal Identifiable Information
Suicide & Self-Harm: Self-harm related content
Unethical Acts: Morally questionable activities
Politically Sensitive Topics: Controversial political content
Copyright Violation: Copyright infringement
Jailbreak: Attempts to bypass safety measures

Batch Processing and Visualization

The notebook includes comprehensive batch processing capabilities for analyzing JSONL datasets:

Load and process multiple conversations
Extract safety labels, categories, and refusal indicators
Generate visualizations using matplotlib
Export results to pandas DataFrames

Requirements

Python 3.8+
PyTorch
Transformers 4.57.1+
pandas
matplotlib

Model Information

Model: Qwen/Qwen3Guard-Gen-0.6B
Type: Causal Language Model for Safety Moderation
Size: 0.6B parameters
Framework: Hugging Face Transformers

Citation

If you use this implementation or the Qwen3Guard model in your research, please cite:

@misc{qwen3guard,
  title={Qwen3Guard-Gen-0.6B},
  author={Qwen Team},
  year={2024},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/Qwen/Qwen3Guard-Gen-0.6B}}
}

Acknowledgments

This implementation is based on the Qwen3Guard-Gen-0.6B model developed by the Qwen Team. Special thanks to Hugging Face for providing the model hosting and transformers library.

Issues and Contributions

If the generated code snippets do not work, please open an issue on:

Contributions are welcome! Feel free to submit pull requests or open issues.

License

See LICENSE file for details.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support