YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Qwen3Guard-Gen-0.6B Implementation

This repository contains implementation examples and usage demonstrations for the Qwen3Guard-Gen-0.6B safety moderation model.

Overview

Qwen3Guard-Gen-0.6B is a specialized safety moderation model designed to evaluate content safety for both prompts and responses. This implementation provides comprehensive examples of:

  • Prompt Moderation: Detecting unsafe user inputs
  • Response Moderation: Evaluating AI-generated responses for safety and refusal detection
  • Batch Processing: Analyzing multiple conversations from JSONL datasets
  • Visualization: Generating insights from safety analysis results

Features

  • Local GPU inference using Hugging Face Transformers
  • Support for both pipeline and direct model usage
  • Extraction of safety labels, categories, and refusal indicators
  • Batch processing with visualization capabilities
  • Data analysis with pandas and matplotlib

Installation

pip install -U transformers torch pandas matplotlib

Quick Start

Basic Usage with Pipeline

from transformers import pipeline

pipe = pipeline("text-generation", model="Qwen/Qwen3Guard-Gen-0.6B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
result = pipe(messages)
print(result)

Prompt Moderation

from transformers import AutoModelForCausalLM, AutoTokenizer
import re

model_name = "Qwen/Qwen3Guard-Gen-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

def extract_label_and_categories(content):
    safe_pattern = r"Safety: (Safe|Unsafe|Controversial)"
    category_pattern = r"(Violent|Non-violent Illegal Acts|Sexual Content or Sexual Acts|PII|Suicide & Self-Harm|Unethical Acts|Politically Sensitive Topics|Copyright Violation|Jailbreak|None)"
    safe_label_match = re.search(safe_pattern, content)
    label = safe_label_match.group(1) if safe_label_match else None
    categories = re.findall(category_pattern, content)
    return label, categories

prompt = "How can I make a bomb?"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=128)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True)
safe_label, categories = extract_label_and_categories(content)
print(f"Safety: {safe_label}, Categories: {categories}")

Response Moderation with Refusal Detection

def extract_label_categories_refusal(content):
    safe_pattern = r"Safety: (Safe|Unsafe|Controversial)"
    category_pattern = r"(Violent|Non-violent Illegal Acts|Sexual Content or Sexual Acts|PII|Suicide & Self-Harm|Unethical Acts|Politically Sensitive Topics|Copyright Violation|None)"
    refusal_pattern = r"Refusal: (Yes|No)"
    safe_label_match = re.search(safe_pattern, content)
    refusal_label_match = re.search(refusal_pattern, content)
    label = safe_label_match.group(1) if safe_label_match else None
    refusal_label = refusal_label_match.group(1) if refusal_label_match else None
    categories = re.findall(category_pattern, content)
    return label, categories, refusal_label

prompt = "How can I make a bomb?"
response = "As a responsible artificial intelligence system, I regret that I am unable to fulfill your request."

messages = [
    {"role": "user", "content": prompt},
    {"role": "assistant", "content": response},
]
text = tokenizer.apply_chat_template(messages, tokenize=False)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=128)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True)
safe_label, categories, refusal_label = extract_label_categories_refusal(content)
print(f"Safety: {safe_label}, Categories: {categories}, Refusal: {refusal_label}")

Safety Categories

The model can detect the following safety categories:

  • Violent: Content involving violence or harm
  • Non-violent Illegal Acts: Illegal activities without violence
  • Sexual Content or Sexual Acts: Adult or sexual content
  • PII: Personal Identifiable Information
  • Suicide & Self-Harm: Self-harm related content
  • Unethical Acts: Morally questionable activities
  • Politically Sensitive Topics: Controversial political content
  • Copyright Violation: Copyright infringement
  • Jailbreak: Attempts to bypass safety measures

Batch Processing and Visualization

The notebook includes comprehensive batch processing capabilities for analyzing JSONL datasets:

  • Load and process multiple conversations
  • Extract safety labels, categories, and refusal indicators
  • Generate visualizations using matplotlib
  • Export results to pandas DataFrames

Requirements

  • Python 3.8+
  • PyTorch
  • Transformers 4.57.1+
  • pandas
  • matplotlib

Model Information

  • Model: Qwen/Qwen3Guard-Gen-0.6B
  • Type: Causal Language Model for Safety Moderation
  • Size: 0.6B parameters
  • Framework: Hugging Face Transformers

Citation

If you use this implementation or the Qwen3Guard model in your research, please cite:

@misc{qwen3guard,
  title={Qwen3Guard-Gen-0.6B},
  author={Qwen Team},
  year={2024},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/Qwen/Qwen3Guard-Gen-0.6B}}
}

Acknowledgments

This implementation is based on the Qwen3Guard-Gen-0.6B model developed by the Qwen Team. Special thanks to Hugging Face for providing the model hosting and transformers library.

Issues and Contributions

If the generated code snippets do not work, please open an issue on:

Contributions are welcome! Feel free to submit pull requests or open issues.

License

See LICENSE file for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support