SAGE-32B
The SAGE LLMs are instruction tuned generative models optimized for agentic reasoning and reliable tool usage (text in/text out). All models are released under an open license for commercial use.
- SAGE models are hybrid reasoning models. Each model can answer directly (standard LLM), or use inverse reasoning before answering (like reasoning models with self-verification).
- The LLMs are trained using Iterative Distillation and Amplification (IDA) combined with Reflective Distillation - a scalable alignment strategy using iterative self-improvement with error recovery.
- The models have been optimized for agentic workflows, tool calling, multi-step planning, STEM, and error recovery, with significantly higher reliability and lower hallucination rates than size equivalent counterparts.
- In both standard and reasoning modes, SAGE-32B models outperform their size equivalent counterparts on agentic benchmarks and mathematical reasoning tasks.
- Each model is trained in over 30 languages and supports a context length of 128k tokens via Landmark Attention.
Evaluations
We compare our models against state of the art size equivalent models in direct mode as well as the reasoning mode. For the direct mode, we compare against Qwen2.5-32B-Instruct. For reasoning, we compare against Llama-3.1-70B and demonstrate superior performance with fewer parameters.
Key Results:
- MATH Reasoning: 91.78% with Inverse Reasoning (vs 72.6% GPT-4-Turbo, 68.0% Llama-3.1-70B)
- Tool Calling: 91.5% success rate with only 2.4% unforced errors at 1/7th the cost of GPT-4
- Error Recovery: 76% internal recovery rate (vs 35% base model)
- Cost Efficiency: Near GPT-4 reliability at $4.50 per 1k episodes vs $32 for GPT-4-Turbo
Usage
Here is a snippet below for usage with Transformers:
import transformers
import torch
model_id = "sagea-ai/sage-reasoning-32b"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are SAGE, an agentic reasoning assistant."},
{"role": "user", "content": "Explain the concept of inverse reasoning in AI systems."},
]
outputs = pipeline(
messages,
max_new_tokens=512,
)
print(outputs[0]["generated_text"][-1])
Implementing Inverse Reasoning Mode
- By default, the model will answer in the standard mode.
- To enable inverse reasoning (thinking with self-verification), you can do any one of the two methods:
- Add a specific system prompt, or
- Set
enable_thinking=Truewhile applying the chat template.
Method 1 - Add a specific system prompt.
To enable inverse reasoning, simply use this in the system prompt: system_instruction = 'Enable inverse reasoning mode.'
If you already have a system_instruction, then use system_instruction = 'Enable inverse reasoning mode.' + '\n\n' + system_instruction.
Here is an example:
import transformers
import torch
model_id = "sagea-ai/sage-reasoning-32b"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
INVERSE_REASONING_INSTRUCTION = "Enable inverse reasoning mode." || DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."
messages = [
{"role": "system", "content": INVERSE_REASONING_INSTRUCTION},
{"role": "user", "content": "Write a Python function that implements the Sieve of Eratosthenes algorithm to find all prime numbers up to n."},
]
outputs = pipeline(
messages,
max_new_tokens=2048,
)
print(outputs[0]["generated_text"][-1])
Similarly, if you have a system prompt, you can append the INVERSE_REASONING_INSTRUCTION to the beginning in this way:
INVERSE_REASONING_INSTRUCTION = "Enable inverse reasoning mode."
system_prompt = "You are an expert Python programmer. Provide clean, efficient code."
prompt = "Solve the following problem: Given a list of intervals, merge all overlapping intervals."
messages = [
{"role": "system", "content": INVERSE_REASONING_INSTRUCTION + '\n\n' + system_prompt},
{"role": "user", "content": prompt}
]
Method 2 - Set enable_thinking=True in the tokenizer
If you are using Huggingface tokenizers, then you can simply add the argument enable_thinking=True to the tokenization (this option is added to the chat template).
Here is an example:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "sagea-ai/sage-reasoning-32b"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Solve for x: 3x^2 + 7x - 6 = 0"
messages = [
{"role": "system", "content": "You are a mathematics expert."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=2048
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Tool Calling
SAGE models support tool calling (single, parallel, multiple and parallel_multiple) both in standard and inverse reasoning mode. SAGE-32B has been specifically optimized for low hallucination rates in tool calling with only 2.4% unforced errors.
Here is a snippet:
# First, define a tool
def get_current_temperature(location: str) -> float:
"""
Get the current temperature at a location.
Args:
location: The location to get the temperature for, in the format "City, Country"
Returns:
The current temperature at the specified location in the specified units, as a float.
"""
return 22. # A real function should probably actually get the temperature!
# Next, create a chat and apply the chat template
messages = [
{"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
]
model_inputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)
text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
output_text = tokenizer.batch_decode(outputs)[0][len(text):]
print(output_text)
This will result in the output:
<tool_call>
{"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
</tool_call><|im_end|>
You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:
tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
and then call the tool and append the result, with the tool role, like so:
messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})
After that, you can generate() again to let the model use the tool result in the chat:
text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
output_text = tokenizer.batch_decode(outputs)[0][len(text):]
This should result in the string:
'The current temperature in Paris is 22.0 degrees.<|im_end|>'
Advanced: Hybrid Mode for Production
SAGE-32B supports automatic mode switching for optimal latency-accuracy trade-offs:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "sagea-ai/sage-reasoning-32b"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
def generate_with_hybrid_mode(prompt, complexity_threshold=0.7):
"""
Automatically switch between fast and reasoning modes based on task complexity.
In production, this achieves 98% of reasoning mode performance at 40% of the latency.
"""
# Simple heuristic: use reasoning mode for math, code, or complex queries
enable_reasoning = any(keyword in prompt.lower() for keyword in
['calculate', 'solve', 'prove', 'write a function', 'algorithm', 'implement'])
messages = [
{"role": "system", "content": "Enable inverse reasoning mode." if enable_reasoning else "You are SAGE."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=2048 if enable_reasoning else 512,
temperature=0.7,
do_sample=True
)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
return response, "reasoning" if enable_reasoning else "fast"
# Example usage
result, mode = generate_with_hybrid_mode("What is the capital of France?")
print(f"Mode: {mode}\n{result}")
result, mode = generate_with_hybrid_mode("Prove that the square root of 2 is irrational.")
print(f"Mode: {mode}\n{result}")
Model Architecture Highlights
Inverse Reasoning Head: Novel dual-head architecture that validates logical plans before execution by computing an Inverse Consistency Score (ICS). This mechanism prevents "confident hallucinations" by verifying that reasoning traces can reconstruct the original problem constraints.
Landmark Attention: Reduces memory complexity from O(N²) to O(N·N/k) where k=64, enabling 128K context windows while maintaining:
- Dense attention for recent 4096 tokens
- Global attention to landmark tokens across entire history
Split-Embedding Strategy: Separate embedding matrices for natural language and code tokens with learnable gating parameter α_t, enabling seamless mode switching between conversational flow and rigid API articulation.
Training Pipeline:
- Stage 1 - Distillation & Amplification: 5M synthetic trajectories with negative constraint sampling (type errors, hallucinated parameters, logic errors)
- Stage 2 - Reflective Distillation: Rejection sampling with feedback loop for error recovery
- Stage 3 - RL Refinement (CodePPO): Zero-tolerance for invalid JSON/function signatures, heavy penalties for argument hallucination
This results in unforced error reduction from 14.5% → 2.4%.
License
This repository and the model weights are licensed under the Apache 2.0 License Agreement.
Citation
If you use SAGE-32B in your research, please cite:
@article{sage32b2026,
title={SAGE-32B: A Specialized Agentic Reasoning Model with Inverse Reasoning},
author={SAGE Team},
journal={arXiv preprint},
year={2026},
url={https://huggingface.co/sagea-ai/sage-reasoning-32b}
}
Contact
For questions or support, please open an issue on our GitHub repository or reach out to our team.
- Downloads last month
- 4
Model tree for sagea-ai/sage-reasoning-32b
Base model
Qwen/Qwen2.5-32B