TRIDENT
TRIDENT is a reasoning-focused 4B-parameter language model that improves its own reasoning capability through algorithmic self-improvement, rather than parameter scaling.
The model is built on Qwen3-4B and enhanced using the TRIDENT framework: a combination of GNN-guided Tree-of-Thoughts search, multi-agent reasoning policies, and variance-based self-training.
Overview
Traditional large language model training depends on:
- Human-written reasoning traces
- Manually curated preference datasets
- Static fine-tuning pipelines
TRIDENT removes these dependencies.
Instead, the model:
- Explores multiple reasoning paths
- Evaluates them using a learned GNN policy
- Selects high-uncertainty problems automatically
- Generates its own training supervision
- Distills improvements back into the model using LoRA
model-index:
- name: TRIDENT
results:
- task:
type: text-generation
dataset:
name: GSM8K
type: gsm8k
split: test
metrics:
- type: accuracy value: 86.58
- task:
type: text-generation
dataset:
name: MMLU
type: mmlu
split: test
metrics:
- type: accuracy value: 72.61
- task:
type: text-generation
dataset:
name: GPQA
type: gpqa
split: test
metrics:
- type: accuracy value: 42.42
- task:
type: text-generation
dataset:
name: ARC-Challenge
type: arc-challenge
split: test
metrics:
- type: accuracy value: 59.0
- task:
type: text-generation
dataset:
name: GSM8K
type: gsm8k
split: test
metrics:
Core Capabilities
GNN-Guided Tree-of-Thoughts
Reasoning is represented as a directed graph of intermediate states.
A 3-layer Graph Convolutional Network predicts a promise score for each branch, guiding exploration and pruning.
Multi-Agent Reasoning
Four internal agents (Conservative, Exploratory, Balanced, Reflective) vote on reasoning actions to balance exploration and correctness.
Variance-Based Curriculum
Problems are selected for training based on reward variance, targeting examples where the model is inconsistent and learning signal is highest.
Self-Generative Reasoning Loop
No human-annotated reasoning traces are used.
The model autonomously generates, evaluates, and curates its own reasoning data.
Stable Training
A multi-layer reward stabilization mechanism prevents:
- Reward collapse
- Loss explosions
- Infinite failure loops
The architecture is compatible with future GRPO-style reinforcement learning.
Benchmark Results
Accuracy comparison against the base model:
| Benchmark | Qwen3-4B | TRIDENT |
|---|---|---|
| GSM8K (5-shot) | 74.14 | 86.58 |
| MMLU (5-shot) | 47.70 | 72.61 |
| ARC-C (25-shot) | 54.0 | 59.0 |
| GPQA (0-shot) | 28.28 | 42.42 |
| Winogrande (0-shot) | 59.6 | 67.08 |
| TruthfulQA (0-shot) | 54.9 | 54.7 |
Highlight:
+14.14 percentage point improvement on GPQA (0-shot).
Intended Use
TRIDENT is suitable for:
- Multi-step mathematical reasoning
- Scientific and logical inference
- Hard QA benchmarks
- Planning and hypothesis exploration
- Research on reasoning systems
Limitations
- Higher inference-time compute than single-pass models
- Not optimized for low-latency chat
- Best used where reasoning depth matters more than speed
Ethical Considerations
- No human-written reasoning traces used
- No preference data collection
- Training relies on verifiable task rewards
- Like all LLMs, may hallucinate outside structured reasoning workflows
Paper link
https://www.shivik.in/shivik-labs/trident
Citation
@article{puri2025trident,
title={TRIDENT: Thought-based Reasoning and Improvement through Deep Exploration of Neuronal Trees},
author={Puri, Shivansh and Khandelwal, Abhisek and Joshi, Vedant and Yadav, Akash},
year={2025}
}
- Downloads last month
- 465