|
|
--- |
|
|
license: other |
|
|
license_name: raml-v1.0 |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- model_hub_mixin |
|
|
- pytorch_model_hub_mixin |
|
|
- RxNN |
|
|
- RxLM |
|
|
- ReactiveTransformer |
|
|
- Event-Driven |
|
|
- MemorySystem |
|
|
- ShortTermMemory |
|
|
- Real-Time |
|
|
- RxLM |
|
|
- ReactiveLanguageModel |
|
|
- RealTimeLanguageModel |
|
|
language: |
|
|
- en |
|
|
datasets: |
|
|
- HuggingFaceFW/fineweb-edu |
|
|
- wikimedia/wikipedia |
|
|
- HuggingFaceFW/clean-wikipedia |
|
|
- ReactiveAI/smol-smoltalk-Interaction-SFT |
|
|
- ReactiveAI/cosmopedia-100k-Interaction-SFT |
|
|
- ReactiveAI/Real-Chat-SMAT |
|
|
- ReactiveAI/Real-Chat-No-System-SMAT |
|
|
library_name: RxLM |
|
|
gated: true |
|
|
extra_gated_prompt: >- |
|
|
Accept [Reactive AI Model & Architecture License (RAML) |
|
|
v1.0](https://github.com/RxAI-dev/rxlm/blob/main/MODELS_LICENSE.md) terms to |
|
|
access the repository and use model. Reactive Transformer (pending patent |
|
|
#P.453260) is available for free for non-commercial usage. For commercial |
|
|
usage please contact Reactive AI at licensing@rxai.dev |
|
|
extra_gated_fields: |
|
|
Company: text |
|
|
Country: country |
|
|
I want to use this model for: |
|
|
type: select |
|
|
options: |
|
|
- Research |
|
|
- Education |
|
|
- label: Other |
|
|
value: other |
|
|
I agree to use this model for non-commercial use ONLY: checkbox |
|
|
extra_gated_heading: >- |
|
|
You need to agree to use this model only for research or education purposes |
|
|
under Reactive AI Model & Architecture License (RAML) v1.0 |
|
|
extra_gated_description: The repository will be available instantly after accepting license terms |
|
|
extra_gated_button_content: Accept license terms |
|
|
--- |
|
|
|
|
|
# RxT-Beta-Micro-Supervised 290M |
|
|
World's first experimental real-time **Reactive Language Model (RxLM)** trained on limited real-world data (after synthetic |
|
|
RxT-Alpha generation). It's based on revolutionary **Reactive Transformer** architecture - processing only single |
|
|
interactions/messages, with all the context moved to **Short-Term Memory**, managed by **Attention-Based Memory System**. |
|
|
|
|
|
> Docs in progress |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
First **Reactive Language Model (RxLM)** trained on limited real-world datasets, based on **Reactive Transformer (RxT)** architecture |
|
|
|
|
|
**RxLMs** have linear computational/inference cost scaling (`O(NT)`) compared to **LLMs** quadratic growth (`O(N²T)`), |
|
|
where `N` is the number of messages in conversation and `T` is the number of tokens in single interaction. Thanks to that |
|
|
scaling, they are just `N` times faster and cheaper than **LLMs**. |
|
|
|
|
|
That's not all from the advantages - event-driven real-time processing with memory is a lot more natural and human-like, |
|
|
than LLMs data-driven approach (processing full conversation history everytime). It's a crucial milestone in development |
|
|
of AGI and awareness models. |
|
|
|
|
|
> This is _Supervised_ version of the model with "weak" memory system - result of Supervised Memory System Training (SMST). It's |
|
|
> able to remember information between interactions (without passing it explicitly in prompt/chat template), but it |
|
|
> has to be refined in next Memory Reinforcement Learning (MRL) stage for full functionality. |
|
|
|
|
|
After successful experiments with simple synthetic datasets, we moved to real-world data, but this model still had limited |
|
|
amount of english-only data for pre-training - only 10B tokens from Wikipedia and FineWeb-Edu (+2B tokens in later stages). |
|
|
Then it could have limited general knowledge and should be fine-tuned for some specialization - for example, we trained |
|
|
[RxT-Beta-Micro-Supervised-AI](https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised-AI) on AI/Data Science knowledge |
|
|
based chats. |
|
|
|
|
|
|
|
|
### Reactive Transformer Architecture |
|
|
Experimental research model made to test our Reactive Transformer architecture and Attention-based Memory System. |
|
|
|
|
|
Reactive Transformer has additional Short-Term Memory layers, connected to model with Memory Cross-Attention, and updated by Memory Encoder and Memory Attention. |
|
|
Short-Term Memory state is kept between interactions/event (single message), not between tokens in sequence - that's key difference between RxNNs and RNNs. |
|
|
|
|
|
The goal of the architecture is to process only single messages and keep conversation history in Short-Term Memory - we believe, that this is the key requirement |
|
|
for awareness and AGI. Processing all the chat history on every interaction is not natural and that's not how human awareness is working. Then, Reactive Transformer |
|
|
architecture is a first step in transition from language models to awareness models. |
|
|
|
|
|
To balance number of the parameters, decoder is based on Mixture-of-Experts architecture, while the encoder is using regular |
|
|
dense feed forward layers. This model is using gated self/interlayer version of memory attention network with sigmoid residual gates. |
|
|
|
|
|
<img src="https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised/resolve/main/reactive-transformer-self-interlayer.png" width="800" /> |
|
|
|
|
|
#### Architecture details: |
|
|
- dim: 256 |
|
|
- layers: 14 |
|
|
- heads (for split): 16 |
|
|
- **Decoder:** |
|
|
- 2 initial stateless layers: Dense and MoE |
|
|
- self-attention: Sparse Query Attention |
|
|
- query heads: 8/16 |
|
|
- key/value heads: 4/16 |
|
|
- memory cross-attention: Sparse Query Attention |
|
|
- query heads: 8/16 |
|
|
- key/value heads: 4/16 |
|
|
- Mixture-of-Experts Feed Forward |
|
|
- experts: 42 |
|
|
- active experts: 4 |
|
|
- SwiGLU feed forward with 512 dim |
|
|
- size: \~269M (~44M Activated) |
|
|
- **Encoder:** |
|
|
- self-attention: symmetric Sparse Query Attention |
|
|
- query/key/value heads: 8/16 |
|
|
- SwiGLU feed forward with 768 dim |
|
|
- size: ~18.3M |
|
|
- **Memory Attention:** |
|
|
- variant: **Gated Self/Interlayer Memory Attention** |
|
|
- attention layers: symmetric Sparse Query Attention |
|
|
- query/key/value heads: 8/16 |
|
|
- residual gate: elementwise with sigmoid activation (per STM slot) |
|
|
- size: ~3.73M |
|
|
- RoPE for self-attention, memory cross-attention (query only) and memory attention (key only) |
|
|
- RMS Norm for all normalization layers |
|
|
- vocab: 32k (english only) |
|
|
- interaction (query + answer) length: 1024 tokens |
|
|
- STM size: 14 layers * 1024 slots (* 256 dim) |
|
|
- context/messages: **Infinite** |
|
|
- size: ~290M |
|
|
- Library: RxLM |
|
|
--- |
|
|
- **Developed by:** [Adam Filipek](https://huggingface.co/AdamF92) & [Reactive AI](https://huggingface.co/ReactiveAI) |
|
|
- **Funded by:** [Reactive AI](https://huggingface.co/ReactiveAI) |
|
|
- **Model type:** **Reactive Language Model (RxLM)** |
|
|
- **Language(s) (NLP):** English |
|
|
- **License:** [Reactive AI Model & Architecture License (RAML) v1.0](https://github.com/RxAI-dev/rxlm/blob/main/MODELS_LICENSE.md) |
|
|
|
|
|
### Model Sources |
|
|
- **Repository:** [RxLM Framework](https://github.com/RxAI-dev/rxlm) |
|
|
- **Paper:** [Reactive Transformer (RxT) - Stateful Real-Time Processing for Event-Driven Reactive Language Models](https://arxiv.org/abs/2510.03561) |
|
|
- **Demo:** In progress |
|
|
|
|
|
## Uses |
|
|
This model is still experimental and it was pre-trained on limited corpus with only 10B tokens, so it's general knowledge is also limited. It's recommended |
|
|
to further fine-tune the model for some specialization, like our [RxT-Beta-Micro-Supervised-AI](https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised-AI), |
|
|
that's trained on AI/Data Science based conversations. |
|
|
|
|
|
**Supervised** RxT models are partially functional intermediate stage models - it's recommended to refine them in Memory Reinforcement Learning (MRL) and Reactive |
|
|
Reinforcement Learning from Human Feedback (RxRLHF) to reach final stage. |
|
|
|
|
|
### Direct Use |
|
|
It's not recommended to use this model directly without additional specialization training or reinforcement learning stages. |
|
|
|
|
|
**Reactive Transformer** models are made for conversational tasks, especially chatbots or as a stateful base for agentic systems. |
|
|
|
|
|
### Downstream Use |
|
|
It's recommended to further fine-tune the model for some specialization, because of limited pre-training data. For the example, |
|
|
we trained [RxT-Beta-Micro-Supervised-AI](https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised-AI) |
|
|
|
|
|
### Out-of-Scope Use |
|
|
**Reactive Transformer** models are natively conversational and made for multi-step tasks. They aren't typical Gen AI and aren't made |
|
|
for single-step generative tasks (like summarization, dataset generation, etc.) - they will work in those scenarios, but it will be waste |
|
|
of computational resources (initializing/processing memory, when it's not needed). For that case it's better to use stateless LLM. |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
The model is still experimental, made to test **Reactive Transformer** architecture on real-world data, after succesful experiments with simple synthetic data. |
|
|
It was pre-trained on 10B tokens only (and additional 2B in next stages), so it's general knowledge is limited and responses could be inaccurate. |
|
|
|
|
|
Conversation context is theoretically infinite (1024 tokens limit is only for single interaction), but after some number of messages model will slowly forget |
|
|
outdated information - that's why it's called **Short-Term Memory**. It will be extended in upcoming generations with **Long-Term Memory** for true infinite context. |
|
|
|
|
|
### Recommendations |
|
|
As mentioned before, supervised models are in intermediate stage and it's recommended to continue the training in reinforcement learning stages. It's also recommended |
|
|
to fine-tune this base model for some specialization. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
Model could be loaded and used with our RxLM framework (https://github.com/RxAI-dev/RxLM): |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from rxlm.rxt.models import RxTBeta |
|
|
from rxlm.training.tokenizer import load_tokenizer_from_hf_hub |
|
|
|
|
|
tokenizer = load_tokenizer_from_hf_hub('ReactiveAI/RxT-Beta-Micro') |
|
|
|
|
|
model = RxTBeta.from_pretrained('ReactiveAI/RxT-Beta-Micro-Supervised', tokenizer=tokenizer) |
|
|
model.share_components() # currently required to connect embeddings/STM |
|
|
|
|
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
|
model.to(device) |
|
|
|
|
|
seq_len = 1024 |
|
|
|
|
|
# Memory init - could be used as "system prompt" in LLMs (not recommended in this model, as it wasn't trained with system prompts) |
|
|
stm_init_state = model.tokenize_full_interaction('System prompt like', 'Initial memory for the model', max_seq_len=seq_len, device=device) |
|
|
model.init_stm_state(**stm_init_state) |
|
|
|
|
|
# Helper function |
|
|
def interaction(query: str): |
|
|
tokenized_query = model.tokenize_query(query, max_seq_len=seq_len, device=device) |
|
|
for token_id in model.interact(**tokenized_query, max_seq_len=seq_len, temperature=1.0): |
|
|
if token_id == -1: print('\n', '[Start memory update...]') |
|
|
elif token_id == -2: print('[Memory updated]') |
|
|
else: |
|
|
txt_token = model.stringify_token(token_id) |
|
|
print(txt_token, end='') |
|
|
|
|
|
# Process first interaction |
|
|
interaction('Hello! Who are you?') |
|
|
# Process follow-up interaction |
|
|
interaction('Follow-up question?') |
|
|
|
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
Stateful & real-time nature of **Reactive Transformer** architecture, especially asynchronous memory update, requires advanced training pipeline with multiple |
|
|
supervised and reinforcement learning stages: |
|
|
- Supervised: |
|
|
- Joint Language Models Pre-Training | raw large text corpora |
|
|
- Interaction Supervised Fine-Tuning | single, not connected interactions (query + answer) |
|
|
- Self-Supervised Memory Attention Pre-Training | multi-step conversations (SMAT datasets) |
|
|
- Supervised Memory-Aware Training (SMAT) | multi-step conversations |
|
|
- Reinforcement: |
|
|
- Memory Reinforcement Learning (MRL) | multi-step conversations |
|
|
- Reactive Reinforcement Learning from Human Feedback (RxRLHF) | multi-step conversations |
|
|
|
|
|
|
|
|
### Training Data |
|
|
We used public open-source datasets for pre-training and our custom datasets (converted from public datasets) for other stages: |
|
|
- Joint Language Models Pre-Training |
|
|
- 'sample-10BT' subset from [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) |
|
|
- '20231101.en' subset from [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) |
|
|
- Interaction SFT |
|
|
- [ReactiveAI/smol-smoltalk-Interaction-SFT](https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT) |
|
|
- [ReactiveAI/cosmopedia-100k-Interaction-SFT](https://huggingface.co/datasets/ReactiveAI/cosmopedia-100k-Interaction-SFT) |
|
|
- Self-Supervised Memory Attention Pre-Training |
|
|
- 30% of [ReactiveAI/Real-Chat-SMAT](https://huggingface.co/datasets/ReactiveAI/Real-Chat-SMAT) |
|
|
- Supervised Memory-Aware Training (SMAT) |
|
|
- [ReactiveAI/Real-Chat-SMAT](https://huggingface.co/datasets/ReactiveAI/Real-Chat-SMAT) |
|
|
- [ReactiveAI/Real-Chat-No-System-SMAT](https://huggingface.co/datasets/ReactiveAI/Real-Chat-No-System-SMAT) |
|
|
|
|
|
|
|
|
### Training Procedure |
|
|
Supervised Memory System Training includes 4 steps, before proceeding to Reinforcement Learning stages. |
|
|
|
|
|
#### Joint Language Models Pre-Training |
|
|
Decoder was trained with Encoder and additional MLM head model, using Joint LM Training (with MLM and Autoregressive loss), |
|
|
using [**HuggingFaceFW/fineweb-edu**](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) and [**wikimedia/wikipedia**](https://huggingface.co/datasets/wikimedia/wikipedia) datasets. |
|
|
Both encoder and decoder are using shared embedding layer |
|
|
|
|
|
#### Supervised Fine-Tuning |
|
|
**RxT-Beta Micro** model was fine-tuned to real-time interactions (sequences) format on our datasets, derived from HuggingFace ones: |
|
|
- [**ReactiveAI/smol-smoltalk-Interaction-SFT**](https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT) |
|
|
- [**ReactiveAI/cosmopedia-100k-Interaction-SFT**](https://huggingface.co/datasets/ReactiveAI/cosmopedia-100k-Interaction-SFT-Interaction-SFT). |
|
|
|
|
|
Models were fine-tuned using Joint LM Training mode (for memory cross-attention pre-training): |
|
|
- encode data with encoder and calculate MLM loss for it |
|
|
- save encoder layer's results as Short-Term Memory (available for decoder by memory cross-attention) |
|
|
- process data with decoder and calculate autoregressive loss |
|
|
|
|
|
That training results in decoder with ~95% accuracy, because it has access to all next tokens information with memory cross-attention. In next training stages it |
|
|
will access previous interactions data with those layers. |
|
|
|
|
|
#### Self-Supervised Memory Attention Pre-Training |
|
|
Memory Attention was pre-trained to combine accumulated Short-Term Memory states with next interaction data processed by the |
|
|
encoder, using weighted mean (with randomized arbitrary weights) as labels and negative cosine similarity as loss. Label weights |
|
|
depending on inner step: |
|
|
- first step, when STM is in initial random normal state, using 90% of new encoded data |
|
|
- follow-up steps are using `50% - step * 5%` of new encoded data |
|
|
- each step could have 0-15% random differences in weights |
|
|
|
|
|
Additionally, random noise is added to both inputs and labels. |
|
|
|
|
|
This model was trained on six arbitrary selected steps using single epoch on 30% from [**ReactiveAI/Real-Chat-SMAT**](https://huggingface.co/datasets/ReactiveAI/Real-Chat-SMAT) dataset. |
|
|
|
|
|
#### Supervised Memory-Aware Training |
|
|
Finally, with pre-trained/fine-tuned components, in last supervised stage, model is trained to use previous/accumulated STM |
|
|
states as memory cross-attention input, instead of the same sequences as decoder's input: |
|
|
- previous (or first) interaction is processed by encoder and used to update memory |
|
|
- next interaction is processed by decoder, using related information from STM |
|
|
- loss is calculated from decoder's logits and gradients propagate through memory attention to encoder |
|
|
|
|
|
We used staged memory-aware training with different datasets: |
|
|
- starting from 2 epochs on raw 80k examples (with 7 interactions) - [**ReactiveAI/Real-Chat-SMAT**](https://huggingface.co/datasets/ReactiveAI/Real-Chat-SMAT) |
|
|
- then 5 epochs on filtered 27k better quality examples - [**ReactiveAI/Real-Chat-No-System-SMAT**](https://huggingface.co/datasets/ReactiveAI/Real-Chat-No-System-SMAT) |
|
|
|
|
|
#### Preprocessing |
|
|
Pre-training is done on raw text corpora and it require only tokenization. In next stages, model is processing sequences in simple **Interaction format**, that's used |
|
|
instead complex chat templates - `[Q] User's query... [A] Model's answer`. For upcoming reasoning models, it will be extended to `[Q] User's query... [T] Reasoning... [A] Model's answer` |
|
|
|
|
|
|
|
|
#### Training Hyperparameters |
|
|
- **Training regime:** bf16 mixed precision (AMP autocast) |
|
|
- **Optimizer**: AdamW |
|
|
- **Scheduler**: Cosine annealing |
|
|
|
|
|
## Evaluation |
|
|
Evaluation is in progress - more details soon! |
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
|
|
#### Testing Data |
|
|
|
|
|
<!-- This should link to a Dataset Card if possible. --> |
|
|
|
|
|
[More Information Needed] |
|
|
|
|
|
#### Factors |
|
|
|
|
|
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
|
|
|
|
|
[More Information Needed] |
|
|
|
|
|
#### Metrics |
|
|
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
|
|
[More Information Needed] |
|
|
|
|
|
### Results |
|
|
|
|
|
[More Information Needed] |
|
|
|
|
|
#### Summary |
|
|
|
|
|
|
|
|
## Environmental Impact |
|
|
- **Hardware Type:** 4x NVIDIA A100 40GB |
|
|
- **Hours used:** 150 |
|
|
|
|
|
## Model Card Contact |
|
|
[Adam Filipek](https://huggingface.co/AdamF92) - adamfilipek@rxai.dev |
|
|
|
|
|
Licences - licensing@rxai.dev |