README.md · ReactiveAI/RxT-Beta-Micro-Supervised at main

RxT-Beta-Micro-Supervised / README.md

AdamF92

Update README.md

58e4be6 verified 23 days ago

preview code

raw

history blame contribute delete

16.9 kB

	---
	license: other
	license_name: raml-v1.0
	pipeline_tag: text-generation
	tags:
	- model_hub_mixin
	- pytorch_model_hub_mixin
	- RxNN
	- RxLM
	- ReactiveTransformer
	- Event-Driven
	- MemorySystem
	- ShortTermMemory
	- Real-Time
	- RxLM
	- ReactiveLanguageModel
	- RealTimeLanguageModel
	language:
	- en
	datasets:
	- HuggingFaceFW/fineweb-edu
	- wikimedia/wikipedia
	- HuggingFaceFW/clean-wikipedia
	- ReactiveAI/smol-smoltalk-Interaction-SFT
	- ReactiveAI/cosmopedia-100k-Interaction-SFT
	- ReactiveAI/Real-Chat-SMAT
	- ReactiveAI/Real-Chat-No-System-SMAT
	library_name: RxLM
	gated: true
	extra_gated_prompt: >-
	Accept [Reactive AI Model & Architecture License (RAML)
	v1.0](https://github.com/RxAI-dev/rxlm/blob/main/MODELS_LICENSE.md) terms to
	access the repository and use model. Reactive Transformer (pending patent
	#P.453260) is available for free for non-commercial usage. For commercial
	usage please contact Reactive AI at licensing@rxai.dev
	extra_gated_fields:
	Company: text
	Country: country
	I want to use this model for:
	type: select
	options:
	- Research
	- Education
	- label: Other
	value: other
	I agree to use this model for non-commercial use ONLY: checkbox
	extra_gated_heading: >-
	You need to agree to use this model only for research or education purposes
	under Reactive AI Model & Architecture License (RAML) v1.0
	extra_gated_description: The repository will be available instantly after accepting license terms
	extra_gated_button_content: Accept license terms
	---

	# RxT-Beta-Micro-Supervised 290M
	World's first experimental real-time Reactive Language Model (RxLM) trained on limited real-world data (after synthetic
	RxT-Alpha generation). It's based on revolutionary Reactive Transformer architecture - processing only single
	interactions/messages, with all the context moved to Short-Term Memory, managed by Attention-Based Memory System.

	> Docs in progress

	## Model Details

	### Model Description
	First Reactive Language Model (RxLM) trained on limited real-world datasets, based on Reactive Transformer (RxT) architecture

	RxLMs have linear computational/inference cost scaling (`O(NT)`) compared to LLMs quadratic growth (`O(N²T)`),
	where `N` is the number of messages in conversation and `T` is the number of tokens in single interaction. Thanks to that
	scaling, they are just `N` times faster and cheaper than LLMs.

	That's not all from the advantages - event-driven real-time processing with memory is a lot more natural and human-like,
	than LLMs data-driven approach (processing full conversation history everytime). It's a crucial milestone in development
	of AGI and awareness models.

	> This is _Supervised_ version of the model with "weak" memory system - result of Supervised Memory System Training (SMST). It's
	> able to remember information between interactions (without passing it explicitly in prompt/chat template), but it
	> has to be refined in next Memory Reinforcement Learning (MRL) stage for full functionality.

	After successful experiments with simple synthetic datasets, we moved to real-world data, but this model still had limited
	amount of english-only data for pre-training - only 10B tokens from Wikipedia and FineWeb-Edu (+2B tokens in later stages).
	Then it could have limited general knowledge and should be fine-tuned for some specialization - for example, we trained
	[RxT-Beta-Micro-Supervised-AI](https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised-AI) on AI/Data Science knowledge
	based chats.


	### Reactive Transformer Architecture
	Experimental research model made to test our Reactive Transformer architecture and Attention-based Memory System.

	Reactive Transformer has additional Short-Term Memory layers, connected to model with Memory Cross-Attention, and updated by Memory Encoder and Memory Attention.
	Short-Term Memory state is kept between interactions/event (single message), not between tokens in sequence - that's key difference between RxNNs and RNNs.

	The goal of the architecture is to process only single messages and keep conversation history in Short-Term Memory - we believe, that this is the key requirement
	for awareness and AGI. Processing all the chat history on every interaction is not natural and that's not how human awareness is working. Then, Reactive Transformer
	architecture is a first step in transition from language models to awareness models.

	To balance number of the parameters, decoder is based on Mixture-of-Experts architecture, while the encoder is using regular
	dense feed forward layers. This model is using gated self/interlayer version of memory attention network with sigmoid residual gates.

	<img src="https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised/resolve/main/reactive-transformer-self-interlayer.png" width="800" />

	#### Architecture details:
	- dim: 256
	- layers: 14
	- heads (for split): 16
	- Decoder:
	- 2 initial stateless layers: Dense and MoE
	- self-attention: Sparse Query Attention
	- query heads: 8/16
	- key/value heads: 4/16
	- memory cross-attention: Sparse Query Attention
	- query heads: 8/16
	- key/value heads: 4/16
	- Mixture-of-Experts Feed Forward
	- experts: 42
	- active experts: 4
	- SwiGLU feed forward with 512 dim
	- size: \~269M (~44M Activated)
	- Encoder:
	- self-attention: symmetric Sparse Query Attention
	- query/key/value heads: 8/16
	- SwiGLU feed forward with 768 dim
	- size: ~18.3M
	- Memory Attention:
	- variant: Gated Self/Interlayer Memory Attention
	- attention layers: symmetric Sparse Query Attention
	- query/key/value heads: 8/16
	- residual gate: elementwise with sigmoid activation (per STM slot)
	- size: ~3.73M
	- RoPE for self-attention, memory cross-attention (query only) and memory attention (key only)
	- RMS Norm for all normalization layers
	- vocab: 32k (english only)
	- interaction (query + answer) length: 1024 tokens
	- STM size: 14 layers * 1024 slots (* 256 dim)
	- context/messages: Infinite
	- size: ~290M
	- Library: RxLM
	---
	- Developed by: [Adam Filipek](https://huggingface.co/AdamF92) & [Reactive AI](https://huggingface.co/ReactiveAI)
	- Funded by: [Reactive AI](https://huggingface.co/ReactiveAI)
	- Model type: Reactive Language Model (RxLM)
	- Language(s) (NLP): English
	- License: [Reactive AI Model & Architecture License (RAML) v1.0](https://github.com/RxAI-dev/rxlm/blob/main/MODELS_LICENSE.md)

	### Model Sources
	- Repository: [RxLM Framework](https://github.com/RxAI-dev/rxlm)
	- Paper: [Reactive Transformer (RxT) - Stateful Real-Time Processing for Event-Driven Reactive Language Models](https://arxiv.org/abs/2510.03561)
	- Demo: In progress

	## Uses
	This model is still experimental and it was pre-trained on limited corpus with only 10B tokens, so it's general knowledge is also limited. It's recommended
	to further fine-tune the model for some specialization, like our [RxT-Beta-Micro-Supervised-AI](https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised-AI),
	that's trained on AI/Data Science based conversations.

	Supervised RxT models are partially functional intermediate stage models - it's recommended to refine them in Memory Reinforcement Learning (MRL) and Reactive
	Reinforcement Learning from Human Feedback (RxRLHF) to reach final stage.

	### Direct Use
	It's not recommended to use this model directly without additional specialization training or reinforcement learning stages.

	Reactive Transformer models are made for conversational tasks, especially chatbots or as a stateful base for agentic systems.

	### Downstream Use
	It's recommended to further fine-tune the model for some specialization, because of limited pre-training data. For the example,
	we trained [RxT-Beta-Micro-Supervised-AI](https://huggingface.co/ReactiveAI/RxT-Beta-Micro-Supervised-AI)

	### Out-of-Scope Use
	Reactive Transformer models are natively conversational and made for multi-step tasks. They aren't typical Gen AI and aren't made
	for single-step generative tasks (like summarization, dataset generation, etc.) - they will work in those scenarios, but it will be waste
	of computational resources (initializing/processing memory, when it's not needed). For that case it's better to use stateless LLM.

	## Bias, Risks, and Limitations
	The model is still experimental, made to test Reactive Transformer architecture on real-world data, after succesful experiments with simple synthetic data.
	It was pre-trained on 10B tokens only (and additional 2B in next stages), so it's general knowledge is limited and responses could be inaccurate.

	Conversation context is theoretically infinite (1024 tokens limit is only for single interaction), but after some number of messages model will slowly forget
	outdated information - that's why it's called Short-Term Memory. It will be extended in upcoming generations with Long-Term Memory for true infinite context.

	### Recommendations
	As mentioned before, supervised models are in intermediate stage and it's recommended to continue the training in reinforcement learning stages. It's also recommended
	to fine-tune this base model for some specialization.

	## How to Get Started with the Model
	Model could be loaded and used with our RxLM framework (https://github.com/RxAI-dev/RxLM):

	```python
	import torch
	from rxlm.rxt.models import RxTBeta
	from rxlm.training.tokenizer import load_tokenizer_from_hf_hub

	tokenizer = load_tokenizer_from_hf_hub('ReactiveAI/RxT-Beta-Micro')

	model = RxTBeta.from_pretrained('ReactiveAI/RxT-Beta-Micro-Supervised', tokenizer=tokenizer)
	model.share_components() # currently required to connect embeddings/STM

	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
	model.to(device)

	seq_len = 1024

	# Memory init - could be used as "system prompt" in LLMs (not recommended in this model, as it wasn't trained with system prompts)
	stm_init_state = model.tokenize_full_interaction('System prompt like', 'Initial memory for the model', max_seq_len=seq_len, device=device)
	model.init_stm_state(**stm_init_state)

	# Helper function
	def interaction(query: str):
	tokenized_query = model.tokenize_query(query, max_seq_len=seq_len, device=device)
	for token_id in model.interact(**tokenized_query, max_seq_len=seq_len, temperature=1.0):
	if token_id == -1: print('\n', '[Start memory update...]')
	elif token_id == -2: print('[Memory updated]')
	else:
	txt_token = model.stringify_token(token_id)
	print(txt_token, end='')

	# Process first interaction
	interaction('Hello! Who are you?')
	# Process follow-up interaction
	interaction('Follow-up question?')

	```

	## Training Details
	Stateful & real-time nature of Reactive Transformer architecture, especially asynchronous memory update, requires advanced training pipeline with multiple
	supervised and reinforcement learning stages:
	- Supervised:
	- Joint Language Models Pre-Training \| raw large text corpora
	- Interaction Supervised Fine-Tuning \| single, not connected interactions (query + answer)
	- Self-Supervised Memory Attention Pre-Training \| multi-step conversations (SMAT datasets)
	- Supervised Memory-Aware Training (SMAT) \| multi-step conversations
	- Reinforcement:
	- Memory Reinforcement Learning (MRL) \| multi-step conversations
	- Reactive Reinforcement Learning from Human Feedback (RxRLHF) \| multi-step conversations


	### Training Data
	We used public open-source datasets for pre-training and our custom datasets (converted from public datasets) for other stages:
	- Joint Language Models Pre-Training
	- 'sample-10BT' subset from [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
	- '20231101.en' subset from [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)
	- Interaction SFT
	- [ReactiveAI/smol-smoltalk-Interaction-SFT](https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT)
	- [ReactiveAI/cosmopedia-100k-Interaction-SFT](https://huggingface.co/datasets/ReactiveAI/cosmopedia-100k-Interaction-SFT)
	- Self-Supervised Memory Attention Pre-Training
	- 30% of [ReactiveAI/Real-Chat-SMAT](https://huggingface.co/datasets/ReactiveAI/Real-Chat-SMAT)
	- Supervised Memory-Aware Training (SMAT)
	- [ReactiveAI/Real-Chat-SMAT](https://huggingface.co/datasets/ReactiveAI/Real-Chat-SMAT)
	- [ReactiveAI/Real-Chat-No-System-SMAT](https://huggingface.co/datasets/ReactiveAI/Real-Chat-No-System-SMAT)


	### Training Procedure
	Supervised Memory System Training includes 4 steps, before proceeding to Reinforcement Learning stages.

	#### Joint Language Models Pre-Training
	Decoder was trained with Encoder and additional MLM head model, using Joint LM Training (with MLM and Autoregressive loss),
	using [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) and [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) datasets.
	Both encoder and decoder are using shared embedding layer

	#### Supervised Fine-Tuning
	RxT-Beta Micro model was fine-tuned to real-time interactions (sequences) format on our datasets, derived from HuggingFace ones:
	- [ReactiveAI/smol-smoltalk-Interaction-SFT](https://huggingface.co/datasets/ReactiveAI/smol-smoltalk-Interaction-SFT)
	- [ReactiveAI/cosmopedia-100k-Interaction-SFT](https://huggingface.co/datasets/ReactiveAI/cosmopedia-100k-Interaction-SFT-Interaction-SFT).

	Models were fine-tuned using Joint LM Training mode (for memory cross-attention pre-training):
	- encode data with encoder and calculate MLM loss for it
	- save encoder layer's results as Short-Term Memory (available for decoder by memory cross-attention)
	- process data with decoder and calculate autoregressive loss

	That training results in decoder with ~95% accuracy, because it has access to all next tokens information with memory cross-attention. In next training stages it
	will access previous interactions data with those layers.

	#### Self-Supervised Memory Attention Pre-Training
	Memory Attention was pre-trained to combine accumulated Short-Term Memory states with next interaction data processed by the
	encoder, using weighted mean (with randomized arbitrary weights) as labels and negative cosine similarity as loss. Label weights
	depending on inner step:
	- first step, when STM is in initial random normal state, using 90% of new encoded data
	- follow-up steps are using `50% - step * 5%` of new encoded data
	- each step could have 0-15% random differences in weights

	Additionally, random noise is added to both inputs and labels.

	This model was trained on six arbitrary selected steps using single epoch on 30% from [ReactiveAI/Real-Chat-SMAT](https://huggingface.co/datasets/ReactiveAI/Real-Chat-SMAT) dataset.

	#### Supervised Memory-Aware Training
	Finally, with pre-trained/fine-tuned components, in last supervised stage, model is trained to use previous/accumulated STM
	states as memory cross-attention input, instead of the same sequences as decoder's input:
	- previous (or first) interaction is processed by encoder and used to update memory
	- next interaction is processed by decoder, using related information from STM
	- loss is calculated from decoder's logits and gradients propagate through memory attention to encoder

	We used staged memory-aware training with different datasets:
	- starting from 2 epochs on raw 80k examples (with 7 interactions) - [ReactiveAI/Real-Chat-SMAT](https://huggingface.co/datasets/ReactiveAI/Real-Chat-SMAT)
	- then 5 epochs on filtered 27k better quality examples - [ReactiveAI/Real-Chat-No-System-SMAT](https://huggingface.co/datasets/ReactiveAI/Real-Chat-No-System-SMAT)

	#### Preprocessing
	Pre-training is done on raw text corpora and it require only tokenization. In next stages, model is processing sequences in simple Interaction format, that's used
	instead complex chat templates - `[Q] User's query... [A] Model's answer`. For upcoming reasoning models, it will be extended to `[Q] User's query... [T] Reasoning... [A] Model's answer`


	#### Training Hyperparameters
	- Training regime: bf16 mixed precision (AMP autocast)
	- Optimizer: AdamW
	- Scheduler: Cosine annealing

	## Evaluation
	Evaluation is in progress - more details soon!

	### Testing Data, Factors & Metrics

	#### Testing Data

	<!-- This should link to a Dataset Card if possible. -->

	[More Information Needed]

	#### Factors

	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

	[More Information Needed]

	#### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	[More Information Needed]

	### Results

	[More Information Needed]

	#### Summary


	## Environmental Impact
	- Hardware Type: 4x NVIDIA A100 40GB
	- Hours used: 150

	## Model Card Contact
	[Adam Filipek](https://huggingface.co/AdamF92) - adamfilipek@rxai.dev

	Licences - licensing@rxai.dev