Llama-3.2-1B-Instruct-FlashHead

Optimized version of Llama-3.2-1B-Instruct using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

FlashHead
Custom vLLM generation via embedl-models

FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs.

Model Details

Field	Value
Base Model	Llama-3.2-1B-Instruct
Input / Output	Text → Text
Release Date	2025-12-08
Version	1.0
Optimizations	FlashHead LM Head
Developers	Embedl
Licenses	Upstream: Meta Llama 3.2 License. Built with Llama. Optimized components: Embedl Models Community Licence v1.0 (no redistribution)
Intended Use	Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs

Optimizations

FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
Custom Runtime Integration - compatible with vLLM (0.10.2) via the embedl-models package.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

Precision	Tokens/sec	Speedup vs BF16
BF16 baseline	130	1.0×
FlashHead (Embedl)	163	1.25×
W4A16 baseline	278	2.14×
FlashHead W4A16 (Embedl)	485	3.73×

FlashHead improves end-to-end speed by 1.75× over state-of-the-art, while maintaining full accuracy parity.

Measurement setup: vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.

NVIDIA H200 measurement: FP8, 512 Tokens/sec.

Accuracy (Parity with Baseline)

Method	MMLU-Pro	HellaSwag	IFEval	BoolQ	BBH	TruthfulQA	GSM8K
Baseline	0.18	0.59	0.45	0.69	0.38	0.36	0.46
FlashHead	0.18	0.59	0.45	0.69	0.38	0.36	0.46

FlashHead closely matches baseline accuracy.

Installation

pip install embedl-models

The embedl-models package is required, it provides the optimized FlashHead implementation and quantized model runtime.

Usage Examples

Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).

vLLM Inference

from vllm import SamplingParams
from embedl.models.vllm import LLM

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)

Interactive REPL Example

The run_repl() coroutine launches an interactive, streaming chat interface using the vLLM backend with FlashHead enabled.
It maintains an in-memory chat history and supports simple commands such as /exit to quit and /reset to clear context.

import asyncio
from embedl.models.vllm.demo import run_repl

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead"

if __name__ == "__main__":
    asyncio.run(
        run_repl(
            model=model_id,
            max_model_len=131072
        )
    )

⚠️ Important Warning: Hugging Face Transformers Support

FlashHead is currently not applied when using the Hugging Face transformers pipeline.
Generation through transformers will fall back to the standard dense LM head, disabling FlashHead acceleration.

For now, we strongly recommend using the vLLM integration (embedl.models.vllm.LLM) to ensure FlashHead is active and optimized for low-latency inference.

Full support for the Hugging Face transformers pipeline with FlashHead integration will be released in the coming days.

Limitations

Limited to vLLM 0.10.2 (pinned dependency)
Batch size = 1 (real-time generation)
Currently optimized for NVIDIA RTX GPUs

Roadmap

Planned improvements:

Advanced mixed precision quantization
Huggingface transformers generation
vLLM CLI benchmarking for detailed latency evaluation
lm-eval-harness integration for detailed accuracy evaluation
Upstream support in Transformers and vLLM
Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
Broader model coverage (larger models, VLMs, VLAs)

License

Upstream: Meta Llama 3.2 License
Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)

Contact

Enterprise & Commercial Inquiries sales@embedl.com

Technical Issues & Early Access https://github.com/embedl/embedl-models

More Information & Model Releases https://embedl.com

Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

Embedl SDK - AI optimization tools & profiling
Embedl HUB - benchmarking platform
Engineering support for on-prem/edge deployments
Migration guidance (Llama / Qwen / Gemma)
Early access & partner co-marketing opportunities

Contact: sales@embedl.com

Downloads last month: 97

Safetensors

Model size

1B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for embedl/Llama-3.2-1B-Instruct-FlashHead

Base model

meta-llama/Llama-3.2-1B-Instruct

Finetuned

(1177)

this model