Llama-3.2-1B-Instruct-FlashHead
Optimized version of Llama-3.2-1B-Instruct using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:
- FlashHead
- Custom vLLM generation via
embedl-models
FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs.
Model Details
| Field | Value |
|---|---|
| Base Model | Llama-3.2-1B-Instruct |
| Input / Output | Text → Text |
| Release Date | 2025-12-08 |
| Version | 1.0 |
| Optimizations | FlashHead LM Head |
| Developers | Embedl |
| Licenses | Upstream: Meta Llama 3.2 License. Built with Llama. Optimized components: Embedl Models Community Licence v1.0 (no redistribution) |
| Intended Use | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |
Optimizations
- FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
- Custom Runtime Integration - compatible with vLLM (0.10.2) via the
embedl-modelspackage.
Performance
Token Generation Speed (RTX 3500 Ada, batch size = 1)
| Precision | Tokens/sec | Speedup vs BF16 |
|---|---|---|
| BF16 baseline | 130 | 1.0× |
| FlashHead (Embedl) | 163 | 1.25× |
| W4A16 baseline | 278 | 2.14× |
| FlashHead W4A16 (Embedl) | 485 | 3.73× |
FlashHead improves end-to-end speed by 1.75× over state-of-the-art, while maintaining full accuracy parity.
Measurement setup: vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.
NVIDIA H200 measurement: FP8, 512 Tokens/sec.
Accuracy (Parity with Baseline)
| Method | MMLU-Pro | HellaSwag | IFEval | BoolQ | BBH | TruthfulQA | GSM8K |
|---|---|---|---|---|---|---|---|
| Baseline | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 |
| FlashHead | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 |
FlashHead closely matches baseline accuracy.
Installation
pip install embedl-models
The embedl-models package is required, it provides the optimized FlashHead implementation and quantized model runtime.
Usage Examples
Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).
vLLM Inference
from vllm import SamplingParams
from embedl.models.vllm import LLM
model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead"
if __name__ == "__main__":
sampling = SamplingParams(max_tokens=128, temperature=0.0)
llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
prompt = "Write a haiku about coffee."
output = llm.generate([prompt], sampling)
print(output[0].outputs[0].text)
Interactive REPL Example
The run_repl() coroutine launches an interactive, streaming chat interface using the vLLM backend with FlashHead enabled.
It maintains an in-memory chat history and supports simple commands such as /exit to quit and /reset to clear context.
import asyncio
from embedl.models.vllm.demo import run_repl
model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead"
if __name__ == "__main__":
asyncio.run(
run_repl(
model=model_id,
max_model_len=131072
)
)
⚠️ Important Warning: Hugging Face Transformers Support
FlashHead is currently not applied when using the Hugging Face
transformerspipeline.
Generation throughtransformerswill fall back to the standard dense LM head, disabling FlashHead acceleration.For now, we strongly recommend using the vLLM integration (
embedl.models.vllm.LLM) to ensure FlashHead is active and optimized for low-latency inference.Full support for the Hugging Face
transformerspipeline with FlashHead integration will be released in the coming days.
Limitations
- Limited to vLLM 0.10.2 (pinned dependency)
- Batch size = 1 (real-time generation)
- Currently optimized for NVIDIA RTX GPUs
Roadmap
Planned improvements:
- Advanced mixed precision quantization
- Huggingface transformers generation
- vLLM CLI benchmarking for detailed latency evaluation
lm-eval-harnessintegration for detailed accuracy evaluation- Upstream support in Transformers and vLLM
- Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
- Broader model coverage (larger models, VLMs, VLAs)
License
- Upstream: Meta Llama 3.2 License
- Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)
Contact
Enterprise & Commercial Inquiries sales@embedl.com
Technical Issues & Early Access https://github.com/embedl/embedl-models
More Information & Model Releases https://embedl.com
Partner & Developer Opportunities
If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:
- Embedl SDK - AI optimization tools & profiling
- Embedl HUB - benchmarking platform
- Engineering support for on-prem/edge deployments
- Migration guidance (Llama / Qwen / Gemma)
- Early access & partner co-marketing opportunities
Contact: sales@embedl.com
- Downloads last month
- 97
Model tree for embedl/Llama-3.2-1B-Instruct-FlashHead
Base model
meta-llama/Llama-3.2-1B-Instruct