Llama-3.2-1B-Instruct-FlashHead

My model banner

Optimized version of Llama-3.2-1B-Instruct using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy. Designed for low-latency inference on NVIDIA RTX GPUs, leveraging:

  • FlashHead
  • Custom vLLM generation via embedl-models

FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs.


Model Details

Field Value
Base Model Llama-3.2-1B-Instruct
Input / Output Text → Text
Release Date 2025-12-08
Version 1.0
Optimizations FlashHead LM Head
Developers Embedl
Licenses Upstream: Meta Llama 3.2 License. Built with Llama.
Optimized components: Embedl Models Community Licence v1.0 (no redistribution)
Intended Use Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs

Optimizations

  • FlashHead LM Head - lightweight replacement for the dense LM head, significantly improving throughput.
  • Custom Runtime Integration - compatible with vLLM (0.10.2) via the embedl-models package.

Performance

Token Generation Speed (RTX 3500 Ada, batch size = 1)

Precision Tokens/sec Speedup vs BF16
BF16 baseline 130 1.0×
FlashHead (Embedl) 163 1.25×
W4A16 baseline 278 2.14×
FlashHead W4A16 (Embedl) 485 3.73×

FlashHead improves end-to-end speed by 1.75× over state-of-the-art, while maintaining full accuracy parity.

Measurement setup: vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.

NVIDIA H200 measurement: FP8, 512 Tokens/sec.


Accuracy (Parity with Baseline)

Method MMLU-Pro HellaSwag IFEval BoolQ BBH TruthfulQA GSM8K
Baseline 0.18 0.59 0.45 0.69 0.38 0.36 0.46
FlashHead 0.18 0.59 0.45 0.69 0.38 0.36 0.46

FlashHead closely matches baseline accuracy.


Installation

pip install embedl-models

The embedl-models package is required, it provides the optimized FlashHead implementation and quantized model runtime.


Usage Examples

Note (vLLM context length): max_model_len=131072 may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower max_model_len (or increase gpu_memory_utilization).

vLLM Inference

from vllm import SamplingParams
from embedl.models.vllm import LLM

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead"

if __name__ == "__main__":
    sampling = SamplingParams(max_tokens=128, temperature=0.0)
    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
    
    prompt = "Write a haiku about coffee."
    output = llm.generate([prompt], sampling)
    print(output[0].outputs[0].text)

Interactive REPL Example

The run_repl() coroutine launches an interactive, streaming chat interface using the vLLM backend with FlashHead enabled.
It maintains an in-memory chat history and supports simple commands such as /exit to quit and /reset to clear context.

import asyncio
from embedl.models.vllm.demo import run_repl

model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead"

if __name__ == "__main__":
    asyncio.run(
        run_repl(
            model=model_id,
            max_model_len=131072
        )
    )


⚠️ Important Warning: Hugging Face Transformers Support

FlashHead is currently not applied when using the Hugging Face transformers pipeline.
Generation through transformers will fall back to the standard dense LM head, disabling FlashHead acceleration.

For now, we strongly recommend using the vLLM integration (embedl.models.vllm.LLM) to ensure FlashHead is active and optimized for low-latency inference.

Full support for the Hugging Face transformers pipeline with FlashHead integration will be released in the coming days.


Limitations

  • Limited to vLLM 0.10.2 (pinned dependency)
  • Batch size = 1 (real-time generation)
  • Currently optimized for NVIDIA RTX GPUs

Roadmap

Planned improvements:

  • Advanced mixed precision quantization
  • Huggingface transformers generation
  • vLLM CLI benchmarking for detailed latency evaluation
  • lm-eval-harness integration for detailed accuracy evaluation
  • Upstream support in Transformers and vLLM
  • Compatibility with GGUF, MLC, Llama.cpp, Ollama, etc.
  • Broader model coverage (larger models, VLMs, VLAs)

License

  • Upstream: Meta Llama 3.2 License
  • Optimized Components: Embedl Models Community Licence v1.0 (no redistribution)

Contact

Enterprise & Commercial Inquiries sales@embedl.com

Technical Issues & Early Access https://github.com/embedl/embedl-models

More Information & Model Releases https://embedl.com


Partner & Developer Opportunities

If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:

  • Embedl SDK - AI optimization tools & profiling
  • Embedl HUB - benchmarking platform
  • Engineering support for on-prem/edge deployments
  • Migration guidance (Llama / Qwen / Gemma)
  • Early access & partner co-marketing opportunities

Contact: sales@embedl.com

Downloads last month
97
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for embedl/Llama-3.2-1B-Instruct-FlashHead

Finetuned
(1177)
this model