File size: 6,233 Bytes
a001915 74310bb a001915 74310bb 78655c8 ec6ebba 78655c8 aadcecd 78655c8 85ae12b 78655c8 1ef5d7b 78655c8 4f066e3 78655c8 cfd5819 78655c8 d91d12e 78655c8 23e9529 d91d12e 23e9529 78655c8 d805c9e 23e9529 d91d12e 23e9529 d805c9e 78655c8 74310bb 78655c8 594ce42 78655c8 6948fd8 78655c8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
---
license: other
license_name: embedl-models-community-licence-1.0
license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
base_model:
- meta-llama/Llama-3.2-1B-Instruct
tags:
- text-generation-inference
---
# Llama-3.2-1B-Instruct-FlashHead

**Optimized version of Llama-3.2-1B-Instruct using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy.**
Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:
- FlashHead
- Custom vLLM generation via `embedl-models`
FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs.
---
## Model Details
| **Field** | **Value** |
|------------|------------|
| **Base Model** | Llama-3.2-1B-Instruct |
| **Input / Output** | Text → Text |
| **Release Date** | 2025-12-08 |
| **Version** | 1.0 |
| **Optimizations** | FlashHead LM Head|
| **Developers** | Embedl |
| **Licenses** | Upstream: Meta Llama 3.2 License. Built with Llama. <br>Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* |
| **Intended Use** | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |
---
## Optimizations
- **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput.
- **Custom Runtime Integration** - compatible with **vLLM (0.10.2)** via the `embedl-models` package.
---
## Performance
### Token Generation Speed (RTX 3500 Ada, batch size = 1)
| **Precision** | **Tokens/sec** | **Speedup vs BF16** |
|----------------|----------------|----------------------|
| BF16 baseline | 130 | 1.0× |
| **FlashHead (Embedl)** | **163** | **1.25×** |
| W4A16 baseline | 278 | 2.14× |
| **FlashHead W4A16 (Embedl)** | **485** | **3.73×** |
FlashHead improves end-to-end speed by **1.75×** over state-of-the-art, while maintaining full accuracy parity.
**Measurement setup:** vLLM 0.10.2, batch_size=1, prompt length=32, max_new_tokens=128, 10 warm-up runs, averaged over 100 runs.
**NVIDIA H200 measurement:** **FP8**, **512 Tokens/sec**.
---
## Accuracy (Parity with Baseline)
| **Method** | **MMLU-Pro** | **HellaSwag** | **IFEval** | **BoolQ** | **BBH** | **TruthfulQA** | **GSM8K** |
|-------------|---------------|----------------|--------------|-------------|-------------|----------------|--------------|
| **Baseline** | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 |
| **FlashHead** | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 |
FlashHead closely matches baseline accuracy.
---
## Installation
```bash
pip install embedl-models
```
The `embedl-models` package is required, it provides the optimized FlashHead implementation and quantized model runtime.
---
## Usage Examples
**Note (vLLM context length):** `max_model_len=131072` may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower `max_model_len` (or increase `gpu_memory_utilization`).
### vLLM Inference
```python
from vllm import SamplingParams
from embedl.models.vllm import LLM
model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead"
if __name__ == "__main__":
sampling = SamplingParams(max_tokens=128, temperature=0.0)
llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
prompt = "Write a haiku about coffee."
output = llm.generate([prompt], sampling)
print(output[0].outputs[0].text)
```
---
### Interactive REPL Example
The `run_repl()` coroutine launches an **interactive, streaming chat interface** using the vLLM backend with FlashHead enabled.
It maintains an in-memory chat history and supports simple commands such as `/exit` to quit and `/reset` to clear context.
```python
import asyncio
from embedl.models.vllm.demo import run_repl
model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead"
if __name__ == "__main__":
asyncio.run(
run_repl(
model=model_id,
max_model_len=131072
)
)
```
---
---
## ⚠️ Important Warning: Hugging Face Transformers Support
> **FlashHead is currently not applied when using the Hugging Face `transformers` pipeline.**
> Generation through `transformers` will fall back to the standard dense LM head, **disabling FlashHead acceleration**.
>
> For now, **we strongly recommend using the vLLM integration** (`embedl.models.vllm.LLM`) to ensure FlashHead is active and optimized for low-latency inference.
>
> Full support for the Hugging Face `transformers` pipeline with FlashHead integration will be released **in the coming days**.
---
## Limitations
- Limited to **vLLM 0.10.2** (pinned dependency)
- **Batch size = 1** (real-time generation)
- Currently optimized for **NVIDIA RTX GPUs**
---
## Roadmap
Planned improvements:
- Advanced mixed precision quantization
- Huggingface transformers generation
- vLLM CLI benchmarking for detailed latency evaluation
- `lm-eval-harness` integration for detailed accuracy evaluation
- Upstream support in **Transformers** and **vLLM**
- Compatibility with **GGUF**, **MLC**, **Llama.cpp**, **Ollama**, etc.
- Broader model coverage (larger models, VLMs, VLAs)
---
## License
- **Upstream:** Meta Llama 3.2 License
- **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)*
---
## Contact
**Enterprise & Commercial Inquiries**
[sales@embedl.com](mailto:sales@embedl.com)
**Technical Issues & Early Access**
[https://github.com/embedl/embedl-models](https://github.com/embedl/embedl-models)
**More Information & Model Releases**
[https://embedl.com](https://embedl.com)
---
### Partner & Developer Opportunities
If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:
- Embedl SDK - AI optimization tools & profiling
- Embedl HUB - benchmarking platform
- Engineering support for on-prem/edge deployments
- Migration guidance (Llama / Qwen / Gemma)
- Early access & partner co-marketing opportunities
Contact: [sales@embedl.com](mailto:sales@embedl.com) |