WilhelmT commited on
Commit
78655c8
·
verified ·
1 Parent(s): a001915

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -1
README.md CHANGED
@@ -4,4 +4,148 @@ license_name: embedl-models-community-licence-agreement-1.0
4
  license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
5
  base_model:
6
  - meta-llama/Llama-3.2-1B-Instruct
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
5
  base_model:
6
  - meta-llama/Llama-3.2-1B-Instruct
7
+ ---
8
+
9
+
10
+ # Llama-3.2-1B-Instruct-FlashHead
11
+
12
+ **Optimized version of Llama-3.2-1B-Instruct using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy.**
13
+ Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:
14
+
15
+ - FlashHead
16
+ - Custom vLLM generation via `embedl-models`
17
+
18
+ FlashHead matches the baseline **Llama-3.2-1B** within rounding on standard evaluations (MMLU-Pro, HellaSwag, GSM8K, etc.) and, in combination with quantization, achieves **H200-level latency** on **RTX Ada** GPUs.
19
+
20
+ ---
21
+
22
+ ## Model Details
23
+
24
+ | **Field** | **Value** |
25
+ |------------|------------|
26
+ | **Base Model** | Llama-3.2-1B-Instruct |
27
+ | **Input / Output** | Text → Text |
28
+ | **Release Date** | 2025-12-08 |
29
+ | **Version** | 1.0 |
30
+ | **Optimizations** | FlashHead LM Head, W4A16 Mixed Precision |
31
+ | **Developers** | Embedl |
32
+ | **Licenses** | Upstream: Meta Llama 3.2 License. Built with Llama. <br>Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* |
33
+ | **Intended Use** | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |
34
+
35
+ ---
36
+
37
+ ## Optimizations
38
+
39
+ - **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput.
40
+ - **Mixed-Precision Quantization (W4A16)** - optimal balance of memory footprint and accuracy.
41
+ - **Custom Runtime Integration** - compatible with both **vLLM (0.10.2)** via the `embedl-models` package.
42
+
43
+ ---
44
+
45
+ ## Performance
46
+
47
+ ### Token Generation Speed (RTX 3500 Ada, batch size = 1)
48
+
49
+ | **Precision** | **Tokens/sec** | **Speedup vs BF16** |
50
+ |----------------|----------------|----------------------|
51
+ | BF16 baseline | 130 | 1.0× |
52
+ | **FlashHead (Embedl)** | **163** | **1.25×** |
53
+ | W4A16 baseline | 278 | 2.14× |
54
+ | **FlashHead W4A16 (Embedl)** | **485** | **3.73×** |
55
+
56
+ FlashHead improves end-to-end speed by **1.75×** over state-of-the-art, while maintaining full accuracy parity.
57
+
58
+ ---
59
+
60
+ ## Accuracy (Parity with Baseline)
61
+
62
+ | **Method** | **MMLU-Pro** | **HellaSwag** | **IFEval** | **BoolQ** | **BBH** | **TruthfulQA** | **GSM8K** |
63
+ |-------------|---------------|----------------|--------------|-------------|-------------|----------------|--------------|
64
+ | **Baseline** | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 |
65
+ | **FlashHead** | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 |
66
+
67
+ FlashHead matches baseline performance exactly across all evaluation benchmarks.
68
+
69
+ ---
70
+
71
+ ## Installation
72
+
73
+ ```bash
74
+ pip install embedl-models
75
+ ```
76
+
77
+ The `embedl-models` package is required, it provides the optimized FlashHead implementation and quantized model runtime.
78
+
79
+ ---
80
+
81
+ ## Usage Examples
82
+
83
+ ### vLLM Inference
84
+
85
+ ```python
86
+ from vllm import SamplingParams
87
+ from embedl.models.vllm import LLM
88
+
89
+ model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead"
90
+
91
+ sampling = SamplingParams(max_tokens=128, temperature=0.0)
92
+ llm = LLM(model=model_id, trust_remote_code=True)
93
+
94
+ prompt = "Write a haiku about coffee."
95
+ output = llm.generate([prompt], sampling)
96
+ print(output[0].outputs[0].text)
97
+ ```
98
+ ---
99
+
100
+ ## Limitations
101
+
102
+ - Limited to **vLLM 0.10.2** (pinned dependency)
103
+ - **Batch size = 1** (real-time generation)
104
+ - Currently optimized for **NVIDIA RTX GPUs**
105
+
106
+ ---
107
+
108
+ ## Roadmap
109
+
110
+ Planned improvements:
111
+
112
+ - Huggingface transformers generation
113
+ - vLLM CLI benchmarking for detailed latency evaluation
114
+ - `lm-eval-harness` integration for detailed accuracy evaluation
115
+ - Upstream support in **Transformers** and **vLLM**
116
+ - Compatibility with **GGUF**, **MLC**, **Llama.cpp**, **TGI**, etc.
117
+ - Broader model coverage (larger models, VLMs, VLAs)
118
+
119
+ ---
120
+
121
+ ## License
122
+
123
+ - **Upstream:** Meta Llama 3.2 License
124
+ - **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)*
125
+
126
+ ---
127
+
128
+ ## Contact
129
+
130
+ **Enterprise & Commercial Inquiries**
131
+ [sales@embedl.com](mailto:sales@embedl.com)
132
+
133
+ **Technical Issues & Early Access**
134
+ [https://github.com/embedl/embedl-models](https://github.com/embedl/embedl-models)
135
+
136
+ **More Information & Model Releases**
137
+ [https://embedl.com](https://embedl.com)
138
+
139
+ ---
140
+
141
+ ### Partner & Developer Opportunities
142
+
143
+ If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:
144
+
145
+ - Embedl SDK - AI optimization tools & profiling
146
+ - Embedl HUB - benchmarking platform
147
+ - Engineering support for on-prem/edge deployments
148
+ - Migration guidance (Llama / Qwen / Gemma)
149
+ - Early access & partner co-marketing opportunities
150
+
151
+ Contact: [sales@embedl.com](mailto:sales@embedl.com)