embedl
/

Llama-3.2-1B-Instruct-FlashHead

flash_head_llama

text-generation-inference

Model card Files Files and versions

WilhelmT commited on 17 days ago

Commit

d805c9e

·

verified ·

1 Parent(s): 74310bb

Update README.md

Files changed (1) hide show

README.md +18 -0

README.md CHANGED Viewed

@@ -96,6 +96,24 @@ prompt = "Write a haiku about coffee."
 output = llm.generate([prompt], sampling)
 print(output[0].outputs[0].text)
 ```
 ---
 ---

 output = llm.generate([prompt], sampling)
 print(output[0].outputs[0].text)
 ```
+---
+### Interactive REPL Example
+The `run_repl()` coroutine launches an **interactive, streaming chat interface** using the vLLM backend with FlashHead enabled.
+It maintains an in-memory chat history and supports simple commands such as `/exit` to quit and `/reset` to clear context.
+```python
+import asyncio
+from embedl.models.vllm.demo import run_repl
+model_id = "embedl/Llama-3.2-1B-Instruct-FlashHead"
+asyncio.run(
+    run_repl(
+        model=model_id
+    )
+```
 ---
 ---