Spaces:
Paused
Paused
Derek Thomas
commited on
Commit
·
867c18d
1
Parent(s):
c606718
Adding more info around LanceDB
Browse files- notebooks/05_vector_db.ipynb +40 -14
notebooks/05_vector_db.ipynb
CHANGED
|
@@ -6,9 +6,25 @@
|
|
| 6 |
"metadata": {},
|
| 7 |
"source": [
|
| 8 |
"# Approach\n",
|
|
|
|
| 9 |
"There are a number of aspects of choosing a vector db that might be unique to your situation. You should think through your HW, utilization, latency requirements, scale, etc before choosing. \n",
|
| 10 |
"\n",
|
| 11 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
]
|
| 13 |
},
|
| 14 |
{
|
|
@@ -97,7 +113,7 @@
|
|
| 97 |
},
|
| 98 |
"source": [
|
| 99 |
"# Setup\n",
|
| 100 |
-
"
|
| 101 |
]
|
| 102 |
},
|
| 103 |
{
|
|
@@ -115,14 +131,6 @@
|
|
| 115 |
" document['vector'] = document.pop('embedding')"
|
| 116 |
]
|
| 117 |
},
|
| 118 |
-
{
|
| 119 |
-
"cell_type": "markdown",
|
| 120 |
-
"id": "98aec715-8d97-439e-99c0-0eff63df386b",
|
| 121 |
-
"metadata": {},
|
| 122 |
-
"source": [
|
| 123 |
-
"Convert the dictionaries to `Documents`"
|
| 124 |
-
]
|
| 125 |
-
},
|
| 126 |
{
|
| 127 |
"cell_type": "code",
|
| 128 |
"execution_count": 6,
|
|
@@ -170,9 +178,7 @@
|
|
| 170 |
"id": "676f644c-fb09-4d17-89ba-30c92aad8777",
|
| 171 |
"metadata": {},
|
| 172 |
"source": [
|
| 173 |
-
"
|
| 174 |
-
"\n",
|
| 175 |
-
"Note that if you are doing this at scale, you should use a proper instance and not saving to file. You should also take a [measured ingestion](https://qdrant.tech/documentation/tutorials/bulk-upload/) approach instead of using a convenient loader. "
|
| 176 |
]
|
| 177 |
},
|
| 178 |
{
|
|
@@ -187,11 +193,23 @@
|
|
| 187 |
"from lancedb.embeddings.registry import EmbeddingFunctionRegistry\n",
|
| 188 |
"from lancedb.embeddings.sentence_transformers import SentenceTransformerEmbeddings\n",
|
| 189 |
"\n",
|
| 190 |
-
"\n",
|
| 191 |
"db = lancedb.connect(proj_dir/\".lancedb\")\n",
|
| 192 |
"tbl = db.create_table('arabic-wiki', [document])"
|
| 193 |
]
|
| 194 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 195 |
{
|
| 196 |
"cell_type": "code",
|
| 197 |
"execution_count": 8,
|
|
@@ -818,6 +836,14 @@
|
|
| 818 |
" "
|
| 819 |
]
|
| 820 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 821 |
{
|
| 822 |
"cell_type": "code",
|
| 823 |
"execution_count": 9,
|
|
|
|
| 6 |
"metadata": {},
|
| 7 |
"source": [
|
| 8 |
"# Approach\n",
|
| 9 |
+
"## VectorDB\n",
|
| 10 |
"There are a number of aspects of choosing a vector db that might be unique to your situation. You should think through your HW, utilization, latency requirements, scale, etc before choosing. \n",
|
| 11 |
"\n",
|
| 12 |
+
"I've been hearing a lot about LanceDB and wanted to check it out. It's newer and may or may not be good for **your** use-case. I'm attracted by its fast ingestion, cuda assisted indexing, and portability. It has some drawbacks, it doesnt support hnsw yet and it could change significantly given how early it is.\n",
|
| 13 |
+
"\n",
|
| 14 |
+
"\n",
|
| 15 |
+
"You will be blown away on how fast ingestion + indexing is with LanceDB. \n",
|
| 16 |
+
"\n",
|
| 17 |
+
"## Ingestion Strategy\n",
|
| 18 |
+
"I used the ~100k document `.ndjson` files in sequence to upload. After uploading I index.\n",
|
| 19 |
+
"\n",
|
| 20 |
+
"## Indexing\n",
|
| 21 |
+
"The algorithm used is `IVF_PQ`. I ignore the `PQ` part because I want better recall. Recall is important since Jais only has a 2k context window, I can't put my top 10 documents for RAG in my prompt. It will be my top 3 (512\\*3 + query + instructions ~ 2k). For many use-cases its worth the trade-off as you get much faster retrieval with not much performance loss. \n",
|
| 22 |
+
"\n",
|
| 23 |
+
"More partitions means faster retrieval but slower indexing. I chose 384 sub_vectors to be equal to my embedding dimension size. \n",
|
| 24 |
+
"\n",
|
| 25 |
+
"```tbl.create_index(num_partitions=1024, num_sub_vectors=384, accelerator=\"cuda\")```\n",
|
| 26 |
+
"\n",
|
| 27 |
+
"Read more about it [here](https://lancedb.github.io/lancedb/ann_indexes/)."
|
| 28 |
]
|
| 29 |
},
|
| 30 |
{
|
|
|
|
| 113 |
},
|
| 114 |
"source": [
|
| 115 |
"# Setup\n",
|
| 116 |
+
"To work with LanceDB we want to create the table before ingesting the first batch. To create a table we need at least 1 doc."
|
| 117 |
]
|
| 118 |
},
|
| 119 |
{
|
|
|
|
| 131 |
" document['vector'] = document.pop('embedding')"
|
| 132 |
]
|
| 133 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
{
|
| 135 |
"cell_type": "code",
|
| 136 |
"execution_count": 6,
|
|
|
|
| 178 |
"id": "676f644c-fb09-4d17-89ba-30c92aad8777",
|
| 179 |
"metadata": {},
|
| 180 |
"source": [
|
| 181 |
+
"Here we create the db and the table."
|
|
|
|
|
|
|
| 182 |
]
|
| 183 |
},
|
| 184 |
{
|
|
|
|
| 193 |
"from lancedb.embeddings.registry import EmbeddingFunctionRegistry\n",
|
| 194 |
"from lancedb.embeddings.sentence_transformers import SentenceTransformerEmbeddings\n",
|
| 195 |
"\n",
|
|
|
|
| 196 |
"db = lancedb.connect(proj_dir/\".lancedb\")\n",
|
| 197 |
"tbl = db.create_table('arabic-wiki', [document])"
|
| 198 |
]
|
| 199 |
},
|
| 200 |
+
{
|
| 201 |
+
"cell_type": "markdown",
|
| 202 |
+
"id": "502f7cb9-32cf-4b32-8cb3-b021e02bd06c",
|
| 203 |
+
"metadata": {},
|
| 204 |
+
"source": [
|
| 205 |
+
"For each file we:\n",
|
| 206 |
+
"- Read the `ndjson` into a list of documents\n",
|
| 207 |
+
"- Replace 'embedding' with 'vector' to be compatible with LanceDB\n",
|
| 208 |
+
"- Write the docs to the table\n",
|
| 209 |
+
"\n",
|
| 210 |
+
"After that we index with a cuda accelerator."
|
| 211 |
+
]
|
| 212 |
+
},
|
| 213 |
{
|
| 214 |
"cell_type": "code",
|
| 215 |
"execution_count": 8,
|
|
|
|
| 836 |
" "
|
| 837 |
]
|
| 838 |
},
|
| 839 |
+
{
|
| 840 |
+
"cell_type": "markdown",
|
| 841 |
+
"id": "179af522-84ca-4985-9ca4-ffd1bde487eb",
|
| 842 |
+
"metadata": {},
|
| 843 |
+
"source": [
|
| 844 |
+
"It's crazy how fast it was. 42minutes to ingest and index >2M documents. Lets run a test to make sure it worked!"
|
| 845 |
+
]
|
| 846 |
+
},
|
| 847 |
{
|
| 848 |
"cell_type": "code",
|
| 849 |
"execution_count": 9,
|