You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+23-35Lines changed: 23 additions & 35 deletions
Original file line number
Diff line number
Diff line change
@@ -129,31 +129,26 @@ NVIDIA drivers on your machine need to be compatible with CUDA version 12.2 or h
129
129
130
130
To see all options to serve your models:
131
131
132
-
```shell
133
-
text-embeddings-router --help
134
-
```
132
+
```console
133
+
$ text-embeddings-router --help
134
+
Text Embedding Webserver
135
135
136
-
```
137
136
Usage: text-embeddings-router [OPTIONS]
138
137
139
138
Options:
140
139
--model-id <MODEL_ID>
141
-
The name of the model to load. Can be a MODEL_ID as listed on <https://hf.co/models> like `thenlper/gte-base`.
142
-
Or it can be a local directory containing the necessary files as saved by `save_pretrained(...)` methods of
143
-
transformers
140
+
The name of the model to load. Can be a MODEL_ID as listed on <https://hf.co/models> like `BAAI/bge-large-en-v1.5`. Or it can be a local directory containing the necessary files as saved by `save_pretrained(...)` methods of transformers
144
141
145
142
[env: MODEL_ID=]
146
-
[default: thenlper/gte-base]
143
+
[default: BAAI/bge-large-en-v1.5]
147
144
148
145
--revision <REVISION>
149
-
The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id
150
-
or a branch like `refs/pr/2`
146
+
The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id or a branch like `refs/pr/2`
151
147
152
148
[env: REVISION=]
153
149
154
150
--tokenization-workers <TOKENIZATION_WORKERS>
155
-
Optionally control the number of tokenizer workers used for payload tokenization, validation and truncation.
156
-
Default to the number of CPU cores on the machine
151
+
Optionally control the number of tokenizer workers used for payload tokenization, validation and truncation. Default to the number of CPU cores on the machine
157
152
158
153
[env: TOKENIZATION_WORKERS=]
159
154
@@ -175,14 +170,11 @@ Options:
175
170
Possible values:
176
171
- cls: Select the CLS token as embedding
177
172
- mean: Apply Mean pooling to the model embeddings
178
-
- splade: Apply SPLADE (Sparse Lexical and Expansion) to the model embeddings. This option is only
179
-
available if the loaded model is a `ForMaskedLM` Transformer model
173
+
- splade: Apply SPLADE (Sparse Lexical and Expansion) to the model embeddings. This option is only available if the loaded model is a `ForMaskedLM` Transformer model
The maximum amount of concurrent requests for this particular deployment.
184
-
Having a low limit will refuse clients requests instead of having them wait for too long and is usually good
185
-
to handle backpressure correctly
177
+
The maximum amount of concurrent requests for this particular deployment. Having a low limit will refuse clients requests instead of having them wait for too long and is usually good to handle backpressure correctly
186
178
187
179
[env: MAX_CONCURRENT_REQUESTS=]
188
180
[default: 512]
@@ -194,8 +186,7 @@ Options:
194
186
195
187
For `max_batch_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.
196
188
197
-
Overall this number should be the largest possible until the model is compute bound. Since the actual memory
198
-
overhead depends on the model implementation, text-embeddings-inference cannot infer this number automatically.
189
+
Overall this number should be the largest possible until the model is compute bound. Since the actual memory overhead depends on the model implementation, text-embeddings-inference cannot infer this number automatically.
199
190
200
191
[env: MAX_BATCH_TOKENS=]
201
192
[default: 16384]
@@ -223,9 +214,7 @@ Options:
223
214
224
215
Must be a key in the `sentence-transformers` configuration `prompts` dictionary.
225
216
226
-
For example if ``default_prompt_name`` is "query" and the ``prompts`` is {"query": "query: ", ...}, then the
227
-
sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?" because
228
-
the prompt text will be prepended before any text to encode.
217
+
For example if ``default_prompt_name`` is "query" and the ``prompts`` is {"query": "query: ", ...}, then the sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?" because the prompt text will be prepended before any text to encode.
229
218
230
219
The argument '--default-prompt-name <DEFAULT_PROMPT_NAME>' cannot be used with '--default-prompt <DEFAULT_PROMPT>`
231
220
@@ -234,9 +223,7 @@ Options:
234
223
--default-prompt <DEFAULT_PROMPT>
235
224
The prompt that should be used by default for encoding. If not set, no prompt will be applied.
236
225
237
-
For example if ``default_prompt`` is "query: " then the sentence "What is the capital of France?" will be
238
-
encoded as "query: What is the capital of France?" because the prompt text will be prepended before any text
239
-
to encode.
226
+
For example if ``default_prompt`` is "query: " then the sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?" because the prompt text will be prepended before any text to encode.
240
227
241
228
The argument '--default-prompt <DEFAULT_PROMPT>' cannot be used with '--default-prompt-name <DEFAULT_PROMPT_NAME>`
242
229
@@ -260,15 +247,13 @@ Options:
260
247
[default: 3000]
261
248
262
249
--uds-path <UDS_PATH>
263
-
The name of the unix socket some text-embeddings-inference backends will use as they communicate internally
264
-
with gRPC
250
+
The name of the unix socket some text-embeddings-inference backends will use as they communicate internally with gRPC
265
251
266
252
[env: UDS_PATH=]
267
253
[default: /tmp/text-embeddings-inference-server]
268
254
269
255
--huggingface-hub-cache <HUGGINGFACE_HUB_CACHE>
270
-
The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk
271
-
for instance
256
+
The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance
272
257
273
258
[env: HUGGINGFACE_HUB_CACHE=]
274
259
@@ -283,8 +268,7 @@ Options:
283
268
--api-key <API_KEY>
284
269
Set an api key for request authorization.
285
270
286
-
By default the server responds to every request. With an api key set, the requests must have the Authorization
287
-
header set with the api key as Bearer token.
271
+
By default the server responds to every request. With an api key set, the requests must have the Authorization header set with the api key as Bearer token.
Copy file name to clipboardExpand all lines: docs/source/en/cli_arguments.md
+27-33Lines changed: 27 additions & 33 deletions
Original file line number
Diff line number
Diff line change
@@ -18,31 +18,26 @@ rendered properly in your Markdown viewer.
18
18
19
19
To see all options to serve your models, run the following:
20
20
21
-
```shell
22
-
text-embeddings-router --help
23
-
```
21
+
```console
22
+
$ text-embeddings-router --help
23
+
Text Embedding Webserver
24
24
25
-
```
26
25
Usage: text-embeddings-router [OPTIONS]
27
26
28
27
Options:
29
28
--model-id <MODEL_ID>
30
-
The name of the model to load. Can be a MODEL_ID as listed on <https://hf.co/models> like `thenlper/gte-base`.
31
-
Or it can be a local directory containing the necessary files as saved by `save_pretrained(...)` methods of
32
-
transformers
29
+
The name of the model to load. Can be a MODEL_ID as listed on <https://hf.co/models> like `BAAI/bge-large-en-v1.5`. Or it can be a local directory containing the necessary files as saved by `save_pretrained(...)` methods of transformers
33
30
34
31
[env: MODEL_ID=]
35
-
[default: thenlper/gte-base]
32
+
[default: BAAI/bge-large-en-v1.5]
36
33
37
34
--revision <REVISION>
38
-
The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id
39
-
or a branch like `refs/pr/2`
35
+
The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id or a branch like `refs/pr/2`
40
36
41
37
[env: REVISION=]
42
38
43
39
--tokenization-workers <TOKENIZATION_WORKERS>
44
-
Optionally control the number of tokenizer workers used for payload tokenization, validation and truncation.
45
-
Default to the number of CPU cores on the machine
40
+
Optionally control the number of tokenizer workers used for payload tokenization, validation and truncation. Default to the number of CPU cores on the machine
46
41
47
42
[env: TOKENIZATION_WORKERS=]
48
43
@@ -64,14 +59,11 @@ Options:
64
59
Possible values:
65
60
- cls: Select the CLS token as embedding
66
61
- mean: Apply Mean pooling to the model embeddings
67
-
- splade: Apply SPLADE (Sparse Lexical and Expansion) to the model embeddings. This option is only
68
-
available if the loaded model is a `ForMaskedLM` Transformer model
62
+
- splade: Apply SPLADE (Sparse Lexical and Expansion) to the model embeddings. This option is only available if the loaded model is a `ForMaskedLM` Transformer model
The maximum amount of concurrent requests for this particular deployment.
73
-
Having a low limit will refuse clients requests instead of having them wait for too long and is usually good
74
-
to handle backpressure correctly
66
+
The maximum amount of concurrent requests for this particular deployment. Having a low limit will refuse clients requests instead of having them wait for too long and is usually good to handle backpressure correctly
75
67
76
68
[env: MAX_CONCURRENT_REQUESTS=]
77
69
[default: 512]
@@ -83,8 +75,7 @@ Options:
83
75
84
76
For `max_batch_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.
85
77
86
-
Overall this number should be the largest possible until the model is compute bound. Since the actual memory
87
-
overhead depends on the model implementation, text-embeddings-inference cannot infer this number automatically.
78
+
Overall this number should be the largest possible until the model is compute bound. Since the actual memory overhead depends on the model implementation, text-embeddings-inference cannot infer this number automatically.
88
79
89
80
[env: MAX_BATCH_TOKENS=]
90
81
[default: 16384]
@@ -112,9 +103,7 @@ Options:
112
103
113
104
Must be a key in the `sentence-transformers` configuration `prompts` dictionary.
114
105
115
-
For example if ``default_prompt_name`` is "query" and the ``prompts`` is {"query": "query: ", ...}, then the
116
-
sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?" because
117
-
the prompt text will be prepended before any text to encode.
106
+
For example if ``default_prompt_name`` is "query" and the ``prompts`` is {"query": "query: ", ...}, then the sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?" because the prompt text will be prepended before any text to encode.
118
107
119
108
The argument '--default-prompt-name <DEFAULT_PROMPT_NAME>' cannot be used with '--default-prompt <DEFAULT_PROMPT>`
120
109
@@ -123,9 +112,7 @@ Options:
123
112
--default-prompt <DEFAULT_PROMPT>
124
113
The prompt that should be used by default for encoding. If not set, no prompt will be applied.
125
114
126
-
For example if ``default_prompt`` is "query: " then the sentence "What is the capital of France?" will be
127
-
encoded as "query: What is the capital of France?" because the prompt text will be prepended before any text
128
-
to encode.
115
+
For example if ``default_prompt`` is "query: " then the sentence "What is the capital of France?" will be encoded as "query: What is the capital of France?" because the prompt text will be prepended before any text to encode.
129
116
130
117
The argument '--default-prompt <DEFAULT_PROMPT>' cannot be used with '--default-prompt-name <DEFAULT_PROMPT_NAME>`
131
118
@@ -149,15 +136,13 @@ Options:
149
136
[default: 3000]
150
137
151
138
--uds-path <UDS_PATH>
152
-
The name of the unix socket some text-embeddings-inference backends will use as they communicate internally
153
-
with gRPC
139
+
The name of the unix socket some text-embeddings-inference backends will use as they communicate internally with gRPC
154
140
155
141
[env: UDS_PATH=]
156
142
[default: /tmp/text-embeddings-inference-server]
157
143
158
144
--huggingface-hub-cache <HUGGINGFACE_HUB_CACHE>
159
-
The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk
160
-
for instance
145
+
The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance
161
146
162
147
[env: HUGGINGFACE_HUB_CACHE=]
163
148
@@ -172,8 +157,7 @@ Options:
172
157
--api-key <API_KEY>
173
158
Set an api key for request authorization.
174
159
175
-
By default the server responds to every request. With an api key set, the requests must have the Authorization
176
-
header set with the api key as Bearer token.
160
+
By default the server responds to every request. With an api key set, the requests must have the Authorization header set with the api key as Bearer token.
0 commit comments