Skip to content

Commit 7fdf05f

Browse files
Support remote endpoints (#2085)
Signed-off-by: Ubuntu <azureuser@denvr-inf.kifxisxbiwme5gt4kkwqsfdjuh.dx.internal.cloudapp.net> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 80fa841 commit 7fdf05f

File tree

13 files changed

+545
-54
lines changed

13 files changed

+545
-54
lines changed

ChatQnA/chatqna.py

Lines changed: 17 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -175,25 +175,23 @@ def align_generator(self, gen, **kwargs):
175175
# b'data:{"id":"","object":"text_completion","created":1725530204,"model":"meta-llama/Meta-Llama-3-8B-Instruct","system_fingerprint":"2.0.1-native","choices":[{"index":0,"delta":{"role":"assistant","content":"?"},"logprobs":null,"finish_reason":null}]}\n\n'
176176
for line in gen:
177177
line = line.decode("utf-8")
178-
start = line.find("{")
179-
end = line.rfind("}") + 1
180-
181-
json_str = line[start:end]
182-
try:
183-
# sometimes yield empty chunk, do a fallback here
184-
json_data = json.loads(json_str)
185-
if "ops" in json_data and "op" in json_data["ops"][0]:
186-
if "value" in json_data["ops"][0] and isinstance(json_data["ops"][0]["value"], str):
187-
yield f"data: {repr(json_data['ops'][0]['value'].encode('utf-8'))}\n\n"
188-
else:
189-
pass
190-
elif (
191-
json_data["choices"][0]["finish_reason"] != "eos_token"
192-
and "content" in json_data["choices"][0]["delta"]
193-
):
194-
yield f"data: {repr(json_data['choices'][0]['delta']['content'].encode('utf-8'))}\n\n"
195-
except Exception as e:
196-
yield f"data: {repr(json_str.encode('utf-8'))}\n\n"
178+
chunks = [chunk.strip() for chunk in line.split("\n\n") if chunk.strip()]
179+
for line in chunks:
180+
start = line.find("{")
181+
end = line.rfind("}") + 1
182+
json_str = line[start:end]
183+
try:
184+
# sometimes yield empty chunk, do a fallback here
185+
json_data = json.loads(json_str)
186+
if "ops" in json_data and "op" in json_data["ops"][0]:
187+
if "value" in json_data["ops"][0] and isinstance(json_data["ops"][0]["value"], str):
188+
yield f"data: {repr(json_data['ops'][0]['value'].encode('utf-8'))}\n\n"
189+
else:
190+
pass
191+
elif "content" in json_data["choices"][0]["delta"]:
192+
yield f"data: {repr(json_data['choices'][0]['delta']['content'].encode('utf-8'))}\n\n"
193+
except Exception as e:
194+
yield f"data: {repr(json_str.encode('utf-8'))}\n\n"
197195
yield "data: [DONE]\n\n"
198196

199197

ChatQnA/docker_compose/intel/cpu/xeon/README.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,7 @@ In the context of deploying a ChatQnA pipeline on an Intel® Xeon® platform, we
147147
| File | Description |
148148
| ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
149149
| [compose.yaml](./compose.yaml) | Default compose file using vllm as serving framework and redis as vector database |
150+
| [compose_remote.yaml](./compose_remote.yaml) | Default compose file using remote inference endpoints and redis as vector database |
150151
| [compose_milvus.yaml](./compose_milvus.yaml) | Uses Milvus as the vector database. All other configurations remain the same as the default |
151152
| [compose_pinecone.yaml](./compose_pinecone.yaml) | Uses Pinecone as the vector database. All other configurations remain the same as the default. For more details, refer to [README_pinecone.md](./README_pinecone.md). |
152153
| [compose_qdrant.yaml](./compose_qdrant.yaml) | Uses Qdrant as the vector database. All other configurations remain the same as the default. For more details, refer to [README_qdrant.md](./README_qdrant.md). |
@@ -158,6 +159,28 @@ In the context of deploying a ChatQnA pipeline on an Intel® Xeon® platform, we
158159
| [compose_tgi.telemetry.yaml](./compose_tgi.telemetry.yaml) | Helper file for telemetry features for tgi. Can be used along with any compose files that serves tgi |
159160
| [compose_mariadb.yaml](./compose_mariadb.yaml) | Uses MariaDB Server as the vector database. All other configurations remain the same as the default |
160161

162+
### Running LLM models with remote endpoints
163+
164+
When models are deployed on a remote server, a base URL and an API key are required to access them. To set up a remote server and acquire the base URL and API key, refer to [Intel® AI for Enterprise Inference](https://www.intel.com/content/www/us/en/developer/topic-technology/artificial-intelligence/enterprise-inference.html) offerings.
165+
166+
Set the following environment variables.
167+
168+
- `REMOTE_ENDPOINT` is the HTTPS endpoint of the remote server with the model of choice (i.e. https://api.example.com). **Note:** If the API for the models does not use LiteLLM, the second part of the model card needs to be appended to the URL. For example, set `REMOTE_ENDPOINT` to https://api.example.com/Llama-3.3-70B-Instruct if the model card is `meta-llama/Llama-3.3-70B-Instruct`.
169+
- `API_KEY` is the access token or key to access the model(s) on the server.
170+
- `LLM_MODEL_ID` is the model card which may need to be overwritten depending on what it is set to `set_env.sh`.
171+
172+
```bash
173+
export REMOTE_ENDPOINT=<https-endpoint-of-remote-server>
174+
export API_KEY=<your-api-key>
175+
export LLM_MODEL_ID=<model-card>
176+
```
177+
178+
After setting these environment variables, run `docker compose` with `compose_remote.yaml`:
179+
180+
```bash
181+
docker compose -f compose_remote.yaml up -d
182+
```
183+
161184
## ChatQnA with Conversational UI (Optional)
162185

163186
To access the Conversational UI (react based) frontend, modify the UI service in the `compose` file used to deploy. Replace `chaqna-xeon-ui-server` service with the `chatqna-xeon-conversation-ui-server` service as per the config below:

ChatQnA/docker_compose/intel/cpu/xeon/compose_remote.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ services:
102102
- RERANK_SERVER_HOST_IP=tei-reranking-service
103103
- RERANK_SERVER_PORT=${RERANK_SERVER_PORT:-80}
104104
- LLM_SERVER_HOST_IP=${REMOTE_ENDPOINT}
105-
- OPENAI_API_KEY= ${OPENAI_API_KEY}
105+
- OPENAI_API_KEY=${API_KEY}
106106
- LLM_SERVER_PORT=80
107107
- LLM_MODEL=${LLM_MODEL_ID}
108108
- LOGFLAG=${LOGFLAG}

CodeGen/codegen.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -181,7 +181,6 @@ async def handle_request(self, request: Request):
181181

182182
# Handle the chat messages to generate the prompt
183183
prompt = handle_message(chat_request.messages)
184-
185184
# Get the agents flag from the request data, default to False if not provided
186185
agents_flag = data.get("agents_flag", False)
187186

@@ -200,7 +199,6 @@ async def handle_request(self, request: Request):
200199

201200
# Initialize the initial inputs with the generated prompt
202201
initial_inputs = {"query": prompt}
203-
204202
# Check if the key index name is provided in the parameters
205203
if parameters.index_name:
206204
if agents_flag:
@@ -268,7 +266,6 @@ async def handle_request(self, request: Request):
268266
result_dict, runtime_graph = await megaservice.schedule(
269267
initial_inputs=initial_inputs, llm_parameters=parameters
270268
)
271-
272269
for node, response in result_dict.items():
273270
# Check if the last microservice in the megaservice is LLM
274271
if (
@@ -277,7 +274,6 @@ async def handle_request(self, request: Request):
277274
and megaservice.services[node].service_type == ServiceType.LLM
278275
):
279276
return response
280-
281277
# Get the response from the last node in the runtime graph
282278
last_node = runtime_graph.all_leaves()[-1]
283279

@@ -288,7 +284,6 @@ async def handle_request(self, request: Request):
288284
response = result_dict[last_node]["text"]
289285
except (KeyError, TypeError):
290286
response = "Response Error"
291-
292287
choices = []
293288
usage = UsageInfo()
294289
choices.append(

CodeGen/docker_compose/intel/cpu/xeon/README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,11 +91,39 @@ Different Docker Compose files are available to select the LLM serving backend.
9191
- **Description:** Uses Hugging Face Text Generation Inference (TGI) optimized for Intel CPUs as the LLM serving engine.
9292
- **Services Deployed:** `codegen-tgi-server`, `codegen-llm-server`, `codegen-tei-embedding-server`, `codegen-retriever-server`, `redis-vector-db`, `codegen-dataprep-server`, `codegen-backend-server`, `codegen-gradio-ui-server`.
9393
- **To Run:**
94+
9495
```bash
9596
# Ensure environment variables (HOST_IP, HF_TOKEN) are set
9697
docker compose -f compose_tgi.yaml up -d
9798
```
9899

100+
#### Deployment with remote endpoints (`compose_remote.yaml`)
101+
102+
- **Compose File:** `compose_remote.yaml`
103+
- **Description:** Uses remote endpoints to access the served LLM's. This is the default configurations except for the LLM serving engine.
104+
- **Services Deployed:** `codegen-tei-embedding-server`, `codegen-retriever-server`, `redis-vector-db`, `codegen-dataprep-server`, `codegen-backend-server`, `codegen-gradio-ui-server`.
105+
- **To Run:**
106+
107+
When models are deployed on a remote server, a base URL and an API key are required to access them. To set up a remote server and acquire the base URL and API key, refer to [Intel® AI for Enterprise Inference](https://www.intel.com/content/www/us/en/developer/topic-technology/artificial-intelligence/enterprise-inference.html) offerings.
108+
109+
Set the following environment variables.
110+
111+
- `REMOTE_ENDPOINT` is the HTTPS endpoint of the remote server with the model of choice (i.e. https://api.example.com). **Note:** If the API for the models does not use LiteLLM, the second part of the model card needs to be appended to the URL. For example, set `REMOTE_ENDPOINT` to https://api.example.com/Llama-3.3-70B-Instruct if the model card is `meta-llama/Llama-3.3-70B-Instruct`.
112+
- `API_KEY` is the access token or key to access the model(s) on the server.
113+
- `LLM_MODEL_ID` is the model card which may need to be overwritten depending on what it is set to `set_env.sh`.
114+
115+
```bash
116+
export REMOTE_ENDPOINT=<https-endpoint-of-remote-server>
117+
export API_KEY=<your-api-key>
118+
export LLM_MODEL_ID=<model-card>
119+
```
120+
121+
After setting these environment variables, run `docker compose` with `compose_remote.yaml`:
122+
123+
```bash
124+
docker compose -f compose_remote.yaml up -d
125+
```
126+
99127
### Configuration Parameters
100128
101129
#### Environment Variables

CodeGen/docker_compose/intel/cpu/xeon/compose_remote.yaml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@ services:
66
codegen-xeon-backend-server:
77
image: ${REGISTRY:-opea}/codegen:${TAG:-latest}
88
container_name: codegen-xeon-backend-server
9+
depends_on:
10+
dataprep-redis-server:
11+
condition: service_healthy
912
ports:
1013
- "7778:7778"
1114
environment:
@@ -14,7 +17,8 @@ services:
1417
- http_proxy=${http_proxy}
1518
- MEGA_SERVICE_HOST_IP=${MEGA_SERVICE_HOST_IP}
1619
- LLM_SERVICE_HOST_IP=${REMOTE_ENDPOINT}
17-
- OPENAI_API_KEY= ${OPENAI_API_KEY}
20+
- LLM_MODEL_ID=${LLM_MODEL_ID}
21+
- OPENAI_API_KEY=${API_KEY}
1822
- RETRIEVAL_SERVICE_HOST_IP=${RETRIEVAL_SERVICE_HOST_IP}
1923
- REDIS_RETRIEVER_PORT=${REDIS_RETRIEVER_PORT}
2024
- TEI_EMBEDDING_HOST_IP=${TEI_EMBEDDING_HOST_IP}
@@ -61,6 +65,11 @@ services:
6165
INDEX_NAME: ${INDEX_NAME}
6266
HF_TOKEN: ${HF_TOKEN}
6367
LOGFLAG: true
68+
healthcheck:
69+
test: ["CMD-SHELL", "curl -f http://localhost:5000/v1/health_check || exit 1"]
70+
interval: 10s
71+
timeout: 5s
72+
retries: 10
6473
restart: unless-stopped
6574
tei-embedding-serving:
6675
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.7

DocSum/docker_compose/intel/cpu/xeon/README.md

Lines changed: 27 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -115,10 +115,33 @@ All the DocSum containers will be stopped and then removed on completion of the
115115

116116
In the context of deploying a DocSum pipeline on an Intel® Xeon® platform, we can pick and choose different large language model serving frameworks. The table below outlines the various configurations that are available as part of the application.
117117

118-
| File | Description |
119-
| -------------------------------------- | ----------------------------------------------------------------------------------------- |
120-
| [compose.yaml](./compose.yaml) | Default compose file using vllm as serving framework |
121-
| [compose_tgi.yaml](./compose_tgi.yaml) | The LLM serving framework is TGI. All other configurations remain the same as the default |
118+
| File | Description |
119+
| -------------------------------------------- | -------------------------------------------------------------------------------------- |
120+
| [compose.yaml](./compose.yaml) | Default compose file using vllm as serving framework |
121+
| [compose_tgi.yaml](./compose_tgi.yaml) | The LLM serving framework is TGI. All other configurations remain the same as default |
122+
| [compose_remote.yaml](./compose_remote.yaml) | Uses remote inference endpoints for LLMs. All other configurations are same as default |
123+
124+
### Running LLM models with remote endpoints
125+
126+
When models are deployed on a remote server, a base URL and an API key are required to access them. To set up a remote server and acquire the base URL and API key, refer to [Intel® AI for Enterprise Inference](https://www.intel.com/content/www/us/en/developer/topic-technology/artificial-intelligence/enterprise-inference.html) offerings.
127+
128+
Set the following environment variables.
129+
130+
- `REMOTE_ENDPOINT` is the HTTPS endpoint of the remote server with the model of choice (i.e. https://api.example.com). **Note:** If the API for the models does not use LiteLLM, the second part of the model card needs to be appended to the URL. For example, set `REMOTE_ENDPOINT` to https://api.example.com/Llama-3.3-70B-Instruct if the model card is `meta-llama/Llama-3.3-70B-Instruct`.
131+
- `API_KEY` is the access token or key to access the model(s) on the server.
132+
- `LLM_MODEL_ID` is the model card which may need to be overwritten depending on what it is set to `set_env.sh`.
133+
134+
```bash
135+
export REMOTE_ENDPOINT=<https-endpoint-of-remote-server>
136+
export API_KEY=<your-api-key>
137+
export LLM_MODEL_ID=<model-card>
138+
```
139+
140+
After setting these environment variables, run `docker compose` with `compose_remote.yaml`:
141+
142+
```bash
143+
docker compose -f compose_remote.yaml up -d
144+
```
122145

123146
## DocSum Detailed Usage
124147

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
services:
5+
llm-docsum-vllm:
6+
image: ${REGISTRY:-opea}/llm-docsum:${TAG:-latest}
7+
container_name: docsum-xeon-llm-server
8+
ports:
9+
- ${LLM_PORT:-9000}:9000
10+
ipc: host
11+
environment:
12+
no_proxy: ${no_proxy}
13+
http_proxy: ${http_proxy}
14+
https_proxy: ${https_proxy}
15+
LLM_ENDPOINT: ${REMOTE_ENDPOINT}
16+
LLM_MODEL_ID: ${LLM_MODEL_ID}
17+
OPENAI_API_KEY: ${API_KEY}
18+
HUGGINGFACEHUB_API_TOKEN: ${HF_TOKEN}
19+
HF_TOKEN: ${HF_TOKEN}
20+
MAX_INPUT_TOKENS: ${MAX_INPUT_TOKENS}
21+
MAX_TOTAL_TOKENS: ${MAX_TOTAL_TOKENS}
22+
DocSum_COMPONENT_NAME: ${DocSum_COMPONENT_NAME}
23+
24+
LOGFLAG: ${LOGFLAG:-False}
25+
restart: unless-stopped
26+
27+
whisper:
28+
image: ${REGISTRY:-opea}/whisper:${TAG:-latest}
29+
container_name: docsum-xeon-whisper-server
30+
ports:
31+
- "7066:7066"
32+
ipc: host
33+
environment:
34+
no_proxy: ${no_proxy}
35+
http_proxy: ${http_proxy}
36+
https_proxy: ${https_proxy}
37+
restart: unless-stopped
38+
39+
docsum-xeon-backend-server:
40+
image: ${REGISTRY:-opea}/docsum:${TAG:-latest}
41+
container_name: docsum-xeon-backend-server
42+
depends_on:
43+
- llm-docsum-vllm
44+
ports:
45+
- "${BACKEND_SERVICE_PORT:-8888}:8888"
46+
environment:
47+
- no_proxy=${no_proxy}
48+
- https_proxy=${https_proxy}
49+
- http_proxy=${http_proxy}
50+
- MEGA_SERVICE_HOST_IP=${MEGA_SERVICE_HOST_IP}
51+
- LLM_SERVICE_HOST_IP=${LLM_SERVICE_HOST_IP}
52+
- ASR_SERVICE_HOST_IP=${ASR_SERVICE_HOST_IP}
53+
ipc: host
54+
restart: always
55+
56+
docsum-gradio-ui:
57+
image: ${REGISTRY:-opea}/docsum-gradio-ui:${TAG:-latest}
58+
container_name: docsum-xeon-ui-server
59+
depends_on:
60+
- docsum-xeon-backend-server
61+
ports:
62+
- "${FRONTEND_SERVICE_PORT:-5173}:5173"
63+
environment:
64+
- no_proxy=${no_proxy}
65+
- https_proxy=${https_proxy}
66+
- http_proxy=${http_proxy}
67+
- BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
68+
- DOC_BASE_URL=${BACKEND_SERVICE_ENDPOINT}
69+
ipc: host
70+
restart: always
71+
72+
networks:
73+
default:
74+
driver: bridge

ProductivitySuite/docker_compose/intel/cpu/xeon/README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ Some HuggingFace resources, such as some models, are only accessible if you have
4343
To set up environment variables for deploying Productivity Suite service, source the set_env.sh script in this directory:
4444

4545
```
46+
export host_ip=<ip-address-of-the-machine>
4647
source set_env.sh
4748
```
4849

@@ -228,3 +229,27 @@ The table provides a comprehensive overview of the Productivity Suite service ut
228229
| tgi_service_codegen | ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu | No | Serves code generation models for inference, optimized for Intel Xeon CPUs. |
229230
| tgi-service | ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu | No | Specific to the TGI deployment, focuses on text generation inference using Xeon hardware. |
230231
| whisper-server | opea/whisper:latest | No | Provides speech-to-text transcription services using Whisper models. |
232+
233+
### Running LLM models with remote endpoints
234+
235+
When models are deployed on a remote server, a base URL and an API key are required to access them. To set up a remote server and acquire the base URL and API key, refer to [Intel® AI for Enterprise Inference](https://www.intel.com/content/www/us/en/developer/topic-technology/artificial-intelligence/enterprise-inference.html) offerings.
236+
237+
Set the following environment variables.
238+
239+
- `REMOTE_ENDPOINT` is the HTTPS endpoint of the remote server with the model of choice (i.e. https://api.example.com). **Note:** If the API for the models does not use LiteLLM, the second part of the model card needs to be appended to the URL. For example, set `REMOTE_ENDPOINT` to https://api.example.com/Llama-3.3-70B-Instruct if the model card is `meta-llama/Llama-3.3-70B-Instruct`.
240+
- `API_KEY` is the access token or key to access the model(s) on the server.
241+
- `LLM_MODEL_ID` is the model card which may need to be overwritten depending on what it is set to `set_env.sh`.
242+
243+
```bash
244+
export DocSum_COMPONENT_NAME="OpeaDocSumvLLM"
245+
export REMOTE_ENDPOINT=<https-endpoint-of-remote-server>
246+
export API_KEY=<your-api-key>
247+
export LLM_MODEL_ID=<model-card>
248+
export LLM_MODEL_ID_CODEGEN=<model-card>
249+
```
250+
251+
After setting these environment variables, run `docker compose` with `compose_remote.yaml`:
252+
253+
```bash
254+
docker compose -f compose_remote.yaml up -d
255+
```

0 commit comments

Comments
 (0)