Skip to content

Commit c57c17e

Browse files
authored
[NeuralChat] Support Triton on HPU (intel#1292)
1 parent 67cd510 commit c57c17e

File tree

4 files changed

+138
-0
lines changed

4 files changed

+138
-0
lines changed
Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# Serving NeuralChat Text Generation with Triton Inference Server on HPU
2+
3+
Nvidia Triton Inference Server is a widely adopted inference serving software. We also support serving and deploying NeuralChat models with Triton Inference Server.
4+
5+
## Prepare serving scripts
6+
7+
```bash
8+
cd <path to intel_extension_for_transformers>/neural_chat/examples/serving/triton_inference_server
9+
mkdir -p models/text_generation/1/
10+
cp ../../../serving/triton/text_generation/model.py models/text_generation/1/model.py
11+
cp ../../../serving/triton/text_generation/client.py models/text_generation/1/client.py
12+
cp ../../../serving/triton/text_generation/config_hpu.pbtxt models/text_generation/config.pbtxt
13+
```
14+
15+
Make sure `KIND_CPU` is used for instance_group in `config_hpu.pbtxt`. You can change the num of `count` here to configure the num of model instances on your HPU, 8 is set in the example config file like below.
16+
```
17+
instance_group [{
18+
count: 8
19+
kind: KIND_CPU
20+
}]
21+
```
22+
23+
Then your folder structure under the current `serving` folder should be like:
24+
25+
```
26+
serving/
27+
├── models
28+
│ └── text_generation
29+
│ ├── 1
30+
│ │ ├── model.py
31+
│ └── config.pbtxt
32+
├── README.md
33+
```
34+
35+
## Create Docker Image for HPU
36+
Followting the commands below, you will create a Docker image for Habana Gaudi on your local machine.
37+
38+
```bash
39+
git clone https://github.com/HabanaAI/Setup_and_Install.git
40+
cd Setup_and_Install/dockerfiles/triton
41+
make build DOCKER_CACHE=true
42+
```
43+
44+
## Run the Backend Container
45+
After the Docker Image is created, you need to run a backend container to run tritonserver. The serving scripts will be mounted into Docker container using `-v ./models:/models`.
46+
47+
Remember to replace the `${image_name}` into the docker image name you just created. You can check the image name with the command `docker images`.
48+
```bash
49+
docker run -it --runtime=habana --name triton_backend --shm-size "4g" -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host -v ./models:/models ${image_name}
50+
```
51+
52+
## Launch the Triton Server
53+
Now you should be inside the Docker container.
54+
55+
In order to launch your customized triton server, you need to install the necessary prerequisites for itrex. By default, NeuralChat uses `Intel/neural-chat-7b-v3-1` as the LLM. Then you can launch the triton server to start the service.
56+
57+
You can specify an available http port to replace the `${your_port}` in tritonserver command.
58+
```bash
59+
# install ITREX using the latest github repo
60+
git clone https://github.com/intel/intel-extension-for-transformers.git itrex
61+
export PYTHONPATH=/opt/tritonserver/itrex
62+
# install requirements
63+
pip install transformers>=4.35.2 uvicorn yacs fastapi==0.103.2 neural-compressor accelerate datasets fschat==0.2.35 optimum optimum[habana] neural_speed
64+
# launch triton server
65+
tritonserver --model-repository=/models --http-port ${your_port}
66+
```
67+
68+
When the triton server is successfully launched, you will see the table below:
69+
```bash
70+
I0103 00:04:58.435488 237 server.cc:626]
71+
+--------+---------+--------+
72+
| Model | Version | Status |
73+
+--------+---------+--------+
74+
| llama2 | 1 | READY |
75+
+--------+---------+--------+
76+
```
77+
78+
Check the service status and port by running the following command:
79+
```bash
80+
curl -v localhost:8021/v2/health/ready
81+
```
82+
83+
You will find a `HTTP/1.1 200 OK` if your server is up and ready for receiving requests.
84+
85+
86+
## Launch and Run the Client
87+
88+
Start the Triton client and enter into the container. Remember to replace the `${image_name}` into the docker image name you just created.
89+
90+
```bash
91+
docker run -it --runtime=habana --name triton_client --shm-size "4g" -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host -v ./models:/models ${image_name}
92+
```
93+
94+
Inside the client docker container, you need to install tritonclient first.
95+
```bash
96+
pip install tritonclient[all]
97+
```
98+
99+
Send a request using `client.py`. The `${your_port}` is the triton server port.
100+
```bash
101+
python /models/text_generation/1/client.py --prompt="Tell me about Intel Xeon Scalable Processors." --url=localhost:${your_port}
102+
```
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Copyright (c) 2023 Intel Corporation
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
name: "text_generation"
16+
backend: "python"
17+
18+
input [
19+
{
20+
name: "INPUT0"
21+
data_type: TYPE_STRING
22+
dims: [ 1 ]
23+
}
24+
]
25+
output [
26+
{
27+
name: "OUTPUT0"
28+
data_type: TYPE_STRING
29+
dims: [ 1 ]
30+
}
31+
]
32+
33+
instance_group [{
34+
count: 8
35+
kind: KIND_CPU
36+
}]

0 commit comments

Comments
 (0)