|
| 1 | +# Serving NeuralChat Text Generation with Triton Inference Server on HPU |
| 2 | + |
| 3 | +Nvidia Triton Inference Server is a widely adopted inference serving software. We also support serving and deploying NeuralChat models with Triton Inference Server. |
| 4 | + |
| 5 | +## Prepare serving scripts |
| 6 | + |
| 7 | +```bash |
| 8 | +cd <path to intel_extension_for_transformers>/neural_chat/examples/serving/triton_inference_server |
| 9 | +mkdir -p models/text_generation/1/ |
| 10 | +cp ../../../serving/triton/text_generation/model.py models/text_generation/1/model.py |
| 11 | +cp ../../../serving/triton/text_generation/client.py models/text_generation/1/client.py |
| 12 | +cp ../../../serving/triton/text_generation/config_hpu.pbtxt models/text_generation/config.pbtxt |
| 13 | +``` |
| 14 | + |
| 15 | +Make sure `KIND_CPU` is used for instance_group in `config_hpu.pbtxt`. You can change the num of `count` here to configure the num of model instances on your HPU, 8 is set in the example config file like below. |
| 16 | +``` |
| 17 | +instance_group [{ |
| 18 | + count: 8 |
| 19 | + kind: KIND_CPU |
| 20 | +}] |
| 21 | +``` |
| 22 | + |
| 23 | +Then your folder structure under the current `serving` folder should be like: |
| 24 | + |
| 25 | +``` |
| 26 | +serving/ |
| 27 | +├── models |
| 28 | +│ └── text_generation |
| 29 | +│ ├── 1 |
| 30 | +│ │ ├── model.py |
| 31 | +│ └── config.pbtxt |
| 32 | +├── README.md |
| 33 | +``` |
| 34 | + |
| 35 | +## Create Docker Image for HPU |
| 36 | +Followting the commands below, you will create a Docker image for Habana Gaudi on your local machine. |
| 37 | + |
| 38 | +```bash |
| 39 | +git clone https://github.com/HabanaAI/Setup_and_Install.git |
| 40 | +cd Setup_and_Install/dockerfiles/triton |
| 41 | +make build DOCKER_CACHE=true |
| 42 | +``` |
| 43 | + |
| 44 | +## Run the Backend Container |
| 45 | +After the Docker Image is created, you need to run a backend container to run tritonserver. The serving scripts will be mounted into Docker container using `-v ./models:/models`. |
| 46 | + |
| 47 | +Remember to replace the `${image_name}` into the docker image name you just created. You can check the image name with the command `docker images`. |
| 48 | +```bash |
| 49 | +docker run -it --runtime=habana --name triton_backend --shm-size "4g" -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host -v ./models:/models ${image_name} |
| 50 | +``` |
| 51 | + |
| 52 | +## Launch the Triton Server |
| 53 | +Now you should be inside the Docker container. |
| 54 | + |
| 55 | +In order to launch your customized triton server, you need to install the necessary prerequisites for itrex. By default, NeuralChat uses `Intel/neural-chat-7b-v3-1` as the LLM. Then you can launch the triton server to start the service. |
| 56 | + |
| 57 | +You can specify an available http port to replace the `${your_port}` in tritonserver command. |
| 58 | +```bash |
| 59 | +# install ITREX using the latest github repo |
| 60 | +git clone https://github.com/intel/intel-extension-for-transformers.git itrex |
| 61 | +export PYTHONPATH=/opt/tritonserver/itrex |
| 62 | +# install requirements |
| 63 | +pip install transformers>=4.35.2 uvicorn yacs fastapi==0.103.2 neural-compressor accelerate datasets fschat==0.2.35 optimum optimum[habana] neural_speed |
| 64 | +# launch triton server |
| 65 | +tritonserver --model-repository=/models --http-port ${your_port} |
| 66 | +``` |
| 67 | + |
| 68 | +When the triton server is successfully launched, you will see the table below: |
| 69 | +```bash |
| 70 | +I0103 00:04:58.435488 237 server.cc:626] |
| 71 | ++--------+---------+--------+ |
| 72 | +| Model | Version | Status | |
| 73 | ++--------+---------+--------+ |
| 74 | +| llama2 | 1 | READY | |
| 75 | ++--------+---------+--------+ |
| 76 | +``` |
| 77 | + |
| 78 | +Check the service status and port by running the following command: |
| 79 | +```bash |
| 80 | +curl -v localhost:8021/v2/health/ready |
| 81 | +``` |
| 82 | + |
| 83 | +You will find a `HTTP/1.1 200 OK` if your server is up and ready for receiving requests. |
| 84 | + |
| 85 | + |
| 86 | +## Launch and Run the Client |
| 87 | + |
| 88 | +Start the Triton client and enter into the container. Remember to replace the `${image_name}` into the docker image name you just created. |
| 89 | + |
| 90 | +```bash |
| 91 | +docker run -it --runtime=habana --name triton_client --shm-size "4g" -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host -v ./models:/models ${image_name} |
| 92 | +``` |
| 93 | + |
| 94 | +Inside the client docker container, you need to install tritonclient first. |
| 95 | +```bash |
| 96 | +pip install tritonclient[all] |
| 97 | +``` |
| 98 | + |
| 99 | +Send a request using `client.py`. The `${your_port}` is the triton server port. |
| 100 | +```bash |
| 101 | +python /models/text_generation/1/client.py --prompt="Tell me about Intel Xeon Scalable Processors." --url=localhost:${your_port} |
| 102 | +``` |
0 commit comments