|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Accelerated Inference With PEFT'd StarCoder2" |
| 8 | + ] |
| 9 | + }, |
| 10 | + { |
| 11 | + "cell_type": "markdown", |
| 12 | + "metadata": {}, |
| 13 | + "source": [ |
| 14 | + "In the previous [notebook](https://github.com/NVIDIA/GenerativeAIExamples/blob/main/models/StarCoder2/lora.ipynb), we show how to parameter efficiently finetune StarCoder2 model with a custom code (instruction, completion) pair dataset. We choose LoRA as our PEFT algorithnm and finetune for 50 interations. In this notebook, the goal is to demonstrate how to compile fintuned .nemo model into optimized TensorRT-LLM engines. The converted model engine can perform accelerated inference locally or be deployed to Triton Inference Server." |
| 15 | + ] |
| 16 | + }, |
| 17 | + { |
| 18 | + "cell_type": "markdown", |
| 19 | + "metadata": {}, |
| 20 | + "source": [ |
| 21 | + "## Export Model Via TensorRT-LLM\n", |
| 22 | + "\n", |
| 23 | + "NVIDIA TensorRT-LLM is an open-source library that accelerates and optimizes inference performance of the latest LLMs on supported AI platforms. NVIDIA NeMo framework offers TensorRT-LLM as an user friendly tool to compile .nemo models into optimized engines. To start with, let's create a folder where the exported model files will be saved." |
| 24 | + ] |
| 25 | + }, |
| 26 | + { |
| 27 | + "cell_type": "code", |
| 28 | + "execution_count": null, |
| 29 | + "metadata": {}, |
| 30 | + "outputs": [], |
| 31 | + "source": [ |
| 32 | + "!mkdir starcoder2_trt_llm" |
| 33 | + ] |
| 34 | + }, |
| 35 | + { |
| 36 | + "cell_type": "markdown", |
| 37 | + "metadata": {}, |
| 38 | + "source": [ |
| 39 | + "Next, we need to create an instance of the TensorRTLLM class and call the TensorRTLLM.export() function with the nemo_checkpoint_path pointing to the LoRA fine-tuned .nemo checkpoint.\n", |
| 40 | + "\n", |
| 41 | + "After optimized model export, a few files will be stored in the folder we just created. These files include an engine file that holds the weights, the compiled execution graph of the model, a tokenizer.model file which contains the tokenizer information, and config.json which records the metadata about the model (along with model.cache, which caches some operations and makes it faster to re-compile the model in the future.)" |
| 42 | + ] |
| 43 | + }, |
| 44 | + { |
| 45 | + "cell_type": "code", |
| 46 | + "execution_count": null, |
| 47 | + "metadata": {}, |
| 48 | + "outputs": [], |
| 49 | + "source": [ |
| 50 | + "from nemo.export import TensorRTLLM\n", |
| 51 | + "trt_llm_exporter = TensorRTLLM(model_dir=\"starcoder2_trt_llm\")\n", |
| 52 | + "trt_llm_exporter.export(nemo_checkpoint_path=\"starcoder2_lora_alpaca_python_merged.nemo\", model_type=\"starcoder\", n_gpus=1)" |
| 53 | + ] |
| 54 | + }, |
| 55 | + { |
| 56 | + "cell_type": "markdown", |
| 57 | + "metadata": {}, |
| 58 | + "source": [ |
| 59 | + "After the finetuned model is exported into TensorRT-LLM optimized engines, we can perform accelerated inference." |
| 60 | + ] |
| 61 | + }, |
| 62 | + { |
| 63 | + "cell_type": "code", |
| 64 | + "execution_count": null, |
| 65 | + "metadata": {}, |
| 66 | + "outputs": [], |
| 67 | + "source": [ |
| 68 | + "trt_llm_exporter.forward([\"Given a non-empty array of integers nums, every element appears twice except for one. Find that single one. ### Input: nums = [2,2,1] ### Output: 1\"])" |
| 69 | + ] |
| 70 | + }, |
| 71 | + { |
| 72 | + "cell_type": "markdown", |
| 73 | + "metadata": {}, |
| 74 | + "source": [ |
| 75 | + "Another code generation example:" |
| 76 | + ] |
| 77 | + }, |
| 78 | + { |
| 79 | + "cell_type": "code", |
| 80 | + "execution_count": null, |
| 81 | + "metadata": {}, |
| 82 | + "outputs": [], |
| 83 | + "source": [ |
| 84 | + "trt_llm_exporter.forward([\"Implement Fibonacci sequence in Python\"])" |
| 85 | + ] |
| 86 | + }, |
| 87 | + { |
| 88 | + "cell_type": "markdown", |
| 89 | + "metadata": {}, |
| 90 | + "source": [ |
| 91 | + "## Deploy Model Using Triton Inference Server\n", |
| 92 | + "\n", |
| 93 | + "Lastly, we can easily deploy the finetuned model as a service, which is supported by Triton Inference Server:" |
| 94 | + ] |
| 95 | + }, |
| 96 | + { |
| 97 | + "cell_type": "code", |
| 98 | + "execution_count": null, |
| 99 | + "metadata": {}, |
| 100 | + "outputs": [], |
| 101 | + "source": [ |
| 102 | + "from nemo.deploy import DeployPyTriton\n", |
| 103 | + "\n", |
| 104 | + "nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name=\"starcoder\")\n", |
| 105 | + "nm.deploy()\n", |
| 106 | + "nm.serve()" |
| 107 | + ] |
| 108 | + } |
| 109 | + ], |
| 110 | + "metadata": { |
| 111 | + "kernelspec": { |
| 112 | + "display_name": "Python 3 (ipykernel)", |
| 113 | + "language": "python", |
| 114 | + "name": "python3" |
| 115 | + }, |
| 116 | + "language_info": { |
| 117 | + "codemirror_mode": { |
| 118 | + "name": "ipython", |
| 119 | + "version": 3 |
| 120 | + }, |
| 121 | + "file_extension": ".py", |
| 122 | + "mimetype": "text/x-python", |
| 123 | + "name": "python", |
| 124 | + "nbconvert_exporter": "python", |
| 125 | + "pygments_lexer": "ipython3", |
| 126 | + "version": "3.10.12" |
| 127 | + } |
| 128 | + }, |
| 129 | + "nbformat": 4, |
| 130 | + "nbformat_minor": 4 |
| 131 | +} |
0 commit comments