Skip to content

Commit 8be2d9b

Browse files
authored
[Doc] change the structure of llm runtime readme (intel#596)
* add warning in graph build * add more info
1 parent 0656f42 commit 8be2d9b

File tree

1 file changed

+56
-42
lines changed
  • intel_extension_for_transformers/llm/runtime/graph

1 file changed

+56
-42
lines changed

intel_extension_for_transformers/llm/runtime/graph/README.md

Lines changed: 56 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -37,33 +37,19 @@ LLM Runtime supports the following models:
3737

3838

3939
## How to Use
40+
There are two methods for utilizing the LLM runtime:
41+
- [Transformer-based API](#How-to-use-Transformer-based-API)
42+
- [Straightforward Python script](#How-to-use-Straightforward-Python-script)
4043

41-
### 1. Install LLM Runtime
44+
45+
## How to use: Transformer-based API
46+
### 1. Install
4247
Install from binary
4348
```shell
4449
pip install intel-extension-for-transformers
4550
```
4651

47-
Build from source
48-
```shell
49-
# Linux
50-
git submodule update --init --recursive
51-
mkdir build
52-
cd build
53-
cmake .. -G Ninja
54-
ninja
55-
```
56-
57-
```powershell
58-
# Windows
59-
# Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022'
60-
mkdir build
61-
cd build
62-
cmake ..
63-
cmake --build . -j
64-
```
65-
Note: add compile args ```-DNE_AVX512=OFF -DNE_AVX512_VBMI=OFF -DNE_AVX512_VNNI=OFF``` to ```cmake``` when compiling it on a CPU without AVX512
66-
### 2. Run LLM with Python API
52+
### 2. Run LLM with Transformer-based API
6753

6854
You can use Python API to run Hugging Face model simply. Here is the sample code:
6955
```python
@@ -126,9 +112,57 @@ Argument description of generate function:
126112
| n_keep | Int | Number of tokens to keep from the initial prompt (default: 0, -1 = all) |
127113
| n_discard | Int | Number of tokens will be discarded (default: -1, -1 = half of tokens will be discarded) |
128114

115+
### 3. Chat with LLaMA2
116+
```python
117+
from transformers import AutoTokenizer, TextStreamer
118+
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
119+
120+
model_name = "meta-llama/Llama-2-7b-chat-hf" # or local path to model
121+
woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")
122+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
123+
streamer = TextStreamer(tokenizer)
124+
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True)
125+
126+
while True:
127+
print("> ", end="")
128+
prompt = input().strip()
129+
if prompt == "quit":
130+
break
131+
b_prompt = "[INST]{}[/INST]".format(prompt) # prompt template for llama2
132+
inputs = tokenizer(b_prompt, return_tensors="pt").input_ids
133+
outputs = model.generate(inputs, streamer=streamer, interactive=True, ignore_prompt=True,
134+
num_beams=1, max_new_tokens=512, ctx_size = 512, do_sample=True, threads=28, repetition_penalty=1.1)
135+
```
136+
129137

130138

131-
### 3. Run LLM with Python Script
139+
## How to use: Straightforward Python script
140+
Build from source
141+
> :warning: **If you want to use ```from_pretrain``` API**: please follow [Transformer-based API](#How-to-use-Transformer-based-API)
142+
143+
```shell
144+
# Linux
145+
# make sure your path is in intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph folder
146+
git submodule update --init --recursive
147+
mkdir build
148+
cd build
149+
cmake .. -G Ninja
150+
ninja
151+
```
152+
153+
```powershell
154+
# Windows
155+
# Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022'
156+
# make sure your path is in intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph folder
157+
mkdir build
158+
cd build
159+
cmake ..
160+
cmake --build . -j
161+
```
162+
Note: add compile args ```-DNE_AVX512=OFF -DNE_AVX512_VBMI=OFF -DNE_AVX512_VNNI=OFF``` to ```cmake``` when compiling it on a CPU without AVX512
163+
164+
165+
### 1. Run LLM with Python Script
132166
You can run LLM with one-click python script including conversion, quantization and inference.
133167
```
134168
python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"
@@ -236,24 +270,4 @@ Argument description of inference.py:
236270

237271
We support tensor parallelism strategy for distributed inference/training on multi-node and multi-socket. You can refer to [tensor_parallelism.md](./tensor_parallelism.md) to enable this feature.
238272

239-
### 4. Chat with LLaMA2
240-
```python
241-
from transformers import AutoTokenizer, TextStreamer
242-
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
243-
244-
model_name = "meta-llama/Llama-2-7b-chat-hf" # or local path to model
245-
woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")
246-
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
247-
streamer = TextStreamer(tokenizer)
248-
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True)
249273

250-
while True:
251-
print("> ", end="")
252-
prompt = input().strip()
253-
if prompt == "quit":
254-
break
255-
b_prompt = "[INST]{}[/INST]".format(prompt) # prompt template for llama2
256-
inputs = tokenizer(b_prompt, return_tensors="pt").input_ids
257-
outputs = model.generate(inputs, streamer=streamer, interactive=True, ignore_prompt=True,
258-
num_beams=1, max_new_tokens=512, ctx_size = 512, do_sample=True, threads=28, repetition_penalty=1.1)
259-
```

0 commit comments

Comments
 (0)