[Doc] change the structure of llm runtime readme (intel#596)

kevinintel · web-flow · commit 8be2d9b9960e · 2023-11-01T15:20:30.000+08:00
* add warning in graph build

* add more info
diff --git a/intel_extension_for_transformers/llm/runtime/graph/README.md b/intel_extension_for_transformers/llm/runtime/graph/README.md
@@ -37,33 +37,19 @@ LLM Runtime supports the following models:
 
 
 ## How to Use
+There are two methods for utilizing the LLM runtime:
+- [Transformer-based API](#How-to-use-Transformer-based-API)
+- [Straightforward Python script](#How-to-use-Straightforward-Python-script)
 
-### 1. Install LLM Runtime
+
+## How to use: Transformer-based API
+### 1. Install
 Install from binary
 ```shell
 pip install intel-extension-for-transformers
 ```
 
-Build from source
-```shell
-# Linux
-git submodule update --init --recursive
-mkdir build
-cd build
-cmake .. -G Ninja
-ninja
-```
-
-```powershell
-# Windows
-# Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022'
-mkdir build
-cd build
-cmake ..
-cmake --build . -j
-```
-Note: add compile args ```-DNE_AVX512=OFF -DNE_AVX512_VBMI=OFF -DNE_AVX512_VNNI=OFF``` to ```cmake``` when compiling it on a CPU without AVX512
-### 2. Run LLM with Python API
+### 2. Run LLM with Transformer-based API
 
 You can use Python API to run Hugging Face model simply. Here is the sample code:
 ```python
@@ -126,9 +112,57 @@ Argument description of generate function:
 | n_keep            | Int         | Number of tokens to keep from the initial prompt (default: 0, -1 = all)                 |
 | n_discard         | Int         | Number of tokens will be discarded (default: -1, -1 = half of tokens will be discarded) |
 
+### 3. Chat with LLaMA2
+```python
+from transformers import AutoTokenizer, TextStreamer
+from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
+
+model_name = "meta-llama/Llama-2-7b-chat-hf"  # or local path to model
+woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+streamer = TextStreamer(tokenizer)
+model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True)
+
+while True:
+    print("> ", end="")
+    prompt = input().strip()
+    if prompt == "quit":
+        break
+    b_prompt = "[INST]{}[/INST]".format(prompt)  # prompt template for llama2
+    inputs = tokenizer(b_prompt, return_tensors="pt").input_ids
+    outputs = model.generate(inputs, streamer=streamer, interactive=True, ignore_prompt=True,
+                num_beams=1, max_new_tokens=512, ctx_size = 512, do_sample=True, threads=28, repetition_penalty=1.1)
+```
+
 
 
-### 3. Run LLM with Python Script
+## How to use: Straightforward Python script
+Build from source
+> :warning: **If you want to use ```from_pretrain``` API**: please follow [Transformer-based API](#How-to-use-Transformer-based-API)
+
+```shell
+# Linux
+# make sure your path is in intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph folder
+git submodule update --init --recursive
+mkdir build
+cd build
+cmake .. -G Ninja
+ninja
+```
+
+```powershell
+# Windows
+# Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022'
+# make sure your path is in intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph folder
+mkdir build
+cd build
+cmake ..
+cmake --build . -j
+```
+Note: add compile args ```-DNE_AVX512=OFF -DNE_AVX512_VBMI=OFF -DNE_AVX512_VNNI=OFF``` to ```cmake``` when compiling it on a CPU without AVX512
+
+
+### 1. Run LLM with Python Script
 You can run LLM with one-click python script including conversion, quantization and inference.
 ```
 python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"
@@ -236,24 +270,4 @@ Argument description of inference.py:
 
 We support tensor parallelism strategy for distributed inference/training on multi-node and multi-socket.  You can refer to [tensor_parallelism.md](./tensor_parallelism.md) to enable this feature.
 
-### 4. Chat with LLaMA2
-```python
-from transformers import AutoTokenizer, TextStreamer
-from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
-
-model_name = "meta-llama/Llama-2-7b-chat-hf"  # or local path to model
-woq_config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")
-tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
-streamer = TextStreamer(tokenizer)
-model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True)
 
-while True:
-    print("> ", end="")
-    prompt = input().strip()
-    if prompt == "quit":
-        break
-    b_prompt = "[INST]{}[/INST]".format(prompt)  # prompt template for llama2
-    inputs = tokenizer(b_prompt, return_tensors="pt").input_ids
-    outputs = model.generate(inputs, streamer=streamer, interactive=True, ignore_prompt=True,
-                num_beams=1, max_new_tokens=512, ctx_size = 512, do_sample=True, threads=28, repetition_penalty=1.1)
-```