You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> :warning:**If you want to use ```from_pretrain``` API**: please follow [Transformer-based API](#How-to-use-Transformer-based-API)
142
+
143
+
```shell
144
+
# Linux
145
+
# make sure your path is in intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph folder
146
+
git submodule update --init --recursive
147
+
mkdir build
148
+
cd build
149
+
cmake .. -G Ninja
150
+
ninja
151
+
```
152
+
153
+
```powershell
154
+
# Windows
155
+
# Install VisualStudio 2022 and open 'Developer PowerShell for VS 2022'
156
+
# make sure your path is in intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph folder
157
+
mkdir build
158
+
cd build
159
+
cmake ..
160
+
cmake --build . -j
161
+
```
162
+
Note: add compile args ```-DNE_AVX512=OFF -DNE_AVX512_VBMI=OFF -DNE_AVX512_VNNI=OFF``` to ```cmake``` when compiling it on a CPU without AVX512
163
+
164
+
165
+
### 1. Run LLM with Python Script
132
166
You can run LLM with one-click python script including conversion, quantization and inference.
133
167
```
134
168
python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"
@@ -236,24 +270,4 @@ Argument description of inference.py:
236
270
237
271
We support tensor parallelism strategy for distributed inference/training on multi-node and multi-socket. You can refer to [tensor_parallelism.md](./tensor_parallelism.md) to enable this feature.
238
272
239
-
### 4. Chat with LLaMA2
240
-
```python
241
-
from transformers import AutoTokenizer, TextStreamer
242
-
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
243
-
244
-
model_name ="meta-llama/Llama-2-7b-chat-hf"# or local path to model
0 commit comments