You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2-4Lines changed: 2 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,18 +1,16 @@
1
-
# ONNX Runtime generate() API
1
+
# ONNX Runtime GenAI
2
2
3
3
## *Main branch contains new API changes and examples in main branch reflect these changes. For example scripts compatible with current release (0.5.2), [see release branch](https://github.com/microsoft/onnxruntime-genai/tree/rel-0.5.2).*
This API gives you an easy, flexible and performant way of running LLMs on device.
11
11
12
12
It implements the generative AI loop for ONNX models, including pre and post processing, inference with ONNX Runtime, logits processing, search and sampling, and KV cache management.
13
13
14
-
You can call a high level `generate()` method to generate all of the output at once, or stream the output one token at a time.
15
-
16
14
See documentation at https://onnxruntime.ai/docs/genai.
17
15
18
16
|Support matrix|Supported now|Under development|On the roadmap|
Activation-aware Weight Quantization (AWQ) works by identifying the top 1% most salient weights that are most important for maintaining accuracy and quantizing the remaining 99% of weights. This leads to less accuracy loss from quantization compared to many other quantization techniques. For more on AWQ, see [here](https://arxiv.org/abs/2306.00978).
15
15
16
-
This tutorial downloads the Phi-3 mini short context PyTorch model, applies AWQ quantization, generates the corresponding optimized & quantized ONNX model, and runs the ONNX model with ONNX Runtime's generate() API. If you would like to use another model, please change the model name in the instructions below.
16
+
This tutorial downloads the Phi-3 mini short context PyTorch model, applies AWQ quantization, generates the corresponding optimized & quantized ONNX model, and runs the ONNX model with ONNX Runtime GenAI. If you would like to use another model, please change the model name in the instructions below.
17
17
18
18
## 1. Download your PyTorch model
19
19
@@ -47,7 +47,7 @@ $ pip install -e .
47
47
48
48
Note: You can try to install AutoAWQ directly with `pip install autoawq`. However, AutoAWQ will try to auto-detect the CUDA version installed on your machine. If the CUDA version it detects is incorrect, the `.whl` file that `pip` will choose will be incorrect. This will cause an error during runtime when trying to quantize. Thus, it is recommended to install AutoAWQ from source to get the right `.whl` file.
49
49
50
-
## 3. Install the generate() API
50
+
## 3. Install ONNX Runtime GenAI
51
51
52
52
Based on your desired hardware target, pick from one of the following options to install ONNX Runtime GenAI.
python3 model-generate.py -m genai_models/phi2-int4-cpu -pr "my favorite movie is""write a function that always returns True""I am very happy" -p 0.0 -k 1 -v
4
+
python3 model-generate.py -m genai_models/phi2-int4-cpu -e cpu -pr "my favorite movie is""write a function that always returns True""I am very happy" -p 0.0 -k 1 -v
Copy file name to clipboardExpand all lines: examples/python/model-generate.py
+5-5Lines changed: 5 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -7,10 +7,10 @@ def main(args):
7
7
ifhasattr(og, 'Config'):
8
8
config=og.Config(args.model_path)
9
9
config.clear_providers()
10
-
ifargs.provider!="cpu":
10
+
ifargs.execution_provider!="cpu":
11
11
ifargs.verbose:
12
-
print(f"Setting model to {args.provider}...")
13
-
config.append_provider(args.provider)
12
+
print(f"Setting model to {args.execution_provider}...")
13
+
config.append_provider(args.execution_provider)
14
14
model=og.Model(config)
15
15
else:
16
16
model=og.Model(args.model_path)
@@ -80,8 +80,8 @@ def main(args):
80
80
81
81
if__name__=="__main__":
82
82
parser=argparse.ArgumentParser(argument_default=argparse.SUPPRESS, description="End-to-end token generation loop example for gen-ai")
83
-
parser.add_argument('-m', '--model_path', type=str, required=True, help='Onnx model folder path (must contain config.json and model.onnx)')
84
-
parser.add_argument("-p", "--provider", type=str, required=True, choices=["cpu", "cuda", "dml"], help="Provider to run model")
83
+
parser.add_argument('-m', '--model_path', type=str, required=True, help='Onnx model folder path (must contain genai_config.json and model.onnx)')
84
+
parser.add_argument("-e", "--execution_provider", type=str, required=True, choices=["cpu", "cuda", "dml"], help="Provider to run model")
85
85
parser.add_argument('-pr', '--prompts', nargs='*', required=False, help='Input prompts to generate tokens from. Provide this parameter multiple times to batch multiple prompts')
86
86
parser.add_argument('-i', '--min_length', type=int, default=25, help='Min number of tokens to generate including the prompt')
87
87
parser.add_argument('-l', '--max_length', type=int, default=50, help='Max number of tokens to generate including the prompt')
Copy file name to clipboardExpand all lines: examples/python/model-qa.py
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -98,7 +98,7 @@ def main(args):
98
98
99
99
if__name__=="__main__":
100
100
parser=argparse.ArgumentParser(argument_default=argparse.SUPPRESS, description="End-to-end AI Question/Answer example for gen-ai")
101
-
parser.add_argument('-m', '--model', type=str, required=True, help='Onnx model folder path (must contain config.json and model.onnx)')
101
+
parser.add_argument('-m', '--model_path', type=str, required=True, help='Onnx model folder path (must contain genai_config.json and model.onnx)')
102
102
parser.add_argument('-e', '--execution_provider', type=str, required=True, choices=["cpu", "cuda", "dml"], help="Execution provider to run ONNX model with")
103
103
parser.add_argument('-i', '--min_length', type=int, help='Min number of tokens to generate including the prompt')
104
104
parser.add_argument('-l', '--max_length', type=int, help='Max number of tokens to generate including the prompt')
0 commit comments