Skip to content
This repository was archived by the owner on Jun 3, 2025. It is now read-only.

Commit 7ea8298

Browse files
jeanniefinksbfineranmgoinmarkurtz
authored
Jfinks scheduler (#101)
* update Flask examples (#96) * update Flask examples * adding logging, adding post/preprocessing client options * update docstrings * update readme outputs * default to multi_stream scheduler for serving * Create scheduler.md new content for single and multi-stream scheduling * Add files via upload Files that go with new scheduler doc in review * Update index.rst included scheduler doc into nav tree fixed minor formatting issues w/ lists and markdown * Update example-log.md (#99) optimized link so it would convert from .md to .html properly as it's resulting in a 404 in its companion html file * Update docs/source/scheduler.md Co-authored-by: Michael Goin <[email protected]> * Update docs/source/scheduler.md Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Benjamin Fineran <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Mark Kurtz <[email protected]>
1 parent 7b9999b commit 7ea8298

File tree

8 files changed

+247
-89
lines changed

8 files changed

+247
-89
lines changed

docs/debugging-optimizing/example-log.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ limitations under the License.
1616

1717
# Example Log, Verbose Level = diagnose
1818

19-
The following is an example log with `NM_LOGGING_LEVEL=diagnose` running a super_resolution network, where we only support running 70% of it. Different portions of the log are explained in [Parsing an Example Log](./diagnostics-debugging.md#parsing-an-example-log).
19+
The following is an example log with `NM_LOGGING_LEVEL=diagnose` running a super_resolution network, where we only support running 70% of it. Different portions of the log are explained in [Parsing an Example Log](diagnostics-debugging.md#parsing-an-example-log).
2020

2121
```bash
2222
onnx_filename : test-models/cv-resolution/super_resolution/none-bsd300-onnx-repo/model.onnx

docs/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ For example, pruning plus quantization can give noticeable improvements in perfo
6666

6767
The Deep Sparse product suite builds on top of sparsification enabling you to easily apply the techniques to your datasets and models using recipe-driven approaches.
6868
Recipes encode the directions for how to sparsify a model into a simple, easily editable format.
69+
6970
- Download a sparsification recipe and sparsified model from the `SparseZoo <https://github.com/neuralmagic/sparsezoo>`_.
7071
- Alternatively, create a recipe for your model using `Sparsify <https://github.com/neuralmagic/sparsify>`_.
7172
- Apply your recipe with only a few lines of code using `SparseML <https://github.com/neuralmagic/sparseml>`_.
@@ -121,6 +122,7 @@ Additionally, more information can be found via
121122
:caption: Performance
122123

123124
debugging-optimizing/index
125+
source/scheduler
124126

125127
.. toctree::
126128
:maxdepth: 2

docs/source/multi-stream.png

30.8 KB
Loading

docs/source/scheduler.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
<!--
2+
Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved.
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
http://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing,
11+
software distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
-->
16+
17+
## Serial or Concurrent Inferences
18+
19+
Schedulers are special system software which handle the distribution of work across cores in parallel computation. The goal of a good scheduler is to ensure that while work is available, cores aren’t sitting idle. On the contrary, as long as parallel tasks are available, all cores should be kept busy.
20+
21+
In most use cases, the default scheduler is the preferred choice when running inferences with the DeepSparse Engine. It's highly optimized for minimum per-request latency, using all of the system's resources provided to it on every request it gets. Often, particularly when working with large batch sizes, the scheduler is able to distribute the workload of a single request across as many cores as it's provided.
22+
23+
![Single-stream scheduling diagram](single-stream.png)
24+
25+
_Single stream scheduling; requests execute serially by default_
26+
27+
However, there are circumstances in which more cores does not imply better performance. If the computation can't be divided up to produce enough parallelism (while maximizing use of the CPU cache), then adding more cores simply adds more compute power with little to apply it to.
28+
29+
An alternative, "multi-stream" scheduler is provided with the software. In cases where parallelism is low, sending multiple requests simultaneously can more adequately saturate the available cores. In other words, if speedup can't be achieved by adding more cores, then perhaps speedup can be achieved by adding more work.
30+
31+
If increasing core count doesn't decrease latency, that's a strong indicator that parallelism is low in your particular model/batch-size combination. It may be that total throughput can be increased by making more requests simultaneously. Using the [deepsparse.engine.Scheduler API](https://docs.neuralmagic.com/deepsparse/api/deepsparse.html), the multi-stream scheduler can be selected, and requests made by multiple Python threads will be handled concurrently.
32+
33+
![Multi-stream scheduling diagram](multi-stream.png)
34+
35+
_Multi-stream scheduling; requests execute in parallel and may utilize hardware resources better_
36+
37+
Whereas the default scheduler will queue up requests made simultaneously and handle them serially, the multi-stream scheduler maintains a set of dropboxes where requests may be deposited and the requesting threads can wait. These dropboxes allow workers to find work from multiple sources when work from a single source would otherwise be scarce, maximizing throughput. When a request is complete, the requesting thread is awakened and returns the results to the caller.
38+
39+
The most common use cases for the multi-stream scheduler are where parallelism is low with respect to core count, and where requests need to be made asynchronously without time to batch them. Implementing a model server may fit such a scenario and be ideal for using multi-stream scheduling.
40+
41+
Depending on your engine execution strategy, enable one of these options by running:
42+
43+
```python
44+
engine = compile_model(model_path, batch_size, num_cores, num_sockets, "single_stream")
45+
```
46+
47+
or
48+
49+
```python
50+
engine = compile_model(model_path, batch_size, num_cores, num_sockets, "multi_stream")
51+
```
52+
53+
or pass in the enum value directly, since` "multi_stream" == Scheduler.multi_stream`
54+
55+
By default, the scheduler will map to a single stream.

docs/source/single-stream.png

17.1 KB
Loading

examples/flask/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -55,11 +55,11 @@ python client.py ~/Downloads/resnet18_pruned.onnx
5555
```
5656
Output:
5757
```bash
58-
[ INFO onnx.py: 92 - generate_random_inputs() ] Generating 1 random inputs
59-
[ INFO onnx.py: 102 - generate_random_inputs() ] -- random input #0 of shape = [1, 3, 224, 224]
60-
Sending 1 input tensors to http://0.0.0.0:5543/predict
61-
Recieved response of 2 output tensors:
62-
Round-trip time took 13.4261 milliseconds
63-
output #0: shape (1, 1000)
64-
output #1: shape (1, 1000)
58+
[ INFO onnx.py: 127 - generate_random_inputs() ] -- generating random input #0 of shape = [1, 3, 224, 224]
59+
[ INFO client.py: 152 - main() ] Sending 1 input tensors to http://0.0.0.0:5543/run
60+
[ DEBUG client.py: 102 - _post() ] Sending POST request to http://0.0.0.0:5543/run
61+
[ INFO client.py: 159 - main() ] Round-trip time took 13.3283 milliseconds
62+
[ INFO client.py: 160 - main() ] Received response of 2 output tensors:
63+
[ INFO client.py: 163 - main() ] output #0: shape (1, 1000)
64+
[ INFO client.py: 163 - main() ] output #1: shape (1, 1000)
6565
```

examples/flask/client.py

Lines changed: 72 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -18,17 +18,18 @@
1818
1919
##########
2020
Command help:
21-
usage: client.py [-h] [-s BATCH_SIZE] [-a ADDRESS] [-p PORT] onnx_filepath
21+
usage: client.py [-h] [-b BATCH_SIZE] [-a ADDRESS] [-p PORT] model_path
2222
23-
Communicate with a Flask server hosting an ONNX model with the
24-
DeepSparse Engine as inference backend.
23+
Communicate with a Flask server hosting an ONNX model with the DeepSparse
24+
Engine as inference backend.
2525
2626
positional arguments:
27-
onnx_filepath The full filepath of the ONNX model file
27+
model_path The full filepath of the ONNX model file or SparseZoo
28+
stub of model
2829
2930
optional arguments:
3031
-h, --help show this help message and exit
31-
-s BATCH_SIZE, --batch_size BATCH_SIZE
32+
-b BATCH_SIZE, --batch-size BATCH_SIZE
3233
The batch size to run the analysis for
3334
-a ADDRESS, --address ADDRESS
3435
The IP address of the hosted model
@@ -41,11 +42,65 @@
4142
"""
4243

4344
import argparse
45+
import os
4446
import time
47+
from typing import Any, Callable, List
4548

49+
import numpy
4650
import requests
4751

48-
from deepsparse.utils import arrays_to_bytes, bytes_to_arrays, generate_random_inputs
52+
from deepsparse.utils import (
53+
arrays_to_bytes,
54+
bytes_to_arrays,
55+
generate_random_inputs,
56+
log_init,
57+
)
58+
59+
60+
_LOGGER = log_init(os.path.basename(__file__))
61+
62+
63+
class EngineFlaskClient:
64+
"""
65+
Client object for interacting with HTTP server invoked with `engine_flask_server`.
66+
67+
:param address: IP address of server to query
68+
:param port: port that the server is running on
69+
:param preprocessing_fn: function to preprocess inputs to the run argument before
70+
sending inputs to the model server. Defaults to the `arrays_to_bytes` function
71+
for serializing lists of numpy arrays
72+
:param preprocessing_fn: function to postprocess outputs from model server
73+
inferences. Defaults to the `bytes_to_arrays` function for de-serializing
74+
lists of numpy arrays
75+
"""
76+
77+
def __init__(
78+
self,
79+
address: str,
80+
port: str,
81+
preprocessing_fn: Callable[[Any], Any] = arrays_to_bytes,
82+
postprocessing_fn: Callable[[Any], Any] = bytes_to_arrays,
83+
):
84+
self.url = f"http://{address}:{port}"
85+
self.preprocessing_fn = preprocessing_fn
86+
self.postprocessing_fn = postprocessing_fn
87+
88+
def run(self, inp: List[numpy.ndarray]) -> List[numpy.ndarray]:
89+
"""
90+
Client function for running a forward pass of the server model.
91+
92+
:param inp: the list of inputs to pass to the server for inference.
93+
The expected order is the inputs order as defined in the ONNX graph
94+
:return: the list of outputs from the server after executing over the inputs
95+
"""
96+
data = self.preprocessing_fn(inp)
97+
response = self._post("run", data=data)
98+
return self.postprocessing_fn(response)
99+
100+
def _post(self, route: str, data: Any):
101+
route_url = f"{self.url}/{route}"
102+
_LOGGER.debug(f"Sending POST request to {route_url}")
103+
return requests.post(route_url, data=data).content
49104

50105

51106
def parse_args():
@@ -57,14 +112,14 @@ def parse_args():
57112
)
58113

59114
parser.add_argument(
60-
"onnx_filepath",
115+
"model_path",
61116
type=str,
62-
help="The full filepath of the ONNX model file",
117+
help="The full filepath of the ONNX model file or SparseZoo stub of model",
63118
)
64119

65120
parser.add_argument(
66-
"-s",
67-
"--batch_size",
121+
"-b",
122+
"--batch-size",
68123
type=int,
69124
default=1,
70125
help="The batch size to run the analysis for",
@@ -89,32 +144,23 @@ def parse_args():
89144

90145
def main():
91146
args = parse_args()
92-
onnx_filepath = args.onnx_filepath
93-
batch_size = args.batch_size
94-
address = args.address
95-
port = args.port
96147

97-
prediction_url = f"http://{address}:{port}/predict"
148+
engine = EngineFlaskClient(args.address, args.port)
98149

99-
inputs = generate_random_inputs(onnx_filepath, batch_size)
150+
inputs = generate_random_inputs(args.model_path, args.batch_size)
100151

101-
print(f"Sending {len(inputs)} input tensors to {prediction_url}")
152+
_LOGGER.info(f"Sending {len(inputs)} input tensors to {engine.url}/run")
102153

103154
start = time.time()
104-
# Encode inputs
105-
data = arrays_to_bytes(inputs)
106-
# Send data to server for inference
107-
response = requests.post(prediction_url, data=data)
108-
# Decode outputs
109-
outputs = bytes_to_arrays(response.content)
155+
outputs = engine.run(inputs)
110156
end = time.time()
111157
elapsed_time = end - start
112158

113-
print(f"Received response of {len(outputs)} output tensors:")
114-
print(f"Round-trip time took {elapsed_time * 1000.0:.4f} milliseconds")
159+
_LOGGER.info(f"Round-trip time took {elapsed_time * 1000.0:.4f} milliseconds")
160+
_LOGGER.info(f"Received response of {len(outputs)} output tensors:")
115161

116162
for i, out in enumerate(outputs):
117-
print(f" output #{i}: shape {out.shape}")
163+
_LOGGER.info(f"\toutput #{i}: shape {out.shape}")
118164

119165

120166
if __name__ == "__main__":

0 commit comments

Comments
 (0)