Skip to content

Commit 669a4f9

Browse files
Apply feedback from documentation-website to PPL command docs (#4997)
Co-authored-by: Kyle Hounslow <kylhouns@amazon.com>
1 parent 22669da commit 669a4f9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+4354
-3422
lines changed

docs/user/ppl/cmd/ad.md

Lines changed: 77 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,76 @@
1-
# ad (deprecated by ml command)
21

3-
## Description
2+
# ad (Deprecated)
43

5-
The `ad` command applies Random Cut Forest (RCF) algorithm in the ml-commons plugin on the search result returned by a PPL command. Based on the input, the command uses two types of RCF algorithms: fixed-in-time RCF for processing time-series data, batch RCF for processing non-time-series data.
6-
## Syntax
4+
> **Warning**: The `ad` command is deprecated in favor of the [`ml` command](./ml.md).
75
8-
## Fixed In Time RCF For Time-series Data
6+
The `ad` command applies the Random Cut Forest (RCF) algorithm in the ML Commons plugin to the search results returned by a PPL command. The command provides two anomaly detection approaches:
97

10-
ad [number_of_trees] [shingle_size] [sample_size] [output_after] [time_decay] [anomaly_rate] \<time_field\> [date_format] [time_zone] [category_field]
11-
* number_of_trees: optional. Number of trees in the forest. **Default:** 30.
12-
* shingle_size: optional. A shingle is a consecutive sequence of the most recent records. **Default:** 8.
13-
* sample_size: optional. The sample size used by stream samplers in this forest. **Default:** 256.
14-
* output_after: optional. The number of points required by stream samplers before results are returned. **Default:** 32.
15-
* time_decay: optional. The decay factor used by stream samplers in this forest. **Default:** 0.0001.
16-
* anomaly_rate: optional. The anomaly rate. **Default:** 0.005.
17-
* time_field: mandatory. Specifies the time field for RCF to use as time-series data.
18-
* date_format: optional. Used for formatting time_field. **Default:** "yyyy-MM-dd HH:mm:ss".
19-
* time_zone: optional. Used for setting time zone for time_field. **Default:** "UTC".
20-
* category_field: optional. Specifies the category field used to group inputs. Each category will be independently predicted.
8+
- [Anomaly detection for time-series data](#anomaly-detection-for-time-series-data) using the fixed-in-time RCF algorithm
9+
- [Anomaly detection for non-time-series data](#anomaly-detection-for-non-time-series-data) using the batch RCF algorithm
10+
11+
> **Note**: To use the `ad` command, `plugins.calcite.enabled` must be set to `false`.
12+
13+
## Syntax
14+
15+
The `ad` command has two different syntax variants, depending on the algorithm type.
16+
17+
### Anomaly detection for time-series data
18+
19+
Use this syntax to detect anomalies in time-series data. This method uses the fixed-in-time RCF algorithm, which is optimized for sequential data patterns.
20+
21+
The fixed-in-time RCF `ad` command has the following syntax:
22+
23+
```syntax
24+
ad [number_of_trees] [shingle_size] [sample_size] [output_after] [time_decay] [anomaly_rate] <time_field> [date_format] [time_zone] [category_field]
25+
```
26+
27+
### Parameters
28+
29+
The fixed-in-time RCF algorithm supports the following parameters.
30+
31+
| Parameter | Required/Optional | Description |
32+
| --- | --- | --- |
33+
| `time_field` | Required | The time field for RCF to use as time-series data. |
34+
| `number_of_trees` | Optional | The number of trees in the forest. Default is `30`. |
35+
| `shingle_size` | Optional | The number of records in a shingle. A shingle is a consecutive sequence of the most recent records. Default is `8`. |
36+
| `sample_size` | Optional | The sample size used by the stream samplers in this forest. Default is `256`. |
37+
| `output_after` | Optional | The number of points required by the stream samplers before results are returned. Default is `32`. |
38+
| `time_decay` | Optional | The decay factor used by the stream samplers in this forest. Default is `0.0001`. |
39+
| `anomaly_rate` | Optional | The anomaly rate. Default is `0.005`. |
40+
| `date_format` | Optional | The format used for the `time_field` field. Default is `yyyy-MM-dd HH:mm:ss`. |
41+
| `time_zone` | Optional | The time zone for the `time_field` field. Default is `UTC`. |
42+
| `category_field` | Optional | The category field used to group input values. The predict operation is applied to each category independently. |
2143

22-
## Batch RCF For Non-time-series Data
2344

45+
### Anomaly detection for non-time-series data
46+
47+
Use this syntax to detect anomalies in data where the order doesn't matter. This method uses the batch RCF algorithm, which is optimized for independent data points.
48+
49+
The batch RCF `ad` command has the following syntax:
50+
51+
```syntax
2452
ad [number_of_trees] [sample_size] [output_after] [training_data_size] [anomaly_score_threshold] [category_field]
25-
* number_of_trees: optional. Number of trees in the forest. **Default:** 30.
26-
* sample_size: optional. Number of random samples given to each tree from the training data set. **Default:** 256.
27-
* output_after: optional. The number of points required by stream samplers before results are returned. **Default:** 32.
28-
* training_data_size: optional. **Default:** size of your training data set.
29-
* anomaly_score_threshold: optional. The threshold of anomaly score. **Default:** 1.0.
30-
* category_field: optional. Specifies the category field used to group inputs. Each category will be independently predicted.
53+
```
54+
55+
### Parameters
56+
57+
The batch RCF algorithm supports the following parameters.
58+
59+
| Parameter | Required/Optional | Description |
60+
| --- | --- | --- |
61+
| `number_of_trees` | Optional | The number of trees in the forest. Default is `30`. |
62+
| `sample_size` | Optional | The number of random samples provided to each tree from the training dataset. Default is `256`. |
63+
| `output_after` | Optional | The number of points required by the stream samplers before results are returned. Default is `32`. |
64+
| `training_data_size` | Optional | The size of the training dataset. Default is the full dataset size. |
65+
| `anomaly_score_threshold` | Optional | The anomaly score threshold. Default is `1.0`. |
66+
| `category_field` | Optional | The category field used to group input values. The predict operation is applied to each category independently. |
3167

32-
## Example 1: Detecting events in New York City from taxi ridership data with time-series data
3368

34-
This example trains an RCF model and uses the model to detect anomalies in the time-series ridership data.
69+
## Example 1: Detecting events in New York City taxi ridership time-series data
70+
71+
The following examples use the `nyc_taxi` dataset, which contains New York City taxi ridership data with fields including `value` (number of rides), `timestamp` (time of measurement), and `category` (time period classifications such as 'day' and 'night').
72+
73+
This example trains an RCF model and uses it to detect anomalies in time-series ridership data:
3574

3675
```ppl ignore
3776
source=nyc_taxi
@@ -40,7 +79,7 @@ source=nyc_taxi
4079
| where value=10844.0
4180
```
4281

43-
Expected output:
82+
The query returns the following results:
4483

4584
```text
4685
fetched rows / total rows = 1/1
@@ -51,9 +90,10 @@ fetched rows / total rows = 1/1
5190
+---------+---------------------+-------+---------------+
5291
```
5392

54-
## Example 2: Detecting events in New York City from taxi ridership data with time-series data independently with each category
5593

56-
This example trains an RCF model and uses the model to detect anomalies in the time-series ridership data with multiple category values.
94+
## Example 2: Detecting events in New York City taxi ridership time-series data by category
95+
96+
This example trains an RCF model and uses it to detect anomalies in time-series ridership data across multiple category values:
5797

5898
```ppl ignore
5999
source=nyc_taxi
@@ -62,7 +102,7 @@ source=nyc_taxi
62102
| where value=10844.0 or value=6526.0
63103
```
64104

65-
Expected output:
105+
The query returns the following results:
66106

67107
```text
68108
fetched rows / total rows = 2/2
@@ -74,9 +114,10 @@ fetched rows / total rows = 2/2
74114
+----------+---------+---------------------+-------+---------------+
75115
```
76116

77-
## Example 3: Detecting events in New York City from taxi ridership data with non-time-series data
78117

79-
This example trains an RCF model and uses the model to detect anomalies in the non-time-series ridership data.
118+
## Example 3: Detecting events in New York City taxi ridership non-time-series data
119+
120+
This example trains an RCF model and uses it to detect anomalies in non-time-series ridership data:
80121

81122
```ppl ignore
82123
source=nyc_taxi
@@ -85,7 +126,7 @@ source=nyc_taxi
85126
| where value=10844.0
86127
```
87128

88-
Expected output:
129+
The query returns the following results:
89130

90131
```text
91132
fetched rows / total rows = 1/1
@@ -96,9 +137,10 @@ fetched rows / total rows = 1/1
96137
+---------+-------+-----------+
97138
```
98139

99-
## Example 4: Detecting events in New York City from taxi ridership data with non-time-series data independently with each category
100140

101-
This example trains an RCF model and uses the model to detect anomalies in the non-time-series ridership data with multiple category values.
141+
## Example 4: Detecting events in New York City taxi ridership non-time-series data by category
142+
143+
This example trains an RCF model and uses it to detect anomalies in non-time-series ridership data across multiple category values:
102144

103145
```ppl ignore
104146
source=nyc_taxi
@@ -107,7 +149,7 @@ source=nyc_taxi
107149
| where value=10844.0 or value=6526.0
108150
```
109151

110-
Expected output:
152+
The query returns the following results:
111153

112154
```text
113155
fetched rows / total rows = 2/2
@@ -119,6 +161,4 @@ fetched rows / total rows = 2/2
119161
+----------+---------+-------+-----------+
120162
```
121163

122-
## Limitations
123164

124-
The `ad` command can only work with `plugins.calcite.enabled=false`.

docs/user/ppl/cmd/addcoltotals.md

Lines changed: 29 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,32 @@
1-
# AddColTotals
2-
31

4-
# Description
2+
# addcoltotals
53

6-
The `addcoltotals` command computes the sum of each column and add a summary event at the end to show the total of each column. This command works the same way `addtotals` command works with row=false and col=true option. This is useful for creating summary reports with subtotals or grand totals. The `addcoltotals` command only sums numeric fields (integers, floats, doubles). Non-numeric fields in the field list are ignored even if its specified in field-list or in the case of no field-list specified.
4+
The `addcoltotals` command computes the sum of each column and adds a summary row showing the total for each column. This command is equivalent to using `addtotals` with `row=false` and `col=true`, making it useful for creating summary reports with column totals.
75

8-
# Syntax
6+
The command only processes numeric fields (integers, floats, doubles). Non-numeric fields are ignored regardless of whether they are explicitly specified in the field list.
97

10-
`addcoltotals [field-list] [label=<string>] [labelfield=<field>]`
118

12-
- `field-list`: Optional. Comma-separated list of numeric fields to sum. If not specified, all numeric fields are summed.
13-
- `labelfield=<field>`: Optional. Field name to place the label. If it specifies a non-existing field, adds the field and shows label at the summary event row at this field.
14-
- `label=<string>`: Optional. Custom text for the totals row labelfield\'s label. Default is \"Total\".
9+
## Syntax
1510

16-
# Example 1: Basic Example
11+
The `addcoltotals` command has the following syntax:
1712

18-
The example shows placing the label in an existing field.
13+
```syntax
14+
addcoltotals [field-list] [label=<string>] [labelfield=<field>]
15+
```
16+
17+
## Parameters
18+
19+
The `addcoltotals` command supports the following parameters.
20+
21+
| Parameter | Required/Optional | Description |
22+
| --- | --- | --- |
23+
| `<field-list>` | Optional | A comma-separated list of numeric fields to add. By default, all numeric fields are added. |
24+
| `labelfield` | Optional | The field in which the label is placed. If the field does not exist, it is created and the label is shown in the summary row (last row) of the new field. |
25+
| `label` | Optional | The text that appears in the summary row (last row) to identify the computed totals. When used with `labelfield`, this text is placed in the specified field in the summary row. Default is `Total`. |
26+
27+
## Example 1: Basic example
28+
29+
The following query places the label in an existing field:
1930

2031
```ppl
2132
source=accounts
@@ -24,7 +35,7 @@ source=accounts
2435
| addcoltotals labelfield='firstname'
2536
```
2637

27-
Expected output:
38+
The query returns the following results:
2839

2940
```text
3041
fetched rows / total rows = 4/4
@@ -38,17 +49,17 @@ fetched rows / total rows = 4/4
3849
+-----------+---------+
3950
```
4051

41-
# Example 2: Adding column totals and adding a summary event with label specified.
52+
## Example 2: Adding column totals with a custom summary label
4253

43-
The example shows adding totals after a stats command where final summary event label is \'Sum\' and row=true value was used by default when not specified. It also added new field specified by labelfield as it did not match existing field.
54+
The following query adds totals after a `stats` command where the final summary event label is `Sum`. It also creates a new field specified by `labelfield` because this field does not exist in the data:
4455

4556
```ppl
4657
source=accounts
4758
| stats count() by gender
4859
| addcoltotals `count()` label='Sum' labelfield='Total'
4960
```
5061

51-
Expected output:
62+
The query returns the following results:
5263

5364
```text
5465
fetched rows / total rows = 3/3
@@ -61,9 +72,9 @@ fetched rows / total rows = 3/3
6172
+---------+--------+-------+
6273
```
6374

64-
# Example 3: With all options
75+
## Example 3: Using all options
6576

66-
The example shows using addcoltotals with all options set.
77+
The following query uses the `addcoltotals` command with all options set:
6778

6879
```ppl
6980
source=accounts
@@ -73,7 +84,7 @@ source=accounts
7384
| addcoltotals avg_balance, count label='Sum' labelfield='Column Total'
7485
```
7586

76-
Expected output:
87+
The query returns the following results:
7788

7889
```text
7990
fetched rows / total rows = 4/4

0 commit comments

Comments
 (0)