Skip to content
This repository was archived by the owner on Sep 21, 2021. It is now read-only.

Commit de2478d

Browse files
Added missing part 03_Aggregations
1 parent b61e63f commit de2478d

8 files changed

+149
-146
lines changed

02_Dealing_with_language.asciidoc

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -64,16 +64,3 @@ include::240_Stopwords.asciidoc[]
6464
include::260_Synonyms.asciidoc[]
6565

6666
include::270_Fuzzy_matching.asciidoc[]
67-
68-
include::301_Aggregation_Overview.asciidoc[]
69-
70-
include::302_Example_Walkthrough.asciidoc[]
71-
72-
include::303_Making_Graphs.asciidoc[]
73-
74-
include::304_Approximate_Aggregations.asciidoc[]
75-
76-
include::305_Significant_Terms.asciidoc[]
77-
78-
include::306_Practical_Considerations.asciidoc[]
79-
Lines changed: 60 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,60 @@
1-
[[aggregations]]
2-
= Aggregations
3-
4-
[partintro]
5-
--
6-
Until this point, this book has been dedicated to search.((("searching", "search versus aggregations")))((("aggregations"))) With search,
7-
we have a query and we want to find a subset of documents that
8-
match the query. We are looking for the proverbial needle(s) in the
9-
haystack.
10-
11-
With aggregations, we zoom out to get an overview of our data. Instead of
12-
looking for individual documents, we want to analyze and summarize our complete
13-
set of data:
14-
15-
// Popular manufacturers? Unusual clumps of needles in the haystack?
16-
- How many needles are in the haystack?
17-
- What is the average length of the needles?
18-
- What is the median length of the needles, broken down by manufacturer?
19-
- How many needles were added to the haystack each month?
20-
21-
Aggregations can answer more subtle questions too:
22-
23-
- What are your most popular needle manufacturers?
24-
- Are there any unusual or anomalous clumps of needles?
25-
26-
Aggregations allow us to ask sophisticated questions of our data. And yet, while
27-
the functionality is completely different from search, it leverages the
28-
same data-structures. This means aggregations execute quickly and are
29-
_near real-time_, just like search.
30-
31-
This is extremely powerful for reporting and dashboards. Instead of performing
32-
_rollups_ of your data (_that crusty Hadoop job that takes a week to run_),
33-
you can visualize your data in real time, allowing you to respond immediately.
34-
35-
// Perhaps mention "not precalculated, out of date, and irrelevant"?
36-
// Perhaps "aggs are calculated in the context of the user's search, so you're not showing them that you have 10 4 star hotels on your site, but that you have 10 4 star hotels that *match their criteria*".
37-
38-
Finally, aggregations operate alongside search requests.((("aggregations", "operating alongside search requests"))) This means you can
39-
both search/filter documents _and_ perform analytics at the same time, on the
40-
same data, in a single request. And because aggregations are calculated in the
41-
context of a user's search, you're not just displaying a count of four-star hotels--you're displaying a count of four-star hotels that _match their search criteria_.
42-
43-
Aggregations are so powerful that many companies have built large Elasticsearch
44-
clusters solely for analytics.
45-
--
1+
ifndef::es_build[= placeholder3]
2+
3+
[[aggregations]]
4+
= Aggregations
5+
6+
[partintro]
7+
--
8+
Until this point, this book has been dedicated to search.((("searching", "search versus aggregations")))((("aggregations"))) With search,
9+
we have a query and we want to find a subset of documents that
10+
match the query. We are looking for the proverbial needle(s) in the
11+
haystack.
12+
13+
With aggregations, we zoom out to get an overview of our data. Instead of
14+
looking for individual documents, we want to analyze and summarize our complete
15+
set of data:
16+
17+
// Popular manufacturers? Unusual clumps of needles in the haystack?
18+
- How many needles are in the haystack?
19+
- What is the average length of the needles?
20+
- What is the median length of the needles, broken down by manufacturer?
21+
- How many needles were added to the haystack each month?
22+
23+
Aggregations can answer more subtle questions too:
24+
25+
- What are your most popular needle manufacturers?
26+
- Are there any unusual or anomalous clumps of needles?
27+
28+
Aggregations allow us to ask sophisticated questions of our data. And yet, while
29+
the functionality is completely different from search, it leverages the
30+
same data-structures. This means aggregations execute quickly and are
31+
_near real-time_, just like search.
32+
33+
This is extremely powerful for reporting and dashboards. Instead of performing
34+
_rollups_ of your data (_that crusty Hadoop job that takes a week to run_),
35+
you can visualize your data in real time, allowing you to respond immediately.
36+
37+
// Perhaps mention "not precalculated, out of date, and irrelevant"?
38+
// Perhaps "aggs are calculated in the context of the user's search, so you're not showing them that you have 10 4 star hotels on your site, but that you have 10 4 star hotels that *match their criteria*".
39+
40+
Finally, aggregations operate alongside search requests.((("aggregations", "operating alongside search requests"))) This means you can
41+
both search/filter documents _and_ perform analytics at the same time, on the
42+
same data, in a single request. And because aggregations are calculated in the
43+
context of a user's search, you're not just displaying a count of four-star hotels--you're displaying a count of four-star hotels that _match their search criteria_.
44+
45+
Aggregations are so powerful that many companies have built large Elasticsearch
46+
clusters solely for analytics.
47+
--
48+
49+
include::301_Aggregation_Overview.asciidoc[]
50+
51+
include::302_Example_Walkthrough.asciidoc[]
52+
53+
include::303_Making_Graphs.asciidoc[]
54+
55+
include::304_Approximate_Aggregations.asciidoc[]
56+
57+
include::305_Significant_Terms.asciidoc[]
58+
59+
include::306_Practical_Considerations.asciidoc[]
60+

300_Aggregations/15_concepts_buckets.asciidoc

Lines changed: 0 additions & 86 deletions
This file was deleted.

300_Aggregations/30_histogram.asciidoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,7 @@ means `0-20,000`, the key `20000` means `20,000-40,000`, and so forth.
103103
Graphically, you could represent the preceding data in the histogram shown in <<barcharts-histo1>>:
104104

105105
[[barcharts-histo1]]
106+
.Histogram of top makes per price range
106107
image::images/elas_28in01.png["Histogram of top makes per price range"]
107108

108109
Of course, you can build bar charts with any aggregation that emits categories
@@ -144,6 +145,7 @@ std_err = std_deviation / count
144145
This will allow us to build a chart like <<barcharts-bar1>>:
145146

146147
[[barcharts-bar1]]
148+
.Barchart of average price per make, with error bars
147149
image::images/elas_28in02.png["Barchart of average price per make, with error bars"]
148150

149151

301_Aggregation_Overview.asciidoc

Lines changed: 84 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,86 @@
1+
[[aggs-high-level]]
2+
== High-Level Concepts
3+
4+
Like the query DSL, ((("aggregations", "high-level concepts")))aggregations have a _composable_ syntax: independent units
5+
of functionality can be mixed and matched to provide the custom behavior that
6+
you need. This means that there are only a few basic concepts to learn, but
7+
nearly limitless combinations of those basic components.
8+
9+
To master aggregations, you need to understand only two main concepts:
10+
11+
_Buckets_:: Collections of documents that meet a criterion
12+
_Metrics_:: Statistics calculated on the documents in a bucket
13+
14+
That's it! Every aggregation is simply a combination of one or more buckets
15+
and zero or more metrics. To translate into rough SQL terms:
16+
17+
[source,sql]
18+
--------------------------------------------------
19+
SELECT COUNT(color) <1>
20+
FROM table
21+
GROUP BY color <2>
22+
--------------------------------------------------
23+
<1> `COUNT(color)` is equivalent to a metric.
24+
<2> `GROUP BY color` is equivalent to a bucket.
25+
26+
Buckets are conceptually similar to grouping in SQL, while metrics are similar
27+
to `COUNT()`, `SUM()`, `MAX()`, and so forth.
28+
29+
30+
Let's dig into both of these concepts((("aggregations", "high-level concepts", "buckets")))((("buckets"))) and see what they entail.
31+
32+
=== Buckets
33+
34+
A _bucket_ is simply a collection of documents that meet a certain criteria:
35+
36+
- An employee would land in either the _male_ or _female_ bucket.
37+
- The city of Albany would land in the _New York_ state bucket.
38+
- The date 2014-10-28 would land within the _October_ bucket.
39+
40+
As aggregations are executed, the values inside each document are evaluated to
41+
determine whether they match a bucket's criteria. If they match, the document is placed
42+
inside the bucket and the aggregation continues.
43+
44+
Buckets can also be nested inside other buckets, giving you a hierarchy or
45+
conditional partitioning scheme. For example, Cincinnati would be placed inside
46+
the Ohio state bucket, and the _entire_ Ohio bucket would be placed inside the
47+
USA country bucket.
48+
49+
Elasticsearch has a variety of buckets, which allow you to
50+
partition documents in many ways (by hour, by most-popular terms, by
51+
age ranges, by geographical location, and more). But fundamentally they all operate
52+
on the same principle: partitioning documents based on a criteria.
53+
54+
=== Metrics
55+
56+
Buckets allow us to partition documents into useful subsets,((("aggregations", "high-level concepts", "metrics")))((("metrics"))) but ultimately what
57+
we want is some kind of metric calculated on those documents in each bucket.
58+
Bucketing is the means to an end: it provides a way to group documents in a way
59+
that you can calculate interesting metrics.
60+
61+
Most _metrics_ are simple mathematical operations (for example, min, mean, max, and sum)
62+
that are calculated using the document values. In practical terms, metrics allow
63+
you to calculate quantities such as the average salary, or the maximum sale price,
64+
or the 95th percentile for query latency.
65+
66+
=== Combining the Two
67+
68+
An _aggregation_ is a combination of buckets and metrics.((("aggregations", "high-level concepts", "combining buckets and metrics")))((("buckets", "combining with metrics")))((("metrics", "combining with buckets"))) An aggregation may have
69+
a single bucket, or a single metric, or one of each. It may even have multiple
70+
buckets nested inside other buckets. For example, we can partition documents by which country they belong to (a bucket), and
71+
then calculate the average salary per country (a metric).
72+
73+
Because buckets can be nested, we can derive a much more complex aggregation:
74+
75+
1. Partition documents by country (bucket).
76+
2. Then partition each country bucket by gender (bucket).
77+
3. Then partition each gender bucket by age ranges (bucket).
78+
4. Finally, calculate the average salary for each age range (metric)
79+
80+
This will give you the average salary per `<country, gender, age>` combination. All in
81+
one request and with one pass over the data!
82+
83+
84+
185

2-
include::300_Aggregations/05_overview.asciidoc[]
386

4-
include::300_Aggregations/15_concepts_buckets.asciidoc[]

306_Practical_Considerations.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
[[controlling-memory]]
12
== Controlling Memory Use and Latency
23

34
include::300_Aggregations/90_fielddata.asciidoc[]

atlas.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
"00_Getting_started.asciidoc",
99
"01_Search_in_depth.asciidoc",
1010
"02_Dealing_with_language.asciidoc",
11+
"03_Aggregations.asciidoc",
1112
"04_Geolocation.asciidoc",
1213
"06_Modeling_your_data.asciidoc",
1314
"07_Admin.asciidoc",

book.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ include::01_Search_in_depth.asciidoc[]
1111

1212
include::02_Dealing_with_language.asciidoc[]
1313

14+
include::03_Aggregations.asciidoc[]
1415

1516
include::04_Geolocation.asciidoc[]
1617

0 commit comments

Comments
 (0)