Added missing part 03_Aggregations

clintongormley · clintongormley · commit de2478d1b1a5 · 2014-11-30T12:20:09.000+01:00
diff --git a/02_Dealing_with_language.asciidoc b/02_Dealing_with_language.asciidoc
@@ -64,16 +64,3 @@ include::240_Stopwords.asciidoc[]
 include::260_Synonyms.asciidoc[]
 
 include::270_Fuzzy_matching.asciidoc[]
-
-include::301_Aggregation_Overview.asciidoc[]
-
-include::302_Example_Walkthrough.asciidoc[]
-
-include::303_Making_Graphs.asciidoc[]
-
-include::304_Approximate_Aggregations.asciidoc[]
-
-include::305_Significant_Terms.asciidoc[]
-
-include::306_Practical_Considerations.asciidoc[]
-
diff --git a/03_Aggregations.asciidoc b/03_Aggregations.asciidoc
@@ -1,45 +1,60 @@
-[[aggregations]]
-= Aggregations
-
-[partintro]
---
-Until this point, this book has been dedicated to search.((("searching", "search versus aggregations")))((("aggregations")))  With search, 
-we have a query and we want to find a subset of documents that
-match the query.  We are looking for the proverbial needle(s) in the
-haystack.
-
-With aggregations, we zoom out to get an overview of our data.  Instead of 
-looking for individual documents, we want to analyze and summarize our complete 
-set of data:
-
-// Popular manufacturers? Unusual clumps of needles in the haystack?
-- How many needles are in the haystack?
-- What is the average length of the needles?
-- What is the median length of the needles, broken down by manufacturer?
-- How many needles were added to the haystack each month?
-
-Aggregations can answer more subtle questions too:
-
-- What are your most popular needle manufacturers?
-- Are there any unusual or anomalous clumps of needles?
-
-Aggregations allow us to ask sophisticated questions of our data.  And yet, while
-the functionality is completely different from search, it leverages the
-same data-structures.  This means aggregations execute quickly and are
-_near real-time_, just like search.
-
-This is extremely powerful for reporting and dashboards.  Instead of performing
-_rollups_ of your data (_that crusty Hadoop job that takes a week to run_), 
-you can visualize your data in real time, allowing you to respond immediately.
-
-// Perhaps mention "not precalculated, out of date, and irrelevant"?
-// Perhaps "aggs are calculated in the context of the user's search, so you're not showing them that you have 10 4 star hotels on your site, but that you have 10 4 star hotels that *match their criteria*".
-
-Finally, aggregations operate alongside search requests.((("aggregations", "operating alongside search requests"))) This means you can
-both search/filter documents _and_ perform analytics at the same time, on the
-same data, in a single request.  And because aggregations are calculated in the
-context of a user's search, you're not just displaying a count of four-star hotels--you're displaying a count of four-star hotels that _match their search criteria_.
-
-Aggregations are so powerful that many companies have built large Elasticsearch
-clusters solely for analytics.
---
+ifndef::es_build[= placeholder3]
+
+[[aggregations]]
+= Aggregations
+
+[partintro]
+--
+Until this point, this book has been dedicated to search.((("searching", "search versus aggregations")))((("aggregations")))  With search,
+we have a query and we want to find a subset of documents that
+match the query.  We are looking for the proverbial needle(s) in the
+haystack.
+
+With aggregations, we zoom out to get an overview of our data.  Instead of
+looking for individual documents, we want to analyze and summarize our complete
+set of data:
+
+// Popular manufacturers? Unusual clumps of needles in the haystack?
+- How many needles are in the haystack?
+- What is the average length of the needles?
+- What is the median length of the needles, broken down by manufacturer?
+- How many needles were added to the haystack each month?
+
+Aggregations can answer more subtle questions too:
+
+- What are your most popular needle manufacturers?
+- Are there any unusual or anomalous clumps of needles?
+
+Aggregations allow us to ask sophisticated questions of our data.  And yet, while
+the functionality is completely different from search, it leverages the
+same data-structures.  This means aggregations execute quickly and are
+_near real-time_, just like search.
+
+This is extremely powerful for reporting and dashboards.  Instead of performing
+_rollups_ of your data (_that crusty Hadoop job that takes a week to run_),
+you can visualize your data in real time, allowing you to respond immediately.
+
+// Perhaps mention "not precalculated, out of date, and irrelevant"?
+// Perhaps "aggs are calculated in the context of the user's search, so you're not showing them that you have 10 4 star hotels on your site, but that you have 10 4 star hotels that *match their criteria*".
+
+Finally, aggregations operate alongside search requests.((("aggregations", "operating alongside search requests"))) This means you can
+both search/filter documents _and_ perform analytics at the same time, on the
+same data, in a single request.  And because aggregations are calculated in the
+context of a user's search, you're not just displaying a count of four-star hotels--you're displaying a count of four-star hotels that _match their search criteria_.
+
+Aggregations are so powerful that many companies have built large Elasticsearch
+clusters solely for analytics.
+--
+
+include::301_Aggregation_Overview.asciidoc[]
+
+include::302_Example_Walkthrough.asciidoc[]
+
+include::303_Making_Graphs.asciidoc[]
+
+include::304_Approximate_Aggregations.asciidoc[]
+
+include::305_Significant_Terms.asciidoc[]
+
+include::306_Practical_Considerations.asciidoc[]
+
diff --git a/300_Aggregations/15_concepts_buckets.asciidoc b/300_Aggregations/15_concepts_buckets.asciidoc
diff --git a/300_Aggregations/30_histogram.asciidoc b/300_Aggregations/30_histogram.asciidoc
@@ -103,6 +103,7 @@ means `0-20,000`, the key `20000` means `20,000-40,000`, and so forth.
 Graphically, you could represent the preceding data in the histogram shown in <<barcharts-histo1>>:
 
 [[barcharts-histo1]]
+.Histogram of top makes per price range
 image::images/elas_28in01.png["Histogram of top makes per price range"]
 
 Of course, you can build bar charts with any aggregation that emits categories
@@ -144,6 +145,7 @@ std_err = std_deviation / count
 This will allow us to build a chart like <<barcharts-bar1>>:
 
 [[barcharts-bar1]]
+.Barchart of average price per make, with error bars
 image::images/elas_28in02.png["Barchart of average price per make, with error bars"]
 
 
diff --git a/301_Aggregation_Overview.asciidoc b/301_Aggregation_Overview.asciidoc
@@ -1,4 +1,86 @@
+[[aggs-high-level]]
+== High-Level Concepts
+
+Like the query DSL, ((("aggregations", "high-level concepts")))aggregations have a _composable_ syntax: independent units
+of functionality can be mixed and matched to provide the custom behavior that
+you need. This means that there are only a few basic concepts to learn, but
+nearly limitless combinations of those basic components.
+
+To master aggregations, you need to understand only two main concepts:
+
+_Buckets_:: Collections of documents that meet a criterion
+_Metrics_:: Statistics calculated on the documents in a bucket
+
+That's it!  Every aggregation is simply a combination of one or more buckets
+and zero or more metrics. To translate into rough SQL terms:
+
+[source,sql]
+--------------------------------------------------
+SELECT COUNT(color) <1>
+FROM table
+GROUP BY color <2>
+--------------------------------------------------
+<1> `COUNT(color)` is equivalent to a metric.
+<2> `GROUP BY color` is equivalent to a bucket.
+
+Buckets are conceptually similar to grouping in SQL, while metrics are similar
+to `COUNT()`, `SUM()`, `MAX()`, and so forth.
+
+
+Let's dig into both of these concepts((("aggregations", "high-level concepts", "buckets")))((("buckets"))) and see what they entail.
+
+=== Buckets
+
+A _bucket_ is simply a collection of documents that meet a certain criteria:
+
+- An employee would land in either the _male_ or _female_ bucket.
+- The city of Albany would land in the _New York_ state bucket.
+- The date 2014-10-28 would land within the _October_ bucket.
+
+As aggregations are executed, the values inside each document are evaluated to
+determine whether they match a bucket's criteria.  If they match, the document is placed
+inside the bucket and the aggregation continues.
+
+Buckets can also be nested inside other buckets, giving you a hierarchy or
+conditional partitioning scheme.  For example, Cincinnati would be placed inside
+the Ohio state bucket, and the _entire_ Ohio bucket would be placed inside the
+USA country bucket.
+
+Elasticsearch has a variety of buckets, which allow you to
+partition documents in many ways (by hour, by most-popular terms, by
+age ranges, by geographical location, and more).  But fundamentally they all operate
+on the same principle: partitioning documents based on a criteria.
+
+=== Metrics
+
+Buckets allow us to partition documents into useful subsets,((("aggregations", "high-level concepts", "metrics")))((("metrics"))) but ultimately what
+we want is some kind of metric calculated on those documents in each bucket.
+Bucketing is the means to an end: it provides a way to group documents in a way
+that you can calculate interesting metrics.
+
+Most _metrics_ are simple mathematical operations (for example, min, mean, max, and sum)
+that are calculated using the document values.  In practical terms, metrics allow
+you to calculate quantities such as the average salary, or the maximum sale price,
+or the 95th percentile for query latency.
+
+=== Combining the Two
+
+An _aggregation_ is a combination of buckets and metrics.((("aggregations", "high-level concepts", "combining buckets and metrics")))((("buckets", "combining with metrics")))((("metrics", "combining with buckets")))  An aggregation may have
+a single bucket, or a single metric, or one of each.  It may even have multiple
+buckets nested inside other buckets. For example, we can partition documents by which country they belong to (a bucket), and
+then calculate the average salary per country (a metric).
+
+Because buckets can be nested, we can derive a much more complex aggregation:
+
+1. Partition documents by country (bucket).
+2. Then partition each country bucket by gender (bucket).
+3. Then partition each gender bucket by age ranges (bucket).
+4. Finally, calculate the average salary for each age range (metric)
+
+This will give you the average salary per `<country, gender, age>` combination.  All in
+one request and with one pass over the data!
+
+
+
 
-include::300_Aggregations/05_overview.asciidoc[]
 
-include::300_Aggregations/15_concepts_buckets.asciidoc[]
diff --git a/306_Practical_Considerations.asciidoc b/306_Practical_Considerations.asciidoc
@@ -1,3 +1,4 @@
+[[controlling-memory]]
 == Controlling Memory Use and Latency
 
 include::300_Aggregations/90_fielddata.asciidoc[]
diff --git a/atlas.json b/atlas.json
@@ -8,6 +8,7 @@
     "00_Getting_started.asciidoc",
     "01_Search_in_depth.asciidoc",
     "02_Dealing_with_language.asciidoc",
+    "03_Aggregations.asciidoc",
     "04_Geolocation.asciidoc",
     "06_Modeling_your_data.asciidoc",
     "07_Admin.asciidoc",
diff --git a/book.asciidoc b/book.asciidoc
@@ -11,6 +11,7 @@ include::01_Search_in_depth.asciidoc[]
 
 include::02_Dealing_with_language.asciidoc[]
 
+include::03_Aggregations.asciidoc[]
 
 include::04_Geolocation.asciidoc[]
 

Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,4 @@`
	`1`	`+[[controlling-memory]]`
`1`	`2`	`== Controlling Memory Use and Latency`
`2`	`3`
`3`	`4`	`include::300_Aggregations/90_fielddata.asciidoc[]`
Original file line number	Diff line number	Diff line change
`@@ -11,6 +11,7 @@ include::01_Search_in_depth.asciidoc[]`
`11`	`11`
`12`	`12`	`include::02_Dealing_with_language.asciidoc[]`
`13`	`13`
	`14`	`+include::03_Aggregations.asciidoc[]`
`14`	`15`
`15`	`16`	`include::04_Geolocation.asciidoc[]`
`16`	`17`