Skip to content

Commit 438152b

Browse files
kaisechengjsvd
andauthored
Add support to search after (#198)
This commit introduces the `search_api` option to enable support for both search_after and scroll. `auto`: Default value. It uses `search_after` for Elasticsearch 8 and newer versions, otherwise it falls back to the `scroll` API `search_after`: Create a point in time for search, which is the recommended way to do paginated search since 7.10 `scroll`: Uses scroll API to search, which is the previous way to search When search_after is utilized, if the query doesn't specify a sort field, a default sort of { "sort": { "_shard_doc": "asc" } } will be added to the query. By default, Elasticsearch adds a sort field "_shard_doc" on top of existing sort implicitly as a tie-breakers. The scroll search has been refactored and is expected to function the same, with only minor changes made to the logging messages. Co-authored-by: João Duarte <[email protected]>
1 parent f711348 commit 438152b

File tree

8 files changed

+562
-171
lines changed

8 files changed

+562
-171
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
## 4.19.0
2+
- Added `search_api` option to support `search_after` and `scroll` [#198](https://github.com/logstash-plugins/logstash-input-elasticsearch/pull/198)
3+
- The default value `auto` uses `search_after` for Elasticsearch >= 8, otherwise, fall back to `scroll`
4+
15
## 4.18.0
26
- Added request header `Elastic-Api-Version` for serverless [#195](https://github.com/logstash-plugins/logstash-input-elasticsearch/pull/195)
37

docs/index.asciidoc

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,7 @@ This plugin supports the following configuration options plus the <<plugins-{typ
118118
| <<plugins-{type}s-{plugin}-request_timeout_seconds>> | <<number,number>>|No
119119
| <<plugins-{type}s-{plugin}-schedule>> |<<string,string>>|No
120120
| <<plugins-{type}s-{plugin}-scroll>> |<<string,string>>|No
121+
| <<plugins-{type}s-{plugin}-search_api>> |<<string,string>>, one of `["auto", "search_after", "scroll"]`|No
121122
| <<plugins-{type}s-{plugin}-size>> |<<number,number>>|No
122123
| <<plugins-{type}s-{plugin}-slices>> |<<number,number>>|No
123124
| <<plugins-{type}s-{plugin}-ssl_certificate>> |<<path,path>>|No
@@ -333,6 +334,9 @@ environment variables e.g. `proxy => '${LS_PROXY:}'`.
333334
The query to be executed. Read the {ref}/query-dsl.html[Elasticsearch query DSL
334335
documentation] for more information.
335336

337+
When <<plugins-{type}s-{plugin}-search_api>> resolves to `search_after` and the query does not specify `sort`,
338+
the default sort `'{ "sort": { "_shard_doc": "asc" } }'` will be added to the query. Please refer to the {ref}/paginate-search-results.html#search-after[Elasticsearch search_after] parameter to know more.
339+
336340
[id="plugins-{type}s-{plugin}-request_timeout_seconds"]
337341
===== `request_timeout_seconds`
338342

@@ -377,6 +381,19 @@ This parameter controls the keepalive time in seconds of the scrolling
377381
request and initiates the scrolling process. The timeout applies per
378382
round trip (i.e. between the previous scroll request, to the next).
379383

384+
[id="plugins-{type}s-{plugin}-seearch_api"]
385+
===== `search_api`
386+
387+
* Value can be any of: `auto`, `search_after`, `scroll`
388+
* Default value is `auto`
389+
390+
With `auto` the plugin uses the `search_after` parameter for Elasticsearch version `8.0.0` or higher, otherwise the `scroll` API is used instead.
391+
392+
`search_after` uses {ref}/point-in-time-api.html#point-in-time-api[point in time] and sort value to search.
393+
The query requires at least one `sort` field, as described in the <<plugins-{type}s-{plugin}-query>> parameter.
394+
395+
`scroll` uses {ref}/paginate-search-results.html#scroll-search-results[scroll] API to search, which is no longer recommended.
396+
380397
[id="plugins-{type}s-{plugin}-size"]
381398
===== `size`
382399

lib/logstash/inputs/elasticsearch.rb

Lines changed: 46 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@
1111
require "logstash/plugin_mixins/scheduler"
1212
require "logstash/plugin_mixins/normalize_config_support"
1313
require "base64"
14-
require 'logstash/helpers/loggable_try'
1514

1615
require "elasticsearch"
1716
require "elasticsearch/transport/transport/http/manticore"
@@ -74,6 +73,8 @@
7473
#
7574
class LogStash::Inputs::Elasticsearch < LogStash::Inputs::Base
7675

76+
require 'logstash/inputs/elasticsearch/paginated_search'
77+
7778
include LogStash::PluginMixins::ECSCompatibilitySupport(:disabled, :v1, :v8 => :v1)
7879
include LogStash::PluginMixins::ECSCompatibilitySupport::TargetCheck
7980

@@ -106,6 +107,10 @@ class LogStash::Inputs::Elasticsearch < LogStash::Inputs::Base
106107
# The number of retries to run the query. If the query fails after all retries, it logs an error message.
107108
config :retries, :validate => :number, :default => 0
108109

110+
# Default `auto` will use `search_after` api for Elasticsearch 8 and use `scroll` api for 7
111+
# Set to scroll to fallback to previous version
112+
config :search_api, :validate => %w[auto search_after scroll], :default => "auto"
113+
109114
# This parameter controls the keepalive time in seconds of the scrolling
110115
# request and initiates the scrolling process. The timeout applies per
111116
# round trip (i.e. between the previous scroll request, to the next).
@@ -321,93 +326,21 @@ def register
321326

322327
setup_serverless
323328

329+
setup_search_api
330+
324331
@client
325332
end
326333

327334

328335
def run(output_queue)
329336
if @schedule
330-
scheduler.cron(@schedule) { do_run(output_queue) }
337+
scheduler.cron(@schedule) { @paginated_search.do_run(output_queue) }
331338
scheduler.join
332339
else
333-
do_run(output_queue)
334-
end
335-
end
336-
337-
private
338-
JOB_NAME = "run query"
339-
def do_run(output_queue)
340-
# if configured to run a single slice, don't bother spinning up threads
341-
if @slices.nil? || @slices <= 1
342-
return retryable(JOB_NAME) do
343-
do_run_slice(output_queue)
344-
end
345-
end
346-
347-
logger.warn("managed slices for query is very large (#{@slices}); consider reducing") if @slices > 8
348-
349-
350-
@slices.times.map do |slice_id|
351-
Thread.new do
352-
LogStash::Util::set_thread_name("[#{pipeline_id}]|input|elasticsearch|slice_#{slice_id}")
353-
retryable(JOB_NAME) do
354-
do_run_slice(output_queue, slice_id)
355-
end
356-
end
357-
end.map(&:join)
358-
359-
logger.trace("#{@slices} slices completed")
360-
end
361-
362-
def retryable(job_name, &block)
363-
begin
364-
stud_try = ::LogStash::Helpers::LoggableTry.new(logger, job_name)
365-
stud_try.try((@retries + 1).times) { yield }
366-
rescue => e
367-
error_details = {:message => e.message, :cause => e.cause}
368-
error_details[:backtrace] = e.backtrace if logger.debug?
369-
logger.error("Tried #{job_name} unsuccessfully", error_details)
340+
@paginated_search.do_run(output_queue)
370341
end
371342
end
372343

373-
def do_run_slice(output_queue, slice_id=nil)
374-
slice_query = @base_query
375-
slice_query = slice_query.merge('slice' => { 'id' => slice_id, 'max' => @slices}) unless slice_id.nil?
376-
377-
slice_options = @options.merge(:body => LogStash::Json.dump(slice_query) )
378-
379-
logger.info("Slice starting", slice_id: slice_id, slices: @slices) unless slice_id.nil?
380-
381-
begin
382-
r = search_request(slice_options)
383-
384-
r['hits']['hits'].each { |hit| push_hit(hit, output_queue) }
385-
logger.debug("Slice progress", slice_id: slice_id, slices: @slices) unless slice_id.nil?
386-
387-
has_hits = r['hits']['hits'].any?
388-
scroll_id = r['_scroll_id']
389-
390-
while has_hits && scroll_id && !stop?
391-
has_hits, scroll_id = process_next_scroll(output_queue, scroll_id)
392-
logger.debug("Slice progress", slice_id: slice_id, slices: @slices) if logger.debug? && slice_id
393-
end
394-
logger.info("Slice complete", slice_id: slice_id, slices: @slices) unless slice_id.nil?
395-
ensure
396-
clear_scroll(scroll_id)
397-
end
398-
end
399-
400-
##
401-
# @param output_queue [#<<]
402-
# @param scroll_id [String]: a scroll id to resume
403-
# @return [Array(Boolean,String)]: a tuple representing whether the response
404-
#
405-
def process_next_scroll(output_queue, scroll_id)
406-
r = scroll_request(scroll_id)
407-
r['hits']['hits'].each { |hit| push_hit(hit, output_queue) }
408-
[r['hits']['hits'].any?, r['_scroll_id']]
409-
end
410-
411344
def push_hit(hit, output_queue)
412345
event = targeted_event_factory.new_event hit['_source']
413346
set_docinfo_fields(hit, event) if @docinfo
@@ -433,20 +366,7 @@ def set_docinfo_fields(hit, event)
433366
event.set(@docinfo_target, docinfo_target)
434367
end
435368

436-
def clear_scroll(scroll_id)
437-
@client.clear_scroll(:body => { :scroll_id => scroll_id }) if scroll_id
438-
rescue => e
439-
# ignore & log any clear_scroll errors
440-
logger.warn("Ignoring clear_scroll exception", message: e.message, exception: e.class)
441-
end
442-
443-
def scroll_request(scroll_id)
444-
@client.scroll(:body => { :scroll_id => scroll_id }, :scroll => @scroll)
445-
end
446-
447-
def search_request(options={})
448-
@client.search(options)
449-
end
369+
private
450370

451371
def hosts_default?(hosts)
452372
hosts.nil? || ( hosts.is_a?(Array) && hosts.empty? )
@@ -677,6 +597,18 @@ def test_connection!
677597
raise LogStash::ConfigurationError, "Could not connect to a compatible version of Elasticsearch"
678598
end
679599

600+
def es_info
601+
@es_info ||= @client.info
602+
end
603+
604+
def es_version
605+
@es_version ||= es_info&.dig('version', 'number')
606+
end
607+
608+
def es_major_version
609+
@es_major_version ||= es_version.split('.').first.to_i
610+
end
611+
680612
# recreate client with default header when it is serverless
681613
# verify the header by sending GET /
682614
def setup_serverless
@@ -691,13 +623,35 @@ def setup_serverless
691623
end
692624

693625
def build_flavor
694-
@build_flavor ||= @client.info&.dig('version', 'build_flavor')
626+
@build_flavor ||= es_info&.dig('version', 'build_flavor')
695627
end
696628

697629
def serverless?
698630
@is_serverless ||= (build_flavor == BUILD_FLAVOR_SERVERLESS)
699631
end
700632

633+
def setup_search_api
634+
@resolved_search_api = if @search_api == "auto"
635+
api = if es_major_version >= 8
636+
"search_after"
637+
else
638+
"scroll"
639+
end
640+
logger.info("`search_api => auto` resolved to `#{api}`", :elasticsearch => es_version)
641+
api
642+
else
643+
@search_api
644+
end
645+
646+
647+
@paginated_search = if @resolved_search_api == "search_after"
648+
LogStash::Inputs::Elasticsearch::SearchAfter.new(@client, self)
649+
else
650+
logger.warn("scroll API is no longer recommended for pagination. Consider using search_after instead.") if es_major_version >= 8
651+
LogStash::Inputs::Elasticsearch::Scroll.new(@client, self)
652+
end
653+
end
654+
701655
module URIOrEmptyValidator
702656
##
703657
# @override to provide :uri_or_empty validator

0 commit comments

Comments
 (0)