Improve network error handling for batch jobs (HTTP retries)

### Context

In our project, we have to reprocess a large amount of Elastic data in a single batch. The `elasticsearch` input plugin might run for multiple hours to handle billions of events. But we are facing a problem: when a network issue occurs, the entire job is restarted (not only last HTTP request).

I guess this issue rarely occurs when Logstash is deployed in the Elasticsearch subnet but we got a cloud hybrid configuration, and it makes it impossible to use in production as of now.

The plugin configuration:

```ruby
input {
 elasticsearch {
 hosts => "https://<cluster_id>.francecentral.azure.elastic-cloud.com:443"
 api_key => "<api_key>"
 ssl_enabled => "true"
 index => "<index_name>"
 search_api => "search_after"
 retries => 5
 scroll => "5m"
 response_type => "hits"
 size => 1000
 query => '{"query":{"bool":{"filter":[ ... ]}},"sort":[ ... ]}'
 }
}
```

Log sample of the restarting job when a network error occurred:

```
[2024-11-21T09:29:03,645][DEBUG][logstash.inputs.elasticsearch.searchafter][main][9fee6d2baff37ecc70c364d3215ef2d3eab93bec9b08f68ed299cc50ed87e9b2] Query progress
[2024-11-21T09:29:03,806][DEBUG][logstash.inputs.elasticsearch.searchafter][main][9fee6d2baff37ecc70c364d3215ef2d3eab93bec9b08f68ed299cc50ed87e9b2] Query progress
[2024-11-21T09:29:03,815][WARN ][logstash.inputs.elasticsearch.searchafter][main][9fee6d2baff37ecc70c364d3215ef2d3eab93bec9b08f68ed299cc50ed87e9b2] Attempt to search_after paginated search but failed. Sleeping for 0.02 {:fail_count=>1, :exception=>"<cluster_id>.francecentral.azure.elastic-cloud.com:443 failed to respond"}
[2024-11-21T09:29:03,835][INFO ][logstash.inputs.elasticsearch.searchafter][main][9fee6d2baff37ecc70c364d3215ef2d3eab93bec9b08f68ed299cc50ed87e9b2] Query start
[2024-11-21T09:29:03,835][DEBUG][logstash.inputs.elasticsearch.searchafter][main][9fee6d2baff37ecc70c364d3215ef2d3eab93bec9b08f68ed299cc50ed87e9b2] Query progress
[2024-11-21T09:29:04,222][DEBUG][logstash.inputs.elasticsearch.searchafter][main][9fee6d2baff37ecc70c364d3215ef2d3eab93bec9b08f68ed299cc50ed87e9b2] Query progress
```

### Feature proposal

First, for future documentation readers, it would be nice to improve the [retries](https://www.elastic.co/guide/en/logstash/current/plugins-inputs-elasticsearch.html#plugins-inputs-elasticsearch-retries) section of the documentation to explain that it is at the "job" level and not "http request" level.

Then, adding a retry mechanism at the HTTP request level with an exponential backoff (or similar) would be a good option.

I had a quick look at the code base, I think we could add a wrapper around the [next_page()](https://github.com/logstash-plugins/logstash-input-elasticsearch/blob/main/lib/logstash/inputs/elasticsearch/paginated_search.rb#L191) function to handle the network error and implement the retries properly


### Contribution

If it can help, we can contribute and develop this feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve network error handling for batch jobs (HTTP retries) #216

Context

Feature proposal

Contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve network error handling for batch jobs (HTTP retries) #216

Description

Context

Feature proposal

Contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions