Skip to content

retryable_errors detection issue for transient errors #4821

@ramizraza504

Description

@ramizraza504

Describe the bug

I have been using closest to latest version of terragrunt cli and found an issue which seems like a bug in detection of the retryable error on terragrunt for errors getting logged during terraform run.

When Error is getting logged in single line matching the pattern then retry works as expected.
However when the Error gets logged in few different lines where the matching pattern goes into different line then Terragrunt retry skips this to identify as retryable error which is a problem.

Due to this some of important runs are getting in failed state where it should have easily retried and succeeded.

Steps To Reproduce

These are transient error so I am not entirely sure how we can reproduce exact behavior of making an error logged in different lines but I can provide some snippet which shows two different logs where it worked and retried as expected because the error was logged in one line but other just failed without retry because the error was in different line. Below is my retryable block

errors {
  retry "default" {
    retryable_errors = [
      "(?s).*Failed to load state.*tcp.*timeout.*",
      "(?s).*Failed to load backend.*TLS handshake timeout.*",
      "(?s).*Creating metric alarm failed.*request to update this alarm is in progress.*",
      "(?s).*Error installing provider.*TLS handshake timeout.*",
      "(?s).*Error configuring the backend.*TLS handshake timeout.*",
      "NoSuchBucket: The specified bucket does not exist",
      "(?s).*Error creating SSM parameter: TooManyUpdates:.*",
      "(?s).*app.terraform.io.*: 429 Too Many Requests.*",
      "(?s).*ssh_exchange_identification.*Connection closed by remote host.*",
      "(?s).*Could not download module.*The requested URL returned error: 429.*",
      "(?s).*ssh: connect to host.*port 22: Connection timed out.*",
      # Connection reset or TCP errors
      "(?s).*Error installing provider.*tcp.*timeout.*",
      "(?s).*Error installing provider.*tcp.*connection reset by peer.*",
      "(?s).*Failure responding to request.*tcp.*connection reset by peer.*",
      "(?s).*dial tcp.*connection refused.*",
      "(?s).*dial tcp.*i/o timeout.*",
      "(?s).*context deadline exceeded.*",
      # Cloud API rate limiting errors
      "(?s).*Rate exceeded.*",
      "(?s).*Quota exceeded.*",
      "(?s).*maximum number of requests.*",
      # HTTP service errors (500s, 502s, 503s)
      "(?s).*500 Internal Server Error.*",
      "(?s).*503 Service Unavailable.*",
      "(?s).*502 Bad Gateway.*",
      # Timeout errors
      "(?s).*Timeout waiting for server response.*",
      "(?s).*connection timed out.*",
      "(?s).*Client\\.Timeout exceeded while awaiting headers.*",
      # DNS resolution errors
      "(?s).*lookup.*: no such host.*",
      # Cloud-specific instance unavailability or capacity issues
      "(?s).*Insufficient capacity to fulfill request.*",
      "(?s).*Instance not found.*",
      "(?s).*Resource temporarily unavailable.*",
      # Common errors
      "(?s).*unexpected end of JSON input.*",
    ]
    max_attempts = 3
    sleep_interval_sec = 5
  }
} 

Expected behavior

Smart Error detection for retry to work and not just go by identifying error with one liner Error

Don't just go by Error: Error in single line as it go into different lines

Nice to haves

  • Terminal output
  • Error that didn't retry because matching pattern was on different line[
Image

] Screenshots

  • Error that retired perfectly because matching pattern was in single line[
Image

] Screenshots

Versions

  • Terragrunt version: v0.76.0 and v0.85.0
  • OpenTofu/Terraform version: 1.12
  • Environment details (Ubuntu 20.04, Windows 10, etc.): Ubuntu 22

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions