Skip to content

Add alerts for low available swap space #1075

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 31, 2024
Merged

Conversation

seunghun1ee
Copy link
Member

@seunghun1ee seunghun1ee commented May 16, 2024

As exhausting memory on hypervisor causes OOM kills, alerts for low available swap space are added.
Storage nodes usually utilise all of the swap spaces, so this alert can be spammed with them.
Therefore, silencing rules for storage nodes are going to be necessary.
We can implement per group alerting by overriding prometheus.yml.j2 from Kolla-ansible to avoid silencing approach but that needs separate discussion.

@seunghun1ee seunghun1ee requested a review from priteau May 16, 2024 11:17
@seunghun1ee seunghun1ee self-assigned this May 16, 2024
@seunghun1ee seunghun1ee requested a review from a team as a code owner May 16, 2024 11:17
Copy link
Member

@dougszumski dougszumski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @seunghun1ee, good job.

Please could you add a release note, and mention that we currently have a one-size fits all policy as an interim solution. We can advise that operators may need to tune the threshold for the alerts.

@dougszumski
Copy link
Member

Renders fine:
image

@seunghun1ee seunghun1ee force-pushed the monitor-swap-usage branch from 50fc3d1 to a7562bf Compare May 21, 2024 09:53
@seunghun1ee seunghun1ee requested a review from dougszumski May 21, 2024 09:53
@seunghun1ee
Copy link
Member Author

seunghun1ee commented May 21, 2024

Thanks @seunghun1ee, good job.

Please could you add a release note, and mention that we currently have a one-size fits all policy as an interim solution. We can advise that operators may need to tune the threshold for the alerts.

Thank you @dougszumski
I added release note.

@seunghun1ee seunghun1ee force-pushed the monitor-swap-usage branch from a7562bf to 643aa78 Compare May 21, 2024 10:17
Copy link
Member

@dougszumski dougszumski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks @seunghun1ee

@seunghun1ee seunghun1ee merged commit f23d52c into stackhpc/yoga May 31, 2024
15 checks passed
@seunghun1ee seunghun1ee deleted the monitor-swap-usage branch May 31, 2024 09:00
@@ -24,6 +24,24 @@ groups:
summary: "Prometheus exporter at {{ $labels.instance }} reports low memory"
description: "Available memory is {{ $value }} GiB."

- alert: LowSwapSpace
expr: (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) < {% endraw %}{{ alertmanager_node_free_swap_warning_threshold_ratio }}{% raw %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when there is no swap, i.e. node_memory_SwapTotal_bytes is 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good point. It seems like either adding conditional templating or updating expr for these are needed. @markgoddard How should I do it? Do you want me to revert this merge and re-do the PR?

Copy link
Contributor

@markgoddard markgoddard Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on how long it will take for you to test it out and provide a fix. If it can be done quickly then no need to revert.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I think we should revert this change. I'm working on customer's system this week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The divide by zero evaluates to NaN, and the alert appears as 'OK'. Context: https://www.robustperception.io/get-thee-to-a-nannary/

Copy link
Member

@dougszumski dougszumski Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example from a no swap test env:
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this alert is fine as it is, but a good question to ask none-the-less.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants