-
Notifications
You must be signed in to change notification settings - Fork 23
Add alerts for low available swap space #1075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @seunghun1ee, good job.
Please could you add a release note, and mention that we currently have a one-size fits all policy as an interim solution. We can advise that operators may need to tune the threshold for the alerts.
50fc3d1
to
a7562bf
Compare
Thank you @dougszumski |
a7562bf
to
643aa78
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks @seunghun1ee
@@ -24,6 +24,24 @@ groups: | |||
summary: "Prometheus exporter at {{ $labels.instance }} reports low memory" | |||
description: "Available memory is {{ $value }} GiB." | |||
|
|||
- alert: LowSwapSpace | |||
expr: (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes) < {% endraw %}{{ alertmanager_node_free_swap_warning_threshold_ratio }}{% raw %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when there is no swap, i.e. node_memory_SwapTotal_bytes is 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good point. It seems like either adding conditional templating or updating expr
for these are needed. @markgoddard How should I do it? Do you want me to revert this merge and re-do the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on how long it will take for you to test it out and provide a fix. If it can be done quickly then no need to revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case I think we should revert this change. I'm working on customer's system this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The divide by zero evaluates to NaN, and the alert appears as 'OK'. Context: https://www.robustperception.io/get-thee-to-a-nannary/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think this alert is fine as it is, but a good question to ask none-the-less.
As exhausting memory on hypervisor causes OOM kills, alerts for low available swap space are added.
Storage nodes usually utilise all of the swap spaces, so this alert can be spammed with them.
Therefore, silencing rules for storage nodes are going to be necessary.
We can implement per group alerting by overriding prometheus.yml.j2 from Kolla-ansible to avoid silencing approach but that needs separate discussion.