Skip to content

[change] Optimize the logic that catches flapping metrics #667

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pandafy opened this issue May 19, 2025 · 0 comments
Open

[change] Optimize the logic that catches flapping metrics #667

pandafy opened this issue May 19, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@pandafy
Copy link
Member

pandafy commented May 19, 2025

Description:
To reduce noise, OpenWISP Monitoring avoids sending alerts for metrics that flap—i.e., alternate between healthy and unhealthy states—within a defined tolerance period.

Currently, this is implemented by loading all data points for the metric from the time series database within the threshold window and iterating over them to detect flapping. However, this approach does not scale well when the threshold window spans several hours or days, especially for high-frequency metrics or large deployments.

# retrieves latest measurements, ordered by most recent first
points = self.metric.read(
since=timezone.now() - timedelta(minutes=self._tolerance_search_range),
limit=None,
order='-time',
retention_policy=retention_policy,
extra_fields=extra_fields,
)
# store a list with the results
results = [value_crossed]
# loop on each measurement starting from the most recent
for i, point in enumerate(points, 1):
# skip the first point because it was just added before this
# check started and its value coincides with ``current_value``
if i <= 1:
continue
utc_time = utc.localize(datetime.utcfromtimestamp(point['time']))
# did this point cross the threshold? Append to result list
results.append(self._value_crossed(point[self.metric.alert_field]))
# tolerance is trepassed
if self._time_crossed(utc_time):
# if the latest results are consistent, the metric being
# monitored is not flapping and we can confidently return
# wheter the value crosses the threshold or not
if len(set(results)) == 1:
return value_crossed
# otherwise, the results are flapping, the situation has not changed
# we will return a value that will not trigger changes
return not self.metric.is_healthy_tolerant
# otherwise keep looking back
continue
# the search has not yielded any conclusion
# return result based on the current value and time
time = timezone.now()

Problem:

  • Inefficient performance for large time windows.
  • Potentially high memory usage and long processing time when analyzing high-volume data.

Expected Behavior:
Optimize the flapping detection logic to work efficiently for long threshold windows without loading all data points into memory. We shall try to use database queries to optimize this operation.

@pandafy pandafy added the enhancement New feature or request label May 19, 2025
@pandafy pandafy moved this from To do (general) to To do (Python & Django) in OpenWISP Contributor's Board May 19, 2025
@nemesifier nemesifier changed the title [feature] Optimize the logic that catches flapping metrics [change] Optimize the logic that catches flapping metrics May 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: To do (Python & Django)
Development

No branches or pull requests

2 participants