[change] Optimize the logic that catches flapping metrics #667

pandafy · 2025-05-19T17:13:52Z

Description:
To reduce noise, OpenWISP Monitoring avoids sending alerts for metrics that flap—i.e., alternate between healthy and unhealthy states—within a defined tolerance period.

Currently, this is implemented by loading all data points for the metric from the time series database within the threshold window and iterating over them to detect flapping. However, this approach does not scale well when the threshold window spans several hours or days, especially for high-frequency metrics or large deployments.

openwisp-monitoring/openwisp_monitoring/monitoring/base/models.py

Lines 988 to 1021 in a9993c7

    
           # retrieves latest measurements, ordered by most recent first 
        
           points = self.metric.read( 
        
               since=timezone.now() - timedelta(minutes=self._tolerance_search_range), 
        
               limit=None, 
        
               order='-time', 
        
               retention_policy=retention_policy, 
        
               extra_fields=extra_fields, 
        
           ) 
        
           # store a list with the results 
        
           results = [value_crossed] 
        
           # loop on each measurement starting from the most recent 
        
           for i, point in enumerate(points, 1): 
        
               # skip the first point because it was just added before this 
        
               # check started and its value coincides with ``current_value`` 
        
               if i <= 1: 
        
                   continue 
        
               utc_time = utc.localize(datetime.utcfromtimestamp(point['time'])) 
        
               # did this point cross the threshold? Append to result list 
        
               results.append(self._value_crossed(point[self.metric.alert_field])) 
        
               # tolerance is trepassed 
        
               if self._time_crossed(utc_time): 
        
                   # if the latest results are consistent, the metric being 
        
                   # monitored is not flapping and we can confidently return 
        
                   # wheter the value crosses the threshold or not 
        
                   if len(set(results)) == 1: 
        
                       return value_crossed 
        
                   # otherwise, the results are flapping, the situation has not changed 
        
                   # we will return a value that will not trigger changes 
        
                   return not self.metric.is_healthy_tolerant 
        
               # otherwise keep looking back 
        
               continue 
        
           # the search has not yielded any conclusion 
        
           # return result based on the current value and time 
        
           time = timezone.now()

Problem:

Inefficient performance for large time windows.
Potentially high memory usage and long processing time when analyzing high-volume data.

Expected Behavior:
Optimize the flapping detection logic to work efficiently for long threshold windows without loading all data points into memory. We shall try to use database queries to optimize this operation.

pandafy added the enhancement New feature or request label May 19, 2025

pandafy added this to OpenWISP Contributor's Board May 19, 2025

github-project-automation bot moved this to To do (general) in OpenWISP Contributor's Board May 19, 2025

pandafy moved this from To do (general) to To do (Python & Django) in OpenWISP Contributor's Board May 19, 2025

nemesifier assigned nemesifier and pandafy May 19, 2025

nemesifier changed the title ~~[feature] Optimize the logic that catches flapping metrics~~ [change] Optimize the logic that catches flapping metrics May 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[change] Optimize the logic that catches flapping metrics #667

[change] Optimize the logic that catches flapping metrics #667

pandafy commented May 19, 2025

Uh oh!

[change] Optimize the logic that catches flapping metrics #667

[change] Optimize the logic that catches flapping metrics #667

Comments

pandafy commented May 19, 2025