-
-
Notifications
You must be signed in to change notification settings - Fork 155
[feature] Add a check which inspects device configuration status periodically #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c304e4d to
dabafcc
Compare
|
QUERIES:
|
0094d9e to
58539d6
Compare
Check out the README file. According to the current version, a status will be critical when the metric is defined in |
58539d6 to
fc42293
Compare
c3a1a23 to
d3ac3f4
Compare
| ) | ||
| # Input in days | ||
| CONFIG_MODIFIED_RETENTION_POLICY = getattr( | ||
| settings, 'OPENWISP_MONITORING_CONFIG_STATUS_RETENTION_POLICY', '48h0m0s' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the standard we use for setting names: the user defines OPENWISP_MONITORING_X_Y_Z but we use X_Y_Z internally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be an exception here. I have named the setting as OPENWISP_MONITORING_DEVICE_CONFIGURATION_X_Y_Z but internally the settings are named as CONFIG_X_Y_Z. The reason was to explicitly say to newcomers that the setting is related to DEVICE_CONFIGURATION (there's also chart config and soon there shall be metric_config) but internally i.e. in the check settings we know that it means device config hence it's shortened. Let me know if it seems right?
7ab69dd to
c7ffc08
Compare
|
@nemesisdesign, I realized it now that there is no migration for |
1047e03 to
ba31564
Compare
|
|
||
| @shared_task | ||
| def auto_create_ping(model, app_label, object_id, created): | ||
| def auto_create_ping(model, app_label, object_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's safe to remove the the keyword argument created , since it's not being used at all.
63e1612 to
d6203a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good progress.
I have not tested this yet, I'm only doing code-level review, see my comments.
@nemesisdesign, I realized it now that there is no migration for
ping,config_modifiedchecks in thesample_check. I don't think there's a need since the module is not yet released (and the migrations were meant for backward compatibilty). Can you please confirm this?
No data migration needed for sample app.
PS: rebase on the current master
README.rst
Outdated
|
|
||
| Whether ping checks are created automatically for devices. | ||
|
|
||
| ``OPENWISP_MONITORING_AUTO_DEVICE_CONFIGURATION_MODIFIED`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about OPENWISP_MONITORING_AUTO_DEVICE_CONFIG_CHECK?
README.rst
Outdated
| +--------------+-----------+ | ||
|
|
||
| This setting allows you to configure the time that the check is allowed to | ||
| fail after which the device health status changes to ``problem``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this setting from its explanation, we should also find a shorter name, eg OPENWISP_MONITORING_DEVICE_CONFIG_CHECK_MAX_TIME
README.rst
Outdated
| **Note**: If the setting ``OPENWISP_MONITORING_AUTO_DEVICE_CONFIGURATION_MODIFIED`` | ||
| is disabled then this setting need not be declared. | ||
|
|
||
| ``OPENWISP_MONITORING_DEVICE_CONFIGURATION_MODIFIED_RETENTION_POLICY`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too long.. OPENWISP_MONITORING_DEVICE_CONFIG_CHECK_RP?
openwisp_monitoring/check/apps.py
Outdated
| verbose_name = _('Network Monitoring Checks') | ||
|
|
||
| def ready(self): | ||
| manage_config_modified_retention_policy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this play with the abstraction? Can you make it play well with the changes introduced in #103?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
manage_config_modified_retention_policy() makes use of timeseries_db.get_list_retention_policies after rebasing which has been defined by us so it's abstract enough, I think.
154bae6 to
6e5a612
Compare
6e5a612 to
6bf784b
Compare
6bf784b to
6f14ba3
Compare
README.rst
Outdated
| ``Configuration Modified`` | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| It checks the configuration status of the device periodically and changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this rewording
"This will periodically check the configuration present in the device and will change the device health status to PROBLEM if the configuration is not in sync with the expected by OpenWISP. The time until this check changes the health status can be configured using the setting OPENWISP_MONITORING_DEVICE_CONFIG_CHECK_MAX_TIME"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds a bit technical.
README.rst
Outdated
|
|
||
| ``config_modified`` check changes the device health status to ``PROBLEM`` if it's | ||
| configuration status has been in the modified state for more than the time defined in | ||
| ``OPENWISP_MONITORING_AUTO_DEVICE_CONFIG_CHECK``. It's 5 minutes by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Defines the amount of time that the device configuration can be different from what is expected until its health status is changed to PROBLEM"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defines the maximum amount of minutes the device configuration can stay in the *modified* state before its health status is changed to PROBLEM and an alert is sent.
6f14ba3 to
2a4ce3a
Compare
@nemesisdesign, I am slightly confused. You are saying it should be 5 min and 600s at the same time 🤔. |
c5b4ad4 to
8a868fc
Compare
openwisp_monitoring/check/migrations/0004_create_config_applied.py
Outdated
Show resolved
Hide resolved
openwisp_monitoring/check/migrations/0004_create_config_applied.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still work to do, I'm disappointed to be highlighting the same silly mistakes 👎
Also, the check doesn't seem to work to me if the tolerance is not zero seconds, I'll do more tests and let you know more details if I can. UPDATE: I confirm this, here's how I test it:
- run dev server, run worker, run celery beat
- set the config of a device to
modified - wait more than 5 minutes, no alert is ever generated
- if I set
secondsto0in the alert settings, the alert is generated
openwisp_monitoring/check/migrations/0004_create_config_applied.py
Outdated
Show resolved
Hide resolved
| self.assertIn('Ping', c.check) | ||
| self.assertEqual(Check.objects.count(), 2) | ||
| with self.subTest('Test AUTO_PING'): | ||
| c1 = Check.objects.get(name='Ping') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't query by name, use the check field, please correct the same for any other occurrence of this everywhere!
Do not query by fields that can be changed by users, do not use get() for querying objects that can return multiple items!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not use get() for querying objects that can return multiple items!
I expect one Ping and one ConfigApplied check per device. Since, the test had only one device, I thought using get was safe enough. Though yes, querying by name (which can be changed) is bad practice on my side :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get will raise an error if multiple objects are returned, but it's better to have failures rather than errors since it's usually faster to understand what's wrong
5 minutes, sorry, I mistyped. |
| with self.subTest('Test AUTO_CONFIG_CHECK'): | ||
| c2 = Check.objects.get(name='Configuration Applied') | ||
| self.assertEqual(c2.content_object, d) | ||
| self.assertIn('ConfigApplied', c2.check) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you use equal and you compare it to self._CONFIG_APPLIED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was already written for Ping check. So, I have just adapted it for ConfigApplied checks. Made the change.
@nemesisdesign, no it's not a bug. You had asked to cover this use case earlier 😕, I replied back with a solution #123 (comment). Awaited your response, pinged you two days later #123 (comment) and on IM. Still no response, maybe you were slightly busy back then. Hence, went ahead with the approach and informed nevertheless #123 (comment). As you had requested, the The zero second case I think is an edge case where device goes offline. Note: There is one specific test ( |
8a868fc to
5ba3b99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read again what was written previously here: #123 (review), it is correct, but that's not what is happening here.
Let me explain again. Please test it in the same way.
- run dev server, run worker, run celery beat
- ensure you have a device with health status OK and config status set to applied
- set the config of a device to modified
- wait more than 5 minutes, no alert is ever generated, but it should
- if I set seconds to 0 in the alert settings, the alert is generated, why this difference?
I dug deeper with ipdb, I found out that when previous measurements are retrieved, no point is found.
Same if I use the influxdb shell:
> select * from config_applied;
>
The situation changes if I remove retention_policy=SHORT_RP on the write method call:
> select * from config_applied;
name: config_applied
time config_applied content_type object_id
---- -------------- ------------ ---------
1596134001603513262 0 config.device cc5c1d61-8e0b-4f31-9ab4-70cf6bb0c55b
1596134078032234031 0 config.device cc5c1d61-8e0b-4f31-9ab4-70cf6bb0c55b
1596134379375040639 0 config.device cc5c1d61-8e0b-4f31-9ab4-70cf6bb0c55b
Please repeat the same tests and let me know:
- if you can replicate the same issue successfully
- if yes, why it is happening? Why writing with the short retention policy doesn't write anything?
- if yes, write a failing test for it
- if yes, how you fixed it
You have to do this kind of testing on your own, without the need of me requesting you to do it.
It must become the standard way of how you work.
|
Mentioning so if anyone faces same issue in future, then can avoid the trouble, hopefully.
Yes
When a retention policy is used besides
Added
Added In [7]: m.read(retention_policy='short')
Out[7]:
[{'time': 1596102060, 'config_applied': 1},
{'time': 1596102120, 'config_applied': 1},
...Personally, I don't like this approach since user will need to specify the The alert didn't came immediately after 5 minutes (took roughly 6-7 minutes, I think) but yes tested twice and added |
nemesifier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great progress, it finally works 👍
I still think we have dangerous code in our software and I'd like to get it resolved since we're already here and have our mind on it.
[...]
3. if yes, write a failing test for itAdded
openwisp_monitoring.db.backends.influxdb.tests.TestDatabaseClient.test_read_with_rp
I prefer to ensure the actual functional test of the check fails (it would have helped you to spot this without me being involved), I explained in one comment below how to do it.
4. if yes, how you fixed itAdded
retention_policyas a parameter inmetric.read()andtimeseries_db.read().
Ok, I see you added support for this in the tsdb backend, but the short retention policy was already being used to retrieve the device data, and that code has not been updated. Can you update that bit of code as well to make it use the new way so we can probably simplify the code? I'm referring to these:
| q = device_data_query.format(SHORT_RP, self.__key, self.pk) |
| device_data_query = ( |
Personally, I don't like this approach since user will need to specify the
retention_policyeverytime in read / write. Maybe we can refactorretention_policyto be a field inMetricmodel (later on).
Doesn't worry me, there are other things a lot more important that we need to work on.
b07ac63 to
eab6145
Compare
@nemesisdesign, as you have correctly pointed it out, openwisp-monitoring/openwisp_monitoring/device/base/models.py Lines 100 to 102 in 9c53057
also abstraction was not complete as device_data_query is a dict in case of elasticsearch. I think if needed, we can simplify it there 😃.
Ok, thanks freezegun works great. Currently, implemented only for this test, will do for all others in the suite. Though we will need to run checks twice in order to validate alerts after this. The reason being To get it working, I had to alter tsdb backend timestamp = kwargs.get('timestamp') or now() # <--- now() is extra here
if isinstance(timestamp, datetime):
timestamp = timestamp.strftime('%Y-%m-%dT%H:%M:%SZ')
point['time'] = timestamp |
eab6145 to
4f87998
Compare
Check this build (https://travis-ci.org/github/openwisp/openwisp-monitoring/jobs/713662695#L1163) to know the failure. Found it, it was related to timestamp = kwargs.get('timestamp') or now() # now() is added here
if isinstance(timestamp, datetime):
timestamp = timestamp.strftime('%Y-%m-%dT%H:%M:%SZ') # This was the reason, seconds precision
point['time'] = timestamp # Passing timestamp was optional earlierI will try to add a failing test for this. |
nemesifier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds all good @nepython, let me know when ready for review.
9a6ed40 to
ec5e68b
Compare
…odically #54 [influxdb] Add support for retention_policy in read
ec5e68b to
0041002
Compare

Closes #54