Commit Graph

17 Commits

Author SHA1 Message Date
Tejaswini Chadaga
42597806a9
[201911][multi-asic] Monit changes to enable internal link monitoring script (#16393)
Monit changes to enable script to monitor SAI_PORT_STAT_IF_IN_ERRORS & SAI_PORT_STAT_IF_OUT_ERRORS on internal (backend) ports of multi-asic device.
2023-09-12 15:57:13 -07:00
Renuka Manavalan
eda84d2209
Invoke disk check periodically (#7374)
Helps with periodic scan of disk for RO state.
If found, this script makes transient fix and raise error message.
2021-11-19 16:45:21 -08:00
yozhao101
24e1cde1e6
[201911][Monit] Restart telemetry container if memory usage is beyond the threshold (#7618)
This PR aims to monitor the memory usage of streaming telemetry container and restart streaming telemetry container if memory usage is larger than the pre-defined threshold.
2021-05-17 16:51:13 -07:00
yozhao101
a8d2d0b5cd
[201911][Monit] Monitor critical processes in PMon contianer. (#7438)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
This PR aims to monitor the critical processes in PMon container by Monit in 201911 branch.

How I did it
I created a template configuration file of Monit and it will be rendered to generate Monit configuration file of PMon container
by a service generate_monit_config.service.

How to verify it
I verified this on a Mellanox device str-msn2700-03 and an Arista device str-a7050-acs-1.

Which release branch to backport (provide reason below if selected)
 201811
[x ] 201911
 202006
 202012
2021-04-28 17:12:21 -07:00
yozhao101
528543bc6a
[201911][Monit] Monitor critical processes in radv and dhcp_relay containers. (#7340)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
This PR aims to monitor critical processes in router advertiser and dhcp_relay containers by Monit.

How I did it
Router advertiser container only ran on T0 device and the T0 device should have at least one VLAN interface
which was configured an IPv6 address. At the same time, router advertiser container will not run on devices of which
the deployment type is 8.

As such, I created a service which will dynamically generate Monit configuration file of router advertiser from a
template.

Similarly Monit configuration file of dhcp_relay was also generated from a template since the number of dhcrelay process in dhcp_relay container is depended on number of VLANs.

How to verify it
I verified this implementation on a DuT.
2021-04-16 08:40:06 -07:00
Volodymyr Samotiy
fd22b3bcee
[monit] Periodically monitor VNET route consistency (#7078)
To run VNET route consistency check periodically.

For any failure, the monit will raise alert based on return code.
The tool will log required details.
2021-03-25 07:24:59 -07:00
Qi Luo
32e3cd9454
Revert "[monit] Periodically monitor VNET route consistency (#6819)" (#6975)
This reverts commit 2c6be7e0f5.
Reverts #6819
2021-03-06 06:56:26 -08:00
Volodymyr Samotiy
2c6be7e0f5
[monit] Periodically monitor VNET route consistency (#6819)
To run VNET route consistency check periodically.

For any failure, the monit will raise alert based on return code.
The tool will log required details.
2021-03-05 13:15:19 -08:00
abdosi
3a24e7f31f [multi-asic] Enhancing monit process checker for multi-asic. (#6100)
Added Support of process checker for work on multi-asic platforms.
2020-12-04 13:17:35 -08:00
abdosi
0fad6bdc7f [monit] Adding patch to enhance syslog error message generation for monit alert action when status is failed. (#5720)
Why/How I did:

Make sure first error syslog is triggered based on FAULT TOLERANCE condition.

Added support of repeat clause with alert action. This is used as trigger
for generation of periodic syslog error messages if error is persistent

Updated the monit conf files with repeat every x cycles for the alert action
2020-11-01 10:27:10 -08:00
bingwang-ms
7a015eacc1 Fix 'NoSuchProcess' exception in process_checker (#5716)
The psutil library used in process_checker create a cache for each
process when calling process_iter. So, there is some possibility that
one process exists when calling process_iter, but not exists when
calling cmdline, which will raise a NoSuchProcess exception. This commit
fix the issue.

Signed-off-by: bingwang <bingwang@microsoft.com>
2020-10-30 08:56:10 -07:00
yozhao101
7580c846ad
[201911][Monit] Unmonitor processes in disabled containers (#5462)
We want to let Monit to unmonitor the processes in containers which are disabled in `FEATURE` table such that
Monit will not generate false alerting messages into the syslog.

- Backport of https://github.com/Azure/sonic-buildimage/pull/5153 to the 201911 branch

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
2020-09-25 00:30:41 -07:00
Renuka Manavalan
7d6e5083ce [monit] Periodically monitor route consistency (#5085)
* Add route_check to mont.

* Switched to units of cycles per comments

* Added comments per Joe's comments.

* Added more comments per Royal's comments.
2020-09-19 15:47:53 -07:00
yozhao101
358570324b [Monit] Delay start of monitoring for 5 minutes (#4281) 2020-03-22 22:58:57 -07:00
yozhao101
82c2eee1e6 [Monit] Change the monitoring period from 120 seconds to 60 seconds. (#3974)
* [Monit] Change the monitoring period of monit from 120 seconds to 60
seconds and also at the same time double the interval for existing sonic monit config file in
host.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
2020-01-21 10:44:36 -08:00
Joe LeVeque
5e07b252ff [monit] Build from source and patch to use MemAvailable value if available on system (#3875) 2020-01-06 11:41:20 -08:00
lguohan
95a72b4e39
[baseimage]: fix monit configuration (#3448)
- monit config broke by one monit upgrade
- abandon sed approach since it is suspestible to monit config changes
- use unixsocket instead of httpd due to a bug in 5.20.0
2019-09-12 22:48:40 -07:00