Commit Graph

11 Commits

Author SHA1 Message Date
Junchao-Mellanox
bb41b55f1a [system-health] Make check interval more accurate (#14085)
- Why I did it

Healthd check system status every 60 seconds. However, running checker may take several seconds. Say checker takes X seconds, healthd takes (60 + X) seconds to finish one iteration. This implementation makes sonic-mgmt test case not so stable because the value X is hard to predict and different among different platforms. This PR introduces an interval
compensation mechanism to healthd main loop.

- How I did it

Introduces an interval compensation mechanism to healthd main loop: healthd should wait (60 - X) seconds for next iteration

- How to verify it

Manual test
Unit test
2023-03-19 22:32:43 +08:00
Mai Bui
eeb3ae17a6 Revert "[system-health] Remove subprocess with shell=True (#12572)" (#13505)
This reverts commit b3a8167968.
Due to issue https://github.com/sonic-net/sonic-buildimage/issues/13432
2023-03-06 19:30:11 +08:00
Stephen Sun
f221ac3db5 [system health daemon] Support PSU power threshold checking (#11864) 2022-12-10 10:33:21 +08:00
Mai Bui
b3a8167968
[system-health] Remove subprocess with shell=True (#12572)
Signed-off-by: maipbui <maibui@microsoft.com>
#### Why I did it
`subprocess` is used with `shell=True`, which is very dangerous for shell injection.
#### How I did it
remove `shell=True`, use `shell=False`
#### How to verify it
Pass UT
Manual test
2022-11-02 10:16:48 -04:00
Junchao-Mellanox
865be7ba85
[system-health] Fix error log system_service'state' while doing confi… (#11225)
- Why I did it
While doing config reload, FEATURE table may be removed and re-add. During this process, updating FEATURE table is not atomic. It could be that the FEATURE table has entry, but each entry has no field. This PR introduces a retry mechanism to avoid this.

- How I did it
Introduces a retry mechanism to avoid this.

- How to verify it
New unit test added to verify the flow as well as running some manual test.
2022-06-28 18:48:10 +03:00
Senthil Kumar Guruswamy
f37dd770cd
System Ready (#10479)
Why I did it
At present, there is no mechanism in an event driven model to know that the system is up with all the essential sonic services and also, all the docker apps are ready along with port ready status to start the network traffic. With the asynchronous architecture of SONiC, we will not be able to verify if the config has been applied all the way down to the HW. But we can get the closest up status of each app and arrive at the system readiness.

How I did it
A new python based system monitor tool is introduced under system-health framework to monitor all the essential system host services including docker wrapper services on an event based model and declare the system is ready. This framework gives provision for docker apps to notify its closest up status. CLIs are provided to fetch the current system status and also service running status and its app ready status along with failure reason if any.

How to verify it
"show system-health sysready-status" click CLI
Syslogs for system ready
2022-05-20 13:25:11 -07:00
Junchao-Mellanox
c06cb219e2
Make system health service start early (#9792)
- Why I did it
For SYSTEM READY feature. Currently, there is a booting stage in system health service to indicate that the system is loading SONiC component. This booting stage is no longer needed because SYSTEM READY feature will treat that stage as system "NOT READY".

- How I did it
1. Remove booting stage
2. Adjust unit test cases

- How to verify it
Manual test, Unit test, sonic-mgmt Regression
2022-01-27 13:46:52 +02:00
Junchao-Mellanox
11a93d2f92
[system-health] No longer check critical process/service status via monit (#9068)
HLD updated here: https://github.com/Azure/SONiC/pull/887

#### Why I did it

Command `monit summary -B` can no longer display the status for each critical process, system-health should not depend on it and need find a way to monitor the status of critical processes. The PR is to address that. monit is still used by system-health to do file system check as well as customize check.

#### How I did it

1.	Get container names from FEATURE table
2.	For each container, collect critical process names from file critical_processes
3.	Use “docker exec -it <container_name> bash -c ‘supervisorctl status’” to get processes status inside container, parse the output and check if any critical processes exit

#### How to verify it

1. Add unit test case to cover it
2. Adjust sonic-mgmt cases to cover it
3. Manual test
2021-11-23 15:47:48 -08:00
Qi Luo
ec624e280c
Replace swsssdk.ConfigDBConnector and SonicDBConfig with swsscommon implementation in system-health (#8186)
swsssdk will be deprecated. Use swsscommon instead.
2021-07-16 19:56:24 -07:00
Joe LeVeque
566ea4f601
[system-health] Convert to Python 3 (#5886)
- Convert system-health scripts to Python 3
- Build and install system-health as a Python 3 wheel
- Also convert newlines from DOS to UNIX
2020-12-29 14:04:09 -08:00
Junchao-Mellanox
1c97a03b81
[system-health] Add support for monitoring system health (#4835)
* system health first commit

* system health daemon first commit

* Finish healthd

* Changes due to lower layer logic change

* Get ASIC temperature from TEMPERATURE_INFO table

* Add system health make rule and service files

* fix bugs found during manual test

* Change make file to install system-health library to host

* Set system LED to blink on bootup time

* Caught exceptions in system health checker to make it more robust

* fix issue that fan/psu presence will always be true

* fix issue for external checker

* move system-health service to right after rc-local service

* Set system-health service start after database service

* Get system up time via /proc/uptime

* Provide more information in stat for CLI to use

* fix typo

* Set default category to External for external checker

* If external checker reported OK, save it to stat too

* Trim string for external checker output

* fix issue: PSU voltage check always return OK

* Add unit test cases for system health library

* Fix LGTM warnings

* fix demo comments: 1. get boot up timeout from monit configuration file; 2. set system led in library instead of daemon

* Remove boot_timeout configuration because it will get from monit config file

* Fix argument miss

* fix unit test failure

* fix issue: summary status is not correct

* Fix format issues found in code review

* rename th to threshold to make it clearer

* Fix review comment: 1. add a .dep file for system health; 2. deprecated daemon_base and uses sonic-py-common instead

* Fix unit test failure

* Fix LGTM alert

* Fix LGTM alert

* Fix review comments

* Fix review comment

* 1. Add relevant comments for system health; 2. rename external_checker to user_define_checker

* Ignore check for unknown service type

* Fix unit test issue

* Rename user define checker to user defined checker

* Rename user_define_checkers to user_defined_checkers for configuration file

* Renmae file user_define_checker.py -> user_defined_checker.py

* Fix typo

* Adjust import order for config.py

Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>

* Adjust import order for src/system-health/health_checker/hardware_checker.py

Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>

* Adjust import order for src/system-health/scripts/healthd

Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>

* Adjust import orders in src/system-health/tests/test_system_health.py

* Fix typo

* Add new line after import

* If system health configuration file not exist, healthd should exit

* Fix indent and enable pytest coverage

* Fix typo

* Fix typo

* Remove global logger and use log functions inherited from super class

* Change info level logger to notice level

Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>
2020-10-12 11:12:49 +03:00