Commit Graph

11 Commits

Author SHA1 Message Date
Junchao-Mellanox
d82eafd8ae
[system-health] Fix file handle leak (#10059)
- Why I did it
swsscommon.ConfigDBConnector does not automatically close connection when the instance is recycled by python. So, it should not create this instance each time calling check_services. It will cause error like Failed to read from file /var/run/hw-management/led/led_status_capability - OSError(24, 'Too many open files')

- How I did it
Only connect DB once in init

- How to verify it
Manual test
2022-02-24 11:29:59 +02:00
Junchao-Mellanox
c06cb219e2
Make system health service start early (#9792)
- Why I did it
For SYSTEM READY feature. Currently, there is a booting stage in system health service to indicate that the system is loading SONiC component. This booting stage is no longer needed because SYSTEM READY feature will treat that stage as system "NOT READY".

- How I did it
1. Remove booting stage
2. Adjust unit test cases

- How to verify it
Manual test, Unit test, sonic-mgmt Regression
2022-01-27 13:46:52 +02:00
Junchao-Mellanox
11a93d2f92
[system-health] No longer check critical process/service status via monit (#9068)
HLD updated here: https://github.com/Azure/SONiC/pull/887

#### Why I did it

Command `monit summary -B` can no longer display the status for each critical process, system-health should not depend on it and need find a way to monitor the status of critical processes. The PR is to address that. monit is still used by system-health to do file system check as well as customize check.

#### How I did it

1.	Get container names from FEATURE table
2.	For each container, collect critical process names from file critical_processes
3.	Use “docker exec -it <container_name> bash -c ‘supervisorctl status’” to get processes status inside container, parse the output and check if any critical processes exit

#### How to verify it

1. Add unit test case to cover it
2. Adjust sonic-mgmt cases to cover it
3. Manual test
2021-11-23 15:47:48 -08:00
Qi Luo
ec624e280c
Replace swsssdk.ConfigDBConnector and SonicDBConfig with swsscommon implementation in system-health (#8186)
swsssdk will be deprecated. Use swsscommon instead.
2021-07-16 19:56:24 -07:00
Aravind Mani
6d83a424b5
[dell]: System Health: Fix ASIC key issue in Dell platform (#6556)
ASIC key used in system health daemon is not present in Dell platforms.

Fixes #6343

Got the thermal sensor list using 2.0 API and retrieved the ASIC keys.
2021-04-05 18:00:38 -07:00
Junchao-Mellanox
2a0351c65c
Check fan speed before check fan status (#6586)
**- Why I did it**
In thermalctd, when speed of fan exceeds threshold, the fan status will be saved as "bad". So in system health, it is better to check fan speed before fan status. In this case, if fan speed exceeds threshold, we get more detailed information.

**- How I did it**
Move fan speed check logic before fan status check

**- How to verify it**
Manual test
2021-01-31 09:06:36 +02:00
Joe LeVeque
2d77a36658
[system-health] Make run_command() Python 3-compliant (#6371)
Pass universal_newlines=True parameter to subprocess.Popen(); no longer use .encode('utf-8') on resulting stdout.
This was missed in #5886

Note: I would prefer to use text=True instead of universal_newlines=True, as the former is an alias only available in Python 3 and is more understandable than the latter. However, Even though the setup.py file for this package only specifies Python 3, the LGTM tool finds other Python 2 code in the repo and validates the code as Python 2 code and alerts that text=True is an invalid parameter. Will stick with universal_newlines=True for now. Once all Python code in the repo has been converted to Python 3, I will change all universal_newlines=True to text=True.
2021-01-07 05:48:13 -08:00
Joe LeVeque
566ea4f601
[system-health] Convert to Python 3 (#5886)
- Convert system-health scripts to Python 3
- Build and install system-health as a Python 3 wheel
- Also convert newlines from DOS to UNIX
2020-12-29 14:04:09 -08:00
Vaibhav Hemant Dixit
e8da2ee975
Fix post-reboot errors in platforms without sonic_platform package (#6130)
Refactor determine-reboot cause code. Fix errors seen during determine-reboot-cause when sonic_platform package is not installed.
Add error handling for healthd service when sonic_platform package is not installed.

Tested on KVM where sonic_platform is not present, and the errors are not seen anymore in syslog.
2020-12-17 12:01:42 -08:00
Joe LeVeque
73825e4d4d
[system-health] Update .gitignore file (#5688)
Touch up .gitignore file to properly ignore all files generated when building a Python wheel package
2020-10-22 11:58:27 -07:00
Junchao-Mellanox
1c97a03b81
[system-health] Add support for monitoring system health (#4835)
* system health first commit

* system health daemon first commit

* Finish healthd

* Changes due to lower layer logic change

* Get ASIC temperature from TEMPERATURE_INFO table

* Add system health make rule and service files

* fix bugs found during manual test

* Change make file to install system-health library to host

* Set system LED to blink on bootup time

* Caught exceptions in system health checker to make it more robust

* fix issue that fan/psu presence will always be true

* fix issue for external checker

* move system-health service to right after rc-local service

* Set system-health service start after database service

* Get system up time via /proc/uptime

* Provide more information in stat for CLI to use

* fix typo

* Set default category to External for external checker

* If external checker reported OK, save it to stat too

* Trim string for external checker output

* fix issue: PSU voltage check always return OK

* Add unit test cases for system health library

* Fix LGTM warnings

* fix demo comments: 1. get boot up timeout from monit configuration file; 2. set system led in library instead of daemon

* Remove boot_timeout configuration because it will get from monit config file

* Fix argument miss

* fix unit test failure

* fix issue: summary status is not correct

* Fix format issues found in code review

* rename th to threshold to make it clearer

* Fix review comment: 1. add a .dep file for system health; 2. deprecated daemon_base and uses sonic-py-common instead

* Fix unit test failure

* Fix LGTM alert

* Fix LGTM alert

* Fix review comments

* Fix review comment

* 1. Add relevant comments for system health; 2. rename external_checker to user_define_checker

* Ignore check for unknown service type

* Fix unit test issue

* Rename user define checker to user defined checker

* Rename user_define_checkers to user_defined_checkers for configuration file

* Renmae file user_define_checker.py -> user_defined_checker.py

* Fix typo

* Adjust import order for config.py

Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>

* Adjust import order for src/system-health/health_checker/hardware_checker.py

Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>

* Adjust import order for src/system-health/scripts/healthd

Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>

* Adjust import orders in src/system-health/tests/test_system_health.py

* Fix typo

* Add new line after import

* If system health configuration file not exist, healthd should exit

* Fix indent and enable pytest coverage

* Fix typo

* Fix typo

* Remove global logger and use log functions inherited from super class

* Change info level logger to notice level

Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>
2020-10-12 11:12:49 +03:00