Commit Graph

36 Commits

Author SHA1 Message Date
spilkey-cisco
3b982c073c Fix system-health hardware_checker to consume fan tolerance details (#16689)
Why I did it

Fan tolerance checking is done through new APIs, is_under_speed and is_over_speed, which populate corresponding fields into the database. speed_tolerance is no longer used and was removed, but system-health was not updated and indicates failures:

ADO: 25279165

root@sonic/# show system-health summary
System status summary

  System status LED  red_blink
  Services:
    Status: OK
  Hardware:
    Status: Not OK
    Reasons: Failed to get speed tolerance for fantray5.fan1
	     Failed to get speed tolerance for fantray5.fan0
	     Failed to get speed tolerance for fantray4.fan1
	     Failed to get speed tolerance for fantray4.fan0
	     Failed to get speed tolerance for fantray3.fan1
	     Failed to get speed tolerance for fantray3.fan0
	     Failed to get speed tolerance for fantray2.fan1
	     Failed to get speed tolerance for fantray2.fan0
	     Failed to get speed tolerance for fantray1.fan1
	     Failed to get speed tolerance for fantray1.fan0
	     Failed to get speed tolerance for fantray0.fan1
	     Failed to get speed tolerance for fantray0.fan0
	     Failed to get speed tolerance for PSU1.fan0
	     Failed to get speed tolerance for PSU0.fan0

How I did it
Updated hardware_checker.py in system-health to consume new is_under_speed and is_over_speed database entries instead of speed_tolerance and hard-coded calculations.

How to verify it
root@sonic:/# show system-health summary
System status summary

  System status LED  green
  Services:
    Status: OK
  Hardware:
    Status: OK
2024-02-02 14:33:14 +08:00
ganglv
c71fb3a30f
Share image for gnmi and telemetry (#16863)
Why I did it
Share docker image to support gnmi container and telemetry container

Work item tracking
Microsoft ADO 25423918:
How I did it
Create telemetry image from gnmi docker image.
Enable gnmi container and disable telemetry container by default.

How to verify it
Run end to end test.
2023-11-08 08:54:36 +08:00
Senthil Kumar Guruswamy
34e5d266e5
Handle service start-limit-hit failure event case in sysmonitor (#16174) 2023-08-31 12:07:42 -07:00
Senthil Kumar Guruswamy
fdd5deb453
Fix for issue#14871 (#15433)
Include valid input check for system status in test along with db update
check
2023-08-31 12:04:48 -07:00
Stephen Sun
2a55e8b359
Update the description message of PSU power threshold checking in system health (#15289)
- Why I did it
Adjust PSU power threshold logic in system health.

- How I did it
Update the description message in PSU power threshold checking
power of PSU x (xx w) exceeds threshold (xx w) => System power exceeds xx threshold (xx w)

- How to verify it
Manual test and unit test
2023-07-15 01:10:29 +03:00
Senthil Kumar Guruswamy
ed700de435
Fix for issue#14964 (#15212)
Multiprocessing Manager resources (Queue) to be freed up during task stop
2023-06-19 12:10:28 -07:00
DavidZagury
e830491001
[system-health] When disabling a feature the SYSTEM_READY|SYSTEM_STATE was not updated (#14823)
- Why I did it
If you enable feature and then disable it, System Ready status change to Not Ready

A disabled feature should not affect the system ready status.

- How I did it
During the disable flow of dhcp_relay, it entered the dnsrvs_name list, which caused the SYSTEM_STATE key to be set to DOWN. Right after that, the dhcp_relay service was removed from the full service list, however, but, when it was removed from the dnsrvs_name, there was no flow to reset the system state back to UP even though there was no more services in down state.

- How to verify it
root@qa-eth-vt01-2-3700v:/home/admin# config feature state dhcp_relay enabled 
root@qa-eth-vt01-2-3700v:/home/admin# show system-health sysready-status 

root@qa-eth-vt01-2-3700v:/home/admin# config feature state dhcp_relay disabled
root@qa-eth-vt01-2-3700v:/home/admin# show system-health sysready-status 

Should see
System is ready
2023-05-30 10:37:33 +03:00
Vivek
bc9c054da2
[healthd] Use unix_socket_path instead of loopback ip (#14843)
- Why I did it

interfaces-config service restarts networking service, which in-turn results in loopback interface address is being removed and reassigned back

If the system-health happens to start during that instance expections and logs like this are seen:

Apr 15 18:14:49.357869 r-panther-20 ERR healthd: update system status exception:Unable to connect to redis: Cannot assign requested address
Apr 15 18:14:49.429778 r-panther-20 ERR healthd: subscribe_statedb exited- Unable to connect to redis: Cannot assign requested address
Apr 15 18:14:52.218594 r-panther-20 ERR healthd: system_service_Map_base::at
Apr 15 18:14:52.219714 r-panther-20 ERR healthd: system_service_Map_base::at
Apr 15 18:14:55.218636 r-panther-20 ERR healthd: system_service_Map_base::at
Apr 15 18:14:55.218722 r-panther-20 ERR healthd: system_service_Map_base::at

- How I did it
use unix socket path

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
2023-05-26 15:49:21 +03:00
Guilt
a73d443c1d
[CI][doc][build] Trim src folder files trailing blanks (#15162)
- Run pre-commit tox profile to trim all trailing blanks
- Use several commits with a per-folder based strategy
  to ease their merge

Issue #15114

Signed-off-by: Guillaume Lambert <guillaume.lambert@orange.com>
2023-05-24 10:01:43 -07:00
Junchao-Mellanox
5e893666df
[system-health] Add fan direction check for system health (#14509)
- Why I did it
Add fan direction check to system health, all fans should be in the same direction

- How I did it
Add fan direction check to system health, all fans should be in the same direction

- How to verify it
Manual test
Unit test
Added sonic-mgmt test case to verify
2023-05-10 20:38:20 +03:00
Vivek
22b4aac432
[Sys Mon] Fix the service entry delete in state_db because of timer job (#14702)
Why I did it
systemd stop event on service with timers can sometime delete the state_db entry for the corresponding service.

Note: This won't be observed on the latest master label since the dependency on timer was removed with the recent config reload enhancement. However, it is better to have the fix since there might be some systemd services added to system health daemon in the future which may contain timers

root@qa-eth-vt01-4-3700c:/home/admin# systemctl stop snmp
root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status 
System is not ready - one or more services are not up

Service-Name            Service-Status    App-Ready-Status    Down-Reason
----------------------  ----------------  ------------------  -------------
<Truncated>
ssh                     OK                OK                  -
swss                    OK                OK                  -
syncd                   OK                OK                  -
sysstat                 OK                OK                  -
teamd                   OK                OK                  -
telemetry               OK                OK                  -
what-just-happened      OK                OK                  -
ztp                     OK                OK                  -
<Truncated>
Expected

Should see a Down entry for SNMP instead of the entry being deleted from the STATE_DB

root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status 
System is not ready - one or more services are not up

Service-Name            Service-Status    App-Ready-Status    Down-Reason
----------------------  ----------------  ------------------  -------------
<Truncated>
snmp                    Down              Down                Inactive
ssh                     OK                OK                  -
swss                    OK                OK                  -
syncd                   OK                OK                  -
sysstat                 OK                OK                  -
teamd                   OK                OK                  -
telemetry               OK                OK                  -
what-just-happened      OK                OK                  -
ztp                     OK                OK                  -
<Truncated>
How I did it
Happens because the timer is usually a PartOf service and thus a stop on service is propagated to timer. Fixed the logic to handle this

Apr 18 02:06:47.711252 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.service from source:sysbus time:2023-04-17 23:06:47
Apr 18 02:06:47.711347 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.service ] 
Apr 18 02:06:47.722363 r-lionfish-16 INFO healthd: snmp.service service state changed to [inactive/dead]

Apr 18 02:06:47.723230 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.timer from source:sysbus time:2023-04-17 23:06:47
Apr 18 02:06:47.723328 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.timer ] 

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
2023-04-27 09:02:13 -07:00
Junchao-Mellanox
03cab99a7a
[system-health] Make check interval more accurate (#14085)
- Why I did it

Healthd check system status every 60 seconds. However, running checker may take several seconds. Say checker takes X seconds, healthd takes (60 + X) seconds to finish one iteration. This implementation makes sonic-mgmt test case not so stable because the value X is hard to predict and different among different platforms. This PR introduces an interval
compensation mechanism to healthd main loop.

- How I did it

Introduces an interval compensation mechanism to healthd main loop: healthd should wait (60 - X) seconds for next iteration

- How to verify it

Manual test
Unit test
2023-03-15 07:21:00 +02:00
Liu Shilong
dcce42c402
Check SONiC dependencies before installation. (#13850)
Why I did it
SONiC related packages shouldn't be intalled from Pypi.
It is security compliance requirement.

How I did it
Check SONiC related packages when using setup.py.

How to verify it
2023-03-02 08:20:39 +08:00
spilkey-cisco
ad679a0338
Add asic presence filtering for container checking in system-health (#13497)
Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).

system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity.

How I did it
Port container_checker logic from #11442 into service_checker for system-health.

How to verify it
Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.
2023-02-10 21:34:10 -08:00
Junchao-Mellanox
5e6e2c827d
Fix issue: ERR healthd: Get unit status determine-reboot-cause-'LoadState' (#13697)
- Why I did it
Fix issue: ERR healthd: Get unit status determine-reboot-cause-'LoadState'. The error log is only seen on shutdown flow such as fast-reboot/warm-reboot.

In shutdown flow, 'LoadState' might not be available in systemctl status output, using [] might cause a KeyError.

- How I did it
Use dict.get instead of []

- How to verify it
Manual test
2023-02-07 17:56:06 +02:00
Mai Bui
2f2702f705
Revert "[system-health] Remove subprocess with shell=True (#12572)" (#13505)
This reverts commit b3a8167968.
Due to issue https://github.com/sonic-net/sonic-buildimage/issues/13432
2023-01-25 13:41:08 -08:00
Tomer Shalvi
2d2d9433b3
Moving multiprocessing.Manager to the correct sub-process (#13377)
Why I did it
There is a queue in sysmonitor.py that is created based on an object of multiprocessing.Manager.
After performing fast-reboot, system health monitor is being shut down, what causes this Manager to be shut down as well, since it is a child-process of healthd.
That's why I moved the creation of this Manager from the top of the file to the function Sysmonitor.system_service() (The only place it is used), to make Manager a child-process of Sysmonitor, instead of Healthd. This way both the queue (the Manager) and the processes that uses this queue will be child-processes of the same process, and the problematic scenario of sysmonitor sending messages to a dead queue will not be possible.

How I did it
Removed the definition of manager as global and moved it to system_service() function

How to verify it
Perform a fast reboot and verify the traceback issue is fixed
2023-01-17 08:43:49 -08:00
Junchao-Mellanox
ffa974c7f4
[system-health] Led color shall be controlled by configuration when system is booting (#12487)
* [system-health] Led color shall be controlled by configuration when system is booting

* Fix unit test issue
2022-11-30 18:38:50 -08:00
Stephen Sun
7b4032e9ed
[system health daemon] Support PSU power threshold checking (#11864) 2022-11-21 07:04:58 -08:00
Zain Budhwani
8f48773fd1
Publish additional events (#12563)
Add event_publish code or regex for rsyslog plugin for additional events
2022-11-07 09:57:57 -08:00
Mai Bui
b3a8167968
[system-health] Remove subprocess with shell=True (#12572)
Signed-off-by: maipbui <maibui@microsoft.com>
#### Why I did it
`subprocess` is used with `shell=True`, which is very dangerous for shell injection.
#### How I did it
remove `shell=True`, use `shell=False`
#### How to verify it
Pass UT
Manual test
2022-11-02 10:16:48 -04:00
Marty Y. Lok
57ff7a2308
[chassis][supervisor] show system-health summary fails on the supervisor card (#10631)
Fix the command "sudo show system-health summary" shows the following error on the supervisor card. Fixes #10630
2022-09-22 16:39:31 -07:00
Junchao-Mellanox
865be7ba85
[system-health] Fix error log system_service'state' while doing confi… (#11225)
- Why I did it
While doing config reload, FEATURE table may be removed and re-add. During this process, updating FEATURE table is not atomic. It could be that the FEATURE table has entry, but each entry has no field. This PR introduces a retry mechanism to avoid this.

- How I did it
Introduces a retry mechanism to avoid this.

- How to verify it
New unit test added to verify the flow as well as running some manual test.
2022-06-28 18:48:10 +03:00
Lior Avramov
afd3c63561
Change severity of log messages for cases where docker container was stopped during service checker operation (#11188)
#### Why I did it
There might be a case where service checker periodic operation determined that specific container is running but when it tries to perform an operation on it, it was already closed by the user. This is a valid flow and we should not log an error message, informative warning is enough. 

#### How I did it
I reduce log severity.

#### How to verify it
I verified it manually.
2022-06-22 14:54:14 -07:00
Senthil Kumar Guruswamy
f37dd770cd
System Ready (#10479)
Why I did it
At present, there is no mechanism in an event driven model to know that the system is up with all the essential sonic services and also, all the docker apps are ready along with port ready status to start the network traffic. With the asynchronous architecture of SONiC, we will not be able to verify if the config has been applied all the way down to the HW. But we can get the closest up status of each app and arrive at the system readiness.

How I did it
A new python based system monitor tool is introduced under system-health framework to monitor all the essential system host services including docker wrapper services on an event based model and declare the system is ready. This framework gives provision for docker apps to notify its closest up status. CLIs are provided to fetch the current system status and also service running status and its app ready status along with failure reason if any.

How to verify it
"show system-health sysready-status" click CLI
Syslogs for system ready
2022-05-20 13:25:11 -07:00
Junchao-Mellanox
d82eafd8ae
[system-health] Fix file handle leak (#10059)
- Why I did it
swsscommon.ConfigDBConnector does not automatically close connection when the instance is recycled by python. So, it should not create this instance each time calling check_services. It will cause error like Failed to read from file /var/run/hw-management/led/led_status_capability - OSError(24, 'Too many open files')

- How I did it
Only connect DB once in init

- How to verify it
Manual test
2022-02-24 11:29:59 +02:00
Junchao-Mellanox
c06cb219e2
Make system health service start early (#9792)
- Why I did it
For SYSTEM READY feature. Currently, there is a booting stage in system health service to indicate that the system is loading SONiC component. This booting stage is no longer needed because SYSTEM READY feature will treat that stage as system "NOT READY".

- How I did it
1. Remove booting stage
2. Adjust unit test cases

- How to verify it
Manual test, Unit test, sonic-mgmt Regression
2022-01-27 13:46:52 +02:00
Junchao-Mellanox
11a93d2f92
[system-health] No longer check critical process/service status via monit (#9068)
HLD updated here: https://github.com/Azure/SONiC/pull/887

#### Why I did it

Command `monit summary -B` can no longer display the status for each critical process, system-health should not depend on it and need find a way to monitor the status of critical processes. The PR is to address that. monit is still used by system-health to do file system check as well as customize check.

#### How I did it

1.	Get container names from FEATURE table
2.	For each container, collect critical process names from file critical_processes
3.	Use “docker exec -it <container_name> bash -c ‘supervisorctl status’” to get processes status inside container, parse the output and check if any critical processes exit

#### How to verify it

1. Add unit test case to cover it
2. Adjust sonic-mgmt cases to cover it
3. Manual test
2021-11-23 15:47:48 -08:00
Qi Luo
ec624e280c
Replace swsssdk.ConfigDBConnector and SonicDBConfig with swsscommon implementation in system-health (#8186)
swsssdk will be deprecated. Use swsscommon instead.
2021-07-16 19:56:24 -07:00
Aravind Mani
6d83a424b5
[dell]: System Health: Fix ASIC key issue in Dell platform (#6556)
ASIC key used in system health daemon is not present in Dell platforms.

Fixes #6343

Got the thermal sensor list using 2.0 API and retrieved the ASIC keys.
2021-04-05 18:00:38 -07:00
Junchao-Mellanox
2a0351c65c
Check fan speed before check fan status (#6586)
**- Why I did it**
In thermalctd, when speed of fan exceeds threshold, the fan status will be saved as "bad". So in system health, it is better to check fan speed before fan status. In this case, if fan speed exceeds threshold, we get more detailed information.

**- How I did it**
Move fan speed check logic before fan status check

**- How to verify it**
Manual test
2021-01-31 09:06:36 +02:00
Joe LeVeque
2d77a36658
[system-health] Make run_command() Python 3-compliant (#6371)
Pass universal_newlines=True parameter to subprocess.Popen(); no longer use .encode('utf-8') on resulting stdout.
This was missed in #5886

Note: I would prefer to use text=True instead of universal_newlines=True, as the former is an alias only available in Python 3 and is more understandable than the latter. However, Even though the setup.py file for this package only specifies Python 3, the LGTM tool finds other Python 2 code in the repo and validates the code as Python 2 code and alerts that text=True is an invalid parameter. Will stick with universal_newlines=True for now. Once all Python code in the repo has been converted to Python 3, I will change all universal_newlines=True to text=True.
2021-01-07 05:48:13 -08:00
Joe LeVeque
566ea4f601
[system-health] Convert to Python 3 (#5886)
- Convert system-health scripts to Python 3
- Build and install system-health as a Python 3 wheel
- Also convert newlines from DOS to UNIX
2020-12-29 14:04:09 -08:00
Vaibhav Hemant Dixit
e8da2ee975
Fix post-reboot errors in platforms without sonic_platform package (#6130)
Refactor determine-reboot cause code. Fix errors seen during determine-reboot-cause when sonic_platform package is not installed.
Add error handling for healthd service when sonic_platform package is not installed.

Tested on KVM where sonic_platform is not present, and the errors are not seen anymore in syslog.
2020-12-17 12:01:42 -08:00
Joe LeVeque
73825e4d4d
[system-health] Update .gitignore file (#5688)
Touch up .gitignore file to properly ignore all files generated when building a Python wheel package
2020-10-22 11:58:27 -07:00
Junchao-Mellanox
1c97a03b81
[system-health] Add support for monitoring system health (#4835)
* system health first commit

* system health daemon first commit

* Finish healthd

* Changes due to lower layer logic change

* Get ASIC temperature from TEMPERATURE_INFO table

* Add system health make rule and service files

* fix bugs found during manual test

* Change make file to install system-health library to host

* Set system LED to blink on bootup time

* Caught exceptions in system health checker to make it more robust

* fix issue that fan/psu presence will always be true

* fix issue for external checker

* move system-health service to right after rc-local service

* Set system-health service start after database service

* Get system up time via /proc/uptime

* Provide more information in stat for CLI to use

* fix typo

* Set default category to External for external checker

* If external checker reported OK, save it to stat too

* Trim string for external checker output

* fix issue: PSU voltage check always return OK

* Add unit test cases for system health library

* Fix LGTM warnings

* fix demo comments: 1. get boot up timeout from monit configuration file; 2. set system led in library instead of daemon

* Remove boot_timeout configuration because it will get from monit config file

* Fix argument miss

* fix unit test failure

* fix issue: summary status is not correct

* Fix format issues found in code review

* rename th to threshold to make it clearer

* Fix review comment: 1. add a .dep file for system health; 2. deprecated daemon_base and uses sonic-py-common instead

* Fix unit test failure

* Fix LGTM alert

* Fix LGTM alert

* Fix review comments

* Fix review comment

* 1. Add relevant comments for system health; 2. rename external_checker to user_define_checker

* Ignore check for unknown service type

* Fix unit test issue

* Rename user define checker to user defined checker

* Rename user_define_checkers to user_defined_checkers for configuration file

* Renmae file user_define_checker.py -> user_defined_checker.py

* Fix typo

* Adjust import order for config.py

Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>

* Adjust import order for src/system-health/health_checker/hardware_checker.py

Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>

* Adjust import order for src/system-health/scripts/healthd

Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>

* Adjust import orders in src/system-health/tests/test_system_health.py

* Fix typo

* Add new line after import

* If system health configuration file not exist, healthd should exit

* Fix indent and enable pytest coverage

* Fix typo

* Fix typo

* Remove global logger and use log functions inherited from super class

* Change info level logger to notice level

Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>
2020-10-12 11:12:49 +03:00