sonic-buildimage

Author	SHA1	Message	Date
spilkey-cisco	3b982c073c	Fix system-health hardware_checker to consume fan tolerance details (#16689 ) Why I did it Fan tolerance checking is done through new APIs, is_under_speed and is_over_speed, which populate corresponding fields into the database. speed_tolerance is no longer used and was removed, but system-health was not updated and indicates failures: ADO: 25279165 root@sonic/# show system-health summary System status summary System status LED red_blink Services: Status: OK Hardware: Status: Not OK Reasons: Failed to get speed tolerance for fantray5.fan1 Failed to get speed tolerance for fantray5.fan0 Failed to get speed tolerance for fantray4.fan1 Failed to get speed tolerance for fantray4.fan0 Failed to get speed tolerance for fantray3.fan1 Failed to get speed tolerance for fantray3.fan0 Failed to get speed tolerance for fantray2.fan1 Failed to get speed tolerance for fantray2.fan0 Failed to get speed tolerance for fantray1.fan1 Failed to get speed tolerance for fantray1.fan0 Failed to get speed tolerance for fantray0.fan1 Failed to get speed tolerance for fantray0.fan0 Failed to get speed tolerance for PSU1.fan0 Failed to get speed tolerance for PSU0.fan0 How I did it Updated hardware_checker.py in system-health to consume new is_under_speed and is_over_speed database entries instead of speed_tolerance and hard-coded calculations. How to verify it root@sonic:/# show system-health summary System status summary System status LED green Services: Status: OK Hardware: Status: OK	2024-02-02 14:33:14 +08:00
ganglv	c71fb3a30f	Share image for gnmi and telemetry (#16863 ) Why I did it Share docker image to support gnmi container and telemetry container Work item tracking Microsoft ADO 25423918: How I did it Create telemetry image from gnmi docker image. Enable gnmi container and disable telemetry container by default. How to verify it Run end to end test.	2023-11-08 08:54:36 +08:00
Senthil Kumar Guruswamy	34e5d266e5	Handle service start-limit-hit failure event case in sysmonitor (#16174 )	2023-08-31 12:07:42 -07:00
Senthil Kumar Guruswamy	fdd5deb453	Fix for issue#14871 (#15433 ) Include valid input check for system status in test along with db update check	2023-08-31 12:04:48 -07:00
Stephen Sun	2a55e8b359	Update the description message of PSU power threshold checking in system health (#15289 ) - Why I did it Adjust PSU power threshold logic in system health. - How I did it Update the description message in PSU power threshold checking power of PSU x (xx w) exceeds threshold (xx w) => System power exceeds xx threshold (xx w) - How to verify it Manual test and unit test	2023-07-15 01:10:29 +03:00
Senthil Kumar Guruswamy	ed700de435	Fix for issue#14964 (#15212 ) Multiprocessing Manager resources (Queue) to be freed up during task stop	2023-06-19 12:10:28 -07:00
DavidZagury	e830491001	[system-health] When disabling a feature the SYSTEM_READY\|SYSTEM_STATE was not updated (#14823 ) - Why I did it If you enable feature and then disable it, System Ready status change to Not Ready A disabled feature should not affect the system ready status. - How I did it During the disable flow of dhcp_relay, it entered the dnsrvs_name list, which caused the SYSTEM_STATE key to be set to DOWN. Right after that, the dhcp_relay service was removed from the full service list, however, but, when it was removed from the dnsrvs_name, there was no flow to reset the system state back to UP even though there was no more services in down state. - How to verify it root@qa-eth-vt01-2-3700v:/home/admin# config feature state dhcp_relay enabled root@qa-eth-vt01-2-3700v:/home/admin# show system-health sysready-status root@qa-eth-vt01-2-3700v:/home/admin# config feature state dhcp_relay disabled root@qa-eth-vt01-2-3700v:/home/admin# show system-health sysready-status Should see System is ready	2023-05-30 10:37:33 +03:00
Vivek	bc9c054da2	[healthd] Use unix_socket_path instead of loopback ip (#14843 ) - Why I did it interfaces-config service restarts networking service, which in-turn results in loopback interface address is being removed and reassigned back If the system-health happens to start during that instance expections and logs like this are seen: Apr 15 18:14:49.357869 r-panther-20 ERR healthd: update system status exception:Unable to connect to redis: Cannot assign requested address Apr 15 18:14:49.429778 r-panther-20 ERR healthd: subscribe_statedb exited- Unable to connect to redis: Cannot assign requested address Apr 15 18:14:52.218594 r-panther-20 ERR healthd: system_service_Map_base::at Apr 15 18:14:52.219714 r-panther-20 ERR healthd: system_service_Map_base::at Apr 15 18:14:55.218636 r-panther-20 ERR healthd: system_service_Map_base::at Apr 15 18:14:55.218722 r-panther-20 ERR healthd: system_service_Map_base::at - How I did it use unix socket path Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>	2023-05-26 15:49:21 +03:00
Guilt	a73d443c1d	[CI][doc][build] Trim src folder files trailing blanks (#15162 ) - Run pre-commit tox profile to trim all trailing blanks - Use several commits with a per-folder based strategy to ease their merge Issue #15114 Signed-off-by: Guillaume Lambert <guillaume.lambert@orange.com>	2023-05-24 10:01:43 -07:00
Junchao-Mellanox	5e893666df	[system-health] Add fan direction check for system health (#14509 ) - Why I did it Add fan direction check to system health, all fans should be in the same direction - How I did it Add fan direction check to system health, all fans should be in the same direction - How to verify it Manual test Unit test Added sonic-mgmt test case to verify	2023-05-10 20:38:20 +03:00
Vivek	22b4aac432	[Sys Mon] Fix the service entry delete in state_db because of timer job (#14702 ) Why I did it systemd stop event on service with timers can sometime delete the state_db entry for the corresponding service. Note: This won't be observed on the latest master label since the dependency on timer was removed with the recent config reload enhancement. However, it is better to have the fix since there might be some systemd services added to system health daemon in the future which may contain timers root@qa-eth-vt01-4-3700c:/home/admin# systemctl stop snmp root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status System is not ready - one or more services are not up Service-Name Service-Status App-Ready-Status Down-Reason ---------------------- ---------------- ------------------ ------------- <Truncated> ssh OK OK - swss OK OK - syncd OK OK - sysstat OK OK - teamd OK OK - telemetry OK OK - what-just-happened OK OK - ztp OK OK - <Truncated> Expected Should see a Down entry for SNMP instead of the entry being deleted from the STATE_DB root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status System is not ready - one or more services are not up Service-Name Service-Status App-Ready-Status Down-Reason ---------------------- ---------------- ------------------ ------------- <Truncated> snmp Down Down Inactive ssh OK OK - swss OK OK - syncd OK OK - sysstat OK OK - teamd OK OK - telemetry OK OK - what-just-happened OK OK - ztp OK OK - <Truncated> How I did it Happens because the timer is usually a PartOf service and thus a stop on service is propagated to timer. Fixed the logic to handle this Apr 18 02:06:47.711252 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.service from source:sysbus time:2023-04-17 23:06:47 Apr 18 02:06:47.711347 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.service ] Apr 18 02:06:47.722363 r-lionfish-16 INFO healthd: snmp.service service state changed to [inactive/dead] Apr 18 02:06:47.723230 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.timer from source:sysbus time:2023-04-17 23:06:47 Apr 18 02:06:47.723328 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.timer ] Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>	2023-04-27 09:02:13 -07:00
Junchao-Mellanox	03cab99a7a	[system-health] Make check interval more accurate (#14085 ) - Why I did it Healthd check system status every 60 seconds. However, running checker may take several seconds. Say checker takes X seconds, healthd takes (60 + X) seconds to finish one iteration. This implementation makes sonic-mgmt test case not so stable because the value X is hard to predict and different among different platforms. This PR introduces an interval compensation mechanism to healthd main loop. - How I did it Introduces an interval compensation mechanism to healthd main loop: healthd should wait (60 - X) seconds for next iteration - How to verify it Manual test Unit test	2023-03-15 07:21:00 +02:00
Liu Shilong	dcce42c402	Check SONiC dependencies before installation. (#13850 ) Why I did it SONiC related packages shouldn't be intalled from Pypi. It is security compliance requirement. How I did it Check SONiC related packages when using setup.py. How to verify it	2023-03-02 08:20:39 +08:00
spilkey-cisco	ad679a0338	Add asic presence filtering for container checking in system-health (#13497 ) Why I did it On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created. system-health indicates errors in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file). system-health process error messages were also altered to indicate which container had the issue; multiple containers may run processes with the same name, which can result in identical system-health error messages, causing ambiguity. How I did it Port container_checker logic from #11442 into service_checker for system-health. How to verify it Bringup Supervisor card with one or more missing fabric cards. Execute 'show system-health summary'. The command should not report failure due to missing dockers for the asics on the fabric cards which are not present.	2023-02-10 21:34:10 -08:00
Junchao-Mellanox	5e6e2c827d	Fix issue: ERR healthd: Get unit status determine-reboot-cause-'LoadState' (#13697 ) - Why I did it Fix issue: ERR healthd: Get unit status determine-reboot-cause-'LoadState'. The error log is only seen on shutdown flow such as fast-reboot/warm-reboot. In shutdown flow, 'LoadState' might not be available in systemctl status output, using [] might cause a KeyError. - How I did it Use dict.get instead of [] - How to verify it Manual test	2023-02-07 17:56:06 +02:00
Mai Bui	2f2702f705	Revert "[system-health] Remove subprocess with shell=True (#12572 )" (#13505 ) This reverts commit `b3a8167968`. Due to issue https://github.com/sonic-net/sonic-buildimage/issues/13432	2023-01-25 13:41:08 -08:00
Tomer Shalvi	2d2d9433b3	Moving multiprocessing.Manager to the correct sub-process (#13377 ) Why I did it There is a queue in sysmonitor.py that is created based on an object of multiprocessing.Manager. After performing fast-reboot, system health monitor is being shut down, what causes this Manager to be shut down as well, since it is a child-process of healthd. That's why I moved the creation of this Manager from the top of the file to the function Sysmonitor.system_service() (The only place it is used), to make Manager a child-process of Sysmonitor, instead of Healthd. This way both the queue (the Manager) and the processes that uses this queue will be child-processes of the same process, and the problematic scenario of sysmonitor sending messages to a dead queue will not be possible. How I did it Removed the definition of manager as global and moved it to system_service() function How to verify it Perform a fast reboot and verify the traceback issue is fixed	2023-01-17 08:43:49 -08:00
Junchao-Mellanox	ffa974c7f4	[system-health] Led color shall be controlled by configuration when system is booting (#12487 ) * [system-health] Led color shall be controlled by configuration when system is booting * Fix unit test issue	2022-11-30 18:38:50 -08:00
Stephen Sun	7b4032e9ed	[system health daemon] Support PSU power threshold checking (#11864 )	2022-11-21 07:04:58 -08:00
Zain Budhwani	8f48773fd1	Publish additional events (#12563 ) Add event_publish code or regex for rsyslog plugin for additional events	2022-11-07 09:57:57 -08:00
Mai Bui	b3a8167968	[system-health] Remove subprocess with shell=True (#12572 ) Signed-off-by: maipbui <maibui@microsoft.com> #### Why I did it `subprocess` is used with `shell=True`, which is very dangerous for shell injection. #### How I did it remove `shell=True`, use `shell=False` #### How to verify it Pass UT Manual test	2022-11-02 10:16:48 -04:00
Marty Y. Lok	57ff7a2308	[chassis][supervisor] show system-health summary fails on the supervisor card (#10631 ) Fix the command "sudo show system-health summary" shows the following error on the supervisor card. Fixes #10630	2022-09-22 16:39:31 -07:00
Junchao-Mellanox	865be7ba85	[system-health] Fix error log system_service'state' while doing confi… (#11225 ) - Why I did it While doing config reload, FEATURE table may be removed and re-add. During this process, updating FEATURE table is not atomic. It could be that the FEATURE table has entry, but each entry has no field. This PR introduces a retry mechanism to avoid this. - How I did it Introduces a retry mechanism to avoid this. - How to verify it New unit test added to verify the flow as well as running some manual test.	2022-06-28 18:48:10 +03:00
Lior Avramov	afd3c63561	Change severity of log messages for cases where docker container was stopped during service checker operation (#11188 ) #### Why I did it There might be a case where service checker periodic operation determined that specific container is running but when it tries to perform an operation on it, it was already closed by the user. This is a valid flow and we should not log an error message, informative warning is enough. #### How I did it I reduce log severity. #### How to verify it I verified it manually.	2022-06-22 14:54:14 -07:00
Senthil Kumar Guruswamy	f37dd770cd	System Ready (#10479 ) Why I did it At present, there is no mechanism in an event driven model to know that the system is up with all the essential sonic services and also, all the docker apps are ready along with port ready status to start the network traffic. With the asynchronous architecture of SONiC, we will not be able to verify if the config has been applied all the way down to the HW. But we can get the closest up status of each app and arrive at the system readiness. How I did it A new python based system monitor tool is introduced under system-health framework to monitor all the essential system host services including docker wrapper services on an event based model and declare the system is ready. This framework gives provision for docker apps to notify its closest up status. CLIs are provided to fetch the current system status and also service running status and its app ready status along with failure reason if any. How to verify it "show system-health sysready-status" click CLI Syslogs for system ready	2022-05-20 13:25:11 -07:00
Junchao-Mellanox	d82eafd8ae	[system-health] Fix file handle leak (#10059 ) - Why I did it swsscommon.ConfigDBConnector does not automatically close connection when the instance is recycled by python. So, it should not create this instance each time calling check_services. It will cause error like Failed to read from file /var/run/hw-management/led/led_status_capability - OSError(24, 'Too many open files') - How I did it Only connect DB once in init - How to verify it Manual test	2022-02-24 11:29:59 +02:00
Junchao-Mellanox	c06cb219e2	Make system health service start early (#9792 ) - Why I did it For SYSTEM READY feature. Currently, there is a booting stage in system health service to indicate that the system is loading SONiC component. This booting stage is no longer needed because SYSTEM READY feature will treat that stage as system "NOT READY". - How I did it 1. Remove booting stage 2. Adjust unit test cases - How to verify it Manual test, Unit test, sonic-mgmt Regression	2022-01-27 13:46:52 +02:00
Junchao-Mellanox	11a93d2f92	[system-health] No longer check critical process/service status via monit (#9068 ) HLD updated here: https://github.com/Azure/SONiC/pull/887 #### Why I did it Command `monit summary -B` can no longer display the status for each critical process, system-health should not depend on it and need find a way to monitor the status of critical processes. The PR is to address that. monit is still used by system-health to do file system check as well as customize check. #### How I did it 1. Get container names from FEATURE table 2. For each container, collect critical process names from file critical_processes 3. Use “docker exec -it <container_name> bash -c ‘supervisorctl status’” to get processes status inside container, parse the output and check if any critical processes exit #### How to verify it 1. Add unit test case to cover it 2. Adjust sonic-mgmt cases to cover it 3. Manual test	2021-11-23 15:47:48 -08:00
Qi Luo	ec624e280c	Replace swsssdk.ConfigDBConnector and SonicDBConfig with swsscommon implementation in system-health (#8186 ) swsssdk will be deprecated. Use swsscommon instead.	2021-07-16 19:56:24 -07:00
Aravind Mani	6d83a424b5	[dell]: System Health: Fix ASIC key issue in Dell platform (#6556 ) ASIC key used in system health daemon is not present in Dell platforms. Fixes #6343 Got the thermal sensor list using 2.0 API and retrieved the ASIC keys.	2021-04-05 18:00:38 -07:00
Junchao-Mellanox	2a0351c65c	Check fan speed before check fan status (#6586 ) - Why I did it In thermalctd, when speed of fan exceeds threshold, the fan status will be saved as "bad". So in system health, it is better to check fan speed before fan status. In this case, if fan speed exceeds threshold, we get more detailed information. - How I did it Move fan speed check logic before fan status check - How to verify it Manual test	2021-01-31 09:06:36 +02:00
Joe LeVeque	2d77a36658	[system-health] Make `run_command()` Python 3-compliant (#6371 ) Pass universal_newlines=True parameter to subprocess.Popen(); no longer use .encode('utf-8') on resulting stdout. This was missed in #5886 Note: I would prefer to use text=True instead of universal_newlines=True, as the former is an alias only available in Python 3 and is more understandable than the latter. However, Even though the setup.py file for this package only specifies Python 3, the LGTM tool finds other Python 2 code in the repo and validates the code as Python 2 code and alerts that text=True is an invalid parameter. Will stick with universal_newlines=True for now. Once all Python code in the repo has been converted to Python 3, I will change all universal_newlines=True to text=True.	2021-01-07 05:48:13 -08:00
Joe LeVeque	566ea4f601	[system-health] Convert to Python 3 (#5886 ) - Convert system-health scripts to Python 3 - Build and install system-health as a Python 3 wheel - Also convert newlines from DOS to UNIX	2020-12-29 14:04:09 -08:00
Vaibhav Hemant Dixit	e8da2ee975	Fix post-reboot errors in platforms without sonic_platform package (#6130 ) Refactor determine-reboot cause code. Fix errors seen during determine-reboot-cause when sonic_platform package is not installed. Add error handling for healthd service when sonic_platform package is not installed. Tested on KVM where sonic_platform is not present, and the errors are not seen anymore in syslog.	2020-12-17 12:01:42 -08:00
Joe LeVeque	73825e4d4d	[system-health] Update .gitignore file (#5688 ) Touch up .gitignore file to properly ignore all files generated when building a Python wheel package	2020-10-22 11:58:27 -07:00
Junchao-Mellanox	1c97a03b81	[system-health] Add support for monitoring system health (#4835 ) * system health first commit * system health daemon first commit * Finish healthd * Changes due to lower layer logic change * Get ASIC temperature from TEMPERATURE_INFO table * Add system health make rule and service files * fix bugs found during manual test * Change make file to install system-health library to host * Set system LED to blink on bootup time * Caught exceptions in system health checker to make it more robust * fix issue that fan/psu presence will always be true * fix issue for external checker * move system-health service to right after rc-local service * Set system-health service start after database service * Get system up time via /proc/uptime * Provide more information in stat for CLI to use * fix typo * Set default category to External for external checker * If external checker reported OK, save it to stat too * Trim string for external checker output * fix issue: PSU voltage check always return OK * Add unit test cases for system health library * Fix LGTM warnings * fix demo comments: 1. get boot up timeout from monit configuration file; 2. set system led in library instead of daemon * Remove boot_timeout configuration because it will get from monit config file * Fix argument miss * fix unit test failure * fix issue: summary status is not correct * Fix format issues found in code review * rename th to threshold to make it clearer * Fix review comment: 1. add a .dep file for system health; 2. deprecated daemon_base and uses sonic-py-common instead * Fix unit test failure * Fix LGTM alert * Fix LGTM alert * Fix review comments * Fix review comment * 1. Add relevant comments for system health; 2. rename external_checker to user_define_checker * Ignore check for unknown service type * Fix unit test issue * Rename user define checker to user defined checker * Rename user_define_checkers to user_defined_checkers for configuration file * Renmae file user_define_checker.py -> user_defined_checker.py * Fix typo * Adjust import order for config.py Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com> * Adjust import order for src/system-health/health_checker/hardware_checker.py Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com> * Adjust import order for src/system-health/scripts/healthd Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com> * Adjust import orders in src/system-health/tests/test_system_health.py * Fix typo * Add new line after import * If system health configuration file not exist, healthd should exit * Fix indent and enable pytest coverage * Fix typo * Fix typo * Remove global logger and use log functions inherited from super class * Change info level logger to notice level Co-authored-by: Joe LeVeque <jleveque@users.noreply.github.com>	2020-10-12 11:12:49 +03:00

36 Commits