sonic-buildimage

Author	SHA1	Message	Date
Michael Li	f753a8ba04	Reload BCM SDK kmods on syncd start to handle syncd restart issues (#12804 ) Why I did it There is an issue on the Arista PikeZ platform (using T3.X2: BCM56274) while running SONiC. If the 'syncd' container in SONiC is restarted, the expected behaviour is that syncd will automatically restart/recover; however it does not and always fails at create_switch due to BCM SDK kmod DMA operation cancellation getting stuck. Sep 16 22:19:44.855125 pkz208 ERR syncd#syncd: [none] SAI_API_SWITCH:platform_process_command:428 Platform command "init soc" failed, rc = -1. Sep 16 22:19:44.855206 pkz208 INFO syncd#supervisord: syncd CMIC_CMC0_PKTDMA_CH4_DESC_COUNT_REQ:0x33#015 Sep 16 22:19:44.855264 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:platformInit:1909 initialization command "init soc" failed, rc = -1 (Internal error). Sep 16 22:19:44.855403 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:sai_driver_init:642 Error initializing driver, rc = -1. ... Sep 16 22:19:44.855891 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:brcm_sai_create_switch:1173 initializing SDK failed with error Operation failed (0xfffffff5). Reloading the BCM SDK kmods allows the switch init to continue properly. How I did it If BCM SDK kmods are loaded, unload and load them again on syncd docker start script. How to verify it Steps to reproduce: In SONiC, run 'docker ps' to see current running containers; 'syncd' should be present. Run 'docker stop syncd' Wait ~1 minute. Run 'docker ps' to see that syncd is missing. Check logs to see messages similar to the above. Signed-off-by: Michael Li <michael.li@broadcom.com>	2022-12-10 10:33:21 +08:00
Stepan Blyshchak	a66941a6ce	[syncd.sh] 'sxdkernel start' => 'sxdkernel restart' (#11718 ) Change `sxdkernel start` to `sxdkernel restart`. If `syncd` service crashes in `ExecStartPre` systemd will not call `ExecStop` and thus will not call `sxdkernel stop`. Use of `sxdkernel restart` is more robust in terms of guarantees to restore the system after unexpected crashes. Signed-off-by: Stepan Blyschak <stepanb@nvidia.com> Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>	2022-08-15 13:35:34 -07:00
Sudharsan Dhamal Gopalarathnam	9452095e25	[lldp]Fix lldp spawned after reboot when disabled (#11080 ) - Why I did it When LLDP is disabled through feature command, it gets spawned after reboot. - How I did it In syncd.sh check if the service is enabled before spawning automatically during cold reboot. - How to verify it Disable lldp feature. Perform cold reboot and verify its not spawned.	2022-06-22 03:11:41 +03:00
shlomibitton	1474ad76d8	[Mellanox] [pmon] Fix for PMON service not starting when restarting SWSS service after fast/warm reboot (#10901 ) - Why I did it Recent change to delay PMON service in case of fast/warm reboot introduce an issue when restarting only SWSS service after fast/warm reboot for Nvidia platform. Since the timer is triggered only when the system boot, in a scenario when the system is after a fast/warm reboot and the user restart SWSS service, as part of syncd.sh script, PMON service will stop but the timer will not start again. - How I did it On syncd.sh script, in case of fast/warm indication, check if pmon.timer is running. If it is running it means we are at the first boot and continue normally. If it is not running, meaning the service was restarted, start the timer to keep the system behavior consistent. - How to verify it Run fast/warm reboot. service swss restart. Observe PMON service starting. Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>	2022-06-16 12:15:09 +03:00
shlomibitton	4ec3af86af	[Fastboot] Delay PMON service for better fastboot performance (#10567 ) - Why I did it Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time. This parallel execution consume CPU time and the duration of create_switch is longer than it should be. Following this finding, and the motivation to ensure these services will not interfere in the future, PMON is delayed in 90 seconds until the system finish the init flow after fastboot. - How I did it Add a timer for PMON service. Exclude for MLNX platform the start trigger of PMON when SYNCD starts in case of fastboot. Copy the timer file to the host bin image. - How to verify it Run fast-reboot on MLNX platform and observe faster create_switch execution time.	2022-05-02 10:44:17 +03:00
shlomibitton	1d84e0d7df	[Fastboot] Delay LLDP service for better fastboot performance (#10568 ) - Why I did it Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time. This parallel execution consume CPU time and the duration of create_switch is longer than it should be. Following this finding, and the motivation to ensure these services will not interfere in the future, LLDP is delayed in 90 seconds until the system finish the init flow after fastboot. - How I did it Add a timer for LLDP service. Copy the timer file to the host bin image. - How to verify it Run fast-reboot on MLNX platform and observe faster create_switch execution time. This PR is dependent on PR: #10567	2022-04-28 10:35:14 +03:00
Kostiantyn Yarovyi	bf4ab4a338	[Barefoot][Syncd] restart of the interface for cleaning txquee through which communication takes place between Sonic and openBMC (#9941 ) Why I did it improvement of starting barefoot SDK How I did it restart of the interface for cleaning txquee through which communication takes place between Sonic and openBMC How to verify it run sonic autorestart tests	2022-03-21 10:07:20 -07:00
Lior Avramov	6f8c31554f	[systemd] Increase syncd startup script timeout to support FW upgrade on init. (#6709 ) - Why I did it To support FW upgrade on init. - How I did it Change timeout value - How to verify it I manually changed ASIC and Gearbox FW followed by hard reset in order for FW upgrade to take place on init. Signed-off-by: liora <liora@nvidia.com>	2021-02-11 12:53:36 +02:00
Syd Logan	0311a4a037	Add gearbox phy device files and a new physyncd docker to support VS gearbox phy feature (#4851 ) * buildimage: Add gearbox phy device files and a new physyncd docker to support VS gearbox phy feature * scripts and configuration needed to support a second syncd docker (physyncd) * physyncd supports gearbox device and phy SAI APIs and runs multiple instances of syncd, one per phy in the device * support for VS target (sonic-sairedis vslib has been extended to support a virtual BCM81724 gearbox PHY). HLD is located at `b817a12fd8/doc/gearbox/gearbox_mgr_design.md` - Why I did it This work is part of the gearbox phy joint effort between Microsoft and Broadcom, and is based on multi-switch support in sonic-sairedis. - How I did it Overall feature was implemented across several projects. The collective pull requests (some in late stages of review at this point): https://github.com/Azure/sonic-utilities/pull/931 - CLI (merged) https://github.com/Azure/sonic-swss-common/pull/347 - Minor changes (merged) https://github.com/Azure/sonic-swss/pull/1321 - gearsyncd, config parsers, changes to orchargent to create gearbox phy on supported systems https://github.com/Azure/sonic-sairedis/pull/624 - physyncd, virtual BCM81724 gearbox phy added to vslib - How to verify it In a vslib build: root@sonic:/home/admin# show gearbox interfaces status PHY Id Interface MAC Lanes MAC Lane Speed PHY Lanes PHY Lane Speed Line Lanes Line Lane Speed Oper Admin -------- ----------- --------------- ---------------- --------------- ---------------- ------------ ----------------- ------ ------- 1 Ethernet48 121,122,123,124 25G 200,201,202,203 25G 204,205 50G down down 1 Ethernet49 125,126,127,128 25G 206,207,208,209 25G 210,211 50G down down 1 Ethernet50 69,70,71,72 25G 212,213,214,215 25G 216 100G down down In addition, docker ps \| grep phy should show a physyncd docker running. Signed-off-by: syd.logan@broadcom.com	2020-09-25 08:32:44 -07:00
Kebo Liu	f3091c91a6	[Mellanox] remove code which instructs hw-mgmt to skip mlsw_minimal probing in fast-boot flow (#5011 )	2020-07-22 12:21:11 +03:00
Kebo Liu	2b568ec136	Add with_i2cdev for mst start to have I2C device loaded properly (#4790 )	2020-06-21 16:27:05 +03:00
yozhao101	4ea2e5e6dc	[docker-syncd] Add timeout to force stop syncd container (#4617 ) - Why I did it When I tested auto-restart feature of swss container by manually killing one of critical processes in it, swss will be stopped. Then syncd container as the peer container should also be stopped as expected. However, I found sometimes syncd container can be stopped, sometimes it can not be stopped. The reason why syncd container can not be stopped is the process (/usr/local/bin/syncd.sh stop) to execute the stop() function will be stuck between the lines 164 –167. Systemd will wait for 90 seconds and then kill this process. 164 # wait until syncd quit gracefully 165 while docker top syncd$DEV \| grep -q /usr/bin/syncd; do 166 sleep 0.1 167 done The first thing I did is to profile how long this while loop will spin if syncd container can be normally stopped after swss container is stopped. The result is 5 seconds or 6 seconds. If syncd container can be normally stopped, two messages will be written into syslog: str-a7050-acs-3 NOTICE syncd#dsserve: child /usr/bin/syncd exited status: 134 str-a7050-acs-3 INFO syncd#supervisord: syncd [5] child /usr/bin/syncd exited status: 134 The second thing I did was to add a timer in the condition of while loop to ensure this while loop will be forced to exit after 20 seconds: After that, the testing result is that syncd container can be normally stopped if swss is stopped first. One more thing I want to mention is that if syncd container is stopped during 5 seconds or 6 seconds, then the two log messages can be still seen in syslog. However, if the execution time of while loop is longer than 20 seconds and is forced to exit, although syncd container can be stopped, I did not see these two messages in syslog. Further, although I observed the auto-restart feature of swss container can work correctly right now, I can not make sure the issue which syncd container can not stopped will occur in future. - How I did it I added a timer around the while loop in stop() function. This while loop will exit after spinning 20 seconds. Signed-off-by: Yong Zhao <yozhao@microsoft.com>	2020-06-04 15:17:28 -07:00
judyjoseph	acf465b43b	Multi DB with namespace support, Introducing the database_global.json… (#4477 ) * Multi DB with namespace support, Introducing the database_global.json file for supporting accessing DB's in other namespaces for service running in linux host * Updates based on comments * Adding the j2 templates for database_config and database_global files. * Updating to retrieve the redis DIR's to be mounted from database_global.json file. * Additional check to see if asic.conf file exists before sourcing it. * Updates based on PR comments discussion. * Review comments update * Updates to the argument "-n" for namespace used in both context of parsing minigraph and multi DB access. * Update with the attribute "persistence_for_warm_boot" that was added to database_config.json file earlier. * Removing the database_config.json file to avioid confusion in future. We use the database_config.json.j2 file to generate database_config.json files dynamically. * Update the comments for sudo usage in docker_image_ctrl.j2 * Update with the new logic in PING PONG tests using sonic-db-cli. With this we wait till the PONG response is received when redis server is up. * Similar changes in swss and syncd scripts for the PING tests with sonic-db-cli * Updated with a missing , in the database_config.json.j2 file, Do pip install of j2cli in docker-base-buster.	2020-05-08 21:24:05 -07:00
Dong Zhang	340cf826a6	[MultiDB] use sonic-db-cli PING and fix wrong multiDB API in NAT (#4541 )	2020-05-06 15:41:28 -07:00
SuvarnaMeenakshi	4b8067e913	Multi-ASIC implementation (#3888 ) Changes made to support multi-asic platform. Added multi-instance support for swss, syncd, database, bgp, teamd and lldp.	2020-03-31 10:06:19 -07:00
Dong Zhang	7aa0baf709	[MultiDB] (except ./src and ./dockers dirs): replace redis-cli with sonic-db-cli and use new DBConnector (#4035 ) * [MultiDB] (except ./src and ./dockers dirs): replace redis-cli with sonic-db-cli and use new DBConnector * update comment for a potential bug * update comment * add TODO maker as review reqirement	2020-01-22 11:26:23 -08:00
lguohan	483a5946a8	Revert "[MultiDB]except src and dockers : replace redis-cli with sonic-db-cli and use new DBConnector (#3928 )" (#4002 ) This reverts commit `0dae59ac30`.	2020-01-10 08:27:34 -08:00
Dong Zhang	0dae59ac30	[MultiDB]except src and dockers : replace redis-cli with sonic-db-cli and use new DBConnector (#3928 ) * [MultiDB]except src and dockers : replace redis-cli with sonic-db-cli and use new DBConnector * fix vs tests along with swss vs tests together	2020-01-02 14:46:25 -08:00
Stepan Blyshchak	b6ad09aa35	[syncd.sh] remove chipdown on mellanox (#3926 ) ASIC reset events are captured by hw-mgmt and hw-mgmt calls chipup/chipdown internally without OS iteraction Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>	2019-12-23 11:15:08 +02:00
pavel-shirshov	1848fb262b	[fast-reboot]: Save fast-reboot state into the db (#3741 ) Put a flag for fast-reboot to the db using EXPIRE feature. Using this flag in other part of SONiC to start in Fast-reboot mode. If we reload a config, the state in the db will be removed.	2019-12-04 14:10:19 -08:00
Stephen Sun	7308d2eb97	[Mellanox] Stop pmon ahead of syncd (#3505 ) Issue Overview shutdown flow For any shutdown flow, which means all dockers are stopped in order, pmon docker stops after syncd docker has stopped, causing pmon docker fail to release sx_core resources and leaving sx_core in a bad state. The related logs are like the following: INFO syncd.sh[23597]: modprobe: FATAL: Module sx_core is in use. INFO syncd.sh[23597]: Unloading sx_core[FAILED] INFO syncd.sh[23597]: rmmod: ERROR: Module sx_core is in use config reload & service swss.restart In the flows like "config reload" and "service swss restart", the failure cause further consequences: sx_core initialization error with error message like "sx_core: create EMAD sdq 0 failed. err: -16" syncd fails to execute the create switch api with error message "syncd_main: Runtime error: :- processEvent: failed to execute api: create, key: SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000, status: SAI_STATUS_FAILURE" swss fails to call SAI API "SAI_SWITCH_ATTR_INIT_SWITCH", which causes orchagent to restart. This will introduce an extra 1 or 2 minutes for the system to be available, failing related test cases. reboot, warm-reboot & fast-reboot In the reboot flows including "reboot", "fast-reboot" and "warm-reboot" this failure doesn't have further negative effects since the system has already rebooted. In addition, "warm-reboot" requires the system to be shutdown as soon as possible to meet the GR time restriction of both BGP and LACP. "fast-reboot" also requires to meet the GR time restriction of BGP which is longer than LACP. In this sense, any unnecessary steps should be avoided. It's better to keep those flows untouched. summary To summarize, we have to come up with a way to ensure: shutdown pmon docker ahead of syncd for "config reload" or "service swss restart" flow; don't shutdown pmon docker ahead of syncd for "fast-reboot" or "warm-reboot" flow in order to save time. for "reboot" flow, either order is acceptable. Solution To solve the issue, pmon shoud be stopped ahead of syncd stopped for all flows except for the warm-reboot. - How I did it To stop pmon ahead of syncd stopped. This is done in /usr/local/bin/syncd.sh::stop() and for all shutdown sequence. Now pmon stops ahead of syncd so there must be a way in which pmon can start after syncd started. Another point that should be taken consideration is that pmon starting should be deferred so that services which have the logic of graceful restart in fast-reboot and warm-reboot have sufficient CPU cycles to meet their deadline. This is done by add "syncd.service" as "After" to pmon.service and startin /usr/local/bin/syncd.sh::wait() To start pmon automatically after syncd started.	2019-09-27 10:15:46 +02:00
pavel-shirshov	8facac9149	[Fast-Reboot]: FR mode is active only first 3 minutes after start. (#3352 ) * Fast reboot mode should be enabled only 3 minutes after restart * Advance sonic-quagga submodule	2019-08-19 16:05:20 -07:00
Stepan Blyshchak	6961816dec	fix fast reboot compatibility (#3083 ) * fix fast reboot compatibility We should handle both cases for backward-compatible with 201803: - fast-reboot - SONIC_BOOT_TYPE=fast-reboot * handle review comments * add a comment that getBootType code snippet is shared between two files	2019-06-26 12:46:58 -07:00
Nazarii Hnydyn	e041b15d10	[mellanox]: Fixed config reload race. (#2930 ) Signed-off-by: Nazarii Hnydyn <nazariig@mellanox.com>	2019-05-29 09:57:29 +03:00
Joe LeVeque	2bb5400948	[services] Services which start containers now use 'docker wait' instead of 'docker attach' (#2661 )	2019-03-08 10:59:41 -08:00
Nazarii Hnydyn	b22fe37670	[mellanox]: Upgraded hw-management V.2.0.0160. (#2643 ) Signed-off-by: Nazarii Hnydyn <nazariig@mellanox.com>	2019-03-06 18:51:46 -08:00
Joe LeVeque	5eb7872a07	[services] Ensure swss and syncd services start before dependent services (#2634 ) * [services] Ensure swss and syncd services start before dependent services * Add 'attach' functions to scripts which get installed to /usr/local/bin so that services only reference the one script each * Add 'After=swss.service' to syncd.service	2019-03-02 15:28:34 -08:00
lguohan	572db1e0a9	[swss]: flush asic db in swss.sh for non warm-boot (#2582 ) need to flush asic db in swss.sh instead of syncd.sh orchagent might already started in swss.sh and put commands into asic db before asic db is flushed in syncd.sh. This causes race condition such as INIT_VIEW not passing to syncd. Signed-off-by: Guohan Lu <gulv@microsoft.com>	2019-02-19 21:48:43 -08:00
Jipan Yang	ff74daaf13	Move warm_restart enable/disable config to stateDB WARM_RESTART_ENABLE_TABLE (#2538 ) Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>	2019-02-19 17:06:56 -08:00
Stepan Blyshchak	2dd769bf46	[syncd.sh] Don't stop sxdkernel during warm shutdown on Mellanox platform (#2572 ) /etc/init.d/sxdkernel stop may take up to 15 sec which has impact on control plane downtime Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>	2019-02-15 16:08:08 -08:00
Ying Xie	44551d0fb5	[swss/syncd] log swss/syncd service script activities (#2545 ) Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2019-02-10 11:56:31 -08:00
stepanblyschak	ff526dd103	[mellanox\|ffb] use system level warm reboot for Mellanox fastfast boot (#2374 ) * [mellanox\|ffb] use system level warm reboot for Mellanox fastfast boot Signed-off-by: Stepan Blyschak <stepanb@mellanox.com> * [mellanox\|ffb] add comments for mellanox start/stop drivers section Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>	2019-01-10 14:09:03 -08:00
Volodymyr Samotiy	b506241b84	[syncd]: Fix reload flow for Mellanox platforms (#2386 ) * Perform stop/start of Mellanox driver tools for all types of reboot * Don't set Mellanox FAST_BOOT option for "cold" reboot * Don't send "syncd_request_shutdown" event for "cold" reboot on Mellanox platforms Signed-off-by: Volodymyr Samotiy <volodymyrs@mellanox.com>	2018-12-15 11:36:12 -08:00
Ying Xie	4abbe43463	[syncd] skip ledinit during syncd warm start (#2285 ) * [syncd] skip ledinit during syncd warm start Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2018-11-21 17:56:19 -08:00
Ying Xie	8598ccaf84	[syncd] extend syncd service script to support both warm/cold shutdown (#2238 ) - cold shutdown is used by regular service stop and/or fast reboot - warm shutdown is used by warm restart and/or warm reboot Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2018-11-15 15:47:33 -08:00
stepanblyschak	447ae7b61a	[mlnx] Fix fast reboot (#2237 ) Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>	2018-11-09 21:54:20 -08:00
Ying Xie	f3ab8cdf9a	[warm boot] syncd warm start could be individual warm start (#2147 ) Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2018-10-16 11:20:39 -07:00
Kevin(Shengkai) Wang	ea4b4bd650	[mellanox]: Update recipe for hw-mgmt according to latest changes (#2128 ) Update the hw-mgmt to latest release V.2.0.0060. Update the related files according to the latest hw-mgmt. Signed-off-by: Kevin Wang <kevinw@mellanox.com>	2018-10-08 18:33:44 -07:00
Ying Xie	c8e6b15504	[syncd] warn shutdown syncd process when warm boot is enabled (#2078 ) * [syncd] warn shutdown syncd process when warm boot is enabled Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [warmboot] mount folder to hold warmboot temporary files Signed-off-by: Ying Xie <ying.xie@microsoft.com> * Fix a typo	2018-10-01 19:01:04 -07:00
Ying Xie	cfe01f19e4	Separate syncd service from swss service (#2051 ) * [swss.sh] refactor ssh service script code - Move checks and waits to helper functions. - Remove early returns from code stream Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [swss.sh] Add debug log for service state changes Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [syncd] Separate out syncd service from swss service Still make them start/stop/restart synchronously so existing scripts continue working. Signed-off-by: Ying Xie <ying.xie@microsoft.com> * Remove extra 'After' in swss service and remove syncd docker warm boot code Syncd warm boot needs more thinking, we can put it back once the work flow has been defined and ready for coding/testing. * [syncd] syncd start/stop/restart shouldn't affect swss state Semi-detach syncd service state change from swss: - swss state change still chase syncd service to follow except warm boot - syncd state change will only affect itself. Signed-off-by: Ying Xie <ying.xie@microsoft.com> * add missing '{'	2018-09-24 16:35:01 -07:00

40 Commits