sonic-buildimage

Author	SHA1	Message	Date
yozhao101	aa949cdc74	[docker-syncd] Add timeout to force stop syncd container (#4617 ) - Why I did it When I tested auto-restart feature of swss container by manually killing one of critical processes in it, swss will be stopped. Then syncd container as the peer container should also be stopped as expected. However, I found sometimes syncd container can be stopped, sometimes it can not be stopped. The reason why syncd container can not be stopped is the process (/usr/local/bin/syncd.sh stop) to execute the stop() function will be stuck between the lines 164 –167. Systemd will wait for 90 seconds and then kill this process. 164 # wait until syncd quit gracefully 165 while docker top syncd$DEV \| grep -q /usr/bin/syncd; do 166 sleep 0.1 167 done The first thing I did is to profile how long this while loop will spin if syncd container can be normally stopped after swss container is stopped. The result is 5 seconds or 6 seconds. If syncd container can be normally stopped, two messages will be written into syslog: str-a7050-acs-3 NOTICE syncd#dsserve: child /usr/bin/syncd exited status: 134 str-a7050-acs-3 INFO syncd#supervisord: syncd [5] child /usr/bin/syncd exited status: 134 The second thing I did was to add a timer in the condition of while loop to ensure this while loop will be forced to exit after 20 seconds: After that, the testing result is that syncd container can be normally stopped if swss is stopped first. One more thing I want to mention is that if syncd container is stopped during 5 seconds or 6 seconds, then the two log messages can be still seen in syslog. However, if the execution time of while loop is longer than 20 seconds and is forced to exit, although syncd container can be stopped, I did not see these two messages in syslog. Further, although I observed the auto-restart feature of swss container can work correctly right now, I can not make sure the issue which syncd container can not stopped will occur in future. - How I did it I added a timer around the while loop in stop() function. This while loop will exit after spinning 20 seconds. Signed-off-by: Yong Zhao <yozhao@microsoft.com>	2020-06-09 16:07:24 +00:00
Prince Sunny	320dcf2008	Sleep done before mismatch handler (#4165 ) * Sleep done before mismatch handler	2020-02-25 16:39:33 +00:00
Prince Sunny	53a2934fc5	Added timeout to ping command (#4123 )	2020-02-06 17:41:38 -08:00
Prince Sunny	c53f09684a	Update arp_update to refresh neighbor entries from APP_DB (#4102 ) * Update arp_update to refresh neighbor entries from APP_DB	2020-02-05 15:42:15 -08:00
Ying Xie	9583a74b47	[swss service] flush fast-reboot enabled flag upon swss stopping (#3908 ) If we need to stop swss during fast-reboot procedure on the boot up path, it means that something went wrong, like syncd/orchagent crashed already, we are stopping and restarting swss/syncd to re-initialize. In this case, we should proceed as if it is a cold reboot. Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2019-12-16 16:04:10 +00:00
pavel-shirshov	b28dd1db7b	[fast-reboot]: Save fast-reboot state into the db [Nov] (#3892 ) - Port changes #3741	2019-12-13 06:07:13 -08:00
Ying Xie	ba88f9c0ae	Revert "[swss.sh] When starting, call 'systemctl restart' on dependents, not (#3807 )" (#3835 ) This reverts commit `351410ea8c`.	2019-12-02 23:56:04 +00:00
Joe LeVeque	3920ac2368	[services] Remove explicit dependencies from dhcp_relay service file, control in swss.sh (#3823 )	2019-11-27 02:21:00 +00:00
Joe LeVeque	8e86a157ff	[swss.sh] When starting, call 'systemctl restart' on dependents, not (#3807 ) 'systemctl start'	2019-11-24 03:26:03 +00:00
Danny Allen	ba77de12ac	[cron.d] Add cron job to periodically clean-up core files (#3449 ) * [cron.d] Create cron job to periodically clean-up core files * Create script to scan /var/core and clean-up older core files * Create cron job to run clean-up script Signed-off-by: Danny Allen <daall@microsoft.com> * Update interval for running cron job * Respond to feedback * Change syslog id	2019-09-13 17:52:10 +00:00
pavel-shirshov	b715ec89c4	[Fast-Reboot]: FR mode is active only first 3 minutes after start. (#3352 ) * Fast reboot mode should be enabled only 3 minutes after restart * Advance sonic-quagga submodule	2019-08-21 21:48:33 +00:00
Ying Xie	d821cb84b8	[radv service] radv service should be a cold only dependent of swss (#3348 ) radv should be left alone during warm restart of swss. Otherwise it will announce departure and cause hosts to lose default gateway. Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2019-08-16 19:46:37 +00:00
Ying Xie	a41d9a5d3f	[service dependent] describe non-warm-reboot dependency outside systemd (#3311 ) * [service dependent] describe non-warm-reboot dependency outside systemctl When dependency was described with systemctl, it will kick in all the time, including under warm reboot/restart scenarios. This is not what we always want. For components that are capable of warm reboot/start, they need to describe dependency in service files. Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [service] teamd service should not require swss service Adding require swss will cause teamd to be killed by systemctl when swss stops. This is not what we want in warm reboot. Signed-off-by: Ying Xie <ying.xie@microsoft.com> * refactoring code * rename functions to match other functions in the file	2019-08-08 22:46:06 +00:00
Joe LeVeque	29bbd86862	[services] Restart SwSS service upon unexpected critical process exit (#2845 ) (#2852 )	2019-07-29 18:10:26 -07:00
Stepan Blyshchak	4b5abd048b	[swss.sh]: Cleanup LAG entries in STATE DB (#3114 ) Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>	2019-07-10 23:04:33 +00:00
Qi Luo	588c687a27	[fast-reboot] fix fast reboot compatibility (#3083 ) and advance sai-redis/201811 point (#3089 ) * fix fast reboot compatibility (#3083) and advance sai-redis/201811 point * Repoint the submodule	2019-06-26 22:02:21 -07:00
Stepan Blyshchak	fae35536c3	[swss.sh] flush FDB table during cold start (#2933 ) Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>	2019-05-29 00:51:09 +00:00
Joe LeVeque	ecec579933	[services] Services which start containers now use 'docker wait' instead of 'docker attach' (#2661 )	2019-03-19 03:05:37 +00:00
Ying Xie	deab95cff6	[swss/syncd] cold start syncd service in swss in attach method (#2639 ) start() is called by service startPre method, which is blocking. Starting syncd service here is causing deadlock. attach() is called by service start method, which is non-blocking. Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2019-03-07 03:30:34 +00:00
Joe LeVeque	c6ccb80803	[services] Ensure swss and syncd services start before dependent services (#2634 ) * [services] Ensure swss and syncd services start before dependent services * Add 'attach' functions to scripts which get installed to /usr/local/bin so that services only reference the one script each * Add 'After=swss.service' to syncd.service	2019-03-07 03:23:13 +00:00
lguohan	c5b0c59b78	[swss]: flush asic db in swss.sh for non warm-boot (#2582 ) need to flush asic db in swss.sh instead of syncd.sh orchagent might already started in swss.sh and put commands into asic db before asic db is flushed in syncd.sh. This causes race condition such as INIT_VIEW not passing to syncd. Signed-off-by: Guohan Lu <gulv@microsoft.com>	2019-02-21 18:23:58 +00:00
Stepan Blyshchak	e5daf216fd	[syncd.sh] Don't stop sxdkernel during warm shutdown on Mellanox platform (#2572 ) /etc/init.d/sxdkernel stop may take up to 15 sec which has impact on control plane downtime Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>	2019-02-21 18:11:37 +00:00
Ying Xie	4faa5f2f92	[warm boot] cherry-pick PR #2538 and advance related sub-modules (#2569 ) PR#2538 cannot merge due to master branch status. It has been tested against 201811 branch. Submodule src/sonic-sairedis 21f4a49..d57222a: > Add more specific logic for ingress ACL and buffer profile (#421) > Move warm_restart enable/disable config to stateDB WARM_RESTART_ENABLE_TABLE (#418) > Add support for vlan tagged frames in virtual switch (#417) Submodule src/sonic-swss 1590030..584490c: > Move warm_restart enable/disable config to stateDB WARM_RESTART_ENABLE_TABLE (#786) > [vstest]: Potential fix for timing issue in warm_reboot's routing UT (#788) Submodule src/sonic-swss-common 594f4e8..286ef34: > Move warm_restart enable/disable config to stateDB WARM_RESTART_ENABLE_TABLE (#260) Submodule src/sonic-utilities c6666e2..b44b462: > Move warm_restart enable/disable config to stateDB WARM_RESTART_ENABL… (#458) > [aclshow] output only counters per table/rule (#442) Signed-off-by: Ying Xie <ying.xie@microsoft.com> [PR 2538] Move warm_restart enable/disable config to stateDB WARM_RESTART_ENABLE_TABLE Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>	2019-02-14 12:12:55 -08:00
Ying Xie	24bce77def	[swss/syncd] log swss/syncd service script activities (#2545 ) Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2019-02-14 17:04:21 +00:00
Prince Sunny	e9125b944a	[swss]: Change VrfMgrd startup order, cleanup VRF_TABLE from state DB (#2510 )	2019-02-02 19:39:42 +00:00
stepanblyschak	ff526dd103	[mellanox\|ffb] use system level warm reboot for Mellanox fastfast boot (#2374 ) * [mellanox\|ffb] use system level warm reboot for Mellanox fastfast boot Signed-off-by: Stepan Blyschak <stepanb@mellanox.com> * [mellanox\|ffb] add comments for mellanox start/stop drivers section Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>	2019-01-10 14:09:03 -08:00
Volodymyr Samotiy	b506241b84	[syncd]: Fix reload flow for Mellanox platforms (#2386 ) * Perform stop/start of Mellanox driver tools for all types of reboot * Don't set Mellanox FAST_BOOT option for "cold" reboot * Don't send "syncd_request_shutdown" event for "cold" reboot on Mellanox platforms Signed-off-by: Volodymyr Samotiy <volodymyrs@mellanox.com>	2018-12-15 11:36:12 -08:00
Volodymyr Samotiy	75b41233d2	[Mellanox\|FFB]: Add support for Mellanox fast-fast boot (#2294 ) * [mlnx\|ffb] Add support for mellanox fast-fast boot Signed-off-by: Stepan Blyschak <stepanb@mellanox.com> * [mlnx\|ffb]: Add support of "config end" event for mlnx fast-fast boot Signed-off-by: Volodymyr Samotiy <volodymyrs@mellanox.com> * [Mellanox\|FFB]: Fix review comments * Change naming convention from "fast-fast" to "fastfast" Signed-off-by: Volodymyr Samotiy <volodymyrs@mellanox.com>	2018-12-04 10:11:24 -08:00
Ying Xie	4abbe43463	[syncd] skip ledinit during syncd warm start (#2285 ) * [syncd] skip ledinit during syncd warm start Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2018-11-21 17:56:19 -08:00
Ying Xie	5c8650aaaa	[swss service] don't clear WARM_RESTART table (#2256 ) Clear WARM_RESTART table could cause component level warm restart to fail due to missing WARM_RESTART state. Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2018-11-15 22:04:53 -08:00
Ying Xie	8598ccaf84	[syncd] extend syncd service script to support both warm/cold shutdown (#2238 ) - cold shutdown is used by regular service stop and/or fast reboot - warm shutdown is used by warm restart and/or warm reboot Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2018-11-15 15:47:33 -08:00
stepanblyschak	447ae7b61a	[mlnx] Fix fast reboot (#2237 ) Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>	2018-11-09 21:54:20 -08:00
Shuotian Cheng	110355201b	[swss]: Update swss.sh script to clean up specific db when start (#2223 ) This script shall not flush all the entries in the state database when it starts up, since there are entries maintained and written by other processes outside this docker. The issue we noticed was that the portchannel states are cleaned up after teamsyncd writes the entries into the database, which causes the IPs failed to be configured because intfmgrd considers the portchannels are not ready yet. Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>	2018-11-03 12:32:46 -07:00
Ying Xie	f3ab8cdf9a	[warm boot] syncd warm start could be individual warm start (#2147 ) Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2018-10-16 11:20:39 -07:00
Kevin(Shengkai) Wang	ea4b4bd650	[mellanox]: Update recipe for hw-mgmt according to latest changes (#2128 ) Update the hw-mgmt to latest release V.2.0.0060. Update the related files according to the latest hw-mgmt. Signed-off-by: Kevin Wang <kevinw@mellanox.com>	2018-10-08 18:33:44 -07:00
Jipan Yang	dedd5624a0	Adapt to the new WARM_RESTART_TABLE table schema: change from restart… (#2083 ) * Adapt to the new WARM_RESTART_TABLE table schema: change from restart_count to restore_count Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com> * Update variable and function name to match restore_count name change Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com> * Update swss submodule for warm restart schema change Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>	2018-10-02 06:08:26 -07:00
Ying Xie	c8e6b15504	[syncd] warn shutdown syncd process when warm boot is enabled (#2078 ) * [syncd] warn shutdown syncd process when warm boot is enabled Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [warmboot] mount folder to hold warmboot temporary files Signed-off-by: Ying Xie <ying.xie@microsoft.com> * Fix a typo	2018-10-01 19:01:04 -07:00
Ying Xie	cfe01f19e4	Separate syncd service from swss service (#2051 ) * [swss.sh] refactor ssh service script code - Move checks and waits to helper functions. - Remove early returns from code stream Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [swss.sh] Add debug log for service state changes Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [syncd] Separate out syncd service from swss service Still make them start/stop/restart synchronously so existing scripts continue working. Signed-off-by: Ying Xie <ying.xie@microsoft.com> * Remove extra 'After' in swss service and remove syncd docker warm boot code Syncd warm boot needs more thinking, we can put it back once the work flow has been defined and ready for coding/testing. * [syncd] syncd start/stop/restart shouldn't affect swss state Semi-detach syncd service state change from swss: - swss state change still chase syncd service to follow except warm boot - syncd state change will only affect itself. Signed-off-by: Ying Xie <ying.xie@microsoft.com> * add missing '{'	2018-09-24 16:35:01 -07:00
Jipan Yang	3f37b96de6	[swss]: Add support for swss docker warm restart (#1982 ) Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>	2018-08-25 01:39:09 -07:00
lguohan	80c6453731	[swss]: simplify swss systemd service file (#1965 ) move the swss service start/stop logic into /usr/local/bin/swss.sh Signed-off-by: Guohan Lu <gulv@microsoft.com>	2018-08-22 13:02:32 -07:00
zhenggen-xu	d761630f73	Fix potential blackholing/looping traffic when link-local was used and refresh ipv6 neighbor to avoid CPU hit (#1904 ) * Fix potential blackholing/looping traffic and refresh ipv6 neighbor to avoid CPU hit In case ipv6 global addresses were configured on L3 interfaces and used for peering, and routing protocol was using link-local addresses on the same interfaces as prefered nexthops, the link-local addresses could be aged out after a while due to no activities towards the link-local addresses themselves. And when we receive new routes with the link-local nexthops, SONiC won't insert them to the HW, and thus cause looping or blackholing traffic. Global ipv6 addresses on L3 interfaces between switches are refreshed by BGP keeplive and other messages. On server facing side, traffic may hit fowarding plane only, and no refresh for the ipv6 neighbor entries regularly. This could age-out the linux kernel ipv6 neighbor entries, and HW neighbor table entries could be removed, and thus traffic going to those neighbors would hit CPU, and cause traffic drop and temperary CPU high load. Also, if link-local addresses were not learned, we may not get them at all later. It is intended to fix all above issues. Changes: Add ndisc6 package in swss docker and use it for ipv6 ndp ping to update the neighbors' state on Vlan interfaces Change the default ipv6 neighbor reachable timer to 30mins Add periodical ipv6 multicast ping to ff02::11 to get/refresh link-local neighbor info. * Fix review comments: Add PORTCHANNEL_INTERFACE interface for ipv6 multicast ping format issue * Combine regular L3 interface and portchannel interface for looping * Add ndisc6 package to vs docker	2018-08-12 03:14:55 -07:00
pavel-shirshov	c52fb762dd	Convert arp_update into a 'start-it-once' mode (#1864 ) * Run arp_update just once, don't restart it. It will run continuosly with 5 min pauses	2018-07-18 13:04:57 -07:00
Joe LeVeque	a36527a6a5	Store ConfigDB init indicator boolean value as 1/0 in Redis to be language-agnostic (#1352 )	2018-01-30 15:04:52 -08:00
pavel-shirshov	8cfa223ef9	[scripts]: Fix issues with checking status of the DB. Use one approach everywhere. (#1323 )	2018-01-18 19:55:11 -08:00
lguohan	b907e4e9f5	[vs]: add vlan configuration support in virtual switch (#1200 )	2017-11-30 14:59:25 -08:00

45 Commits