sonic-buildimage

Author	SHA1	Message	Date
abdosi	4111c25557	updated internal route policy for chassis-packet (#15349 ) What I did: Workaround for the issue seen here : FRRouting/frr#13682 It seems there is timing issue where there are multiple recursive lookup needed to resolve nexthop of the route it's possible that it does not happen correctly causing route to remain in inactive state Issue is seen on chassis-packet as there 2 level of recursive lookup needed for a given e-BGP learnt route - Level1 to resolve e-BGP peer (connected route via bgp ) over Loopback4096 (i-BGP peering) - Level 2 Loopback4096 over backend port-channels next-hops For VOQ chassis there is no e-BGP peer (connected route via bgp ) resolution as route is added as Static route by orchagent over Ethernet-IB. Also as part of this remove route-map policy from instance.conf.j2 as same is define in peer-group.j2. Microsoft ADO: https://msazure.visualstudio.com/One/_workitems/edit/24198507 How I verify: Functional Verification manually Updated UT. We will be adding sanity check in sonic-mgmt to make sure none of route are in inactive state. Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>	2023-06-10 14:32:44 +08:00
mssonicbld	99d6003717	Changes to support TSA from supervisor (#14691 ) (#14878 )	2023-04-28 21:11:55 +08:00
Tejaswini Chadaga	37be88bef2	Fix VOQ_CHASSIS_V6_PEER route-map config (#14055 ) * Fix typo in VOQ_CHASSIS_V6_PEER route-map config * Updated UT files with the changed config	2023-03-19 22:32:47 +08:00
Longxiang Lyu	918e2d11f8	[dualtor] Let T0 delay 10 seconds before sending BGP updates (#12996 ) Why I did it To ensure, that after a BGP startup, dualtor T0 receives BGP updates before sending out BGP updates. Please refer to sonic-net/SONiC#1161 for more details. How I did it add coalesce-time 10000 to the frr bgp startup config. Signed-off-by: Longxiang Lyu <lolv@microsoft.com>	2023-02-04 09:54:05 +08:00
Zain Budhwani	24be87504f	Change bgp notification leaf name and mem_usage leaf type (#13012 ) #### Why I did it Improve naming convention for bgp notification events and change type of leaf for sonic-events-host mem usage from uint64 to decimal64 #### How I did it Replace "-" with "_" Replace uint64 with decimal64 #### How to verify it Run yang model unit tests #### Description for the changelog Change YANG model leaf naming convention for bgp notification	2023-02-04 09:53:57 +08:00
Junchao-Mellanox	e631f426f4	[infra] Support syslog rate limit configuration (#12490 ) (#13535 ) Backport of https://github.com/sonic-net/sonic-buildimage/pull/12490 into 202211 - Why I did it Support syslog rate limit configuration feature - How I did it Remove unused rsyslog.conf from containers Modify docker startup script to generate rsyslog.conf from template files Add metadata/init data for syslog rate limit configuration - How to verify it Manual test New sonic-mgmt regression cases	2023-01-30 20:11:44 +02:00
Arnaud	9d3814045b	[docker-fpm-frr]: Add unified-split mode to routing config (#11938 ) - Why I did it The values for config_db "docker_routing_config_mode" are: separated: FRR config generated from ConfigDB, each FRR daemon has its own config file unified: FRR config generated from ConfigDB, single FRR config file split: FRR config not generated from ConfigDB, each FRR daemon has its own config file This commit adds: split-unified: FRR config not generated from ConfigDB, single FRR config file - How I did it In docker_init.sh, when split-unified is used, the FRR configs are not generated from ConfigDB. What's more, "service integrated-vtysh-config" is configured in vtysh.conf. - How to verify it FRR config not overwritten when FRR container starts. Signed-off-by: Arnaud le Taillanter <a.letaillanter@criteo.com>	2022-11-14 10:37:48 -08:00
Caitlin Choate	66f1cc458d	Bugfix #9739 : Support when 'bgp_asn' is set to 'None', 'Null', or missing. (#12588 ) bgpd.main.conf.j2: bugfix-9739 * Update bgpd.main.conf.j2 to gracefully handle the bgp configuration cases for when 'bgp_asn' is set to 'None', 'Null', or missing. How I did it Include a conditional statement to avoid configuring bgp in FRR when 'bgp_asn' is missing or set to 'None' or 'Null' How to verify it Configure 'bgp_asn' as 'None', 'Null' or have it missing from configurations and verify that /etc/frr/bgpd.conf does not have invalid bgp configurations like 'router bgp None' Description for the changelog Update bgpd.main.conf.j2 to gracefully handle the bgp configuration cases for when 'bgp_asn' is set to 'None', 'Null', or missing for bugfix 9739. Signed-off-by: cchoate54@gmail.com	2022-11-08 16:53:14 -08:00
tjchadaga	763d3dc29d	Allow TSA on ibgp sessions between linecards on packet chassis (#12589 )	2022-11-03 08:54:33 -07:00
Zain Budhwani	09fe3f467f	Add Structured Events w/ YANG Models (#12270 ) Add events for dhcp-relay, bgp, syncd, & kernel.	2022-10-09 20:23:31 -07:00
Zain Budhwani	fd6a1b0ce2	Add events to host and create rsyslog_plugin deb pkg (#12059 ) Why I did it Create rsyslog plugin deb for other containers/host to install Add events for bgp and host events	2022-09-21 09:20:53 -07:00
Renuka Manavalan	31e750ee0b	Fix PR build failure (#11973 ) Some PR builds fails to find this file. Remove it temporarily until we root cause it	2022-09-06 15:13:05 -07:00
Zain Budhwani	6a54bc439a	Streaming structured events implementation (#11848 ) With this PR in, you flap BGP and use events_tool to see the published events. With telemetry PR #111 in and corresponding submodule update done in buildimage, one could run gnmi_cli to capture BGP flap events.	2022-09-03 07:33:25 -07:00
Hasan Naqvi	2d4ab9e979	Bullseye frr (#11777 ) Why I did it Migrate FRR to bullseye How I did it Makefile and docker config changes to refer to bullseye instead of buster. How to verify it Build bullseye frr docker. Co-authored-by: Rajendra Dendukuri <rajendra.dendukuri@broadcom.com>	2022-08-21 17:04:47 -07:00
tjchadaga	cdd2786117	Fix for TSA error logging on multi-asic (#11519 )	2022-07-30 22:16:58 -07:00
tjchadaga	077a537b14	Log message fix for TSB (#11441 )	2022-07-14 12:26:58 -07:00
tjchadaga	849eb4bf32	Changes to persist TSA/B state across reloads (#11257 )	2022-07-12 00:22:48 -07:00
Sudharsan Dhamal Gopalarathnam	14f6f70ca3	[BGP]Adding configuration knob to allow advertise Loopback ipv6 /128 prefix (#10958 ) * [BGP]Adding configuration knob to allow advertise Loopback ipv6 /128 prefix By default when IPv6 address is configured with /128 as subnet mask in Loopback0 interface, it will be advertised as prefix with /64 subnet. To control this behavior a new field 'bgp_adv_lo_prefix_as_128' is introduced in DEVICE_METADATA table which when set to true will advertise prefix with /128 subnet as it is.	2022-06-06 08:51:04 -07:00
Kalimuthu-Velappan	bc30528341	Parallel building of sonic dockers using native dockerd(dood). (#10352 ) Currently, the build dockers are created as a user dockers(docker-base-stretch-<user>, etc) that are specific to each user. But the sonic dockers (docker-database, docker-swss, etc) are created with a fixed docker name and common to all the users. docker-database:latest docker-swss:latest When multiple builds are triggered on the same build server that creates parallel building issue because all the build jobs are trying to create the same docker with latest tag. This happens only when sonic dockers are built using native host dockerd for sonic docker image creation. This patch creates all sonic dockers as user sonic dockers and then, while saving and loading the user sonic dockers, it rename the user sonic dockers into correct sonic dockers with tag as latest. docker-database:latest <== SAVE/LOAD ==> docker-database-<user>:tag The user sonic docker names are derived from 'DOCKER_USERNAME and DOCKER_USERTAG' make env variable and using Jinja template, it replaces the FROM docker name with correct user sonic docker name for loading and saving the docker image.	2022-04-28 08:39:37 +08:00
arlakshm	fd22635de0	[chassis][bgp] create v4 and v6 peer group for VoQ internal neighbors (#9693 ) Why I did it In the recent minigraph changes we add separate BGP session configuration for V4 and V6 internal VoQ neighbors. This PR is adding different Peer groups for V4 and V6 neighbors How I did it Add VOQ_CHASSIS_V4_PEER and VOQ_CHASSIS_V6_PEER groups Add extra Unit tests How to verify it Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>	2022-02-24 11:21:26 -08:00
abdosi	e44a40cc3b	Updated Internal BGP Templates for chassis packet (#9674 ) Fixes: https://github.com/Azure/sonic-buildimage/issues/9610	2022-02-08 09:36:32 -08:00
Longxiang Lyu	49a036e90c	Add dualtor TSA/B/C support (#9726 ) Why I did it Add TSA/B/C dualtor support Signed-off-by: Longxiang Lyu lolv@microsoft.com How I did it For TSA, toggle all the mux to standby if the device type is dualtor and there are active mux ports. For TSC, add mux status output. How to verify it Run TSA/B/C on a dualtor setup	2022-01-25 10:50:29 +08:00
Saikrishna Arcot	bd479cad29	Create a docker-swss-layer that holds the swss package. This is to save about 50MB of disk space, since 6 containers individually install this package. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>	2022-01-06 09:26:55 -08:00
abdosi	6c0da4bcf0	[bgp] Enable BGP Graceful Restart based on device role (#9486 ) What I did: Updated Jinja Template to enable BGP Graceful Restart based on device role. By default it will be enable only if the device role type is TorRouter. Why I did:- By default FRR is configured in Graceful Helper mode. Graceful Restart is needed on T0/TorRouter only since the device can go for warm-reboot. For T1/LeafRouter it need to be in Helper mode only	2021-12-13 10:14:50 -08:00
abdosi	f501311f11	Updated BGP Template for Chassis/Multi-asic (#9291 ) Updated BGP Template for the case: 1. For Packet Chassis do not advertise Loopback4096 address into BGP as there is Static Route for same. Having this route in BGP causes two level of recursion in Zebra and cause assert in Zebra when there are many nexthop involved 2. Advertise only P2P Connected IP's into BGP (External Peers). For Packet chassis we have backend IP Interface subnet and if they get advertised into BGP then it also causes recursion	2021-12-06 09:36:24 -08:00
vganesan-nokia	78de10713c	[voq-chassis][bgpcfg] VOQ_BGP_CHASSIS_NEIGHBORS timers default (#8455 ) The BGP_VOQ_CHASSIS_NEIGHBOR keepalive and holdtime timers are configured similar to general neighbors. Changes are done to configure BGP_VOQ_CHASSIS_NEIGHBOR timers similar to BGP_INTENAL_NEIGBOR since voq chassis bgp neighbors are similar to bgp internal neighbors in multi-asic. As it is done for bgp internal neighbors, the keepalive and holdtime timers are set to 3 and 10 seconds respectively. Also similar to bgp internal neighbors, connection retry timer is also configured for voq chassis bgp neighbors. Signed-off-by: vedganes <vedavinayagam.ganesan@nokia.com>	2021-11-30 12:10:27 -08:00
arlakshm	5830852832	remove staticd.conf.j2 (#9182 ) Why I did it resolves #8979 and #9055 How I did it Remove the file static.conf.j2,which adds the default route on eth0 from bgp docker Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>	2021-11-24 15:32:16 -08:00
abdosi	919b3e5cdf	[chassis-packet] Fixed BGP Internal Peer template (#9106 ) What I did: Fix the typo in Internal Peer Group template for Packet-based Chassis. Address Review comments of PR: [chassis-packet] minigraph parsing and BGP template changes #8966 - Static Route Parsing for Host - Formatting of chassis port_config.ini	2021-10-29 11:02:38 -07:00
abdosi	3bb248bd67	[chassis-packet] minigraph parsing and BGP template changes (#8966 ) 1. Changes for Generation LC-Graph for packet-based chassis. 2. Added Support Ipv6 Peering on Loopback4096 for voq also 3. Updated asic topology yml files to be offset of slot 4. Made slot_num to take string slot<number> instead of number 5. Consolidated template_dpg_voq_asic.j2 into dpg_asic.j2 6. Remove Loopback4096 from asic topology and parse as dut invertory for multi-asic 7. Updated topo_facts parsing for asic topology_ 8. Internal BGP Session rename from <VoqChassisInternal> to <ChassisInternal> and take switch_type as value. Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>	2021-10-18 18:44:24 -07:00
vganesan-nokia	f9231723f9	[multiasic][voq][bgpconf] Fix for the issue of same BGP router id in all asics (#8049 ) For multiasic, the back end asics use ip addresss of Loopback4096 for BGP router id. In VOQ multi-asic chassis there are no back end asics. All the asics are front end and the iBGP connections are established via Ethernet-IB of asics. Since these asics are not designated as BackEnd, the ip address of interface Loopback0 is used as BGP router id. Since the ip address of Loopback0 is same for all the asics in the line card, same router id is used for voq iBGP configurations and hence the iBGP connections are not established. Changes are done to fix this	2021-07-26 12:54:52 -07:00
Shi Su	8a48be9b74	Reduce route selection deferral timer for bgp graceful restart (#7533 ) Why I did it There are scenarios that End-of-RIB comes from a part of the peers arrives after reconciliation. In such scenarios, if the route selection deferral timer has the default value of 360 seconds, FRR would not set up routes and all routes would be removed after reconciliation. This PR reduces the route selection deferral timer so that at least routes to parts of the peers get restored at the point of reconciliation. Fix #7488 How I did it Reduce route selection deferral timer for bgp graceful restart to 15 seconds.	2021-07-26 10:16:19 -07:00
arlakshm	ef67ba5f6e	[multi-asic] fix network command for internal loopback (#7878 ) Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com> In the multi asic platforms all the ASIC are advertising the same IPv6 /64 network from Loopback4096. Therefore, the IPv6 loopback address of backend asic is not learnt on the frontend asic. Change the bgpd.conf.main.conf.j2 template file to advertise the Loopback4096 ipv6 address as /128	2021-06-24 12:02:01 -07:00
yozhao101	1a3cab43ac	[Monit] Deprecate the feature of monitoring the critical processes by Monit (#7676 ) Signed-off-by: Yong Zhao yozhao@microsoft.com Why I did it Currently we leveraged the Supervisor to monitor the running status of critical processes in each container and it is more reliable and flexible than doing the monitoring by Monit. So we removed the functionality of monitoring the critical processes by Monit. How I did it I removed the script process_checker and corresponding Monit configuration entries of critical processes. How to verify it I verified this on the device str-7260cx3-acs-1.	2021-06-04 10:16:53 -07:00
xumia	9387350e19	Fix the type issue in rvtysh (#7648 ) Why I did it Change the type issue in the command rvtysh change PARA/para to PARAM/param	2021-05-20 21:35:23 +08:00
abdosi	f27aa33e69	[muti-asic] Updated BGP community for Internal routes (#7617 ) Following changes are done: Internal routes are tagged with no-export instead of local-AS Option to add User Define BGP community on top of no-export	2021-05-16 19:44:06 -07:00
xumia	56bdd750ab	Support readonly vtysh for sudoers (#7383 ) Why I did it Support readonly version of the command vtysh How I did it Check if the command starting with "show", and verify only contains single command in script.	2021-04-25 16:32:02 +08:00
Ze Gan	f77d719f7c	[docker-fpm-frr]: Add split mode to routing config (#7307 ) For the split mode, the config files, like bgpd.conf, zebra.conf and so on, were provided by outside. But the docker_init.sh will overwrite the outside config files if restart bgp service. How I did it Add a split mode checking in docker_init.sh, if docker_routing_config_mode is split, don't overwrite the existing routing config files. How to verify it Set split mode in config db { "DEVICE_METADATA": { "localhost": { "hwsku": "Force10-S6000", "platform": "x86_64-kvm_x86_64-r0", "docker_routing_config_mode": "split" ... } } } Replace your bgpd.conf to /etc/sonic/frr/bgpd.conf Restart bgp service by sudo service bgp restart The /etc/sonic/frr/bgpd.conf your provided shouldn't be overwritten Signed-off-by: Ze Gan <ganze718@gmail.com>	2021-04-23 10:16:20 -07:00
jmmikkel	43342b33b8	[chassis] Add templates and code to support VoQ chassis iBGP peers (#5622 ) This commit has following changes: * Add templates and code to support VoQ chassis iBGP peers * Add support to convert a new VoQChassisInternal element in the BGPSession element of the minigraph to a new BGP_VOQ_CHASSIS_NEIGHBOR table in CONFIG_DB. * Add a new set of "voq_chassis" templates to docker-fpm-frr * Add a new BGP peer manager to bgpcfgd to add neighbors from the BGP_VOQ_CHASSIS_NEIGHBOR table using the voq_chassis templates. * Add a test case for minigraph.py, making sure the VoQChassisInternal element creates a BGP_VOQ_CHASSIS_NEIGHBOR entry, but not if its value is "false". * Add a set of test cases for the new voq_chassis templates in sonic-bgpcfgd tests. Note that the templates expect the new "bgp bestpath peer-type multipath-relax" bgpd configuration to be available. Signed-off-by: Joanne Mikkelson <jmmikkel@arista.com>	2021-04-16 11:11:32 -07:00
judyjoseph	1ad5dbeab6	Fixes for errors seen in staging devices (#7171 ) With the latest 201911 image, the following error was seen on staging devices with TSB command ( for both single asic, multi asic ). Though this err message doesn't affect the TSB functionality, it is good to fix. admin@STG01-0101-0102-01T1:~$ TSB BGP0 : % Could not find route-map entry TO_TIER0_V4 20 line 1: Failure to communicate[13] to zebra, line: no route-map TO_TIER0_V4 permit 20 % Could not find route-map entry TO_TIER0_V4 30 line 2: Failure to communicate[13] to zebra, line: no route-map TO_TIER0_V4 deny 30 In addition, in this PR I am fixing the message displayed to user when there are no BGP neighbors configured on that BGP instance. In multi-asic device there could be case where there are no BGP neighbors configured on a particular ASIC.	2021-04-08 15:16:43 -07:00
Joe LeVeque	c651a9ade4	[dockers][supervisor] Increase event buffer size for process exit listener; Set all event buffer sizes to 1024 (#7083 ) To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged: ``` Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46 ``` This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10. This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802). I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.	2021-03-27 21:14:24 -07:00
Shi Su	de64c4e34c	[bgp]: Reduce bgp connect retry timer to 10 seconds (#7169 ) The default bgp connect retry timer is 120 seconds. A reconnection will happen 120 seconds if the initial connection fails. This PR aims to allow a more frequent retry.	2021-03-27 11:36:56 -07:00
judyjoseph	9d9503e1fe	To decrease the Connect Retry Timer from default value which is 120sec to 10 sec. (#7087 ) Why I did it It was observed that on a multi-asic DUT bootup, the BGP internal sessions between ASIC's was taking more time to get ESTABLISHED than external BGP sessions. The internal sessions was coming up almost exactly 120 secs later. In multi-asic platform the bgp dockers ( which is per ASIC ) on switch start are bring brought up around the same time and they try to make the bgp sessions with neighbors (in peer ASIC's) which may be not be completely up. This results in BGP connect fail and the retry happens after 120sec which is the default Connect Retry Timer How I did it Add the command to set the bgp neighboring session retry timer to 10sec for internal bgp neighbors.	2021-03-17 23:14:38 -07:00
abdosi	30b6668b7d	Changes in FRR temapltes for multi-asic (#6901 ) 1. Made the command next-hop-self force only applicable on back-end asic bgp. This is done so that BGPL iBGP session running on backend can send e-BGP learn nexthop. Back end asic FRR is able to recursively resolve the eBGP nexthop in its routing table since it knows about all the connected routes advertise from front end asic. 2. Made all front-end asic bgp use global loopback ip (Loopback0) as router id and back end asic bgp use Loopbacl4096 as ruter-id and originator id for Route-Reflector. This is done so that routes learnt by external peer do not see Loopback4096 as router id in show ip bgp <route-prerfix> output. 3. To handle above change need to pass Loopback4096 from BGP manager for jinja2 template generation. This was missing and this change/fix is needed for this also https://github.com/Azure/sonic-buildimage/blob/master/dockers/docker-fpm-frr/frr/bgpd/templates/dynamic/instance.conf.j2#L27 4. Enhancement to add mult_asic specific bgpd template generation unit test cases.	2021-02-26 17:05:15 -08:00
abdosi	a520cecb44	[multi-asic] BBR support on internal-peers for multi-asic platfroms. (#6848 ) Enable BBR config allowas-in 1 for internal peers Why I did: To advertise BBR routes learnt via e-BGP peer in one asic/namespace to another iBGP asic/namespace via Route Reflector.	2021-02-25 23:15:02 -08:00
judyjoseph	ad88700912	[docker-fpm-frr]: TSA/B/C changes for multi-asic (#6510 ) - Introduced TS common file in docker as well and moved common functions. - TSA/B/C scripts run only in BGP instances for front end ASICs. In addition skip enforcing it on route maps used between internal BGP sessions. admin@str--acs-1:~$ sudo /usr/bin/TSA System Mode: Normal -> Maintenance and in case of Multi-ASIC admin@str--acs-1:~$ sudo /usr/bin/TSA BGP0 : System Mode: Normal -> Maintenance BGP1 : System Mode: Normal -> Maintenance BGP2 : System Mode: Normal -> Maintenance	2021-02-12 10:56:44 -08:00
Guohan Lu	f7346cca32	[docker-fmp-frr]: remove blank lines in generated critical_process Signed-off-by: Guohan Lu <lguohan@gmail.com>	2021-01-27 19:41:59 -08:00
Shi Su	aab37b7f42	[FRR] Create a separate script to wait zebra to be ready to receive connections (#6519 ) The requirement for zebra to be ready to accept connections is a generic problem that is not specific to bgpd. Making the script to wait for zebra socket a separate script and let bgpd and staticd to wait for zebra socket.	2021-01-27 12:36:02 -08:00
Zhenhong Zhao	a171e6c5e4	[frrcfgd] introduce frrcfgd to manage frr config when frr_mgmt_framework_config is true (#5142 ) - Support for non-template based FRR configurations (BGP, route-map, OSPF, static route..etc) using config DB schema. - Support for save & restore - Jinja template based config-DB data read and apply to FRR during startup - How I did it - add frrcfgd service - when frr_mgmg_framework_config is set, frrcfgd starts in bgp container - when user changed the BGP or other related table entries in config DB, frrcfgd will run corresponding VTYSH commands to program on FRR. - add jinja template to generate FRR config file to be used by FRR daemons while bgp container restarted - How to verify it 1. Add/delete data on config DB and then run VTYSH "show running-config" command to check if FRR configuration changed. 1. Restart bgp container and check if generated FRR config file is correct and run VTYSH "show running-config" command to check if FRR configuration is consistent with attributes in config DB Co-authored-by: Zhenhong Zhao <zhenhong.zhao@dell.com>	2021-01-24 17:57:03 -08:00
yozhao101	be3c036794	[supervisord] Monitoring the critical processes with supervisord. (#6242 ) - Why I did it Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance. Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by Supervisord, we can only focus on the logic of monitoring. - How I did it We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take following steps if it was notified one of critical processes exited unexpectedly: The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted. If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog. - How to verify it First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not. Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute. - Which release branch to backport (provide reason below if selected) 201811 201911 [x ] 202006	2021-01-21 12:57:49 -08:00
Shi Su	afee1a851c	[bgpd]: Check zebra is ready to connect when starting bgpd (#6478 ) Fix #5026 There is a race condition between zebra server accepts connections and bgpd tries to connect. Bgpd has a chance to try to connect before zebra is ready. In this scenario, bgpd will try again after 10 seconds and operate as normal within these 10 seconds. As a consequence, whatever bgpd tries to sent to zebra will be missing in the 10 seconds. To avoid such a scenario, bgpd should start after zebra is ready to accept connections.	2021-01-19 00:23:36 -08:00

1 2 3 4

166 Commits