sonic-buildimage

Archived

Author	SHA1	Message	Date
Hua Liu	bdb24676eb	Change orchagent stuck message from ERR to WARNING (#17872 ) Change orchagent stuck message from ERR to WARNING #### Why I did it During switch initialization, sometime Orchagent will busy for more than 40seconds and will trigger process stuck workdog error. To improve this issue, change watchdog error message to warning message. ##### Work item tracking - Microsoft ADO: 26517622 #### How I did it Change orchagent stuck message from ERR to WARNING. #### How to verify it Pass all UT. ### Description for the changelog Change orchagent stuck message from ERR to WARNING.	2024-01-26 00:01:50 -08:00
Zain Budhwani	b557488608	Remove echo log to /tmp/{$SERVICE}-debug.log in service_mgmt.sh (#17838 ) ### Why I did it Unnecessary for logs to be written out to /tmp/${SERVICE}-debug.log as they are already being written to syslog. Therefore, removing writing to a new log in concern for memory space and not being able to startup some services in RO state. ##### Work item tracking - Microsoft ADO (number only):26458976 #### How I did it Remove DEBUGLOG definition and line that echo's message to mentioned log file. #### How to verify it Manually verified, /tmp/${SERVICE}-debug.log files do not exist and log for service starting still appears in syslog	2024-01-25 17:14:21 -08:00
Lawrence Lee	eb70bff4b7	add timeout to ping6 command (#17729 ) Signed-off-by: Lawrence Lee <lawlee@microsoft.com>	2024-01-10 14:40:15 -08:00
Junhua Zhai	53be9de743	Fix syncd_request_shutdown coredump in config reload on KVM sonic (#17486 ) The issue is related to #16812. Process syncd does not run in the container gbsyncd on kvm sonic with default hwsku. Microsoft ADO : 26151608 How I did it If syncd has not run in container gbsyncd, it is not needed to trigger graceful shudown of syncd. How to verify it None of syncd_request_shutdown coredump in config reload on KVM sonic	2023-12-13 17:37:44 -08:00
Aaron Payment	0ecee5df05	[gbsyncd]: Set SYSLOG_CONFIG_FEATURE for gbsyncd (#17325 ) Why I did it SONiC Mgmt test syslog/test_syslog_rate_limit.py syslog.test_syslog_rate_limit test_syslog_rate_limit was failing on SKUs with gbsyncd. This includes Arista 720DT when testing on the 202305 branch. How I did it The issue was no value for gbsyncd in "show syslog rate-limit-container", because gbsyncd is not having a SYSLOG_CONFIG_FEAGTURE\|gbsyncd entry in config_db, which is further because gbsyncd feature is for not enabled through init_cfg.json.j2. How to verify it Test is now passing on 720DT in 202305 branch. Co-authored-by: Boyang Yu <byu@arista.com>	2023-12-06 22:04:21 -08:00
Junhua Zhai	048f2a7c39	[gbsyncd] Graceful shutdown of syncd process in container gbsyncd (#16812 ) Fix #16608. Need to gracefully shutdown syncd/gbsyncd individually.	2023-12-06 21:43:13 -08:00
Lawrence Lee	572af1dcdf	[arp_update]: Flush neighbors with incorrect MAC info (#17238 ) [arp_update]: Flush MAC mismatch neighbors - Check for MAC mismatch between neighbor entries in the kernel and APPL_DB - Flush any entries with a mismatch	2023-11-30 14:23:05 -08:00
Vivek	4727185648	[lldp] Clean up service start logic owing to port init start optimization (#17268 ) Signed-off-by: Vivek Reddy <vkarri@nvidia.com>	2023-11-27 09:56:54 -08:00
ganglv	c71fb3a30f	Share image for gnmi and telemetry (#16863 ) Why I did it Share docker image to support gnmi container and telemetry container Work item tracking Microsoft ADO 25423918: How I did it Create telemetry image from gnmi docker image. Enable gnmi container and disable telemetry container by default. How to verify it Run end to end test.	2023-11-08 08:54:36 +08:00
Yaqiang Zhu	d11e0a214e	Add use_unix_socket_path to supervisor-proc-exit-listener (#16548 ) Why I did it ConfigDBConnector in supervisor-proc-exit-listener uses default parameter to connect CONFIG_DB (connect by 127.0.0.1:6379) which would fail at non-host network mode container, because they are not sharing the same network and socket. How I did it Add a new parameter use_unix_socket_path to this script to indicate whether to use socket to connect CONFIG_DB. How to verify it Build image and install it, kill critical processes in container and container crushed.	2023-09-15 16:23:25 -07:00
vganesan-nokia	b13b41fc22	[swss] Chassis db clean up optimization and bug fixes (#16454 ) * [swss] Chassis db clean up optimization and bug fixes This commit includes the following changes: - Fix for regression failure due to error in finding CHASSIS_APP_DB in pizzabox (#PR 16451) - After attempting to delete the system neighbor entries from chassis db, before starting clearing the system interface entries, wait for sometime only if some system neighbors were deleted. If there are no system neighbors entries deleted for the asic coming up, no need to wait. - Similar changes for system lag delete. Before deleting the system lag, wait for some time only if some system lag memebers were deleted. If there are no system lag members deleted no need to wait. - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic is coming up, when system neigh entries are deleted from chassis ap db (as part of chassis db clean up), there is no orchs/process running to process the delete messages from chassis redis. Because of this, stale system neigh are entries present in the local STATE_DB. The stale entries result in creation of orphan (no corresponding data path/asic db entry) kernel neigh entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from the local STATE_DB when sevice comes up. Signed-off-by: vedganes <veda.ganesan@nokia.com> * [swss] Chassis db clean up bug fixes review comment fix - 1 Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE) Signed-off-by: vedganes <veda.ganesan@nokia.com> --------- Signed-off-by: vedganes <veda.ganesan@nokia.com>	2023-09-11 08:28:27 -07:00
Stephen Sun	b5e8c16134	[Mellanox] Enhance FW upgrade mechanism (#16090 ) ### Why I did it 1. Enhance the diagnosis information collecting mechanism - If the option `-v` is fed, it will pass additional diagnosis flags to mlxfwmanager - Collect all the output from mlxfwmanager and print them to syslog if it fails 2. Abort syncd in case waiting for device or upgrading firmware fails Signed-off-by: Stephen Sun <stephens@nvidia.com> ### How I did it #### How to verify it Regression and manual test	2023-09-04 11:28:53 -07:00
anamehra	f6897bb585	chassis-packet: Update arp_update script for FAILED and STALE check (#16311 ) chassis-packet: Update arp_update script for FAILED and STALE check (#16311) 1. Fixing an issue with FAILED entry resolution retry. Neighbor entries in arp table may sometimes enter a FAILED state when the far end is down and reports the state as follows: 2603:10e2:400:3::1 dev PortChannel19 router FAILED While the arp_update script handles the entries for FAILED in the following format, the above was not handled due to the token location (extra router keyword at index 4): 2603:10e2:400:3::1 dev PortChannel19 FAILED The former format may appear if an arp resolution is tried on a link that is known but the far end goes down, e.g., pinging a STALE entry while the far end is down. 2. Refreshing STALE entries to make sure the far end is reachable. STALE entries for some backend ports may appear in chassis-packet when no traffic is received for a while on the port. When the far end goes down, it is expected for BFD to stop sending packets on the session for which the far end is not reachable. But as the entry is known as stale, on the Cisco chassis, BFD keeps sending packets. Refreshing the stale entry will keep active links as reachable in the neighbor table while the entries for the far end down will enter a failed state. FAILED state entries will be retired and entered reachable when far end comes back up.	2023-09-01 11:41:46 -07:00
vganesan-nokia	5fded5c51b	[chassis] Chassis DB cleanup when asic comes up (#16213 ) * [chassis]Chassis DB cleanup when asic comes up Cleanup the entries from the following tables in chassis app db in redis_chassis server in the supervisor (1) SYSTEM_NEIGH (2) SYSTEM_INTERFACE (3) SYSTEM_LAG_MEMBER_TABLE (4) SYSTEM_LAG_TABLE As part of the clean up only those entries created by the asic that is coming up are deleted. The LAG IDs used by the asics are also de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET - Added check to run the chassis db clean up only for voq switches. Signed-off-by: vedganes <veda.ganesan@nokia.com>	2023-08-31 23:38:56 -07:00
Arvindsrinivasan Lakshmi Narasimhan	46817036fd	[chassis]: removed dependency for bgp and swss for chassis supervisor (#15734 ) Fixes #15667 and #13293 Work item tracking Microsoft ADO 24472854: How I did it On chassis supervisor bgp feature is disabled in hostcfgd. The dependency between swss and bgp causes the bgp containers to start even though the feature is disabled. How to verify it Tests on chassis supervisor and LC	2023-08-07 09:52:48 -07:00
Vadym Hlushko	9fba98ce6d	[syncd.sh] Clear semaphore before updating firmware (#15818 ) Why I did it The hw resources should be released before updating firmware. How I did it Added logic to release hw resources in syncd.sh script Signed-off-by: Vadym Hlushko <vadymh@nvidia.com>	2023-08-06 22:30:33 -07:00
Vaibhav Hemant Dixit	e127701660	Fix CONFIG_DB_INITIALIZED flag check logic and set/reset flag for warmboot (#15685 ) * Fix CONFIG_DB_INITIALIZED flag check logic and set/reset flag for warm-reboot * Fix db-cli usage * Handle same image warm-reboot and generalize handling of INIT flag * Cover boot from ONIE case: set config init flag when minigraph, config_db are missing * Handle case: first boot of SONiC * Check for config init flag * Simplify logic, and do not call db_migrator for same image reboot	2023-08-04 16:00:26 -07:00
Lawrence Lee	b4a3711a95	[arp_update]: Fix IPv6 neighbor race condition (#15583 ) * [arp_update]: Fix IPv6 neighbor race condition on dualtors Signed-off-by: Lawrence Lee <lawlee@microsoft.com>	2023-06-30 14:06:25 -07:00
siqbal1986	bf5b72a356	Vnet monitor table cleanup (#15399 ) * Added VNET_MONITOR_TABLE, BFD_SESSION_TABLE, to the listof tables to be cleaned up after swss restart. * Added VNET_ROUTE* table in cleanup. This should cover VNET_ROUTE_TUNNEL_TABLE as well.	2023-06-27 12:53:56 -07:00
Hua Liu	05f1a5a31e	Add watchdog mechanism to swss service and generate alert when swss have issue. (#15429 ) Add watchdog mechanism to swss service and generate alert when swss have issue. Work item tracking Microsoft ADO (number only): 16578912 What I did Add orchagent watchdog to monitor and alert orchagent stuck issue. Why I did it Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. How I verified it Pass all UT. Manually test process_monitoring/test_critical_process_monitoring.py can pass. Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly. Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log: Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes). Details if related Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737 UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306	2023-06-12 17:53:54 -07:00
Ye Jianquan	cec9d7b83a	Revert "Add watchdog mechanism to swss service and generate alert when swss have issue. (#14686 )" (#15390 ) This reverts commit `44427a2f6b`. Docker image not updated during PR validation and caused PR check failures. Force merge this revert. After cache is updated after this PR is merged, issue should be fixed.	2023-06-09 09:10:35 +08:00
Hua Liu	44427a2f6b	Add watchdog mechanism to swss service and generate alert when swss have issue. (#14686 ) This PR depends on https://github.com/sonic-net/sonic-swss/pull/2737 merge first. What I did Add orchagent watchdog to monitor and alert orchagent stuck issue. Why I did it Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. How I verified it Pass all UT. Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly. Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log: Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes). Details if related Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737 UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306	2023-06-05 22:21:17 -07:00
siqbal1986	381cfe4485	Added VNET_MONITOR_TABLE,BFD_SESSION_TABLE,VNET_ROUTE_TUNNEL_TABLE to the list (#14992 ) * The 3 tables in state DB need to be cleaned up after SWSS restart for have consistant state.	2023-06-05 13:18:50 -07:00
Anish Narsian	05a85b57b8	[arp_update] Resolve neighbors from config_db (#15006 ) * To resolve NEIGH table entries present in CONFIG_DB. Without this change arp/ndp entries which we wish to resolve, and configured via CONFIG_DB are not resolved.	2023-05-17 10:42:03 -07:00
Sudharsan Dhamal Gopalarathnam	2804998766	[config reload]Config Reload Enhancement (#13969 ) #### Why I did it Implementing code changes for https://github.com/sonic-net/SONiC/pull/1203 #### How I did it Removed the timers and delayed target since the delayed services would start based on event driven approach. Cleared port table during config reload and cold reboot scenario. Modified yang model, init_cfg.json to change has_timer to delayed #### How to verify it Running regression	2023-04-12 11:20:03 -07:00
anamehra	f34360f101	chassis-packet: resolve the missing static routes (#14593 ) Why I did it Fixes #14179 chassis-packet: missing arp entries for static routes causing high orchagent cpu usage It is observed that some sonic-mgmt test case calls sonic-clear arp, which clears the static arp entries as well. Orchagent or arp_update process does not try to resolve the missing arp entries after clear. How I did it arp_update should resolve the missing arp/ndp static route entries. Added code to check for missing entries and try ping if any found to resolve it. How to verify it After boot or config reload, check ipv4 and ipv4 neigh entries to make sure all static route entries are present manual validation: Use sonic-clear arp and sonic-clear ndp to clear all neighbor entries run arp_update Check for neigh entries. All entries should be present. Testing on T0 setup route/for test_static_route.py The test set the STATIC_ROUTE entry in conifg db without ifname: sonic-db-cli CONFIG_DB hmset 'STATIC_ROUTE\|2.2.2.0/24' nexthop 192.168.0.18,192.168.0.25,192.168.0.23 "STATIC_ROUTE": { "2.2.2.0/24": { "nexthop": "192.168.0.18,192.168.0.25,192.168.0.23" } }, Validate that the arp_update gets the proper ARP_UPDATE_VARDS using arp_update_vars.j2 template from config db and does not crash: { "switch_type": "", "interface": "", "pc_interface" : "PortChannel101 PortChannel102 PortChannel103 PortChannel104 ", "vlan_sub_interface": "", "vlan" : "Vlan1000", "static_route_nexthops": "192.168.0.18 192.168.0.25 192.168.0.23 ", "static_route_ifnames": "" } validate route/test_static_route.py testcase pass.	2023-04-12 15:07:42 +08:00
Aryeh Feigin	41a9813018	Finalize fast-reboot in warmboot finalizer (#14238 ) - Why I did it To solve an issue with upgrade with fast-reboot including FW upgrade which has been introduced since moving to fast-reboot over warm-reboot infrastructure. As well, this introduces fast-reboot finalizing logic to determine fast-reboot is done. - How I did it Added logic to finalize-warmboot script to handle fast-reboot as well, this makes sense as using fast-reboot over warm-reboot this script will be invoked. The script will clear fast-reboot entry from state-db instead of previous implementation that relied on timer. The timer could expire in some scenarios between fast-reboot finished causing fallback to cold-reboot and possible crashes. As well this PR updates all services/scripts reading fast-reboot state-db entry to look for the updated value representing fast-reboot is active. - How to verify it Run fast-reboot and check that fast-reboot entry exists in state-db right after startup and being cleared as warm-reboot is finalized and not due to a timer.	2023-04-09 16:59:15 +03:00
Devesh Pathak	d74055e12c	Increase wait_for_tunnel() timeout to 90s (#14279 ) Why I did it Orchagent sometimes take additional time to execute Tunnel tasks. This cause write_standby script to error out and mux state machines are not initialized. It results in show mux status missing some ports in output. Mar 13 20:36:52.337051 m64-tor-0-yy41 INFO systemd[1]: Starting MUX Cable Container... Mar 13 20:37:52.480322 m64-tor-0-yy41 ERR write_standby: Timed out waiting for tunnel MuxTunnel0, mux state will not be written Mar 13 20:37:58.983412 m64-tor-0-yy41 NOTICE swss#orchagent: :- doTask: Tunnel(s) added to ASIC_DB. How I did it Increase timeout from 60s to 90s How to verify it Verified that mux state machine is initialized and show mux status has all needed ports in it.	2023-04-07 11:30:58 +08:00
Ying Xie	737d0e57ad	[write standby] force DB connections to use unix socket to connect (#14524 ) Why I did it At service start up time, there are chances that the networking service is being restarted by interface-config service. When that happens, write_standby could fail to make DB connections due to loopback interface is being reconfigured. How I did it Force the db connector to use unix socket to avoid loopback reconfig timing window. How to verify it Run config reload test 20+ times and no issue encountered. Signed-off-by: Ying Xie <ying.xie@microsoft.com> * use unix socket instead Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2023-04-06 13:54:56 -07:00
Ye Jianquan	6c04ed987d	Revert "chassis-packet: resolve the missing static routes (#14230 )" (#14544 ) This reverts commit `a8f8ea3b50`.	2023-04-06 10:36:10 -07:00
anamehra	a8f8ea3b50	chassis-packet: resolve the missing static routes (#14230 ) arp_update should resolve the missing arp/ndp static route entries. Added code to check for missing entries and try ping to resolve the missing entry. Why I did it Fixes #14179 chassis-packet: missing arp entries for static routes causing high orchagent cpu usage It is observed that some sonic-mgmt test case calls sonic-clear arp, which clears the static arp entries as well. Orchagent or arp_update process does not try to resolve the missing arp entries after clear. How I did it arp_update should resolve the missing arp/ndp static route entries. Added code to check for missing entries and try ping if any found to resolve it. How to verify it After boot or config reload, check ipv4 and ipv4 neigh entries to make sure all static route entries are present manual validation: Use sonic-clear arp and sonic-clear ndp to clear all neighbor entries run arp_update Check for neigh entries. All entries should be present. Signed-off-by: anamehra <anamehra@cisco.com>	2023-03-29 09:53:32 -07:00
anamehra	4a93e4cfa4	Add support for platform syncd pre shutdown plugin (#13564 ) Why I did it Vendor platform may require running platform specific pre-shutdown routine before shutting down the syncd process which runs the SAI and vendor sdk instance. How I did it Added a platform script hook which will be executed if the plugin script is provided by the platform in device//plugins/	2023-03-03 15:53:33 -08:00
Jing Zhang	78f249be38	change default to be on (#13495 ) Changing the default config knob value to be True for killing radv, due to the reasons below: Killing RADV is to prevent sending the "cease to be advertising interface" protocol packet. RFC 4861 says this ceasing packet as "should" instead of "must", considering that it's fatal to not do this. In active-active scenario, host side might have difficulty distinguish if the "cease to be advertising interface" is for the last interface leaving. 6.2.5. Ceasing To Be an Advertising Interface shutting down the system. In such cases, the router SHOULD transmit one or more (but not more than MAX_FINAL_RTR_ADVERTISEMENTS) final multicast Router Advertisements on the interface with a Router Lifetime field of zero. In the case of a router becoming a host, the system SHOULD also depart from the all-routers IP multicast group on all interfaces on which the router supports IP multicast (whether or not they had been advertising interfaces). In addition, the host MUST ensure that subsequent Neighbor Advertisement messages sent from the interface have the Router flag set to zero. sign-off: Jing Zhang zhangjing@microsoft.com	2023-01-24 23:59:54 +00:00
Jing Zhang	260a2ec3e7	[dualtor][active-active]Killing radv instead of stopping on `active-active` dualtor if config knob is on (#13408 ) How I did it radv sends a good-bye packet when the service is stopped, which causes a IPv6 route update on SoC side. And this update leads to an interface bouncing and causes traffic disruption even though the ToR device might already be isolated. This PR is to mitigate the traffic disruption issue during planned maintenance, by killing radv instead of stopping. So the cease packet won't be sent. How to verify it Verified on dev clusters: Traffic disruption was no longer reproducible. radv took the killing path if knob was off, radv would take the stopping path sign-off: Jing Zhang zhangjing@microsoft.com	2023-01-20 15:34:34 -08:00
Oleksandr Ivantsiv	9988ff888b	[build] Add the possibility to disable compilation of teamd and radv containers. (#12920 ) - Why I did it This optimization is needed for DPU SONiC. DPU SONiC runs a limited set of containers and teamd and radv containers are not part of them. Unlike the other containers, there was no possibility to disable teamd and radv containers compilation. To reduce DPU SONiC compilation time and reduce the image size this commit adds the possibility to disable their compilation. - How I did it Two new configuration options are added to rules/config file: INCLUDE_TEAMD INCLUDE_ROUTER_ADVERTISER By default to preserve the existing behavior both options are enabled. There are two ways to override them: To change option value to "n" in rules/config file. To override their value using SONIC_OVERRIDE_BUILD_VARS env variable: SONIC_OVERRIDE_BUILD_VARS="SONIC_INCLUDE_TEAMD=y SONIC_INCLUDE_ROUTER_ADVERTISER=n" - How to verify it The default behavior is preserved. To verify it compile the image without overriding new options. Install the image and verify that both teamd and radv containers are present and running. To verify the new options override them with "n" value. Compile and install image. Verify that no docker containers are present. Verify that SWSS can start without errors.	2022-12-13 12:06:30 +02:00
Arvindsrinivasan Lakshmi Narasimhan	7db272556e	[chassis] update the asic_status.py to read from CHASSIS_FABRIC_ASIC_INFO_TABLE (#12576 ) Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan arlakshm@microsoft.com Why I did it Fixes #12575 and #12575 How I did it In the PR sonic-net/sonic-platform-daemons#311 chassisd updates to CHASSIS_FABRIC_ASIC_INFO with the fabric asic info. Updating the asic_status.py to read from the correct table. How to verify it test on chassis Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>	2022-12-07 21:53:47 -08:00
Michael Li	50b962b4a8	Limit reload BCM SDK kmods on syncd start to PikeZ platform (#12971 ) Why I did it Limiting #12804 changes to PikeZ platform only (Arista-720DT-48S). Note that this is a short term workaround for this platform until SDK investigation on SDK init failure on docker syncd restart due to DMA issues is resolved. How I did it Retrieve platform name from /host/machine.conf and only reload SDK kmods on Arista-720DT-48S platform. Signed-off-by: Michael Li <michael.li@broadcom.com>	2022-12-07 09:53:21 +08:00
Stepan Blyshchak	8ca0530920	[swss.sh] optimize macsec feature state query (#12946 ) - Why I did it There's a slowdown in bootup related to the execution of a show command during startup of swss service. show is a pretty heavy command and takes long time to execute ~2 sec. - How I did it I replaced show with sonic-db-cli which takes a ms to run. - How to verify it Boot the switch and verify swss is active. Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>	2022-12-06 11:23:46 +02:00
Michael Li	f725b83bd6	Reload BCM SDK kmods on syncd start to handle syncd restart issues (#12804 ) Why I did it There is an issue on the Arista PikeZ platform (using T3.X2: BCM56274) while running SONiC. If the 'syncd' container in SONiC is restarted, the expected behaviour is that syncd will automatically restart/recover; however it does not and always fails at create_switch due to BCM SDK kmod DMA operation cancellation getting stuck. Sep 16 22:19:44.855125 pkz208 ERR syncd#syncd: [none] SAI_API_SWITCH:platform_process_command:428 Platform command "init soc" failed, rc = -1. Sep 16 22:19:44.855206 pkz208 INFO syncd#supervisord: syncd CMIC_CMC0_PKTDMA_CH4_DESC_COUNT_REQ:0x33#015 Sep 16 22:19:44.855264 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:platformInit:1909 initialization command "init soc" failed, rc = -1 (Internal error). Sep 16 22:19:44.855403 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:sai_driver_init:642 Error initializing driver, rc = -1. ... Sep 16 22:19:44.855891 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:brcm_sai_create_switch:1173 initializing SDK failed with error Operation failed (0xfffffff5). Reloading the BCM SDK kmods allows the switch init to continue properly. How I did it If BCM SDK kmods are loaded, unload and load them again on syncd docker start script. How to verify it Steps to reproduce: In SONiC, run 'docker ps' to see current running containers; 'syncd' should be present. Run 'docker stop syncd' Wait ~1 minute. Run 'docker ps' to see that syncd is missing. Check logs to see messages similar to the above. Signed-off-by: Michael Li <michael.li@broadcom.com>	2022-11-30 16:16:30 +08:00
abdosi	668485aac5	Added Support to runtime render bgp and teamd feature state and lldp has_asic_scope flag (#11796 ) Added Support to runtime render bgp and teamd feature `state` and lldp `has_asic_scope` flag Needed for SONiC on chassis. Signed-off-by: Abhishek Dosi <abdosi@microsoft.com> Co-authored-by: mlok <marty.lok@nokia.com>	2022-11-15 16:20:14 -08:00
abdosi	bd348c5264	[chassis-packet] fix the issue of internal ip arp not getting resolved. (#12127 ) Fix the issue where arp_update will not ping some of the ip's even though they are in failed state since grep of that ip on ip neigh show command does not do exact word match and can return multiple match.	2022-11-14 10:15:17 -08:00
Lawrence Lee	ddf16c9d8c	[arp_update]: Fix hardcoded vlan (#12566 ) Typo in prior PR #11919 hardcodes Vlan name. Change command to use the $vlan variable instead Signed-off-by: Lawrence Lee <lawlee@microsoft.com>	2022-11-07 12:10:00 -08:00
Zain Budhwani	8f48773fd1	Publish additional events (#12563 ) Add event_publish code or regex for rsyslog plugin for additional events	2022-11-07 09:57:57 -08:00
Mai Bui	61a085e55e	Replace os.system and remove subprocess with shell=True (#12177 ) Signed-off-by: maipbui <maibui@microsoft.com> #### Why I did it `subprocess` is used with `shell=True`, which is very dangerous for shell injection. `os` - not secure against maliciously constructed input and dangerous if used to evaluate dynamic content #### How I did it remove `shell=True`, use `shell=False` Replace `os` by `subprocess`	2022-11-04 10:48:51 -04:00
Stepan Blyshchak	e662008f72	[services] kill container on stop in warm/fast mode (#10510 ) - Why I did it To optimize stop on warm boot. - How I did it Added kill for containers	2022-09-19 19:34:33 +03:00
Ze Gan	016f671857	[docker-macsec]: Add dependencies of MACsec (#11770 ) Why I did it If the SWSS services was restarted, the MACsec service should also be restarted. Otherwise the data in wpa_supplicant and orchagent will not be consistent. How I did it Add dependency in docker-macsec.mk. How to verify it Manually check by 'sudo service swss restart'. The MACsec container should be started after swss, the syslog will look like Sep 8 14:36:29.562953 sonic INFO swss.sh[9661]: Starting existing swss container with HWSKU Force10-S6000 Sep 8 14:36:30.024399 sonic DEBUG container: container_start: BEGIN ... Sep 8 14:36:33.391706 sonic INFO systemd[1]: Starting macsec container... Sep 8 14:36:33.392925 sonic INFO systemd[1]: Starting Management Framework container... Signed-off-by: Ze Gan <ganze718@gmail.com>	2022-09-08 23:45:06 +08:00
Ying Xie	a6843927d9	[mux] skip mux operations during warm shutdown (#11937 ) * [mux] skip mux operations during warm shutdown - Enhance write_standby.py script to skip actions during warm shutdown. - Expand the support to BGP service. - MuX support was added by a previous PR. - don't skip action during warm recovery Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2022-09-02 13:50:42 -07:00
Lawrence Lee	a762b35cbc	[arp_update]: Set failed IPv6 neighbors to incomplete (#11919 ) After pinging any failed IPv6 neighbor entries, set the remaining failed/incomplete entries to a permanent INCOMPLETE state. This manual setting to INCOMPLETE prevents these entries from automatically transitioning to FAILED state, and since they are now incomplete any subsequent NA messages for these neighbors is able to resolve the entry in the cache. Signed-off-by: Lawrence Lee <lawlee@microsoft.com>	2022-09-02 13:40:40 -07:00
Longxiang Lyu	6e878a36da	[mux] Exit to write `standby` state to `active-active` ports (#11821 ) [mux] Exit to write standby state to `active-active` ports Signed-off-by: Longxiang Lyu <lolv@microsoft.com>	2022-08-31 13:10:22 -07:00
abdosi	3bf1abb2dc	Address Review Comment to define SONIC_GLOBAL_DB_CLI in gbsyncd.sh (#11857 ) As part of PR #11754 Change was added to use variable SONIC_DB_NS_CLI for namespace but that will not work since ./files/scripts/syncd_common.sh uses SONIC_DB_CLI. So revert back to use SONIC_DB_CLI and define new variable for SONIC_GLOBAL_DB_CLI for global/host db cli access Also fixed DB_CLI not working for namespace.	2022-08-29 08:19:28 -07:00

1 2 3 4

187 Commits