Why I did it
SONiC currently does not identify 'EdgeZoneAggregator' neighbor. As a result, the buffer profile attached to those interfaces uses the default cable length which could cause ingress packet drops due to insufficient headroom. Hence, there is a need to update the buffer templates to identify such neighbors and assign the same cable length as used by the T1.
How I did it
Modified the buffer template to identify EdgeZoneAggregator as a neighbor device type and assign it the same cable length as a T1/leaf router.
How to verify it
Unit tests pass, and manually checked on a 7260 to see the changes take effect.
Signed-off-by: dojha <devojha@microsoft.com>
Provide platform-components.json for Clearwater2 and Wolverine
These files are needed for fwutil platform sonic-mgmt tests to pass.
Fix PikeZ platform_components.json
Co-authored-by: Patrick MacArthur <pmacarthur@arista.com>
Co-authored-by: Andy Wong <andywong@arista.com>
- Why I did it
Fixes#14236
When a redis event quickly gets outdated during port breakout, error logs like this are seen
Mar 8 01:43:26.011724 r-leopard-56 INFO ConfigMgmt: Write in DB: {'PORT': {'Ethernet64': {'admin_status': 'down'}, 'Ethernet68': {'admin_status': 'down'}}}
Mar 8 01:43:26.012565 r-leopard-56 INFO ConfigMgmt: Writing in Config DB
Mar 8 01:43:26.013468 r-leopard-56 INFO ConfigMgmt: Write in DB: {'PORT': {'Ethernet64': None, 'Ethernet68': None}, 'INTERFACE': None}
Mar 8 01:43:26.018095 r-leopard-56 NOTICE swss#portmgrd: :- doTask: Configure Ethernet64 admin status to down
Mar 8 01:43:26.018309 r-leopard-56 NOTICE swss#portmgrd: :- doTask: Delete Port: Ethernet64
Mar 8 01:43:26.018641 r-leopard-56 NOTICE lldp#lldpmgrd[32]: :- pops: Miss table key PORT_TABLE:Ethernet64, possibly outdated
Mar 8 01:43:26.018654 r-leopard-56 ERR lldp#lldpmgrd[32]: unknown operation ''
- How I did it
Only log the error when the op is not empty and not one of ("SET" & "DEL" )
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Fixes#11873.
#### Why I did it
When loading from minigraph, for port channels, don't create the members@ array in config_db in the PORTCHANNEL table. This is no longer needed or used.
In addition, when adding a port channel member from the CLI, that member doesn't get added into the members@ array, resulting in a bit of inconsistency. This gets rid of that inconsistency.
Why I did it
Dhcpmon had incorrect RX count for server side packets. It does not raise any false alarms, but could miss catching server side packet count mismatch between snapshot and current counter.
Add debug mode which prints counter to syslog
How I did it
Due to dualtor inbound filter requirement, there are currently two filters, each for listening to rx / tx packets.
Originally, we opened up an rx/tx socket for each interface specified, which causes duplicate socket. Now we initialize the sockets only once. Both sockets are not binded to an interface, and we use vlan to interface mapping to filter packets. For inbound uplinks, we use a portchannel to interface mapping.
Previous dhcpmon counter before dual tor change:
[ Agg-Vlan1000- Current rx/tx] Discover: 1/ 4, Offer: 1/ 1, Request: 3/ 12, ACK: 1/ 1
[ eth0- Current rx/tx] Discover: 0/ 0, Offer: 0/ 0, Request: 0/ 0, ACK: 0/ 0
[ eth0- Current rx/tx] Discover: 0/ 0, Offer: 0/ 0, Request: 0/ 0, ACK: 0/ 0
[ PortChannel104- Current rx/tx] Discover: 0/ 1, Offer: 0/ 0, Request: 0/ 3, ACK: 0/ 0
[ PortChannel103- Current rx/tx] Discover: 0/ 1, Offer: 0/ 0, Request: 0/ 3, ACK: 0/ 0
[ PortChannel102- Current rx/tx] Discover: 0/ 2, Offer: 1/ 0, Request: 0/ 6, ACK: 1/ 0
[ PortChannel101- Current rx/tx] Discover: 0/ 0, Offer: 0/ 0, Request: 0/ 0, ACK: 0/ 0
[ Vlan1000- Current rx/tx] Discover: 1/ 0, Offer: 0/ 1, Request: 3/ 0, ACK: 0/ 1
[ Agg-Vlan1000- Current rx/tx] Discover: 1/ 4, Offer: 1/ 1, Request: 3/ 12, ACK: 1/ 1
Dhcpmon counter after this PR:
[ PortChannel104- Current rx/tx] Discover: 0/ 1, Offer: 0/ 0, Request: 0/ 3, ACK: 0/ 0
[ PortChannel103- Current rx/tx] Discover: 0/ 1, Offer: 0/ 0, Request: 0/ 3, ACK: 0/ 0
[ PortChannel102- Current rx/tx] Discover: 0/ 2, Offer: 1/ 0, Request: 0/ 6, ACK: 1/ 0
[ PortChannel101- Current rx/tx] Discover: 0/ 0, Offer: 0/ 0, Request: 0/ 0, ACK: 0/ 0
[ Vlan1000- Current rx/tx] Discover: 1/ 0, Offer: 0/ 1, Request: 3/ 0, ACK: 0/ 1
[ Agg-Vlan1000- Current rx/tx] Discover: 1/ 4, Offer: 1/ 1, Request: 3/ 12, ACK: 1/ 1
How to verify it
Ran dhcp relay test to send all four packets in singles and batches on both single ToR and dual ToR. Counter was as expected.
On SONiC VoQ chassis, the speed changes are done from 400G to 100G needs to be supported on 400G linecards.
To enable this, along with speed change the port lanes need to be changed. This PR has the changes to update the port lanes when such speed change happens.
This PR is intended only for VoQ chassis linecards. These platforms today have 400g port with 8 serdes lines, and 100g will operate with 4 serdes lane. When the port speed changes from 400G to 100G the first 4 lanes will be used for 100G port.
Platforms which support 2x50g PAM4 or support 100G PAM4 serdes or other combinations are not handled in the PR.
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
#### Why I did it
Remove dialout as critical process as it is no longer used in prod. As part of future work, can remove dialout completely
#### How I did it
Remove from critical process list
- Why I did it
Healthd check system status every 60 seconds. However, running checker may take several seconds. Say checker takes X seconds, healthd takes (60 + X) seconds to finish one iteration. This implementation makes sonic-mgmt test case not so stable because the value X is hard to predict and different among different platforms. This PR introduces an interval
compensation mechanism to healthd main loop.
- How I did it
Introduces an interval compensation mechanism to healthd main loop: healthd should wait (60 - X) seconds for next iteration
- How to verify it
Manual test
Unit test
Why I did it
Add interface-id in dhcpv6-relay yang model
How I did it
Add interface-id option and corresponding UT. Updated configuration.md
How to verify it
kellyyeh@kellyyeh:~/sonic-buildimage/src/sonic-yang-models$ pyang -Vf tree -p /usr/local/share/yang/modules/ietf ./yang-models/sonic-dhcpv6-relay.yang
Why I did it
Find a new bug on kubelet side. The kubernetes-cni plug-in was removed in #12997, the reason is that the plug-in will be auto installed when install kubeadm, and will report error if we don't remove the install code. But after removal, the version auto installed is different from what we installed before. This will affect the kubelet action in some scenarios we don't find before. Need to install it by another way.
How I did it
Install kubernetes-cni==0.8.7-00 before install kubeadm
How to verify it
Flannel binary will be installed under /opt/cni/bin/ folder
Why I did it
Update dynamic threshold to -1 to get optimal performance for RDMA traffic
How I did it
Modified pg_profile_lookup.ini to reflect the correct value
Signed-off-by: Neetha John <nejo@microsoft.com>
Why I did it
This PR addresses the issue mentioned above by loading the acl config as a service on a storage backend device
How I did it
The new acl service is a oneshot service which will start after swss and does some retries to ensure that the SWITCH_CAPABILITY info is present before attempting to load the acl rules. The service is also bound to sonic targets which ensures that it gets restarted during minigraph reload and config reload
How to verify it
Build an image with the following changes and did the following tests
Verified that acl is loaded successfully on a storage backend device after a switch boot up
Verified that acl is loaded successfully on a storage backend ToR after minigraph load and config reload
Verified that acl is not loaded if the device is not a storage backend ToR or the device does not have a DATAACL table
Signed-off-by: Neetha John <nejo@microsoft.com>
- Why I did it
In sfplpm API, the number of logical ports is hardcoded as 64. When a system contains more port than this, the SDK APIs would fail with a syslog as below
Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [MGMT_LIB.ERR] Slot [0] Module [0] has logport [0x00010069] in enabled state
Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [SDK_MGMT_LIB.ERR] Failed in __sdk_mgmt_phy_module_pwr_attr_set, error: Internal Error
Mar 7 03:53:58.106118 r-leopard-58 ERR pmon#-c: Error occurred when setting power mode for SFP module 0, slot 0, error code 1
- How I did it
Remove the hardcoded value of 64. Obtained the number of logical ports from SDK
- How to verify it
Manual testing
- Why I did it
To include latest fixes:
Fix traffic loss on all routed traffic when moving from 4.4.3372/XX_2008_3388 to 4.5.4118-012/XX_2010_4120-010. Issue occurred after ISSU process in Spectrum 1 only, When upgrading from older version to a new one. Neighbor entries are overwritten.
Fix When using mirror session policer on SPC2/3, the actual CIR was 1.28 times more than the configured CIR value.
Fix Creation of router interface of type bridge may occasionally fail if create is performed immediately after delete.
Fix False errors during SDK deinitialization may be seen in the syslog
- How I did it
Updated SDK submodule and relevant makefiles with the required versions.
- How to verify it
Build an image and run tests from "sonic-mgmt".
- Why I did it
Sometimes Nvidia watchdog device isn't ready when watchdog-control service is up after first installation from ONIE
need to delay watchdog control service to go up after hw-mgmt which gets devices up and ready
- How I did it
Delay Nvidia watchdog-control service before hw-mgmt has started on Mellanox platform in order to avoid missing or not ready watchdog device.
- How to verify it
verification test of ONIE installation of image in a loop
making sure watchdog service is always up (not failed) after first installation from ONIE
fa8b709 Handled the error case of negative age (#57)
990f5b0 Use github code scanning instead of LGTM (#55)
a7992c5 Install libyang for swss-common. (#50)
244fa86 Update README.md
Signed-off-by: Vivek Reddy <vkarri@nvidia.com>
Manual cherry-pick of #14045
Why I did it
Fixing issue #13983 Added Missing fields in sonic-portchannel yang model. "fallback" and "fast_rate" fields are present in configuration schema but not in yang model. This leads to traceback when yang is validated
sonic_yang(3):All Keys are not parsed in PORTCHANNEL dict_keys(['PortChannel100'])
sonic_yang(3):exceptionList:["'fast_rate'"]
sonic_yang(3):Data Loading Failed:All Keys are not parsed in PORTCHANNEL dict_keys(['PortChannel100'])
exceptionList:["'fast_rate'"]
Data Loading Failed
All Keys are not parsed in PORTCHANNEL
dict_keys(['PortChannel100'])
exceptionList:["'fast_rate'"]
ConfigMgmt Class creation failed
Failed to break out Port. Error: Failed to load the config. Error: ConfigMgmtDPB Class creation failed
How I did it
Updated yang model
How to verify it
Added tests to verify
- Why I did it
Need to add the possibility to choose between dropping packets (using ACL) on ingress or egress in Dual ToR scenario
- How I did it
Add new attribute "mux_tunnel_ingress_acl" to SYSTEM_DEFAULTS table
- How to verify it
check that new attribute exists in redis:
admin@sonic:~$ redis-cli -n 4
127.0.0.1:6379[4]> HGETALL SYSTEM_DEFAULTS|mux_tunnel_ingress_acl
1."state"
2."false"
Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
Why I did it
Update SAI xgs version to 8.4.0.2 and migrate xgs to DMZ repo.
How I did it
Update SAI xgs version in sai.mk.
How to verify it
Run the SONiC and SAI test with the8.4 SAI release pipeline.
Why I did it
Fix similar issue seen on #13739 but only for DCS-7050CX3-32S
How I did it
Add a kernel parameter to tell libata to disable NCQ
How to verify it
The message ata2.00: FORCE: horkage modified (noncq) should appear on the dmesg.
Test results using: fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4
with NCQ
READ: bw=26.1MiB/s (27.4MB/s), 26.1MiB/s-26.1MiB/s (27.4MB/s-27.4MB/s), io=3136MiB (3288MB), run=120053-120053msec
WRITE: bw=26.3MiB/s (27.6MB/s), 26.3MiB/s-26.3MiB/s (27.6MB/s-27.6MB/s), io=3161MiB (3315MB), run=120053-120053msec
without NCQ
READ: bw=22.0MiB/s (23.1MB/s), 22.0MiB/s-22.0MiB/s (23.1MB/s-23.1MB/s), io=2647MiB (2775MB), run=120069-120069msec
WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=2665MiB (2795MB), run=120069-120069msec
Fixes#13568
Upgrade from old image always requires squashfs mount to get the next image FW binary. This can be avoided if we put FW binary under platform directory which is easily accessible after installation:
admin@r-spider-05:~$ ls /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
/host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
admin@r-spider-05:~$ ls -al /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa
lrwxrwxrwx 1 root root 66 Feb 8 17:57 /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa -> /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
- Why I did it
202211 and above uses different squashfs compression type that 201911 kernel can not handle. Therefore, we avoid mounting squashfs altogether with this change.
- How I did it
Place FW binary under /host/image-/platform/mlnx/, soft links in /etc/mlnx are created to avoid breaking existing scripts/automation.
/etc/mlnx/fw-SPCX.mfa is a soft link always pointing to the FW that should be used in current image
mlnx-fw-upgrade.sh is updated to prefer /host/image-/platform/mlnx location and fallback to /etc/mlnx in squashfs in case new location does not exist. This is necessary to do image downgrade.
- How to verify it
Upgrade from 201911 to master
master to 201911 downgrade
master -> master reboot
ONIE -> master boot (First FW burn)
Which release branch to backport (provide reason below if selected)
Why I did it
8c7ddf56 - [warm/fast-reboot] Backup logs from tmpfs to disk during fast/warm shutdown ([swss]: update swss docker to stretch #2714) (3 hours ago) [Vaibhav Hemant Dixit]
f2a31b30 - [ci] Fix pipeline issue caused by sonic-slave-* change. ([201803] Modify Debian apt repos to reflect changes made by maintainers #2709) (3 hours ago) [Liu Shilong]
586ecf0e - [dhcp_relay] Fix dhcp_relay restart error while add/del vlan ([thrift] add a patch to revert THRIFT-3650 #2688) (3 hours ago) [Yaqiang Zhu]
07b0ef4c - [portstat CLI] don't print reminder if use json format ([devices] add new accton platform minipack. #2670) (3 hours ago) [wenyiz2021]
48d3d3ef - [show][muxcable] add some new commands health, reset-cause, queue_info support for muxcable (DUT takes more than 7 seconds to finish update ip v6 neighbor #2414) (3 hours ago) [vdahiya12]
How I did it
How to verify it
Why I did it
submodule advance
b085b5f - [ci] Fix pipeline error about team5 not found. (Core dump in orchagent when assigning router interface to a vlan with untagged mode #2684) (3 hours ago) [Liu Shilong]
4549b4c - Fix issue: there is no retry while creating a RIF which is in removing state ([201811 sub-module] advance sub-modules: utilities, swss, swss-common #2679) (3 hours ago) [Junchao-Mellanox]
980a45b - [FDB]Fixing FDB consolidated flush for Remote MACs (pmon to stretch #2673) (3 hours ago) [Sudharsan Dhamal Gopalarathnam]
c646607 - Do not allow to add port to .1Q bridge while router port deletion is not completed (Update SDK, FW and SAI #2669) (3 hours ago) [Lior Avramov]
4a321f0 - [orchagent]: Get bridge port ID from orchagent cache instead of SAI API ([201811 sub module] advance sairedis sub module #2657) (3 hours ago) [Lawrence Lee]
f4b88f3 - [Dual-ToR] handle 'mux_tunnel_egress_acl' attrib in order to change ACL configuration (drop on ingress/egress) on standby ToR (lm75 doesn't support written alarm to syslog. #2646) (3 hours ago) [Andriy Yurkiv]
a4f29c1 - [Workaround] EvpnRemoteVnip2pOrch warmboot check failure ([teamd]: wait for swss db flush done before starting teamd container #2626) (3 hours ago) [jcaiMR]
53ee0a8 - Support for tc-dot1p and tc-dscp qosmap ([201803] [router-advertiser] Add templated script to wait for pertinent interfaces to be ready before starting radvd #2559) (3 hours ago) [Divya Mukundan]
b953866 - [dual-tor] add missing SAI attribte in order to create IPNIP tunnel (Config reload/load_minigraph not clearing State DB #2503) (3 hours ago) [Andriy Yurkiv]
How I did it
How to verify it
#### Why I did it
Following the PR https://github.com/sonic-net/sonic-swss-common/pull/739 increasing netlink buffer size in linux kernel
As error is seen in fdbsyncd with netlink reports "out of memory on reading a netlink socket" It is seen when kernel is sending 10k remote mac to fdbsyncd.
#### How I did it
Increase the buffer size of the netlink buffer from 3MB to 16MB
#### How to verify it
Verified with 10k remote mac, and restarting the fdbsyncd process. So that kernel send the bridge fdb dump to the fdbsyncd.
Verified that the netlink buffer error is not reported in the sys log.
Why I did it
cf9a66b - Fix issue: bulk counter feature is disabled ([Broadcom]: Update Broadcom SDK/SAI package #1205) (4 hours ago) [Lior Avramov]
8b1583b - [Dual-ToR] update sai.profile with SAI_ADDITIONAL_MAC_ENABLED attribute if corresponding arg passed to syncd ([Makefile]: variable ENABLE_SYNCD_RPC is always empty string #1201) (4 hours ago) [Andriy Yurkiv]
50d8e21 - [syncd]: Enable port bulk API ([platform] Accton AS7712-32X. Update for sensors and sfputil. #1197) (4 hours ago) [Nazarii Hnydyn]
a72438a - Use new value of STATE_DB FAST_REBOOT entry ([device/accton]: Update Accton-AS5712_54X #1196) (4 hours ago) [Aryeh Feigin]
d78ce86 - validation support for SAI_ATTR_VALUE_TYPE_JSON ([installer] FIX. ONIE installer error issue: #1152) (4 hours ago) [svshah-intel]
How I did it
How to verify it
Why I did it
9ccaaa5 - Update host electrical interface for 2x100G AOC ([platform]: add dell s6100 into one image #346) (4 hours ago) [mihirpat1]
d7016a4 - [ssd_generic] Get health status from Remaining_Life_Left field for virtium SSD ([docker]: Update docker-orchagent start.sh to combine td2 qos/buffers… #344) (4 hours ago) [Junchao-Mellanox]
How I did it
How to verify it
Why I did it
6391de0 - [ycable] add changes for correcting telemetry values for 'active-active' (Add default dhcp_relay.yml file to OneImage build #341) (4 hours ago) [vdahiya12]
2cb31c4 - Update CMIS module types for 2x100G AOC support ([kernel]: update linux kernel to support z9100 #339) (4 hours ago) [mihirpat1]
2ea9cf2 - [ycabled] add more coverage to ycabled; add minor name change for vendor API CLI return key-values pairs ([Makefile]: Automatically rebuild sonic-slave #338) (4 hours ago) [vdahiya12]
How I did it
How to verify it
Why I did it
e732ed0 - Prevent sonic-db-cli generate core dump (Update submodule: sairedis #749) (4 minutes ago) [Hua Liu]
28adcb4 - Support for TC-DOT1p qos map (Update submodules: sonic-swss-common, sonic-sairedis #721) (5 minutes ago) [Divya Mukundan]
How I did it
How to verify it
Why I did it
[Build] Support to use loosen version when failed to install python packages
It is to fix the issue #14012
How I did it
Try to use the installation command without constraint
How to verify it