**What I did**
Check /etc/pam.d/sshd integrity after modify it in hostcfgd.
**Why I did it**
Found some incident that /etc/pam.d/sshd become empty file during OR upgrade.
**How I verified it**
Pass all UT.
Add new UT to cover new code.
**Details if related**
This is a manually cherry-pick PR for https://github.com/sonic-net/sonic-host-services/pull/36
Backport #14372 to 202012
Why I did it
For better accounting purposes, updating the ingress lossy traffic profile to use static threshold. This change is only intended for Th devices using RDMA-CENTRIC profiles
How I did it
Update the buffer templates for Th devices in RDMA-CENTRIC folder to use the correct threshold
Signed-off-by: Neetha John <nejo@microsoft.com>
During the build process, a dsc file is retrieved from the URL:
http://deb.debian.org/debian/pool/main/i/isc-dhcp/isc-dhcp_4.4.1-2.3.dsc
Depending on the DNS resolution, the server reached may respond with a
HTTP 404 error code, what stops the build process.
In all cases, the URL http://deb.debian.org/debian/pool/main/i/isc-dhcp/
no more lists this DSC file but one with a different format.
The suffix "+deb11u1" is now appended to identify the debian version.
- append this suffix to the make file rules of isc-dhcp
Signed-off-by: Guillaume Lambert <guillaume.lambert@orange.com>
Co-authored-by: Guilt <guillaume.lambert@orange.com>
For sonic-platform-daemons following commits are added to the submodule
dd8fbae (HEAD -> 202012, origin/202012) [ycabled] add more coverage to ycabled; add minor name change for vendor API CLI return key-values pairs (#338)
846555e [thermalctld] fix some redundant removal of state DB tables (#315)
3d92fb9 Use github code scanning instead of LGTM (#316)
For sonic-utilities the following commits are added in this PR to the submodule
git log --oneline 39cdb49c..202012
ec4c6ea5 (HEAD -> 202012, origin/202012) [show][muxcable] add some new commands health, reset-cause, queue_info support for muxcable (#2414) (#2704)
03ef272e [202012][vlan] Remove add field of vlanid to DHCP_RELAY table while adding vlan (#2681)
e00a81ac [202012][dhcp-relay] Add support for dhcp_relay config cli (#2640)
274184e1 [vlan] Refresh dhcpv6_relay config while adding/deleting a vlan (#2660) (#2668
#### Why I did it
updating the submodule of sonic-platform-daemons, sonic-utilities
#### How I did it
updated the submodule
Why I did it
This PR addresses the issue mentioned above by loading the acl config as a service on a storage backend device
How I did it
The new acl service is a oneshot service which will start after swss and does some retries to ensure that the SWITCH_CAPABILITY info is present before attempting to load the acl rules. The service is also bound to sonic targets which ensures that it gets restarted during minigraph reload and config reload
How to verify it
Build an image with the following changes and did the following tests
Verified that acl is loaded successfully on a storage backend device after a switch boot up
Verified that acl is loaded successfully on a storage backend ToR after minigraph load and config reload
Verified that acl is not loaded if the device is not a storage backend ToR or the device does not have a DATAACL table
Signed-off-by: Neetha John <nejo@microsoft.com>
Why I did it
Update dynamic threshold to -1 to get optimal performance for RDMA traffic
How I did it
Modified pg_profile_lookup.ini to reflect the correct value
Signed-off-by: Neetha John <nejo@microsoft.com>
Why I did it
Dhcpmon had incorrect RX count for server side packets. It does not raise any false alarms, but could miss catching server side packet count mismatch between snapshot and current counter.
Add debug mode which prints counter to syslog
How I did it
Due to dualtor inbound filter requirement, there are currently two filters, each for listening to rx / tx packets.
Originally, we opened up an rx/tx socket for each interface specified, which causes duplicate socket. Now we initialize the sockets only once. Both sockets are not binded to an interface, and we use vlan to interface mapping to filter packets. For inbound uplinks, we use a portchannel to interface mapping.
Previous dhcpmon counter before dual tor change:
[ Agg-Vlan1000- Current rx/tx] Discover: 1/ 4, Offer: 1/ 1, Request: 3/ 12, ACK: 1/ 1
[ eth0- Current rx/tx] Discover: 0/ 0, Offer: 0/ 0, Request: 0/ 0, ACK: 0/ 0
[ eth0- Current rx/tx] Discover: 0/ 0, Offer: 0/ 0, Request: 0/ 0, ACK: 0/ 0
[ PortChannel104- Current rx/tx] Discover: 0/ 1, Offer: 0/ 0, Request: 0/ 3, ACK: 0/ 0
[ PortChannel103- Current rx/tx] Discover: 0/ 1, Offer: 0/ 0, Request: 0/ 3, ACK: 0/ 0
[ PortChannel102- Current rx/tx] Discover: 0/ 2, Offer: 1/ 0, Request: 0/ 6, ACK: 1/ 0
[ PortChannel101- Current rx/tx] Discover: 0/ 0, Offer: 0/ 0, Request: 0/ 0, ACK: 0/ 0
[ Vlan1000- Current rx/tx] Discover: 1/ 0, Offer: 0/ 1, Request: 3/ 0, ACK: 0/ 1
[ Agg-Vlan1000- Current rx/tx] Discover: 1/ 4, Offer: 1/ 1, Request: 3/ 12, ACK: 1/ 1
Dhcpmon counter after this PR:
[ PortChannel104- Current rx/tx] Discover: 0/ 1, Offer: 0/ 0, Request: 0/ 3, ACK: 0/ 0
[ PortChannel103- Current rx/tx] Discover: 0/ 1, Offer: 0/ 0, Request: 0/ 3, ACK: 0/ 0
[ PortChannel102- Current rx/tx] Discover: 0/ 2, Offer: 1/ 0, Request: 0/ 6, ACK: 1/ 0
[ PortChannel101- Current rx/tx] Discover: 0/ 0, Offer: 0/ 0, Request: 0/ 0, ACK: 0/ 0
[ Vlan1000- Current rx/tx] Discover: 1/ 0, Offer: 0/ 1, Request: 3/ 0, ACK: 0/ 1
[ Agg-Vlan1000- Current rx/tx] Discover: 1/ 4, Offer: 1/ 1, Request: 3/ 12, ACK: 1/ 1
How to verify it
Ran dhcp relay test to send all four packets in singles and batches on both single ToR and dual ToR. Counter was as expected.
- Why I did it
To optimize Mellanox platform build
- How I did it
sdk debs are now downloaded as Spectrum-SDK-Drivers-SONiC-Bins release
sx kernel is downloaded as zip from Spectrum-SDK-Drivers
Update sonic-restapi for the following commit:
44121be - 2023-03-14: Support ipv6 prefix length greater than 64 and check for adv_prefix
47e4b53 - 2023-03-15: Set allowed IPv6 pfx len to be 60
Manual cherry-pick of https://github.com/sonic-net/sonic-buildimage/pull/14138
- Why I did it
In sfplpm API, the number of logical ports is hardcoded as 64. When a system contains more port than this, the SDK APIs would fail with a syslog as below
Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [MGMT_LIB.ERR] Slot [0] Module [0] has logport [0x00010069] in enabled state
Mar 7 03:53:58.105980 r-leopard-58 ERR syncd#SDK: [SDK_MGMT_LIB.ERR] Failed in __sdk_mgmt_phy_module_pwr_attr_set, error: Internal Error
Mar 7 03:53:58.106118 r-leopard-58 ERR pmon#-c: Error occurred when setting power mode for SFP module 0, slot 0, error code 1
- How I did it
Remove the hardcoded value of 64. Obtained the number of logical ports from SDK
- How to verify it
Manual testing
#### Why I did it
Some products might experience an occasional IO failure in the communication between CPU and SSD.
Based on some research it could be attributable to some device not handling ATA NCQ (Native Command Queue).
This issue currently affect 4 products:
- `DCS-7170-32C*`
- `DCS-7170-64C`
- `DCS-7060DX4-32`
- `DCS-7260CX3-64`
#### How I did it
This change disable NCQ on the affected drive for a small set of products.
#### How to verify it
When the fix is applied, these 2 patterns can be found in the dmesg.
`ata1.00: FORCE: horkage modified (noncq)`
`NCQ (not used)`
Test results using: `fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4`
with NCQ (`ata1.00: 61865984 sectors, multi 1: LBA48 NCQ (depth 32), AA`)
```
READ: bw=33.9MiB/s (35.6MB/s), 33.9MiB/s-33.9MiB/s (35.6MB/s-35.6MB/s), io=4073MiB (4270MB), run=120078-120078msec
WRITE: bw=34.1MiB/s (35.8MB/s), 34.1MiB/s-34.1MiB/s (35.8MB/s-35.8MB/s), io=4100MiB (4300MB), run=120078-120078msec
```
without NCQ (`ata1.00: 61865984 sectors, multi 1: LBA48 NCQ (not used)`)
```
READ: bw=31.7MiB/s (33.3MB/s), 31.7MiB/s-31.7MiB/s (33.3MB/s-33.3MB/s), io=3808MiB (3993MB), run=120083-120083msec
WRITE: bw=31.9MiB/s (33.4MB/s), 31.9MiB/s-31.9MiB/s (33.4MB/s-33.4MB/s), io=3830MiB (4016MB), run=120083-120083msec
```
#### Description for the changelog
Disable ATA NCQ for a few Arista products
#### Why I did it
This is part of a corresponding change to the pcie daemon that enables it to verify PCI peripherals on a platform against a preconfigured YAML file, and enables the pcied daemon to call the system commands needed for PCI peripheral verification
#### How I did it
Adding aforementioned libraries to the Dockerfile.j2 file
#### How to verify it
run 'which setpci' from the pmon docker - would show the path of the binary
#### Description for the changelog
Modified pmon's Dockerfile.j2 to include pciutils and libpci libraries.
**cherry-pick of SHA: 7de04504c9518d68aa00c304b7376fdff4e1d318**
#### Why I did it
When using ```sonic-install set-default``` to switch the image from 202012 to 202205. The system will be stuck at loading kernel while reboot.
#### How I did it
The issue is caused by the kernal size related setting in uboot environment is smaller in the 202012 branch while they are larger in 202205 branch. The "sonic-installer set-default" just changes the boot_next variable. To fix this issue, we sync up the 202012 branch kernel related setting with the 202205 branch. This PR is only applicable to 202012 branch.
#### How to verify it
1) Install the latest 202205 image .89 or latest and reboot
2) Install the 202012 image which contains this fix and reboot
3) using "sonic-installer set-default 202205 image and reboot
4) system should start without any issue.
#### Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
* Include the following commits:
- a21b160 [202012][orchagent]: Handle duplicate routes in a graceful manner (#2666)
- 1540161 [bfdorch] add default TOS value for BFD packet (#2692)
- 860430c [ci] run apt-get update before apt-get install (#2686)
This PR is to backport #14138 to 202012.
- Why I did it
In sfplpm API, the number of logical ports is hardcoded as 64. When a system contains more port than this, the SDK APIs would fail with a trace as below
Enabling low-power mode for port Ethernet0... Traceback (most recent call last):
File "/usr/share/sonic/platform/plugins/sfplpmset.py", line 167, in
set_lpmode(handle, cmd, sfp_module)
File "/usr/share/sonic/platform/plugins/sfplpmset.py", line 128, in set_lpmode
SX_MGMT_PHY_MOD_PWR_ATTR_PWR_MODE_E, SX_MGMT_PHY_MOD_PWR_MODE_LOW_E)
File "/usr/share/sonic/platform/plugins/sfplpmset.py", line 115, in pwr_attr_set
mgmt_phy_mod_pwr_attr_set(handle, module_id, attr_type, power_mode)
File "/usr/share/sonic/platform/plugins/sfplpmset.py", line 84, in mgmt_phy_mod_pwr_attr_set
assert SX_STATUS_SUCCESS == rc, "sx_mgmt_phy_mod_pwr_attr_set failed"
AssertionError: sx_mgmt_phy_mod_pwr_attr_set failed
Error! Unable to set LPM for 1, rc = 1, err msg: [+] opening sdk
Mar 07 03:25:28 INFO LOG: Initializing SX log with STDOUT as output file.
Mar 07 03:25:28 ERROR SX_API_PORT: sx_mgmt_phy_mod_pwr_attr_get: This API is deprecated and will be removed in the future. Please use sx_mgmt_phy_module_pwr_attr_get in its place.
Mar 07 03:25:28 ERROR SX_API_PORT: sx_mgmt_phy_mod_pwr_attr_set: This API is deprecated and will be removed in the future. Please use sx_mgmt_phy_module_pwr_attr_set in its place.
- How I did it
Remove the hardcoded value of 64. Obtained the number of logical ports from SDK
- How to verify it
Manual testing
#### Why I did it
Update sonic-snmpagent submodule to include below commit:
fba50c6 [202012]: snmp vlan support per RFC1213 and added the missing support for RFC2863 (#279)
Why I did it
[Build] Support to use loosen version when failed to install python packages
It is to fix the issue #14012
How I did it
Try to use the installation command without constraint
How to verify it
Why I did it
Platform cases test_tx_disable, test_tx_disable_channel, test_power_override failed in dx010.
How I did it
Add i2c access algorithm for CPLD i2c adapters.
How to verify it
Verify it with platform_tests/api/test_sfp.py::TestSfpApi test cases.
#### Why I did it
Remove dialout as critical process as it is no longer used in prod. As part of future work, can remove dialout completely
#### How I did it
Remove from critical process list
Why I did it
Seastone does not have the psu fans' status led, need to reflect it in platform.json.
How I did it
Set the psu fans status led available to false.
How to verify it
Verify it with platform_tests/api/test_psu_fans.py::TestPsuFans::test_set_fans_led case.
- Why I did it
Update MFT version to 4.21.0-100 to include a fix for an issue reported using mlxlink on qsfp-dd
- How I did it
Update mft.mk
- How to verify it
Run regression on Mellanox platforms
Why I did it
Fix the docker-base-stretch build issue in nephos platform.
Collecting supervisord-dependent-startup==1.4.0
Downloading d386c3d2cf/supervisord_dependent_startup-1.4.0-py2.py3-none-any.whl
Collecting toposort>=1.5 (from supervisord-dependent-startup==1.4.0)
Downloading 44e51b4216/toposort-1.9.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
IOError: [Errno 2] No such file or directory: '/tmp/pip-build-LnROQE/toposort/setup.py'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-LnROQE/toposort/
The other platforms have been upgraded to docker-base-buster, not impacted.
How I did it
Pin the toposort version to 1.7, the package supervisord-dependent-startup has dependency on it.
The toposort>=1.8 only for python3, is not applicable to python2.
Why I did it
Support to upgrade packages, do better cleanup after the build.
How I did it
Remove the no use preference version control file after the build.
How to verify it
- Why I did it
sfp_event.py gets a PMPE message when a cable event is available. In PMPE message, there is no label port available. Current sfp_event.py is using sx_api_port_device_get to get 64 logical ports attributes, and find the label port from those 64 attributes. However, if there are more than 64 ports, sfp_event.py might not be able to find the label port and drop the PMPE message.
- How I did it
Don't use hardcoded 64, get logical port number instead.
- How to verify it
Manual test
Upgrade from old image always requires squashfs mount to get the next image FW binary. This can be avoided if we put FW binary under platform directory which is easily accessible after installation:
admin@r-spider-05:~$ ls /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
/host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
admin@r-spider-05:~$ ls -al /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa
lrwxrwxrwx 1 root root 66 Feb 8 17:57 /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa -> /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
- Why I did it
202211 and above uses different squashfs compression type that 201911 kernel can not handle. Therefore, we avoid mounting squashfs altogether with this change.
- How I did it
Place FW binary under /host/image-/platform/mlnx/, soft links in /etc/mlnx are created to avoid breaking existing scripts/automation.
/etc/mlnx/fw-SPCX.mfa is a soft link always pointing to the FW that should be used in current image
mlnx-fw-upgrade.sh is updated to prefer /host/image-/platform/mlnx location and fallback to /etc/mlnx in squashfs in case new location does not exist. This is necessary to do image downgrade.
- How to verify it
Upgrade from 201911 to 202012
202012 to 201911 downgrade
202012 -> 202012 reboot
ONIE -> 202012 boot (First FW burn)
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Why I did it
Docker build has a low rate of hanging up.
It hangs on different steps. So, it looks like a bug in docker daemon.
How I did it
Start a daemon process to scan running time more than 1 hours, and kill the process.
How to verify it
- Why I did it
There are 3 tasks in xcvrd:
main task, run a loop to recover missing SFP static information to DB every 1 minute
SFP state task, a process which listens cable plug in/out event, insert SFP static information to DB while a cable is inserted
SFP DOM update task, a thread which handles cable DOM information update every 1 minute
Let assume user replaces QSFP with QSFP-DD. There are two issues:
Only SFP state task listens cable plug in/out event, main task and SFP DOM update task does not know SFP type has changed, they still “think” the SFP type is QSFP. So, main task and SFP DOM update task uses QSFP standard to parse QSFP-DD EEPROM which causes corrupted data.
There is a race condition between main task and SFP state task. They both insert SFP static information to DB. Depends on timing, it is possible that main task using wrong SFP type to override SFP static information.
The PR is to fix these two issues.
There is no such issue on 202205 and above because there is a refactor for xcvrd:
SFP state task was changed from process to thread, so that all 3 tasks share the same memory space, they always have correct SFP type.
Recover missing SFP information logical was moved from main task to SFP state task. There is no race condition anymore.
- How I did it
It is difficult to back port latest xcvrd because there are many refactor/new features in xcvrd after 202012 release. It will be huge effort to do so. Based on that, we decided to fix the issue on Nvidia platform API side. The fix is that: refreshing SFP type before any SFP API which accessing SFP EEPROM. Refreshing SFP type before any SFP API would cause a small performance down: Due to my test on 202012 branch, accessing transceiver INFO and DOM INFO for 32 ports takes 1.7 seconds before the change. The number changes to 2.4 seconds after the change. I suppose the performance down is acceptable.
- How to verify it
Manual test
Regression
#### Why I did it
Submodule update for sonic-swss-common with following change:
```
3e34309 2023-02-11 | RedisPipeline ignore flush when call dtor from another thread. (#736) [Hua Liu]
```
Why I did it
Some products might experience an occasional IO failure in the communication between CPU and SSD.
Based on some research it could be attributable to some device not handling ATA NCQ (Native Command Queue).
This issue currently affect 4 products:
DCS-7170-32C*
DCS-7170-64C
DCS-7060DX4-32
DCS-7260CX3-64
How I did it
This change disable NCQ on the affected drive for a small set of products.
How to verify it
When the fix is applied, these 2 patterns can be found in the dmesg.
ata1.00: FORCE: horkage modified (noncq)
NCQ (not used)
Test results using: fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4
with NCQ (ata1.00: 61865984 sectors, multi 1: LBA48 NCQ (depth 32), AA)
READ: bw=33.9MiB/s (35.6MB/s), 33.9MiB/s-33.9MiB/s (35.6MB/s-35.6MB/s), io=4073MiB (4270MB), run=120078-120078msec
WRITE: bw=34.1MiB/s (35.8MB/s), 34.1MiB/s-34.1MiB/s (35.8MB/s-35.8MB/s), io=4100MiB (4300MB), run=120078-120078msec
without NCQ (ata1.00: 61865984 sectors, multi 1: LBA48 NCQ (not used))
READ: bw=31.7MiB/s (33.3MB/s), 31.7MiB/s-31.7MiB/s (33.3MB/s-33.3MB/s), io=3808MiB (3993MB), run=120083-120083msec
WRITE: bw=31.9MiB/s (33.4MB/s), 31.9MiB/s-31.9MiB/s (33.4MB/s-33.4MB/s), io=3830MiB (4016MB), run=120083-120083msec
Which release branch to backport (provide reason below if selected)
Why I did it
1. fix chassis test_set_fans_led case
2. fix chassis get_name case mismatch issue
3. fix fan_drawer test_set_fans_speed
4. fix component test_components test case
How I did it
Add corresponding configuration into chassis json file
How to verify it
Run platform tests cases to verify these failure cases
#### Why I did it
Include below commits:
```
7147354 2023-02-14 | Fix: zero route may have empty nexthop (#276) [Qi Luo]
e60a64c 2022-11-30 | Use github code scanning instead of LGTM (#274) [Liu Shilong]
```
#### Why I did it
1.57.x SDK based incremental drop that addresses:
Fix for MIGSMSFT-158
Support for VxLAN and BFD Serviceability CLI
sfputil reset platform fix to handle 100G optics
Added thermal management feature for ZR optics sensors
#### How I did it
Update cisco-8000 submodule to v0.2.5
Include following commits:
- c98b9f09 [202012][orchagent]: Get bridge port ID from FDBOrch cache instead of SAI API #2657 (#2658)
- 59886b8f [MuxOrch] Enabling neighbor when adding in active state (#2601)