* [sflow + dropmon] added INCLUDE_SFLOW_DROPMON flag, added patches for hsflowd
*Added a capability of monitoring dropped packets for the sFlow daemon in order to improve network - monitoring, diagnostic, and troubleshooting. The drop monitor service allows the sFlow daemon to export another type of sample - dropped packets as Discard samples alongside Counter samples and Packet Flow samples.
Signed-off-by: Vadym Hlushko <vadymh@nvidia.com>
- Why I did it
Advance to new SAI version for bugs fixes as well as new features/enhacements:
New:
1. ARM64 support
2. FG ECMP performance optimization
3. Support setting empty list for port ingress/egress buffer profile list
4. Add service port for SN5600
5. Add CR8/SR8/LR8/KR8 interface type
6. Disable mlxtrace during debug dump
Fixes:
1. Fix SAI_ACL_ENTRY_ATTR_FIELD_TC
2. Fix Packets loop back if no member in portchannel
3. Fix optimize descriptors apply time (and fast boot time)
4. Add flush fdb entries for vxlan tunnel bridge port
5. Don't disable used tunnel underlay interfaces
- How I did it
Advanced SAI submodule
- How to verify it
make configure PLATFORM=mellanox
make target/sonic-mellanox.bin
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
* [Mellanox]Check dmi file permission before access (#11309)
Signed-off-by: Sudharsan Dhamal Gopalarathnam sudharsand@nvidia.com
Why I did it
During the system boot up when 'show platform status' or 'show version' command is executed before STATE_DB CHASSIS_INFO table is populated, the show will try to fallback to use the platform API. The DMI file in mellanox platforms require root permission for access. So if the show commands are executed as admin or any other user, the following error log will appear in the syslog
Jun 28 17:21:25.612123 sonic ERR show: Fail to decode DMI /sys/firmware/dmi/entries/2-0/raw due to PermissionError(13, 'Permission denied')
How I did it
Check the file permission before accessing it.
How to verify it
Added UT to verify. Manually verified if the error log is not thrown.
This PR is a backport of #10950 and a fix for it #11227
- Why I did it
To not build python2 pysairedis on bullseye
- How I did it
Cherry-picked above PRs from master
- How to verify it
Build
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
- Why I did it
This is for the eventual support of multiple architectures for the mellanox platform.
- How I did it
Change the location of the binaries in Switch-SDK-drivers so that the path specifies the target architecture in addition to the target distribution that the debians are built for.
This is the most straightforward way to separate binaries built against different architectures and selectively target them for installation in the mellanox SONiC image.
- How to verify it
Build SONiC for mellanox and verify it compiles successfully.
Why I did it
To further support parse out soc_ipv4 and soc_ipv6 out of Dpg:
<DeviceDataPlaneInfo>
<IPSecTunnels />
<LoopbackIPInterfaces xmlns:a="http://schemas.datacontract.org/2004/07/Microsoft.Search.Autopilot.Evolution">
<a:LoopbackIPInterface>
<ElementType>LoopbackInterface</ElementType>
<Name>HostIP</Name>
<AttachTo>Loopback0</AttachTo>
<a:Prefix xmlns:b="Microsoft.Search.Autopilot.NetMux">
<b:IPPrefix>10.10.10.2/32</b:IPPrefix>
</a:Prefix>
<a:PrefixStr>10.10.10.2/32</a:PrefixStr>
</a:LoopbackIPInterface>
<a:LoopbackIPInterface>
<ElementType>LoopbackInterface</ElementType>
<Name>HostIP1</Name>
<AttachTo>Loopback0</AttachTo>
<a:Prefix xmlns:b="Microsoft.Search.Autopilot.NetMux">
<b:IPPrefix>fe80::0002/128</b:IPPrefix>
</a:Prefix>
<a:PrefixStr>fe80::0002/128</a:PrefixStr>
</a:LoopbackIPInterface>
<a:LoopbackIPInterface>
<ElementType>LoopbackInterface</ElementType>
<Name>SoCHostIP0</Name>
<AttachTo>server2SOC</AttachTo>
<a:Prefix xmlns:b="Microsoft.Search.Autopilot.NetMux">
<b:IPPrefix>10.10.10.3/32</b:IPPrefix>
</a:Prefix>
<a:PrefixStr>10.10.10.3/32</a:PrefixStr>
</a:LoopbackIPInterface>
<a:LoopbackIPInterface>
<ElementType>LoopbackInterface</ElementType>
<Name>SoCHostIP1</Name>
<AttachTo>server2SOC</AttachTo>
<a:Prefix xmlns:b="Microsoft.Search.Autopilot.NetMux">
<b:IPPrefix>fe80::0003/128</b:IPPrefix>
</a:Prefix>
<a:PrefixStr>fe80::0003/128</a:PrefixStr>
</a:LoopbackIPInterface>
</LoopbackIPInterfaces>
</DeviceDataPlaneInfo>
Signed-off-by: Longxiang Lyu lolv@microsoft.com
How I did it
For servers loopback definitions in Dpg, if they contain LoopbackIPInterface with tags AttachTo, which has value of format like <server_name>SOC, the address will be regarded as a SoC IP, and sonic-cfggen now will treat the port connected to the server as active-active if the redundancy_type is either Libra or Mixed.
How to verify it
Pass the unittest.
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Signed-off-by: Sudharsan Dhamal Gopalarathnam sudharsand@nvidia.com
Why I did it
During the system boot up when 'show platform status' or 'show version' command is executed before STATE_DB CHASSIS_INFO table is populated, the show will try to fallback to use the platform API. The DMI file in mellanox platforms require root permission for access. So if the show commands are executed as admin or any other user, the following error log will appear in the syslog
Jun 28 17:21:25.612123 sonic ERR show: Fail to decode DMI /sys/firmware/dmi/entries/2-0/raw due to PermissionError(13, 'Permission denied')
How I did it
Check the file permission before accessing it.
How to verify it
Added UT to verify. Manually verified if the error log is not thrown.
The asan-enabled docker image will be used in other CIs that run DVS tests.
Why I did it
To add a possibility to run DVS tests with ASAN for other CIs (e.g. swss).
How I did it
Added the 'asan_image' flag to the vs job group.
How to verify it
Run the CI and check the docker-sonic-vs-asan.gz artifact.
The following packages have unmet dependencies:
libssl-dev : Depends: libssl1.1 (= 1.1.1n-0+deb11u3) but 1.1.1n-0+deb11u2 is to be installed
E: Unable to correct problems, you have held broken packages.
- Why I did it
To include latest fixes:
1. Warmboot | When trying to reconfigure the Flex Parser header and Flex transition parameters after ISSU, the switch will returned an error even if the configuration was identical to that done before performing the ISSU.
2. Link Up | When toggling many ports of the Spectrum devices while raising 10GbE link up and link maintenance is enabled, the switch may get stuck and may need to be rebooted.
3. Shared buffer | While moving from lossless to lossy while shared headroom was used, reduction of the shared headroom can only be done prior to pool type change and when shared headroom is not utilized.
- How I did it
Updated SDK submodule along with the relevant Makefiles
- How to verify it
Build an image and run tests from "sonic-mgmt".
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
swss:
* 6bafea4 2022-06-29 | [vnetorch] [vxlanorch] fix a set of memory usage issues (#2352) (HEAD -> 202205, github/202205) [Yakiv Huryk]
utilities:
* c64454c 2022-06-28 | [GCU] Moving UniqueLanes from only validating moves, to be a supplemental YANG validator (#2234) (HEAD -> 202205, github/202205) [Mohamed Ghoneim]
* fbd79d4 2022-06-29 | Add check to not allow deleting PO if its member of vlan (#2237) [anilkpan]
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
- Why I did it
New security feature for enforcing strong passwords when login or changing passwords of existing users into the switch.
- How I did it
By using mainly Linux package named pam-cracklib that support the enforcement of user passwords, the daemon named hostcfgd, will support add/modify password policies that enforce and strengthen the user passwords.
- How to verify it
Manually Verification-
1. Enable the feature, using the new sonic-cli command passw-hardening or manually add the password hardening table like shown in HLD by using redis-cli command
2. Change password policies manually like in step 1.
Notes:
password hardening CLI can be found in sonic-utilities repo-
P.R: Add support for Password Hardening sonic-utilities#2121
code config path: config/plugins/sonic-passwh_yang.py
code show path: show/plugins/sonic-passwh_yang.py
3. Create a new user (using adduser command) or modify an existing password by using passwd command in the terminal. And it will now request a strong password instead of default linux policies.
Automatic Verification - Unitest:
This PR contained unitest that cover:
1. test default init values of the feature in PAM files
2. test all the types of classes policies supported by the feature in PAM files
3. test aging policy configuration in PAM files
Avoid write_standby in warm restart context.
sign-off: Jing Zhang zhangjing@microsoft.com
Why I did it
In warm restart context, we should avoid mux state change.
How I did it
Check warm restart flag before applying changes to app db.
How to verify it
Ran write_standby in table missing, key missing, field missing scenarios.
Did a warm restart, app db changes were skipped. Saw this in syslog:
WARNING write_standby: Taking no action due to ongoing warmrestart.
Signed-off-by: bingwang <wang.bing@microsoft.com>
Why I did it
This PR brings two changes
Add lossy PG profile for PG2 and PG6 on T1 for ports between T1 and T2.
After PR Update qos config to clear queues for bounced back traffic #10176 , the DSCP_TO_TC_MAP and TC_TO_PG_MAP is updated when remapping is enable
DSCP_TO_TC_MAP
Before After Why do this change
"2" : "1" "2" : "2" Only change for leaf router to map DSCP 2 to TC 2 as TC 2 will be used for lossless TC
"6" : "1" "6" : "6" Only change for leaf router to map DSCP 6 to TC 6 as TC 6 will be used for lossless TC
TC_TO_PRIORITY_GROUP_MAP
Before After Why do this change
"2" : "0" "2" : "2" Only change for leaf router to map TC 2 to PG 2 as PG 2 will be used for lossless PG
"6" : "0" "6" : "6" Only change for leaf router to map TC 6 to PG 6 as PG 6 will be used for lossless PG
So, we have two new lossy PGs (2 and 6) for the T2 facing ports on T1, and two new lossless PGs (2 and 6) for the T0 facing port on T1.
However, there is no lossy PG profile for the T2 facing ports on T1. The lossless PGs for ports between T1 and T0 have been handled by buffermgrd .Therefore, We need to add lossy PG profiles for T2 facing ports on T1.
We don't have this issue on T0 because PG 2 and PG 6 are lossless PGs, and there is no lossy traffic mapped to PG 2 and PG 6
Map port level TC7 to PG0
Before the PCBB change, DSCP48 -> TC 6 -> PG 0.
After the PCBB change, DSCP48 -> TC 7 -> PG 7
Actually, we can map TC7 to PG0 to save a lossy PG.
How I did it
Update the qos and buffer template.
How to verify it
Verified by UT.
- Why I did it
While doing config reload, FEATURE table may be removed and re-add. During this process, updating FEATURE table is not atomic. It could be that the FEATURE table has entry, but each entry has no field. This PR introduces a retry mechanism to avoid this.
- How I did it
Introduces a retry mechanism to avoid this.
- How to verify it
New unit test added to verify the flow as well as running some manual test.
- Why I did it
To provide an ability to suppress ASAN false positives and have a clean ASAN report for docker-sonic-vs/mlnx-syncd/orchagent docker
- How I did it
Added the "print_suppressions=0" to ASAN configs.
- How to verify it
add a suppression to some ASAN-enabled component (the suppression should catch some leak)
build with ENABLE_ASAN=y
run a test and see that the ASAN report is empty instead of having the suppression summary
Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
Why I did it
Support to use symbol links in platform folder to reduce the image size.
The current solution is to copy each lazy installation targets (xxx.deb files) to each of the folders in the platform folder. The size will keep growing when more and more packages added in the platform folder. For cisco-8000 as an example, the size will be up to 2G, while most of them are duplicate packages in the platform folder.
How I did it
Create a new folder in platform/common, all the deb packages are copied to the folder, any other folders where use the packages are the symbol links to the common folder.
Why platform.tar?
We have implemented a patch for it, see #10775, but the problem is the the onie use really old unzip version, cannot support the symbol links.
The current solution is similar to the PR 10775, but make the platform folder into a tar package, which can be supported by onie. During the installation, the package.tar will be extracted to the original folder and removed.
#### Why I did it
Support the following tables which were introduced during dynamic buffer calculation
- LOSSLESS_TRAFFIC_PATTERN
- DEFAULT_LOSSLESS_BUFFER_PARAMETER
#### How I did it
- LOSSLESS_TRAFFIC_PATTERN
|name|type|range|mandatory|description|
|---|---|---|---|---|
|mtu|uint16|64~10240|true|The maximum packet size of a lossless packet|
|small_packet_percentage|uint8|0~100|true|The percentage of small packet|
- DEFAULT_LOSSLESS_BUFFER_PARAMETER
|name|type|range|mandatory|description|
|---|---|---|---|---|
|default_dynamic_th|int8|-8~7|true|The default dynamic_th for all buffer profiles that are dynamically generated for lossless PG|
|over_subscribe_ratio|uint16|-|false|The oversubscribe ratio for shared headroom pool.|
|||||Semantically, the upper bound is the number of physical ports but it can not be represented in the yang module. So we keep the upper bound open. As the type is (signed) integer whose lower bound is 0 by nature, we do not need to specify the range.|
#### How to verify it
Run unit test
To ensure that ASAN logs are always generated. Currently, the way to get the logs is to map the "/var/log/asan" outside of a container, which doesn't work for DVS test run with "--imgname" option.
Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
Why I did it
To address internal build failures where the cable len for some of the skus is set to 300m for all tiers.
How I did it
For the buffers test, generate a new output file based off the original expected output with CABLE_LENGTH table updated to use 300m. In the comparison logic, compare against each of the expected output files and if any matches, the testcase is set to pass
Signed-off-by: Neetha John <nejo@microsoft.com>
Why I did it
As part of PCBB changes, we need to enable 2 extra lossless queues. The changes in this PR are done to adjust only the reserved sizes on Th2 for the additional 2 lossless queues
Calculations are done based on 40 downlinks for T1 and 16 uplinks for dual ToR
How to verify it
Verified that the rendering works fine on Th2 dut
Unit tests have been updated to reflect the modified buffer sizes when pcbb is enabled. There are existing testcases that will test the original buffer sizes when pcbb is disabled. With these changes, was able to build sonic-config-engine wheel successfully
Signed-off-by: Neetha John <nejo@microsoft.com>
#### Why I did it
There might be a case where service checker periodic operation determined that specific container is running but when it tries to perform an operation on it, it was already closed by the user. This is a valid flow and we should not log an error message, informative warning is enough.
#### How I did it
I reduce log severity.
#### How to verify it
I verified it manually.
Why I did it
The t0-sonic pool has been fixed, so add it back to azp checker.
How I did it
Remove continueOnError in run-test-template.yml.
Signed-off-by: Ze Gan <ganze718@gmail.com>
#### Why I did it
SSHD keepalive timeout feature not enabled on sonic.
#### How I did it
Enable SSHD keepalive timeout feature by set ClientAliveCountMax to 1.
#### How to verify it
Pass All E2E test case.
Manually test with following steps:
1. Change config and restart sshd
2. Connect a ssh with -vvv option to show debug message
3. Get running ssh by command and stop it:
```
azureuser@liuh-dev-vm-02:~$ ps -auxww | grep vvv
azureus+ 1614153 0.0 0.0 12244 6004 pts/1S+ 15:48 0:00 ssh admin@10.250.0.101 -vvv
azureus+ 1615570 0.0 0.0 8168 2424 pts/3S+ 15:49 0:00 grep --color=auto vvv
azureuser@liuh-dev-vm-02:~$ kill -Stop 1614153
```
4. Check TCP status from server side with ss command:
https://man7.org/linux/man-pages/man8/ss.8.html
```
admin@vlab-01:~$ ss | grep -i ssh
tcp ESTAB 0 010.250.0.101:ssh 10.250.0.1:58150
tcp FIN-WAIT-2 0 010.250.0.101:ssh 10.250.0.1:58164
tcp ESTAB 0 010.250.0.101:ssh 10.250.0.1:57978
```
FIN-WAIT-2 means server already terminate the connection and wait for client response:
https://kb.iu.edu/d/ajmi
. FIN-WAIT-2 <-- <SEQ=300><ACK=101><CTL=ACK> <-- CLOSE-WAIT
5. Check again later will show the session been complete closed:
```
admin@vlab-01:~$ ss | grep -i ssh
tcp ESTAB 0 010.250.0.101:ssh 10.250.0.1:58150
tcp ESTAB 0 010.250.0.101:ssh 10.250.0.1:57978
```
- Why I did it
When LLDP is disabled through feature command, it gets spawned after reboot.
- How I did it
In syncd.sh check if the service is enabled before spawning automatically during cold reboot.
- How to verify it
Disable lldp feature. Perform cold reboot and verify its not spawned.