This is to backport community PR #8768 to 202106 branch
Why I did it
Support zero buffer profiles
Add buffer profiles and pool definition for zero buffer profiles
Support applying zero profiles on INACTIVE PORTS
Enable dynamic buffer manager to load zero pools and profiles from a JSON file
Signed-off-by: Stephen Sun stephens@nvidia.com
How I did it
Add buffer profiles and pool definition for zero buffer profiles
If the buffer model is static:
- Apply normal buffer profiles to admin-up ports
- Apply zero buffer profiles to admin-down ports
If the buffer model is dynamic:
- Apply normal buffer profiles to all ports
- buffer manager will take care when a port is shut down
- Update buffers_config.j2 to support INACTIVE PORTS by extending the existing macros to generate the various buffer objects, including PGs, queues, ingress/egress profile lists
Originally, all the macros to generate the above buffer objects took active ports only as an argument
Now that buffer items need to be generated on inactive ports as well, an extra argument representing the inactive ports need to be added
To be backward compatible, a new series of macros are introduced to take both active and inactive ports as arguments
The original version (with active ports only) will be checked first. If it is not defined, then the extended version will be called
Only vendors who support zero profiles need to change their buffer templates
Enable buffer manager to load zero pools and profiles from a JSON file:
The JSON file is provided on a per-platform basis
It is copied from platform/<vendor> folder to /usr/share/sonic/temlates folder in compiling time and rendered when the swss container is being created.
To make code clean and reduce redundant code, extract common macros from buffer_defaults_t{0,1}.j2 of all SKUs to two common files:
One in Mellanox-SN2700-D48C8 for single ingress pool mode
The other in ACS-MSN2700 for double ingress pool mode
Those files of all other SKUs will be symbol link to the above files
Update sonic-cfggen test accordingly:
- Adjust example output file of JSON template for unit test
- Add unit test in for Mellanox's new buffer templates.
How to verify it
Regression test.
Unit test in sonic-cfggen
Run regression test and manually test.
- Why I did it
To include latest fixes.
SAI
* Reduce verbosity of warning message on shared memory already existing
* accuflow allocation support by key value
SDK
* Under various circumstances, Ethernet ports falsely showed that InfiniBand cables were connected.
* In SN4600C, at times, the link up time in both DAC and optics cables may, in the worst case, take up to 15 seconds.
* Using SN4600C with copper or optics loopback cables in NRZ speeds, link may raise in long link up times
* When ECMP has high amount of next-hops based on VLAN interfaces, in some rare cases, packets will get a wrong VLAN tag and will be dropped.
* When connecting Spectrum devices with optical transceivers that support RXLOS, remote side port down might cause the switch firmware to get stuck and cause unexpected switch behavior.
* Aggregation event is missing for WJH L2 drop reason 'Unicast egress port list is empty'.
* Tying the SCL and SDA of the optical modules to 3.3V causes errors.
* On SN4600, there was a delay of more than 10 seconds from the time a data packet is sent from CPU until it is transmitted through one of the switch ports.
* While using SN4600C system with Finisar FTLC1157RGPL 100GbE CWDM4 modules, intermittent link flaps across multiple ports may be observed.
* In Spectrum-2 and Spectrum-3 systems, link did not work in auto-negotiation when connected to Marvell PHY. KR mechanism has been enhanced to integrate with Marvell PHY.
* The tunnel counter counts the drop packets now for Spectrum-2 and Spectrum-3 and consistent with Spectrum behavior and count the ECN dropped packets as well.
* When connecting SN3800 to Cisco-9000, fast-linkup flow will fail and will rise in the normal flow.
* Race condition in WJH library: when multiple threads load the LAG shared memory concurrently, the program may crash.
* Add WJH L2 drop reason 'Unicast egress port list is empty' as a new drop reason.
* Fixed a memory leak in sx_api_port_sflow_statistics_get API.
* During initialization flow, the command interface that is used by the minimal driver and SDK caused the collision in the firmware since the same buffer is used in the firmware for the two interfaces.
- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.
- How to verify it
Build an image and run tests from "sonic-mgmt".
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
- Why I did it
When PSU is powered off, the PSU is still on the switch and the air flow is still the same. In this case, it is not necessary to set FAN speed to 100%.
- How I did it
When PSU is powered of, don't treat it as absent.
- How to verify it
Adjust existing unit test case
Add new case in sonic-mgmt
- Why I did it
This is to update the common sonic-buildimage infra for reclaiming buffer.
- How I did it
Render zero_profiles.j2 to zero_profiles.json for vendors that support reclaiming buffer
The zero profiles will be referenced in PR [Reclaim buffer] Reclaim unused buffers by applying zero buffer profiles #8768 on Mellanox platforms and there will be test cases to verify the behavior there.
Rendering is done here for passing azure pipeline.
Load zero_profiles.json when the dynamic buffer manager starts
Generate inactive port list to reclaim buffer
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
If multiple Vlans are configured to have DHCPv6 relay, only one relay instance is able to capture DHCP packets received from upstream, this is as a result of kernel design to operate this way (SO_REUSEPORT).
DHCPv6 transmit unicast packets to clients, only multicast packets can be captured on multiple application listening on the same UDP port.
This issue causing only one Vlan interface to get packets from servers.
- How I did it
Change the design to neglect Vlan isolation and run only one relay instance serving all Vlans with all configured DHCP servers.
- How to verify it
Run DHCPv6 relay test with 2 Vlans configured do have a DHCP relay.
Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
This is to address an issue where it was observed that SAI operations sometime may take a very long to time complete (over 45ms). It was determined that the ALPM distributed thread was causing this issue.
The fix is to disable this debug thread that has no functional purpose.
Preliminary tests looks fine. BGP neighbors were all up with proper routes programmed
interfaces are all up
Manually ran the fib test cases on 7050CX3 (TD3), TD2, TH, TH2, and TH3 based platforms and
thy all passed.
Note: the testing was done over 20201230 image and are porting this change to master branch.
No need to port this to 20201230 branch as a separate PR was already done for that branch. (#9190)
this PR is created to port the changes made by (#9199) but could not be cherry picked directly to 202106 branch.
sonic-snmpagent
7e46eb1 [201911][RFC1213]: Initialize lag oid map in reinit_data (#234)
aa98ded CPU Spike because of redundant and flooded keyspace notifis handled (#230)
sonic-swss
bc4e334 [Mux orch] Handle setting unknown mux state (#1984)
bd3630b [tunnel decap] Change tunnel orch order (#1977)
87a673a Fix the option missing in kernel config issue (#1973)
57967a1 [orchagent] Fix group name of port-buffer-drop in flexcounterorch.cpp (#1967)
sonic-utilities
181e8b0 Fix the option missing in kernel config issue (#1888)
21c0cc0 [watermarkstat] Fix for error in processing empty array from couters db (#1810)
7f15755 [chassis][supervisor][show][interfaces]show interfaces command warning on Supervisor card (#1771)
Signed-off-by: Neetha John <nejo@microsoft.com>
Why I did it
Storage T0's have all vlan members as tagged
How I did it
Since currently minigraph does not have a unique way to identify if a vlan member is tagged/untagged and to ensure other scenarios are not broken, the logic used is to just update the vlan member type as 'tagged' when we determine that it is a storage backend device. This change will apply only to storage backend T0's since storage backend T1's will not have vlan member information
How to verify it
Updated the storage backend T0 testcases to check for tagged vlan members
Added testcase to check if a T1 and backend T1 device generates an empty vlan member table
Existing vlan member testcases are good enough for checking if any regression has been caused for regular T0's
Build sonic_config_engine-1.0-py3-none-any.whl successfully
* [Nokia ixs7215] Platform API fixes
This commit delivers the following fixes
- Fix bug preventing access to second PSU eeprom
- Fix bug preventing updates to front panel PSU status led
- Fix SFP reset test case failure
* Fix LGTM alert
This commit more fully declares the HW capabilities of the Nokia-7215
platform. For example, support for the threshold values associated with each
thermal sensor is described. The intent here is to inform the sonic-mgmt
platform test cases of which HW features are supported.
This commit must align with PR# 4521 within the sonic-mgmt git repo which is
currently under review. Any changes to that PR will need to be reflected in
this commit.
Fix the check used to wait for interfaces to come up. The group name in
the supervisor config files has changed from isc-dhcp-relay to
dhcp-relay.
Also, in the wait script, wait 10 additional seconds after the vlans,
port channels, and any interfaces are up. This is because dhcrelay
listens on all interfaces (in addition to port channels and vlans), and
to ensure that it stays in a clean state during runtime, wait some extra
time to make sure that those interfaces are created as well.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
- Why I did it
Wrong SKU configuration will lead to longer init flow.
This will affect fast-reboot feature by increasing the traffic downtime.
Since MLNX met the required downtime period with this SKU this bug found with a delay.
- How I did it
Add the required split labels for ports.
- How to verify it
Run fast-reboot with this platform using SN3800-D112C8 SKU.
swss
73caba3 Allow interface type value none (#1991)
utilities
32e530f Allow interface type value none (#1902)
53f066c Fix log_ssd_health hang issue (#1904)
This PR allow user to set none value to interface type. So there is a way to achieve the goal via CLI:
config interface type XXX none
config interface speed XXX 10000
config interface type XXX CR
* Changed Debian package dependency in order to support both python or python3 packages
* Fix Python scripts to be compatible with python2.7/python3 versions
* hw-mgmt: attributes: Fix PSU power sensor attributes capability
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
- Why I did it
In case an app.ext requires a dependency syncd^1.0.0, the RPC version of syncd will not satisfy this constraint, since 1.0.0-rpc < 1.0.0. This is not correct to put 'rpc' as a prerelease identifier. Instead put 'rpc' as build metadata in the version: 1.0.0+rpc which satisfies the constraint ^1.0.0.
- How I did it
Changed the way how to version in RPC and DBG images are constructed.
- How to verify it
Install app.ext with syncd^1.0.0 dependency on a switch with RPC syncd docker.
Signed-off-by: Stepan Blyshchak <stepanb@nvidia.com>
- Why I did it
docker-orchagent was missing libsairedis version label.
E.g. Currently only swsscommon is recorded in the labels:
admin@arc-switch1038:~$ docker inspect docker-orchagent | grep versions
"com.azure.sonic.versions.libswsscommon": "1.0.0"
With this change libsairedis is also recorded:
admin@arc-switch1038:~$ docker inspect docker-orchagent | grep versions
"com.azure.sonic.versions.libswsscommon": "1.0.0"
"com.azure.sonic.versions.libsairedis": "1.0.0"
- How I did it
By expanding the list of dependencies.
- How to verify it
Build and verify the label for libsairedis exists in docker-orchagent.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Adding a release file for 202106. Without it 'release' in sonic_version.yml appears to be none. It should be 202106. This is required for QoS scripts in sonic-mgmt to pick a schema based on release branch
Signed-off-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>
#### Why I did it
Fixed an issue that changing SDK version leads to cache framework taking cached syncd RPC image rather then rebuilding syncd RPC based on new syncd with new SDK.
Investigation showed that cache framework calculates a component hash based on direct dependencies. Syncd RPC image hash consists of two parts: one is the flags of syncd RPC (platform, ENABLE_SYNCD_RPC) and syncd RPC direct dependencies makefiles. None of the syncd RPC direct dependencies are modified when SDK version changes, so hash is unchanged.
#### How I did it
To fix this issue, include the hash of dependencies into current component hash calculation, e.g.:
In calcultation of the hash ```docker-syncd-mlnx-rpc.gz-274dfed3f52f2effa9989fc-39344350436f9b06d28b470.tgz```, the hash of syncd is included: ```docker-syncd-mlnx.gz-48ee88ac54b201e0e107b15-7bbea320025177a2121e440.tgz``` in which the hash of SDK is included.
#### How to verify it
Build with cache enabled and check that changing SDK version leads to a different hash of syncd rpc image:
SDK version 4.5.1002:
```
docker-syncd-mlnx.gz-48ee88ac54b201e0e107b15-7bbea320025177a2121e440.tgz
docker-syncd-mlnx-rpc.gz-274dfed3f52f2effa9989fc-39344350436f9b06d28b470.tgz
```
SDK version 4.5.1002-005:
```
docker-syncd-mlnx.gz-18baf952e3e0eda7cda7c3c-e5668f4784390d5dffd55af.tgz
docker-syncd-mlnx-rpc.gz-4a6e59580eda110b5709449-552f76be135deaf750aeab2.tgz
```
Depends on Mellanox hw-mgmt 7.0010.3300
Why I did it
Adjust LED logical according to hw-mgmt change.
How I did it
Add a trigger to set LED to blink.
How to verify it
Manual test
- Why I did it
To include the following changes:
* b684149 [techsupport] [202106] Removed -i option for docker commands and Improved Error Reporting (#1843)
- How I did it
Updated sonic-utilities submodule pointer.
- How to verify it
Build an image and run sonic-mgmt tests.
This is due to the SERVICE variable declared after reading a file
#### Why I did it
To fix an issue that dhcp_relay does not restart with swss.
#### How I did it
Fixed in the swss.sh script
#### How to verify it
sudo systemctl restart swss
verify dhcp_relay restarts as well.
- Why I did it
Bug fix:
bad_param request due to missing parser rest command while running mlxlink
- How I did it
Advance to MFT tool version to 4.17.2-12.
- How to verify it
Manually tested on all mellanox platforms.
- Why I did it
Upgrade to the latest version of hardware management in order to incorporate the latest bugfixes and drivers in the kernel.
- How I did it
Updated the version number and submodule for hw-mgmt.
- How to verify it
This has been verified on all Mellanox platforms through a combination of sonic-mgmt tests and other internal verification.
88cfbc3 [Buffermgr]Graceful handling of buffer model change (#1956)
7f87a12 Orchagent validates mirror session queue parameter against maximum value from SAI (#1957)
* To add portchannel support in frrcfgd and bgpcfgd
* Update is_zero_ip() to handle portchannel name
Signed-off-by: d-dashkov <Dmytro_Dashkov@Jabil.com>