Cherry pick of #10072
- Why I did it
Removing DPB breakout modes that require adjacent ports to be disabled as that is not supported by the current DPB infrastructure.
Correspondingly had to remove the hwsku.json file from any SKUs which utilized these removed modes such that the system will fall back to ports_config.ini and DPB will not be supported for those SKUs.
- How I did it
Modified the platform.json files of Mellanox devices.
- How to verify it
Execute show int break [Ethernet] on the affected platforms and ensure there are no modes present that would require an adjacent port to be disabled to function.
Closes#7958
#### Why I did it
The previous implementation of sonic-cfggen did a simple comparison between default breakout mode in
hwsku.json and supported modes in platform.json. To set a different default speed in hwsku.json
it was required to add one more entry to supported modes in platfrom.json file:
1x10G[100G,50G] vs 1x100G[50G,10G]
The new implementation does more intelligent parsing and analysis of supported and default modes. It
allows changing default speed without adding a new entry to platform.json.
#### How I did it
Add more intelligent parsing and analysis of supported and default modes.
#### How to verify it
Run sonic-config-engine unit tests from sonic-config-engine/tests directory
Why I did it
Code review was still in progress when #9858 was merged and upon further testing I have arrived at a better solution.
How I did it
Modified supervisord configuration j2 template for pmon to require no minimum uptime for chassisd_db_init and to remove the redundant exit_codes directive
How to verify it
Boot switch and verify in syslog that there are no errors related to chassis_db_init
Why I did it
During warm-reboot and fast-reboot the below error logs appear
Feb 3 22:05:15.187408 r-lionfish-13 ERR container: docker cmd: kill for nat failed with 404 Client Error for http+docker://localhost/v1.41/containers/nat/json: Not Found ("No such container: nat")
The container command when called for local mode doesn't check if it is enabled before calling docker kill which throws the above errors.
b6ca76b482/scripts/fast-reboot (L699)
How I did it
Checking feature state if local mode and returning error exit code along with valid debug message.
How to verify it
Manually tested with warm-reboot and fast-reboot
Added UT to verify it.
- add mechanism to power off linecard and fabrics on supervisor reboot (only lc by default)
- improve lc interface config script
- fix exception handling in logging
Why I did it
Fixed the monit container_checker fails due to unexpected "database-chassis" docker running on Supervisor card in the VOQ chassis. fixes#9042
How I did it
Added database-chassis to the always running docker list if platform is supervisor card.
How to verify it
Execute the CLI command "sudo monit status container_checker"
Signed-off-by: mlok <marty.lok@nokia.com>
* Update container_checker for multi-asic devices
Update container_checker for multi-asic devices to add database containers in always_running_containers.
Previous change was made for single-asic, and that database containers were not considered as feature when writing to state_db.
* Update container_checker
Update an indent
- Why I did it
Stopping swss and syncd causes some driver module unloading. Those driver modules are depended by PMON. This could trigger ERROR logs in syslog.
- How I did it
Adjust warmboot shutdown order in make file
- How to verify it
Manual test
- Why I did it
swsscommon.ConfigDBConnector does not automatically close connection when the instance is recycled by python. So, it should not create this instance each time calling check_services. It will cause error like Failed to read from file /var/run/hw-management/led/led_status_capability - OSError(24, 'Too many open files')
- How I did it
Only connect DB once in init
- How to verify it
Manual test
- Why I did it
Error log was shown on switches during boot
pmon#supervisord 2021-12-22 04:27:16,709 INFO exited: chassis_db_init (exit status 0; not expected)
- How I did it
Add exit code zero as an expected exit code and also disable autorestart.
- How to verify it
Boot the switch and ensure the above log line does not appear.
- Why I did it
MSN4700 platform has 8 lanes per port and thus can support 2x40G with each lane running at 10G
- How I did it
Added 40G to 2x200G breakout mode in platform.json
- How to verify it
Run config int break Ethernet0 2x40G[200G,100G,50G,25G,10G,1G]
And verify the command runs successfully and the port speed was set to 40G with a 2x breakout.
Why I did it
Fix typo and missing files in SN3800 and SN4600C's buffer templates
How I did it
ingress_lossless_xoff_size => ingress_lossless_pool_xoff add missing files for SN4600C-D100C12S2
How to verify it
Deploy the fix and verify whether the device can be up.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Info: Attempting file://dev/vdb/onie-installer ...
Info: Attempting file://dev/vdb/onie-installer.bin ...
cp: write error: No space left on device
Failure: local_fs_run():/dev/vdb Unable to copy /tmp/tmp.CPY0ad/onie-installer.bin to tmpfs
vs image is failing. Increase kvm device space.
Fixes#9376
Because /etc/passwd and /etc/group have been overwritten with symlinks
to /host_etc/passwd and /host_etc/group, the debug container build
fails. This is because the debug container is built without /etc being
mounted at /host_etc in the container (which does happen at runtime).
Because of that, /etc/passwd and /etc/group don't exist, which causes
some package installation errors when openssh-client tries to create a
group.
This is a partial revert of 1347f29178.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
- Why I did it
To include latest fixes.
1. On CMIS modules, after low power configuration, the firmware waited for the module state to be ModuleReady instead of ModuleLowPower causing delays.
2. When connecting SN4600C, 100GbE port with CWDM4 module (Gen 3.0), link up time is 30 seconds.
- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.
- How to verify it
Build an image and run tests from "sonic-mgmt".
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
- Why I did it
Optimize thermal control policies to simplify the logic and add more protection code in policies to make sure it works even if kernel algorithm does not work.
- How I did it
Reduce unused thermal policies
Add timely ASIC temperature check in thermal policy to make sure ASIC temperature and fan speed is coordinated
Minimum allowed fan speed now is calculated by max of the expected fan speed among all policies
Move some logic from fan.py to thermal.py to make it more readable
- How to verify it
1. Manual test
2. Regression
Why I did it
resolves#8979 and #9055
How I did it
Remove the file static.conf.j2,which adds the default route on eth0 from bgp docker
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
- Why I did it
To fix an issue that hw-mgmt patches were not applied. One patch was already in upstream hw-mgmt package thus applying it again caused an error and no other patches were applied. Also, I did it to improve the Makefile, so that the make will fail in case patches fail to apply.
- How I did it
Removed obsolete patch, made applying patches a hard failure in the build.
- How to verify it
Run the make and verify patches are applied.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
- Why I did it
Also recalculated all parameters with the latest algorithm with per-speed peer response time taken into account
- How I did it
Detailed information of each SKU:
C64:
t0: 32 100G downlinks and 32 100G uplinks
t1: 56 100G downlinks and 8 100G uplinks with 2km-cable supported
D112C8: 112 50G downlinks and 8 100G uplinks.
D48C40: 48 50G downlinks, 32 100G downlinks, and 8 100G uplinks
D100C12S2: 4 100G downlinks, 2 10G downlinks, 100 50G downlinks, and 8 100G uplinks
2km cable is supported for C64 on t1 only
- How to verify it
Run regression test (QoS)
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
To have an ability to use PRM sniffer.
- How I did it
Enabled the option in configure flags.
- How to verify it
Built and ran on switch. Enabled the feature in runtime and checked the sniffer recording.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
- Why I did it
The capability files were incorrect in comparison to the marketing spec of the SN4410 platform.
- How I did it
Aligned the capability files according to the marketing spec.
- How to verify it
Did basic manual sanity checks:
1. Check if critical docker containers were UP
2. Check if interfaces were created and were UP
3. Check if interfaces created in the syncd docker container by executing – sx_api_ports_dump.py script
4. Check the logs from the start of the switch – everything was OK
5.Verified the port breakout
Signed-off-by: Vadym Hlushko <vadymh@nvidia.com>
- Why I did it
Fix issue: SFP might not be re-initialized after cable plug in. There could be case like:
1. The SFP object was initialized as QSFP type
2. User use adapter to connect a SFP to a QSFP port
3. The SFP object treat the SFP as QSFP and failed to process EEPROM
- How I did it
If a new SFP is plugged, always re-initialize the SFP
- How to verify it
Added unit test case
Backport #9259 to 202106
- Why I did it
Nvidia platform API does not support set LED to orange.
- How I did it
Allow user to set LED to orange
- How to verify it
Manual test and unit testing
This is to backport community PR #8768 to 202106 branch
Why I did it
Support zero buffer profiles
Add buffer profiles and pool definition for zero buffer profiles
Support applying zero profiles on INACTIVE PORTS
Enable dynamic buffer manager to load zero pools and profiles from a JSON file
Signed-off-by: Stephen Sun stephens@nvidia.com
How I did it
Add buffer profiles and pool definition for zero buffer profiles
If the buffer model is static:
- Apply normal buffer profiles to admin-up ports
- Apply zero buffer profiles to admin-down ports
If the buffer model is dynamic:
- Apply normal buffer profiles to all ports
- buffer manager will take care when a port is shut down
- Update buffers_config.j2 to support INACTIVE PORTS by extending the existing macros to generate the various buffer objects, including PGs, queues, ingress/egress profile lists
Originally, all the macros to generate the above buffer objects took active ports only as an argument
Now that buffer items need to be generated on inactive ports as well, an extra argument representing the inactive ports need to be added
To be backward compatible, a new series of macros are introduced to take both active and inactive ports as arguments
The original version (with active ports only) will be checked first. If it is not defined, then the extended version will be called
Only vendors who support zero profiles need to change their buffer templates
Enable buffer manager to load zero pools and profiles from a JSON file:
The JSON file is provided on a per-platform basis
It is copied from platform/<vendor> folder to /usr/share/sonic/temlates folder in compiling time and rendered when the swss container is being created.
To make code clean and reduce redundant code, extract common macros from buffer_defaults_t{0,1}.j2 of all SKUs to two common files:
One in Mellanox-SN2700-D48C8 for single ingress pool mode
The other in ACS-MSN2700 for double ingress pool mode
Those files of all other SKUs will be symbol link to the above files
Update sonic-cfggen test accordingly:
- Adjust example output file of JSON template for unit test
- Add unit test in for Mellanox's new buffer templates.
How to verify it
Regression test.
Unit test in sonic-cfggen
Run regression test and manually test.
- Why I did it
To include latest fixes.
SAI
* Reduce verbosity of warning message on shared memory already existing
* accuflow allocation support by key value
SDK
* Under various circumstances, Ethernet ports falsely showed that InfiniBand cables were connected.
* In SN4600C, at times, the link up time in both DAC and optics cables may, in the worst case, take up to 15 seconds.
* Using SN4600C with copper or optics loopback cables in NRZ speeds, link may raise in long link up times
* When ECMP has high amount of next-hops based on VLAN interfaces, in some rare cases, packets will get a wrong VLAN tag and will be dropped.
* When connecting Spectrum devices with optical transceivers that support RXLOS, remote side port down might cause the switch firmware to get stuck and cause unexpected switch behavior.
* Aggregation event is missing for WJH L2 drop reason 'Unicast egress port list is empty'.
* Tying the SCL and SDA of the optical modules to 3.3V causes errors.
* On SN4600, there was a delay of more than 10 seconds from the time a data packet is sent from CPU until it is transmitted through one of the switch ports.
* While using SN4600C system with Finisar FTLC1157RGPL 100GbE CWDM4 modules, intermittent link flaps across multiple ports may be observed.
* In Spectrum-2 and Spectrum-3 systems, link did not work in auto-negotiation when connected to Marvell PHY. KR mechanism has been enhanced to integrate with Marvell PHY.
* The tunnel counter counts the drop packets now for Spectrum-2 and Spectrum-3 and consistent with Spectrum behavior and count the ECN dropped packets as well.
* When connecting SN3800 to Cisco-9000, fast-linkup flow will fail and will rise in the normal flow.
* Race condition in WJH library: when multiple threads load the LAG shared memory concurrently, the program may crash.
* Add WJH L2 drop reason 'Unicast egress port list is empty' as a new drop reason.
* Fixed a memory leak in sx_api_port_sflow_statistics_get API.
* During initialization flow, the command interface that is used by the minimal driver and SDK caused the collision in the firmware since the same buffer is used in the firmware for the two interfaces.
- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.
- How to verify it
Build an image and run tests from "sonic-mgmt".
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>