Commit Graph

5262 Commits

Author SHA1 Message Date
Samuel Angebault
01fdf5954b
[202106][Arista] Update driver submodules (#10279)
- add mechanism to power off linecard and fabrics on supervisor reboot (only lc by default)
- improve lc interface config script
- fix exception handling in logging
2022-03-20 17:32:09 -07:00
Shilong Liu
c14462c6a3
Add a config variable to override default container registry instead of dockerhub. (#10166) (#10261)
* Add variable to reset default docker registry
* fix bug in docker version control
2022-03-19 00:11:09 +08:00
Guohan Lu
fd8d80486d [ci]: build marvel armhf on sonicbld-armhf pool [Guohan Lu]
Signed-off-by: Guohan Lu <lguohan@gmail.com>
2022-03-18 14:55:15 +00:00
Guohan Lu
2340071d4a [ci]: build centec arm64 to sonicbld-arm64 pool
Signed-off-by: Guohan Lu <lguohan@gmail.com>
2022-03-15 12:35:50 +00:00
xumia
d75c569ef0 [Build][Ci]: Support to use the cisco sai packages built by azp (#10102)
Why I did it
Support to use the cisco sai packages built by azp
2022-03-08 02:18:58 +00:00
Marty Y. Lok
f94ce408c6 [chassis][supervisor]monit container-checker failed due to unexpected "database-chassis" docker running #9042 (#9043)
Why I did it
Fixed the monit container_checker fails due to unexpected "database-chassis" docker running on Supervisor card in the VOQ chassis. fixes #9042

How I did it
Added database-chassis to the always running docker list if platform is supervisor card.

How to verify it
Execute the CLI command "sudo monit status container_checker"


Signed-off-by: mlok <marty.lok@nokia.com>
2022-03-04 21:49:52 +00:00
wenyiz2021
05e566ed45 Update container_checker for multi-asic devices when state is 'always_enabled' (#10067)
* Update container_checker for multi-asic devices 

Update container_checker for multi-asic devices to add database containers in always_running_containers. 
Previous change was made for single-asic, and that database containers were not considered as feature when writing to state_db.

* Update container_checker

Update an indent
2022-03-04 21:48:53 +00:00
Samuel Angebault
5419e5de71 Add platform.json configs for all denali SKUs (#9717) 2022-03-01 20:17:56 +00:00
Junchao-Mellanox
bc9b39a2fe Stop PMON before swss during warm reboot (#10046)
- Why I did it
Stopping swss and syncd causes some driver module unloading. Those driver modules are depended by PMON. This could trigger ERROR logs in syslog.

- How I did it
Adjust warmboot shutdown order in make file

- How to verify it
Manual test
2022-03-01 20:13:09 +00:00
Junchao-Mellanox
74c49a7682 [system-health] Fix file handle leak (#10059)
- Why I did it
swsscommon.ConfigDBConnector does not automatically close connection when the instance is recycled by python. So, it should not create this instance each time calling check_services. It will cause error like Failed to read from file /var/run/hw-management/led/led_status_capability - OSError(24, 'Too many open files')

- How I did it
Only connect DB once in init

- How to verify it
Manual test
2022-03-01 20:12:03 +00:00
Alexander Allen
212cdfbe80 [pmon] Fix chassis_db_init exit not being expected (#9858)
- Why I did it
Error log was shown on switches during boot
pmon#supervisord 2021-12-22 04:27:16,709 INFO exited: chassis_db_init (exit status 0; not expected)

- How I did it
Add exit code zero as an expected exit code and also disable autorestart.

- How to verify it
Boot the switch and ensure the above log line does not appear.
2022-03-01 03:49:55 +00:00
Alexander Allen
127a93c201 [Mellanox] Add 2x40G support to MSN4700 platform (#9485)
- Why I did it
MSN4700 platform has 8 lanes per port and thus can support 2x40G with each lane running at 10G

- How I did it
Added 40G to 2x200G breakout mode in platform.json

- How to verify it
Run config int break Ethernet0 2x40G[200G,100G,50G,25G,10G,1G]
And verify the command runs successfully and the port speed was set to 40G with a 2x breakout.
2022-03-01 03:48:25 +00:00
Stephen Sun
84942c10d6 Fix typo and missing files in SN3800 and SN4600C's buffer templates (#9537)
Why I did it
Fix typo and missing files in SN3800 and SN4600C's buffer templates

How I did it
ingress_lossless_xoff_size => ingress_lossless_pool_xoff add missing files for SN4600C-D100C12S2

How to verify it
Deploy the fix and verify whether the device can be up.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-03-01 03:42:27 +00:00
noaOrMlnx
be7f31e6d2 [CoPP] Add always_enabled field (#9302)
*Add the "always_enabled" field to copp_cfg.j2 file, in order to allow traps without an entry in features table, to be installed automatically.
2022-03-01 03:40:40 +00:00
Mahesh Maddikayala
87d18eb6c3
[libsaibcm] Fix the expiry date on the libsaibcm debian packages. (#10090) 2022-02-28 08:52:11 -08:00
Shilong Liu
d05765cc06 [build] Increase vs platform kvm disk size (#10001)
Info: Attempting file://dev/vdb/onie-installer ...
Info: Attempting file://dev/vdb/onie-installer.bin ...
cp: write error: No space left on device
Failure: local_fs_run():/dev/vdb Unable to copy /tmp/tmp.CPY0ad/onie-installer.bin to tmpfs

vs image is failing. Increase kvm device space.
2022-02-25 10:26:46 -08:00
xumia
a3733384bf [Security]: Upgrade urllib3 to fix CVE-2021-33503
See https://security.archlinux.org/CVE-2021-33503
2022-02-25 09:13:48 +00:00
Mahesh Maddikayala
2153ea41a5
[libsaibcm] Update libsaibcm with PFC patches (#10066) 2022-02-23 14:24:10 -08:00
xumia
93a22d7df5 [Build]: Fix marvell sai package version parsing issue
Fix marvell sai package version parsing issue (#10009)
2022-02-19 04:23:29 +00:00
Nazarii Hnydyn
293ca0b870 [build]: Fix SAE to support debug flavor build. (#9435)
setup PACKAGE_NAME, PATH, MACHINE, VERSION info for debug docker

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2022-02-09 14:20:22 +08:00
Saikrishna Arcot
aa69f6b94c [docker-mgmt-framework]: Don't overwrite /etc/passwd and /etc/group with symlinks (#9375)
Fixes #9376

Because /etc/passwd and /etc/group have been overwritten with symlinks
to /host_etc/passwd and /host_etc/group, the debug container build
fails. This is because the debug container is built without /etc being
mounted at /host_etc in the container (which does happen at runtime).
Because of that, /etc/passwd and /etc/group don't exist, which causes
some package installation errors when openssh-client tries to create a
group.

This is a partial revert of 1347f29178.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-02-08 13:14:41 +08:00
Shilong Liu
0e6eb338eb Fix rules/functions.generage_manifest. (#9340)
Why I did it
Fix a bug in sonic debug image build. That bug is imported in the following PR: #8920
2022-02-07 14:09:58 +08:00
Volodymyr Samotiy
97bd2bf82f
[Mellanox][202106] Update SAI to 1.20.2 and SDK/FW to 4.5.1208/2010.1218 (#9618)
- Why I did it
To include latest fixes.
1. On CMIS modules, after low power configuration, the firmware waited for the module state to be ModuleReady instead of ModuleLowPower causing delays.
2. When connecting SN4600C, 100GbE port with CWDM4 module (Gen 3.0), link up time is 30 seconds.

- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.

- How to verify it
Build an image and run tests from "sonic-mgmt".

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2022-01-26 10:59:39 +02:00
Shilong Liu
f5bef56d5e [CI] Fix Azure pipeline set -e not work. (#9282)
In azure pipeline template 'set -e' not works as expected.
2022-01-24 15:26:06 +08:00
Junchao-Mellanox
c0f0694236
[Mellanox] [202106] Optimize thermal policies (#9451)
- Why I did it
Optimize thermal control policies to simplify the logic and add more protection code in policies to make sure it works even if kernel algorithm does not work.

- How I did it
Reduce unused thermal policies
Add timely ASIC temperature check in thermal policy to make sure ASIC temperature and fan speed is coordinated
Minimum allowed fan speed now is calculated by max of the expected fan speed among all policies
Move some logic from fan.py to thermal.py to make it more readable

- How to verify it
1. Manual test
2. Regression
2022-01-19 11:43:55 +02:00
arlakshm
8ef7030f4f remove staticd.conf.j2 (#9182)
Why I did it
resolves #8979 and #9055

How I did it
Remove the file static.conf.j2,which adds the default route on eth0 from bgp docker

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2022-01-19 02:09:31 +00:00
Stephen Sun
c0df856847
Update sonic-swss-common (#9787)
d00a25bb [ci] refer 202106 branch resources rather than master branch. (#573)

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-01-18 22:11:20 +02:00
Stepan Blyshchak
559ce14277
[nvidia] fail the build when hw-mgmt patches do not apply (#9565)
- Why I did it
To fix an issue that hw-mgmt patches were not applied. One patch was already in upstream hw-mgmt package thus applying it again caused an error and no other patches were applied. Also, I did it to improve the Makefile, so that the make will fail in case patches fail to apply.

- How I did it
Removed obsolete patch, made applying patches a hard failure in the build.

- How to verify it
Run the make and verify patches are applied.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-01-16 08:10:25 +02:00
Qi Luo
c4bc9933fe
Update sonic-swss submodule (#9609)
Includes below commits
```
38b616ba 2021-12-09 | [LGTM] lgtm to use 202106 branches of swss-common and sairedis (#2074) [Stephen Sun]
c68bad07 2021-12-07 | Fix random failure in PR/CI build. #2006 [Shilong Liu]
ba17675b 2021-08-23 | [ci]: fix artifacts download from swss-common and sairedis (#1882) [Guohan Lu]
```
2021-12-30 17:23:38 -08:00
Judy Joseph
1aa225cd0c Update sonic-utilities submodule
74d2a09 [portstat] check TX/RX utilization calculation correctness (#1840)
2021-12-22 09:09:09 -08:00
Stephen Sun
b479bcd941 [Mellanox] Adjust buffer parameters with 2km cable supported for 4600C non-generic SKUs (#9215)
- Why I did it
Also recalculated all parameters with the latest algorithm with per-speed peer response time taken into account

- How I did it
Detailed information of each SKU:

C64:
t0: 32 100G downlinks and 32 100G uplinks
t1: 56 100G downlinks and 8 100G uplinks with 2km-cable supported
D112C8: 112 50G downlinks and 8 100G uplinks.
D48C40: 48 50G downlinks, 32 100G downlinks, and 8 100G uplinks
D100C12S2: 4 100G downlinks, 2 10G downlinks, 100 50G downlinks, and 8 100G uplinks
2km cable is supported for C64 on t1 only

- How to verify it
Run regression test (QoS)

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-12-22 09:06:33 -08:00
Stepan Blyshchak
4ebafdaf28 [Mellanox][SDK] Build SDK with PRM sniffer support (#9500)
- Why I did it
To have an ability to use PRM sniffer.

- How I did it
Enabled the option in configure flags.

- How to verify it
Built and ran on switch. Enabled the feature in runtime and checked the sniffer recording.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2021-12-22 09:05:49 -08:00
Vadym Hlushko
79d2a9a225
[Mellanox] [SN4410] [202106] Fixed capability files - port_config.ini, hwsku.json, platform.json (#9541)
- Why I did it
The capability files were incorrect in comparison to the marketing spec of the SN4410 platform.

- How I did it
Aligned the capability files according to the marketing spec.

- How to verify it
Did basic manual sanity checks:

1. Check if critical docker containers were UP
2. Check if interfaces were created and were UP
3. Check if interfaces created in the syncd docker container by executing – sx_api_ports_dump.py script
4. Check the logs from the start of the switch – everything was OK
5.Verified the port breakout

Signed-off-by: Vadym Hlushko <vadymh@nvidia.com>
2021-12-16 17:39:17 +02:00
Junchao-Mellanox
73a17d8876
[Mellanox] [202106] Fix issue: SFP might not be re-initialized after cable plug in (#9387)
- Why I did it
Fix issue: SFP might not be re-initialized after cable plug in. There could be case like:
1. The SFP object was initialized as QSFP type
2. User use adapter to connect a SFP to a QSFP port
3. The SFP object treat the SFP as QSFP and failed to process EEPROM

- How I did it
If a new SFP is plugged, always re-initialize the SFP

- How to verify it
Added unit test case
2021-12-14 16:27:26 +02:00
Junchao-Mellanox
3ae3eeda35
[Mellanox] Allow user to set LED to orange (#9259) (#9515)
Backport #9259 to 202106

- Why I did it
Nvidia platform API does not support set LED to orange.

- How I did it
Allow user to set LED to orange

- How to verify it
Manual test and unit testing
2021-12-14 13:33:23 +02:00
Stephen Sun
5c91f233ef
[Reclaim buffer][202106] Reclaim unused buffers by applying zero buffer profiles (#9062)
This is to backport community PR #8768 to 202106 branch

Why I did it
Support zero buffer profiles
Add buffer profiles and pool definition for zero buffer profiles
Support applying zero profiles on INACTIVE PORTS
Enable dynamic buffer manager to load zero pools and profiles from a JSON file

Signed-off-by: Stephen Sun stephens@nvidia.com

How I did it
Add buffer profiles and pool definition for zero buffer profiles

If the buffer model is static:
 - Apply normal buffer profiles to admin-up ports
 - Apply zero buffer profiles to admin-down ports
If the buffer model is dynamic:
 - Apply normal buffer profiles to all ports
 - buffer manager will take care when a port is shut down
 - Update buffers_config.j2 to support INACTIVE PORTS by extending the existing macros to generate the various buffer objects, including PGs, queues, ingress/egress profile lists

Originally, all the macros to generate the above buffer objects took active ports only as an argument
Now that buffer items need to be generated on inactive ports as well, an extra argument representing the inactive ports need to be added
To be backward compatible, a new series of macros are introduced to take both active and inactive ports as arguments
The original version (with active ports only) will be checked first. If it is not defined, then the extended version will be called
Only vendors who support zero profiles need to change their buffer templates
Enable buffer manager to load zero pools and profiles from a JSON file:

The JSON file is provided on a per-platform basis
It is copied from platform/<vendor> folder to /usr/share/sonic/temlates folder in compiling time and rendered when the swss container is being created.
To make code clean and reduce redundant code, extract common macros from buffer_defaults_t{0,1}.j2 of all SKUs to two common files:

One in Mellanox-SN2700-D48C8 for single ingress pool mode
The other in ACS-MSN2700 for double ingress pool mode
Those files of all other SKUs will be symbol link to the above files

Update sonic-cfggen test accordingly:
 - Adjust example output file of JSON template for unit test
 - Add unit test in for Mellanox's new buffer templates.

How to verify it
Regression test.
Unit test in sonic-cfggen
Run regression test and manually test.
2021-12-13 10:51:50 -08:00
Samuel Angebault
08c2c07fc0
[202106][Arista] Update arista platform library (#9483) 2021-12-09 18:29:59 -08:00
Arvindsrinivasan Lakshmi Narasimhan
e2b8e2d1da submodule update swss
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2021-12-08 00:25:09 +00:00
Vivek Reddy
4856f98716 [Mellanox] [SKU] Fix the shared headroom for 4600C-C64 SKU (#8242)
Removed ingress_lossy_pool from the BUFFER_POOL list
Fx the the egress_lossless_pool_size value

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
2021-12-08 00:23:07 +00:00
Volodymyr Samotiy
b22db5a52b
[Mellanox] [202106] Update SAI to v1.20.0.1 and SDK/FW to v4.5.1156/v2010.1152 (#9431)
- Why I did it
To include latest fixes.
SAI
* Reduce verbosity of warning message on shared memory already existing
* accuflow allocation support by key value

SDK
* Under various circumstances, Ethernet ports falsely showed that InfiniBand cables were connected.
* In SN4600C, at times, the link up time in both DAC and optics cables may, in the worst case, take up to 15 seconds.
* Using SN4600C with copper or optics loopback cables in NRZ speeds, link may raise in long link up times
* When ECMP has high amount of next-hops based on VLAN interfaces, in some rare cases, packets will get a wrong VLAN tag and will be dropped.
* When connecting Spectrum devices with optical transceivers that support RXLOS, remote side port down might cause the switch firmware to get stuck and cause unexpected switch behavior.
* Aggregation event is missing for WJH L2 drop reason 'Unicast egress port list is empty'.
* Tying the SCL and SDA of the optical modules to 3.3V causes errors.
* On SN4600, there was a delay of more than 10 seconds from the time a data packet is sent from CPU until it is transmitted through one of the switch ports.
* While using SN4600C system with Finisar FTLC1157RGPL 100GbE CWDM4 modules, intermittent link flaps across multiple ports may be observed.
* In Spectrum-2 and Spectrum-3 systems, link did not work in auto-negotiation when connected to Marvell PHY. KR mechanism has been enhanced to integrate with Marvell PHY. 
* The tunnel counter counts the drop packets now for Spectrum-2 and Spectrum-3 and consistent with Spectrum behavior and count the ECN dropped packets as well.
* When connecting SN3800 to Cisco-9000, fast-linkup flow will fail and will rise in the normal flow.
* Race condition in WJH library: when multiple threads load the LAG shared memory concurrently, the program may crash.
* Add WJH L2 drop reason 'Unicast egress port list is empty' as a new drop reason. 
* Fixed a memory leak in sx_api_port_sflow_statistics_get API. 
* During initialization flow, the command interface that is used by the minimal driver and SDK caused the collision in the firmware since the same buffer is used in the firmware for the two interfaces.

- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.

- How to verify it
Build an image and run tests from "sonic-mgmt".

Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
2021-12-06 21:54:14 +02:00
Mahesh Maddikayala
3f3dceb96a
[broadcom]: update bcm dnx gpl module pointer (#9442)
saibcm_modules_dnx submodule update contains a fix for kernel crash
2021-12-03 21:21:07 -08:00
Judy Joseph
70b24ad9c5 Update sonic-utilities submodule
9514857 [config reload][202106] Update command reference (#1944)
2021-12-01 19:21:45 -08:00
Judy Joseph
41a2d3e290 Update sonic-swss submodule
[8522f4f] Don't handle buffer pool watermark during warm reboot reconciling (#1987)
2021-12-01 11:13:38 -08:00
Junchao-Mellanox
f5c847bdbc
[system-health] [202106] No longer check critical process/service status via monit (#9366) 2021-12-01 10:26:06 -08:00
Junchao-Mellanox
e4ff4d2e3a [Mellanox] Fan speed should not be 100% when PSU is powered off (#9258)
- Why I did it
When PSU is powered off, the PSU is still on the switch and the air flow is still the same. In this case, it is not necessary to set FAN speed to 100%.

- How I did it
When PSU is powered of, don't treat it as absent.

- How to verify it
Adjust existing unit test case
Add new case in sonic-mgmt
2021-12-01 09:47:26 -08:00
Stephen Sun
fa0ae42e69 [Reclaim buffer] Common infrastructure update for reclaiming buffer (#9133)
- Why I did it
This is to update the common sonic-buildimage infra for reclaiming buffer.

- How I did it
Render zero_profiles.j2 to zero_profiles.json for vendors that support reclaiming buffer
The zero profiles will be referenced in PR [Reclaim buffer] Reclaim unused buffers by applying zero buffer profiles #8768 on Mellanox platforms and there will be test cases to verify the behavior there.
Rendering is done here for passing azure pipeline.
Load zero_profiles.json when the dynamic buffer manager starts
Generate inactive port list to reclaim buffer

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-12-01 09:47:18 -08:00
shlomibitton
1ebe52847a
[DHCPv6 relay] [202106] Fix DHCPv6 design to support multiple VLANS (#9163)
- Why I did it
If multiple Vlans are configured to have DHCPv6 relay, only one relay instance is able to capture DHCP packets received from upstream, this is as a result of kernel design to operate this way (SO_REUSEPORT).
DHCPv6 transmit unicast packets to clients, only multicast packets can be captured on multiple application listening on the same UDP port.
This issue causing only one Vlan interface to get packets from servers.

- How I did it
Change the design to neglect Vlan isolation and run only one relay instance serving all Vlans with all configured DHCP servers.

- How to verify it
Run DHCPv6 relay test with 2 Vlans configured do have a DHCP relay.

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-11-18 19:40:48 +02:00
gechiang
e0209f745a
[202106]Disable ALPM distributed hitbit thread that is used for debug purpose only but interfered with Other functional operations (#9293)
This is to address an issue where it was observed that SAI operations sometime may take a very long to time complete (over 45ms). It was determined that the ALPM distributed thread was causing this issue.
The fix is to disable this debug thread that has no functional purpose.

Preliminary tests looks fine. BGP neighbors were all up with proper routes programmed
interfaces are all up
Manually ran the fib test cases on 7050CX3 (TD3), TD2, TH, TH2, and TH3 based platforms and
thy all passed.
Note: the testing was done over 20201230 image and are porting this change to master branch.
No need to port this to 20201230 branch as a separate PR was already done for that branch. (#9190)

this PR is created to port the changes made by (#9199) but could not be cherry picked directly to 202106 branch.
2021-11-17 20:58:25 -08:00
Arvindsrinivasan Lakshmi Narasimhan
84226bdc57 sonic-utilities submodule update
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2021-11-18 01:27:29 +00:00
Arvindsrinivasan Lakshmi Narasimhan
d4ed9e7e62 Swss submodule update
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2021-11-18 00:39:15 +00:00