* Update PG headroom settings ports based on port speed/cable length
* Updated XOFF settings to use chip level numbers than core
* Updated PG headroom based on uplink/downlink side
* fix for sonic-config-gen tests
* More fixes for unit test cases
* more test fixes
* Merged multiple functions into one
Why I did it
Current regex not able to capture logs, modify regex to capture syslog messages
Work item tracking
Microsoft ADO (number only): 13366345
How I did it
Code change
How to verify it
sonic-mgmt test case
- Why I did it
We suspect the issue #13791 is caused by redis server being temporarily unavailable during system initialization so we do not use -d in sonic-cfggen, for now, to avoid accessing redis server
- How I did it
Provide a string containing required json data when calling sonic-cfggen
- How to verify it
Manually test it
Signed-off-by: Stephen Sun <stephens@nvidia.com>
rasdaemon is a tool to log hardware errors. It takes 100% CPU during
boot for a few seconds. It impacts fast/warm boot by delaying control
plane restoration for 5 sec on some platforms.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
#### Why I did it
Implementing code changes for https://github.com/sonic-net/SONiC/pull/1203
#### How I did it
Removed the timers and delayed target since the delayed services would start based on event driven approach.
Cleared port table during config reload and cold reboot scenario.
Modified yang model, init_cfg.json to change has_timer to delayed
#### How to verify it
Running regression
Why I did it
Fixes#14179
chassis-packet: missing arp entries for static routes causing high orchagent cpu usage
It is observed that some sonic-mgmt test case calls sonic-clear arp, which clears the static arp entries as well. Orchagent or arp_update process does not try to resolve the missing arp entries after clear.
How I did it
arp_update should resolve the missing arp/ndp static route
entries. Added code to check for missing entries and try ping if any
found to resolve it.
How to verify it
After boot or config reload, check ipv4 and ipv4 neigh entries to make sure all static route entries are present
manual validation:
Use sonic-clear arp and sonic-clear ndp to clear all neighbor entries
run arp_update
Check for neigh entries. All entries should be present.
Testing on T0 setup route/for test_static_route.py
The test set the STATIC_ROUTE entry in conifg db without ifname:
sonic-db-cli CONFIG_DB hmset 'STATIC_ROUTE|2.2.2.0/24' nexthop 192.168.0.18,192.168.0.25,192.168.0.23
"STATIC_ROUTE": {
"2.2.2.0/24": {
"nexthop": "192.168.0.18,192.168.0.25,192.168.0.23"
}
},
Validate that the arp_update gets the proper ARP_UPDATE_VARDS using arp_update_vars.j2 template from config db and does not crash:
{ "switch_type": "", "interface": "", "pc_interface" : "PortChannel101 PortChannel102 PortChannel103 PortChannel104 ", "vlan_sub_interface": "", "vlan" : "Vlan1000", "static_route_nexthops": "192.168.0.18 192.168.0.25 192.168.0.23 ", "static_route_ifnames": "" }
validate route/test_static_route.py testcase pass.
Why I did it
Support to add SONiC OS Version in device info.
It will be used to display the version info in the SONiC command "show version". The version is used to do the FIPS certification. We do not do the FIPS certification on a specific release, but on the SONiC OS Version.
SONiC Software Version: SONiC.master-13812.218661-7d94c0c28
SONiC OS Version: 11
Distribution: Debian 11.6
Kernel: 5.10.0-18-2-amd64
How I did it
#### Why I did it
Enhance the error message output mechanism during swss docker creating
#### How I did it
Capture the output to stderr of `sonic-cfggen` and output it using `echo` to make sure the error message will be logged in syslog.
#### How to verify it
Manually test
Why I did it
All these 3 services started after swss service, which used to start after interface-config service. But #13084 remove the time constraints for swss.
After that, these 3 services has the chance of start earlier when the inteface-config service is restarting the networking service, which could cause db connect request to fail.
How I did it
Delay mux/sflow/snmp timer after the interface-config service.
How to verify it
PR test.
Config reload can repro the issue in 1-3 retries. With this change. config reload run 30+ iterations without hitting the issue.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
arp_update should resolve the missing arp/ndp static route
entries. Added code to check for missing entries and try ping to
resolve the missing entry.
Why I did it
Fixes#14179
chassis-packet: missing arp entries for static routes causing high orchagent cpu usage
It is observed that some sonic-mgmt test case calls sonic-clear arp, which clears the static arp entries as well. Orchagent or arp_update process does not try to resolve the missing arp entries after clear.
How I did it
arp_update should resolve the missing arp/ndp static route
entries. Added code to check for missing entries and try ping if any
found to resolve it.
How to verify it
After boot or config reload, check ipv4 and ipv4 neigh entries to make sure all static route entries are present
manual validation:
Use sonic-clear arp and sonic-clear ndp to clear all neighbor entries
run arp_update
Check for neigh entries. All entries should be present.
Signed-off-by: anamehra <anamehra@cisco.com>
Why I did it
SONiC currently does not identify 'EdgeZoneAggregator' neighbor. As a result, the buffer profile attached to those interfaces uses the default cable length which could cause ingress packet drops due to insufficient headroom. Hence, there is a need to update the buffer templates to identify such neighbors and assign the same cable length as used by the T1.
How I did it
Modified the buffer template to identify EdgeZoneAggregator as a neighbor device type and assign it the same cable length as a T1/leaf router.
How to verify it
Unit tests pass, and manually checked on a 7260 to see the changes take effect.
Signed-off-by: dojha <devojha@microsoft.com>
Why I did it
This PR addresses the issue mentioned above by loading the acl config as a service on a storage backend device
How I did it
The new acl service is a oneshot service which will start after swss and does some retries to ensure that the SWITCH_CAPABILITY info is present before attempting to load the acl rules. The service is also bound to sonic targets which ensures that it gets restarted during minigraph reload and config reload
How to verify it
Build an image with the following changes and did the following tests
Verified that acl is loaded successfully on a storage backend device after a switch boot up
Verified that acl is loaded successfully on a storage backend ToR after minigraph load and config reload
Verified that acl is not loaded if the device is not a storage backend ToR or the device does not have a DATAACL table
Signed-off-by: Neetha John <nejo@microsoft.com>
- Why I did it
Add Secure Boot support to SONiC OS.
Secure Boot (SB) is a verification mechanism for ensuring that code launched by a computer's UEFI firmware is trusted. It is designed to protect a system against malicious code being loaded and executed early in the boot process before the operating system has been loaded.
- How I did it
Added a signing process to sign the following components:
shim, grub, Linux kernel, and kernel modules when doing the build, and when feature is enabled in build time according to the HLD explanations (the feature is disabled by default).
- How to verify it
There are self-verifications of each boot component when building the image, in addition, there is an existing end-to-end test in sonic-mgmt repo that checks that the boot succeeds when loading a secure system (details below).
How to build a sonic image with secure boot feature: (more description in HLD)
Required to use the following build flags from rules/config:
SECURE_UPGRADE_MODE="dev"
SECURE_UPGRADE_DEV_SIGNING_KEY="/path/to/private/key.pem"
SECURE_UPGRADE_DEV_SIGNING_CERT="/path/to/cert/key.pem"
After setting those flags should build the sonic-buildimage.
Before installing the image, should prepared the setup (switch device) with the follow:
check that the device support UEFI
stored pub keys in UEFI DB
enabled Secure Boot flag in UEFI
How to run a test that verify the Secure Boot flow:
The existing test "test_upgrade_path" under "sonic-mgmt/tests/upgrade_path/test_upgrade_path", is enough to validate proper boot
You need to specify the following arguments:
Base_image_list your_secure_image
Taget_image_list your_second_secure_image
Upgrade_type cold
And run the test, basically the test will install the base image given in the parameter and then upgrade to target image by doing cold reboot and validates all the services are up and working correctly
Fixes#13568
Upgrade from old image always requires squashfs mount to get the next image FW binary. This can be avoided if we put FW binary under platform directory which is easily accessible after installation:
admin@r-spider-05:~$ ls /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
/host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
admin@r-spider-05:~$ ls -al /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa
lrwxrwxrwx 1 root root 66 Feb 8 17:57 /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa -> /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
- Why I did it
202211 and above uses different squashfs compression type that 201911 kernel can not handle. Therefore, we avoid mounting squashfs altogether with this change.
- How I did it
Place FW binary under /host/image-/platform/mlnx/, soft links in /etc/mlnx are created to avoid breaking existing scripts/automation.
/etc/mlnx/fw-SPCX.mfa is a soft link always pointing to the FW that should be used in current image
mlnx-fw-upgrade.sh is updated to prefer /host/image-/platform/mlnx location and fallback to /etc/mlnx in squashfs in case new location does not exist. This is necessary to do image downgrade.
- How to verify it
Upgrade from 201911 to master
master to 201911 downgrade
master -> master reboot
ONIE -> master boot (First FW burn)
Which release branch to backport (provide reason below if selected)
- Why I did it
FW for Spectrum-4 ASIC not yet available
- How I did it
Remove in Mellanox fw make files to Spectrum-4 ASIC firmware binaries.
Remove from firmware upgrade scripts to be able Spectrum-4 ASIC.
- How to verify it
Run regression test
- Why I did it
Need to add the possibility to choose between dropping packets (using ACL) on ingress or egress in Dual ToR scenario
- How I did it
Add new attribute "mux_tunnel_ingress_acl" to SYSTEM_DEFAULTS table
- How to verify it
check that new attribute exists in redis:
admin@sonic:~$ redis-cli -n 4
127.0.0.1:6379[4]> HGETALL SYSTEM_DEFAULTS|mux_tunnel_ingress_acl
1."state"
2."false"
Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
Fixe #12047. After the c++ implementation of the sonic-db-cli, sonic-db-cli PING command tries to initialize the global database for all instances database starting. If all instance database-config.json are not ready yet. it will crash and generate core file. PR sonic-net/sonic-swss-common#701 only fix the crash and the process abortion.
Signed-off-by: mlok <marty.lok@nokia.com>
Signed-off-by: Zhixin Zhu zhixzhu@cisco.com
Why I did it
backplane ports cable length need to be specified.
How I did it
separated handling for the specific port name.
- Why I did it
Support DSCP remapping in dual ToR topo on T0 switch for SKU Mellanox-SN4600c-C64, Mellanox-SN4600c-D48C40, Mellanox-SN2700, Mellanox-SN2700-D48C8.
- How I did it
Regarding buffer settings, originally, there are two lossless PGs and queues 3, 4. In dual ToR scenario, the lossless traffic from the leaf switch to the uplink of the ToR switch can be bounced back.
To avoid PFC deadlock, we need to map the bounce-back lossless traffic to different PGs and queues. Therefore, 2 additional lossless PGs and queues are allocated on uplink ports on ToR switches.
On uplink ports, map DSCP 2/6 to TC 2/6 respectively
On downlink ports, both DSCP 2/6 are still mapped to TC 1
Buffer adjusted according to the ports information:
Mellanox-SN4600c-C64:
56 downlinks 50G + 8 uplinks 100G
Mellanox-SN4600c-D48C40, Mellanox-SN2700, Mellanox-SN2700-D48C8:
24 downlinks 50G + 8 uplinks 100G
- How to verify it
Unit test.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
In to-sonic and multi-asic KVM-test, pretest sometimes failed. Reason is rsyslogd process can not start in teamd container. Because rsyslog.conf is empty caused by sonic-cfggen execute failed
- How I did it
If sonic-cfggen -d execute failed, execute without -d because the template file has the default value.
- How to verify it
Build image and test it over 40 times, all passed pretest.
Signed-off-by: Chun'ang Li <chunangli@microsoft.com>
- Why I did it
Added platform specific script to be invoked during SAI failure dump. Added some generic changes to mount /var/log/sai_failure_dump as read write in the syncd docker
- How I did it
Added script in docker-syncd of mellanox and copied it to /usr/bin
- How to verify it
Manual UT and new sonic-mgmt tests
* Add support for platform topology configuration service
This service invokes the platform plugin for platform specific topology
configuration.
The path for platform plugin script is:
/usr/share/sonic/device/$PLATFORM/plugins/config-topology.sh
If the platform plugin is not available, this service does nothing.
Signed-off-by: anamehra <anamehra@cisco.com>
After upgrade to brcmsai 8.1, the sdk running environment (container) recommended with mininum memory size as below
TH4/TD4(ltsw) uses 512MB
TH3 used 300MB
Helix4/TD2/TD3/TH/TH 256 MB
Base on this requirement, adjust the default syncd share memory size and set the memory size for special ACISs in platform_env.conf file for different types of Broadcom ASICs.
How I did it
Add the platform_env.conf file if none of it for broadcom platform (base on platform_asic file)
Add the 'SYNCD_SHM_SIZE' and set the value
for ltsw(TD4/TH4) devices set to 512M at least (update the platform_env.conf)
for Td2/TH2/TH devices set to 256M
for TH3 set to 300M
verify
How to verify it
verify the image with code fix
Check with UT
Check on lab devices
On a problematic device which cannot start successfully
Run with the command
$ cat /proc/linux-kernel-bde
Broadcom Device Enumerator (linux-kernel-bde)
Module parameters:
maxpayload=128
usemsi=0
dmasize=32M
himem=(null)
himemaddr=(null)
DMA Memory (kernel): 33554432 bytes, 0 used, 33554432 free, local mmap
No devices found
$ docker rm -f syncd
syncd
$ sudo /usr/bin/syncd.sh start
Cannot get Broadcom Chip Id. Skip set SYNCD_SHM_SIZE.
Creating new syncd container with HWSKU Force10-S6000
a4862129a7fea04f00ed71a88715eac65a41cdae51c3158f9cdd7de3ccc3dd31
$ docker inspect syncd | grep -i shm
"ShmSize": 67108864,
"Tag": "fix_8.1_shm_issue.67873427-9f7ca60a0e",
On Normal device
$ docker inspect syncd | grep -i shm
"ShmSize": 268435456,
"Tag": "fix_8.1_shm_issue.67873427-9f7ca60a0e"
change the config syncd_shm.ini to b85=128m
$ docker rm -f syncd
syncd
$ sudo /usr/bin/syncd.sh start
Creating new syncd container with HWSKU Force10-S6000
3209ffc1e5a7224b99640eb9a286c4c7aa66a2e6a322be32fb7fe2113bb9524c
$ docker inspect syncd | grep -i shm
"ShmSize": 134217728,
"Tag": "fix_8.1_shm_issue.67873427-9f7ca60a0e",
change the config under
/usr/share/sonic/device/x86_64-dell_s6000_s1220-r0/Force10-S6000/platform_env.conf
and run command
$ cat /usr/share/sonic/device/x86_64-dell_s6000_s1220-r0/platform_env.conf
SYNCD_SHM_SIZE=300m
$ sudo /usr/bin/syncd.sh start
Creating new syncd container with HWSKU Force10-S6000
897f6fcde1f669ad2caab7da4326079abd7e811bf73f018c6dacc24cf24bfda5
$ docker inspect syncd | grep -i shm
"ShmSize": 314572800,
"Tag": "fix_8.1_shm_issue.67873427-9f7ca60a0e",
Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
Why I did it
In the voq chassis the buffer_queue configuration needs to be applied on system_port instead of the sonic port.
This PR has the change to do this.
How I did it
Modify buffer_config.j2 to generate buffer_queue configuration on system_ports if the device is Voq Chassis
How to verify it
Verify the buffer_queue configuration is generated properly using sonic-cfggen
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
- Why I did it
The followup to #12920 PR.
If the feature compilation is disabled its configuration should not be included into init_cfg.json.
- How I did it
Update init_cfg.json.j2 template to include teamd and radv features configuration only if their compilation is enabled.
- How to verify it
The default behavior is preserved. To verify the changes compile the image without overriding INCLUDE_TEAMD and INCLUDE_ROUTER_ADVERTISER options. The generated /etc/sonic/init_cfg.json should remain with no changes. Install the image and verify that both teamd and radv containers are present and running. Verify that feature state returned by show feature status command is enabled.
Change the INCLUDE_TEAMD or INCLUDE_ROUTER_ADVERTISER value to "n". Compile and install the image. Verify that feature configuration is not included in generated /etc/sonic/init_cfg.json file. Verify that show feature status output doesn't include the feature.
- Why I did it
Remove dependency on interfaces-config.service to speed up boot, because interfaces-config.service takes a lot of time on boot.
- How I did it
Changed service files for swss, syncd.
- How to verify it
Boot and check swss/syncd start time comparing to interfaces-config
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
- Why I did it
Support syslog rate limit configuration feature
- How I did it
Remove unused rsyslog.conf from containers
Modify docker startup script to generate rsyslog.conf from template files
Add metadata/init data for syslog rate limit configuration
- How to verify it
Manual test
New sonic-mgmt regression cases
Debian is shipping a systemd timer unit for logrotate, but we're also
packaging in a cron job, which means both of them will run, potentially
at the same time. Remove our cron file, and add an override to the
shipped timer file to have it be run every 10 minutes.
Fixes#12392.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
During docker build, host files can be passed to the docker build through
docker context files. But there is no straightforward way to transfer
the files from docker build to host.
This feature provides a tricky way to pass the cache contents from docker
build to host. It tar's the cached content and encodes them as base64 format
and passes it through a log file with a special tag as 'VCSTART and VCENT'.
Slave.mk in the host, it extracts the cache contents from the log and stores them
in the cache folder. Cache contents are encoded as base64 format for
easy passing.
<!--
Please make sure you've read and understood our contributing guidelines:
https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md
** Make sure all your commits include a signature generated with `git commit -s` **
If this is a bug fix, make sure your description includes "fixes #xxxx", or
"closes #xxxx" or "resolves #xxxx"
Please provide the following information:
-->
#### Why I did it
#### How I did it
#### How to verify it
- Why I did it
Add support for compiling Spectrum-4 ASIC firmware to the SONiC image
Add support for Spectrum-4 ASIC firmware upgrade
- How I did it
Update Mellanox fw make files to include Spectrum-4 ASIC firmware binaries.
Update firmware upgrade scripts to be able to detect Spectrum-4 ASIC.
- How to verify it
Run regression tests
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Why I did it
The PR is to apply separated DSCP_TO_TC_MAP and TC_TO_QUEUE_MAP to uplink ports on dualtor.
The traffic with DSCP 2 and DSCP 6 from T1 is treated as lossless traffic.
DSCP TC Queue
2 2 2
6 6 6
Traffic with DSCP 2 or DSCP 6 from downlink is still treated as lossy traffic as before.
How I did it
Define DSCP_TO_TC_MAP|AZURE_UPLINK and TC_TO_QUEUE_MAP|AZURE_UPLINK.
How to verify it
Verified by UT
Verified by coping the new template to a testbed, and rendering a config_db.json
Why I did it
The current lazy installer relies on a filename sort for both unpack and configuration steps. When systemd services are configured [started] by multiple packages the order is by filename not by the declared package dependencies. This can cause the start order of services to differ between first-boot and subsequent boots. Declared systemd service dependencies further exacerbate the issue (e.g. blocking the first-boot script).
The current installer leaves packages un-configured if the package dependency order does not match the filename order.
This also fixes a trivial bug in [Build]: Support to use symbol links for lazy installation targets to reduce the image size #10923 where externally downloaded dependencies are duplicated across lazy package device directories.
How I did it
Changed the staging and first-boot scripts to use apt-get:
dpkg -i /host/image-$SONIC_VERSION/platform/$platform/*.deb
becomes
apt-get -y install /host/image-$SONIC_VERSION/platform/$platform/*.deb
when dependencies are detected during image staging.
How to verify it
Apt-get critical rules
Add a Depends= to the control information of a package. Grep the syslog for rc.local between images and observe the configuration order of packages change.
Added Support to runtime render bgp and teamd feature `state` and lldp `has_asic_scope` flag
Needed for SONiC on chassis.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Co-authored-by: mlok <marty.lok@nokia.com>
#### Why I did it
Currently at the Azure build system, the P4RT container is disabled by default at the build time. Here the goal is to include the P4RT container at the build time while disabling it at the runtime. The user can enable/disable the p4rt app through the config based on the preference.
#### How I did it
Changed the config in rules/config and init-cfg.json.j2
Signed-off-by: Mariusz Stachura <mariusz.stachura@intel.com>
What I did
Adding the dynamic headroom calculation support for Barefoot platforms.
Why I did it
Enabling dynamic mode for barefoot case.
How I verified it
The community tests are adjusted and pass.
* Add smartmontools to pmon docker
* Set smartmontools to install version 7.2-1 in pmon to match host; clean up smartmontools build files
* Add comments on smartmontools version for both host and pmon
Why I did it
BGP service has always been starting after interface-config. However, recently we discovered an issue where some BGP sessions are unable to establish due to BGP daemon not able to read the interface IP.
This issue was clearly observed after upgrading to FRR 8.2.2. See more details in #12380.
How I did it
Delaying starting BGP seems to be a workaround for this issue.
However, caution is that this delay might impact warm reboot timing and other timing sequences.
This workaround is reducing the probability of hitting the issue by close to 100X. However, this workaround is not bulletproof as test shows. It is still preferrable to have a proper FRR fix and revert this change in the future.
How to verify it
Continuously issuing config reload and check BGP session status afterwards.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Remove swsssdk from sonic OS image and docker image
#### Why I did it
swsssdk is deprecated, so need remove from image.
#### How I did it
Update config file to remove swsssdk from image.
#### How to verify it
Pass all test case.
#### Which release branch to backport (provide reason below if selected)
<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
#### Description for the changelog
Remove swsssdk from sonic OS image and docker image
#### Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->
#### A picture of a cute animal (not mandatory but encouraged)