Commit Graph

1116 Commits

Author SHA1 Message Date
mssonicbld
96817c4357
[ci/build]: Upgrade SONiC package versions (#14102)
Upgrade SONiC Versions
2023-03-17 10:12:30 +08:00
Neetha John
f30fb6ec58
[storage_backend] Add backend acl service (#14229)
Why I did it
This PR addresses the issue mentioned above by loading the acl config as a service on a storage backend device

How I did it
The new acl service is a oneshot service which will start after swss and does some retries to ensure that the SWITCH_CAPABILITY info is present before attempting to load the acl rules. The service is also bound to sonic targets which ensures that it gets restarted during minigraph reload and config reload

How to verify it
Build an image with the following changes and did the following tests

Verified that acl is loaded successfully on a storage backend device after a switch boot up
Verified that acl is loaded successfully on a storage backend ToR after minigraph load and config reload
Verified that acl is not loaded if the device is not a storage backend ToR or the device does not have a DATAACL table

Signed-off-by: Neetha John <nejo@microsoft.com>
2023-03-16 14:18:28 -07:00
davidpil2002
8098bc4bf5
Add Secure Boot Support (#12692)
- Why I did it
Add Secure Boot support to SONiC OS.
Secure Boot (SB) is a verification mechanism for ensuring that code launched by a computer's UEFI firmware is trusted. It is designed to protect a system against malicious code being loaded and executed early in the boot process before the operating system has been loaded.

- How I did it
Added a signing process to sign the following components:
shim, grub, Linux kernel, and kernel modules when doing the build, and when feature is enabled in build time according to the HLD explanations (the feature is disabled by default).

- How to verify it
There are self-verifications of each boot component when building the image, in addition, there is an existing end-to-end test in sonic-mgmt repo that checks that the boot succeeds when loading a secure system (details below).

How to build a sonic image with secure boot feature: (more description in HLD)

Required to use the following build flags from rules/config:
SECURE_UPGRADE_MODE="dev"
SECURE_UPGRADE_DEV_SIGNING_KEY="/path/to/private/key.pem"
SECURE_UPGRADE_DEV_SIGNING_CERT="/path/to/cert/key.pem"
After setting those flags should build the sonic-buildimage.
Before installing the image, should prepared the setup (switch device) with the follow:
check that the device support UEFI
stored pub keys in UEFI DB

enabled Secure Boot flag in UEFI
How to run a test that verify the Secure Boot flow:
The existing test "test_upgrade_path" under "sonic-mgmt/tests/upgrade_path/test_upgrade_path", is enough to validate proper boot
You need to specify the following arguments:
Base_image_list your_secure_image
Taget_image_list your_second_secure_image
Upgrade_type cold
And run the test, basically the test will install the base image given in the parameter and then upgrade to target image by doing cold reboot and validates all the services are up and working correctly
2023-03-14 14:55:22 +02:00
Stepan Blyshchak
f908dfe919
[Mellanox] Place FW binaries under platform directory instead of squashfs (#13837)
Fixes #13568

Upgrade from old image always requires squashfs mount to get the next image FW binary. This can be avoided if we put FW binary under platform directory which is easily accessible after installation:

admin@r-spider-05:~$ ls /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
/host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
admin@r-spider-05:~$ ls -al /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa
lrwxrwxrwx 1 root root 66 Feb  8 17:57 /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa -> /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa

- Why I did it
202211 and above uses different squashfs compression type that 201911 kernel can not handle. Therefore, we avoid mounting squashfs altogether with this change.

- How I did it
Place FW binary under /host/image-/platform/mlnx/, soft links in /etc/mlnx are created to avoid breaking existing scripts/automation.
/etc/mlnx/fw-SPCX.mfa is a soft link always pointing to the FW that should be used in current image
mlnx-fw-upgrade.sh is updated to prefer /host/image-/platform/mlnx location and fallback to /etc/mlnx in squashfs in case new location does not exist. This is necessary to do image downgrade.

- How to verify it
Upgrade from 201911 to master
master to 201911 downgrade
master -> master reboot
ONIE -> master boot (First FW burn)
Which release branch to backport (provide reason below if selected)
2023-03-06 13:36:43 +02:00
mssonicbld
506f372533
[ci/build]: Upgrade SONiC package versions (#14072)
Upgrade SONiC Versions
2023-03-05 11:29:38 +08:00
anamehra
4a93e4cfa4
Add support for platform syncd pre shutdown plugin (#13564)
Why I did it
Vendor platform may require running platform specific pre-shutdown routine before shutting down the syncd process which runs the SAI and vendor sdk instance.

How I did it
Added a platform script hook which will be executed if the plugin script is provided by the platform in device//plugins/
2023-03-03 15:53:33 -08:00
Sudharsan Dhamal Gopalarathnam
8883259673
[netlink] Increse netlink buffer size from 3MB to 16MB (#13965)
#### Why I did it
Following the PR https://github.com/sonic-net/sonic-swss-common/pull/739 increasing netlink buffer size in linux kernel
As error is seen in fdbsyncd with netlink reports "out of memory on reading a netlink socket" It is seen when kernel is sending 10k remote mac to fdbsyncd.


#### How I did it
Increase the buffer size of the netlink buffer from 3MB to 16MB


#### How to verify it
Verified with 10k remote mac, and restarting the fdbsyncd process. So that kernel send the bridge fdb dump to the fdbsyncd.
Verified that the netlink buffer error is not reported in the sys log.
2023-02-27 15:41:22 -08:00
mssonicbld
8d0d3e57ba
[ci/build]: Upgrade SONiC package versions (#13989)
Upgrade SONiC Versions
2023-02-27 13:45:49 +08:00
mssonicbld
58592e6c49
[ci/build]: Upgrade SONiC package versions (#13526)
The initial version files for the SONiC reproducible build
2023-02-25 08:16:38 +08:00
Samuel Angebault
b9dffcbaaf
[Arista] Disable SSD NCQ on Lodoga (#13964)
Why I did it
Fix similar issue seen on #13739 but only for DCS-7050CX3-32S

How I did it
Add a kernel parameter to tell libata to disable NCQ

How to verify it
The message ata2.00: FORCE: horkage modified (noncq) should appear on the dmesg.

Test results using: fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4

with NCQ

   READ: bw=26.1MiB/s (27.4MB/s), 26.1MiB/s-26.1MiB/s (27.4MB/s-27.4MB/s), io=3136MiB (3288MB), run=120053-120053msec
  WRITE: bw=26.3MiB/s (27.6MB/s), 26.3MiB/s-26.3MiB/s (27.6MB/s-27.6MB/s), io=3161MiB (3315MB), run=120053-120053msec
without NCQ

   READ: bw=22.0MiB/s (23.1MB/s), 22.0MiB/s-22.0MiB/s (23.1MB/s-23.1MB/s), io=2647MiB (2775MB), run=120069-120069msec
  WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=2665MiB (2795MB), run=120069-120069msec
2023-02-24 10:08:04 -08:00
DavidZagury
ee1b6b3751
Remove support to Mellanox SPC4 ASIC (#13932)
- Why I did it
FW for Spectrum-4 ASIC not yet available

- How I did it
Remove in Mellanox fw make files to Spectrum-4 ASIC firmware binaries.
Remove from firmware upgrade scripts to be able Spectrum-4 ASIC.

- How to verify it
Run regression test
2023-02-23 08:25:34 +02:00
Andriy Yurkiv
5ad78abea0
[Dual-ToR] add default value for ACL rule for mellanox platform (#13547)
- Why I did it
Need to add the possibility to choose between dropping packets (using ACL) on ingress or egress in Dual ToR scenario

- How I did it
Add new attribute "mux_tunnel_ingress_acl" to SYSTEM_DEFAULTS table

- How to verify it
check that new attribute exists in redis:
admin@sonic:~$ redis-cli -n 4
127.0.0.1:6379[4]> HGETALL SYSTEM_DEFAULTS|mux_tunnel_ingress_acl
1."state"
2."false"

Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
2023-02-22 20:25:54 +02:00
Marty Y. Lok
2c22d9affc
[Chassis][multiasic] Fix the sonic-db-cli core files issue on multiasic platform after the c++ implementation of sonic-db-cli (#13207)
Fixe #12047. After the c++ implementation of the sonic-db-cli, sonic-db-cli PING command tries to initialize the global database for all instances database starting. If all instance database-config.json are not ready yet. it will crash and generate core file. PR sonic-net/sonic-swss-common#701 only fix the crash and the process abortion. 

Signed-off-by: mlok <marty.lok@nokia.com>
2023-02-21 11:23:22 -08:00
Saikrishna Arcot
56d732a0a0
Use tmpfs for /var/log on Arista 7050CX3-32S (#13805)
This is to reduce writes to the SSD on the device.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-02-16 19:13:39 -08:00
Samuel Angebault
5ce1b8e4b7
[Arista] Disable ATA NCQ for a few products (#13739)
Why I did it
Some products might experience an occasional IO failure in the communication between CPU and SSD.
Based on some research it could be attributable to some device not handling ATA NCQ (Native Command Queue).

This issue currently affect 4 products:

DCS-7170-32C*
DCS-7170-64C
DCS-7060DX4-32
DCS-7260CX3-64

How I did it
This change disable NCQ on the affected drive for a small set of products.

How to verify it
When the fix is applied, these 2 patterns can be found in the dmesg.
ata1.00: FORCE: horkage modified (noncq)
NCQ (not used)

Test results using: fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4

with NCQ (ata1.00: 61865984 sectors, multi 1: LBA48 NCQ (depth 32), AA)

   READ: bw=33.9MiB/s (35.6MB/s), 33.9MiB/s-33.9MiB/s (35.6MB/s-35.6MB/s), io=4073MiB (4270MB), run=120078-120078msec
  WRITE: bw=34.1MiB/s (35.8MB/s), 34.1MiB/s-34.1MiB/s (35.8MB/s-35.8MB/s), io=4100MiB (4300MB), run=120078-120078msec
without NCQ (ata1.00: 61865984 sectors, multi 1: LBA48 NCQ (not used))

   READ: bw=31.7MiB/s (33.3MB/s), 31.7MiB/s-31.7MiB/s (33.3MB/s-33.3MB/s), io=3808MiB (3993MB), run=120083-120083msec
  WRITE: bw=31.9MiB/s (33.4MB/s), 31.9MiB/s-31.9MiB/s (33.4MB/s-33.4MB/s), io=3830MiB (4016MB), run=120083-120083msec
Which release branch to backport (provide reason below if selected)
2023-02-15 10:31:59 -08:00
Stepan Blyshchak
e5a294644c
[dockerd] Force usage of cgo DNS resolver (#13649)
Go's runtime (and dockerd inherits this) uses own DNS resolver implementation by default on Linux.
It has been observed that there are some DNS resolution issues when executing ```docker pull``` after first boot.

Consider the following script:

```
admin@r-boxer-sw01:~$ while :; do date; cat /etc/resolv.conf; ping -c 1 harbor.mellanox.com; docker pull harbor.mellanox.com/sonic/cpu-report:1.0.0 ; sleep 1; done
Fri 03 Feb 2023 10:06:22 AM UTC
nameserver 10.211.0.124
nameserver 10.211.0.121
nameserver 10.7.77.135
search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com
PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data.
64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=5.99 ms

--- harbor.mellanox.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 5.989/5.989/5.989/0.000 ms
Error response from daemon: Get "https://harbor.mellanox.com/v2/": dial tcp: lookup harbor.mellanox.com on [::1]:53: read udp [::1]:57245->[::1]:53: read: connection refused
Fri 03 Feb 2023 10:06:23 AM UTC
nameserver 10.211.0.124
nameserver 10.211.0.121
nameserver 10.7.77.135
search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com
PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data.
64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=5.56 ms

--- harbor.mellanox.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 5.561/5.561/5.561/0.000 ms
Error response from daemon: Get "https://harbor.mellanox.com/v2/": dial tcp: lookup harbor.mellanox.com on [::1]:53: read udp [::1]:53299->[::1]:53: read: connection refused
Fri 03 Feb 2023 10:06:24 AM UTC
nameserver 10.211.0.124
nameserver 10.211.0.121
nameserver 10.7.77.135
search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com
PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data.
64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=5.78 ms

--- harbor.mellanox.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 5.783/5.783/5.783/0.000 ms
Error response from daemon: Get "https://harbor.mellanox.com/v2/": dial tcp: lookup harbor.mellanox.com on [::1]:53: read udp [::1]:55765->[::1]:53: read: connection refused
Fri 03 Feb 2023 10:06:25 AM UTC
nameserver 10.211.0.124
nameserver 10.211.0.121
nameserver 10.7.77.135
search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com
PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data.
64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=7.17 ms

--- harbor.mellanox.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 7.171/7.171/7.171/0.000 ms
Error response from daemon: Get "https://harbor.mellanox.com/v2/": dial tcp: lookup harbor.mellanox.com on [::1]:53: read udp [::1]:44877->[::1]:53: read: connection refused
Fri 03 Feb 2023 10:06:26 AM UTC
nameserver 10.211.0.124
nameserver 10.211.0.121
nameserver 10.7.77.135
search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com
PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data.
64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=5.66 ms

--- harbor.mellanox.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 5.656/5.656/5.656/0.000 ms
Error response from daemon: Get "https://harbor.mellanox.com/v2/": dial tcp: lookup harbor.mellanox.com on [::1]:53: read udp [::1]:54604->[::1]:53: read: connection refused
Fri 03 Feb 2023 10:06:27 AM UTC
nameserver 10.211.0.124
nameserver 10.211.0.121
nameserver 10.7.77.135
search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com
PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data.
64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=8.22 ms

--- harbor.mellanox.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 8.223/8.223/8.223/0.000 ms
1.0.0: Pulling from sonic/cpu-report
004f1eed87df: Downloading [===================>                               ]   19.3MB/50.43MB
5d6f1e8117db: Download complete
48c2faf66abe: Download complete
234b70d0479d: Downloading [=========>                                         ]  9.363MB/51.84MB
6fa07a00e2f0: Downloading [==>                                                ]   9.51MB/192.4MB
04a31b4508b8: Waiting
e11ae5168189: Waiting
8861a99744cb: Waiting
d59580d95305: Waiting
12b1523494c1: Waiting
d1a4b09e9dbc: Waiting
99f41c3f014f: Waiting
```

While /etc/resolv.conf has the correct content and ping (and any other utility that uses libc's DNS resolution implementation) works correctly
docker is unable to resolve the hostname and falls back to default [::1]:53. This started to happen after PR https://github.com/sonic-net/sonic-buildimage/pull/13516 has been merged.
As you can see from the log, dockerd is able to pick up the correct /etc/resolv.conf only after 5 sec since first try. This seems to be somehow related to the logic in Go's DNS resolver
https://github.com/golang/go/blob/master/src/net/dnsclient_unix.go#L385.

There have been issues like that reported in docker like:
  - https://github.com/docker/cli/issues/2299
  - https://github.com/docker/cli/issues/2618
  - https://github.com/moby/moby/issues/22398

Since this starts to happen after inclusion of resolvconf package by
above mentioned PR and the fact I can't see any problem with that (ping,
nslookup, etc. works) the choice is made to force dockerd to use cgo
(libc) resolver.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2023-02-14 08:57:19 +02:00
zhixzhu
f0f7639fa2
set cable length to 1m for backplane ports (#13572)
Signed-off-by: Zhixin Zhu zhixzhu@cisco.com

Why I did it
backplane ports cable length need to be specified.

How I did it
separated handling for the specific port name.
2023-02-10 19:01:49 -08:00
andywongarista
1894e0aafe
Increase PikeZ varlog size (#13550)
Why I did it
To address error sometimes seen when running sonic-mgmt test_stress_routes.py::test_announce_withdraw_route on 720DT-48S

How I did it
Update boot0 logic to set platform specific varlog size for 720DT-48S

How to verify it
Verified that /var/log size increased and error is no longer observed when running test
2023-02-09 13:24:09 -08:00
Samuel Angebault
dd7948bf17
[Arista] Add emmc quirks in boot0 to improve reliability (#10013)
Why I did it
Fix some unreliability seen on emmc device with some AMD CPUs

How I did it
Added a kernel parameter to add quirks to
It depends on a sonic-linux-kernel change to work properly but will be a no-op without it.
The quirk added is SDHCI_QUIRK2_BROKEN_HS200 used to downgrade the link speed for the eMMC.
2023-02-09 10:46:09 -08:00
Stephen Sun
e3ff08833e
[Mellanox] Support DSCP remapping in dual ToR topo on T0 switch (#12605)
- Why I did it
Support DSCP remapping in dual ToR topo on T0 switch for SKU Mellanox-SN4600c-C64, Mellanox-SN4600c-D48C40, Mellanox-SN2700, Mellanox-SN2700-D48C8.

- How I did it
Regarding buffer settings, originally, there are two lossless PGs and queues 3, 4. In dual ToR scenario, the lossless traffic from the leaf switch to the uplink of the ToR switch can be bounced back.
To avoid PFC deadlock, we need to map the bounce-back lossless traffic to different PGs and queues. Therefore, 2 additional lossless PGs and queues are allocated on uplink ports on ToR switches.

On uplink ports, map DSCP 2/6 to TC 2/6 respectively
On downlink ports, both DSCP 2/6 are still mapped to TC 1
Buffer adjusted according to the ports information:
Mellanox-SN4600c-C64:
56 downlinks 50G + 8 uplinks 100G
Mellanox-SN4600c-D48C40, Mellanox-SN2700, Mellanox-SN2700-D48C8:
24 downlinks 50G + 8 uplinks 100G

- How to verify it
Unit test.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2023-02-07 16:21:59 +02:00
Chun'ang Li
eea54717b8
Fix rsyslogd start failed cause by rsyslog.conf is emtpy. (#13669)
- Why I did it
In to-sonic and multi-asic KVM-test, pretest sometimes failed. Reason is rsyslogd process can not start in teamd container. Because rsyslog.conf is empty caused by sonic-cfggen execute failed

- How I did it
If sonic-cfggen -d execute failed, execute without -d because the template file has the default value.

- How to verify it
Build image and test it over 40 times, all passed pretest.

Signed-off-by: Chun'ang Li <chunangli@microsoft.com>
2023-02-06 16:38:04 +02:00
Sudharsan Dhamal Gopalarathnam
1ff0c0b685
[Mellanox][sai_failure_dump]Added platform specific script to be invoked during SAI failure dump (#13533)
- Why I did it
Added platform specific script to be invoked during SAI failure dump. Added some generic changes to mount /var/log/sai_failure_dump as read write in the syncd docker

- How I did it
Added script in docker-syncd of mellanox and copied it to /usr/bin

- How to verify it
Manual UT and new sonic-mgmt tests
2023-02-05 16:45:49 +02:00
Saikrishna Arcot
ee1c32a802
Use tmpfs for /var/log for Arista 7260 (#13587)
This is to reduce writes to disk, which then can use the SSD to get worn
out faster.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-02-02 09:07:33 -08:00
anamehra
26af468a99
Add support for platform topology configuration service (#12066)
* Add support for platform topology configuration service

    This service invokes the platform plugin for platform specific topology
    configuration.
    The path for platform plugin script is:
    /usr/share/sonic/device/$PLATFORM/plugins/config-topology.sh
    If the platform plugin is not available, this service does nothing.

Signed-off-by: anamehra <anamehra@cisco.com>
2023-02-01 12:53:45 -08:00
Richard.Yu
a096363b48
[broadcom]: Set default SYNCD_SHM_SIZE for Broadcom XGS devices (#13297)
After upgrade to brcmsai 8.1, the sdk running environment (container) recommended with mininum memory size as below

TH4/TD4(ltsw) uses 512MB
TH3 used 300MB
Helix4/TD2/TD3/TH/TH 256 MB
Base on this requirement, adjust the default syncd share memory size and set the memory size for special ACISs in platform_env.conf file for different types of Broadcom ASICs.

How I did it
Add the platform_env.conf file if none of it for broadcom platform (base on platform_asic file)
Add the 'SYNCD_SHM_SIZE' and set the value

for ltsw(TD4/TH4) devices set to 512M at least (update the platform_env.conf)
for Td2/TH2/TH devices set to 256M
for TH3 set to 300M

verify

How to verify it
verify the image with code fix
Check with UT
Check on lab devices

On a problematic device which cannot start successfully
Run with the command
$ cat /proc/linux-kernel-bde
Broadcom Device Enumerator (linux-kernel-bde)
Module parameters:
        maxpayload=128
        usemsi=0
        dmasize=32M
        himem=(null)
        himemaddr=(null)
DMA Memory (kernel): 33554432 bytes, 0 used, 33554432 free, local mmap
No devices found
$ docker rm -f syncd
syncd
$ sudo /usr/bin/syncd.sh start
Cannot get Broadcom Chip Id. Skip set SYNCD_SHM_SIZE.
Creating new syncd container with HWSKU Force10-S6000
a4862129a7fea04f00ed71a88715eac65a41cdae51c3158f9cdd7de3ccc3dd31
$ docker inspect syncd | grep -i shm
            "ShmSize": 67108864,
                "Tag": "fix_8.1_shm_issue.67873427-9f7ca60a0e",
On Normal device
$ docker inspect syncd | grep -i shm
            "ShmSize": 268435456,
                "Tag": "fix_8.1_shm_issue.67873427-9f7ca60a0e"
change the config syncd_shm.ini to b85=128m

$ docker rm -f syncd
syncd
$ sudo /usr/bin/syncd.sh start
Creating new syncd container with HWSKU Force10-S6000
3209ffc1e5a7224b99640eb9a286c4c7aa66a2e6a322be32fb7fe2113bb9524c
$  docker inspect syncd | grep -i shm
            "ShmSize": 134217728,
                "Tag": "fix_8.1_shm_issue.67873427-9f7ca60a0e",
change the config under
/usr/share/sonic/device/x86_64-dell_s6000_s1220-r0/Force10-S6000/platform_env.conf
and run command

$ cat /usr/share/sonic/device/x86_64-dell_s6000_s1220-r0/platform_env.conf
SYNCD_SHM_SIZE=300m

$ sudo /usr/bin/syncd.sh start
Creating new syncd container with HWSKU Force10-S6000
897f6fcde1f669ad2caab7da4326079abd7e811bf73f018c6dacc24cf24bfda5
$  docker inspect syncd | grep -i shm
            "ShmSize": 314572800,
                "Tag": "fix_8.1_shm_issue.67873427-9f7ca60a0e",

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
2023-01-30 20:23:03 -08:00
Oleksandr Ivantsiv
c7ecd92c54
Clear DNS configuration received from DHCP during networking reconfiguration in Linux. (#13516)
- Why I did it
fixes #12907

When the management interface IP address configuration changes from dynamic to static the DNS configuration (retrieved from the DHCP server) in /etc/resolv.conf remains uncleared. This leads to a DNS configuration pointing to the wrong nameserver. To make the behavior clear DNS configuration received from DHCP should be cleared.

- How I did it
Use resolvconf package for managing DNS configuration. It is capable of tracking the source of DNS configuration and puts the configuration retrieved from the DHCP servers into a separate file. This allows the implementation of DNS configuration cleanup retrieved from DHCP during networking reconfiguration.

- How to verify it
Ensure that the management interface has no static configuration.
Check that /etc/resolv.conf has DNS configuration.
Configure a static IP address on the management interface.
Verify that /etc/resolv.conf has no DNS configuration.
Remove the static IP address from the management interface.
Verify that /etc/resolv.conf has DNS configuration retrieved form DHCP server.
2023-01-30 22:13:10 +02:00
Devesh Pathak
c93716a142
rsyslog to start after interfaces-config (#13503)
Fixes #12408

Why I did it
We are running into #12408 very frequently.
This results in no syslogs from any containers as rsyslog server could not start.
some of the sonic-mgmt scripts look for log statements and error out if log is not present.

Interfaces-config service configures the loopback interface along with other interfaces. rsyslog-config reads ip address of loopback interface and generates /etc/rsyslog.conf. When this race condition happens, lo interface ip is not yet programmed and rsyslog-config ends up writing UDP server as null in /etc/rsyslog.conf.

How I did it
rsyslog-config service is started after interfaces-config service.

How to verify it
Did multiple reboots and verified that $UDPServerAddress is valid.
2023-01-26 20:39:13 -08:00
Jing Zhang
dabb31c5f6
[sudoers] add /usr/local/bin/storyteller to READ_ONLY_CMDS (#13422)
Adding /usr/local/bin/storyteller to READ_ONLY_CMDS. So no write access or prompt for password is needed to run storyteller.

Tested on 202205 clusters, user who didn't request write access was able to grep log using storyteller.

sign-off: Jing Zhang zhangjing@microsoft.com
2023-01-26 20:38:29 -08:00
Jing Zhang
78f249be38
change default to be on (#13495)
Changing the default config knob value to be True for killing radv, due to the reasons below:

Killing RADV is to prevent sending the "cease to be advertising interface" protocol packet.
RFC 4861 says this ceasing packet as "should" instead of "must", considering that it's fatal to not do this.
In active-active scenario, host side might have difficulty distinguish if the "cease to be advertising interface" is for the last interface leaving.
6.2.5. Ceasing To Be an Advertising Interface

shutting down the system.
In such cases, the router SHOULD transmit one or more (but not more
than MAX_FINAL_RTR_ADVERTISEMENTS) final multicast Router
Advertisements on the interface with a Router Lifetime field of zero.
In the case of a router becoming a host, the system SHOULD also
depart from the all-routers IP multicast group on all interfaces on
which the router supports IP multicast (whether or not they had been
advertising interfaces). In addition, the host MUST ensure that
subsequent Neighbor Advertisement messages sent from the interface
have the Router flag set to zero.

sign-off: Jing Zhang zhangjing@microsoft.com
2023-01-24 23:59:54 +00:00
Zain Budhwani
c9a33cb00e
Fix segfault issue inside memory_checker (#13066)
#### Why I did it

Segfault was occuring when running memory_checker

#### How I did it

Deinit publisher immediately after publishing

#### How to verify it

Manual testing
2023-01-24 15:30:41 -08:00
Jing Zhang
260a2ec3e7
[dualtor][active-active]Killing radv instead of stopping on active-active dualtor if config knob is on (#13408)
How I did it
radv sends a good-bye packet when the service is stopped, which causes a IPv6 route update on SoC side. And this update leads to an interface bouncing and causes traffic disruption even though the ToR device might already be isolated.

This PR is to mitigate the traffic disruption issue during planned maintenance, by killing radv instead of stopping. So the cease packet won't be sent.

How to verify it
Verified on dev clusters:

Traffic disruption was no longer reproducible.
radv took the killing path
if knob was off, radv would take the stopping path

sign-off: Jing Zhang zhangjing@microsoft.com
2023-01-20 15:34:34 -08:00
Graham Hayes
e077b5362c
[Arista] Rely on automatic flash size detection for Raven (#13277)
Many of these switches have had flash upgraded beyond 2G however, in
boot0 both were assigned 2GB for legacy reasons.

Remove the hardcoding of the flash size and let boot0 autodetect the available space.

Signed-off-by: Graham Hayes <gr@ham.ie>

Signed-off-by: Graham Hayes <gr@ham.ie>
2023-01-12 23:52:40 -08:00
xumia
e6a01ca5eb
[Bug] Fix SONiC installation failure caused by pip/pip3 not found (#13284)
The main issue is the pip/pip3 command cannot be found when the package is being installed by apt-get.
When using the dpkg install, the searching path is PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
When using the apt-get install, the searching path is PATH=/usr/sbin:/usr/bin:/sbin:/bin
But the pip/pip3 default path is at /usr/local/bin, so dpkg works, but apt-get not work.

How I did it
Export the path /usr/local/bin for pip/pip3.
Make the deb packages can be installed by apt-get.
2023-01-11 08:54:24 -08:00
centecqianj
4b933bd566
[Centec arm64] Solve the abnormal console speed of centec-arm64 switch board (#13126)
The console of the centec-arm64 board is ttyAMA0.The current regular expression cannot be correctly parsed.

Signed-off-by: centecqianj <qianj@centec.com>
2023-01-07 21:10:03 -08:00
abdosi
9ecd27ddbb
During build time mask only those feature/services that are disabled excplicitly (#13283)
What I did:
Fix : #13117

How I did:
During build time mask only those feature/services that are disabled explicitly. Some of the features ((eg: teamd/bgp/dhcp-relay/mux/etc..)) state is determine run-time so for those feature by default service will be up and running and then later hostcfgd will mask them if needed.

So Default behavior will be

init_cfg.json.j2 during build time make state as disabled then mask the service
init_cfg.json.j2 during build time make state as another jinja2 template render string than do no mask the service
init_cfg.json.j2 during build time make state as enabled then do not mask the service

How I verify:
Manual Verification.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2023-01-07 02:36:37 +00:00
Arvindsrinivasan Lakshmi Narasimhan
a57fa16839
[Chassis][Voq]update to add buffer_queue config on system ports (#12156)
Why I did it
In the voq chassis the buffer_queue configuration needs to be applied on system_port instead of the sonic port.
This PR has the change to do this.

How I did it
Modify buffer_config.j2 to generate buffer_queue configuration on system_ports if the device is Voq Chassis

How to verify it
Verify the buffer_queue configuration is generated properly using sonic-cfggen

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2022-12-31 23:59:54 -08:00
Oleksandr Ivantsiv
127d60f9b8
[build] Adjust teamd and radv features configuration according to the compilation options. (#13139)
- Why I did it
The followup to #12920 PR.
If the feature compilation is disabled its configuration should not be included into init_cfg.json.

- How I did it
Update init_cfg.json.j2 template to include teamd and radv features configuration only if their compilation is enabled.

- How to verify it
The default behavior is preserved. To verify the changes compile the image without overriding INCLUDE_TEAMD and INCLUDE_ROUTER_ADVERTISER options. The generated /etc/sonic/init_cfg.json should remain with no changes. Install the image and verify that both teamd and radv containers are present and running. Verify that feature state returned by show feature status command is enabled.
Change the INCLUDE_TEAMD or INCLUDE_ROUTER_ADVERTISER value to "n". Compile and install the image. Verify that feature configuration is not included in generated /etc/sonic/init_cfg.json file. Verify that show feature status output doesn't include the feature.
2022-12-27 13:55:37 +02:00
Stepan Blyshchak
661669c805
[swss/syncd] remove dependency on interfaces-config.service (#13084)
- Why I did it
Remove dependency on interfaces-config.service to speed up boot, because interfaces-config.service takes a lot of time on boot.

- How I did it
Changed service files for swss, syncd.

- How to verify it
Boot and check swss/syncd start time comparing to interfaces-config

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-12-26 09:20:45 +02:00
Junchao-Mellanox
2126def04e
[infra] Support syslog rate limit configuration (#12490)
- Why I did it
Support syslog rate limit configuration feature

- How I did it
Remove unused rsyslog.conf from containers
Modify docker startup script to generate rsyslog.conf from template files
Add metadata/init data for syslog rate limit configuration

- How to verify it
Manual test
New sonic-mgmt regression cases
2022-12-20 10:53:58 +02:00
Konstantin Vasin
8a3fad2891
[Build] mount cgroup2 in chroot to fix build on ubuntu 22.04 (#13030)
Why I did it
Ubuntu 22.04 uses cgroup2 by default, but docker.sh doesn't mount it.
As a result we get an error when trying to run docker info in chroot env:

ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

How I did it
mount cgroup2 in chroot if all enabled kernel cgroup controllers are currently not in use by cgroup1

So we need to mount cgroup in chroot environment on /sys/fs/cgroup.
Because inside chroot we don't know which cgroup version is used by the host we have two possible solutions:

cgroup tree for chroot is mounted by the host (it was my 1st version of this fix)
cgroup tree is mounted inside chroot based on info from /proc/cgroups (it's current version of this fix)
My 2nd version based on this code from systemd: 5c6c587ce2/src/shared/cgroup-setup.c (L35-L74)

We parse info from /proc/cgroups
Skip header line started from #
Skip controller if it's disabled (4th column = 0)
Count number of controllers with non-zero of hierarchy_id (2nd column)
If this number is not zero then we assume some of controllers are used by host system and the host system uses hybrid or legacy cgroup tree. In this case we can't use unified cgroup tree inside chroot and mount old cgroup tree (v1).
If this number is zero then we assume host system uses unified cgroup tree and we need to mount cgroup2 inside chroot.

Signed-off-by: Konstantin Vasin <k.vasin@yadro.com>
2022-12-17 12:16:45 -08:00
Oleksandr Ivantsiv
9988ff888b
[build] Add the possibility to disable compilation of teamd and radv containers. (#12920)
- Why I did it
This optimization is needed for DPU SONiC. DPU SONiC runs a limited set of containers and teamd and radv containers are not part of them. Unlike the other containers, there was no possibility to disable teamd and radv containers compilation.
To reduce DPU SONiC compilation time and reduce the image size this commit adds the possibility to disable their compilation.

- How I did it
Two new configuration options are added to rules/config file:

INCLUDE_TEAMD
INCLUDE_ROUTER_ADVERTISER
By default to preserve the existing behavior both options are enabled. There are two ways to override them:

To change option value to "n" in rules/config file.
To override their value using SONIC_OVERRIDE_BUILD_VARS env variable:
SONIC_OVERRIDE_BUILD_VARS="SONIC_INCLUDE_TEAMD=y SONIC_INCLUDE_ROUTER_ADVERTISER=n"

- How to verify it
The default behavior is preserved. To verify it compile the image without overriding new options. Install the image and verify that both teamd and radv containers are present and running.
To verify the new options override them with "n" value. Compile and install image. Verify that no docker containers are present. Verify that SWSS can start without errors.
2022-12-13 12:06:30 +02:00
Saikrishna Arcot
00b11ec4e2
Replace logrotate cron file with (adapted) systemd timer file (#12921)
Debian is shipping a systemd timer unit for logrotate, but we're also
packaging in a cron job, which means both of them will run, potentially
at the same time. Remove our cron file, and add an override to the
shipped timer file to have it be run every 10 minutes.

Fixes #12392.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-12-08 14:13:11 -08:00
Junchao-Mellanox
3b3837a636
[containercfgd] Add containercfgd and syslog rate limit configuration support (#12489)
* [containercfgd] Add containercfgd and syslog rate limit configuration support

* Fix build issue

* Fix checker issue

* Fix review comment

* Fix review comment

* Update containercfgd.py
2022-12-08 08:58:35 -08:00
Arvindsrinivasan Lakshmi Narasimhan
7db272556e
[chassis] update the asic_status.py to read from CHASSIS_FABRIC_ASIC_INFO_TABLE (#12576)
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan arlakshm@microsoft.com

Why I did it
Fixes #12575 and #12575

How I did it
In the PR sonic-net/sonic-platform-daemons#311 chassisd updates to CHASSIS_FABRIC_ASIC_INFO with the fabric asic info.
Updating the asic_status.py to read from the correct table.

How to verify it
test on chassis

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2022-12-07 21:53:47 -08:00
Michael Li
50b962b4a8
Limit reload BCM SDK kmods on syncd start to PikeZ platform (#12971)
Why I did it
Limiting #12804 changes to PikeZ platform only (Arista-720DT-48S). Note that this is a short term workaround for this platform until SDK investigation on SDK init failure on docker syncd restart due to DMA issues is resolved.

How I did it
Retrieve platform name from /host/machine.conf and only reload SDK kmods on Arista-720DT-48S platform.

Signed-off-by: Michael Li <michael.li@broadcom.com>
2022-12-07 09:53:21 +08:00
Stepan Blyshchak
8ca0530920
[swss.sh] optimize macsec feature state query (#12946)
- Why I did it
There's a slowdown in bootup related to the execution of a show command during startup of swss service. show is a pretty heavy command and takes long time to execute ~2 sec.

- How I did it
I replaced show with sonic-db-cli which takes a ms to run.

- How to verify it
Boot the switch and verify swss is active.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-12-06 11:23:46 +02:00
Kalimuthu-Velappan
aaeafa8411
02.Version cache - docker cache build framework (#12001)
During docker build, host files can be passed to the docker build through
docker context files. But there is no straightforward way to transfer
the files from docker build to host.

This feature provides a tricky way to pass the cache contents from docker
build to host. It tar's the cached content and encodes them as base64 format
and passes it through a log file with a special tag as 'VCSTART and VCENT'.

Slave.mk in the host, it extracts the cache contents from the log and stores them
in the cache folder. Cache contents are encoded as base64 format for
easy passing.

<!--
     Please make sure you've read and understood our contributing guidelines:
     https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

     ** Make sure all your commits include a signature generated with `git commit -s` **

     If this is a bug fix, make sure your description includes "fixes #xxxx", or
     "closes #xxxx" or "resolves #xxxx"

     Please provide the following information:
-->

#### Why I did it

#### How I did it

#### How to verify it
2022-12-02 08:28:45 +08:00
Michael Li
f725b83bd6
Reload BCM SDK kmods on syncd start to handle syncd restart issues (#12804)
Why I did it
There is an issue on the Arista PikeZ platform (using T3.X2: BCM56274) while running SONiC. If the 'syncd' container in SONiC is restarted, the expected behaviour is that syncd will automatically restart/recover; however it does not and always fails at create_switch due to BCM SDK kmod DMA operation cancellation getting stuck.

Sep 16 22:19:44.855125 pkz208 ERR syncd#syncd: [none] SAI_API_SWITCH:platform_process_command:428 Platform command "init soc" failed, rc = -1. Sep 16 22:19:44.855206 pkz208 INFO syncd#supervisord: syncd CMIC_CMC0_PKTDMA_CH4_DESC_COUNT_REQ:0x33#015 Sep 16 22:19:44.855264 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:platformInit:1909 initialization command "init soc" failed, rc = -1 (Internal error). Sep 16 22:19:44.855403 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:sai_driver_init:642 Error initializing driver, rc = -1. ... Sep 16 22:19:44.855891 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:brcm_sai_create_switch:1173 initializing SDK failed with error Operation failed (0xfffffff5).

Reloading the BCM SDK kmods allows the switch init to continue properly.

How I did it
If BCM SDK kmods are loaded, unload and load them again on syncd docker start script.

How to verify it
Steps to reproduce:

In SONiC, run 'docker ps' to see current running containers; 'syncd' should be present.
Run 'docker stop syncd'
Wait ~1 minute.
Run 'docker ps' to see that syncd is missing.
Check logs to see messages similar to the above.

Signed-off-by: Michael Li <michael.li@broadcom.com>
2022-11-30 16:16:30 +08:00
Kebo Liu
36a100083f
[Mellanox] Add support to Mellanox Spectrum-4 ASIC Firmware compiling and upgrade (#12844)
- Why I did it
Add support for compiling Spectrum-4 ASIC firmware to the SONiC image
Add support for Spectrum-4 ASIC firmware upgrade

- How I did it
Update Mellanox fw make files to include Spectrum-4 ASIC firmware binaries.
Update firmware upgrade scripts to be able to detect Spectrum-4 ASIC.

- How to verify it
Run regression tests

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-11-29 16:38:41 +02:00
Zain Budhwani
4b001e5115
Change value type of params in memory_checker (#12797)
Fix error when calling events API, required value is string, passing float
2022-11-23 17:37:28 -08:00