sonic-buildimage

Author	SHA1	Message	Date
xumia	7209666374	[Security] Fix some of vulnerability issue relative python packages (#14269 ) Why I did it Fix some of vulnerability issue relative python packages #14269 Pillow: [CVE-2021-27921] Wheel: [CVE-2022-40898] lxml: [CVE-2022-2309] How I did it	2023-03-20 14:15:45 +08:00
mssonicbld	1e8e993a94	[ci/build]: Upgrade SONiC package versions	2023-03-20 09:00:28 +08:00
mssonicbld	89ebd43c81	[ci/build]: Upgrade SONiC package versions (#14311 ) Upgrade SONiC Versions	2023-03-19 10:16:41 +08:00
Dev Ojha	de17f72d9a	[Buffer] Added cable length config to buffer config template for EdgeZoneAggregator (#14280 ) Why I did it SONiC currently does not identify 'EdgeZoneAggregator' neighbor. As a result, the buffer profile attached to those interfaces uses the default cable length which could cause ingress packet drops due to insufficient headroom. Hence, there is a need to update the buffer templates to identify such neighbors and assign the same cable length as used by the T1. How I did it Modified the buffer template to identify EdgeZoneAggregator as a neighbor device type and assign it the same cable length as a T1/leaf router. How to verify it Unit tests pass, and manually checked on a 7260 to see the changes take effect. Signed-off-by: dojha <devojha@microsoft.com>	2023-03-17 11:01:17 -07:00
mssonicbld	96817c4357	[ci/build]: Upgrade SONiC package versions (#14102 ) Upgrade SONiC Versions	2023-03-17 10:12:30 +08:00
Neetha John	f30fb6ec58	[storage_backend] Add backend acl service (#14229 ) Why I did it This PR addresses the issue mentioned above by loading the acl config as a service on a storage backend device How I did it The new acl service is a oneshot service which will start after swss and does some retries to ensure that the SWITCH_CAPABILITY info is present before attempting to load the acl rules. The service is also bound to sonic targets which ensures that it gets restarted during minigraph reload and config reload How to verify it Build an image with the following changes and did the following tests Verified that acl is loaded successfully on a storage backend device after a switch boot up Verified that acl is loaded successfully on a storage backend ToR after minigraph load and config reload Verified that acl is not loaded if the device is not a storage backend ToR or the device does not have a DATAACL table Signed-off-by: Neetha John <nejo@microsoft.com>	2023-03-16 14:18:28 -07:00
davidpil2002	8098bc4bf5	Add Secure Boot Support (#12692 ) - Why I did it Add Secure Boot support to SONiC OS. Secure Boot (SB) is a verification mechanism for ensuring that code launched by a computer's UEFI firmware is trusted. It is designed to protect a system against malicious code being loaded and executed early in the boot process before the operating system has been loaded. - How I did it Added a signing process to sign the following components: shim, grub, Linux kernel, and kernel modules when doing the build, and when feature is enabled in build time according to the HLD explanations (the feature is disabled by default). - How to verify it There are self-verifications of each boot component when building the image, in addition, there is an existing end-to-end test in sonic-mgmt repo that checks that the boot succeeds when loading a secure system (details below). How to build a sonic image with secure boot feature: (more description in HLD) Required to use the following build flags from rules/config: SECURE_UPGRADE_MODE="dev" SECURE_UPGRADE_DEV_SIGNING_KEY="/path/to/private/key.pem" SECURE_UPGRADE_DEV_SIGNING_CERT="/path/to/cert/key.pem" After setting those flags should build the sonic-buildimage. Before installing the image, should prepared the setup (switch device) with the follow: check that the device support UEFI stored pub keys in UEFI DB enabled Secure Boot flag in UEFI How to run a test that verify the Secure Boot flow: The existing test "test_upgrade_path" under "sonic-mgmt/tests/upgrade_path/test_upgrade_path", is enough to validate proper boot You need to specify the following arguments: Base_image_list your_secure_image Taget_image_list your_second_secure_image Upgrade_type cold And run the test, basically the test will install the base image given in the parameter and then upgrade to target image by doing cold reboot and validates all the services are up and working correctly	2023-03-14 14:55:22 +02:00
Stepan Blyshchak	f908dfe919	[Mellanox] Place FW binaries under platform directory instead of squashfs (#13837 ) Fixes #13568 Upgrade from old image always requires squashfs mount to get the next image FW binary. This can be avoided if we put FW binary under platform directory which is easily accessible after installation: admin@r-spider-05:~$ ls /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa admin@r-spider-05:~$ ls -al /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa lrwxrwxrwx 1 root root 66 Feb 8 17:57 /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa -> /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa - Why I did it 202211 and above uses different squashfs compression type that 201911 kernel can not handle. Therefore, we avoid mounting squashfs altogether with this change. - How I did it Place FW binary under /host/image-/platform/mlnx/, soft links in /etc/mlnx are created to avoid breaking existing scripts/automation. /etc/mlnx/fw-SPCX.mfa is a soft link always pointing to the FW that should be used in current image mlnx-fw-upgrade.sh is updated to prefer /host/image-/platform/mlnx location and fallback to /etc/mlnx in squashfs in case new location does not exist. This is necessary to do image downgrade. - How to verify it Upgrade from 201911 to master master to 201911 downgrade master -> master reboot ONIE -> master boot (First FW burn) Which release branch to backport (provide reason below if selected)	2023-03-06 13:36:43 +02:00
mssonicbld	506f372533	[ci/build]: Upgrade SONiC package versions (#14072 ) Upgrade SONiC Versions	2023-03-05 11:29:38 +08:00
anamehra	4a93e4cfa4	Add support for platform syncd pre shutdown plugin (#13564 ) Why I did it Vendor platform may require running platform specific pre-shutdown routine before shutting down the syncd process which runs the SAI and vendor sdk instance. How I did it Added a platform script hook which will be executed if the plugin script is provided by the platform in device//plugins/	2023-03-03 15:53:33 -08:00
Sudharsan Dhamal Gopalarathnam	8883259673	[netlink] Increse netlink buffer size from 3MB to 16MB (#13965 ) #### Why I did it Following the PR https://github.com/sonic-net/sonic-swss-common/pull/739 increasing netlink buffer size in linux kernel As error is seen in fdbsyncd with netlink reports "out of memory on reading a netlink socket" It is seen when kernel is sending 10k remote mac to fdbsyncd. #### How I did it Increase the buffer size of the netlink buffer from 3MB to 16MB #### How to verify it Verified with 10k remote mac, and restarting the fdbsyncd process. So that kernel send the bridge fdb dump to the fdbsyncd. Verified that the netlink buffer error is not reported in the sys log.	2023-02-27 15:41:22 -08:00
mssonicbld	8d0d3e57ba	[ci/build]: Upgrade SONiC package versions (#13989 ) Upgrade SONiC Versions	2023-02-27 13:45:49 +08:00
mssonicbld	58592e6c49	[ci/build]: Upgrade SONiC package versions (#13526 ) The initial version files for the SONiC reproducible build	2023-02-25 08:16:38 +08:00
Samuel Angebault	b9dffcbaaf	[Arista] Disable SSD NCQ on Lodoga (#13964 ) Why I did it Fix similar issue seen on #13739 but only for DCS-7050CX3-32S How I did it Add a kernel parameter to tell libata to disable NCQ How to verify it The message ata2.00: FORCE: horkage modified (noncq) should appear on the dmesg. Test results using: fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 with NCQ READ: bw=26.1MiB/s (27.4MB/s), 26.1MiB/s-26.1MiB/s (27.4MB/s-27.4MB/s), io=3136MiB (3288MB), run=120053-120053msec WRITE: bw=26.3MiB/s (27.6MB/s), 26.3MiB/s-26.3MiB/s (27.6MB/s-27.6MB/s), io=3161MiB (3315MB), run=120053-120053msec without NCQ READ: bw=22.0MiB/s (23.1MB/s), 22.0MiB/s-22.0MiB/s (23.1MB/s-23.1MB/s), io=2647MiB (2775MB), run=120069-120069msec WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=2665MiB (2795MB), run=120069-120069msec	2023-02-24 10:08:04 -08:00
DavidZagury	ee1b6b3751	Remove support to Mellanox SPC4 ASIC (#13932 ) - Why I did it FW for Spectrum-4 ASIC not yet available - How I did it Remove in Mellanox fw make files to Spectrum-4 ASIC firmware binaries. Remove from firmware upgrade scripts to be able Spectrum-4 ASIC. - How to verify it Run regression test	2023-02-23 08:25:34 +02:00
Andriy Yurkiv	5ad78abea0	[Dual-ToR] add default value for ACL rule for mellanox platform (#13547 ) - Why I did it Need to add the possibility to choose between dropping packets (using ACL) on ingress or egress in Dual ToR scenario - How I did it Add new attribute "mux_tunnel_ingress_acl" to SYSTEM_DEFAULTS table - How to verify it check that new attribute exists in redis: admin@sonic:~$ redis-cli -n 4 127.0.0.1:6379[4]> HGETALL SYSTEM_DEFAULTS\|mux_tunnel_ingress_acl 1."state" 2."false" Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>	2023-02-22 20:25:54 +02:00
Marty Y. Lok	2c22d9affc	[Chassis][multiasic] Fix the sonic-db-cli core files issue on multiasic platform after the c++ implementation of sonic-db-cli (#13207 ) Fixe #12047. After the c++ implementation of the sonic-db-cli, sonic-db-cli PING command tries to initialize the global database for all instances database starting. If all instance database-config.json are not ready yet. it will crash and generate core file. PR sonic-net/sonic-swss-common#701 only fix the crash and the process abortion. Signed-off-by: mlok <marty.lok@nokia.com>	2023-02-21 11:23:22 -08:00
Saikrishna Arcot	56d732a0a0	Use tmpfs for /var/log on Arista 7050CX3-32S (#13805 ) This is to reduce writes to the SSD on the device. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>	2023-02-16 19:13:39 -08:00
Samuel Angebault	5ce1b8e4b7	[Arista] Disable ATA NCQ for a few products (#13739 ) Why I did it Some products might experience an occasional IO failure in the communication between CPU and SSD. Based on some research it could be attributable to some device not handling ATA NCQ (Native Command Queue). This issue currently affect 4 products: DCS-7170-32C* DCS-7170-64C DCS-7060DX4-32 DCS-7260CX3-64 How I did it This change disable NCQ on the affected drive for a small set of products. How to verify it When the fix is applied, these 2 patterns can be found in the dmesg. ata1.00: FORCE: horkage modified (noncq) NCQ (not used) Test results using: fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4 with NCQ (ata1.00: 61865984 sectors, multi 1: LBA48 NCQ (depth 32), AA) READ: bw=33.9MiB/s (35.6MB/s), 33.9MiB/s-33.9MiB/s (35.6MB/s-35.6MB/s), io=4073MiB (4270MB), run=120078-120078msec WRITE: bw=34.1MiB/s (35.8MB/s), 34.1MiB/s-34.1MiB/s (35.8MB/s-35.8MB/s), io=4100MiB (4300MB), run=120078-120078msec without NCQ (ata1.00: 61865984 sectors, multi 1: LBA48 NCQ (not used)) READ: bw=31.7MiB/s (33.3MB/s), 31.7MiB/s-31.7MiB/s (33.3MB/s-33.3MB/s), io=3808MiB (3993MB), run=120083-120083msec WRITE: bw=31.9MiB/s (33.4MB/s), 31.9MiB/s-31.9MiB/s (33.4MB/s-33.4MB/s), io=3830MiB (4016MB), run=120083-120083msec Which release branch to backport (provide reason below if selected)	2023-02-15 10:31:59 -08:00
Stepan Blyshchak	e5a294644c	[dockerd] Force usage of cgo DNS resolver (#13649 ) Go's runtime (and dockerd inherits this) uses own DNS resolver implementation by default on Linux. It has been observed that there are some DNS resolution issues when executing ```docker pull``` after first boot. Consider the following script: ``` admin@r-boxer-sw01:~$ while :; do date; cat /etc/resolv.conf; ping -c 1 harbor.mellanox.com; docker pull harbor.mellanox.com/sonic/cpu-report:1.0.0 ; sleep 1; done Fri 03 Feb 2023 10:06:22 AM UTC nameserver 10.211.0.124 nameserver 10.211.0.121 nameserver 10.7.77.135 search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data. 64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=5.99 ms --- harbor.mellanox.com ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 5.989/5.989/5.989/0.000 ms Error response from daemon: Get "https://harbor.mellanox.com/v2/": dial tcp: lookup harbor.mellanox.com on [::1]:53: read udp [::1]:57245->[::1]:53: read: connection refused Fri 03 Feb 2023 10:06:23 AM UTC nameserver 10.211.0.124 nameserver 10.211.0.121 nameserver 10.7.77.135 search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data. 64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=5.56 ms --- harbor.mellanox.com ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 5.561/5.561/5.561/0.000 ms Error response from daemon: Get "https://harbor.mellanox.com/v2/": dial tcp: lookup harbor.mellanox.com on [::1]:53: read udp [::1]:53299->[::1]:53: read: connection refused Fri 03 Feb 2023 10:06:24 AM UTC nameserver 10.211.0.124 nameserver 10.211.0.121 nameserver 10.7.77.135 search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data. 64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=5.78 ms --- harbor.mellanox.com ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 5.783/5.783/5.783/0.000 ms Error response from daemon: Get "https://harbor.mellanox.com/v2/": dial tcp: lookup harbor.mellanox.com on [::1]:53: read udp [::1]:55765->[::1]:53: read: connection refused Fri 03 Feb 2023 10:06:25 AM UTC nameserver 10.211.0.124 nameserver 10.211.0.121 nameserver 10.7.77.135 search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data. 64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=7.17 ms --- harbor.mellanox.com ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 7.171/7.171/7.171/0.000 ms Error response from daemon: Get "https://harbor.mellanox.com/v2/": dial tcp: lookup harbor.mellanox.com on [::1]:53: read udp [::1]:44877->[::1]:53: read: connection refused Fri 03 Feb 2023 10:06:26 AM UTC nameserver 10.211.0.124 nameserver 10.211.0.121 nameserver 10.7.77.135 search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data. 64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=5.66 ms --- harbor.mellanox.com ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 5.656/5.656/5.656/0.000 ms Error response from daemon: Get "https://harbor.mellanox.com/v2/": dial tcp: lookup harbor.mellanox.com on [::1]:53: read udp [::1]:54604->[::1]:53: read: connection refused Fri 03 Feb 2023 10:06:27 AM UTC nameserver 10.211.0.124 nameserver 10.211.0.121 nameserver 10.7.77.135 search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data. 64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=8.22 ms --- harbor.mellanox.com ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 8.223/8.223/8.223/0.000 ms 1.0.0: Pulling from sonic/cpu-report 004f1eed87df: Downloading [===================> ] 19.3MB/50.43MB 5d6f1e8117db: Download complete 48c2faf66abe: Download complete 234b70d0479d: Downloading [=========> ] 9.363MB/51.84MB 6fa07a00e2f0: Downloading [==> ] 9.51MB/192.4MB 04a31b4508b8: Waiting e11ae5168189: Waiting 8861a99744cb: Waiting d59580d95305: Waiting 12b1523494c1: Waiting d1a4b09e9dbc: Waiting 99f41c3f014f: Waiting ``` While /etc/resolv.conf has the correct content and ping (and any other utility that uses libc's DNS resolution implementation) works correctly docker is unable to resolve the hostname and falls back to default [::1]:53. This started to happen after PR https://github.com/sonic-net/sonic-buildimage/pull/13516 has been merged. As you can see from the log, dockerd is able to pick up the correct /etc/resolv.conf only after 5 sec since first try. This seems to be somehow related to the logic in Go's DNS resolver https://github.com/golang/go/blob/master/src/net/dnsclient_unix.go#L385. There have been issues like that reported in docker like: - https://github.com/docker/cli/issues/2299 - https://github.com/docker/cli/issues/2618 - https://github.com/moby/moby/issues/22398 Since this starts to happen after inclusion of resolvconf package by above mentioned PR and the fact I can't see any problem with that (ping, nslookup, etc. works) the choice is made to force dockerd to use cgo (libc) resolver. Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>	2023-02-14 08:57:19 +02:00
zhixzhu	f0f7639fa2	set cable length to 1m for backplane ports (#13572 ) Signed-off-by: Zhixin Zhu zhixzhu@cisco.com Why I did it backplane ports cable length need to be specified. How I did it separated handling for the specific port name.	2023-02-10 19:01:49 -08:00
andywongarista	1894e0aafe	Increase PikeZ varlog size (#13550 ) Why I did it To address error sometimes seen when running sonic-mgmt test_stress_routes.py::test_announce_withdraw_route on 720DT-48S How I did it Update boot0 logic to set platform specific varlog size for 720DT-48S How to verify it Verified that /var/log size increased and error is no longer observed when running test	2023-02-09 13:24:09 -08:00
Samuel Angebault	dd7948bf17	[Arista] Add emmc quirks in boot0 to improve reliability (#10013 ) Why I did it Fix some unreliability seen on emmc device with some AMD CPUs How I did it Added a kernel parameter to add quirks to It depends on a sonic-linux-kernel change to work properly but will be a no-op without it. The quirk added is SDHCI_QUIRK2_BROKEN_HS200 used to downgrade the link speed for the eMMC.	2023-02-09 10:46:09 -08:00
Stephen Sun	e3ff08833e	[Mellanox] Support DSCP remapping in dual ToR topo on T0 switch (#12605 ) - Why I did it Support DSCP remapping in dual ToR topo on T0 switch for SKU Mellanox-SN4600c-C64, Mellanox-SN4600c-D48C40, Mellanox-SN2700, Mellanox-SN2700-D48C8. - How I did it Regarding buffer settings, originally, there are two lossless PGs and queues 3, 4. In dual ToR scenario, the lossless traffic from the leaf switch to the uplink of the ToR switch can be bounced back. To avoid PFC deadlock, we need to map the bounce-back lossless traffic to different PGs and queues. Therefore, 2 additional lossless PGs and queues are allocated on uplink ports on ToR switches. On uplink ports, map DSCP 2/6 to TC 2/6 respectively On downlink ports, both DSCP 2/6 are still mapped to TC 1 Buffer adjusted according to the ports information: Mellanox-SN4600c-C64: 56 downlinks 50G + 8 uplinks 100G Mellanox-SN4600c-D48C40, Mellanox-SN2700, Mellanox-SN2700-D48C8: 24 downlinks 50G + 8 uplinks 100G - How to verify it Unit test. Signed-off-by: Stephen Sun <stephens@nvidia.com>	2023-02-07 16:21:59 +02:00
Chun'ang Li	eea54717b8	Fix rsyslogd start failed cause by rsyslog.conf is emtpy. (#13669 ) - Why I did it In to-sonic and multi-asic KVM-test, pretest sometimes failed. Reason is rsyslogd process can not start in teamd container. Because rsyslog.conf is empty caused by sonic-cfggen execute failed - How I did it If sonic-cfggen -d execute failed, execute without -d because the template file has the default value. - How to verify it Build image and test it over 40 times, all passed pretest. Signed-off-by: Chun'ang Li <chunangli@microsoft.com>	2023-02-06 16:38:04 +02:00
Sudharsan Dhamal Gopalarathnam	1ff0c0b685	[Mellanox][sai_failure_dump]Added platform specific script to be invoked during SAI failure dump (#13533 ) - Why I did it Added platform specific script to be invoked during SAI failure dump. Added some generic changes to mount /var/log/sai_failure_dump as read write in the syncd docker - How I did it Added script in docker-syncd of mellanox and copied it to /usr/bin - How to verify it Manual UT and new sonic-mgmt tests	2023-02-05 16:45:49 +02:00
Saikrishna Arcot	ee1c32a802	Use tmpfs for /var/log for Arista 7260 (#13587 ) This is to reduce writes to disk, which then can use the SSD to get worn out faster. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>	2023-02-02 09:07:33 -08:00
anamehra	26af468a99	Add support for platform topology configuration service (#12066 ) * Add support for platform topology configuration service This service invokes the platform plugin for platform specific topology configuration. The path for platform plugin script is: /usr/share/sonic/device/$PLATFORM/plugins/config-topology.sh If the platform plugin is not available, this service does nothing. Signed-off-by: anamehra <anamehra@cisco.com>	2023-02-01 12:53:45 -08:00
Richard.Yu	a096363b48	[broadcom]: Set default SYNCD_SHM_SIZE for Broadcom XGS devices (#13297 ) After upgrade to brcmsai 8.1, the sdk running environment (container) recommended with mininum memory size as below TH4/TD4(ltsw) uses 512MB TH3 used 300MB Helix4/TD2/TD3/TH/TH 256 MB Base on this requirement, adjust the default syncd share memory size and set the memory size for special ACISs in platform_env.conf file for different types of Broadcom ASICs. How I did it Add the platform_env.conf file if none of it for broadcom platform (base on platform_asic file) Add the 'SYNCD_SHM_SIZE' and set the value for ltsw(TD4/TH4) devices set to 512M at least (update the platform_env.conf) for Td2/TH2/TH devices set to 256M for TH3 set to 300M verify How to verify it verify the image with code fix Check with UT Check on lab devices On a problematic device which cannot start successfully Run with the command $ cat /proc/linux-kernel-bde Broadcom Device Enumerator (linux-kernel-bde) Module parameters: maxpayload=128 usemsi=0 dmasize=32M himem=(null) himemaddr=(null) DMA Memory (kernel): 33554432 bytes, 0 used, 33554432 free, local mmap No devices found $ docker rm -f syncd syncd $ sudo /usr/bin/syncd.sh start Cannot get Broadcom Chip Id. Skip set SYNCD_SHM_SIZE. Creating new syncd container with HWSKU Force10-S6000 a4862129a7fea04f00ed71a88715eac65a41cdae51c3158f9cdd7de3ccc3dd31 $ docker inspect syncd \| grep -i shm "ShmSize": 67108864, "Tag": "fix_8.1_shm_issue.67873427-9f7ca60a0e", On Normal device $ docker inspect syncd \| grep -i shm "ShmSize": 268435456, "Tag": "fix_8.1_shm_issue.67873427-9f7ca60a0e" change the config syncd_shm.ini to b85=128m $ docker rm -f syncd syncd $ sudo /usr/bin/syncd.sh start Creating new syncd container with HWSKU Force10-S6000 3209ffc1e5a7224b99640eb9a286c4c7aa66a2e6a322be32fb7fe2113bb9524c $ docker inspect syncd \| grep -i shm "ShmSize": 134217728, "Tag": "fix_8.1_shm_issue.67873427-9f7ca60a0e", change the config under /usr/share/sonic/device/x86_64-dell_s6000_s1220-r0/Force10-S6000/platform_env.conf and run command $ cat /usr/share/sonic/device/x86_64-dell_s6000_s1220-r0/platform_env.conf SYNCD_SHM_SIZE=300m $ sudo /usr/bin/syncd.sh start Creating new syncd container with HWSKU Force10-S6000 897f6fcde1f669ad2caab7da4326079abd7e811bf73f018c6dacc24cf24bfda5 $ docker inspect syncd \| grep -i shm "ShmSize": 314572800, "Tag": "fix_8.1_shm_issue.67873427-9f7ca60a0e", Signed-off-by: richardyu-ms <richard.yu@microsoft.com>	2023-01-30 20:23:03 -08:00
Oleksandr Ivantsiv	c7ecd92c54	Clear DNS configuration received from DHCP during networking reconfiguration in Linux. (#13516 ) - Why I did it fixes #12907 When the management interface IP address configuration changes from dynamic to static the DNS configuration (retrieved from the DHCP server) in /etc/resolv.conf remains uncleared. This leads to a DNS configuration pointing to the wrong nameserver. To make the behavior clear DNS configuration received from DHCP should be cleared. - How I did it Use resolvconf package for managing DNS configuration. It is capable of tracking the source of DNS configuration and puts the configuration retrieved from the DHCP servers into a separate file. This allows the implementation of DNS configuration cleanup retrieved from DHCP during networking reconfiguration. - How to verify it Ensure that the management interface has no static configuration. Check that /etc/resolv.conf has DNS configuration. Configure a static IP address on the management interface. Verify that /etc/resolv.conf has no DNS configuration. Remove the static IP address from the management interface. Verify that /etc/resolv.conf has DNS configuration retrieved form DHCP server.	2023-01-30 22:13:10 +02:00
Devesh Pathak	c93716a142	rsyslog to start after interfaces-config (#13503 ) Fixes #12408 Why I did it We are running into #12408 very frequently. This results in no syslogs from any containers as rsyslog server could not start. some of the sonic-mgmt scripts look for log statements and error out if log is not present. Interfaces-config service configures the loopback interface along with other interfaces. rsyslog-config reads ip address of loopback interface and generates /etc/rsyslog.conf. When this race condition happens, lo interface ip is not yet programmed and rsyslog-config ends up writing UDP server as null in /etc/rsyslog.conf. How I did it rsyslog-config service is started after interfaces-config service. How to verify it Did multiple reboots and verified that $UDPServerAddress is valid.	2023-01-26 20:39:13 -08:00
Jing Zhang	dabb31c5f6	[sudoers] add `/usr/local/bin/storyteller` to `READ_ONLY_CMDS` (#13422 ) Adding /usr/local/bin/storyteller to READ_ONLY_CMDS. So no write access or prompt for password is needed to run storyteller. Tested on 202205 clusters, user who didn't request write access was able to grep log using storyteller. sign-off: Jing Zhang zhangjing@microsoft.com	2023-01-26 20:38:29 -08:00
Jing Zhang	78f249be38	change default to be on (#13495 ) Changing the default config knob value to be True for killing radv, due to the reasons below: Killing RADV is to prevent sending the "cease to be advertising interface" protocol packet. RFC 4861 says this ceasing packet as "should" instead of "must", considering that it's fatal to not do this. In active-active scenario, host side might have difficulty distinguish if the "cease to be advertising interface" is for the last interface leaving. 6.2.5. Ceasing To Be an Advertising Interface shutting down the system. In such cases, the router SHOULD transmit one or more (but not more than MAX_FINAL_RTR_ADVERTISEMENTS) final multicast Router Advertisements on the interface with a Router Lifetime field of zero. In the case of a router becoming a host, the system SHOULD also depart from the all-routers IP multicast group on all interfaces on which the router supports IP multicast (whether or not they had been advertising interfaces). In addition, the host MUST ensure that subsequent Neighbor Advertisement messages sent from the interface have the Router flag set to zero. sign-off: Jing Zhang zhangjing@microsoft.com	2023-01-24 23:59:54 +00:00
Zain Budhwani	c9a33cb00e	Fix segfault issue inside memory_checker (#13066 ) #### Why I did it Segfault was occuring when running memory_checker #### How I did it Deinit publisher immediately after publishing #### How to verify it Manual testing	2023-01-24 15:30:41 -08:00
Jing Zhang	260a2ec3e7	[dualtor][active-active]Killing radv instead of stopping on `active-active` dualtor if config knob is on (#13408 ) How I did it radv sends a good-bye packet when the service is stopped, which causes a IPv6 route update on SoC side. And this update leads to an interface bouncing and causes traffic disruption even though the ToR device might already be isolated. This PR is to mitigate the traffic disruption issue during planned maintenance, by killing radv instead of stopping. So the cease packet won't be sent. How to verify it Verified on dev clusters: Traffic disruption was no longer reproducible. radv took the killing path if knob was off, radv would take the stopping path sign-off: Jing Zhang zhangjing@microsoft.com	2023-01-20 15:34:34 -08:00
Graham Hayes	e077b5362c	[Arista] Rely on automatic flash size detection for Raven (#13277 ) Many of these switches have had flash upgraded beyond 2G however, in boot0 both were assigned 2GB for legacy reasons. Remove the hardcoding of the flash size and let boot0 autodetect the available space. Signed-off-by: Graham Hayes <gr@ham.ie> Signed-off-by: Graham Hayes <gr@ham.ie>	2023-01-12 23:52:40 -08:00
xumia	e6a01ca5eb	[Bug] Fix SONiC installation failure caused by pip/pip3 not found (#13284 ) The main issue is the pip/pip3 command cannot be found when the package is being installed by apt-get. When using the dpkg install, the searching path is PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin When using the apt-get install, the searching path is PATH=/usr/sbin:/usr/bin:/sbin:/bin But the pip/pip3 default path is at /usr/local/bin, so dpkg works, but apt-get not work. How I did it Export the path /usr/local/bin for pip/pip3. Make the deb packages can be installed by apt-get.	2023-01-11 08:54:24 -08:00
centecqianj	4b933bd566	[Centec arm64] Solve the abnormal console speed of centec-arm64 switch board (#13126 ) The console of the centec-arm64 board is ttyAMA0.The current regular expression cannot be correctly parsed. Signed-off-by: centecqianj <qianj@centec.com>	2023-01-07 21:10:03 -08:00
abdosi	9ecd27ddbb	During build time mask only those feature/services that are disabled excplicitly (#13283 ) What I did: Fix : #13117 How I did: During build time mask only those feature/services that are disabled explicitly. Some of the features ((eg: teamd/bgp/dhcp-relay/mux/etc..)) state is determine run-time so for those feature by default service will be up and running and then later hostcfgd will mask them if needed. So Default behavior will be init_cfg.json.j2 during build time make state as disabled then mask the service init_cfg.json.j2 during build time make state as another jinja2 template render string than do no mask the service init_cfg.json.j2 during build time make state as enabled then do not mask the service How I verify: Manual Verification. Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>	2023-01-07 02:36:37 +00:00
Arvindsrinivasan Lakshmi Narasimhan	a57fa16839	[Chassis][Voq]update to add buffer_queue config on system ports (#12156 ) Why I did it In the voq chassis the buffer_queue configuration needs to be applied on system_port instead of the sonic port. This PR has the change to do this. How I did it Modify buffer_config.j2 to generate buffer_queue configuration on system_ports if the device is Voq Chassis How to verify it Verify the buffer_queue configuration is generated properly using sonic-cfggen Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>	2022-12-31 23:59:54 -08:00
Oleksandr Ivantsiv	127d60f9b8	[build] Adjust teamd and radv features configuration according to the compilation options. (#13139 ) - Why I did it The followup to #12920 PR. If the feature compilation is disabled its configuration should not be included into init_cfg.json. - How I did it Update init_cfg.json.j2 template to include teamd and radv features configuration only if their compilation is enabled. - How to verify it The default behavior is preserved. To verify the changes compile the image without overriding INCLUDE_TEAMD and INCLUDE_ROUTER_ADVERTISER options. The generated /etc/sonic/init_cfg.json should remain with no changes. Install the image and verify that both teamd and radv containers are present and running. Verify that feature state returned by show feature status command is enabled. Change the INCLUDE_TEAMD or INCLUDE_ROUTER_ADVERTISER value to "n". Compile and install the image. Verify that feature configuration is not included in generated /etc/sonic/init_cfg.json file. Verify that show feature status output doesn't include the feature.	2022-12-27 13:55:37 +02:00
Stepan Blyshchak	661669c805	[swss/syncd] remove dependency on interfaces-config.service (#13084 ) - Why I did it Remove dependency on interfaces-config.service to speed up boot, because interfaces-config.service takes a lot of time on boot. - How I did it Changed service files for swss, syncd. - How to verify it Boot and check swss/syncd start time comparing to interfaces-config Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>	2022-12-26 09:20:45 +02:00
Junchao-Mellanox	2126def04e	[infra] Support syslog rate limit configuration (#12490 ) - Why I did it Support syslog rate limit configuration feature - How I did it Remove unused rsyslog.conf from containers Modify docker startup script to generate rsyslog.conf from template files Add metadata/init data for syslog rate limit configuration - How to verify it Manual test New sonic-mgmt regression cases	2022-12-20 10:53:58 +02:00
Konstantin Vasin	8a3fad2891	[Build] mount cgroup2 in chroot to fix build on ubuntu 22.04 (#13030 ) Why I did it Ubuntu 22.04 uses cgroup2 by default, but docker.sh doesn't mount it. As a result we get an error when trying to run docker info in chroot env: ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? How I did it mount cgroup2 in chroot if all enabled kernel cgroup controllers are currently not in use by cgroup1 So we need to mount cgroup in chroot environment on /sys/fs/cgroup. Because inside chroot we don't know which cgroup version is used by the host we have two possible solutions: cgroup tree for chroot is mounted by the host (it was my 1st version of this fix) cgroup tree is mounted inside chroot based on info from /proc/cgroups (it's current version of this fix) My 2nd version based on this code from systemd: `5c6c587ce2/src/shared/cgroup-setup.c (L35-L74)` We parse info from /proc/cgroups Skip header line started from # Skip controller if it's disabled (4th column = 0) Count number of controllers with non-zero of hierarchy_id (2nd column) If this number is not zero then we assume some of controllers are used by host system and the host system uses hybrid or legacy cgroup tree. In this case we can't use unified cgroup tree inside chroot and mount old cgroup tree (v1). If this number is zero then we assume host system uses unified cgroup tree and we need to mount cgroup2 inside chroot. Signed-off-by: Konstantin Vasin <k.vasin@yadro.com>	2022-12-17 12:16:45 -08:00
Oleksandr Ivantsiv	9988ff888b	[build] Add the possibility to disable compilation of teamd and radv containers. (#12920 ) - Why I did it This optimization is needed for DPU SONiC. DPU SONiC runs a limited set of containers and teamd and radv containers are not part of them. Unlike the other containers, there was no possibility to disable teamd and radv containers compilation. To reduce DPU SONiC compilation time and reduce the image size this commit adds the possibility to disable their compilation. - How I did it Two new configuration options are added to rules/config file: INCLUDE_TEAMD INCLUDE_ROUTER_ADVERTISER By default to preserve the existing behavior both options are enabled. There are two ways to override them: To change option value to "n" in rules/config file. To override their value using SONIC_OVERRIDE_BUILD_VARS env variable: SONIC_OVERRIDE_BUILD_VARS="SONIC_INCLUDE_TEAMD=y SONIC_INCLUDE_ROUTER_ADVERTISER=n" - How to verify it The default behavior is preserved. To verify it compile the image without overriding new options. Install the image and verify that both teamd and radv containers are present and running. To verify the new options override them with "n" value. Compile and install image. Verify that no docker containers are present. Verify that SWSS can start without errors.	2022-12-13 12:06:30 +02:00
Saikrishna Arcot	00b11ec4e2	Replace logrotate cron file with (adapted) systemd timer file (#12921 ) Debian is shipping a systemd timer unit for logrotate, but we're also packaging in a cron job, which means both of them will run, potentially at the same time. Remove our cron file, and add an override to the shipped timer file to have it be run every 10 minutes. Fixes #12392. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>	2022-12-08 14:13:11 -08:00
Junchao-Mellanox	3b3837a636	[containercfgd] Add containercfgd and syslog rate limit configuration support (#12489 ) * [containercfgd] Add containercfgd and syslog rate limit configuration support * Fix build issue * Fix checker issue * Fix review comment * Fix review comment * Update containercfgd.py	2022-12-08 08:58:35 -08:00
Arvindsrinivasan Lakshmi Narasimhan	7db272556e	[chassis] update the asic_status.py to read from CHASSIS_FABRIC_ASIC_INFO_TABLE (#12576 ) Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan arlakshm@microsoft.com Why I did it Fixes #12575 and #12575 How I did it In the PR sonic-net/sonic-platform-daemons#311 chassisd updates to CHASSIS_FABRIC_ASIC_INFO with the fabric asic info. Updating the asic_status.py to read from the correct table. How to verify it test on chassis Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>	2022-12-07 21:53:47 -08:00
Michael Li	50b962b4a8	Limit reload BCM SDK kmods on syncd start to PikeZ platform (#12971 ) Why I did it Limiting #12804 changes to PikeZ platform only (Arista-720DT-48S). Note that this is a short term workaround for this platform until SDK investigation on SDK init failure on docker syncd restart due to DMA issues is resolved. How I did it Retrieve platform name from /host/machine.conf and only reload SDK kmods on Arista-720DT-48S platform. Signed-off-by: Michael Li <michael.li@broadcom.com>	2022-12-07 09:53:21 +08:00
Stepan Blyshchak	8ca0530920	[swss.sh] optimize macsec feature state query (#12946 ) - Why I did it There's a slowdown in bootup related to the execution of a show command during startup of swss service. show is a pretty heavy command and takes long time to execute ~2 sec. - How I did it I replaced show with sonic-db-cli which takes a ms to run. - How to verify it Boot the switch and verify swss is active. Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>	2022-12-06 11:23:46 +02:00
Kalimuthu-Velappan	aaeafa8411	02.Version cache - docker cache build framework (#12001 ) During docker build, host files can be passed to the docker build through docker context files. But there is no straightforward way to transfer the files from docker build to host. This feature provides a tricky way to pass the cache contents from docker build to host. It tar's the cached content and encodes them as base64 format and passes it through a log file with a special tag as 'VCSTART and VCENT'. Slave.mk in the host, it extracts the cache contents from the log and stores them in the cache folder. Cache contents are encoded as base64 format for easy passing. <!-- Please make sure you've read and understood our contributing guidelines: https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md Make sure all your commits include a signature generated with `git commit -s` If this is a bug fix, make sure your description includes "fixes #xxxx", or "closes #xxxx" or "resolves #xxxx" Please provide the following information: --> #### Why I did it #### How I did it #### How to verify it	2022-12-02 08:28:45 +08:00
Michael Li	f725b83bd6	Reload BCM SDK kmods on syncd start to handle syncd restart issues (#12804 ) Why I did it There is an issue on the Arista PikeZ platform (using T3.X2: BCM56274) while running SONiC. If the 'syncd' container in SONiC is restarted, the expected behaviour is that syncd will automatically restart/recover; however it does not and always fails at create_switch due to BCM SDK kmod DMA operation cancellation getting stuck. Sep 16 22:19:44.855125 pkz208 ERR syncd#syncd: [none] SAI_API_SWITCH:platform_process_command:428 Platform command "init soc" failed, rc = -1. Sep 16 22:19:44.855206 pkz208 INFO syncd#supervisord: syncd CMIC_CMC0_PKTDMA_CH4_DESC_COUNT_REQ:0x33#015 Sep 16 22:19:44.855264 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:platformInit:1909 initialization command "init soc" failed, rc = -1 (Internal error). Sep 16 22:19:44.855403 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:sai_driver_init:642 Error initializing driver, rc = -1. ... Sep 16 22:19:44.855891 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:brcm_sai_create_switch:1173 initializing SDK failed with error Operation failed (0xfffffff5). Reloading the BCM SDK kmods allows the switch init to continue properly. How I did it If BCM SDK kmods are loaded, unload and load them again on syncd docker start script. How to verify it Steps to reproduce: In SONiC, run 'docker ps' to see current running containers; 'syncd' should be present. Run 'docker stop syncd' Wait ~1 minute. Run 'docker ps' to see that syncd is missing. Check logs to see messages similar to the above. Signed-off-by: Michael Li <michael.li@broadcom.com>	2022-11-30 16:16:30 +08:00
Kebo Liu	36a100083f	[Mellanox] Add support to Mellanox Spectrum-4 ASIC Firmware compiling and upgrade (#12844 ) - Why I did it Add support for compiling Spectrum-4 ASIC firmware to the SONiC image Add support for Spectrum-4 ASIC firmware upgrade - How I did it Update Mellanox fw make files to include Spectrum-4 ASIC firmware binaries. Update firmware upgrade scripts to be able to detect Spectrum-4 ASIC. - How to verify it Run regression tests Signed-off-by: Kebo Liu <kebol@nvidia.com>	2022-11-29 16:38:41 +02:00
Zain Budhwani	4b001e5115	Change value type of params in memory_checker (#12797 ) Fix error when calling events API, required value is string, passing float	2022-11-23 17:37:28 -08:00
bingwang-ms	f402e6b5c6	Apply separated DSCP_TO_TC_MAP and TC_TO_QUEUE_MAP to uplink ports on dualtor (#12730 ) Why I did it The PR is to apply separated DSCP_TO_TC_MAP and TC_TO_QUEUE_MAP to uplink ports on dualtor. The traffic with DSCP 2 and DSCP 6 from T1 is treated as lossless traffic. DSCP TC Queue 2 2 2 6 6 6 Traffic with DSCP 2 or DSCP 6 from downlink is still treated as lossy traffic as before. How I did it Define DSCP_TO_TC_MAP\|AZURE_UPLINK and TC_TO_QUEUE_MAP\|AZURE_UPLINK. How to verify it Verified by UT Verified by coping the new template to a testbed, and rendering a config_db.json	2022-11-21 11:42:28 -08:00
Konstantin Vasin	6448afd338	[Build] set apt Acquire::Retries to 3 for bullseye (#12758 ) Why I did it There were some changes in apt source code in version 2.1.9. As a result apt used in bullseye (2.2.4) is intolerant to network issues. This was fixed in `10631550f1` Already fixed version is used in bookworm (2.5.4) And not yet affected version is used in buster (1.8.2.3) How I did it Set Acquire::Retries to 3 for sonic-slave-bullseye, docker-base-bullseye and final Debian image. Ref: https://bugs.launchpad.net/ubuntu/+source/apt/+bug/1876035 Signed-off-by: Konstantin Vasin k.vasin@yadro.com	2022-11-21 08:05:16 +08:00
Lorne Long	7e525d96b3	[Build] Use apt-get to predictably support dependency ordered configuration of lazy packages (#12164 ) Why I did it The current lazy installer relies on a filename sort for both unpack and configuration steps. When systemd services are configured [started] by multiple packages the order is by filename not by the declared package dependencies. This can cause the start order of services to differ between first-boot and subsequent boots. Declared systemd service dependencies further exacerbate the issue (e.g. blocking the first-boot script). The current installer leaves packages un-configured if the package dependency order does not match the filename order. This also fixes a trivial bug in [Build]: Support to use symbol links for lazy installation targets to reduce the image size #10923 where externally downloaded dependencies are duplicated across lazy package device directories. How I did it Changed the staging and first-boot scripts to use apt-get: dpkg -i /host/image-$SONIC_VERSION/platform/$platform/.deb becomes apt-get -y install /host/image-$SONIC_VERSION/platform/$platform/.deb when dependencies are detected during image staging. How to verify it Apt-get critical rules Add a Depends= to the control information of a package. Grep the syslog for rc.local between images and observe the configuration order of packages change.	2022-11-17 11:20:42 +08:00
abdosi	668485aac5	Added Support to runtime render bgp and teamd feature state and lldp has_asic_scope flag (#11796 ) Added Support to runtime render bgp and teamd feature `state` and lldp `has_asic_scope` flag Needed for SONiC on chassis. Signed-off-by: Abhishek Dosi <abdosi@microsoft.com> Co-authored-by: mlok <marty.lok@nokia.com>	2022-11-15 16:20:14 -08:00
abdosi	bd348c5264	[chassis-packet] fix the issue of internal ip arp not getting resolved. (#12127 ) Fix the issue where arp_update will not ping some of the ip's even though they are in failed state since grep of that ip on ip neigh show command does not do exact word match and can return multiple match.	2022-11-14 10:15:17 -08:00
Zain Budhwani	98ace33b0f	Add rsyslog plugin regex for select operation failure (#12659 ) Added events for select op, alpm parity error, moved dhcp events from host to container	2022-11-13 21:41:33 -08:00
Jing Kan	111752957f	[dhcp_relay] Enable DHCP Relay for BmcMgmtToRRouter in init_cfg (#12648 ) Why I did it DHCP relay feature needs to be enabled for BmcMgmtToRRouter by default How I did it Update device type list	2022-11-10 13:37:02 +08:00
Devesh Pathak	0ea4f4d00e	Clear /etc/resolv.conf before building image (#12592 ) Why I did it nameserver and domain entries from build system fsroot gets into sonic image. How I did it Clear /etc/resolv.conf before building image How to verify it Built image with it and verified with install that /etc/resolv.conf is empty	2022-11-09 16:54:56 -08:00
xumia	ac5d89c6ac	[Build] Support j2 template for debian sources (#12557 ) Why I did it Unify the Debian mirror sources Make easy to upgrade to the next Debian release, not source url code change required. Support to customize the Debian mirror sources during the build Relative issue: #12523	2022-11-09 08:09:53 +08:00
judyjoseph	c259c996b4	Use the macsec_enabled flag in platform to enable macsec feature state (#11998 ) * Use the macsec_enabled flag in platform to enable macesc feature state * Add macsec supported metadata in DEVICE_RUNTIME_METADATA	2022-11-08 11:03:38 -08:00
Sudharsan Dhamal Gopalarathnam	e6a0fba9ea	[logrotate]Fix logrotate firstaction script to reflect correct size (#12599 ) - Why I did it Fix logrotate firstaction script to reflect correct size. The size was modified to change dynamically based on disk size. However this variable was not updated #9504 - How I did it Updated the variable based on disk size - How to verify it Verify in the generated rsyslog file if the variable is correctly generated from jinja template	2022-11-08 13:38:14 +02:00
Lawrence Lee	ddf16c9d8c	[arp_update]: Fix hardcoded vlan (#12566 ) Typo in prior PR #11919 hardcodes Vlan name. Change command to use the $vlan variable instead Signed-off-by: Lawrence Lee <lawlee@microsoft.com>	2022-11-07 12:10:00 -08:00
Zain Budhwani	8f48773fd1	Publish additional events (#12563 ) Add event_publish code or regex for rsyslog plugin for additional events	2022-11-07 09:57:57 -08:00
Mai Bui	61a085e55e	Replace os.system and remove subprocess with shell=True (#12177 ) Signed-off-by: maipbui <maibui@microsoft.com> #### Why I did it `subprocess` is used with `shell=True`, which is very dangerous for shell injection. `os` - not secure against maliciously constructed input and dangerous if used to evaluate dynamic content #### How I did it remove `shell=True`, use `shell=False` Replace `os` by `subprocess`	2022-11-04 10:48:51 -04:00
bingwang-ms	6169ae3ee3	Add lossy scheduler for queue 7 (#12596 ) * Add lossy scheduler for queue 7	2022-11-04 08:12:00 +08:00
ntoorchi	45d174663a	Enable P4RT at build time and disable at startup (#10499 ) #### Why I did it Currently at the Azure build system, the P4RT container is disabled by default at the build time. Here the goal is to include the P4RT container at the build time while disabling it at the runtime. The user can enable/disable the p4rt app through the config based on the preference. #### How I did it Changed the config in rules/config and init-cfg.json.j2	2022-10-31 16:18:42 -07:00
Devesh Pathak	85e3a81f47	Fix to improve hostname handling (#12064 ) * Fix to improve hostname handling If config_db.json is missing hostname entry, hostname-config.sh ends up deleting existing entry too and hostname changes to default 'localhost' * default hostname to 'sonic` if missing in config file	2022-10-25 14:51:02 -07:00
Samuel Angebault	f39c2adc04	Fix extraction of platform.tar.gz for firsttime (#11935 )	2022-10-21 18:27:32 -07:00
Samuel Angebault	9cdd78788f	Add support for UpperlakeElite (#12280 ) Signed-off-by: Samuel Angebault <staphylo@arista.com> Signed-off-by: Samuel Angebault <staphylo@arista.com>	2022-10-21 18:26:43 -07:00
Mariusz Stachura	9f88d03c2b	[QoS] Support dynamic headroom calculation for Barefoot platforms (#11708 ) Signed-off-by: Mariusz Stachura <mariusz.stachura@intel.com> What I did Adding the dynamic headroom calculation support for Barefoot platforms. Why I did it Enabling dynamic mode for barefoot case. How I verified it The community tests are adjusted and pass.	2022-10-19 09:36:56 -07:00
cytsao1	9ef8464964	[pmon] Add smartmontools to pmon docker (#11837 ) * Add smartmontools to pmon docker * Set smartmontools to install version 7.2-1 in pmon to match host; clean up smartmontools build files * Add comments on smartmontools version for both host and pmon	2022-10-17 13:26:31 -07:00
Ying Xie	bc684fef0b	[BGP] starting BGP service after swss (#12381 ) Why I did it BGP service has always been starting after interface-config. However, recently we discovered an issue where some BGP sessions are unable to establish due to BGP daemon not able to read the interface IP. This issue was clearly observed after upgrading to FRR 8.2.2. See more details in #12380. How I did it Delaying starting BGP seems to be a workaround for this issue. However, caution is that this delay might impact warm reboot timing and other timing sequences. This workaround is reducing the probability of hitting the issue by close to 100X. However, this workaround is not bulletproof as test shows. It is still preferrable to have a proper FRR fix and revert this change in the future. How to verify it Continuously issuing config reload and check BGP session status afterwards. Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2022-10-13 09:24:06 -07:00
Hua Liu	257cc96d7c	Remove swsssdk from sonic OS image and docker container image (#12323 ) Remove swsssdk from sonic OS image and docker image #### Why I did it swsssdk is deprecated, so need remove from image. #### How I did it Update config file to remove swsssdk from image. #### How to verify it Pass all test case. #### Which release branch to backport (provide reason below if selected) <!-- - Note we only backport fixes to a release branch, not features! - Please also provide a reason for the backporting below. - e.g. - [x] 202006 --> - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 #### Description for the changelog Remove swsssdk from sonic OS image and docker image #### Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU. #### Link to config_db schema for YANG module changes <!-- Provide a link to config_db schema for the table for which YANG model is defined Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md --> #### A picture of a cute animal (not mandatory but encouraged)	2022-10-12 13:04:14 +08:00
Zain Budhwani	09fe3f467f	Add Structured Events w/ YANG Models (#12270 ) Add events for dhcp-relay, bgp, syncd, & kernel.	2022-10-09 20:23:31 -07:00
Prince George	ac1d392d4c	Disable brackted-paste mode off by default (#12285 ) * Disable brackted-paste mode off by default * address review comment	2022-10-06 07:55:09 -07:00
Saikrishna Arcot	9251d4ba8b	[docker-wait-any]: Exit worker thread if main thread is expected to exit (#12255 ) There's an odd crash that intermittently happens after the teamd container exits, and a signal is raised to the main thread to exit. This thread (watching teamd) continues execution because it's in a `while True`. The subsequent wait call on the teamd container very likely returns immediately, and it calls `is_warm_restart_enabled` and `is_fast_reboot_enabled`. In either of these cases, sometimes, there is a crash in the transition from C code to Python code (after the function gets executed). Python sees that this thread got a signal to exit, because the main thread is exiting, and tells pthread to exit the thread. However, during the stack unwinding, _something_ is telling the unwinder to call `std::terminate`. The reason is unknown. This then results in a python3 SIGABRT, and systemd then doesn't call the stop script to actually stop the container (possibly because the main process exited with a SIGABRT, so it's a hard crash). This means that the container doesn't actually get stopped or restarted, resulting in an inconsistent state afterwards. The workaround appears to be that if we know the main thread needs to exit, just return here, and don't continue execution. This at least tries to avoid it from getting into the problematic code path. However, it's still feasible to get a SIGABRT, depending on thread/process timings (i.e. teamd exits, signals the main thread to exit, and then syncd exits, and syncd calls one of the two C functions, potentially hitting the issue). Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>	2022-10-05 18:14:10 -07:00
Muhammad Danish	8c10851c2a	Update azure.github.io links to sonic-net.github.io (#12209 ) Why I did it azure.github.io/SONiC/ no longer works and returns 404 Not Found. Updated it to the correct sonic-net.github.io/SONiC/	2022-10-02 14:02:10 +08:00
Aryeh Feigin	2c10ebb4fe	Use warm-boot infrastructure for fast-boot (#11594 ) This PR should be merged together with the sonic-utilities PR (sonic-net/sonic-utilities#2286) and sonic-sairedis PR (sonic-net/sonic-sairedis#1100). Use redis contents from dump file in fast-reboot. Improve fast-reboot flow by utilizing the warm-reboot infrastructure. This followes https://github.com/sonic-net/SONiC/blob/master/doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md	2022-09-26 09:01:49 -07:00
Zain Budhwani	fd6a1b0ce2	Add events to host and create rsyslog_plugin deb pkg (#12059 ) Why I did it Create rsyslog plugin deb for other containers/host to install Add events for bgp and host events	2022-09-21 09:20:53 -07:00
Stepan Blyshchak	e662008f72	[services] kill container on stop in warm/fast mode (#10510 ) - Why I did it To optimize stop on warm boot. - How I did it Added kill for containers	2022-09-19 19:34:33 +03:00
Volodymyr Boiko	c243af0cce	[bgp][service] Start bgp service after interfaces-config service (#11827 ) - Why I did it interfaces-config service restarts networking service, during the restart loopback interface address is being removed and reassigned back, leaving loopback without an ipv4 address for a while. On SONiC startup and config reload interfaces-config and bgp services start in parallel and sometimes fpmsyncd in bgp attempts bind to loopback while it does not have an address, fails with the log Exception "Cannot assign requested address" had been thrown in daemon and exits with rc 0. root@sonic:/# supervisorctl status fpmsyncd EXITED Jul 20 05:04 AM zebra RUNNING pid 35, uptime 6:15:05 zsocket EXITED Jul 20 05:04 AM docker logs bgp INFO exited: fpmsyncd (exit status 0; expected) With fpmsyncd dead, configured routes do not appear in the database. - How I did it Added ordering dependency on interfaces-config service into bgp.config - How to verify it Itself the issue reproduces quite rarely, but one can gain the time interval between networking down and networking up in interfaces-config.sh like this: diff --git a/files/image_config/interfaces/interfaces-config.sh b/files/image_config/interfaces/interfaces-config.sh index f6aa4147a..87caceeff 100755 --- a/files/image_config/interfaces/interfaces-config.sh +++ b/files/image_config/interfaces/interfaces-config.sh @@ -63,7 +63,11 @@ done # Read sysctl conf files again sysctl -p /etc/sysctl.d/90-dhcp6-systcl.conf -systemctl restart networking +# systemctl restart networking + +systemctl start networking +sleep 10 +systemctl stop networking # Clean-up created files rm -f /tmp/ztp_input.json /tmp/ztp_port_data.json with this change the issue reproduces on every config reload. Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com>	2022-09-19 17:25:10 +03:00
lixiaoyuner	a1b50cac41	Make client indentity by AME cert (#11946 ) * Make client indentity by AME cert * Join k8s cluster by ipv6 * Change join test cases * Test case bug fix * Improve read node label func * Configure kubelet and change test cases * For kubernetes version 1.22.2 * Fix undefine issue Signed-off-by: Yun Li <yunli1@microsoft.com>	2022-09-16 13:13:39 +08:00
Maxime Lorrillere	0a7dd50dcb	[Chassis][Voq]Configure midplane network on supervisor (#11725 ) Multi-asic Docker instances are created behind Docker's default bridge which doesn't allow talking to other Docker instances that are in the host network (like database-chassis). On linecards, we configure midplane interfaces to let per-asic docker containers talk to CHASSIS_DB on the supervisor through internal chassis network. On the supervisor we don't need to use chassis internal network, but we still need a similar setup in order to allow fabric containers to talk to database-chassis	2022-09-15 17:23:41 -07:00
Oleksandr Ivantsiv	549bb3d483	[services] Update "WantedBy=" section for tacacs-config.timer. (#11893 ) The timer execution may fail if triggered during a config reload (when the sonic.target is stopped). This might happen in a rare situation if config reload is executed after reboot in a small time slot (for 0 to 30 seconds) before the tacacs-config timer is triggered. To ensure that timer execution will be resumed after a config reload the WantedBy section of the systemd service is updated to describe relation to sonic.target. Signed-off-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com> Signed-off-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com>	2022-09-08 15:16:11 -07:00
Ze Gan	016f671857	[docker-macsec]: Add dependencies of MACsec (#11770 ) Why I did it If the SWSS services was restarted, the MACsec service should also be restarted. Otherwise the data in wpa_supplicant and orchagent will not be consistent. How I did it Add dependency in docker-macsec.mk. How to verify it Manually check by 'sudo service swss restart'. The MACsec container should be started after swss, the syslog will look like Sep 8 14:36:29.562953 sonic INFO swss.sh[9661]: Starting existing swss container with HWSKU Force10-S6000 Sep 8 14:36:30.024399 sonic DEBUG container: container_start: BEGIN ... Sep 8 14:36:33.391706 sonic INFO systemd[1]: Starting macsec container... Sep 8 14:36:33.392925 sonic INFO systemd[1]: Starting Management Framework container... Signed-off-by: Ze Gan <ganze718@gmail.com>	2022-09-08 23:45:06 +08:00
Renuka Manavalan	31e750ee0b	Fix PR build failure (#11973 ) Some PR builds fails to find this file. Remove it temporarily until we root cause it	2022-09-06 15:13:05 -07:00
Stepan Blyshchak	a8b2a538a5	[docker-wait-any] immediately start to wait (#11595 ) It could happen that a container has already crashed but docker-wait-any will wait forever till it starts. It should, however, immediately exit to make the serivce restart. #### Why I did it It is observed in some circumstances that the auto-restart mechanism does not work. Specifically for ```swss.service```, ```orchagent``` had crashed before ```docker-wait-any``` started in ```swss.sh```. This led ```docker-wait-any``` wait forever for ```swss``` to be in ```"Running"``` state and it results in: ``` CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 1abef1ecebff bcbca2b74df6 "/usr/local/bin/supe…" 22 hours ago Up 22 hours what-just-happened 3c924d405cd5 docker-lldp:latest "/usr/bin/docker-lld…" 22 hours ago Up 22 hours lldp eb2b12a98c13 docker-router-advertiser:latest "/usr/bin/docker-ini…" 22 hours ago Up 22 hours radv d6aac4a46974 docker-sonic-mgmt-framework:latest "/usr/local/bin/supe…" 22 hours ago Up 22 hours mgmt-framework d880fd07aab9 docker-platform-monitor:latest "/usr/bin/docker_ini…" 22 hours ago Up 22 hours pmon 75f9e22d4fdd docker-snmp:latest "/usr/local/bin/supe…" 22 hours ago Up 22 hours snmp 76d570a4bd1c docker-sonic-telemetry:latest "/usr/local/bin/supe…" 22 hours ago Up 22 hours telemetry ee49f50344b3 docker-syncd-mlnx:latest "/usr/local/bin/supe…" 22 hours ago Up 22 hours syncd 1f0b0bab3687 docker-teamd:latest "/usr/local/bin/supe…" 22 hours ago Up 22 hours teamd 917aeeaf9722 docker-orchagent:latest "/usr/bin/docker-ini…" 22 hours ago Exited (0) 22 hours ago swss 81a4d3e820e8 docker-fpm-frr:latest "/usr/bin/docker_ini…" 22 hours ago Up 22 hours bgp f6eee8be282c docker-database:latest "/usr/local/bin/dock…" 22 hours ago Up 22 hours database ``` The check for ```"Running"``` state is not needed because for cold boot case we do ```start_peer_and_dependent_services``` and for warm boot case the loop will retry to wait for container if this container is doing warm boot: `d01a91a569/files/image_config/misc/docker-wait-any (L56)` #### How I did it Removed the check for ```"Running"```. #### How to verify it Kill swss before ```docker-wait-any``` is reached and verify auto restart will restart swss serivce.	2022-09-06 09:26:54 -07:00
Zain Budhwani	6a54bc439a	Streaming structured events implementation (#11848 ) With this PR in, you flap BGP and use events_tool to see the published events. With telemetry PR #111 in and corresponding submodule update done in buildimage, one could run gnmi_cli to capture BGP flap events.	2022-09-03 07:33:25 -07:00
Ying Xie	a6843927d9	[mux] skip mux operations during warm shutdown (#11937 ) * [mux] skip mux operations during warm shutdown - Enhance write_standby.py script to skip actions during warm shutdown. - Expand the support to BGP service. - MuX support was added by a previous PR. - don't skip action during warm recovery Signed-off-by: Ying Xie <ying.xie@microsoft.com>	2022-09-02 13:50:42 -07:00
Lawrence Lee	a762b35cbc	[arp_update]: Set failed IPv6 neighbors to incomplete (#11919 ) After pinging any failed IPv6 neighbor entries, set the remaining failed/incomplete entries to a permanent INCOMPLETE state. This manual setting to INCOMPLETE prevents these entries from automatically transitioning to FAILED state, and since they are now incomplete any subsequent NA messages for these neighbors is able to resolve the entry in the cache. Signed-off-by: Lawrence Lee <lawlee@microsoft.com>	2022-09-02 13:40:40 -07:00
Longxiang Lyu	6e878a36da	[mux] Exit to write `standby` state to `active-active` ports (#11821 ) [mux] Exit to write standby state to `active-active` ports Signed-off-by: Longxiang Lyu <lolv@microsoft.com>	2022-08-31 13:10:22 -07:00
abdosi	3bf1abb2dc	Address Review Comment to define SONIC_GLOBAL_DB_CLI in gbsyncd.sh (#11857 ) As part of PR #11754 Change was added to use variable SONIC_DB_NS_CLI for namespace but that will not work since ./files/scripts/syncd_common.sh uses SONIC_DB_CLI. So revert back to use SONIC_DB_CLI and define new variable for SONIC_GLOBAL_DB_CLI for global/host db cli access Also fixed DB_CLI not working for namespace.	2022-08-29 08:19:28 -07:00
Hua Liu	214e394ac0	Remove swsssdk from rules and image. (#11469 ) #### Why I did it To deprecate swsssdk, remove all dependency to it. #### How I did it Remove swsssdk from rules and build image scripts. #### How to verify it Pass all UT and E2E test case #### Which release branch to backport (provide reason below if selected) <!-- - Note we only backport fixes to a release branch, not features! - Please also provide a reason for the backporting below. - e.g. - [x] 202006 --> - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 #### Description for the changelog Remove swsssdk from rules and build image scripts. #### Link to config_db schema for YANG module changes <!-- Provide a link to config_db schema for the table for which YANG model is defined Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md --> #### A picture of a cute animal (not mandatory but encouraged)	2022-08-25 08:35:51 +08:00
anamehra	f404ce60e0	container_checker on supervisor should check containers based on asic presence (#11442 ) Why I did it On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created. The monit 'container_checker' fails in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).	2022-08-22 10:08:29 -07:00
abdosi	535612f808	Added support to add gbsyncd in Feature Table of Host Config DB (#11754 ) Why I did: In case of multi-asic platforms gbsyncd is not getting added to Feature Table of Host Config DB. Without this container_checker complains of not needed gbsyncd container's are running. How I did: Update Both Host and Namespace config db when gbsyncd docker is starting. How I verify: Verified on Multi-asic platforms.	2022-08-17 14:02:21 -07:00
Stepan Blyshchak	a66941a6ce	[syncd.sh] 'sxdkernel start' => 'sxdkernel restart' (#11718 ) Change `sxdkernel start` to `sxdkernel restart`. If `syncd` service crashes in `ExecStartPre` systemd will not call `ExecStop` and thus will not call `sxdkernel stop`. Use of `sxdkernel restart` is more robust in terms of guarantees to restore the system after unexpected crashes. Signed-off-by: Stepan Blyschak <stepanb@nvidia.com> Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>	2022-08-15 13:35:34 -07:00

1 2 3 4 5 ...

1170 Commits