Commit Graph

1247 Commits

Author SHA1 Message Date
abdosi
6d767e549d
[chassis] Support advertisement of Loopback0 of all LC's across all e-BGP peers in TSA mode (#16714) (#17837)
What I did:
In Chassis TSA mode Loopback0 Ip's of each LC's should be advertise through e-BGP peers of each remote LC's

How I did:

- Route-map policy to Advertise own/self Loopback IP to other internal iBGP peers with a community internal_community as define in constants.yml
- Route-map policy to match on above internal_community when route is received from internal iBGP peers and set a internal tag as define in constants.yml and also delete the internal_community so we don't send to any of e-BGP peers
- In TSA new route-map match on above internal tag and permit the route (Loopback0 IP's of remote LC's) and set the community to traffic_shift_community.
- In TSB delete the above new route-map.

How I verify:

Manual Verification

UT updated.
sonic-mgmt PR: sonic-net/sonic-mgmt#10239

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2024-01-19 12:53:26 +08:00
abdosi
aa01630841 [build]: Added flag in sonic_version.yml to see if image is secured or non-secured (#16191)
What I did:

Added flag in sonic_version.yml to see if compiled image is secured or non-secured. This is done using build/compile time environmental variable SECURE_UPGRADE_MODE as define in HLD: https://github.com/sonic-net/SONiC/blob/master/doc/secure_boot/hld_secure_boot.md

This flag does not provide the runtime status of whether the image has booted securely or not. It's possible that compile time signed image (secured image) can boot on non secure platform.

Why I did:
Flag can be used for manual check or by the test case.

ADO: 24319390

How I verify:
Manual Verification

---
build_version: 'master-16191.346262-cdc5e72a3'
debian_version: '11.7'
kernel_version: '5.10.0-18-2-amd64'
asic_type: broadcom
asic_subtype: 'broadcom'
commit_id: 'cdc5e72a3'
branch: 'master-16191'
release: 'none'
build_date: Fri Aug 25 03:15:45 UTC 2023
build_number: 346262
built_by: AzDevOps@vmss-soni001UR5
libswsscommon: 1.0.0
sonic_utilities: 1.2
sonic_os_version: 11
secure_boot_image: 'no'

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2024-01-18 14:36:15 +08:00
Liping Xu
b53a8908b0 disable restapi for leafRouter in slim image (#17713)
Why I did it
For some devices with small memory, after upgrading to the latest image, the available memory is not enough.

Work item tracking
Microsoft ADO (number only):
26324242
How I did it
Disable restapi feature for LeafRouter which with slim image.

How to verify it
verified on 7050qx T1 (slim image), restapi disabled
verified on 7050qx T0 (slim image), restapi enabled
verified on 7260 T1 (normal image), restapi enabled
2024-01-12 20:47:37 +08:00
Lawrence Lee
9f66a7dbf6 add timeout to ping6 command (#17729)
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2024-01-11 12:34:22 +08:00
Nazarii Hnydyn
13aa19acf4
[swss/syncd]: Remove dependency on interfaces-config.service (#17649)
Signed-off-by: Nazarii Hnydyn nazariig@nvidia.com

This improvement does not bring any warm-reboot degradation, since the database availability (tcp/ip access over the loopback interface) was fixed by these PRs:

Re-add 127.0.0.1/8 when bringing down the interfaces #15080
Fix potentially not having any loopback address on lo interface #16490
Why I did it
Removed dependency on interfaces-config.service to speed up the boot, because interfaces-config.service takes a lot of time on init
Work item tracking
N/A
How I did it
Changed service files for swss/syncd
How to verify it
Boot and check swss/syncd start time comparing to interfaces-config
2024-01-10 21:15:55 +08:00
mssonicbld
388457aaf8
[arp_update]: Flush neighbors with incorrect MAC info (#17238) (#17677) 2024-01-05 00:35:07 +08:00
mssonicbld
38c5e6825d
Fix can't access IPV6 address via management interface because 'default' route table does not add to route lookup issue. (#17281) (#17676) 2024-01-05 00:32:38 +08:00
mssonicbld
76d8ff06b3
Update backend_acl.py to specify ACL table name (#17553) (#17673) 2024-01-04 23:10:28 +08:00
mssonicbld
b72a74047e
[docker_image_ctl.j2]: swss docker initialization improvements (#17628) (#17672) 2024-01-04 22:37:26 +08:00
mssonicbld
609b3a7646
[image_config]: Update DHCP rate-limit for mgmt TOR devices (#17630) (#17671) 2024-01-04 22:29:00 +08:00
Junchao-Mellanox
a8ad19d4ac
[202305] Optimize syslog rate limit feature for fast and warm boot (#17478)
Backport PR #17458 due to conflict.

Why I did it
Optimize syslog rate limit feature for fast and warm boot

Work item tracking
Microsoft ADO (number only):
How I did it
Optimize redis start time
Don't render rsyslog.conf in container startup script
Disable containercfgd by default. There is a new CLI to enable it (in another PR)
How to verify it
Manual test
Regression test
2023-12-28 20:53:49 +08:00
mssonicbld
690c4c5d36
Fix the fsck script that does filesystem repair (#17424) (#17621) 2023-12-26 19:58:48 +08:00
mssonicbld
7f03584d9d
Fix syncd_request_shutdown coredump in config reload on KVM sonic (#17486) (#17564) 2023-12-19 19:02:31 +08:00
mssonicbld
d7ba0608ba
[gbsyncd] Graceful shutdown of syncd process in container gbsyncd (#16812) (#17523) 2023-12-17 22:39:08 +08:00
mssonicbld
b09e7d1b6b
[gbsyncd]: Set SYSLOG_CONFIG_FEATURE for gbsyncd (#17325) (#17513) 2023-12-15 14:54:26 +08:00
Stepan Blyshchak
2cea4bcbdf [config-chassisdb] use cached variables (#17342)
- Why I did it
Improve boot performance mostly needed for fast and warmboot

- How I did it
Use cached variable.

- How to verify it
Boot the system. Simply do "systemd-analyze blame" and look at service start time.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2023-12-09 14:32:43 +08:00
Stepan Blyshchak
bc4bc03239 [config-topology] use cached variables (#17343)
- Why I did it
Improve  boot performance mostly needed for fast and warmboot

- How I did it
Use cached variable.

- How to verify it
Boot the system. Simply do "systemd-analyze blame" and look at service start time.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2023-12-09 14:32:39 +08:00
mssonicbld
688245a724
Revert "[swss/syncd] remove dependency on interfaces-config.service (#13084) (#14341)" (#15094) (#17367) (#17461) 2023-12-08 19:45:49 +08:00
StormLiangMS
fa7be88599
Revert "[pmon] update gRPC version to 1.57.0 (#16257) (#17219)" (#17391)
This reverts commit 066065f1cd.
2023-12-05 10:38:51 +08:00
StormLiangMS
2c28502ddd
Revert "Share docker image and use telemetry container for 202305 (#17255)" (#17356)
This reverts commit 2c7d53e5fb.
2023-11-30 20:41:38 +08:00
ganglv
2c7d53e5fb
Share docker image and use telemetry container for 202305 (#17255)
Why I did it
Need to share docker image for telemetry and gnmi, and only use telemetry container for 202305 branch

Work item tracking
Microsoft ADO (number only):
How I did it
Add a new docker image, base-gnmi, build sonic-gnmi and sonic-telemetry on this docker image.
Enable telemetry container.

How to verify it
Run end to end test for telemetry and gnmi.
2023-11-24 11:22:48 +08:00
vdahiya12
066065f1cd
[pmon] update gRPC version to 1.57.0 (#16257) (#17219)
* [pmon] update gRPC version to 1.57.0 (#16257)

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>

* fix conflict

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>

---------

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
2023-11-23 21:03:07 +08:00
prabhataravind
aa8a5403b8 [image_config]: Update DHCP rate-limit (#17132)
Change DHCP rate limit in SONiC copp configuration to 100 PPS as this is
necessary to ensure that DHCP flood does not cause LACP/BGP flaps in all
scenarios

This is an extension to the change in image_config: copp: Enable rate limiting 
for bgp, lacp, dhcp, lldp, macsec and udld #14859 and sonic-mgmt change in 
[tests/copp]: Update copp mgmt tests to support new rate-limits sonic-mgmt#8199

Why I did it
300 PPS is not sufficient to prevent LACP/BGP flaps in all cases. 100 PPS seems to
provide better resiliency against DHCP traffic flood to CPU.

Microsoft ADO 25776614:

Send DHCP broadcast packets to DUT and verify that they are trapped to CPU at 100 PPS.

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
2023-11-23 12:33:56 +08:00
ganglv
733a902a70
Revert "[202305] Share image for gnmi and telemetry (#17137)" (#17261)
This reverts commit f2a495f7e5.
2023-11-22 23:51:34 +08:00
mssonicbld
1337d295a3
[chassisd]: Add alternate to the bridge interface created on chassis supervisor. (#16505) (#17223) 2023-11-19 14:42:00 +08:00
ganglv
f2a495f7e5
[202305] Share image for gnmi and telemetry (#17137)
Why I did it
Share docker image to support gnmi container and telemetry container
backport #16863

Work item tracking
Microsoft ADO 25423918:
How I did it
Create telemetry image from gnmi docker image.
Enable gnmi container and disable telemetry container by default.

How to verify it
Run end to end test.
2023-11-15 11:28:21 +08:00
mssonicbld
78cc6cfa22
[copp]: Enable rate limiting for bgp, lacp, dhcp, lldp, macsec and udld (#14859) (#17111) 2023-11-07 20:52:08 +08:00
Vadym Hlushko
28ecd068d4
[202305][buffers] Add 'create_only_config_db_buffers.json' file for the Mellanox devices (not MSFT SKU) (#17006)
Why I did it
Add the create_only_config_db_buffers attribute to the DEVICE_METADATA|localhost. If the "create_only_config_db_buffers" exists and is equal to "true" - the buffers will be created according to the config_db configuration (for example BUFFER_QUEUE|* table), otherwise the maximum available buffers (which are read from SAI) will be created, regardless of the CONFIG_DB buffers configuration.

Work item tracking
Microsoft ADO (number only):
How I did it
Add the create_only_config_db_buffers.json files for Mellanox devices (not MSFT SKU's), and inject the content to the CONFIG_DB during the swss docker container start.

How to verify it
Manual verification:

Install the image with this PR included on the not MSFT SKU switch
Check the show queue counters output and verify that only configured in CONFIG_DB buffers are created
root@sonic:/home/admin# show queue counters
     Port    TxQ    Counter/pkts    Counter/bytes    Drop/pkts    Drop/bytes
---------  -----  --------------  ---------------  -----------  ------------
Ethernet0    UC0               0                0            0           N/A
Ethernet0    UC1               0                0            0           N/A
Ethernet0    UC2               0                0            0           N/A
Ethernet0    UC3               0                0            0           N/A
Ethernet0    UC4               0                0            0           N/A
Ethernet0    UC5               0                0            0           N/A
Ethernet0    UC6               0                0            0           N/A
Open the /usr/share/sonic/device/$DEVICE/$SKU/create_only_config_db_buffers.json and change it to:
"create_only_config_db_buffers": "false"
Do config reload
Check the show queue counters output and verify that all available buffers are created
root@sonic:/home/admin# show queue counters
     Port    TxQ    Counter/pkts    Counter/bytes    Drop/pkts    Drop/bytes
---------  -----  --------------  ---------------  -----------  ------------
Ethernet0    UC0               0                0            0           N/A
Ethernet0    UC1               0                0            0           N/A
Ethernet0    UC2               0                0            0           N/A
Ethernet0    UC3               0                0            0           N/A
Ethernet0    UC4               0                0            0           N/A
Ethernet0    UC5               0                0            0           N/A
Ethernet0    UC6               0                0            0           N/A
Ethernet0    UC7              60            15346            0           N/A
Ethernet0    MC8             N/A              N/A          N/A           N/A
Ethernet0    MC9             N/A              N/A          N/A           N/A
Ethernet0   MC10             N/A              N/A          N/A           N/A
Ethernet0   MC11             N/A              N/A          N/A           N/A
Ethernet0   MC12             N/A              N/A          N/A           N/A
Ethernet0   MC13             N/A              N/A          N/A           N/A
Ethernet0   MC14             N/A              N/A          N/A           N/A
Ethernet0   MC15             N/A              N/A          N/A           N/A
2023-11-03 14:27:17 +08:00
mssonicbld
fbf30ec6a8
[tacacs]: Fix tcpdump report error when tacacs enabled (#16372) (#17077) 2023-11-03 04:31:18 +08:00
mssonicbld
feaa855346
Add special rsyslog filter for MSN2700 platform (#16684) (#17078) 2023-11-03 03:05:44 +08:00
Samuel Angebault
274e929f11
Reduce SONiC image filesystem size (#16948)
Why I did it
Running SONiC releases past 202012 has become really challenging on system with small storage devices (4GB).
Some of these devices can also be limited by only having 4GB of RAM which complicates mitigations.
The main contributor to these issues is the SONiC image growth.
Being able to reduce it by some decent amount should allow these systems to run SONiC longer.
It would also reduce some impacts related to space savings mitigations.

Work item tracking
Microsoft ADO (number only):
How I did it
Add a build option to reduce the image size.
The image reduction process is affecting the builds in 2 ways:

change some packages that are installed in the rootfs
apply a rootfs reduction script
The script itself will perform a few steps:

remove file duplication by leveraging hardlinks
under /usr/share/sonic since the symlinks under the device folder are lost during the build.
under /var/lib/docker since the files there will only be mounted ro
remove some extra files (man, docs, licenses, ...)
some image specific space reduction (only for aboot images currently)
The script can later be improved but for now it's reducing the rootfs size by ~30%.

How to verify it
Compare the size of an image with this option enabled and this option enabled.
Expect the fully extracted content to be ~30% less.

Which release branch to backport (provide reason below if selected)
This is a backport of #16729

Description for the changelog
Add build option to reduce final image size
2023-10-24 21:08:38 +08:00
mssonicbld
bf605cf771
[ci/build]: Upgrade SONiC package versions (#16964) 2023-10-21 23:04:00 +08:00
Longxiang Lyu
dd20597e4d [snmp] Check intfmgrd running before start (#16588)
Add pre start check to ensure intfmgrd is running.
The check will run for 20 seconds at most.

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
2023-10-21 12:32:42 +08:00
Aman Singhal
f265c79541 [cisco]: Enable Kdump config by default for cisco-8000 (#16224)
Why I did it
Enabling kdump by default for cisco-8000 by setting crashkernel cmdline arg in device installer.conf.
After bootup, sonic-kdump-config wipes crashkernel arg from /host/grub/grub.cfg, and resets USE_KDUMP in /etc/default/kdump-tools, so kdump will not be enabled on subsequent reboot.

How I did it
Setting kdump enable config as part of init_cfg.json for cisco-8000 platforms.

How to verify it
Install SONiC image with kdump enabled by default (device/hwsku/installer.conf), then reboot.
Kdump config should persist on subsequent reboots and kdump loaded during bootup

Signed-off-by: Aman Singhal <amans@cisco.com>
2023-10-18 00:37:30 +08:00
Samuel Angebault
dbea038e96 Disable CPU C-States other than C1 (#16703)
Why I did it
Networking devices need to be responsive. Such responsiveness is harmed when the CPU change state.
There is a latency penalty when a CPU is idle (e.g C2) and need to exit this state to come back to C1 state.
To prevent this from happening the CPU should be forced to remain in C1 state.

How I did it
Generalize the cstate forcing to C1 to all Arista products.
This is done by adding processor.max_cstate=1 to the kernel cmdline for all CPUs.
Additionally Intel CPUs also need intel_idle.max_cstate=0 to fallback to the acpi_idle driver.

How to verify it
Check that processor.max_cstate=1 is present on the cmdline for AMD CPUs
Check that both processor.max_cstate=1 and intel_idle.max_cstate=0 are present on the cmdline for Intel CPUs
2023-10-17 20:49:07 +08:00
mssonicbld
e80b956502
[ci/build]: Upgrade SONiC package versions (#15617) 2023-10-17 20:48:25 +08:00
Saikrishna Arcot
39cdee57e1 [baseimage]: Update openssh to 1:8.4p1-5+deb11u2 (#16826)
Openssh in Debian Bullseye has been updated to 1:8.4p1-5+deb11u2 to fix CVE-2023-38408. 
Since we're building openssh with some patches, we need to update our version as well.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-10-17 16:34:18 +08:00
mssonicbld
185a63bc7f
[fast-reboot] Fix regression: set FAST_REBOOT state_db flag to support fast-reboot from older images (#16733) (#16753) 2023-09-29 05:29:20 +08:00
vganesan-nokia
52d5980c0c
[swss] Chassis db clean up optimization and bug fixes (#16454) (#16644)
* [swss] Chassis db clean up optimization and bug fixes

This commit includes the following changes:
    - Fix for regression failure due to error in finding CHASSIS_APP_DB in
    pizzabox (#PR 16451)
    - After attempting to delete the system neighbor entries from
    chassis db, before starting clearing the system interface entries,
    wait for sometime only if some system neighbors were deleted.
    If there are no system neighbors entries deleted for the asic coming up,
    no need to wait.
    - Similar changes for system lag delete. Before deleting the
    system lag, wait for some time only if some system lag memebers were
    deleted. If there are no system lag members deleted no need to wait.
    - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic
    is coming up, when system neigh entries are deleted from chassis ap
    db (as part of chassis db clean up), there is no orchs/process running to
    process the delete messages from chassis redis. Because of this, stale system
    neigh are entries present in the local STATE_DB. The stale entries result in
    creation of orphan (no corresponding data path/asic db entry) kernel neigh
    entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after
    the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from
    the local STATE_DB when sevice comes up.

Signed-off-by: vedganes <veda.ganesan@nokia.com>

* [swss] Chassis db clean up bug fixes review comment fix - 1

Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE)

Signed-off-by: vedganes <veda.ganesan@nokia.com>

---------

Signed-off-by: vedganes <veda.ganesan@nokia.com>
(cherry picked from commit b13b41fc22)
2023-09-22 10:58:27 +08:00
mssonicbld
e7f49c9bce
Fix potentially not having any loopback address on lo interface (#16490) (#16628)
In #15080, there was a command added to re-add 127.0.0.1/8 to the lo
interface when the networking configuration is being brought down.
However, the trigger for that command is `down`, which, looking at
ifupdown2 configuration files, runs immediately after 127.0.0.1/16 is
removed. This means there may be a period of time where there are no
loopback addresses assigned to the lo interface, and redis commands will
fail.

Fix this by changing this to pre-down, which should run well before
127.0.0.1/16 is removed, and should always leave lo with a loopback
address.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Co-authored-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-09-21 20:40:21 +08:00
Alpesh Patel
6b48346ff5 qos template change for backend compute-ai deployment (#16150)
#### Why I did it

To enable qos config for a certain backend deployment mode, for resource-type "Compute-AI".
This deployment has the following requirement:

- Config below enabled if DEVICE_TYPE as one of backend_device_types
- Config below enabled if ResourceType is 'Compute-AI'
- 2 lossless TCs' (2, 3)
- 2 lossy TCs' (0,1)
- DSCP to TC map uses 4 DSCP code points and maps to the TCs' as follows:
   "DSCP_TO_TC_MAP": {
        "AZURE": {
             "48" : "0",
            "46" : "1",
            "3"  : "3",
            "4"  : "4"
        }
    }

- WRED profile has green {min/max/mark%} as {2M/10M/5%}

This required template change <as in the PR> in addition to the vendor qos.json.j2 file (not included here).

### How I did it

#### How to verify it
- with the above change and the vendor config change, generated the qos.json file and verified that the objective stated in "Why I did it" was met

- verified no error

### Description for the changelog
Update qos_config.j2 for Comptue-AI deployment on one of backend device type roles
2023-09-21 18:34:11 +08:00
Prince George
d5a96f69f1 [platform]: Disable interrupt for intel i2c-i801 driver (#16309)
On S6100 we are seeing almost 100K interrupts per second on intels i801 SMBUS controller which affects systems performance.

We now disable the i801 driver interrupt and instead enable polling

Microsoft ADO (number only): 24910530

How I did it
Disable the interrupt by passing the interrupt disable feature argument to i2c-i801 driver

How to verify it
This fix is NOT applicable for ARM based platforms. Applicable only for intel based platforms:-

- On SN2700 its already disabled in Mellanox hw-mgmt
- Celestica DX010 and E1031
- Dell S6100 verified the interrupts are no longer incrementing.
- Arista 7260CX3

Signed-off-by: Prince George <prgeor@microsoft.com>
2023-09-21 16:33:37 +08:00
StormLiangMS
2b381b1fd4
Revert "revert [syslog] Add remote syslog configuration (cherry-pick to 202305) (#15897) (#16179)" (#16549)
This reverts commit 164fa102c0.
2023-09-14 20:52:14 +08:00
Kebo Liu
fe7eeed051
[202305][Mellanox] Update SDK/FW/SAI to 4.6.1020/2012.1020/SAIBuild2305.25.0.3(#16096) (#16298)
* [Mellanox] Update SDK/FW/SAI to 4.6.1020/2012.1020/SAIBuild2305.25.0.3 (#16096)

SONiC changes:
1. Support Spectrum4 ASIC FW binary building.
2. Support new SDK sx-obj-desc lib building since new SAI need it.
3. Remove SX_SCEW debian package from Mellanox SDK build since we are no longer using it (we use libxml2 instead).
4. Update SAI, SDK, FW to version 4.6.1020/2012.1020/SAIBuild2305.25.0.3

SDK/FW bug fixes
1. In SPC-1 platforms: Fastboot mode is not operational for Split port with Force mode in 50G speed
SFP modules are kept in disabled state after set LPM (low power mode) on/off for at least 3 minutes.
2. When preforming fast boot from an old SDK version (currently installed) to a newer one (target version), and the system was initially loaded with a new SDK version (past version), and the system has not been wiped, under specific conditions, the fast boot would use the past version's data and may fail.

SDK/FW Features
1. On SN2700 all ports can support y cable by credo

SAI bug Fixes
1. When creating an ACL rule with SAI_ACL_ENTRY_ATTR_FIELD_SRC_IP/SAI_ACL_ENTRY_ATTR_FIELD_DST_IP enabled, and then disabling the field by setting enable=false, a match on L3_type=IPv4 will remain programmed for the rule Issue resolved after the fix
2. Allow the max scale of virtual routers to be configure for SPC-1, SPC-2, SPC-3 when fastboot enable
3. Remove default hash key of SRC_MAC, DST_MAC and ETH_TYPE

SAI features
1. Port init profile

- How I did it
Update SDK/FW/SAI make files

- How to verify it
Run full sonic-mgmt regression on Mellanox platform

Signed-off-by: Kebo Liu <kebol@nvidia.com>
Conflicts:
	platform/mellanox/mlnx-sai.mk

* Fix issue: unprintable character is rendered when handling comments in j2

Use "{#-" and "-#}" to mark comments in jinja template

Signed-off-by: Stephen Sun <stephens@nvidia.com>

---------

Signed-off-by: Stephen Sun <stephens@nvidia.com>
Co-authored-by: Stephen Sun <stephens@nvidia.com>
2023-09-10 22:28:46 +08:00
mssonicbld
ebe24a134c
[chassis] Chassis DB cleanup when asic comes up (#16213) (#16417) 2023-09-03 23:52:39 +08:00
mssonicbld
40a5cea84c
Assign the higher metric value for Ipv6 default route learnt via RA message (#16367) (#16429) 2023-09-03 22:16:46 +08:00
mssonicbld
d62ae374a9
chassis-packet: Update arp_update script for FAILED and STALE check (#16311) (#16423) 2023-09-03 21:24:17 +08:00
Junchao-Mellanox
cead17cb55 Fix issue: systemctl daemon-reload would sporadically cause udev handler fail (#15253)
#### Why I did it

A workaround to back port the fix for a systemd issue.

The systemd issue: https://github.com/systemd/systemd/issues/24668
The systemd PR to fix the issue: https://github.com/systemd/systemd/pull/24673/files

The formal solution should upgrade systemd to a version that contains the fix. But, systemd is a very basic service, upgrading systemd requires heavy test. 

#### How I did it
Copy the correct systemd-udevd.service file in build time 

#### Tested branch (Please provide the tested image version)

- [x] 202211
- [ ] <!-- image version 2 -->

```
SONiC Software Version: SONiC.fix-udev.3-b65c7bdec_Internal
SONiC OS Version: 11
Distribution: Debian 11.7
Kernel: 5.10.0-18-2-amd64
Build commit: b65c7bdec
Build date: Mon Jun 19 10:54:50 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci02-241

Platform: x86_64-mlnx_msn4700-r0
HwSKU: ACS-MSN4700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2022X08597
Model Number: MSN4700-WS2FO
Hardware Revision: A1
Uptime: 08:10:11 up 1 min,  1 user,  load average: 1.81, 0.67, 0.24
Date: Sun 25 Jun 2023 08:10:11

Docker images:
REPOSITORY                    TAG                             IMAGE ID       SIZE
docker-fpm-frr                fix-udev.3-b65c7bdec_Internal   a7b911e7cb6f   346MB
docker-fpm-frr                latest                          a7b911e7cb6f   346MB
docker-platform-monitor       fix-udev.3-b65c7bdec_Internal   94c5178cf80b   731MB
docker-platform-monitor       latest                          94c5178cf80b   731MB
docker-orchagent              fix-udev.3-b65c7bdec_Internal   46b393e0ace8   328MB
docker-orchagent              latest                          46b393e0ace8   328MB
docker-syncd-mlnx             fix-udev.3-b65c7bdec_Internal   1f5c6c23e33a   734MB
docker-syncd-mlnx             latest                          1f5c6c23e33a   734MB
docker-sflow                  fix-udev.3-b65c7bdec_Internal   7e45992c8c59   317MB
docker-sflow                  latest                          7e45992c8c59   317MB
docker-teamd                  fix-udev.3-b65c7bdec_Internal   e4d905592cda   316MB
docker-teamd                  latest                          e4d905592cda   316MB
docker-nat                    fix-udev.3-b65c7bdec_Internal   7fe799367580   319MB
docker-nat                    latest                          7fe799367580   319MB
docker-macsec                 latest                          d702a5554171   318MB
docker-snmp                   fix-udev.3-b65c7bdec_Internal   3bce8fcf71cd   338MB
docker-snmp                   latest                          3bce8fcf71cd   338MB
docker-sonic-telemetry        fix-udev.3-b65c7bdec_Internal   f13949cbc817   597MB
docker-sonic-telemetry        latest                          f13949cbc817   597MB
docker-dhcp-relay             latest                          153d9072805d   306MB
docker-router-advertiser      fix-udev.3-b65c7bdec_Internal   aed642b9a6bc   299MB
docker-router-advertiser      latest                          aed642b9a6bc   299MB
docker-sonic-p4rt             fix-udev.3-b65c7bdec_Internal   a3cae5ca65a7   870MB
docker-sonic-p4rt             latest                          a3cae5ca65a7   870MB
docker-mux                    fix-udev.3-b65c7bdec_Internal   b81f0401b9a8   347MB
docker-mux                    latest                          b81f0401b9a8   347MB
docker-eventd                 fix-udev.3-b65c7bdec_Internal   c5917d0e801f   298MB
docker-eventd                 latest                          c5917d0e801f   298MB
docker-lldp                   fix-udev.3-b65c7bdec_Internal   fd5dc14a7976   341MB
docker-lldp                   latest                          fd5dc14a7976   341MB
docker-database               fix-udev.3-b65c7bdec_Internal   438c2715a1dd   299MB
docker-database               latest                          438c2715a1dd   299MB
docker-sonic-mgmt-framework   fix-udev.3-b65c7bdec_Internal   5c50b115fbcd   414MB
docker-sonic-mgmt-framework   latest  
```
2023-09-03 18:32:54 +08:00
Vadym Hlushko
b7dfc5b280 [memory_checker] Add a specific log message in a case when the docker service is not running. (#16018)
#### Why I did it
To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created #11129](https://github.com/sonic-net/sonic-buildimage/pull/11129).
There could be a scenario before the reboot, where
1. The `docker service` has stopped
2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry`

In such scenario, the `memory_checker` script will throw an error to the syslog:
```
ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))'
```
But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog.

#### How I did it
Change the log severity to the warning and changed the return value.

#### How to verify it
It is really hard to catch the exact moment described in the `Why I did it` section.
In order to check the logic:
1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](47742dfc2c/files/image_config/monit/memory_checker (L139)) file on the switch.
2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry`
3. Check the syslog for such messages:
```
WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte
d.', FileNotFoundError(2, 'No such file or directory'))'

INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running!
```
2023-09-03 18:32:43 +08:00
xumia
288ebd5dd3 Support FIPS DB configuration (#15632)
Why I did it
Support FIPS DB configuration
Design Doc: sonic-net/SONiC#1372

Work item tracking
Microsoft ADO (number only): 24411148
How I did it
Add the FIPS Yang model to make FIPS configurable in ConfigDB.

How to verify it
See TestPlan: sonic-net/sonic-mgmt#9092
Build the image and run the tests: sonic-net/sonic-mgmt#9091
2023-09-03 16:33:25 +08:00