### Why I did it
An in-band syslog server will not receive any syslog if it is configured without a VRF specified, which is because `eth0` is always specified as the `device` of a syslog server and the syslog packets will be sent to `eth0` regardless of its destination IP address.
### How I did it
Pass the option "device" in rsyslog.conf only if when syslog server's source address is configured with a non-default VRF
#### How to verify it
Manually test:
1. Configuring a syslog server without VRF specified or with `default` as the VRF: no `device` passed in `rsyslog.conf`
2. Configuring a syslog server with non-default VRF: the configured VRF passed as `device` in `rsyslog.conf`
### Why I did it
Remove UpdateGraphService feature from sonic image. The goal is to simplify the bootup process.
### How I did it
Remove updategraph service and updategraph script.
Update all related services, replace updategraph.service with config-setup.service.
#### How to verify it
Build and install new image, load minigraph and check all the services.
Why I did it
On certain routers with baud rate 9600, crash kernel is taking a long time , close to ~5mins, to complete kernel dump and reload the box. On contrast to routers with baud rate 115200, crash kernel dump process is observed to be completed under 35s-60s (depending on the platform). Currently, all debug and informational messages are printed on the console which also factors in for the delay seen. Unless the router is monitored on console in real time, these messages are not very useful. Setting the loglevel to warning will help reduce the verbosity of logs on console, in turn allow crash kernel dump process to be completed in a reasonable time which will also help in overall router recovery time.
How I did it
Setting loglevel attribute in crashkernel cmdline
How to verify it
Install SONiC image with crashkernel cmdline with loglevel set to warning and initiate an induced a crash (sysrq-trigger)
crashkernel boot and dump process will be completed in 20s-30s depending on the platform
### Why I did it
- Currently inside k8s master image we are going to use AAD to do authentication related stuff with python language, we need to pre-install several azure key-vault related python packages.
- Need to upgrade cri-dockerd to 0.3.10 to support bookworm
- Need to change netcat package name to netcat-openbsd for bookworm
- Remove the unnecessary apt-get update
##### Work item tracking
- Microsoft ADO **(number only)**: 26435886
#### How I did it
- pip3 install azure-keyvault-secrets
- apt-get -y install netcat-openbsd
- upgrade the cri-dockerd version for bookworm
#### How to verify it
- pip3 list to check if azure-keyvault-secrets is installed inside image
- dpkg -l to check if netcat-openbsd is installed inside image
- systemctl status cri-dockerd.service to check if it's running well
Adding rule to ebtables to drop multicast packets in kernel. This was
done to address a bug where NS packets were flooding ports with
duplicate packets.
Signed-off-by: Nikola Dancejic <ndancejic@microsoft.com>
### Why I did it
ipmitool utility is used to access various HW sensors. Some platforms use "ipmitool raw " to read specific addresses.
ipmitool_1.8.19-4_amd64.deb, that is part of bookworm has a defect. The package is missing file enterprise.txt that is expected by the "raw read" code path.
It is so because the file the .deb tries to download at the build time does not have the necessary extension as it is available on remote server: https://www.iana.org/assignments/enterprise-numbers.txt
### How I did it
The defect had been fixed using coding changes in next unstable version of Linux. It is expected to be available in future stable version of the OS. Hence to keep the changes to minimal, the .dsc file is downloaded and only the Makefile is modified to download the correct file. To make is work as patch necessary changes are made.
#### How to verify it
Build log is attached and installation of the file is noted line #2274
When using vanilla bookworm on platforms like 5212 or 5224:
-------------------------------------------------------------------
root@sonic:~# ipmitool raw 0x04 0x2d 0x31
IANA PEN registry open failed: No such file or directory
00 c0 01 80
When fixed we should not see the above error:
--------------------------------------------------
root@sonic:/home/admin# ipmitool raw 0x04 0x2d 0x31
00 c0 00 80
### Description for the changelog
This change is to address ipmitool raw read issue. This patch must be removed once it is available in next stable Linux release that contains the fix.
1edb0e27e4
Why I did it
Update smartmontool verson to 7.4. This is done to prevent smartmontools service to exit with non-zero exit status on platform that does not have a SSD/disk to be monitored.
Until Debian Bullseye (which had smartmontools 7.2), Debian had a patch applied that changed the default quit mode to never exit. A bug report was filed on Debian, saying that the source code patch isn't needed and could just be done via command line options, and also that smartmontools 7.3 has a new built-in option to exit with 0 if there are no monitorable devices found (which prevents systemd from treating it as a service failure). Because of that, Debian Bookworm (which also upgraded to 7.3) removed the patch and restored the default behavior of exiting with exit code 17 if there are no devices found.
Smartmontools v7.3 has this issue, because of which smartd exits with non-zero exit status even with "-q" option.
How I did it
Update the smartmontools to version 7.4 which has the fix for exiting gracefully if no monitoring device is found
Added smartd option "-q nodev0" to allow smartd to exit with status 0 if no monitoring device found
### Why I did it
Disable eventd at buildtime for slim images
##### Work item tracking
- Microsoft ADO **(number only)**:26386286
#### How I did it
Add flags for disabling eventd and only copy rsyslog conf files when eventd is included and not slim image
#### How to verify it
Manual testing
Why I did it
Align the keywords to make qos configuration take effect
Work item tracking
Microsoft ADO (number only):
How I did it
Change the keyword to ComputeAI
How to verify it
reload minigraph and check the qos configuration
Change orchagent stuck message from ERR to WARNING
#### Why I did it
During switch initialization, sometime Orchagent will busy for more than 40seconds and will trigger process stuck workdog error.
To improve this issue, change watchdog error message to warning message.
##### Work item tracking
- Microsoft ADO: 26517622
#### How I did it
Change orchagent stuck message from ERR to WARNING.
#### How to verify it
Pass all UT.
### Description for the changelog
Change orchagent stuck message from ERR to WARNING.
### Why I did it
Unnecessary for logs to be written out to /tmp/${SERVICE}-debug.log as they are already being written to syslog. Therefore, removing writing to a new log in concern for memory space and not being able to startup some services in RO state.
##### Work item tracking
- Microsoft ADO **(number only)**:26458976
#### How I did it
Remove DEBUGLOG definition and line that echo's message to mentioned log file.
#### How to verify it
Manually verified, /tmp/${SERVICE}-debug.log files do not exist and log for service starting still appears in syslog
### Why I did it
Fix the issue detected by[ TestStaticMgmtPortIP::test_dynamic_dns_not_working_when_static_ip_configured ](https://github.com/sonic-net/sonic-mgmt/blob/master/tests/dns/static_dns/test_static_dns.py#L105C9-L105C63) test.
### How I did it
Query MGMT interface configuration. Do not apply dynamic DNS configuration when MGMT interface has static IP address.
#### How to verify it
Run `tests/dns/static_dns/test_static_dns.py` sonic-mgmt tests.
ix IPV6 forced-mgmt-route not work issue
Why I did it
IPV6 forced-mgmt-route not work
When add a IPV6 route, should use 'ip -6 rule add pref 32764 address' command, but currently in the template the '-6' parameter are missing, so the IPV6 route been add to IPV4 route table.
Also this PR depends on #17281 , which will fix the IPV6 'default' route table missing in IPV6 route lookup issue.
Microsoft ADO (number only):24719238
Why I did it
For some devices with small memory, after upgrading to the latest image, the available memory is not enough.
Work item tracking
Microsoft ADO (number only):
26324242
How I did it
Disable restapi feature for LeafRouter which with slim image.
How to verify it
verified on 7050qx T1 (slim image), restapi disabled
verified on 7050qx T0 (slim image), restapi enabled
verified on 7260 T1 (normal image), restapi enabled
* [docker_image_ctl.j2]: swss docker initialization improvements
This commit attempts to address the following:
* Make sure swss container is indeed up and running before running any commands
on it. In case where swss container is not fully up when swss.sh attempts to
create swss:/ready file using "docker exec swss$DEV touch", the command can
fail silently and can cause swssconfig to wait forever leading to missing IP
decap configuration among other things. Add a wait so that docker commands
are run only after swss container status is "Running"
* Add a log when swss:/ready file is created or if the file creation fails so
that it becomes easier to debug such scenarios in the future
* [docker_image_ctl.j2]: Use swss$DEV to accommodate multi ASIC platforms as well
Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
* [image_config]: Update DHCP rate-limit for mgmt TOR devices
Change DHCP rate limit(queue4,group3) in SONiC copp configuration to 300 PPS
for mgmt TORs while keeping the rate limit at 100 PPS for other topologies.
Why I did it:
Some mgmt TORs based on Marvell ASIC do not support 100 PPS CIR, so that led
to these devices silently dropping DHCP packets.
Microsoft ADO: **25820076**
How to verify it:
Send DHCP broadcast packets to an M0 DUT and verify that they are trapped to
CPU at 300 PPS. On non-mgmt devices, the packets should be trapped at CIR of
100 PPS. Also ran sonic-mgmt dhcp_relay test and confirmed that it passes.
Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
- Why I did it
Optimize syslog rate limit feature for fast and warm boot
- How I did it
Optimize redis start time
Don't render rsyslog.conf in container startup script
Disable containercfgd by default. There is a new CLI to enable it (in another PR)
- How to verify it
Manual test
Regression test
Fix the fsck check which is not working. Potentially fixes#16938
Modified fsck script to run on the ext4.fsck on the appropriate disk where SONiC resides
Microsoft ADO: 26098631
The issue is related to #16812. Process syncd does not run in the container gbsyncd on kvm sonic with default hwsku.
Microsoft ADO : 26151608
How I did it
If syncd has not run in container gbsyncd, it is not needed to trigger graceful shudown of syncd.
How to verify it
None of syncd_request_shutdown coredump in config reload on KVM sonic
Why I did it
1. Upgrade centec-arm64 platform to Bookworm.
2. Solve the problem of compiling the docker-syncd-centec-rpc.gz error on the centec platform.
How I did it
1. Modified platform driver to comply with bookworm kernel.
2. Upgrade SONiC package versions of the centec platform.
How to verify it
1. Compile the centec-arm64 platform to generate sonic-centec-arm64.bin.
2. Compile the centec platform to generate docker-syncd-centec-rpc.gz.
Signed-off-by: centecqianj <qianj@centec.com>
- Why I did it
Improve boot performance mostly needed for fast and warmboot
- How I did it
Use cached variable.
- How to verify it
Boot the system. Simply do "systemd-analyze blame" and look at service start time.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
- Why I did it
Improve boot performance mostly needed for fast and warmboot
- How I did it
Use cached variable.
- How to verify it
Boot the system. Simply do "systemd-analyze blame" and look at service start time.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Why I did it
SONiC Mgmt test syslog/test_syslog_rate_limit.py syslog.test_syslog_rate_limit test_syslog_rate_limit was failing on SKUs with gbsyncd. This includes Arista 720DT when testing on the 202305 branch.
How I did it
The issue was no value for gbsyncd in "show syslog rate-limit-container",
because gbsyncd is not having a SYSLOG_CONFIG_FEAGTURE|gbsyncd entry in
config_db, which is further because gbsyncd feature is for not enabled
through init_cfg.json.j2.
How to verify it
Test is now passing on 720DT in 202305 branch.
Co-authored-by: Boyang Yu <byu@arista.com>
Fix can't access IPV6 address via management interface because 'default' route table does not add to route lookup issue.
#### Why I did it
When device set with IPV6 TACACS server address, and shutdown all BGP, device can't connect to TACACS server via management interface.
After investigation, I found the IPV6 'default' route table does not add to route lookup:
admin@vlab-01:~$ ip -6 rule list
1001: from all lookup local
32765: from fec0::ffff:afa:1 lookup default
32766: from all lookup main
admin@vlab-01:~$
As compare:
admin@vlab-01:~$ ip -4 rule list
1001: from all lookup local
32764: from all to 172.17.0.1/24 lookup default
32765: from 10.250.0.101 lookup default
32766: from all lookup main
32767: from all lookup default <== 'default' route table exist in IPV4 route lookup
Issue fix by add 'default' route table to route lookup with following command:
admin@vlab-01:~$ sudo ip -6 rule add pref 32767 lookup default
admin@vlab-01:~$ ip -6 rule list
1001: from all lookup local
32765: from fec0::ffff:afa:1 lookup default
32766: from all lookup main
32767: from all lookup default <== 'default' route table been added to IPV6 route lookup
admin@vlab-01:~$
##### Work item tracking
- Microsoft ADO: 25798732
#### How I did it
When management interface using 'default' route table, add 'default' route table to IPV6 route lookup.
#### How to verify it
Pass all UT.
Add new UT to cover this change.
Manually verify issue fixed:
### Tested branch (Please provide the tested image version)
- [x] master-17281.417570-2133d58fa
#### Description for the changelog
Fix can't access IPV6 address via management interface because 'default' route table does not add to route lookup issue.
This commit adds support for pensando asic called ELBA. ELBA is used in pci based cards and in smartswitches.
#### Why I did it
This commit introduces pensando platform which is based on ELBA ASIC.
##### Work item tracking
- Microsoft ADO **(number only)**:
#### How I did it
Created platform/pensando folder and created makefiles specific to pensando.
This mainly creates pensando docker (which OEM's need to download before building an image) which has all the userspace to initialize and use the DPU (ELBA ASIC).
Output of the build process creates two images which can be used from ONIE and goldfw.
Recommendation is use to use ONIE.
#### How to verify it
Load the SONiC image via ONIE or goldfw and make sure the interfaces are UP.
##### Description for the changelog
Add pensando platform support.
- Why I did it
Mellanox MSN2410 platforms have a non-functional error log: "ERR pmon#sensord: Error getting sensor data: dps460/#10: Can't read". This error is because of a firmware issue with some PSU, we are not able to upgrade the FW online. Since there is no functional impact, this error log can be ignored safely
- How I did it
Add a new rsyslog rule to the rsyslog-container.conf.j2, if the docker name is pmon and the platform name matches, the new rule will be inserted into the docker rsyslogd.conf
- How to verify it
run regression on the MSN2410 platform to make the error log will not be printed to the syslog.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
How I did it
Modified platform driver to comply with bookworm kernel.
Modified python build commands for building whl packages.
How to verify it
Verify whether all the platform bookworm debs are built.
make target/debs/bookworm/platform-modules-v682-48y8c-d_1.0_amd64.deb
Load the platform debian into the device and install it in bookworm image.
Verify the platform related CLI and the functionality
Signed-off-by: centecqianj <qianj@centec.com>
[arp_update]: Flush MAC mismatch neighbors
- Check for MAC mismatch between neighbor entries in the kernel and APPL_DB
- Flush any entries with a mismatch
Change DHCP rate limit in SONiC copp configuration to 100 PPS as this is
necessary to ensure that DHCP flood does not cause LACP/BGP flaps in all
scenarios
This is an extension to the change in image_config: copp: Enable rate limiting
for bgp, lacp, dhcp, lldp, macsec and udld #14859 and sonic-mgmt change in
[tests/copp]: Update copp mgmt tests to support new rate-limits sonic-mgmt#8199
Why I did it
300 PPS is not sufficient to prevent LACP/BGP flaps in all cases. 100 PPS seems to
provide better resiliency against DHCP traffic flood to CPU.
Microsoft ADO 25776614:
Send DHCP broadcast packets to DUT and verify that they are trapped to CPU at 100 PPS.
Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
Debian changed the defaults of the sudo package to never lecture the
user when using an unauthorized sudo command, which breaks our use case
of lecturing once. Add a line to lecture once, which is the old
defaults.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
systemd changed the log message syntax for a container going down.
Update the regex for the new format.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
pam-auth-update doesn't store local configuration, and it's meant to be
used by packages only. Because libpam-systemd was getting uninstalled
afterwards, this caused tacplus to get re-enabled.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Starting with Bookworm, Debian moved the non-free Linux firmware blobs
into a new non-free-firmware component, since they are frequently needed
by users and since they need to be updated frequently. Since the only
thing we currently install from the non-free component (that I can think
of) is the Linux firmware, have Bookworm use non-free-firmware instead
of non-free.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Notable changes:
* Use j2cli from Debian repos instead of pip
* Use setuptools from Debian repos instead of pip
* Use wheel from Debian repos instead of pip
* Update grpcio and grpcio-tools python packages to match version in
Bookworm
* Use m2crypto from Debian repos instead of pip
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>