Commit Graph

1299 Commits

Author SHA1 Message Date
rajib-dutta1
4753953ed0
Ipmitool bookworm: Fix and patch enterprise-numbers URL (#17878)
### Why I did it

ipmitool utility is used to access various HW sensors. Some platforms use "ipmitool raw " to read specific addresses. 

ipmitool_1.8.19-4_amd64.deb, that is part of bookworm has a defect. The package is missing file enterprise.txt that is expected by the "raw read" code path. 
It is so because the file the .deb tries to download at the build time does not have the necessary extension as it is available on remote server: https://www.iana.org/assignments/enterprise-numbers.txt

### How I did it

The defect had been fixed using coding changes in next unstable version of Linux. It is expected to be available in future stable version of the OS. Hence to keep the changes to minimal, the .dsc file is downloaded and only the Makefile is modified to download the correct file. To make is work as patch necessary changes are made.

#### How to verify it
Build log is attached and installation of the file is noted line #2274
When using vanilla bookworm on platforms like 5212 or 5224:
-------------------------------------------------------------------
root@sonic:~# ipmitool raw 0x04 0x2d 0x31
IANA PEN registry open failed: No such file or directory
00 c0 01 80

When fixed we should not see the above error:
--------------------------------------------------
root@sonic:/home/admin# ipmitool raw 0x04 0x2d 0x31
 00 c0 00 80

### Description for the changelog

This change is to address ipmitool raw read issue. This patch must be removed once it is available in next stable Linux release that contains the fix. 

1edb0e27e4
2024-02-26 17:49:06 -08:00
Prince George
0564ce48c9
[baseimage]: Update smartmontool version >= v7.4 (#17635)
Why I did it
Update smartmontool verson to 7.4. This is done to prevent smartmontools service to exit with non-zero exit status on platform that does not have a SSD/disk to be monitored.

Until Debian Bullseye (which had smartmontools 7.2), Debian had a patch applied that changed the default quit mode to never exit. A bug report was filed on Debian, saying that the source code patch isn't needed and could just be done via command line options, and also that smartmontools 7.3 has a new built-in option to exit with 0 if there are no monitorable devices found (which prevents systemd from treating it as a service failure). Because of that, Debian Bookworm (which also upgraded to 7.3) removed the patch and restored the default behavior of exiting with exit code 17 if there are no devices found.

Smartmontools v7.3 has this issue, because of which smartd exits with non-zero exit status even with "-q" option.

How I did it
Update the smartmontools to version 7.4 which has the fix for exiting gracefully if no monitoring device is found
Added smartd option "-q nodev0" to allow smartd to exit with status 0 if no monitoring device found
2024-02-12 09:37:12 -08:00
Stepan Blyshchak
cac73d80ca
[bootchart] enable command line recording (#17778)
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2024-02-12 08:36:44 -08:00
Zain Budhwani
c8439cdd4b
Disable eventd and rsyslog plugin in slim images (#17905)
### Why I did it

Disable eventd at buildtime for slim images

##### Work item tracking
- Microsoft ADO **(number only)**:26386286

#### How I did it

Add flags for disabling eventd and only copy rsyslog conf files when eventd is included and not slim image

#### How to verify it

Manual testing
2024-01-30 22:14:23 -08:00
Kevin Wang
5516381d7e
[qos] change the template keyword from Compute-AI to ComputeAI (#17902)
Why I did it
Align the keywords to make qos configuration take effect

Work item tracking
Microsoft ADO (number only):
How I did it
Change the keyword to ComputeAI

How to verify it
reload minigraph and check the qos configuration
2024-01-29 10:10:54 +08:00
ganglv
c798ea8e08
Change tcp port range to support telemetry and gnmi (#17907)
* Reserve tcp port for telemetry and gnmi

* Use ip_local_port_range instead

* Fix sysctl config
2024-01-26 09:31:09 -08:00
Hua Liu
bdb24676eb
Change orchagent stuck message from ERR to WARNING (#17872)
Change orchagent stuck message from ERR to WARNING

#### Why I did it
During switch initialization, sometime Orchagent will busy for more than 40seconds and will trigger process stuck workdog error.
To improve this issue, change watchdog error message to warning message.

##### Work item tracking
- Microsoft ADO: 26517622

#### How I did it
Change orchagent stuck message from ERR to WARNING.

#### How to verify it
Pass all UT.

### Description for the changelog
Change orchagent stuck message from ERR to WARNING.
2024-01-26 00:01:50 -08:00
Zain Budhwani
b557488608
Remove echo log to /tmp/{$SERVICE}-debug.log in service_mgmt.sh (#17838)
### Why I did it

Unnecessary for logs to be written out to /tmp/${SERVICE}-debug.log as they are already being written to syslog. Therefore, removing writing to a new log in concern for memory space and not being able to startup some services in RO state.

##### Work item tracking
- Microsoft ADO **(number only)**:26458976

#### How I did it

Remove DEBUGLOG definition and line that echo's message to mentioned log file.

#### How to verify it

Manually verified, /tmp/${SERVICE}-debug.log files do not exist and log for service starting still appears in syslog
2024-01-25 17:14:21 -08:00
mssonicbld
1fb9732f41 [ci/build]: Upgrade SONiC package versions 2024-01-25 14:35:40 +08:00
Oleksandr Ivantsiv
c693e75f0f
[dns] Do not apply dynamic DNS configuration when MGMT interface has static IP address. (#17769)
### Why I did it
Fix the issue detected by[ TestStaticMgmtPortIP::test_dynamic_dns_not_working_when_static_ip_configured ](https://github.com/sonic-net/sonic-mgmt/blob/master/tests/dns/static_dns/test_static_dns.py#L105C9-L105C63) test.

### How I did it
Query MGMT interface configuration. Do not apply dynamic DNS configuration when MGMT interface has static IP address.

#### How to verify it
Run `tests/dns/static_dns/test_static_dns.py` sonic-mgmt tests.
2024-01-23 16:29:55 -08:00
Hua Liu
c274be2e59
Fix IPV6 forced-mgmt-route not work issue (#17299)
ix IPV6 forced-mgmt-route not work issue

Why I did it
IPV6 forced-mgmt-route not work

When add a IPV6 route, should use 'ip -6 rule add pref 32764 address' command, but currently in the template the '-6' parameter are missing, so the IPV6 route been add to IPV4 route table.

Also this PR depends on #17281 , which will fix the IPV6 'default' route table missing in IPV6 route lookup issue. 

Microsoft ADO (number only):24719238
2024-01-22 09:59:12 -08:00
Nazarii Hnydyn
e173987a56
[swss/syncd]: Remove dependency on interfaces-config.service (#17739)
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
Co-authored-by: Stepan Blyshchak <38952541+stepanblyschak@users.noreply.github.com>
2024-01-18 08:04:00 -08:00
Liping Xu
d6e0bf66a6
disable restapi for leafRouter in slim image (#17713)
Why I did it
For some devices with small memory, after upgrading to the latest image, the available memory is not enough.

Work item tracking
Microsoft ADO (number only):
26324242
How I did it
Disable restapi feature for LeafRouter which with slim image.

How to verify it
verified on 7050qx T1 (slim image), restapi disabled
verified on 7050qx T0 (slim image), restapi enabled
verified on 7260 T1 (normal image), restapi enabled
2024-01-12 15:26:06 +08:00
Lawrence Lee
eb70bff4b7
add timeout to ping6 command (#17729)
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2024-01-10 14:40:15 -08:00
prabhataravind
c20abb9e28
[docker_image_ctl.j2]: swss docker initialization improvements (#17628)
* [docker_image_ctl.j2]: swss docker initialization improvements

This commit attempts to address the following:
 * Make sure swss container is indeed up and running before running any commands
   on it. In case where swss container is not fully up when swss.sh attempts to
   create swss:/ready file using "docker exec swss$DEV touch", the command can
   fail silently and can cause swssconfig to wait forever leading to missing IP
   decap configuration among other things. Add a wait so that docker commands
   are run only after swss container status is "Running"
*  Add a log when swss:/ready file is created or if the file creation fails so
   that it becomes easier to debug such scenarios in the future
* [docker_image_ctl.j2]: Use swss$DEV to accommodate multi ASIC platforms as well

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
2024-01-03 17:44:22 -08:00
bingwang-ms
977e73d370
Update backend_acl.py to specify ACL table name (#17553) 2024-01-03 14:55:38 -08:00
prabhataravind
038ca267c8
[image_config]: Update DHCP rate-limit for mgmt TOR devices (#17630)
* [image_config]: Update DHCP rate-limit for mgmt TOR devices

    Change DHCP rate limit(queue4,group3) in SONiC copp configuration to 300 PPS
    for mgmt TORs while keeping the rate limit at 100 PPS for other topologies.

    Why I did it:
    Some mgmt TORs based on Marvell ASIC do not support 100 PPS CIR, so that led
    to these devices silently dropping DHCP packets.

    Microsoft ADO: **25820076**

    How to verify it:
    Send DHCP broadcast packets to an M0 DUT and verify that they are trapped to
    CPU at 300 PPS. On non-mgmt devices, the packets should be trapped at CIR of
    100 PPS. Also ran sonic-mgmt dhcp_relay test and confirmed that it passes.

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
2024-01-02 21:29:34 -08:00
Junchao-Mellanox
f3f2972512
Optimize syslog rate limit feature for fast and warm boot (#17458)
- Why I did it
Optimize syslog rate limit feature for fast and warm boot

- How I did it
Optimize redis start time
Don't render rsyslog.conf in container startup script
Disable containercfgd by default. There is a new CLI to enable it (in another PR)

- How to verify it
Manual test
Regression test
2023-12-20 09:12:03 +02:00
Prince George
30ff77350f
Fix the fsck script that does filesystem repair (#17424)
Fix the fsck check which is not working. Potentially fixes #16938
Modified fsck script to run on the ext4.fsck on the appropriate disk where SONiC resides

Microsoft ADO: 26098631
2023-12-19 17:51:49 -08:00
Junhua Zhai
53be9de743
Fix syncd_request_shutdown coredump in config reload on KVM sonic (#17486)
The issue is related to #16812. Process syncd does not run in the container gbsyncd on kvm sonic with default hwsku.

Microsoft ADO : 26151608

How I did it
If syncd has not run in container gbsyncd, it is not needed to trigger graceful shudown of syncd.

How to verify it
None of syncd_request_shutdown coredump in config reload on KVM sonic
2023-12-13 17:37:44 -08:00
Yevhen Fastiuk
5efb123ede
[NTP] Add NTP extended configuration (#15058)
hld [#1296](https://github.com/sonic-net/SONiC/pull/1296)
closes [#1254](https://github.com/sonic-net/SONiC/issues/1254)
depends-on [#60](https://github.com/sonic-net/sonic-host-services/pull/60), [#781](https://github.com/sonic-net/sonic-swss-common/pull/781), [#2835](https://github.com/sonic-net/sonic-utilities/pull/2835), [#10749](https://github.com/sonic-net/sonic-mgmt/pull/10749)

#### Why I did it
To cover the next AIs:
* Configure NTP global parameters
* Add/remove new NTP servers
* Change the configuration for NTP servers
* Show NTP status
* Show NTP configuration

### How I did it
* Add YANG model for a new configuration
* Extend configuration templates to support new knobs

### Description for the changelog
* Add ability to configure NTP global parameters such as authentication, dhcp, admin state
* Change the configuration for NTP servers
* Add an ability to show NTP configuration

#### Link to config_db schema for YANG module changes
[NTP configuration](https://github.com/sonic-net/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md#ntp-and-syslog-servers)
2023-12-11 13:31:35 -08:00
Stepan Blyshchak
b61528bee9
Revert "[swss/syncd] remove dependency on interfaces-config.service (#13084) (#14341)" (#15094) (#17367)
This reverts commit 499f57a7f7.

Co-authored-by: Nazarii Hnydyn <nazariig@nvidia.com>
2023-12-07 15:20:39 -08:00
Ying Xie
2e072beb41
Revert "[pmon] update gRPC version to 1.57.0 (#16257)" (#17401)
This reverts commit 45a852233b.
2023-12-07 11:01:47 -08:00
centecqianj
8ec4b53451
[Bookworm] Upgrade centec-arm64 platform to Bookworm. (#17411)
Why I did it
1. Upgrade centec-arm64 platform to Bookworm.
2. Solve the problem of compiling the docker-syncd-centec-rpc.gz error on the centec platform.

How I did it
1. Modified platform driver to comply with bookworm kernel.
2. Upgrade SONiC package versions of the centec platform.

How to verify it
1. Compile the centec-arm64 platform to generate sonic-centec-arm64.bin.
2. Compile the centec platform to generate docker-syncd-centec-rpc.gz.

Signed-off-by: centecqianj <qianj@centec.com>
2023-12-07 08:42:13 -08:00
Stepan Blyshchak
9555883e6f
[config-chassisdb] use cached variables (#17342)
- Why I did it
Improve boot performance mostly needed for fast and warmboot

- How I did it
Use cached variable.

- How to verify it
Boot the system. Simply do "systemd-analyze blame" and look at service start time.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2023-12-07 15:24:21 +02:00
Stepan Blyshchak
6435df1056
[config-topology] use cached variables (#17343)
- Why I did it
Improve  boot performance mostly needed for fast and warmboot

- How I did it
Use cached variable.

- How to verify it
Boot the system. Simply do "systemd-analyze blame" and look at service start time.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2023-12-07 15:22:44 +02:00
Aaron Payment
0ecee5df05
[gbsyncd]: Set SYSLOG_CONFIG_FEATURE for gbsyncd (#17325)
Why I did it
SONiC Mgmt test syslog/test_syslog_rate_limit.py syslog.test_syslog_rate_limit test_syslog_rate_limit was failing on SKUs with gbsyncd. This includes Arista 720DT when testing on the 202305 branch.

How I did it
The issue was no value for gbsyncd in "show syslog rate-limit-container",
because gbsyncd is not having a SYSLOG_CONFIG_FEAGTURE|gbsyncd entry in
config_db, which is further because gbsyncd feature is for not enabled
through init_cfg.json.j2.

How to verify it
Test is now passing on 720DT in 202305 branch.

Co-authored-by: Boyang Yu <byu@arista.com>
2023-12-06 22:04:21 -08:00
Junhua Zhai
048f2a7c39
[gbsyncd] Graceful shutdown of syncd process in container gbsyncd (#16812)
Fix #16608. Need to gracefully shutdown syncd/gbsyncd individually.
2023-12-06 21:43:13 -08:00
Hua Liu
164916681a
Fix can't access IPV6 address via management interface because 'default' route table does not add to route lookup issue. (#17281)
Fix can't access IPV6 address via management interface because 'default' route table does not add to route lookup issue.

#### Why I did it
When device set with IPV6 TACACS server address, and shutdown all BGP, device can't connect to TACACS server via management interface.

After investigation, I found the IPV6 'default' route table does not add to route lookup:

admin@vlab-01:~$ ip -6 rule list
1001:   from all lookup local
32765:  from fec0::ffff:afa:1 lookup default
32766:  from all lookup main
admin@vlab-01:~$

As compare:
admin@vlab-01:~$ ip -4 rule list
1001:   from all lookup local
32764:  from all to 172.17.0.1/24 lookup default
32765:  from 10.250.0.101 lookup default
32766:  from all lookup main
32767:  from all lookup default <== 'default' route table exist in IPV4 route lookup

Issue fix by add 'default' route table to route lookup with following command:
admin@vlab-01:~$ sudo ip -6 rule add pref 32767 lookup default
admin@vlab-01:~$ ip -6 rule list
1001:   from all lookup local
32765:  from fec0::ffff:afa:1 lookup default
32766:  from all lookup main
32767:  from all lookup default <== 'default' route table been added to IPV6 route lookup
admin@vlab-01:~$

##### Work item tracking
- Microsoft ADO: 25798732

#### How I did it
When management interface using 'default' route table, add 'default' route table to IPV6 route lookup.

#### How to verify it
Pass all UT.
Add new UT to cover this change.
Manually verify issue fixed:

### Tested branch (Please provide the tested image version)

- [x]  master-17281.417570-2133d58fa

#### Description for the changelog
Fix can't access IPV6 address via management interface because 'default' route table does not add to route lookup issue.
2023-12-05 11:51:56 -08:00
Ashwin Hiranniah
ada7c6a72e
Add pensando platform (#15978)
This commit adds support for pensando asic called ELBA. ELBA is used in pci based cards and in smartswitches.

#### Why I did it
This commit introduces pensando platform which is based on ELBA ASIC.
##### Work item tracking
- Microsoft ADO **(number only)**:

#### How I did it
Created platform/pensando folder and created makefiles specific to pensando.
This mainly creates pensando docker (which OEM's need to download before building an image) which has all the userspace to initialize and use the DPU (ELBA ASIC).
Output of the build process creates two images which can be used from ONIE and goldfw.
Recommendation is use to use ONIE.
#### How to verify it
Load the SONiC image via ONIE or goldfw and make sure the interfaces are UP.

##### Description for the changelog
Add pensando platform support.
2023-12-04 14:41:52 -08:00
Kebo Liu
4c699050e8
[Mellanox] Add special rsyslog filter for MSN2410 platform (#17365)
- Why I did it
Mellanox MSN2410 platforms have a non-functional error log: "ERR pmon#sensord: Error getting sensor data: dps460/#10: Can't read". This error is because of a firmware issue with some PSU, we are not able to upgrade the FW online. Since there is no functional impact, this error log can be ignored safely

- How I did it
Add a new rsyslog rule to the rsyslog-container.conf.j2, if the docker name is pmon and the platform name matches, the new rule will be inserted into the docker rsyslogd.conf

- How to verify it
run regression on the MSN2410 platform to make the error log will not be printed to the syslog.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2023-12-03 15:32:56 +02:00
centecqianj
8db3a99d11
[Bookworm] Upgrade centec platforms to Bookworm (#17364)
How I did it
Modified platform driver to comply with bookworm kernel.
Modified python build commands for building whl packages.

How to verify it
Verify whether all the platform bookworm debs are built.
make target/debs/bookworm/platform-modules-v682-48y8c-d_1.0_amd64.deb
Load the platform debian into the device and install it in bookworm image.
Verify the platform related CLI and the functionality

Signed-off-by: centecqianj <qianj@centec.com>
2023-12-01 16:07:52 -08:00
Lawrence Lee
572af1dcdf
[arp_update]: Flush neighbors with incorrect MAC info (#17238)
[arp_update]: Flush MAC mismatch neighbors

- Check for MAC mismatch between neighbor entries in the kernel and APPL_DB
- Flush any entries with a mismatch
2023-11-30 14:23:05 -08:00
Xincun Li
f13081bfbd
Ensure that 'logrotate-config.service' is set as a dependency to start before 'logrotate.service'. (#17312)
* Ensure that 'logrotate-config.service' is set as a dependency to start before 'logrotate.service'.
2023-11-29 17:22:47 -08:00
Vivek
4727185648
[lldp] Clean up service start logic owing to port init start optimization (#17268)
Signed-off-by: Vivek Reddy <vkarri@nvidia.com>
2023-11-27 09:56:54 -08:00
prabhataravind
aea3c42f29
[image_config]: Update DHCP rate-limit (#17132)
Change DHCP rate limit in SONiC copp configuration to 100 PPS as this is
necessary to ensure that DHCP flood does not cause LACP/BGP flaps in all
scenarios

This is an extension to the change in image_config: copp: Enable rate limiting 
for bgp, lacp, dhcp, lldp, macsec and udld #14859 and sonic-mgmt change in 
[tests/copp]: Update copp mgmt tests to support new rate-limits sonic-mgmt#8199

Why I did it
300 PPS is not sufficient to prevent LACP/BGP flaps in all cases. 100 PPS seems to
provide better resiliency against DHCP traffic flood to CPU.

Microsoft ADO 25776614:

Send DHCP broadcast packets to DUT and verify that they are trapped to CPU at 100 PPS.

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
2023-11-22 15:02:17 -08:00
mssonicbld
52e304afcf [ci/build]: Upgrade SONiC package versions (#17035) 2023-11-21 18:53:15 -08:00
Saikrishna Arcot
318f3945be Modify the sudoers file to lecture RO users once
Debian changed the defaults of the sudo package to never lecture the
user when using an unauthorized sudo command, which breaks our use case
of lecturing once. Add a line to lecture once, which is the old
defaults.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-11-21 18:53:15 -08:00
Saikrishna Arcot
862bd794ee Fix container down event not sending out a notification
systemd changed the log message syntax for a container going down.
Update the regex for the new format.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-11-21 18:53:15 -08:00
Saikrishna Arcot
cae42998dd Fix PAM module configuration issue
pam-auth-update doesn't store local configuration, and it's meant to be
used by packages only. Because libpam-systemd was getting uninstalled
afterwards, this caused tacplus to get re-enabled.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-11-21 18:53:15 -08:00
Saikrishna Arcot
73605a98ef Modify rasdaemon service on amd64 only
Rasdaemon is not installed on armhf or arm64

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-11-21 18:53:15 -08:00
Saikrishna Arcot
0664c791ef For Bookworm, use non-free-firmware instead of non-free
Starting with Bookworm, Debian moved the non-free Linux firmware blobs
into a new non-free-firmware component, since they are frequently needed
by users and since they need to be updated frequently. Since the only
thing we currently install from the non-free component (that I can think
of) is the Linux firmware, have Bookworm use non-free-firmware instead
of non-free.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-11-21 18:53:15 -08:00
Saikrishna Arcot
ed5176107b Update Debian build script for Bookworm
Notable changes:
* Use j2cli from Debian repos instead of pip
* Use setuptools from Debian repos instead of pip
* Use wheel from Debian repos instead of pip
* Update grpcio and grpcio-tools python packages to match version in
  Bookworm
* Use m2crypto from Debian repos instead of pip

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-11-21 18:53:15 -08:00
Saikrishna Arcot
34a1ac1a0f Migrate from ntp to ntpsec
Debian Bookworm no longer uses NTP, and instead uses NTPsec. Modify our
files to update/replace the NTPsec files instead.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-11-21 18:53:15 -08:00
abdosi
4a7aa2634f
[chassis] Support advertisement of Loopback0 of all LC's across all e-BGP peers in TSA mode (#16714)
What I did:
In Chassis TSA mode Loopback0 Ip's of each LC's should be advertise through e-BGP peers of each remote LC's

How I did:

- Route-map policy to Advertise own/self Loopback IP to other internal iBGP peers with a community internal_community as define in constants.yml
- Route-map policy to match on above internal_community when route is received from internal iBGP peers and set a internal tag as define in constants.yml and also delete the internal_community so we don't send to any of e-BGP peers
- In TSA new route-map match on above internal tag and permit the route (Loopback0 IP's of remote LC's) and set the community to traffic_shift_community.
- In TSB delete the above new route-map.

How I verify:

Manual Verification

UT updated.
sonic-mgmt PR: sonic-net/sonic-mgmt#10239


Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2023-11-20 09:42:02 -08:00
Ze Gan
9f08f88a0d
[dpu]: Add DPU database service (#17161)
Sub PRs:

sonic-net/sonic-host-services#84
#17191

Why I did it
According to the design, the database instances of DPU will be kept in the NPU host.

Microsoft ADO (number only): 25072889

How I did it
To follow the multiple ASIC design, I assume a new platform environment variable NUM_DPU will be defined in the /usr/share/sonic/device/$PLATFORM/platform_env.conf. Based on this number, NPU host will launch a corresponding number of instances for the DPU database.

Signed-off-by: Ze Gan <ganze718@gmail.com>
2023-11-17 09:10:03 -08:00
ganglv
c71fb3a30f
Share image for gnmi and telemetry (#16863)
Why I did it
Share docker image to support gnmi container and telemetry container

Work item tracking
Microsoft ADO 25423918:
How I did it
Create telemetry image from gnmi docker image.
Enable gnmi container and disable telemetry container by default.

How to verify it
Run end to end test.
2023-11-08 08:54:36 +08:00
prabhataravind
7e49530459
[copp]: Enable rate limiting for bgp, lacp, dhcp, lldp, macsec and udld (#14859)
Why I did it
It was observed that a flood of DHCP packets without rate-limiting can cause BGP flaps or lacp keepalive losses.
This change attempts to prevent or reduce such BGP flaps by enabling appropriate rate-limiting in SONiC for all traffic types.

Work item tracking
Microsoft ADO 17964421:

How I did it
Set a reasonable CIR/CBS value of 300 for queue4_group3 (dhcp, lldp, macsec) and 6000 for queue4_group1.
The value 300 was arrived at after testing with dhcp flooding using ptf (using multiple threads). Throttling at this rate was necessary to ensure that dhcp flooding does not cause BGP flaps.

How to verify it
Verified with this script running from ptf, that BGP flaps don't happen when CBS/CIR is set at 300 for queue4_group3.

 import threading
 from scapy.all import *
 
 def send_dhcp_discover(intf):
     dhcp_discover = Ether(dst='ff:ff:ff:ff:ff:ff',src=RandMAC()) \
                         /IP(src='1.1.1.1',dst='255.255.255.255') \
                         /UDP(sport=68,dport=67) \
                         /DHCP(options=[('message-type','discover'),('end')])
     sendp(dhcp_discover,count=100000,iface=intf)
 
 
 if __name__ == "__main__":
     t1 = threading.Thread(target=send_dhcp_discover, args=("eth1",))
     t2 = threading.Thread(target=send_dhcp_discover, args=("eth2",))
     t1.start()
     t2.start()
     t1.join()
     t2.join()

Verified on Arista-7260CX3-D108C8 running 202012 that the copp rule for queue4_group1 and queue4_group3 do NOT affect BGP packets. To verify this using PTF, the copp rules were modified to set the "CBS" and "CIR" for queue4_group1 and queue4_group3 at 600pps and 50k packets each of "BGP open" and "DHCP Discover" were simultaneously sent from the same PTF port to the DUT. It was verified using "show c cpu" that packets are hitting the cpu queue at 1200 pps (double the configured CIR/CBS for these packet types). This helped conclude that throttling rate is per trap (or packet type) and not per queue.

Verified with updated sonic-mgmt tests ([tests/copp]: Update copp mgmt tests to support new rate-limits sonic-mgmt#8199) on broadcom and mellanox platforms that these traffic types are rate-limited.

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
2023-10-25 10:49:24 -07:00
Kebo Liu
31451295d5
Add special rsyslog filter for MSN2700 platform (#16684)
- Why I did it
Mellanox MSN2700 platforms have a non-functional error log: "ERR pmon#sensord: Error getting sensor data: dps460/#10: Can't read". This error is because of a firmware issue with some PSU, we are not able to upgrade the FW online. Since there is no functional impact, this error log can be ignored safely.

- How I did it
Add a new rsyslog rule to the rsyslog-container.conf.j2, if the docker name is pmon and the platform name matches, the new rule will be inserted into the docker rsyslogd.conf

- How to verify it
run regression on the MSN2700 platform to make the error log will not be printed to the syslog.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2023-10-24 17:54:44 +03:00
Samuel Angebault
e4a497183a
Add build option to reduce final image size (#16729)
* Reduce SONiC image filesystem size

Add a build option to reduce the image size.
The image reduction process is affecting the builds in 2 ways:
 - change some packages that are installed in the rootfs
 - apply a rootfs reduction script

The script itself will perform a few steps:
 - remove file duplication by leveraging hardlinks
   - under /usr/share/sonic since the symlinks under the device folder are lost during the build.
   - under /var/lib/docker since the files there will only be mounted ro
 - remove some extra files (man, docs, licenses, ...)
 - some image specific space reduction (only for aboot images currently)

The script can later be improved but for now it's reducing the rootfs
size by ~30%.

* restore fully featured vim package
2023-10-24 10:01:58 +08:00