Commit Graph

967 Commits

Author SHA1 Message Date
Arun Saravanan Balachandran
f4b22f67a4
[initramfs]: SSD firmware upgrade in initramfs (#10748)
Why I did it
To upgrade SSD firmware in initramfs while rebooting from SONiC to SONiC and during NOS to SONiC migration.

How I did it
New option 'ssd-upgrader-part’ is introduced in grub command line, to indicate the partition and its filesystem type in which the SSD firmware updater is present. ‘ssd-upgrader-part’ syntax is ssd-upgrader-part=<partition>,<filesystem type>. Example: ssd-upgrader-part=/dev/sda8,ext4

A new initramfs script ‘ssd-upgrade’ is included in init-premount and it invokes the SSD firmware updater (ssd-fw-upgrade) present in the partition indicated by the boot option 'ssd-upgrader-part'

How to verify it
In SONiC, the SSD firmware updater is copied to “/host/” directory.
Fast-reboot is to be initiated with the ‘-u’ option ([scripts/fast-reboot] Add option to include ssd-upgrader-part boot option with SONiC partition sonic-utilities#2150)
After reboot, while booting into SONiC the SSD firmware updater will be executed in initramfs.
2022-05-12 08:11:02 -07:00
Marty Y. Lok
23f9126f59
[VoQ][config] Multiasic Supervisor card fails to load config_db#.json in chassis when system is reboot (#10106)
Supervisor card fails to load config_db#.json in chassis when system reboot. 
This is an intermittent issue, fixes #10105
2022-05-09 11:06:11 -07:00
xumia
15cf9b0d70
Reduce image size for lazy installation packages (#10775)
Why I did it
The image size is too large, when there are multiple lazy packages and multiple platforms. It is not necessary to keep the lazy installation packages in multiple copies.
For cisco image, the image size will reduce from 3.5G to 1.7G.

How I did it
Use symbol links to only keep one package for each of the lazy package.
Make a new folder fsroot/platform/common
Copy the lazy packages into the folder.
When using a package in each of the platform, such as x86_64-grub, x86_64-8800_rp-r0, x86_64-8201_on-r0, etc, only make a symbol link to the package in the common folder.
2022-05-09 08:26:09 -07:00
xumia
8ec8900d31
Support SONiC OpenSSL FIPS 140-3 based on SymCrypt engine (#9573)
Why I did it
Support OpenSSL FIPS 140-3, see design doc: https://github.com/Azure/SONiC/blob/master/doc/fips/SONiC-OpenSSL-FIPS-140-3.md.

How I did it
Install the fips packages.
To build the fips packages, see https://github.com/Azure/sonic-fips
Azure pipelines: https://dev.azure.com/mssonic/build/_build?definitionId=412

How to verify it
Validate the SymCrypt engine:

admin@sonic:~$ dpkg-query -W | grep openssl
openssl 1.1.1k-1+deb11u1+fips
symcrypt-openssl        0.1

admin@sonic:~$ openssl engine -v | grep -i symcrypt
(symcrypt) SCOSSL (SymCrypt engine for OpenSSL)
admin@sonic:~$
2022-05-06 07:21:30 +08:00
Junchao-Mellanox
681c24878b
Fix race condition between networking service and interface-config service (#10573)
Why I did it
The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:

Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
Systemd starts interface-config service who depends on networking service
Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
Interface-config service updates /etc/networking/interface to static configuration.
Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.
The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.

How I did it
Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.

How to verify it
Manual test.
2022-05-05 15:21:44 -07:00
shlomibitton
4ec3af86af
[Fastboot] Delay PMON service for better fastboot performance (#10567)
- Why I did it
Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time.
This parallel execution consume CPU time and the duration of create_switch is longer than it should be.
Following this finding, and the motivation to ensure these services will not interfere in the future, PMON is delayed in 90 seconds until the system finish the init flow after fastboot.

- How I did it
Add a timer for PMON service.
Exclude for MLNX platform the start trigger of PMON when SYNCD starts in case of fastboot.
Copy the timer file to the host bin image.

- How to verify it
Run fast-reboot on MLNX platform and observe faster create_switch execution time.
2022-05-02 10:44:17 +03:00
shlomibitton
1d84e0d7df
[Fastboot] Delay LLDP service for better fastboot performance (#10568)
- Why I did it
Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time.
This parallel execution consume CPU time and the duration of create_switch is longer than it should be.
Following this finding, and the motivation to ensure these services will not interfere in the future, LLDP is delayed in 90 seconds until the system finish the init flow after fastboot.

- How I did it
Add a timer for LLDP service.
Copy the timer file to the host bin image.

- How to verify it
Run fast-reboot on MLNX platform and observe faster create_switch execution time.
This PR is dependent on PR: #10567
2022-04-28 10:35:14 +03:00
ganglv
9d7387a18e
[sonic-host-services]: Fix import and invalid path (#10660)
Why I did it
Can not start sonic-hostservice

How I did it
Install python3-dbus and systemd-python, and replace invalid path

How to verify it
Start the service with below commands:
sudo systemctl start sonic-hostservice
sudo systemctl status sonic-hostservice

Signed-off-by: Gang Lv ganglv@microsoft.com
2022-04-27 07:14:51 +08:00
Saikrishna Arcot
64187a1b15
Remove SSH host keys after installing the custom version of sshd (#10633)
* Remove SSH host keys after installing the custom version of sshd

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Use an override for for sshd instead of overwriting the service file

Don't overwrite upstream's .service file, and instead use an override
file for making sure the host key(s) are generated.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-04-25 10:38:52 -07:00
bingwang-ms
3fc3259a35
Define qos map AZURE_TUNNEL for QoS remapping of tunnel traffic (#10565)
* Add AZURE_TUNNEL map

Signed-off-by: bingwang <wang.bing@microsoft.com>
2022-04-25 15:06:10 +08:00
yozhao101
e24fe9bc60
[Monit] Fix the issue which shows Monit can not reset its counter. (#10288)
Signed-off-by: Yong Zhao <yozhao@microsoft.com>

Why I did it
This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.

Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:

  check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
      if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.

The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok.

Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry:

    Program 'container_memory_telemetry'
         status                             Status ok
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Sat, 19 Mar 2022 19:56:26
Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:

    Program 'container_memory_telemetry'
         status                             Status failed
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Tue, 01 Feb 2022 22:52:55
After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok.

How I did it
In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.

How to verify it
I verified this change on lab device str-s6000-acs-12. Another pytest PR (Azure/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
2022-04-20 18:08:06 -07:00
Samuel Angebault
fb147764b5
[Arista] Fix arista-net initramfs hook (#10624)
The interface renaming logic fails if one interface is missing.
Because of the `set -e` the whole initramfs hook would abort early on
error.
This change fixes the current behavior to make sure missing interfaces
are properly skipped and ensure existing interface are renamed.
2022-04-20 10:03:05 -07:00
Junhua Zhai
128d762af3
[gearbox] Add peer gbsyncd for swss if gearbox exists (#10504)
Fix the issues #10501 and #9733

If having gearbox, we need:
    * add gbsyncd as a peer since swss also has dependency on gbsyncd
    * add service gbsyncd to FEATURE table if it is missing
2022-04-20 19:02:49 +08:00
kellyyeh
2a516a7763
[dhcp_relay] Enable dhcp_relay on EPMS, MgmtTsTor, MgmtToRRouter and BackEndToRRouter (#10474) 2022-04-15 18:01:24 -07:00
Yakiv Huryk
d9117d9411
[Mellanox][asan] add address sanitizer support for syncd (#10266)
Why I did it
To support address sanitizer for Mellanox syncd

How I did it
/var/log/asan is mapped for syncd container (the same as for swss)
container stop() has a timeout (60s) for syncd (the same as for swss)
This is so libasan has enough time to generate a report.
added ASAN's log path to Mellanox syncd supervisord.conf
added "asan: yes" to sonic_version.yml
How to verify it
Added artificial memory leaks
Compiled with ENABLE_ASAN=y
Installed the image on DUT
Rebooted the DUT
Verified that /var/log/asan/syncd-asan.log contains the leaks

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
2022-04-14 15:00:32 -07:00
Saikrishna Arcot
12ebe3ffa0
Run tune2fs during initramfs instead of image install (#10536)
If it is run during image install, it's not guaranteed that the
installation environment will have tune2fs available. Therefore, run it
during initramfs instead.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-04-12 16:24:13 -07:00
byu343
f7a6553933
[docker-syncd]: Add optional shm-size to syncd container (#10516)
Why I did it
In the bringup of tomahawk4/trident4, we realized that such chips need a larger size of /dev/shm in syncd container, so we added the option --shm-size to the docker create for syncd. The default value for shm-size is 64m; after this change, people can add SYNCD_SHM_SIZE=128m to platform_env.conf to change it to 128m.

How to verify it
We verified that after this change, 1) on existing platforms without platform_env.conf, the size of /dev/shm in syncd container (df -h | grep shm) is still the default 64M; 2) after we add SYNCD_SHM_SIZE=128m to platform_env.conf, /dev/shm in syncd becomes 128M.
2022-04-09 10:47:18 -07:00
Vivek R
ed14eb5263
[interfaces-config] "main exception: cannot find interfaces: eth0" error log avoided (#10463)
- Why I did it
Fixes #9628
During bootup, this error log is seen

Dec 22 04:26:29 sonic interfaces-config.sh[2546]: error: main exception: cannot find interfaces: eth0 (interface was probably never up ?)
This is of non-functional nature and doesn't affect the flow.

- How I did it
Dont take the ifdown if not needed

- How to verify it
Verified during reboot. Log did not appear and IP was acquired on eth0 as expected

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
2022-04-06 16:59:47 +03:00
bingwang-ms
b9dd1df372
Update qos config to clear queues for bounced back traffic (#10176)
* Update qos config to clear queues for bounced back traffic

Signed-off-by: bingwang <bingwang@microsoft.com>
2022-04-05 22:32:25 +08:00
judyjoseph
8e642848c2
Introduce the asic_subtype field for adding the sub platform variants. (#10235)
* Introduce the asic_subtype field for adding the sub platform variants. 
   It uses the value of TARGET_MACHINE variable in slave.mk.
2022-03-28 11:22:32 -07:00
Santhosh Kumar T
e2502edefd
Refactoring DELL platform init to reduce rc.local processing time porting changes in master (#10318)
Why I did it
To reduce the processing time of rc.local, refactoring s6100 platform initialization.
Porting changes from 202012 branch [202012] Refactoring DELL platform init to reduce rc.local processing time #10171
2022-03-24 11:14:37 -07:00
xumia
e9ac13678d
[Build]: Fix armhf mirrors not existing issue (#10312)
Why I did it
[Build]: Fix armhf mirrors not existing issue
The mirror endpoint debian-archive.trafficmanager.net does not support armhf, change to use deb.debian.org and security.debian.org.
2022-03-22 15:24:15 +08:00
Kostiantyn Yarovyi
bf4ab4a338
[Barefoot][Syncd] restart of the interface for cleaning txquee through which communication takes place between Sonic and openBMC (#9941)
Why I did it
improvement of starting barefoot SDK

How I did it
restart of the interface for cleaning txquee through which communication takes place between Sonic and openBMC

How to verify it
run sonic autorestart tests
2022-03-21 10:07:20 -07:00
Samuel Angebault
e4b507fa03
[Arista] rename management interface in initrd (#9856)
On some products the pci enumeration adds randomness into which nic gets
initialized first.
Because SONiC doesn't use deterministic interface naming but instead old
style interface naming, this leads to eth0 not always being the
management port.
To make sure eth0 is always the management port (SONiC expectation)
rename the interfaces in the initramfs for Arista products.
2022-03-21 17:55:23 +05:30
xumia
1017ee6002
[Build]: Use one debian mirror config (#10274)
Why I did it
Use one debian mirror config.
The empty config in https://github.com/Azure/sonic-buildimage/blob/master/files/image_config/apt/sources.list overrides the file https://github.com/Azure/sonic-buildimage/blob/master/files/apt/sources.list.amd64 (armhf/arm64), it does not make sense.
All the content in files/image_config/apt is no use, any one wants to add mirror config, please add in files/apt.

How I did it
Remove files/image_config/apt and the reference.
2022-03-21 16:47:20 +08:00
Saikrishna Arcot
5617b1ae3e
Image disk space reduction (#10172)
# Why I did it

Reduce the disk space taken up during bootup and runtime.

# How I did it

1. Remove python package cache from the base image and from the containers.
2. During bootup, if logs are to be stored in memory, then don't create the `var-log.ext4` file just to delete it later during bootup.
3. For the partition containing `/host`, don't reserve any blocks for just the root user. This just makes sure all disk space is available for all users, if needed during upgrades (for example).


* Remove pip2 and pip3 caches from some containers

Only containers which appeared to have a significant pip cache size are
included here.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Don't create var-log.ext4 if we're storing logs in memory

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Run tune2fs on the device containing /host to not reserve any blocks for just the root user

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-03-15 18:12:49 -07:00
Stepan Blyshchak
18d00dfbe7
[teamd.sh] kill teamd docker on warm shutdown for faster shutdown (#10219)
This can save 6 sec for teamd LAG restoration - the time between:

```
Mar  9 13:51:10.467757 r-panther-13 WARNING teamd#teamd_PortChannel1[28]: Got SIGUSR1.
Mar  9 13:52:33.310707 r-panther-13 INFO teamd#teamd_PortChannel1[27]: carrier changed to UP
```

- Why I did it
Optimize warm boot. Specifically reduce the time needed for LAG restoration.

- How I did it
Kill teamd docker after graceful shutdown of teamd processes.

- How to verify it
Run warm reboot.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-03-15 09:20:36 +02:00
xumia
0243ed9538
[build]: Fix marvell-armhf build hung issue (#10156) (#10229)
Why I did it
The marvel-armhf build is hung, it does not exit after waiting for a long time.
It is caused by the process /etc/entropy.py which is started by the postinst script in target/debs/buster/sonic-platform-nokia-7215_1.0_armhf.deb
2022-03-15 10:03:54 +08:00
Saikrishna Arcot
d7c3ce0045
Specify the filesystem type when mounting to /host (#10169)
When mounting the partition that contains `/host` during initramfs, the
mount binary available there (coming from busybox) tries each filesystem
in `/proc/filesystems` and sees which one succeeds. During this time,
there may be some error messages logged into dmesg because some of the
incorrect filesystems failed to mount the partition.

Specify the filesystem type explicitly so that initramfs knows it's that
type, and we know what filesystem will always get used there.

Fixes #9998

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-03-14 11:34:02 -07:00
Stepan Blyshchak
2919b4820f
[hostcfgd] record feature state in STATE DB (#9842)
- Why I did it
To implement blocking feature state change.

- How I did it
Record the actual feature state in STATE DB from hostcfg.

- How to verify it
UT + verification by running on the switch and checking STATE DB.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-03-14 13:45:27 +02:00
xumia
eea3cc7ad1
[Build]: only install grpc in amd64 (#10212)
[Build]: only install grpc in amd64
Unblock marvell-armhf build.
2022-03-14 13:41:37 +08:00
Samuel Angebault
8d419ca2c5
[Arista] Remove arista.log from rsyslog default logrotate (#9731)
Why I did it
In parallel of this change Arista added a custom logrotate configuration as part of its driver library.
Having 2 logrotate configuration for the same log file triggers an issue.

Fixes aristanetworks/sonic#38

How I did it
Arista merged a few changes in sonic-buildimage which added a logrotate configuration aristanetworks/sonic@e43c797
It is therefore the right path to remove the arista.log line from the logrotate.d/rsyslog configuration.

How to verify it
Logrotate works without any error message, arista log rotation happens and arista daemons still append logs once file was truncated.
2022-03-11 08:09:07 -08:00
xumia
9cdf81230b
[Build]: Fix /proc not mounted issue (#10164)
[Build]: Fix /proc not mounted issue
2022-03-11 09:23:37 +08:00
Song Yuan
01798447ab
[Chassis][QoS template] Skip configuring buffer and QoS config on recirc ports (#7869)
* Added test case to verify the template changes.
2022-03-09 16:04:36 -08:00
Kebo Liu
fe0a7693f4
[smartmontools] Install smartmontools with apt-get and upgrade it to 7.2-1 (#10087)
Why I did it
Smartmontools 6.6 has an issue with reading SMART info of nvme SSD
Smartmontools can be installed with apt-get, no need to build and install

How I did it
Use apt-get to install smartmontools 7.2-1
Remove previous make files for smartmontools 6.6

How to verify it
verify with "smartctl" can read out correct SMART info on NVME ssd.
verify "show platform ssdhealth" can still work

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-03-07 09:39:33 -08:00
Marty Y. Lok
c40f04f0e2
[chassis][supervisor]monit container-checker failed due to unexpected "database-chassis" docker running #9042 (#9043)
Why I did it
Fixed the monit container_checker fails due to unexpected "database-chassis" docker running on Supervisor card in the VOQ chassis. fixes #9042

How I did it
Added database-chassis to the always running docker list if platform is supervisor card.

How to verify it
Execute the CLI command "sudo monit status container_checker"


Signed-off-by: mlok <marty.lok@nokia.com>
2022-03-03 17:56:08 -08:00
Aravind Mani
1740beb1f2
[sonic-cfggen]: Fix sonic-cfggen build failures for armhf (#10132)
Why I did it
amrhf build fails while building sonic-config-engine whl package
https://dev.azure.com/mssonic/be1b070f-be15-4154-aade-b1d3bfb17054/_apis/build/builds/77089/logs/9

The reason for the failure is due to the fact that there is a new line generated at the top of the file in buffer config test cases while building for broadcom based platform and this issue is not seen in Marvell based platforms.

How I did it
Removed the new line for all the buffer test cases as there is no need to add it and accordingly changed the buffer_config.j2 where the new line is generated.
2022-03-02 13:06:20 -08:00
Lawrence Lee
a50d1f1fc8
[write_standby]: Increase timeout to 60s (#10065)
- Avoid scenarios where script times out before orchagent can establish IPinIP tunnel

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2022-02-24 14:55:45 -08:00
wenyiz2021
2d0b063191
Update container_checker for multi-asic devices when state is 'always_enabled' (#10067)
* Update container_checker for multi-asic devices 

Update container_checker for multi-asic devices to add database containers in always_running_containers. 
Previous change was made for single-asic, and that database containers were not considered as feature when writing to state_db.

* Update container_checker

Update an indent
2022-02-23 18:06:30 -08:00
vmittal-msft
bc1dfea619
Updated traffic scheduler settings for HWSKUs : DellEMC-Z9332f-O32 and DellEMC-Z9332f-M-O16C64 (#9828) 2022-02-23 17:22:41 -08:00
Stepan Blyshchak
fb752a4ae5
[rsyslog.j2] fix typo in VAR_LOG_SIZE_KB (#9954)
This issue causes negative threshold value and thus deleting log files even when there is enough space.

This issue causes negative threshold value and thus deleting log files even when there is enough space.

- Why I did it
To fix an issue when log files get deleted even if there is enough space.

- How I did it
Fixed an typo.

- How to verify it
Run the portion of the script that calculates threshold, see that the threshold is calculated correctly.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-02-17 10:16:44 +02:00
byu343
155220be9b
Support multi-asic on macsec container (#9921)
This change enables the support of running multiple macsec containers, each for one ASIC.
2022-02-13 22:45:24 -08:00
Oleksandr Ivantsiv
25a0ce5eb1
[asan] Add address sanitizer support. (#9857)
Implement infrastructure that allows enabling address sanitizer
for docker containers. Enable address sanitizer for SWSS container.

- Why I did it
To add a possibility to compile SONiC applications with address sanitizer (ASAN).
ASAN is a memory error detector for C/C++. It finds:
1. Use after free (dangling pointer dereference)
2. Heap buffer overflow
3. Stack buffer overflow
4. Global buffer overflow
5. Use after return
6. Use after the scope
7. Initialization order bugs
8. Memory leaks

- How I did it
By adding new ENABLE_ASAN configuration option.

- How to verify it
By default ASAN is disabled and the SONiC image is not affected.
When ASAN is enabled it inspects all allocation, deallocation, and memory usage that the application does in run time. To verify whether the application has memory errors tests that trigger memory usage of the application should be run. Ideally, the whole regression tests should be run. Memory leaks reports will be placed in /var/log/asan/ directory of SONiC host OS.

Signed-off-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com>
2022-02-09 13:29:18 +02:00
Prince George
ff14aebef9
Close console session due to user inactivity (#9890)
Signed-off-by: Prince George <prgeor@microsoft.com>
2022-02-02 09:41:21 +05:30
tbgowda
4e32f85a31
Enable SAI_SWITCH_ATTR_UNINIT_DATA_PLANE_ON_REMOVAL attribute (#9419)
Why I did it
Fixes #8980 partly.

The corresponding changes in sonic-sairedis is here :
Azure/sonic-sairedis#975

How I did it
Include changes from both repos and build an image for verification.

How to verify it
Trigger fast-reboot with the changes, see the attribute SAI_SWITCH_ATTR_UNINIT_DATA_PLANE_ON_REMOVAL being set at the SAI level.

Signed-off-by: Thushar Gowda <24815472+tbgowda@users.noreply.github.com>
2022-02-01 08:44:17 -08:00
Alexander Allen
8a07af95e5
[Mellanox] Modified Platform API to support all firmware updates in single boot (#9608)
Why I did it
Requirements from Microsoft for fwutil update all state that all firmwares which support this upgrade flow must support upgrade within a single boot cycle. This conflicted with a number of Mellanox upgrade flows which have been revised to safely meet this requirement.

How I did it
Added --no-power-cycle flags to SSD and ONIE firmware scripts
Modified Platform API to call firmware upgrade flows with this new flag during fwutil update all
Added a script to our reboot plugin to handle installing firmwares in the correct order with prior to reboot
How to verify it
Populate platform_components.json with firmware for CPLD / BIOS / ONIE / SSD
Execute fwutil update all fw --boot cold
CPLD will burn / ONIE and BIOS images will stage / SSD will schedule for reboot
Reboot the switch
SSD will install / CPLD will refresh / switch will power cycle into ONIE
ONIE installer will upgrade ONIE and BIOS / switch will reboot back into SONiC
In SONiC run fwutil show status to check that all firmware upgrades were successful
2022-01-24 00:56:38 -08:00
dflynn-Nokia
b6939b9927
[firsttime boot] suppress error message on platforms not supporting kdump (#9521)
Why I did it
Eliminate benign firsttime boot error reported when running on platforms that do not support kdump.

How I did it
Change rc.local to check for presence of the file /etc/default/kdump-tools before referencing it.

How to verify it
Install a new image on an armhf or arm64 platform and check for a failed reference to /etc/default/kdump-tools on firsttime boot.
2022-01-20 18:27:10 -08:00
Shyam
20f32dc072
Added gbsyncd infra for multi-ASIC, multi-PHY mode (#9722)
- External PHY is managed via gearbox (gbsybcd docker container) in SONiC
  - Enhanced 'External PHY management' from SONiC's single-ASIC environment to multi-ASIC
  - Enhanced gbsyncd docker container from single Namespace to multi-Namspace mode
  - Added gbsyncd.service.j2 on per_namespace basis.
  - Each namepace/ASIC now to have its unique gbsyncd<ASIC#> docker container with its
    own Gearbox table, redis-DB

Signed-off-by: Shyam Kumar <shyakuma@cisco.com>
2022-01-21 10:08:16 +08:00
Alexander Allen
5f596aef63
[pmon] Move smartctl from pmon to host (#9607)
Why I did it
Need to be able to run smartctl when pmon docker is not running.

How I did it
Removed the pmon dependency for pmon as well as the command wrapper and added it to the debian-extension.

How to verify it
Stop pmon
Run smartctl from the host and verify it runs without error
2022-01-19 10:53:10 -08:00
liuh-80
f166b991a7
[image]: Prevent radius passkey and snmp community string into syslog. (#9727)
[image]: Prevent radius passkey and snmp community string into syslog.  (#9727)

#### Why I did it
    Prevent radius passkey and snmp community string into syslog.

#### How I did it
    Add radius and snmp config command to PASSWD_CMDS

#### How to verify it
    Run and pass all UTs.

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106

#### Description for the changelog
    Add radius and snmp config command to PASSWD_CMDS to prevent radius passkey and snmp community string into syslog.

#### A picture of a cute animal (not mandatory but encouraged)
2022-01-17 16:26:22 +08:00