Commit Graph

975 Commits

Author SHA1 Message Date
shlomibitton
2a9aa0836c
[202012] [Mellanox] [pmon] Fix for PMON service not starting when restarting SWSS service after fast/warm reboot (#10902)
- Why I did it
Recent change to delay PMON service in case of fast/warm reboot introduce an issue when restarting only SWSS service after fast/warm reboot for Nvidia platform.
Since the timer is triggered only when the system boot, in a scenario when the system is after a fast/warm reboot and the user restart SWSS service, as part of syncd.sh script, PMON service will stop but the timer will not start again.

- How I did it
On syncd.sh script, in case of fast/warm indication, check if pmon.timer is running.
If it is running it means we are at the first boot and continue normally.
If it is not running, meaning the service was restarted, start the timer to keep the system behavior consistent.

- How to verify it
Run fast/warm reboot.
service swss restart.
Observe PMON service starting.
2022-06-08 09:46:54 +03:00
mssonicbld
855ae0491f
[ci/build]: Upgrade SONiC package versions (#11051)
Upgrade SONiC Versions #11051
2022-06-08 08:42:04 +08:00
Richard.Yu
8f3edde302
[202012][BRCM SAI 4.3.5.3-5] Update saibcm for pcbb feature (#10998)
Support Tunnel PFC/pcbb feature on Broadcom platform.

How to verify it
Tested build target, successful

make target/docker-syncd-brcm.gz
manual run those tests after installing sai binary within image 20201231.67 on 7050CX3 (TD3) T0 DUT, all passed

     fib/test_fib.py
     vxlan/test_vxlan_decap.py
     fdb/test_fdb.py
     decap/test_decap.py
     pfcwd/test_pfcwd_all_port_storm.py
     acl/null_route/test_null_route_helper.py
     acl/test_acl.py
     vlan/test_vlan.py
     platform_tests/test_reboot.py

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
2022-06-06 09:54:00 -07:00
bingwang-ms
e159998657
[202012][cherry-pick] Add two extra lossless queues for bounced back traffic (#10715)
* Add extra lossless queues

Signed-off-by: bingwang <bingwang@microsoft.com>
2022-06-04 19:25:02 +08:00
bingwang-ms
7ec6a60230
[cherry-pick] [202012] Update qos config to clear queues for bounced back traffic (#10608)
* Update qos config to clear queues for bounced back traffic

Signed-off-by: bingwang <wang.bing@microsoft.com>
2022-06-02 16:29:25 +08:00
Jing Kan
14fdcc815a
[202012][openssh] openssh: Upgrade from 7.9 to 8.4, to match version in buster-backports (#10910)
* Use buster-backports version
* Use dget dsc file instead source repo
* Update make files
* Upgrade openssh-client to 8.4 in base image
* Remove useless installation
* Install openssh-server from buster-backports in build_debian
* Update dev buster package version list

Signed-off-by: Jing Kan jika@microsoft.com
2022-06-02 16:06:22 +08:00
xumia
06addae853 Revert "Reduce image size for lazy installation packages (#10775)" (#10916)
This reverts commit 15cf9b0d70.
Why I did it
Revert the PR #10775, for it has impact on onie installation.
It is caused by the symbol links not supported in some of the onie unzip.
We will enable after fixing the issue, see #10914
2022-05-27 17:00:50 +00:00
mssonicbld
8ce3cab508
[ci/build]: Upgrade SONiC package versions (#10732)
Co-authored-by: mssonicbld <vsts@fv-az31-361.b1uo4dmaffwenkazr3a2h2ovdb.jx.internal.cloudapp.net>
2022-05-27 01:10:32 -07:00
shlomibitton
c71c91e2b0
[202012] [Fastboot] Delay PMON service for better fastboot performance (#10745)
#### Why I did it

Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time.
This parallel execution consume CPU time and the duration of create_switch is longer than it should be.
Following this finding, and the motivation to ensure these services will not interfere in the future, PMON is delayed in 90 seconds until the system finish the init flow after fastboot.

#### How I did it

Add a timer for PMON service.
Exclude for MLNX platform the start trigger of PMON when SYNCD starts in case of fastboot.
Copy the timer file to the host bin image.

#### How to verify it

Run fast-reboot on MLNX platform and observe faster create_switch execution time.
2022-05-15 23:31:32 -07:00
shlomibitton
bca8a244c6
[202012] [Fastboot] Delay LLDP service for better fastboot performance (#10568) (#10744)
This PR is to backport a fix #10568
This PR is dependent on PR: #10745

- Why I did it
Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time.
This parallel execution consume CPU time and the duration of create_switch is longer than it should be.
Following this finding, and the motivation to ensure these services will not interfere in the future, LLDP is delayed in 90 seconds until the system finish the init flow after fastboot.

- How I did it
Add a timer for LLDP service.
Copy the timer file to the host bin image.

- How to verify it
Run fast-reboot on MLNX platform and observe faster create_switch execution time.
2022-05-15 15:05:29 +03:00
Junchao-Mellanox
4f326e8779
Fix race condition between networking service and interface-config service (#10573) (#10766)
Backport https://github.com/Azure/sonic-buildimage/pull/10573 to 202012.

#### Why I did it

The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:

1.	Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
2.	Systemd starts interface-config service who depends on networking service
3.	Interface-config service runs command  “ifdown –force eth0”, check [line](16717d2dc5/files/image_config/interfaces/interfaces-config.sh (L4)). but networking service is still running so that this [line](ac32bec0e2/ifupdown2/ifupdown/main.py (L74)) failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
4.	Interface-config service updates /etc/networking/interface to static configuration.
5.	Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
6.	When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.

The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.

#### How I did it

Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.

#### How to verify it

Manual test.
2022-05-14 14:58:24 -07:00
xumia
951d93e362 Reduce image size for lazy installation packages (#10775)
Why I did it
The image size is too large, when there are multiple lazy packages and multiple platforms. It is not necessary to keep the lazy installation packages in multiple copies.
For cisco image, the image size will reduce from 3.5G to 1.7G.

How I did it
Use symbol links to only keep one package for each of the lazy package.
Make a new folder fsroot/platform/common
Copy the lazy packages into the folder.
When using a package in each of the platform, such as x86_64-grub, x86_64-8800_rp-r0, x86_64-8201_on-r0, etc, only make a symbol link to the package in the common folder.
2022-05-10 06:44:40 +00:00
Samuel Angebault
705d3c0804 [Arista] Remove arista.log from rsyslog default logrotate (#9731)
Why I did it
In parallel of this change Arista added a custom logrotate configuration as part of its driver library.
Having 2 logrotate configuration for the same log file triggers an issue.

Fixes aristanetworks/sonic#38

How I did it
Arista merged a few changes in sonic-buildimage which added a logrotate configuration aristanetworks/sonic@e43c797
It is therefore the right path to remove the arista.log line from the logrotate.d/rsyslog configuration.

How to verify it
Logrotate works without any error message, arista log rotation happens and arista daemons still append logs once file was truncated.
2022-04-28 23:58:41 +00:00
mssonicbld
1c9cdc4c7a
[ci/build]: Upgrade SONiC package versions (#10594) 2022-04-27 15:25:14 +00:00
yozhao101
e6c18fa6dd [Monit] Fix the issue which shows Monit can not reset its counter. (#10288)
Signed-off-by: Yong Zhao <yozhao@microsoft.com>

Why I did it
This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.

Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:

  check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
      if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.

The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok.

Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry:

    Program 'container_memory_telemetry'
         status                             Status ok
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Sat, 19 Mar 2022 19:56:26
Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:

    Program 'container_memory_telemetry'
         status                             Status failed
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Tue, 01 Feb 2022 22:52:55
After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok.

How I did it
In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.

How to verify it
I verified this change on lab device str-s6000-acs-12. Another pytest PR (Azure/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
2022-04-21 22:00:42 +00:00
Samuel Angebault
9de6b2ca12
[Arista] Fix arista-net initramfs hook (#10626)
The interface renaming logic fails if one interface is missing.
Because of the `set -e` the whole initramfs hook would abort early on
error.
This change fixes the current behavior to make sure missing interfaces
are properly skipped and ensure existing interface are renamed.
2022-04-20 10:03:37 -07:00
Jing Kan
4ee75f490e
[202012][copp_cfg] Enable dhcp trap for BmcMgmtToRRouter (#10596)
Signed-off-by: Jing Kan jika@microsoft.com
2022-04-19 15:59:20 +08:00
Stepan Blyshchak
fa1e364f54
[services] kill container on stop in warm/fast mode (#10511)
To optimize stop on warm boot, added kill for containers

Use service "kill" in the shutdown path for fast and warm reboot. For all other reload methods, service "stop" is used.
This is done to save time in shutdown path, and to overall improve the time spent in warm and fast reload.

How - Use service_mgmt.sh to trigger common logic to initiate kill (fast/warm) or stop (cold) for database.sh, radv.sh, snmp.sh, telemetry.sh, mgmt-framework.sh

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>, Vaibhav H D <vaibhav.dixit@microsoft.com>
2022-04-18 14:27:48 -07:00
Ying Xie
6af3de4372
[202012][copp cfg] enable dhcp trap for a couple more devices (#10582)
* [copp cfg] enable copp trap for a couple more devices

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2022-04-15 11:47:02 -07:00
Saikrishna Arcot
29b6f62902
[202012] Run tune2fs during initramfs instead of image install (#10558)
If it is run during image install, it's not guaranteed that the
installation environment will have tune2fs available. Therefore, run it
during initramfs instead.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-04-12 19:59:24 -07:00
mssonicbld
e0fa07307a
[ci/build]: Upgrade SONiC package versions (#10395)
[ci/build]: Upgrade SONiC package versions (#10395)
2022-04-10 17:00:00 +08:00
kellyyeh
b68f4dd74c
Enable dhcp copp trap for EPMS and MgmtTsToR (#10439) 2022-04-06 09:46:08 -07:00
Saikrishna Arcot
e9db38594d
Image disk space reduction (#10172) (#10371)
Reduce the disk space taken up during bootup and runtime.

1. Remove python package cache from the base image and from the containers.
2. During bootup, if logs are to be stored in memory, then don't create the `var-log.ext4` file just to delete it later during bootup.
3. For the partition containing `/host`, don't reserve any blocks for just the root user. This just makes sure all disk space is available for all users, if needed during upgrades (for example).

* Remove pip2 and pip3 caches from some containers

Only containers which appeared to have a significant pip cache size are
included here.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Don't create var-log.ext4 if we're storing logs in memory

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Run tune2fs on the device containing /host to not reserve any blocks for just the root user

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
(cherry picked from commit 5617b1ae3e)
2022-03-29 10:11:28 -07:00
mssonicbld
873689ef6e
[ci/build]: Upgrade SONiC package versions (#10373) 2022-03-28 23:08:38 +00:00
mssonicbld
e71c14502d
[ci/build]: Upgrade SONiC package versions (#10331)
Upgrade SONiC Versions
2022-03-25 15:09:37 +08:00
Saikrishna Arcot
aafb3d00e2
Start haveged before systemd-random-seed (#10328)
The haveged service file in Debian Buster specifies that haveged should
start after systemd-random-seed starts (this was removed in Bullseye
after systemd changes caused a bootloop). This is a bit
counterproductive, since haveged is meant to be used in environments
with minimal sources of entropy, but one of the checks that
systemd-random-seed does is to verify that entropy is present.

Therefore, override the default .service file for haveged that moves
systemd-random-seed to the Before list, allowing it to start before
systemd-random-seed checks the system entropy level. (systemd doesn't
allow removing items from dependency/ordering entries such as After= and
Before=, so the entire .service file has to be overwritten.)

Note that despite this, haveged takes up to two seconds to actually
start working, so systemd-random-seed may still block for about two
seconds. However, this still allows other work (such as running
rc.local) to proceed a bit sooner.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-03-24 14:28:42 -07:00
noaOrMlnx
4f021c44c2
Update docker-sonic-vs infrastructure in order to run CoPP UT (#10230)
*Changes to run CoPP UT in docker-sonic-vs
2022-03-21 21:55:24 -07:00
xumia
67312ff635
[Build]: Use one debian mirror config (#10281)
Why I did it
Use one debian mirror config.
The empty config in https://github.com/Azure/sonic-buildimage/blob/master/files/image_config/apt/sources.list overrides the file https://github.com/Azure/sonic-buildimage/blob/master/files/apt/sources.list.amd64 (armhf/arm64), it does not make sense.
All the content in files/image_config/apt is no use, any one wants to add mirror config, please add in files/apt.

How I did it
Remove files/image_config/apt and the reference.
2022-03-21 17:04:19 +08:00
gechiang
a984757b9d
[202012 BRCM SAI 4.3.5.3-3] Picked up fixes that makes up BRCM SAI version 4.3.5.3-3 (#10255) 2022-03-19 17:18:50 -07:00
xumia
413ee3e219
[Build]: Fix /proc not mounted issue (#10164) (#10256)
[Build]: Fix /proc not mounted issue
2022-03-19 22:19:06 +08:00
mssonicbld
03d058efe4
[ci/build]: Upgrade SONiC package versions (#10283)
[ci/build]: Upgrade SONiC package versions
2022-03-19 11:09:51 +08:00
Stepan Blyshchak
8ce5e4e77b [teamd.sh] kill teamd docker on warm shutdown for faster shutdown (#10219)
This can save 6 sec for teamd LAG restoration - the time between:

```
Mar  9 13:51:10.467757 r-panther-13 WARNING teamd#teamd_PortChannel1[28]: Got SIGUSR1.
Mar  9 13:52:33.310707 r-panther-13 INFO teamd#teamd_PortChannel1[27]: carrier changed to UP
```

- Why I did it
Optimize warm boot. Specifically reduce the time needed for LAG restoration.

- How I did it
Kill teamd docker after graceful shutdown of teamd processes.

- How to verify it
Run warm reboot.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-03-16 22:22:26 +00:00
wenyiz2021
5878cfdb06 Update container_checker for multi-asic devices when state is 'always_enabled' (#10067)
* Update container_checker for multi-asic devices 

Update container_checker for multi-asic devices to add database containers in always_running_containers. 
Previous change was made for single-asic, and that database containers were not considered as feature when writing to state_db.

* Update container_checker

Update an indent
2022-03-14 23:01:43 +00:00
mssonicbld
1c4364222d
[ci/build]: Upgrade SONiC package versions (#10214) 2022-03-11 15:13:09 +00:00
Santhosh Kumar T
e83955599d
[202012] Refactoring DELL platform init to reduce rc.local processing time (#10171)
Why I did it
To reduce the processing time of rc.local, refactoring s6100 platform initialization.
Fixing [warm-upgrade][202012] Slow DELL platform init in rc.local causes lacp-teardown #10150
How I did it
On branch 202012-s6100-rclocalChanges to be committed:  (use "git restore --staged <file>..." to unstage)
        modified:   ../../../../files/image_config/platform/rc.local        
	modified:   ../debian/platform-modules-s6100.install        
	modified:   scripts/fast-reboot_plugin
        modified:   scripts/s6100_platform.sh
        renamed:    scripts/s6100_i2c_enumeration.sh -> scripts/s6100_platform_startup.sh
        renamed:    systemd/s6100-i2c-enumerate.service -> systemd/s6100-platform-startup.service
2022-03-10 18:51:07 -08:00
mssonicbld
7fe1489061
[ci/build]: Upgrade SONiC package versions (#10194) 2022-03-09 22:41:51 +00:00
mssonicbld
063882cf87
[ci/build]: Upgrade SONiC package versions (#10069)
[ci/build]: Upgrade SONiC package versions (#10069)
2022-03-08 21:32:36 +08:00
xumia
a8d844c83d
[build]: Fix marvell-armhf build hung issue (#10156)
The marvel-armhf build is hung, it does not exist after waiting for a long time.
It is caused by the process /etc/entropy.py which is started by the postinst script in target/debs/buster/sonic-platform-nokia-7215_1.0_armhf.deb

$ cat postinst 
sh /usr/sbin/nokia-7215_plt_setup.sh
...

$ cat usr/sbin/nokia-7215_plt_setup.sh | tail

    python /etc/entropy.py &


$ cat etc/entropy.py 
if path.exists("/proc/sys/kernel/random/entropy_avail"):
    while 1:
        while avail() < 2048:
            with open('/dev/urandom', 'rb') as urnd, open("/dev/random", mode='wb') as rnd:
                d = urnd.read(512)
                t = struct.pack('ii', 4 * len(d), len(d)) + d
                fcntl.ioctl(rnd, RNDADDENTROPY, t)
        time.sleep(30)

It is a workaround to fix the build issue, need to fix debian package, and revert the change.
2022-03-07 08:00:56 -08:00
roman_savchuk
4d6f9f2de7
[ BFN ] update SDE package for BFN platform (#10049)
Updated SDE package for Barefoot platform with fixes for:

- NAT
- VRF
2022-03-04 20:43:08 -08:00
Qi Luo
04925df451
[build] Fix the urllib3 version in sonic-mgmt-framework (#10149)
Fix the urllib3 version in sonic-mgmt-framework constrain file because it is already updated in Dockerfile
2022-03-04 20:34:23 -08:00
gechiang
7fb546dce4
[202012]BRCM SAI 4.3.5.3-2 Fixes CS00012228504, SONIC-55963:SID, CS00012209080, CS00012220761, and CS00012222414 (#10155) 2022-03-04 16:24:59 -08:00
Lawrence Lee
4d1abbc09b [write_standby]: Increase timeout to 60s (#10065)
- Avoid scenarios where script times out before orchagent can establish IPinIP tunnel

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2022-03-01 22:49:17 +00:00
noaOrMlnx
7a35504ff7
[202012] [CoPP] Add always_enabled field (#9999)
Add the "always_enabled" field to copp_cfg.j2 file, in order to allow traps without an entry in features table, to be installed automatically.

This is a cherry-pick of https://github.com/Azure/sonic-buildimage/pull/9302

- Why I did it
In order to allow traps without an entry in features table, to be installed automatically.

- How I did it
Add always_enabled field to traps without a feature
2022-02-20 12:42:39 +02:00
mssonicbld
a23aac25d3
[ci/build]: Upgrade SONiC package versions (#10023)
[ci/build]: Upgrade SONiC package versions
2022-02-19 08:10:17 +08:00
Samuel Angebault
b32d7eedaf
Add emmc quirks to boot0 (#9989)
Why I did it
Fix some unreliability seen on emmc device with some AMD CPUs

How I did it
Added a kernel parameter to add quirks to
It depends on a sonic-linux-kernel change to work properly but will be a no-op without it.

Description for the changelog
Add emmc quirks for Upperlake
2022-02-17 08:55:01 -08:00
vmittal-msft
304ec5b0cd
Updated traffic scheduler settings for HWSKUs : DellEMC-Z9332f-O32 & DellEMC-Z9332f-M-O16C64 (#9927) 2022-02-15 16:15:20 -08:00
mssonicbld
f746d27c7d
[ci/build]: Upgrade SONiC package versions (#9933) 2022-02-09 00:59:47 +00:00
Prince George
c1a0871fe9 Close console session due to user inactivity (#9890)
Signed-off-by: Prince George <prgeor@microsoft.com>
2022-02-08 19:07:29 +00:00
tbgowda
78dc2d8a7b Enable SAI_SWITCH_ATTR_UNINIT_DATA_PLANE_ON_REMOVAL attribute (#9419)
Why I did it
Fixes #8980 partly.

The corresponding changes in sonic-sairedis is here :
Azure/sonic-sairedis#975

How I did it
Include changes from both repos and build an image for verification.

How to verify it
Trigger fast-reboot with the changes, see the attribute SAI_SWITCH_ATTR_UNINIT_DATA_PLANE_ON_REMOVAL being set at the SAI level.

Signed-off-by: Thushar Gowda <24815472+tbgowda@users.noreply.github.com>
2022-02-08 19:07:08 +00:00
vmittal-msft
7435613216
[202012] BRCM SAI 4.3.5.3-1 Fix for CS00012218555 (#9923) 2022-02-07 08:02:57 -08:00