Commit Graph

5319 Commits

Author SHA1 Message Date
shlomibitton
bca8a244c6
[202012] [Fastboot] Delay LLDP service for better fastboot performance (#10568) (#10744)
This PR is to backport a fix #10568
This PR is dependent on PR: #10745

- Why I did it
Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time.
This parallel execution consume CPU time and the duration of create_switch is longer than it should be.
Following this finding, and the motivation to ensure these services will not interfere in the future, LLDP is delayed in 90 seconds until the system finish the init flow after fastboot.

- How I did it
Add a timer for LLDP service.
Copy the timer file to the host bin image.

- How to verify it
Run fast-reboot on MLNX platform and observe faster create_switch execution time.
2022-05-15 15:05:29 +03:00
Junchao-Mellanox
4f326e8779
Fix race condition between networking service and interface-config service (#10573) (#10766)
Backport https://github.com/Azure/sonic-buildimage/pull/10573 to 202012.

#### Why I did it

The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:

1.	Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
2.	Systemd starts interface-config service who depends on networking service
3.	Interface-config service runs command  “ifdown –force eth0”, check [line](16717d2dc5/files/image_config/interfaces/interfaces-config.sh (L4)). but networking service is still running so that this [line](ac32bec0e2/ifupdown2/ifupdown/main.py (L74)) failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
4.	Interface-config service updates /etc/networking/interface to static configuration.
5.	Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
6.	When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.

The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.

#### How I did it

Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.

#### How to verify it

Manual test.
2022-05-14 14:58:24 -07:00
Sudharsan Dhamal Gopalarathnam
f16d11237a
[202012][submodule] Advance sonic-swss submodule pointer (#10803)
Update sonic-swss submodule to include below commits

b9163d3 [Vnet] Set BFD multihop to true for Vnet routes
cfed8c7 [202012][cherry-pick]Update orchagent to support new field pfcwd_sw_enable
172cd13 [ACL]Avoid incrementing crm count when ACL rule create fails
7377901 [pfcwd] Add vs test infrastructure
0b58595 Removing Vnet with scope default
2022-05-14 10:29:34 +03:00
Saikrishna Arcot
8970425a75 Fix calculation of $(1)_DEP_PKGS_SHA in Makefile.cache (#10764)
In Makefile.cache, for $(1)_DEP_PKGS_SHA, the intention is to include
the DEP_MOD_SHA and MOD_HASH of each of the current package's
dependencies. However, there's a level of dereferencing missing; instead
of grabbing the value of $(dfile)_DEP_MOD_SHA, it is literally using the
variable name $(dfile)_DEP_MOD_SHA. This means that the value of this
variable will not change when some dependency changes.

The impact of this is in transitive dependencies. For a specific
example, if there is some change in sairedis, then sairedis will be
rebuilt (because there's a change within that component), and swss will
be rebuilt (because it's a direct dependency), but
docker-swss-layer-buster will not get rebuilt, because only the direct
dependencies are effectively being checked, and those aren't changing.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-05-10 06:44:45 +00:00
xumia
951d93e362 Reduce image size for lazy installation packages (#10775)
Why I did it
The image size is too large, when there are multiple lazy packages and multiple platforms. It is not necessary to keep the lazy installation packages in multiple copies.
For cisco image, the image size will reduce from 3.5G to 1.7G.

How I did it
Use symbol links to only keep one package for each of the lazy package.
Make a new folder fsroot/platform/common
Copy the lazy packages into the folder.
When using a package in each of the platform, such as x86_64-grub, x86_64-8800_rp-r0, x86_64-8201_on-r0, etc, only make a symbol link to the package in the common folder.
2022-05-10 06:44:40 +00:00
Shilong Liu
a296267097 [ci] Support multi tags when pushing docker image (#10771) 2022-05-10 06:44:35 +00:00
Qi Luo
be5eb80b14
[202012] Fix tagged VlanInterface if attached to multiple vlan as untagged member (#10589)
Backport https://github.com/Azure/sonic-buildimage/pull/8927 to 202012 branch
2022-05-09 14:07:02 -07:00
Sudharsan Dhamal Gopalarathnam
502ddbb249
[202012][caclmgrd]Added logic to allow BFD port numbers (#10740)
* [caclmgrd]Added logic to allow BFD port numbers
2022-05-06 10:38:05 -07:00
Sudharsan Dhamal Gopalarathnam
2a232730b0
[202012][Mellanox] Update SDK/FW to 4.5.1500/2010.1500 and SAI version to 1.21.1.2 (#10464)
* [Mellanox] Update SDK/FW to 4.5.1500/2010.1500 and SAI version to 1.21.0.1

Signed-off-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>

* Updating Switch-SDK-drivers submodule pointer

* Updating SAI version
2022-05-04 06:07:10 +03:00
Qi Luo
9b55564289
[sonic-snmpagent] Update submodule (#10730)
Include below commits:
```
c75440b 2022-05-02 | Fix: not to use blocking get_all() after keys() (#255) [Qi Luo]
```
2022-05-02 23:46:33 -07:00
kellyyeh
96c0d8a7f8 [dhcp6relay] Add retry mechanism for binding socket to interface ipv6 addresses (#10712) 2022-05-03 00:42:49 +00:00
vmittal-msft
7b7737ef0f Adjustment to ingress pool size to accomodate brcm sai (#10694) 2022-05-03 00:42:27 +00:00
xumia
c70a35dda3 Fix the build target error when building sonic-rest-api (#10693)
Why I did it
Fix target target/debs/bullseye/sonic-rest-api_1.0.1_arm64.deb not existing issue, the correct target is target/debs/bullseye/sonic-rest-api_1.0.1_armhf.deb.
Fix issue: #9896

[ FAIL LOG START ] [ target/debs/stretch/sonic-rest-api_1.0.1_amd64.deb ]
[ REASON ] :      target/debs/stretch/sonic-rest-api_1.0.1_amd64.deb does not exist   NON-EXISTENT PREREQUISITES: 
[ FLAGS  FILE    ] : []
2022-05-03 00:42:21 +00:00
Jing Zhang
3da032766e
[sonic-linkmgrd][202012] submodule update (#10703)
[sonic-linkmgrd][202012] submodule update

3523738 Jing Zhang      Sun Apr 3 20:54:40 2022 -0700   Reset link prober state when default route is back #56
8282e78 Jing Zhang      Fri Apr 15 15:59:34 2022 -0700  Keep incrementing sequence number when link prober is suspended and shutdown #55 (#65)
8246eb8 Jing Zhang      Thu Apr 14 18:49:36 2022 -0700  Shutdown ICMP heartbeats when default route state is missing and ToR is in auto mode #44 (#59)

sign-off: Jing Zhang zhangjing@microsoft.com
2022-05-02 09:37:14 -07:00
Vaibhav Hemant Dixit
26055cf46e
[submodule]: update sonic-utilities submodule (#10713)
[202012][dualtor] Fix config_db.json path for config-reload
2022-04-30 10:40:27 -07:00
Nikola Dancejic
602c8e99dc
[device config] Adding configuration for default route fallback (#10692)
Set sai_tunnel_underlay_route_mode attribute to fallback to default
route if more specific route is unavailable.
Signed-off-by: Nikola Dancejic <ndancejic@microsoft.com>
2022-04-29 16:20:18 -07:00
bingwang-ms
a07930bead
Update submodule sonic-sws-common (#10707)
Signed-off-by: bingwang <bingwang@microsoft.com>
2022-04-29 15:37:26 +08:00
Samuel Angebault
705d3c0804 [Arista] Remove arista.log from rsyslog default logrotate (#9731)
Why I did it
In parallel of this change Arista added a custom logrotate configuration as part of its driver library.
Having 2 logrotate configuration for the same log file triggers an issue.

Fixes aristanetworks/sonic#38

How I did it
Arista merged a few changes in sonic-buildimage which added a logrotate configuration aristanetworks/sonic@e43c797
It is therefore the right path to remove the arista.log line from the logrotate.d/rsyslog configuration.

How to verify it
Logrotate works without any error message, arista log rotation happens and arista daemons still append logs once file was truncated.
2022-04-28 23:58:41 +00:00
mssonicbld
1c9cdc4c7a
[ci/build]: Upgrade SONiC package versions (#10594) 2022-04-27 15:25:14 +00:00
Taylor Cai
1aeb658964 Fix issue test_crm and test_fib (#10585)
Why I did it
Fix issue (https://github.com/Azure/sonic-buildimage/issues/9171) and (https://github.com/Azure/sonic-buildimage/issues/9236)

How I did it
Add flag in config file for get correct count of IPv6 entry.
Add init config file to set IPv4 ECMP hash on L4.

How to verify it
Compile the sonic_platform wheel for e1031, then upload to device and install the wheel, verify using testbed.
2022-04-26 17:40:47 +00:00
Vaibhav Hemant Dixit
4dcf6c3dc9
Advanced sonic-sairedis submodule (#10684) 2022-04-26 10:07:45 -07:00
xumia
6ad9daded3
[Submodule]: update submodule for sonic-restapi (#10679)
Why I did it
Update submodule sonic-restapi
e83e0e8 Fix Ctype_char larger than address space issue in 32-bit armhf (#107)
2022-04-26 17:54:51 +08:00
Shilong Liu
cc591039b3
[submodule] Update submodule for sonic-mgmt-common (#10666) 2022-04-25 17:04:37 +08:00
dflynn-Nokia
44ec8372a4
[Nokia ixs7215] Platform API temperature threshold value fixes (#10533)
Incorrect high-threshold and critical-high-threshold values are displayed for
some of the temperature sensors. This commit fixes that.

Co-authored-by: Qi Luo <qiluo-msft@users.noreply.github.com>
Co-authored-by: Jing Kan <jika@microsoft.com>
2022-04-25 09:28:13 +08:00
Shilong Liu
48f5c0ebff
[CG] Fix CG alert about underscore version. (#10606)
Fix CG CVE-2021-23358
2022-04-24 19:18:55 +08:00
Shilong Liu
5779a92d99
[ci] Fix PR checker archieve artifacts step (#9357) (#10652)
Why I did it
When a failed job retry. Publish artifact will fail for duplicated name
2022-04-23 13:57:50 +08:00
xumia
55a6faf925 [Ci]: Support to sign image for cisco-8000 uefi secure boot (#10616)
Why I did it
[Ci]: Support to sign image for cisco-8000 uefi secure boot
2022-04-21 22:00:47 +00:00
yozhao101
e6c18fa6dd [Monit] Fix the issue which shows Monit can not reset its counter. (#10288)
Signed-off-by: Yong Zhao <yozhao@microsoft.com>

Why I did it
This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.

Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:

  check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
      if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.

The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok.

Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry:

    Program 'container_memory_telemetry'
         status                             Status ok
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Sat, 19 Mar 2022 19:56:26
Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:

    Program 'container_memory_telemetry'
         status                             Status failed
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Tue, 01 Feb 2022 22:52:55
After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok.

How I did it
In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.

How to verify it
I verified this change on lab device str-s6000-acs-12. Another pytest PR (Azure/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
2022-04-21 22:00:42 +00:00
vmittal-msft
fcf5dcf5eb Changes to support topology and port speed agnostic switch init for TD3 based platforms (#10587) 2022-04-21 22:00:38 +00:00
xumia
de46150430 [Build]: Fix pip version constraint conflict issue (#10525)
Why I did it
[Build]: Fix pip version constraint conflict issue
When a version is specified in the constraint file, if upgrading the version in build script, it will have conflict issue.

How I did it
If a specified version has specified in pip command line, then the version constraint will be skipped.
2022-04-21 22:00:33 +00:00
Shilong Liu
c49206a884 Fix docker-sonic-mgmt reproducible related issue. (#9647)
Reproducible build script breaks docker-sonic-mgmt build.
2022-04-21 22:00:21 +00:00
Samuel Angebault
9de6b2ca12
[Arista] Fix arista-net initramfs hook (#10626)
The interface renaming logic fails if one interface is missing.
Because of the `set -e` the whole initramfs hook would abort early on
error.
This change fixes the current behavior to make sure missing interfaces
are properly skipped and ensure existing interface are renamed.
2022-04-20 10:03:37 -07:00
Jing Kan
9e5c017ab5
[202012][submodule] Advance sonic-utilities pointer (#10612)
Signed-off-by: Jing Kan jika@microsoft.com
2022-04-20 08:09:47 +08:00
Jing Kan
4ee75f490e
[202012][copp_cfg] Enable dhcp trap for BmcMgmtToRRouter (#10596)
Signed-off-by: Jing Kan jika@microsoft.com
2022-04-19 15:59:20 +08:00
Stepan Blyshchak
fa1e364f54
[services] kill container on stop in warm/fast mode (#10511)
To optimize stop on warm boot, added kill for containers

Use service "kill" in the shutdown path for fast and warm reboot. For all other reload methods, service "stop" is used.
This is done to save time in shutdown path, and to overall improve the time spent in warm and fast reload.

How - Use service_mgmt.sh to trigger common logic to initiate kill (fast/warm) or stop (cold) for database.sh, radv.sh, snmp.sh, telemetry.sh, mgmt-framework.sh

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>, Vaibhav H D <vaibhav.dixit@microsoft.com>
2022-04-18 14:27:48 -07:00
Vivek R
85447401c7
[202012] [submodule] Advance sonic-snmpagent pointer (#10584)
414692f LLDPLocalSystemDataUpdater Exception Log Handled (#249)

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
2022-04-18 10:42:05 +03:00
Ying Xie
6af3de4372
[202012][copp cfg] enable dhcp trap for a couple more devices (#10582)
* [copp cfg] enable copp trap for a couple more devices

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2022-04-15 11:47:02 -07:00
Jing Zhang
9fd75ffd9d
[202012][sonic-linkmgrd] Submodule Update (#10345)
[202012][sonic-linkmgrd]Submodule update

8507629 Jing Zhang      Mon Apr 4 10:25:22 2022 -0700   Lower unsolicited MUX state change notification log level to WARNING #57
17d217d Longxiang Lyu   Mon Mar 21 12:15:19 2022 +0800  Enhance clang format (#46)
c72fa2a Jing Zhang      Fri Apr 1 12:23:29 2022 -0700   Disable the feature that decreases link probe interval for measuring switch overhead #49 (#54)
256b01b Jing Zhang      Thu Mar 31 16:20:00 2022 -0700  Update link prober metrics posting logics #50 #53
dfd48d0 Jing Zhang      Wed Mar 23 16:27:45 2022 -0700  Decrease link probing interval after switchover to better determine the overhead of a toggle #43 (#48)

sign-off: Jing Zhang zhangjing@microsoft.com
2022-04-14 11:42:22 -07:00
Richard.Yu
6ccc458d2b
[CG-Fix-CVE-2021-44906] Patching on thrift.0.13.0 for package minimist (#10554)
* [CG-Fix-CVE-2021-44906] Patching on thrift.0.13.0 for package minimist

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>

* add more information in patch

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
2022-04-14 06:46:19 -07:00
Saikrishna Arcot
29b6f62902
[202012] Run tune2fs during initramfs instead of image install (#10558)
If it is run during image install, it's not guaranteed that the
installation environment will have tune2fs available. Therefore, run it
during initramfs instead.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-04-12 19:59:24 -07:00
kellyyeh
6e17ef311a [dhcp_relay] Remove dhcp6mon (#10467) 2022-04-12 18:39:19 +00:00
Sudharsan Dhamal Gopalarathnam
234d0ab241 [containerd]Fixing container commands when mode is local and state is disabled (#9986)
Why I did it
During warm-reboot and fast-reboot the below error logs appear
Feb 3 22:05:15.187408 r-lionfish-13 ERR container: docker cmd: kill for nat failed with 404 Client Error for http+docker://localhost/v1.41/containers/nat/json: Not Found ("No such container: nat")

The container command when called for local mode doesn't check if it is enabled before calling docker kill which throws the above errors.
b6ca76b482/scripts/fast-reboot (L699)

How I did it
Checking feature state if local mode and returning error exit code along with valid debug message.

How to verify it
Manually tested with warm-reboot and fast-reboot
Added UT to verify it.
2022-04-12 18:39:13 +00:00
Sudharsan Dhamal Gopalarathnam
d27df5d145
[202012] [submodule] Advance sonic-swss pointer (#10540)
Includes the below commits
f3b2873 [BFD]Retry create BFD with different source UDP port on failure

Signed-off-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>
2022-04-12 10:23:09 +03:00
Qi Luo
3fa538e58c
Revert "[ci] Set default ACR in UpgrateVersion/PR/official pipeline. (#10341)" (#10535)
This reverts commit f4bbcd1cf1. The original one was missing one file ".azure-pipelines/azure-pipelines-repd-build-variables.yml" and break the Azure pipeline.
2022-04-11 23:53:42 -07:00
xumia
fc727f0538 [Ci]: check if there is a sonic dirty version issue (#10445)
Why I did it
[Ci]: check if there is a sonic dirty version issue
If there is a dirty version issue in PR build, the build will be failed.
2022-04-11 23:10:06 +00:00
Rajkumar-Marvell
589234a48c
[Marvell] Marvell armhf SAI debian. (#10526)
Fixed IPv6 route issue resulting in orchagent crash.
Signed-off-by: Rajkumar Pennadam Ramamoorthy <rpennadamram@marvell.com>
2022-04-11 14:00:46 +08:00
Kevin Wang
a65916449b
Update cisco-8000 ref to release: 202012-v0.97 (#10522)
Signed-off-by: Kevin(Shengkai) Wang <shengkaiwang@microsoft.com>
2022-04-11 08:59:56 +08:00
mssonicbld
e0fa07307a
[ci/build]: Upgrade SONiC package versions (#10395)
[ci/build]: Upgrade SONiC package versions (#10395)
2022-04-10 17:00:00 +08:00
Kebo Liu
1b42dbfdd2
[submodule] [202012] Advance sonic-platform-common pointer (#10502)
Update sonic-platform-common submodule to pick up new commits:

cd623fa [202012] Backport Enhance ssd_generic with more error handling to avoid python crash (#273)
e9a4a81 [y_cable][Broadcom] update the BRCM y_cable driver to release 2.0 (#263)
2022-04-08 12:56:59 +03:00
Shilong Liu
f4bbcd1cf1 [ci] Set default ACR in UpgrateVersion/PR/official pipeline. (#10341)
Why I did it
docker hub will limit the pull rate.
Use ACR instead to pull debian related docker image.

How I did it
Set DEFAULT_CONTAINER_REGISTRY in pipeline.
2022-04-08 11:19:10 +08:00