Commit Graph

1306 Commits

Author SHA1 Message Date
judyjoseph
efeae03ea3
Add override_config to load_minigraph in config-setup service (#14834)
This PR is to handle the override minigraph config by golden_config_db.json file if it is present in the backup location.
2023-05-10 11:54:33 -07:00
Zain Budhwani
a738c39328
Add fix to monit_regex.json for catching mem_usage and cpu_usage (#14954)
Why I did it
Current regex not able to capture logs, modify regex to capture syslog messages

Work item tracking
Microsoft ADO (number only): 13366345
How I did it
Code change

How to verify it
sonic-mgmt test case
2023-05-08 11:48:17 -07:00
Ying Xie
72c52bc677
Revert "Clear DNS configuration received from DHCP during networking reconfiguration in Linux. (#13516)" (#14902)
This reverts commit c7ecd92c54.
2023-05-01 17:12:38 -07:00
mssonicbld
80c5ab4a4a
[ci/build]: Upgrade SONiC package versions (#14896) 2023-05-01 18:10:48 +08:00
mssonicbld
0d709a3655
[ci/build]: Upgrade SONiC package versions (#14888) 2023-04-29 17:42:19 +08:00
Tejaswini Chadaga
ca224863cb
Changes to support TSA from supervisor (#14691)
Why I did it
Support for SONIC chassis isolation using TSA and un-isolation using TSB from supervisor module

Work item tracking
Microsoft ADO (number only): 17826134
How I did it
When TSA is run on the supervisor, it triggers TSA on each of the linecards using the secure rexec infrastructure introduced in sonic-net/sonic-utilities#2701. User password is requested to allow secure login to linecards through ssh, before execution of TSA/TSB on the linecards

TSA of the chassis withdraws routes from all the external BGP neighbors on each linecard, in order to isolate the entire chassis. No route withdrawal is done from the internal BGP sessions between the linecards to prevent transient drops during internal route deletion. With these changes, complete isolation of a single linecard using TSA will not be possible (a separate CLI/script option will be introduced at a later time to achieve this)

Changes also include no-stats option with TSC for quick retrieval of the current system isolation state

This PR also reverts changes in #11403

How to verify it
These changes have a dependency on sonic-net/sonic-utilities#2701 for testing

Run TSA from supervisor module and ensure transition to Maintenance mode on each linecard
Verify that all routes are withdrawn from eBGP neighbors on all linecards
Run TSB from supervisor module and ensure transition to Normal mode on each linecard
Verify that all routes are re-advertised from eBGP neighbors on all linecards
Run TSC no-stats from supervisor and verify that just the system maintenance state is returned from all linecards
2023-04-28 16:28:06 +08:00
Stephen Sun
9e56fea091
Temporary WA for the issue that asic_table.json can not be rendered (#13888)
- Why I did it
We suspect the issue #13791 is caused by redis server being temporarily unavailable during system initialization so we do not use -d in sonic-cfggen, for now, to avoid accessing redis server

- How I did it
Provide a string containing required json data when calling sonic-cfggen

- How to verify it
Manually test it

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2023-04-24 17:02:35 +03:00
mssonicbld
5ad844f185 [ci/build]: Upgrade SONiC package versions 2023-04-24 18:33:06 +08:00
mssonicbld
81a557885b
[ci/build]: Upgrade SONiC package versions (#14799) 2023-04-22 17:47:40 +08:00
mssonicbld
d006219e2d
[ci/build]: Upgrade SONiC package versions (#14718) 2023-04-19 18:59:16 +08:00
Aryeh Feigin
039a9c998a
[Fast-boot] Clear teamd-timer when finalizing fast-reboot (#14583)
Part of sonic-net/sonic-utilities#2760
Similar to #14295

- Why I did it
To clear teamd timer when fast-reboot is finalized to prevent any further affect.

- How I did it
Deleted teamd timer from config-db in fast-reboot finalizer.
config save call is moved to after clearing teamd-timer so it won't have any further affect as well.

- How to verify it
Verified manually that entry was deleted after fast-reboot was finailized.
2023-04-18 09:15:42 +03:00
Stepan Blyshchak
d73c810e86
[image_config] add rasdaemon.timer (#14300)
rasdaemon is a tool to log hardware errors. It takes 100% CPU during
boot for a few seconds. It impacts fast/warm boot by delaying control
plane restoration for 5 sec on some platforms.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2023-04-17 08:58:45 -07:00
mssonicbld
7f262d71da
[ci/build]: Upgrade SONiC package versions (#14685) 2023-04-17 19:58:43 +08:00
mssonicbld
49dbaeb649
[ci/build]: Upgrade SONiC package versions (#14672) 2023-04-15 18:21:50 +08:00
Sudharsan Dhamal Gopalarathnam
2804998766
[config reload]Config Reload Enhancement (#13969)
#### Why I did it
Implementing code changes for https://github.com/sonic-net/SONiC/pull/1203

#### How I did it
Removed the timers and delayed target since the delayed services would start based on event driven approach.
Cleared port table during config reload and cold reboot scenario.
Modified yang model, init_cfg.json to change has_timer to delayed

#### How to verify it
Running regression
2023-04-12 11:20:03 -07:00
mssonicbld
f9eb849d75
[ci/build]: Upgrade SONiC package versions (#14620) 2023-04-12 20:05:30 +08:00
anamehra
f34360f101
chassis-packet: resolve the missing static routes (#14593)
Why I did it
Fixes #14179
chassis-packet: missing arp entries for static routes causing high orchagent cpu usage

It is observed that some sonic-mgmt test case calls sonic-clear arp, which clears the static arp entries as well. Orchagent or arp_update process does not try to resolve the missing arp entries after clear.

How I did it
arp_update should resolve the missing arp/ndp static route
entries. Added code to check for missing entries and try ping if any
found to resolve it.

How to verify it
After boot or config reload, check ipv4 and ipv4 neigh entries to make sure all static route entries are present
manual validation:
Use sonic-clear arp and sonic-clear ndp to clear all neighbor entries
run arp_update
Check for neigh entries. All entries should be present.
Testing on T0 setup route/for test_static_route.py

The test set the STATIC_ROUTE entry in conifg db without ifname:
sonic-db-cli CONFIG_DB hmset 'STATIC_ROUTE|2.2.2.0/24' nexthop 192.168.0.18,192.168.0.25,192.168.0.23

"STATIC_ROUTE": {
    "2.2.2.0/24": {
        "nexthop": "192.168.0.18,192.168.0.25,192.168.0.23"
    }
},
Validate that the arp_update gets the proper ARP_UPDATE_VARDS using arp_update_vars.j2 template from config db and does not crash:

{ "switch_type": "", "interface": "", "pc_interface" : "PortChannel101 PortChannel102 PortChannel103 PortChannel104 ", "vlan_sub_interface": "", "vlan" : "Vlan1000", "static_route_nexthops": "192.168.0.18 192.168.0.25 192.168.0.23 ", "static_route_ifnames": "" }

validate route/test_static_route.py testcase pass.
2023-04-12 15:07:42 +08:00
xumia
f1fd42558a
Support to add SONiC OS Version in device info (#14601)
Why I did it
Support to add SONiC OS Version in device info.
It will be used to display the version info in the SONiC command "show version". The version is used to do the FIPS certification. We do not do the FIPS certification on a specific release, but on the SONiC OS Version.

SONiC Software Version: SONiC.master-13812.218661-7d94c0c28
SONiC OS Version: 11
Distribution: Debian 11.6
Kernel: 5.10.0-18-2-amd64
How I did it
2023-04-12 09:20:08 +08:00
mssonicbld
4e5c8988b1
[ci/build]: Upgrade SONiC package versions (#14586) 2023-04-10 18:10:37 +08:00
Aryeh Feigin
41a9813018
Finalize fast-reboot in warmboot finalizer (#14238)
- Why I did it
To solve an issue with upgrade with fast-reboot including FW upgrade which has been introduced since moving to fast-reboot over warm-reboot infrastructure.
As well, this introduces fast-reboot finalizing logic to determine fast-reboot is done.

- How I did it
Added logic to finalize-warmboot script to handle fast-reboot as well, this makes sense as using fast-reboot over warm-reboot this script will be invoked. The script will clear fast-reboot entry from state-db instead of previous implementation that relied on timer. The timer could expire in some scenarios between fast-reboot finished causing fallback to cold-reboot and possible crashes.

As well this PR updates all services/scripts reading fast-reboot state-db entry to look for the updated value representing fast-reboot is active.

- How to verify it
Run fast-reboot and check that fast-reboot entry exists in state-db right after startup and being cleared as warm-reboot is finalized and not due to a timer.
2023-04-09 16:59:15 +03:00
mssonicbld
e32624d362
[ci/build]: Upgrade SONiC package versions (#14571) 2023-04-08 18:00:30 +08:00
Stephen Sun
152148fb81
Enhance the error message output mechanism (#14384)
#### Why I did it

Enhance the error message output mechanism during swss docker creating

#### How I did it

Capture the output to stderr of `sonic-cfggen` and output it using `echo` to make sure the error message will be logged in syslog.

#### How to verify it

Manually test
2023-04-07 14:23:35 -07:00
Devesh Pathak
d74055e12c
Increase wait_for_tunnel() timeout to 90s (#14279)
Why I did it
Orchagent sometimes take additional time to execute Tunnel tasks. This cause write_standby script to error out and mux state machines are not initialized. It results in show mux status missing some ports in output.

Mar 13 20:36:52.337051 m64-tor-0-yy41 INFO systemd[1]: Starting MUX Cable Container...
Mar 13 20:37:52.480322 m64-tor-0-yy41 ERR write_standby: Timed out waiting for tunnel MuxTunnel0, mux state will not be written
Mar 13 20:37:58.983412 m64-tor-0-yy41 NOTICE swss#orchagent: :- doTask: Tunnel(s) added to ASIC_DB.
How I did it
Increase timeout from 60s to 90s

How to verify it
Verified that mux state machine is initialized and show mux status has all needed ports in it.
2023-04-07 11:30:58 +08:00
Ying Xie
737d0e57ad
[write standby] force DB connections to use unix socket to connect (#14524)
Why I did it
At service start up time, there are chances that the networking service is being restarted by interface-config service. When that happens, write_standby could fail to make DB connections due to loopback interface is being reconfigured.

How I did it
Force the db connector to use unix socket to avoid loopback reconfig timing window.

How to verify it
Run config reload test 20+ times and no issue encountered.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* use unix socket instead

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2023-04-06 13:54:56 -07:00
Ye Jianquan
6c04ed987d
Revert "chassis-packet: resolve the missing static routes (#14230)" (#14544)
This reverts commit a8f8ea3b50.
2023-04-06 10:36:10 -07:00
mssonicbld
41c46aedf6
[ci/build]: Upgrade SONiC package versions (#14528) 2023-04-05 18:36:57 +08:00
Ying Xie
d3f3ac6411
Delay mux/sflow/snmp timer after interface-config service (#14506)
Why I did it
All these 3 services started after swss service, which used to start after interface-config service. But #13084 remove the time constraints for swss.

After that, these 3 services has the chance of start earlier when the inteface-config service is restarting the networking service, which could cause db connect request to fail.

How I did it
Delay mux/sflow/snmp timer after the interface-config service.

How to verify it
PR test.
Config reload can repro the issue in 1-3 retries. With this change. config reload run 30+ iterations without hitting the issue.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2023-04-04 16:23:00 -07:00
mssonicbld
884dfa5427
[ci/build]: Upgrade SONiC package versions (#14498) 2023-04-03 18:34:35 +08:00
mssonicbld
66d3586fd4
[ci/build]: Upgrade SONiC package versions (#14487) 2023-04-01 18:45:34 +08:00
anamehra
a8f8ea3b50
chassis-packet: resolve the missing static routes (#14230)
arp_update should resolve the missing arp/ndp static route
entries. Added code to check for missing entries and try ping to
resolve the missing entry.

Why I did it
Fixes #14179

chassis-packet: missing arp entries for static routes causing high orchagent cpu usage

It is observed that some sonic-mgmt test case calls sonic-clear arp, which clears the static arp entries as well. Orchagent or arp_update process does not try to resolve the missing arp entries after clear.

How I did it
arp_update should resolve the missing arp/ndp static route
entries. Added code to check for missing entries and try ping if any
found to resolve it.

How to verify it
After boot or config reload, check ipv4 and ipv4 neigh entries to make sure all static route entries are present
manual validation:
Use sonic-clear arp and sonic-clear ndp to clear all neighbor entries
run arp_update
Check for neigh entries. All entries should be present.

Signed-off-by: anamehra <anamehra@cisco.com>
2023-03-29 09:53:32 -07:00
mssonicbld
6e11833a6c
[ci/build]: Upgrade SONiC package versions (#14430) 2023-03-29 18:39:10 +08:00
Hua Liu
4c059d8eb5
Improve sudo cat command for RO user. (#14428)
Improve sudo cat command for RO user.

#### Why I did it
RO user can use sudo command show none syslog files.

#### How I did it
Improve sudo cat command for RO user.

#### How to verify it
Pass all UT.
Manually check fixed code work correctly.

#### Description for the changelog
Improve sudo cat command for RO user.
2023-03-27 17:08:14 -07:00
oleksandrx-kolomeiets
4da51b07ad
Set owner after restoring counters folder during warmboot (#13507)
Why I did it
After warm reboot, show environment prints the following error:
failed to import plugin show.plugins.macsec: [Errno 13] Permission denied: '/tmp/cache/macsec'

How I did it
Set owner back to admin after restoring counters folder.

How to verify it
sudo warm-reboot, then ensure show environement does not print errors.

Signed-off-by: Oleksandr Kolomeiets <oleksandrx.kolomeiets@intel.com>
2023-03-27 10:32:07 -07:00
mssonicbld
fb6e37819b
[ci/build]: Upgrade SONiC package versions (#14414) 2023-03-25 19:21:56 +08:00
mssonicbld
20f1ab8203
[ci/build]: Upgrade SONiC package versions (#14383) 2023-03-22 19:34:21 +08:00
mssonicbld
4429bdd091
[ci/build]: Upgrade SONiC package versions (#14354) 2023-03-21 01:10:17 +08:00
xumia
7209666374
[Security] Fix some of vulnerability issue relative python packages (#14269)
Why I did it
Fix some of vulnerability issue relative python packages #14269
Pillow: [CVE-2021-27921]
Wheel: [CVE-2022-40898]
lxml: [CVE-2022-2309]

How I did it
2023-03-20 14:15:45 +08:00
mssonicbld
1e8e993a94 [ci/build]: Upgrade SONiC package versions 2023-03-20 09:00:28 +08:00
mssonicbld
89ebd43c81
[ci/build]: Upgrade SONiC package versions (#14311)
Upgrade SONiC Versions
2023-03-19 10:16:41 +08:00
Dev Ojha
de17f72d9a
[Buffer] Added cable length config to buffer config template for EdgeZoneAggregator (#14280)
Why I did it
SONiC currently does not identify 'EdgeZoneAggregator' neighbor. As a result, the buffer profile attached to those interfaces uses the default cable length which could cause ingress packet drops due to insufficient headroom. Hence, there is a need to update the buffer templates to identify such neighbors and assign the same cable length as used by the T1.

How I did it
Modified the buffer template to identify EdgeZoneAggregator as a neighbor device type and assign it the same cable length as a T1/leaf router.

How to verify it
Unit tests pass, and manually checked on a 7260 to see the changes take effect.

Signed-off-by: dojha <devojha@microsoft.com>
2023-03-17 11:01:17 -07:00
mssonicbld
96817c4357
[ci/build]: Upgrade SONiC package versions (#14102)
Upgrade SONiC Versions
2023-03-17 10:12:30 +08:00
Neetha John
f30fb6ec58
[storage_backend] Add backend acl service (#14229)
Why I did it
This PR addresses the issue mentioned above by loading the acl config as a service on a storage backend device

How I did it
The new acl service is a oneshot service which will start after swss and does some retries to ensure that the SWITCH_CAPABILITY info is present before attempting to load the acl rules. The service is also bound to sonic targets which ensures that it gets restarted during minigraph reload and config reload

How to verify it
Build an image with the following changes and did the following tests

Verified that acl is loaded successfully on a storage backend device after a switch boot up
Verified that acl is loaded successfully on a storage backend ToR after minigraph load and config reload
Verified that acl is not loaded if the device is not a storage backend ToR or the device does not have a DATAACL table

Signed-off-by: Neetha John <nejo@microsoft.com>
2023-03-16 14:18:28 -07:00
davidpil2002
8098bc4bf5
Add Secure Boot Support (#12692)
- Why I did it
Add Secure Boot support to SONiC OS.
Secure Boot (SB) is a verification mechanism for ensuring that code launched by a computer's UEFI firmware is trusted. It is designed to protect a system against malicious code being loaded and executed early in the boot process before the operating system has been loaded.

- How I did it
Added a signing process to sign the following components:
shim, grub, Linux kernel, and kernel modules when doing the build, and when feature is enabled in build time according to the HLD explanations (the feature is disabled by default).

- How to verify it
There are self-verifications of each boot component when building the image, in addition, there is an existing end-to-end test in sonic-mgmt repo that checks that the boot succeeds when loading a secure system (details below).

How to build a sonic image with secure boot feature: (more description in HLD)

Required to use the following build flags from rules/config:
SECURE_UPGRADE_MODE="dev"
SECURE_UPGRADE_DEV_SIGNING_KEY="/path/to/private/key.pem"
SECURE_UPGRADE_DEV_SIGNING_CERT="/path/to/cert/key.pem"
After setting those flags should build the sonic-buildimage.
Before installing the image, should prepared the setup (switch device) with the follow:
check that the device support UEFI
stored pub keys in UEFI DB

enabled Secure Boot flag in UEFI
How to run a test that verify the Secure Boot flow:
The existing test "test_upgrade_path" under "sonic-mgmt/tests/upgrade_path/test_upgrade_path", is enough to validate proper boot
You need to specify the following arguments:
Base_image_list your_secure_image
Taget_image_list your_second_secure_image
Upgrade_type cold
And run the test, basically the test will install the base image given in the parameter and then upgrade to target image by doing cold reboot and validates all the services are up and working correctly
2023-03-14 14:55:22 +02:00
Stepan Blyshchak
f908dfe919
[Mellanox] Place FW binaries under platform directory instead of squashfs (#13837)
Fixes #13568

Upgrade from old image always requires squashfs mount to get the next image FW binary. This can be avoided if we put FW binary under platform directory which is easily accessible after installation:

admin@r-spider-05:~$ ls /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
/host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa
admin@r-spider-05:~$ ls -al /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa
lrwxrwxrwx 1 root root 66 Feb  8 17:57 /tmp/image-fw-new-loc.0-dirty-20230208.193534-fs/etc/mlnx/fw-SPC.mfa -> /host/image-fw-new-loc.0-dirty-20230208.193534/platform/fw-SPC.mfa

- Why I did it
202211 and above uses different squashfs compression type that 201911 kernel can not handle. Therefore, we avoid mounting squashfs altogether with this change.

- How I did it
Place FW binary under /host/image-/platform/mlnx/, soft links in /etc/mlnx are created to avoid breaking existing scripts/automation.
/etc/mlnx/fw-SPCX.mfa is a soft link always pointing to the FW that should be used in current image
mlnx-fw-upgrade.sh is updated to prefer /host/image-/platform/mlnx location and fallback to /etc/mlnx in squashfs in case new location does not exist. This is necessary to do image downgrade.

- How to verify it
Upgrade from 201911 to master
master to 201911 downgrade
master -> master reboot
ONIE -> master boot (First FW burn)
Which release branch to backport (provide reason below if selected)
2023-03-06 13:36:43 +02:00
mssonicbld
506f372533
[ci/build]: Upgrade SONiC package versions (#14072)
Upgrade SONiC Versions
2023-03-05 11:29:38 +08:00
anamehra
4a93e4cfa4
Add support for platform syncd pre shutdown plugin (#13564)
Why I did it
Vendor platform may require running platform specific pre-shutdown routine before shutting down the syncd process which runs the SAI and vendor sdk instance.

How I did it
Added a platform script hook which will be executed if the plugin script is provided by the platform in device//plugins/
2023-03-03 15:53:33 -08:00
Sudharsan Dhamal Gopalarathnam
8883259673
[netlink] Increse netlink buffer size from 3MB to 16MB (#13965)
#### Why I did it
Following the PR https://github.com/sonic-net/sonic-swss-common/pull/739 increasing netlink buffer size in linux kernel
As error is seen in fdbsyncd with netlink reports "out of memory on reading a netlink socket" It is seen when kernel is sending 10k remote mac to fdbsyncd.


#### How I did it
Increase the buffer size of the netlink buffer from 3MB to 16MB


#### How to verify it
Verified with 10k remote mac, and restarting the fdbsyncd process. So that kernel send the bridge fdb dump to the fdbsyncd.
Verified that the netlink buffer error is not reported in the sys log.
2023-02-27 15:41:22 -08:00
mssonicbld
8d0d3e57ba
[ci/build]: Upgrade SONiC package versions (#13989)
Upgrade SONiC Versions
2023-02-27 13:45:49 +08:00
mssonicbld
58592e6c49
[ci/build]: Upgrade SONiC package versions (#13526)
The initial version files for the SONiC reproducible build
2023-02-25 08:16:38 +08:00
Samuel Angebault
b9dffcbaaf
[Arista] Disable SSD NCQ on Lodoga (#13964)
Why I did it
Fix similar issue seen on #13739 but only for DCS-7050CX3-32S

How I did it
Add a kernel parameter to tell libata to disable NCQ

How to verify it
The message ata2.00: FORCE: horkage modified (noncq) should appear on the dmesg.

Test results using: fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4

with NCQ

   READ: bw=26.1MiB/s (27.4MB/s), 26.1MiB/s-26.1MiB/s (27.4MB/s-27.4MB/s), io=3136MiB (3288MB), run=120053-120053msec
  WRITE: bw=26.3MiB/s (27.6MB/s), 26.3MiB/s-26.3MiB/s (27.6MB/s-27.6MB/s), io=3161MiB (3315MB), run=120053-120053msec
without NCQ

   READ: bw=22.0MiB/s (23.1MB/s), 22.0MiB/s-22.0MiB/s (23.1MB/s-23.1MB/s), io=2647MiB (2775MB), run=120069-120069msec
  WRITE: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=2665MiB (2795MB), run=120069-120069msec
2023-02-24 10:08:04 -08:00
DavidZagury
ee1b6b3751
Remove support to Mellanox SPC4 ASIC (#13932)
- Why I did it
FW for Spectrum-4 ASIC not yet available

- How I did it
Remove in Mellanox fw make files to Spectrum-4 ASIC firmware binaries.
Remove from firmware upgrade scripts to be able Spectrum-4 ASIC.

- How to verify it
Run regression test
2023-02-23 08:25:34 +02:00
Andriy Yurkiv
5ad78abea0
[Dual-ToR] add default value for ACL rule for mellanox platform (#13547)
- Why I did it
Need to add the possibility to choose between dropping packets (using ACL) on ingress or egress in Dual ToR scenario

- How I did it
Add new attribute "mux_tunnel_ingress_acl" to SYSTEM_DEFAULTS table

- How to verify it
check that new attribute exists in redis:
admin@sonic:~$ redis-cli -n 4
127.0.0.1:6379[4]> HGETALL SYSTEM_DEFAULTS|mux_tunnel_ingress_acl
1."state"
2."false"

Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
2023-02-22 20:25:54 +02:00
Marty Y. Lok
2c22d9affc
[Chassis][multiasic] Fix the sonic-db-cli core files issue on multiasic platform after the c++ implementation of sonic-db-cli (#13207)
Fixe #12047. After the c++ implementation of the sonic-db-cli, sonic-db-cli PING command tries to initialize the global database for all instances database starting. If all instance database-config.json are not ready yet. it will crash and generate core file. PR sonic-net/sonic-swss-common#701 only fix the crash and the process abortion. 

Signed-off-by: mlok <marty.lok@nokia.com>
2023-02-21 11:23:22 -08:00
Saikrishna Arcot
56d732a0a0
Use tmpfs for /var/log on Arista 7050CX3-32S (#13805)
This is to reduce writes to the SSD on the device.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-02-16 19:13:39 -08:00
Samuel Angebault
5ce1b8e4b7
[Arista] Disable ATA NCQ for a few products (#13739)
Why I did it
Some products might experience an occasional IO failure in the communication between CPU and SSD.
Based on some research it could be attributable to some device not handling ATA NCQ (Native Command Queue).

This issue currently affect 4 products:

DCS-7170-32C*
DCS-7170-64C
DCS-7060DX4-32
DCS-7260CX3-64

How I did it
This change disable NCQ on the affected drive for a small set of products.

How to verify it
When the fix is applied, these 2 patterns can be found in the dmesg.
ata1.00: FORCE: horkage modified (noncq)
NCQ (not used)

Test results using: fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4

with NCQ (ata1.00: 61865984 sectors, multi 1: LBA48 NCQ (depth 32), AA)

   READ: bw=33.9MiB/s (35.6MB/s), 33.9MiB/s-33.9MiB/s (35.6MB/s-35.6MB/s), io=4073MiB (4270MB), run=120078-120078msec
  WRITE: bw=34.1MiB/s (35.8MB/s), 34.1MiB/s-34.1MiB/s (35.8MB/s-35.8MB/s), io=4100MiB (4300MB), run=120078-120078msec
without NCQ (ata1.00: 61865984 sectors, multi 1: LBA48 NCQ (not used))

   READ: bw=31.7MiB/s (33.3MB/s), 31.7MiB/s-31.7MiB/s (33.3MB/s-33.3MB/s), io=3808MiB (3993MB), run=120083-120083msec
  WRITE: bw=31.9MiB/s (33.4MB/s), 31.9MiB/s-31.9MiB/s (33.4MB/s-33.4MB/s), io=3830MiB (4016MB), run=120083-120083msec
Which release branch to backport (provide reason below if selected)
2023-02-15 10:31:59 -08:00
Stepan Blyshchak
e5a294644c
[dockerd] Force usage of cgo DNS resolver (#13649)
Go's runtime (and dockerd inherits this) uses own DNS resolver implementation by default on Linux.
It has been observed that there are some DNS resolution issues when executing ```docker pull``` after first boot.

Consider the following script:

```
admin@r-boxer-sw01:~$ while :; do date; cat /etc/resolv.conf; ping -c 1 harbor.mellanox.com; docker pull harbor.mellanox.com/sonic/cpu-report:1.0.0 ; sleep 1; done
Fri 03 Feb 2023 10:06:22 AM UTC
nameserver 10.211.0.124
nameserver 10.211.0.121
nameserver 10.7.77.135
search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com
PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data.
64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=5.99 ms

--- harbor.mellanox.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 5.989/5.989/5.989/0.000 ms
Error response from daemon: Get "https://harbor.mellanox.com/v2/": dial tcp: lookup harbor.mellanox.com on [::1]:53: read udp [::1]:57245->[::1]:53: read: connection refused
Fri 03 Feb 2023 10:06:23 AM UTC
nameserver 10.211.0.124
nameserver 10.211.0.121
nameserver 10.7.77.135
search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com
PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data.
64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=5.56 ms

--- harbor.mellanox.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 5.561/5.561/5.561/0.000 ms
Error response from daemon: Get "https://harbor.mellanox.com/v2/": dial tcp: lookup harbor.mellanox.com on [::1]:53: read udp [::1]:53299->[::1]:53: read: connection refused
Fri 03 Feb 2023 10:06:24 AM UTC
nameserver 10.211.0.124
nameserver 10.211.0.121
nameserver 10.7.77.135
search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com
PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data.
64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=5.78 ms

--- harbor.mellanox.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 5.783/5.783/5.783/0.000 ms
Error response from daemon: Get "https://harbor.mellanox.com/v2/": dial tcp: lookup harbor.mellanox.com on [::1]:53: read udp [::1]:55765->[::1]:53: read: connection refused
Fri 03 Feb 2023 10:06:25 AM UTC
nameserver 10.211.0.124
nameserver 10.211.0.121
nameserver 10.7.77.135
search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com
PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data.
64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=7.17 ms

--- harbor.mellanox.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 7.171/7.171/7.171/0.000 ms
Error response from daemon: Get "https://harbor.mellanox.com/v2/": dial tcp: lookup harbor.mellanox.com on [::1]:53: read udp [::1]:44877->[::1]:53: read: connection refused
Fri 03 Feb 2023 10:06:26 AM UTC
nameserver 10.211.0.124
nameserver 10.211.0.121
nameserver 10.7.77.135
search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com
PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data.
64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=5.66 ms

--- harbor.mellanox.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 5.656/5.656/5.656/0.000 ms
Error response from daemon: Get "https://harbor.mellanox.com/v2/": dial tcp: lookup harbor.mellanox.com on [::1]:53: read udp [::1]:54604->[::1]:53: read: connection refused
Fri 03 Feb 2023 10:06:27 AM UTC
nameserver 10.211.0.124
nameserver 10.211.0.121
nameserver 10.7.77.135
search mtr.labs.mlnx labs.mlnx mlnx lab.mtl.com mtl.com
PING harbor.mellanox.com (10.7.1.117) 56(84) bytes of data.
64 bytes from harbor.mtl.labs.mlnx (10.7.1.117): icmp_seq=1 ttl=53 time=8.22 ms

--- harbor.mellanox.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 8.223/8.223/8.223/0.000 ms
1.0.0: Pulling from sonic/cpu-report
004f1eed87df: Downloading [===================>                               ]   19.3MB/50.43MB
5d6f1e8117db: Download complete
48c2faf66abe: Download complete
234b70d0479d: Downloading [=========>                                         ]  9.363MB/51.84MB
6fa07a00e2f0: Downloading [==>                                                ]   9.51MB/192.4MB
04a31b4508b8: Waiting
e11ae5168189: Waiting
8861a99744cb: Waiting
d59580d95305: Waiting
12b1523494c1: Waiting
d1a4b09e9dbc: Waiting
99f41c3f014f: Waiting
```

While /etc/resolv.conf has the correct content and ping (and any other utility that uses libc's DNS resolution implementation) works correctly
docker is unable to resolve the hostname and falls back to default [::1]:53. This started to happen after PR https://github.com/sonic-net/sonic-buildimage/pull/13516 has been merged.
As you can see from the log, dockerd is able to pick up the correct /etc/resolv.conf only after 5 sec since first try. This seems to be somehow related to the logic in Go's DNS resolver
https://github.com/golang/go/blob/master/src/net/dnsclient_unix.go#L385.

There have been issues like that reported in docker like:
  - https://github.com/docker/cli/issues/2299
  - https://github.com/docker/cli/issues/2618
  - https://github.com/moby/moby/issues/22398

Since this starts to happen after inclusion of resolvconf package by
above mentioned PR and the fact I can't see any problem with that (ping,
nslookup, etc. works) the choice is made to force dockerd to use cgo
(libc) resolver.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2023-02-14 08:57:19 +02:00
zhixzhu
f0f7639fa2
set cable length to 1m for backplane ports (#13572)
Signed-off-by: Zhixin Zhu zhixzhu@cisco.com

Why I did it
backplane ports cable length need to be specified.

How I did it
separated handling for the specific port name.
2023-02-10 19:01:49 -08:00
andywongarista
1894e0aafe
Increase PikeZ varlog size (#13550)
Why I did it
To address error sometimes seen when running sonic-mgmt test_stress_routes.py::test_announce_withdraw_route on 720DT-48S

How I did it
Update boot0 logic to set platform specific varlog size for 720DT-48S

How to verify it
Verified that /var/log size increased and error is no longer observed when running test
2023-02-09 13:24:09 -08:00
Samuel Angebault
dd7948bf17
[Arista] Add emmc quirks in boot0 to improve reliability (#10013)
Why I did it
Fix some unreliability seen on emmc device with some AMD CPUs

How I did it
Added a kernel parameter to add quirks to
It depends on a sonic-linux-kernel change to work properly but will be a no-op without it.
The quirk added is SDHCI_QUIRK2_BROKEN_HS200 used to downgrade the link speed for the eMMC.
2023-02-09 10:46:09 -08:00
Stephen Sun
e3ff08833e
[Mellanox] Support DSCP remapping in dual ToR topo on T0 switch (#12605)
- Why I did it
Support DSCP remapping in dual ToR topo on T0 switch for SKU Mellanox-SN4600c-C64, Mellanox-SN4600c-D48C40, Mellanox-SN2700, Mellanox-SN2700-D48C8.

- How I did it
Regarding buffer settings, originally, there are two lossless PGs and queues 3, 4. In dual ToR scenario, the lossless traffic from the leaf switch to the uplink of the ToR switch can be bounced back.
To avoid PFC deadlock, we need to map the bounce-back lossless traffic to different PGs and queues. Therefore, 2 additional lossless PGs and queues are allocated on uplink ports on ToR switches.

On uplink ports, map DSCP 2/6 to TC 2/6 respectively
On downlink ports, both DSCP 2/6 are still mapped to TC 1
Buffer adjusted according to the ports information:
Mellanox-SN4600c-C64:
56 downlinks 50G + 8 uplinks 100G
Mellanox-SN4600c-D48C40, Mellanox-SN2700, Mellanox-SN2700-D48C8:
24 downlinks 50G + 8 uplinks 100G

- How to verify it
Unit test.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2023-02-07 16:21:59 +02:00
Chun'ang Li
eea54717b8
Fix rsyslogd start failed cause by rsyslog.conf is emtpy. (#13669)
- Why I did it
In to-sonic and multi-asic KVM-test, pretest sometimes failed. Reason is rsyslogd process can not start in teamd container. Because rsyslog.conf is empty caused by sonic-cfggen execute failed

- How I did it
If sonic-cfggen -d execute failed, execute without -d because the template file has the default value.

- How to verify it
Build image and test it over 40 times, all passed pretest.

Signed-off-by: Chun'ang Li <chunangli@microsoft.com>
2023-02-06 16:38:04 +02:00
Sudharsan Dhamal Gopalarathnam
1ff0c0b685
[Mellanox][sai_failure_dump]Added platform specific script to be invoked during SAI failure dump (#13533)
- Why I did it
Added platform specific script to be invoked during SAI failure dump. Added some generic changes to mount /var/log/sai_failure_dump as read write in the syncd docker

- How I did it
Added script in docker-syncd of mellanox and copied it to /usr/bin

- How to verify it
Manual UT and new sonic-mgmt tests
2023-02-05 16:45:49 +02:00
Saikrishna Arcot
ee1c32a802
Use tmpfs for /var/log for Arista 7260 (#13587)
This is to reduce writes to disk, which then can use the SSD to get worn
out faster.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-02-02 09:07:33 -08:00
anamehra
26af468a99
Add support for platform topology configuration service (#12066)
* Add support for platform topology configuration service

    This service invokes the platform plugin for platform specific topology
    configuration.
    The path for platform plugin script is:
    /usr/share/sonic/device/$PLATFORM/plugins/config-topology.sh
    If the platform plugin is not available, this service does nothing.

Signed-off-by: anamehra <anamehra@cisco.com>
2023-02-01 12:53:45 -08:00
Richard.Yu
a096363b48
[broadcom]: Set default SYNCD_SHM_SIZE for Broadcom XGS devices (#13297)
After upgrade to brcmsai 8.1, the sdk running environment (container) recommended with mininum memory size as below

TH4/TD4(ltsw) uses 512MB
TH3 used 300MB
Helix4/TD2/TD3/TH/TH 256 MB
Base on this requirement, adjust the default syncd share memory size and set the memory size for special ACISs in platform_env.conf file for different types of Broadcom ASICs.

How I did it
Add the platform_env.conf file if none of it for broadcom platform (base on platform_asic file)
Add the 'SYNCD_SHM_SIZE' and set the value

for ltsw(TD4/TH4) devices set to 512M at least (update the platform_env.conf)
for Td2/TH2/TH devices set to 256M
for TH3 set to 300M

verify

How to verify it
verify the image with code fix
Check with UT
Check on lab devices

On a problematic device which cannot start successfully
Run with the command
$ cat /proc/linux-kernel-bde
Broadcom Device Enumerator (linux-kernel-bde)
Module parameters:
        maxpayload=128
        usemsi=0
        dmasize=32M
        himem=(null)
        himemaddr=(null)
DMA Memory (kernel): 33554432 bytes, 0 used, 33554432 free, local mmap
No devices found
$ docker rm -f syncd
syncd
$ sudo /usr/bin/syncd.sh start
Cannot get Broadcom Chip Id. Skip set SYNCD_SHM_SIZE.
Creating new syncd container with HWSKU Force10-S6000
a4862129a7fea04f00ed71a88715eac65a41cdae51c3158f9cdd7de3ccc3dd31
$ docker inspect syncd | grep -i shm
            "ShmSize": 67108864,
                "Tag": "fix_8.1_shm_issue.67873427-9f7ca60a0e",
On Normal device
$ docker inspect syncd | grep -i shm
            "ShmSize": 268435456,
                "Tag": "fix_8.1_shm_issue.67873427-9f7ca60a0e"
change the config syncd_shm.ini to b85=128m

$ docker rm -f syncd
syncd
$ sudo /usr/bin/syncd.sh start
Creating new syncd container with HWSKU Force10-S6000
3209ffc1e5a7224b99640eb9a286c4c7aa66a2e6a322be32fb7fe2113bb9524c
$  docker inspect syncd | grep -i shm
            "ShmSize": 134217728,
                "Tag": "fix_8.1_shm_issue.67873427-9f7ca60a0e",
change the config under
/usr/share/sonic/device/x86_64-dell_s6000_s1220-r0/Force10-S6000/platform_env.conf
and run command

$ cat /usr/share/sonic/device/x86_64-dell_s6000_s1220-r0/platform_env.conf
SYNCD_SHM_SIZE=300m

$ sudo /usr/bin/syncd.sh start
Creating new syncd container with HWSKU Force10-S6000
897f6fcde1f669ad2caab7da4326079abd7e811bf73f018c6dacc24cf24bfda5
$  docker inspect syncd | grep -i shm
            "ShmSize": 314572800,
                "Tag": "fix_8.1_shm_issue.67873427-9f7ca60a0e",

Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
2023-01-30 20:23:03 -08:00
Oleksandr Ivantsiv
c7ecd92c54
Clear DNS configuration received from DHCP during networking reconfiguration in Linux. (#13516)
- Why I did it
fixes #12907

When the management interface IP address configuration changes from dynamic to static the DNS configuration (retrieved from the DHCP server) in /etc/resolv.conf remains uncleared. This leads to a DNS configuration pointing to the wrong nameserver. To make the behavior clear DNS configuration received from DHCP should be cleared.

- How I did it
Use resolvconf package for managing DNS configuration. It is capable of tracking the source of DNS configuration and puts the configuration retrieved from the DHCP servers into a separate file. This allows the implementation of DNS configuration cleanup retrieved from DHCP during networking reconfiguration.

- How to verify it
Ensure that the management interface has no static configuration.
Check that /etc/resolv.conf has DNS configuration.
Configure a static IP address on the management interface.
Verify that /etc/resolv.conf has no DNS configuration.
Remove the static IP address from the management interface.
Verify that /etc/resolv.conf has DNS configuration retrieved form DHCP server.
2023-01-30 22:13:10 +02:00
Devesh Pathak
c93716a142
rsyslog to start after interfaces-config (#13503)
Fixes #12408

Why I did it
We are running into #12408 very frequently.
This results in no syslogs from any containers as rsyslog server could not start.
some of the sonic-mgmt scripts look for log statements and error out if log is not present.

Interfaces-config service configures the loopback interface along with other interfaces. rsyslog-config reads ip address of loopback interface and generates /etc/rsyslog.conf. When this race condition happens, lo interface ip is not yet programmed and rsyslog-config ends up writing UDP server as null in /etc/rsyslog.conf.

How I did it
rsyslog-config service is started after interfaces-config service.

How to verify it
Did multiple reboots and verified that $UDPServerAddress is valid.
2023-01-26 20:39:13 -08:00
Jing Zhang
dabb31c5f6
[sudoers] add /usr/local/bin/storyteller to READ_ONLY_CMDS (#13422)
Adding /usr/local/bin/storyteller to READ_ONLY_CMDS. So no write access or prompt for password is needed to run storyteller.

Tested on 202205 clusters, user who didn't request write access was able to grep log using storyteller.

sign-off: Jing Zhang zhangjing@microsoft.com
2023-01-26 20:38:29 -08:00
Jing Zhang
78f249be38
change default to be on (#13495)
Changing the default config knob value to be True for killing radv, due to the reasons below:

Killing RADV is to prevent sending the "cease to be advertising interface" protocol packet.
RFC 4861 says this ceasing packet as "should" instead of "must", considering that it's fatal to not do this.
In active-active scenario, host side might have difficulty distinguish if the "cease to be advertising interface" is for the last interface leaving.
6.2.5. Ceasing To Be an Advertising Interface

shutting down the system.
In such cases, the router SHOULD transmit one or more (but not more
than MAX_FINAL_RTR_ADVERTISEMENTS) final multicast Router
Advertisements on the interface with a Router Lifetime field of zero.
In the case of a router becoming a host, the system SHOULD also
depart from the all-routers IP multicast group on all interfaces on
which the router supports IP multicast (whether or not they had been
advertising interfaces). In addition, the host MUST ensure that
subsequent Neighbor Advertisement messages sent from the interface
have the Router flag set to zero.

sign-off: Jing Zhang zhangjing@microsoft.com
2023-01-24 23:59:54 +00:00
Zain Budhwani
c9a33cb00e
Fix segfault issue inside memory_checker (#13066)
#### Why I did it

Segfault was occuring when running memory_checker

#### How I did it

Deinit publisher immediately after publishing

#### How to verify it

Manual testing
2023-01-24 15:30:41 -08:00
Jing Zhang
260a2ec3e7
[dualtor][active-active]Killing radv instead of stopping on active-active dualtor if config knob is on (#13408)
How I did it
radv sends a good-bye packet when the service is stopped, which causes a IPv6 route update on SoC side. And this update leads to an interface bouncing and causes traffic disruption even though the ToR device might already be isolated.

This PR is to mitigate the traffic disruption issue during planned maintenance, by killing radv instead of stopping. So the cease packet won't be sent.

How to verify it
Verified on dev clusters:

Traffic disruption was no longer reproducible.
radv took the killing path
if knob was off, radv would take the stopping path

sign-off: Jing Zhang zhangjing@microsoft.com
2023-01-20 15:34:34 -08:00
Graham Hayes
e077b5362c
[Arista] Rely on automatic flash size detection for Raven (#13277)
Many of these switches have had flash upgraded beyond 2G however, in
boot0 both were assigned 2GB for legacy reasons.

Remove the hardcoding of the flash size and let boot0 autodetect the available space.

Signed-off-by: Graham Hayes <gr@ham.ie>

Signed-off-by: Graham Hayes <gr@ham.ie>
2023-01-12 23:52:40 -08:00
xumia
e6a01ca5eb
[Bug] Fix SONiC installation failure caused by pip/pip3 not found (#13284)
The main issue is the pip/pip3 command cannot be found when the package is being installed by apt-get.
When using the dpkg install, the searching path is PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
When using the apt-get install, the searching path is PATH=/usr/sbin:/usr/bin:/sbin:/bin
But the pip/pip3 default path is at /usr/local/bin, so dpkg works, but apt-get not work.

How I did it
Export the path /usr/local/bin for pip/pip3.
Make the deb packages can be installed by apt-get.
2023-01-11 08:54:24 -08:00
centecqianj
4b933bd566
[Centec arm64] Solve the abnormal console speed of centec-arm64 switch board (#13126)
The console of the centec-arm64 board is ttyAMA0.The current regular expression cannot be correctly parsed.

Signed-off-by: centecqianj <qianj@centec.com>
2023-01-07 21:10:03 -08:00
abdosi
9ecd27ddbb
During build time mask only those feature/services that are disabled excplicitly (#13283)
What I did:
Fix : #13117

How I did:
During build time mask only those feature/services that are disabled explicitly. Some of the features ((eg: teamd/bgp/dhcp-relay/mux/etc..)) state is determine run-time so for those feature by default service will be up and running and then later hostcfgd will mask them if needed.

So Default behavior will be

init_cfg.json.j2 during build time make state as disabled then mask the service
init_cfg.json.j2 during build time make state as another jinja2 template render string than do no mask the service
init_cfg.json.j2 during build time make state as enabled then do not mask the service

How I verify:
Manual Verification.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2023-01-07 02:36:37 +00:00
Arvindsrinivasan Lakshmi Narasimhan
a57fa16839
[Chassis][Voq]update to add buffer_queue config on system ports (#12156)
Why I did it
In the voq chassis the buffer_queue configuration needs to be applied on system_port instead of the sonic port.
This PR has the change to do this.

How I did it
Modify buffer_config.j2 to generate buffer_queue configuration on system_ports if the device is Voq Chassis

How to verify it
Verify the buffer_queue configuration is generated properly using sonic-cfggen

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2022-12-31 23:59:54 -08:00
Oleksandr Ivantsiv
127d60f9b8
[build] Adjust teamd and radv features configuration according to the compilation options. (#13139)
- Why I did it
The followup to #12920 PR.
If the feature compilation is disabled its configuration should not be included into init_cfg.json.

- How I did it
Update init_cfg.json.j2 template to include teamd and radv features configuration only if their compilation is enabled.

- How to verify it
The default behavior is preserved. To verify the changes compile the image without overriding INCLUDE_TEAMD and INCLUDE_ROUTER_ADVERTISER options. The generated /etc/sonic/init_cfg.json should remain with no changes. Install the image and verify that both teamd and radv containers are present and running. Verify that feature state returned by show feature status command is enabled.
Change the INCLUDE_TEAMD or INCLUDE_ROUTER_ADVERTISER value to "n". Compile and install the image. Verify that feature configuration is not included in generated /etc/sonic/init_cfg.json file. Verify that show feature status output doesn't include the feature.
2022-12-27 13:55:37 +02:00
Stepan Blyshchak
661669c805
[swss/syncd] remove dependency on interfaces-config.service (#13084)
- Why I did it
Remove dependency on interfaces-config.service to speed up boot, because interfaces-config.service takes a lot of time on boot.

- How I did it
Changed service files for swss, syncd.

- How to verify it
Boot and check swss/syncd start time comparing to interfaces-config

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-12-26 09:20:45 +02:00
Junchao-Mellanox
2126def04e
[infra] Support syslog rate limit configuration (#12490)
- Why I did it
Support syslog rate limit configuration feature

- How I did it
Remove unused rsyslog.conf from containers
Modify docker startup script to generate rsyslog.conf from template files
Add metadata/init data for syslog rate limit configuration

- How to verify it
Manual test
New sonic-mgmt regression cases
2022-12-20 10:53:58 +02:00
Konstantin Vasin
8a3fad2891
[Build] mount cgroup2 in chroot to fix build on ubuntu 22.04 (#13030)
Why I did it
Ubuntu 22.04 uses cgroup2 by default, but docker.sh doesn't mount it.
As a result we get an error when trying to run docker info in chroot env:

ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

How I did it
mount cgroup2 in chroot if all enabled kernel cgroup controllers are currently not in use by cgroup1

So we need to mount cgroup in chroot environment on /sys/fs/cgroup.
Because inside chroot we don't know which cgroup version is used by the host we have two possible solutions:

cgroup tree for chroot is mounted by the host (it was my 1st version of this fix)
cgroup tree is mounted inside chroot based on info from /proc/cgroups (it's current version of this fix)
My 2nd version based on this code from systemd: 5c6c587ce2/src/shared/cgroup-setup.c (L35-L74)

We parse info from /proc/cgroups
Skip header line started from #
Skip controller if it's disabled (4th column = 0)
Count number of controllers with non-zero of hierarchy_id (2nd column)
If this number is not zero then we assume some of controllers are used by host system and the host system uses hybrid or legacy cgroup tree. In this case we can't use unified cgroup tree inside chroot and mount old cgroup tree (v1).
If this number is zero then we assume host system uses unified cgroup tree and we need to mount cgroup2 inside chroot.

Signed-off-by: Konstantin Vasin <k.vasin@yadro.com>
2022-12-17 12:16:45 -08:00
Oleksandr Ivantsiv
9988ff888b
[build] Add the possibility to disable compilation of teamd and radv containers. (#12920)
- Why I did it
This optimization is needed for DPU SONiC. DPU SONiC runs a limited set of containers and teamd and radv containers are not part of them. Unlike the other containers, there was no possibility to disable teamd and radv containers compilation.
To reduce DPU SONiC compilation time and reduce the image size this commit adds the possibility to disable their compilation.

- How I did it
Two new configuration options are added to rules/config file:

INCLUDE_TEAMD
INCLUDE_ROUTER_ADVERTISER
By default to preserve the existing behavior both options are enabled. There are two ways to override them:

To change option value to "n" in rules/config file.
To override their value using SONIC_OVERRIDE_BUILD_VARS env variable:
SONIC_OVERRIDE_BUILD_VARS="SONIC_INCLUDE_TEAMD=y SONIC_INCLUDE_ROUTER_ADVERTISER=n"

- How to verify it
The default behavior is preserved. To verify it compile the image without overriding new options. Install the image and verify that both teamd and radv containers are present and running.
To verify the new options override them with "n" value. Compile and install image. Verify that no docker containers are present. Verify that SWSS can start without errors.
2022-12-13 12:06:30 +02:00
Saikrishna Arcot
00b11ec4e2
Replace logrotate cron file with (adapted) systemd timer file (#12921)
Debian is shipping a systemd timer unit for logrotate, but we're also
packaging in a cron job, which means both of them will run, potentially
at the same time. Remove our cron file, and add an override to the
shipped timer file to have it be run every 10 minutes.

Fixes #12392.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-12-08 14:13:11 -08:00
Junchao-Mellanox
3b3837a636
[containercfgd] Add containercfgd and syslog rate limit configuration support (#12489)
* [containercfgd] Add containercfgd and syslog rate limit configuration support

* Fix build issue

* Fix checker issue

* Fix review comment

* Fix review comment

* Update containercfgd.py
2022-12-08 08:58:35 -08:00
Arvindsrinivasan Lakshmi Narasimhan
7db272556e
[chassis] update the asic_status.py to read from CHASSIS_FABRIC_ASIC_INFO_TABLE (#12576)
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan arlakshm@microsoft.com

Why I did it
Fixes #12575 and #12575

How I did it
In the PR sonic-net/sonic-platform-daemons#311 chassisd updates to CHASSIS_FABRIC_ASIC_INFO with the fabric asic info.
Updating the asic_status.py to read from the correct table.

How to verify it
test on chassis

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2022-12-07 21:53:47 -08:00
Michael Li
50b962b4a8
Limit reload BCM SDK kmods on syncd start to PikeZ platform (#12971)
Why I did it
Limiting #12804 changes to PikeZ platform only (Arista-720DT-48S). Note that this is a short term workaround for this platform until SDK investigation on SDK init failure on docker syncd restart due to DMA issues is resolved.

How I did it
Retrieve platform name from /host/machine.conf and only reload SDK kmods on Arista-720DT-48S platform.

Signed-off-by: Michael Li <michael.li@broadcom.com>
2022-12-07 09:53:21 +08:00
Stepan Blyshchak
8ca0530920
[swss.sh] optimize macsec feature state query (#12946)
- Why I did it
There's a slowdown in bootup related to the execution of a show command during startup of swss service. show is a pretty heavy command and takes long time to execute ~2 sec.

- How I did it
I replaced show with sonic-db-cli which takes a ms to run.

- How to verify it
Boot the switch and verify swss is active.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-12-06 11:23:46 +02:00
Kalimuthu-Velappan
aaeafa8411
02.Version cache - docker cache build framework (#12001)
During docker build, host files can be passed to the docker build through
docker context files. But there is no straightforward way to transfer
the files from docker build to host.

This feature provides a tricky way to pass the cache contents from docker
build to host. It tar's the cached content and encodes them as base64 format
and passes it through a log file with a special tag as 'VCSTART and VCENT'.

Slave.mk in the host, it extracts the cache contents from the log and stores them
in the cache folder. Cache contents are encoded as base64 format for
easy passing.

<!--
     Please make sure you've read and understood our contributing guidelines:
     https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

     ** Make sure all your commits include a signature generated with `git commit -s` **

     If this is a bug fix, make sure your description includes "fixes #xxxx", or
     "closes #xxxx" or "resolves #xxxx"

     Please provide the following information:
-->

#### Why I did it

#### How I did it

#### How to verify it
2022-12-02 08:28:45 +08:00
Michael Li
f725b83bd6
Reload BCM SDK kmods on syncd start to handle syncd restart issues (#12804)
Why I did it
There is an issue on the Arista PikeZ platform (using T3.X2: BCM56274) while running SONiC. If the 'syncd' container in SONiC is restarted, the expected behaviour is that syncd will automatically restart/recover; however it does not and always fails at create_switch due to BCM SDK kmod DMA operation cancellation getting stuck.

Sep 16 22:19:44.855125 pkz208 ERR syncd#syncd: [none] SAI_API_SWITCH:platform_process_command:428 Platform command "init soc" failed, rc = -1. Sep 16 22:19:44.855206 pkz208 INFO syncd#supervisord: syncd CMIC_CMC0_PKTDMA_CH4_DESC_COUNT_REQ:0x33#015 Sep 16 22:19:44.855264 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:platformInit:1909 initialization command "init soc" failed, rc = -1 (Internal error). Sep 16 22:19:44.855403 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:sai_driver_init:642 Error initializing driver, rc = -1. ... Sep 16 22:19:44.855891 pkz208 CRIT syncd#syncd: [none] SAI_API_SWITCH:brcm_sai_create_switch:1173 initializing SDK failed with error Operation failed (0xfffffff5).

Reloading the BCM SDK kmods allows the switch init to continue properly.

How I did it
If BCM SDK kmods are loaded, unload and load them again on syncd docker start script.

How to verify it
Steps to reproduce:

In SONiC, run 'docker ps' to see current running containers; 'syncd' should be present.
Run 'docker stop syncd'
Wait ~1 minute.
Run 'docker ps' to see that syncd is missing.
Check logs to see messages similar to the above.

Signed-off-by: Michael Li <michael.li@broadcom.com>
2022-11-30 16:16:30 +08:00
Kebo Liu
36a100083f
[Mellanox] Add support to Mellanox Spectrum-4 ASIC Firmware compiling and upgrade (#12844)
- Why I did it
Add support for compiling Spectrum-4 ASIC firmware to the SONiC image
Add support for Spectrum-4 ASIC firmware upgrade

- How I did it
Update Mellanox fw make files to include Spectrum-4 ASIC firmware binaries.
Update firmware upgrade scripts to be able to detect Spectrum-4 ASIC.

- How to verify it
Run regression tests

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2022-11-29 16:38:41 +02:00
Zain Budhwani
4b001e5115
Change value type of params in memory_checker (#12797)
Fix error when calling events API, required value is string, passing float
2022-11-23 17:37:28 -08:00
bingwang-ms
f402e6b5c6
Apply separated DSCP_TO_TC_MAP and TC_TO_QUEUE_MAP to uplink ports on dualtor (#12730)
Why I did it
The PR is to apply separated DSCP_TO_TC_MAP and TC_TO_QUEUE_MAP to uplink ports on dualtor.
The traffic with DSCP 2 and DSCP 6 from T1 is treated as lossless traffic.

DSCP    TC    Queue
2      2     2
6      6     6
Traffic with DSCP 2 or DSCP 6 from downlink is still treated as lossy traffic as before.

How I did it
Define DSCP_TO_TC_MAP|AZURE_UPLINK and TC_TO_QUEUE_MAP|AZURE_UPLINK.

How to verify it
Verified by UT
Verified by coping the new template to a testbed, and rendering a config_db.json
2022-11-21 11:42:28 -08:00
Konstantin Vasin
6448afd338
[Build] set apt Acquire::Retries to 3 for bullseye (#12758)
Why I did it
There were some changes in apt source code in version 2.1.9.
As a result apt used in bullseye (2.2.4) is intolerant to network issues.
This was fixed in 10631550f1 Already fixed version is used in bookworm (2.5.4)
And not yet affected version is used in buster (1.8.2.3)

How I did it
Set Acquire::Retries to 3 for sonic-slave-bullseye, docker-base-bullseye and final Debian image.

Ref: https://bugs.launchpad.net/ubuntu/+source/apt/+bug/1876035

Signed-off-by: Konstantin Vasin k.vasin@yadro.com
2022-11-21 08:05:16 +08:00
Lorne Long
7e525d96b3
[Build] Use apt-get to predictably support dependency ordered configuration of lazy packages (#12164)
Why I did it
The current lazy installer relies on a filename sort for both unpack and configuration steps. When systemd services are configured [started] by multiple packages the order is by filename not by the declared package dependencies. This can cause the start order of services to differ between first-boot and subsequent boots. Declared systemd service dependencies further exacerbate the issue (e.g. blocking the first-boot script).

The current installer leaves packages un-configured if the package dependency order does not match the filename order.

This also fixes a trivial bug in [Build]: Support to use symbol links for lazy installation targets to reduce the image size #10923 where externally downloaded dependencies are duplicated across lazy package device directories.

How I did it
Changed the staging and first-boot scripts to use apt-get:

dpkg -i /host/image-$SONIC_VERSION/platform/$platform/*.deb

becomes

apt-get -y install /host/image-$SONIC_VERSION/platform/$platform/*.deb

when dependencies are detected during image staging.

How to verify it
Apt-get critical rules

Add a Depends= to the control information of a package. Grep the syslog for rc.local between images and observe the configuration order of packages change.
2022-11-17 11:20:42 +08:00
abdosi
668485aac5
Added Support to runtime render bgp and teamd feature state and lldp has_asic_scope flag (#11796)
Added Support to runtime render bgp and teamd feature `state` and lldp `has_asic_scope`  flag
Needed for SONiC on chassis.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Co-authored-by: mlok <marty.lok@nokia.com>
2022-11-15 16:20:14 -08:00
abdosi
bd348c5264
[chassis-packet] fix the issue of internal ip arp not getting resolved. (#12127)
Fix the issue where arp_update will not ping some of the ip's even
though they are in failed state since grep of that ip on ip neigh show
command does not do exact word match and can return multiple match.
2022-11-14 10:15:17 -08:00
Zain Budhwani
98ace33b0f
Add rsyslog plugin regex for select operation failure (#12659)
Added events for select op, alpm parity error, moved dhcp events from host to container
2022-11-13 21:41:33 -08:00
Jing Kan
111752957f
[dhcp_relay] Enable DHCP Relay for BmcMgmtToRRouter in init_cfg (#12648)
Why I did it
DHCP relay feature needs to be enabled for BmcMgmtToRRouter by default

How I did it
Update device type list
2022-11-10 13:37:02 +08:00
Devesh Pathak
0ea4f4d00e
Clear /etc/resolv.conf before building image (#12592)
Why I did it
nameserver and domain entries from build system fsroot gets into sonic image.

How I did it
Clear /etc/resolv.conf before building image

How to verify it
Built image with it and verified with install that /etc/resolv.conf is empty
2022-11-09 16:54:56 -08:00
xumia
ac5d89c6ac
[Build] Support j2 template for debian sources (#12557)
Why I did it
Unify the Debian mirror sources
Make easy to upgrade to the next Debian release, not source url code change required.
Support to customize the Debian mirror sources during the build
Relative issue: #12523
2022-11-09 08:09:53 +08:00
judyjoseph
c259c996b4
Use the macsec_enabled flag in platform to enable macsec feature state (#11998)
* Use the macsec_enabled flag in platform to enable macesc feature state
* Add macsec supported metadata in DEVICE_RUNTIME_METADATA
2022-11-08 11:03:38 -08:00
Sudharsan Dhamal Gopalarathnam
e6a0fba9ea
[logrotate]Fix logrotate firstaction script to reflect correct size (#12599)
- Why I did it
Fix logrotate firstaction script to reflect correct size. The size was modified to change dynamically based on disk size. However this variable was not updated
#9504

- How I did it
Updated the variable based on disk size

- How to verify it
Verify in the generated rsyslog file if the variable is correctly generated from jinja template
2022-11-08 13:38:14 +02:00
Lawrence Lee
ddf16c9d8c
[arp_update]: Fix hardcoded vlan (#12566)
Typo in prior PR #11919 hardcodes Vlan name. Change command to use the $vlan variable instead

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2022-11-07 12:10:00 -08:00
Zain Budhwani
8f48773fd1
Publish additional events (#12563)
Add event_publish code or regex for rsyslog plugin for additional events
2022-11-07 09:57:57 -08:00
Mai Bui
61a085e55e
Replace os.system and remove subprocess with shell=True (#12177)
Signed-off-by: maipbui <maibui@microsoft.com>
#### Why I did it
`subprocess` is used with `shell=True`, which is very dangerous for shell injection.
`os` - not secure against maliciously constructed input and dangerous if used to evaluate dynamic content
#### How I did it
remove `shell=True`, use `shell=False`
Replace `os` by `subprocess`
2022-11-04 10:48:51 -04:00
bingwang-ms
6169ae3ee3
Add lossy scheduler for queue 7 (#12596)
* Add lossy scheduler for queue 7
2022-11-04 08:12:00 +08:00
ntoorchi
45d174663a
Enable P4RT at build time and disable at startup (#10499)
#### Why I did it
Currently at the Azure build system, the P4RT container is disabled by default at the build time. Here the goal is to include the P4RT container at the build time while disabling it at the runtime. The user can enable/disable the p4rt app through the config based on the preference. 

#### How I did it
Changed the config in rules/config and init-cfg.json.j2
2022-10-31 16:18:42 -07:00
Devesh Pathak
85e3a81f47
Fix to improve hostname handling (#12064)
* Fix to improve hostname handling
If config_db.json is missing hostname entry, hostname-config.sh ends
up deleting existing entry too and hostname changes to default 'localhost'

* default hostname to 'sonic` if missing in config file
2022-10-25 14:51:02 -07:00
Samuel Angebault
f39c2adc04
Fix extraction of platform.tar.gz for firsttime (#11935) 2022-10-21 18:27:32 -07:00
Samuel Angebault
9cdd78788f
Add support for UpperlakeElite (#12280)
Signed-off-by: Samuel Angebault <staphylo@arista.com>

Signed-off-by: Samuel Angebault <staphylo@arista.com>
2022-10-21 18:26:43 -07:00
Mariusz Stachura
9f88d03c2b
[QoS] Support dynamic headroom calculation for Barefoot platforms (#11708)
Signed-off-by: Mariusz Stachura <mariusz.stachura@intel.com>

What I did
Adding the dynamic headroom calculation support for Barefoot platforms.

Why I did it
Enabling dynamic mode for barefoot case.

How I verified it
The community tests are adjusted and pass.
2022-10-19 09:36:56 -07:00
cytsao1
9ef8464964
[pmon] Add smartmontools to pmon docker (#11837)
* Add smartmontools to pmon docker

* Set smartmontools to install version 7.2-1 in pmon to match host; clean up smartmontools build files

* Add comments on smartmontools version for both host and pmon
2022-10-17 13:26:31 -07:00
Ying Xie
bc684fef0b
[BGP] starting BGP service after swss (#12381)
Why I did it
BGP service has always been starting after interface-config. However, recently we discovered an issue where some BGP sessions are unable to establish due to BGP daemon not able to read the interface IP.

This issue was clearly observed after upgrading to FRR 8.2.2. See more details in #12380.

How I did it
Delaying starting BGP seems to be a workaround for this issue.

However, caution is that this delay might impact warm reboot timing and other timing sequences.

This workaround is reducing the probability of hitting the issue by close to 100X. However, this workaround is not bulletproof as test shows. It is still preferrable to have a proper FRR fix and revert this change in the future.

How to verify it
Continuously issuing config reload and check BGP session status afterwards.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2022-10-13 09:24:06 -07:00
Hua Liu
257cc96d7c
Remove swsssdk from sonic OS image and docker container image (#12323)
Remove swsssdk from sonic OS image and docker image

#### Why I did it
swsssdk is deprecated, so need remove from image.

#### How I did it
Update config file to remove swsssdk from image.

#### How to verify it
Pass all test case.

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205

#### Description for the changelog
Remove swsssdk from sonic OS image and docker image

#### Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
2022-10-12 13:04:14 +08:00
Zain Budhwani
09fe3f467f
Add Structured Events w/ YANG Models (#12270)
Add events for dhcp-relay, bgp, syncd, & kernel.
2022-10-09 20:23:31 -07:00
Prince George
ac1d392d4c
Disable brackted-paste mode off by default (#12285)
* Disable brackted-paste mode off by default

* address review comment
2022-10-06 07:55:09 -07:00
Saikrishna Arcot
9251d4ba8b
[docker-wait-any]: Exit worker thread if main thread is expected to exit (#12255)
There's an odd crash that intermittently happens after the teamd container
exits, and a signal is raised to the main thread to exit. This thread (watching
teamd) continues execution because it's in a `while True`. The subsequent wait
call on the teamd container very likely returns immediately, and it calls
`is_warm_restart_enabled` and `is_fast_reboot_enabled`. In either of these
cases, sometimes, there is a crash in the transition from C code to Python code
(after the function gets executed).  Python sees that this thread got a signal
to exit, because the main thread is exiting, and tells pthread to exit the
thread.  However, during the stack unwinding, _something_ is telling the
unwinder to call `std::terminate`.  The reason is unknown.

This then results in a python3 SIGABRT, and systemd then doesn't call the stop
script to actually stop the container (possibly because the main process exited
with a SIGABRT, so it's a hard crash). This means that the container doesn't
actually get stopped or restarted, resulting in an inconsistent state
afterwards.

The workaround appears to be that if we know the main thread needs to exit,
just return here, and don't continue execution. This at least tries to avoid it
from getting into the problematic code path. However, it's still feasible to
get a SIGABRT, depending on thread/process timings (i.e. teamd exits, signals
the main thread to exit, and then syncd exits, and syncd calls one of the two C
functions, potentially hitting the issue).

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-10-05 18:14:10 -07:00
Muhammad Danish
8c10851c2a
Update azure.github.io links to sonic-net.github.io (#12209)
Why I did it
azure.github.io/SONiC/ no longer works and returns 404 Not Found. Updated it to the correct sonic-net.github.io/SONiC/
2022-10-02 14:02:10 +08:00
Aryeh Feigin
2c10ebb4fe
Use warm-boot infrastructure for fast-boot (#11594)
This PR should be merged together with the sonic-utilities PR (sonic-net/sonic-utilities#2286) and sonic-sairedis PR (sonic-net/sonic-sairedis#1100).

Use redis contents from dump file in fast-reboot.

Improve fast-reboot flow by utilizing the warm-reboot infrastructure.
This followes https://github.com/sonic-net/SONiC/blob/master/doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md
2022-09-26 09:01:49 -07:00
Zain Budhwani
fd6a1b0ce2
Add events to host and create rsyslog_plugin deb pkg (#12059)
Why I did it

Create rsyslog plugin deb for other containers/host to install
Add events for bgp and host events
2022-09-21 09:20:53 -07:00
Stepan Blyshchak
e662008f72
[services] kill container on stop in warm/fast mode (#10510)
- Why I did it
To optimize stop on warm boot.

- How I did it
Added kill for containers
2022-09-19 19:34:33 +03:00
Volodymyr Boiko
c243af0cce
[bgp][service] Start bgp service after interfaces-config service (#11827)
- Why I did it
interfaces-config service restarts networking service, during the restart loopback interface address is being removed and reassigned back, leaving loopback without an ipv4 address for a while.
On SONiC startup and config reload interfaces-config and bgp services start in parallel and sometimes
fpmsyncd in bgp attempts bind to loopback while it does not have an address, fails with the log
Exception "Cannot assign requested address" had been thrown in daemon
and exits with rc 0.

root@sonic:/# supervisorctl status
fpmsyncd                         EXITED    Jul 20 05:04 AM
zebra                            RUNNING   pid 35, uptime 6:15:05
zsocket                          EXITED    Jul 20 05:04 AM
docker logs bgp
INFO exited: fpmsyncd (exit status 0; expected)
With fpmsyncd dead, configured routes do not appear in the database.

- How I did it
Added ordering dependency on interfaces-config service into bgp.config

- How to verify it
Itself the issue reproduces quite rarely, but one can gain the time interval between networking down and networking up in interfaces-config.sh like this:

diff --git a/files/image_config/interfaces/interfaces-config.sh b/files/image_config/interfaces/interfaces-config.sh
index f6aa4147a..87caceeff 100755
--- a/files/image_config/interfaces/interfaces-config.sh
+++ b/files/image_config/interfaces/interfaces-config.sh
@@ -63,7 +63,11 @@ done
 # Read sysctl conf files again
 sysctl -p /etc/sysctl.d/90-dhcp6-systcl.conf

-systemctl restart networking
+# systemctl restart networking
+
+systemctl start networking
+sleep 10
+systemctl stop networking

 # Clean-up created files
 rm -f /tmp/ztp_input.json /tmp/ztp_port_data.json
with this change the issue reproduces on every config reload.

Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com>
2022-09-19 17:25:10 +03:00
lixiaoyuner
a1b50cac41
Make client indentity by AME cert (#11946)
* Make client indentity by AME cert

* Join k8s cluster by ipv6

* Change join test cases

* Test case bug fix

* Improve read node label func

* Configure kubelet and change test cases

* For kubernetes version 1.22.2

* Fix undefine issue

Signed-off-by: Yun Li <yunli1@microsoft.com>
2022-09-16 13:13:39 +08:00
Maxime Lorrillere
0a7dd50dcb
[Chassis][Voq]Configure midplane network on supervisor (#11725)
Multi-asic Docker instances are created behind Docker's default bridge
which doesn't allow talking to other Docker instances that are in the
host network (like database-chassis).

On linecards, we configure midplane interfaces to let per-asic docker
containers talk to CHASSIS_DB on the supervisor through internal chassis
network.

On the supervisor we don't need to use chassis internal network, but we
still need a similar setup in order to allow fabric containers to talk
to database-chassis
2022-09-15 17:23:41 -07:00
Oleksandr Ivantsiv
549bb3d483
[services] Update "WantedBy=" section for tacacs-config.timer. (#11893)
The timer execution may fail if triggered during a config reload
(when the sonic.target is stopped). This might happen in a rare
situation if config reload is executed after reboot in a small
time slot (for 0 to 30 seconds) before the tacacs-config timer
is triggered. To ensure that timer execution will be resumed after
a config reload the WantedBy section of the systemd service is updated
to describe relation to sonic.target.

Signed-off-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com>

Signed-off-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com>
2022-09-08 15:16:11 -07:00
Ze Gan
016f671857
[docker-macsec]: Add dependencies of MACsec (#11770)
Why I did it
If the SWSS services was restarted, the MACsec service should also be restarted. Otherwise the data in wpa_supplicant and orchagent will not be consistent.

How I did it
Add dependency in docker-macsec.mk.

How to verify it
Manually check by 'sudo service swss restart'.

The MACsec container should be started after swss, the syslog will look like


Sep  8 14:36:29.562953 sonic INFO swss.sh[9661]: Starting existing swss container with HWSKU Force10-S6000
Sep  8 14:36:30.024399 sonic DEBUG container: container_start: BEGIN
...
Sep  8 14:36:33.391706 sonic INFO systemd[1]: Starting macsec container...
Sep  8 14:36:33.392925 sonic INFO systemd[1]: Starting Management Framework container...


Signed-off-by: Ze Gan <ganze718@gmail.com>
2022-09-08 23:45:06 +08:00
Renuka Manavalan
31e750ee0b
Fix PR build failure (#11973)
Some PR builds fails to find this file. Remove it temporarily until we root cause it
2022-09-06 15:13:05 -07:00
Stepan Blyshchak
a8b2a538a5
[docker-wait-any] immediately start to wait (#11595)
It could happen that a container has already crashed but docker-wait-any
will wait forever till it starts. It should, however, immediately exit
to make the serivce restart.

#### Why I did it

It is observed in some circumstances that the auto-restart mechanism does not work. Specifically for ```swss.service```, ```orchagent``` had crashed before ```docker-wait-any``` started in ```swss.sh```. This led ```docker-wait-any``` wait forever for ```swss``` to be in ```"Running"``` state and it results in:

```
CONTAINER ID   IMAGE                                COMMAND                  CREATED        STATUS                    PORTS     NAMES
1abef1ecebff   bcbca2b74df6                         "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         what-just-happened
3c924d405cd5   docker-lldp:latest                   "/usr/bin/docker-lld…"   22 hours ago   Up 22 hours                         lldp
eb2b12a98c13   docker-router-advertiser:latest      "/usr/bin/docker-ini…"   22 hours ago   Up 22 hours                         radv
d6aac4a46974   docker-sonic-mgmt-framework:latest   "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         mgmt-framework
d880fd07aab9   docker-platform-monitor:latest       "/usr/bin/docker_ini…"   22 hours ago   Up 22 hours                         pmon
75f9e22d4fdd   docker-snmp:latest                   "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         snmp
76d570a4bd1c   docker-sonic-telemetry:latest        "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         telemetry
ee49f50344b3   docker-syncd-mlnx:latest             "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         syncd
1f0b0bab3687   docker-teamd:latest                  "/usr/local/bin/supe…"   22 hours ago   Up 22 hours                         teamd
917aeeaf9722   docker-orchagent:latest              "/usr/bin/docker-ini…"   22 hours ago   Exited (0) 22 hours ago             swss
81a4d3e820e8   docker-fpm-frr:latest                "/usr/bin/docker_ini…"   22 hours ago   Up 22 hours                         bgp
f6eee8be282c   docker-database:latest               "/usr/local/bin/dock…"   22 hours ago   Up 22 hours                         database
```

The check for ```"Running"``` state is not needed because for cold boot case we do ```start_peer_and_dependent_services``` and for warm boot case the loop will retry to wait for container if this container is doing warm boot:
d01a91a569/files/image_config/misc/docker-wait-any (L56)

#### How I did it

Removed the check for ```"Running"```.

#### How to verify it

Kill swss before ```docker-wait-any``` is reached and verify auto restart will restart swss serivce.
2022-09-06 09:26:54 -07:00
Zain Budhwani
6a54bc439a
Streaming structured events implementation (#11848)
With this PR in, you flap BGP and use events_tool to see the published events.
With telemetry PR #111 in and corresponding submodule update done in buildimage, one could run gnmi_cli to capture BGP flap events.
2022-09-03 07:33:25 -07:00
Ying Xie
a6843927d9
[mux] skip mux operations during warm shutdown (#11937)
* [mux] skip mux operations during warm shutdown

- Enhance write_standby.py script to skip actions during warm shutdown.
- Expand the support to BGP service.
- MuX support was added by a previous PR.
- don't skip action during warm recovery

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2022-09-02 13:50:42 -07:00
Lawrence Lee
a762b35cbc
[arp_update]: Set failed IPv6 neighbors to incomplete (#11919)
After pinging any failed IPv6 neighbor entries, set the remaining failed/incomplete entries to a permanent INCOMPLETE state. This manual setting to INCOMPLETE prevents these entries from automatically transitioning to FAILED state, and since they are now incomplete any subsequent NA messages for these neighbors is able to resolve the entry in the cache.

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2022-09-02 13:40:40 -07:00
Longxiang Lyu
6e878a36da
[mux] Exit to write standby state to active-active ports (#11821)
[mux] Exit to write standby state to `active-active` ports

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
2022-08-31 13:10:22 -07:00
abdosi
3bf1abb2dc
Address Review Comment to define SONIC_GLOBAL_DB_CLI in gbsyncd.sh (#11857)
As part of PR #11754

    Change was added to use variable SONIC_DB_NS_CLI for
    namespace but that will not work since ./files/scripts/syncd_common.sh
    uses SONIC_DB_CLI. So revert back to use SONIC_DB_CLI and define new
    variable for SONIC_GLOBAL_DB_CLI for global/host db cli access

   Also fixed DB_CLI not working for namespace.
2022-08-29 08:19:28 -07:00
Hua Liu
214e394ac0
Remove swsssdk from rules and image. (#11469)
#### Why I did it
To deprecate swsssdk, remove all dependency to it. 

#### How I did it
Remove swsssdk from rules and build image scripts.

#### How to verify it
Pass all UT and E2E test case

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205

#### Description for the changelog
Remove swsssdk from rules and build image scripts.

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
2022-08-25 08:35:51 +08:00
anamehra
f404ce60e0
container_checker on supervisor should check containers based on asic presence (#11442)
Why I did it
On a supervisor card in a chassis, syncd/teamd/swss/lldp etc dockers are created for each Switch Fabric card. However, not all chassis would have all the switch fabric cards present. In this case, only dockers for Switch Fabrics present would be created.

The monit 'container_checker' fails in this scenario as it is expecting dockers for all Switch Fabrics (based on NUM_ASIC defined in asic.conf file).
2022-08-22 10:08:29 -07:00
abdosi
535612f808
Added support to add gbsyncd in Feature Table of Host Config DB (#11754)
Why I did:

In case of multi-asic platforms gbsyncd is not getting added to Feature Table of Host Config DB. Without this container_checker complains of not needed gbsyncd container's are running.

How I did:
Update Both Host and Namespace config db when gbsyncd docker is starting.

How I verify:
Verified on Multi-asic platforms.
2022-08-17 14:02:21 -07:00
Stepan Blyshchak
a66941a6ce
[syncd.sh] 'sxdkernel start' => 'sxdkernel restart' (#11718)
Change `sxdkernel start` to `sxdkernel restart`. If `syncd` service crashes in `ExecStartPre` systemd will not call `ExecStop` and thus will not call `sxdkernel stop`. Use of `sxdkernel restart` is more robust in terms of guarantees to restore the system after unexpected crashes.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-08-15 13:35:34 -07:00
lixiaoyuner
8d6431e754
Add k8s master feature (#11637)
* Add k8s master feature

Signed-off-by: Yun Li <yunli1@microsoft.com>

* Update kubernetes version mistake and make variable passing clear

Signed-off-by: Yun Li <yunli1@microsoft.com>

* Add CRI-dockerd package

Signed-off-by: Yun Li <yunli1@microsoft.com>

* Update version variable passing logic

Signed-off-by: Yun Li <yunli1@microsoft.com>

* Upgrade the worker kubernetes version

Signed-off-by: Yun Li <yunli1@microsoft.com>

* Install xml file parse tool

Signed-off-by: Yun Li <yunli1@microsoft.com>

Signed-off-by: Yun Li <yunli1@microsoft.com>
2022-08-13 23:01:35 +08:00
Nikola Dancejic
23dcfdf9b6
[swss] Adding conditional for bgp when on multi ASIC platform (#11691)
bgp should be a per-asic service, and runs for each namespace on
multi-asic platforms. However, putting bgp in MULTI_INST_DEPENDENT
causes swss to be restarted as well as bgp. this is causing issues after #11000

Issue: #11653

This fix:

removes bgp from dependents list
adds a conditional that either adds bgp, or bgp@$DEV to separate
between single and multi-asic platforms
2022-08-12 11:34:10 -07:00
Stepan Blyshchak
2d4299308d
[swss.sh/syncd.sh] Trap only on EXIT (#11590)
When using trap on SIGTERM the script will not react to the SIGTERM signal sent while a child is executing.
I.e, the following script does not react on SIGTERM sent to it if it is
waiting for sleep to finish:

```

trap "echo Handled SIGTERM" 0 2 3 15

echo "Before sleep"
sleep inf
echo "After sleep"
```

Instead, trap only on EXIT which covers also a scenario with exit on
SIGINT, SIGTERM.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-08-10 20:57:07 -07:00
Lawrence Lee
889741c9bc
[arp_update]: Resolve failed neighbors on dualtor (#11615)
In arp_update, check for FAILED or INCOMPLETE kernel neighbor entries and manually ping them to try and resolve the neighbor

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2022-08-09 16:19:42 -07:00
Ying Xie
a3e3530d1d
[write_standby] update write_standby.py script (#11650)
Why I did it
The initial value has to be present for the state machines to work. In active-standby dual-tor scenario, or any hardware mux scenario, the value will be updtaed eventually with a delay.

However, in active-active dual-tor scenario, there is no other mechanism to initialize the value and get state machines started.
So this script will have to write something at start up time.

For active-active dualtor, 'active' is a more preferred initial value, the state machine will switch the state to standby soon if
link prober found link not in good state.

How I did it
Update the script to always provide initial values.

How to verify it
Tested on active-active dual-tor testbed.

Signed-off-by: Ying Xie ying.xie@microsoft.com
2022-08-09 14:21:29 -07:00
Sudharsan Dhamal Gopalarathnam
5dc4eb9693
[vs]Preventing ebtables cfg to be applied on vs (#11585)
*Preventing ebtables rules to be applied on KVM image. The ebtables rules in SONiC are added to prevent ARP as well as L2 forwarding to be blocked in linux kernel since the hardware will take care of the actual L2 forward. However this is not the case with KVM where linux needs to forward even L2 packets
2022-08-04 09:18:00 -07:00
bingwang-ms
dc799356aa
Support different DSCP_TO_TC_MAP for T1 in dualtor deployment (#11569)
* Support different DSCP_TO_TC_MAP for T1 in dualtor deployment
2022-08-01 09:35:34 +08:00
Nikola Dancejic
8f6b568acf
[swss] Adding bgp container as dependent of swss (#11000)
What I did:
Added bgp as a dependent of swss

Why I did it:
bgp container was not restarting on swss crash. When swss crashes, linkmgrd
doesn't initate a switchover because it cannot access the default route from
orchagent. Bringing down bgp with swss will isolate the ToR, causing linkmgrd
to initiate a switchover to the peer ToR avoiding significant packet loss.

How I did it:
Added bgp to DEPENDENT

Signed-off-by: Nikola Dancejic <ndancejic@microsoft.com>
2022-07-29 16:22:20 -07:00
Jing Zhang
626919e250
Update WARM START FINALIZER to wait for linkmgrd to reconcile (#11477)
Spanning from sonic-net/sonic-linkmgrd#76, this PR is to update warm restart finalizer to wait for linkmgrd to be reconciled.

sign-off: Jing Zhang zhangjing@microsoft.com

Why I did it
To make sure finalizer save config after linkmgrd's reconciliation.

How I did it
Add linkmgrd to the reconciliation wait list of warmboot finalizer.

How to verify it
Verified on lab device, linkmgrd reconciled as expected.
2022-07-28 09:08:53 -07:00
Stepan Blyshchak
925a393e3d
[swss.sh] clear counters cache folder on swss cold/fast reload (#11244)
A change in sonic-utilities makes all cache files be saved into a
/tmp/cache. On swss restart this cache has to be removed in case swss
starts in cold or fast mode. A related cache restoration in the warmboot
finalizer script is also updated to use new location.

- Why I did it
To fix #9817. Clear the cache directory on swss.sh except for warm start.
Also, adopted finalize-warmboot script to take the cache directory.

- How I did it
A change in sonic-utilities makes all cache files be saved into a /tmp/cache. On swss restart this cache has to be removed in case swss starts in cold or fast mode. A related cache restoration in the warmboot finalizer script is also updated to use new location.

- How to verify it
Run togather with Azure/sonic-utilities#2232. Verify counters cache is removed on config reload, cold/fast reboots, swss restart.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-07-28 12:03:22 +03:00
Lior Avramov
069b3a4669
[memory_checker] Do not check memory usage of containers if docker daemon is not running (#11476)
Fix in Monit memory_checker plugin. Skip fetching running containers if docker engine is down (can happen in deinit).
This PR fixes issue #11472.

Signed-off-by: liora liora@nvidia.com

Why I did it
In the case where Monit runs during deinit flow, memory_checker plugin is fetching the running containers without checking if Docker service is still running. I added this check.

How I did it
Use systemctl is-active to check if Docker engine is still running.

How to verify it
Use systemctl to stop docker engine and reload Monit, no errors in log and relevant print appears in log.

Which release branch to backport (provide reason below if selected)
The fix is required in 202205 and 202012 since the PR that introduced the issue was cherry picked to those branches (#11129).
2022-07-27 16:18:36 -07:00
abdosi
a380105461
Enable ARP Update Script for Packet based chassis. (#11465)
What I did:

    Following changes done for packet based chassis:-
    1> Run arp_update on LC's to resolve static route nexthops over backend
    port-channel interfaces.
    2> On Supervisor make sure arp_update exit gracefully
2022-07-26 16:50:16 -07:00
Iris Hsu
f323f56c54
flush VRF_OBJECT_TABLE table on state db when swss start (#11509)
*flush VRF_OBJECT_TABLE table on state db when swss start
2022-07-21 18:01:39 -07:00
gregshpit
5df09490dc
Ported Marvell armhf build on amd64 host for debian buster to use cross-comp… (#8035)
* Ported Marvell armhf build on x86 for debian buster to use cross-compilation instead of qemu emulation

Current armhf Sonic build on amd64 host uses qemu emulation. Due to the
nature of the emulation it takes a very long time, about 22-24 hours to
complete the build. The change I did to reduce the building time by
porting Sonic armhf build on amd64 host for Marvell platform for debian
buster to use cross-compilation on arm64 host for armhf target. The
overall Sonic armhf building time using cross-compilation reduced to
about 6 hours.

Signed-off-by: marvell <marvell@cpss-build3.marvell.com>

* Fixed final Sonic image build with dockers inside

* Update Dockerfile.j2

Fixed qemu-user-static:x86_64-aarch64-5.0.0-2 .

* Update cross-build-arm-python-reqirements.sh

Added support for both armhf and arm64 cross-build platform using $PY_PLAT environment variable.

* Update Makefile

Added TARGET=<cross-target> for armhf/arm64 cross-compilation.

* Reviewer's @qiluo-msft requests done

Signed-off-by: marvell <marvell@cpss-build3.marvell.com>

* Added new radius/pam patch for arm64 support

* Update slave.mk

Added missing back tick.

* Added libgtest-dev: libgmock-dev: to the buster Dockerfile.j2. Fixed arm perl version to be generic

* Added missing armhf/arm64 entries in /etc/apt/sources.list

* fix libc-bin core dump issue from xumia:fix-libc-bin-install-issue commit

* Removed unnecessary 'apt-get update' from sonic-slave-buster/Dockerfile.j2

* Fixed saiarcot895 reviewer's requests

* Fixed README and replaced 'sed/awk' with patches

* Fixed ntp build to use openssl

* Unuse sonic-slave-buster/cross-build-arm-python-reqirements.sh script (put all prebuilt python packages cross-compilation/install inside Dockerfile.j2). Fixed src/snmpd/Makefile to use -j1 in all cases

* Clean armhf cross-compilation build fixes

* Ported cross-compilation armhf build to bullseye

* Additional change for bullseye

* Set CROSS_BUILD_ENVIRON default value n

* Removed python2 references

* Fixes after merge with the upstream

* Deleted unused sonic-slave-buster/cross-build-arm-python-reqirements.sh file

* Fixed 2 @saiarcot895 requests

* Fixed @saiarcot895 reviewer's requests

* Removed use of prebuilt python wheels

* Incorporated saiarcot895 CC/CXX and other simplification/generalization changes

Signed-off-by: marvell <marvell@cpss-build3.marvell.com>

* Fixed saiarcot895 reviewer's  additional requests

* src/libyang/patch/debian-packaging-files.patch

* Removed --no-deps option when installing wheels. Removed unnecessary lazy_object_proxy arm python3 package instalation

Co-authored-by: marvell <marvell@cpss-build3.marvell.com>
Co-authored-by: marvell <marvell@cpss-build2.marvell.com>
2022-07-21 14:15:16 -07:00
Nazarii Hnydyn
e4e3adcbc2
[ssip]: Update config generator (#10991)
- Why I did it

To implement Syslog Source IP feature
In order to include the following commit: 8e5d478 [ssip]: Add CLI (#2191)

- How I did it
Updated syslog config template
Advanced submodule sonic-utilities

ea11b22 [sonic-bootchart] add sonic-bootchart (#2195)
8e5d478 [ssip]: Add CLI (#2191)
1dacb7f Replace pyswsssdk with swsscommon (#2251)

- How to verify it
make configure PLATFORM=mellanox
make target/sonic-mellanox.bin

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2022-07-20 10:05:13 +03:00
Stephen Sun
8f4a1b7b85
[Mellanox] Support Mellanox-SN4600C-C64 as T1 switch in dual-ToR scenario (#11261)
- Why I did it
Support Mellanox-SN4600C-C64 as T1 switch in dual-ToR scenario
This is to port #11032 and #11299 from 202012 to master.

Support additional queue and PG in buffer templates, including both traditional and dynamic model
Support mapping DSCP 2/6 to lossless traffic in the QoS template.
Add macros to generate additional lossless PG in the dynamic model
Adjust the order in which the generic/dedicated (with additional lossless queues) macros are checked and called to generate buffer tables in common template buffers_config.j2
Buffer tables are rendered via using macros.
Both generic and dedicated macros are defined on our platform. Currently, the generic one is called as long as it is defined, which causes the generic one always being called on our platform. To avoid it, the dedicated macrio is checked and called first and then the generic ones.
Support MAP_PFC_PRIORITY_TO_PRIORITY_GROUP on ports with additional lossless queues.
On Mellanox-SN4600C-C64, buffer configuration for t1 is calculated as:

40 * 100G downlink ports with 4 lossless PGs/queues, 1 lossy PG, and 3 lossy queues
16 * 100G uplink ports with 2 lossless PGs/queues, 1 lossy PG, and 5 lossy queues

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2022-07-20 09:48:15 +03:00
andywongarista
88d0ce5ce8
Add gbsyncd container for broncos (#11154)
* Add docker-gbsyncd-broncos support
* Address review comments
* Add socket to gbsyncd
* Upgrade gbsyncd-broncos to bullseye
2022-07-18 10:57:27 +08:00
bingwang-ms
52c36cd31f
Add flag to control the generation of PORT_QOS_MAP|global entry (#11448)
Why I did it
This PR is to add a flag to control whether to generate PORT_QOS_MAP|global entry or not.
It's because for some HWSKU, such as BackEndToRRouter and BackEndLeafRouter, there is no DSCP_TO_TC_MAP defined.
Hence, if the PORT_QOS_MAP|global entry is generated, OA will report some error because the DSCP_TO_TC_MAP map AZURE can not be found.

Jul 14 00:24:40.286767 str2-7050qx-32s-acs-03 ERR swss#orchagent: :- saiObjectTypeQuery: invalid object id oid:0x7fddb43605d0
Jul 14 00:24:40.286767 str2-7050qx-32s-acs-03 ERR swss#orchagent: :- meta_generic_validation_objlist: SAI_SWITCH_ATTR_QOS_DSCP_TO_TC_MAP:SAI_ATTR_VALUE_TYPE_OBJECT_ID object on list [0] oid 0x7fddb43605d0 is not valid, returned null object id
Jul 14 00:24:40.286767 str2-7050qx-32s-acs-03 ERR swss#orchagent: :- applyDscpToTcMapToSwitch: Failed to apply DSCP_TO_TC QoS map to switch rv:-5
Jul 14 00:24:40.286767 str2-7050qx-32s-acs-03 ERR swss#orchagent: :- doTask: Failed to process QOS task, drop it
This PR is to address the issue.

How I did it
Add a flag require_global_dscp_to_tc_map to control whether to generate the PORT_QOS_MAP|global entry. The default value for require_global_dscp_to_tc_map is true. If the device type is storage backend, the value is changed to false. Then the PORT_QOS_MAP|global entry is not generated.

How to verify it
Update the current test_qos_dscp_remapping_render_template to cover storage backend.
2022-07-14 10:19:00 -07:00
Alexander Allen
429254cb2d
[arm] Refactor installer and build to allow arm builds targeted at grub platforms (#11341)
Refactors the SONiC Installer to support greater flexibility in building for a given architecture and bootloader. 

#### Why I did it

Currently the SONiC installer assumes that if a platform is ARM based that it uses the `uboot` bootloader and uses the `grub` bootloader otherwise. This is not a correct assumption to make as ARM is not strictly tied to uboot and x86 is not strictly tied to grub. 

#### How I did it

To implement this I introduce the following changes:
* Remove the different arch folders from the `installer/` directory
* Merge the generic components of the ARM and x86 installer into `installer/installer.sh`
* Refactor x86 + grub specific functions into `installer/default_platform.conf` 
* Modify installer to call `default_platform.conf` file and also call `platform/[platform]/patform.conf` file as well to override as needed
* Update references to the installer in the `build_image.sh` script
* Add `TARGET_BOOTLOADER` variable that is by default `uboot` for ARM devices and `grub` for x86 unless overridden in `platform/[platform]/rules.mk`
* Update bootloader logic in `build_debian.sh` to be based on `TARGET_BOOTLOADER` instead of `TARGET_ARCH` and to reference the grub package in a generic manner

#### How to verify it

This has been tested on a ARM test platform as well as on Mellanox amd64 switches as well to ensure there was no impact. 

#### Description for the changelog
[arm] Refactor installer and build to allow arm builds targeted at grub platforms

#### Link to config_db schema for YANG module changes

N/A
2022-07-12 15:00:57 -07:00
geogchen
5171589f4d
Add support to generate /e/n/i when there are multiple MGMT_INTERFACE (#11368)
Why I did it
Currently interfaces.j2 hardcodes to eth0 even when there are multiple interfaces in MGMT_INTERFACE. This change adds support to generate /e/n/i when there are multiple interfaces in MGMT_INTERFACE.

How I did it
By removing hardcoded eth0 when looping through MGMT_INTERFACE.

How to verify it
Verified through unit test.

Which release branch to backport (provide reason below if selected)
 201811
 201911
 202006
 202012
 202106
 202111
 202205
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)
2022-07-12 14:55:15 -07:00
tjchadaga
4f95974669
Add load_minigraph option to include traffic-shift-away during config migration (#11403) 2022-07-12 10:08:58 -07:00
tjchadaga
849eb4bf32
Changes to persist TSA/B state across reloads (#11257) 2022-07-12 00:22:48 -07:00
Neetha John
7cd10019b8
Add backend acl template (#11220)
Why I did it
Storage backend has all vlan members tagged. If untagged packets are received on those links, they are accounted as RX_DROPS which can lead to false alarms in monitoring tools. Using this acl to hide these drops.

How I did it
Created a acl template which will be loaded during minigraph load for backend. This template will allow tagged vlan packets and dropped untagged

How to verify it
Unit tests

Signed-off-by: Neetha John <nejo@microsoft.com>
2022-07-06 10:24:16 -07:00
Stepan Blyshchak
ef8675d7ab
[sonic_debian_extension] install systemd-bootchart (#11047)
- Why I did it
Implemented sonic-net/SONiC#1001

- How I did it
Install systemd-bootchart tool and provide default config for it.

- How to verify it
Run build and verify systemd-bootchart is installed.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-07-06 14:03:31 +03:00
judyjoseph
baaf8b085f
Revert "Update include_macsec flag if type is SpineRouter (#11141)" (#11306)
This reverts commit c9f36957db.
2022-07-01 11:00:30 -07:00
Jing Zhang
5d03b5d0df
Avoid write_standby in warm restart context (#11283)
Avoid write_standby in warm restart context.

sign-off: Jing Zhang zhangjing@microsoft.com

Why I did it
In warm restart context, we should avoid mux state change.

How I did it
Check warm restart flag before applying changes to app db.

How to verify it
Ran write_standby in table missing, key missing, field missing scenarios.
Did a warm restart, app db changes were skipped. Saw this in syslog:
WARNING write_standby: Taking no action due to ongoing warmrestart.
2022-06-29 21:34:02 -07:00
andywongarista
6e0559d5fa
[Arista] Add initial support for 720DT-48S (#10656)
Added initial set of config files to allow for booting and partial traffic testing in SONiC on the 720DT-48S.

How to verify it
- Switch boots
- show interfaces status shows links up on interfaces Ethernet24-51
- Traffic flows with no errors on interfaces Ethernet24-51
2022-06-29 09:56:24 -07:00
davidpil2002
8b7d069880
Add support for Password Hardening (#10323)
- Why I did it
New security feature for enforcing strong passwords when login or changing passwords of existing users into the switch.

- How I did it
By using mainly Linux package named pam-cracklib that support the enforcement of user passwords, the daemon named hostcfgd, will support add/modify password policies that enforce and strengthen the user passwords.

- How to verify it
Manually Verification-
1. Enable the feature, using the new sonic-cli command passw-hardening or manually add the password hardening table like shown in HLD by using redis-cli command

2. Change password policies manually like in step 1.
Notes:
password hardening CLI can be found in sonic-utilities repo-
P.R: Add support for Password Hardening sonic-utilities#2121
code config path: config/plugins/sonic-passwh_yang.py
code show path: show/plugins/sonic-passwh_yang.py

3. Create a new user (using adduser command) or modify an existing password by using passwd command in the terminal. And it will now request a strong password instead of default linux policies.

Automatic Verification - Unitest:
This PR contained unitest that cover:
1. test default init values of the feature in PAM files
2. test all the types of classes policies supported by the feature in PAM files
3. test aging policy configuration in PAM files
2022-06-29 15:34:56 +03:00
bingwang-ms
ac86f71287
Add extra lossy PG profile for ports between T1 and T2 (#11157)
Signed-off-by: bingwang <wang.bing@microsoft.com>

Why I did it
This PR brings two changes

Add lossy PG profile for PG2 and PG6 on T1 for ports between T1 and T2.
After PR Update qos config to clear queues for bounced back traffic #10176 , the DSCP_TO_TC_MAP and TC_TO_PG_MAP is updated when remapping is enable

DSCP_TO_TC_MAP
Before	After	Why do this change
"2" : "1"	"2" : "2"	Only change for leaf router to map DSCP 2 to TC 2 as TC 2 will be used for lossless TC
"6" : "1"	"6" : "6"	Only change for leaf router to map DSCP 6 to TC 6 as TC 6 will be used for lossless TC

TC_TO_PRIORITY_GROUP_MAP
Before	After	Why do this change
"2" : "0"	"2" : "2"	Only change for leaf router to map TC 2 to PG 2 as PG 2 will be used for lossless PG
"6" : "0"	"6" : "6"	Only change for leaf router to map TC 6 to PG 6 as PG 6 will be used for lossless PG

So, we have two new lossy PGs (2 and 6) for the T2 facing ports on T1, and two new lossless PGs (2 and 6) for the T0 facing port on T1.
However, there is no lossy PG profile for the T2 facing ports on T1. The lossless PGs for ports between T1 and T0 have been handled by buffermgrd .Therefore, We need to add lossy PG profiles for T2 facing ports on T1.

We don't have this issue on T0 because PG 2 and PG 6 are lossless PGs, and there is no lossy traffic mapped to PG 2 and PG 6

Map port level TC7 to PG0
Before the PCBB change, DSCP48 -> TC 6 -> PG 0.
After the PCBB change, DSCP48 -> TC 7 -> PG 7
Actually, we can map TC7 to PG0 to save a lossy PG.

How I did it
Update the qos and buffer template.

How to verify it
Verified by UT.
2022-06-28 12:50:33 -07:00
judyjoseph
c9f36957db
Update include_macsec flag if type is SpineRouter (#11141)
Add the support to enable macsec when type is SpineRouter
2022-06-24 10:32:02 -07:00
geogchen
6a9c058a92
Revert "Add support for generating interface configuration in /etc/network/interfaces for multiple management interfaces (#11204)" (#11241)
This reverts commit 90a849ea85.

#### Why I did it
The interfaces unit test did not cover some of the conditions in interfaces.j2 that was changed in #11204.  Therefore reverting the change and add the tests before making the change to interfaces.j2.

#### How I did it
Git revert.

#### How to verify it

#### Which release branch to backport (provide reason below if selected)


- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205

#### Description for the changelog

#### Link to config_db schema for YANG module changes

#### A picture of a cute animal (not mandatory but encouraged)
2022-06-24 06:30:33 -07:00
Sudharsan Dhamal Gopalarathnam
9452095e25
[lldp]Fix lldp spawned after reboot when disabled (#11080)
- Why I did it
When LLDP is disabled through feature command, it gets spawned after reboot.

- How I did it
In syncd.sh check if the service is enabled before spawning automatically during cold reboot.

- How to verify it
Disable lldp feature. Perform cold reboot and verify its not spawned.
2022-06-22 03:11:41 +03:00
geogchen
90a849ea85
Add support for generating interface configuration in /etc/network/interfaces for multiple management interfaces (#11204)
* [Interfaces] Modify template to support multiple management interfaces

* Modify minigraph to process interfaces in sorted order

Signed-off-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net>

* Add UT minigraph

Signed-off-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net>

* make case insensitve comparison

Signed-off-by: George Chen <gechen@microsoft.com>

* Use natural sort

Signed-off-by: George Chen <gechen@microsoft.com>

Co-authored-by: Ubuntu <gechen@gechen-sonic-dev.d0r25nej54guppclip4gpy5b5a.jx.internal.cloudapp.net>
2022-06-21 10:16:10 -07:00
xumia
fdef1f0342
[Build]: Support to use symbol links for lazy installation targets to reduce the image size (#10923)
Why I did it
Support to use symbol links in platform folder to reduce the image size.
The current solution is to copy each lazy installation targets (xxx.deb files) to each of the folders in the platform folder. The size will keep growing when more and more packages added in the platform folder. For cisco-8000 as an example, the size will be up to 2G, while most of them are duplicate packages in the platform folder.

How I did it
Create a new folder in platform/common, all the deb packages are copied to the folder, any other folders where use the packages are the symbol links to the common folder.

Why platform.tar?
We have implemented a patch for it, see #10775, but the problem is the the onie use really old unzip version, cannot support the symbol links.
The current solution is similar to the PR 10775, but make the platform folder into a tar package, which can be supported by onie. During the installation, the package.tar will be extracted to the original folder and removed.
2022-06-21 13:03:55 +08:00
jingwenxie
fdc65d7600
Remove minigraph loading in updategraph script (#11146)
Why I did it
Minigraph will be deprecated in the future. So minigraph related reload should be deleted.

How I did it
Remove unused load_minigraph
2022-06-21 08:57:57 +08:00
Stepan Blyshchak
42576d2664
[auto-ts] add memory check (#10433)
#### Why I did it

To support automatic techsupport invokation in case memory usage is too high.

#### How I did it

Implemented according to https://github.com/Azure/SONiC/pull/939

#### How to verify it

UT, manual test on the switch.

*DEPENDS* on https://github.com/Azure/sonic-utilities/pull/2116
2022-06-20 09:39:05 -07:00
yozhao101
241f4454b4
[memory_checker] Do not check memory usage of containers which are not created (#11129)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
This PR aims to fix an issue (#10088) by enhancing the script memory_checker.

Specifically, if container is not created successfully during device is booted/rebooted, then memory_checker do not need check its memory usage.

How I did it
In the script memory_checker, a function is added to get names of running containers. If the specified container name is not in current running container list, then this script will exit without checking its memory usage.

How to verify it
I tested on a lab device by following the steps:

Stops telemetry container with command sudo systemctl stop telemetry.service

Removes telemetry container with command docker rm telemetry

Checks whether the script memory_checker ran by Monit will generate the syslog message saying it will exit without checking memory usage of telemetry.
2022-06-17 12:13:18 -07:00
bingwang-ms
83f23e26ff
Generate switch level dscp_to_tc_map entry from qos_config template (#11087)
* Generate switch level dscp_to_tc_map

Signed-off-by: bingwang <wang.bing@microsoft.com>
2022-06-17 08:44:30 +08:00
byu343
89020f53e4
[Arista] Add support support for 7060dx5_64s and 7060px5_64s (#10888)
Why I did it
This change adds the support for Arista 7060dx5_64s and 7060px5_64s

How I did it
How to verify it
We verified the platform driver is working and the ports are up on 7060dx5_64s and 7060px5_64s.
2022-06-16 09:51:42 -07:00
Samuel Angebault
30bfed92fd
[Arista] Add configuration files for 7050X4-32S platform (#10799)
Add most configuration files for the DCS-7050PX4-32S and DCS-7050DX4-32S.
This review only contains platform configuration files, dataplane ones will follow in future change.

Co-authored-by: Zhi Yuan (Carl) Zhao <zyzhao@arista.com>
2022-06-16 09:42:10 -07:00
shlomibitton
1474ad76d8
[Mellanox] [pmon] Fix for PMON service not starting when restarting SWSS service after fast/warm reboot (#10901)
- Why I did it
Recent change to delay PMON service in case of fast/warm reboot introduce an issue when restarting only SWSS service after fast/warm reboot for Nvidia platform.
Since the timer is triggered only when the system boot, in a scenario when the system is after a fast/warm reboot and the user restart SWSS service, as part of syncd.sh script, PMON service will stop but the timer will not start again.

- How I did it
On syncd.sh script, in case of fast/warm indication, check if pmon.timer is running.
If it is running it means we are at the first boot and continue normally.
If it is not running, meaning the service was restarted, start the timer to keep the system behavior consistent.

- How to verify it
Run fast/warm reboot.
service swss restart.
Observe PMON service starting.

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2022-06-16 12:15:09 +03:00
jingwenxie
cca3b5be5b
Reduce logic in updategraph (#11010)
Why I did it
The dhcp_graph_url used by internal service is always set as "N/A". So we can make the updategraph logic short.

How I did it
Shorten 'if statement' logic for /tmp/dhcp_graph_url
2022-06-14 22:18:47 +08:00
judyjoseph
0b1ae9c43c
Cleanup macsec stateDB tables on restart (#11066)
Clean macsec tables in STATE_DB on start
2022-06-09 15:32:24 -07:00
bingwang-ms
1cc602c6af
Add two extra lossless queues for bounced back traffic (#10496)
Signed-off-by: bingwang <bingwang@microsoft.com>

Why I did it
This PR is to add two extra lossless queues for bounced back traffic.
HLD sonic-net/SONiC#950

SKUs include
Arista-7050CX3-32S-C32
Arista-7050CX3-32S-D48C8
Arista-7260CX3-D108C8
Arista-7260CX3-C64
Arista-7260CX3-Q64

How I did it
Update the buffers.json.j2 template and buffers_config.j2 template to generate new BUFFER_QUEUE table.

For T1 devices, queue 2 and queue 6 are set as lossless queues on T0 facing ports.
For T0 devices, queue 2 and queue 6 are set as lossless queues on T1 facing ports.
Queue 7 is added as a new lossy queue as DSCP 48 is mapped to TC 7, and then mapped into Queue 7

How to verify it
Verified by UT
Verified by coping the new template and generate buffer config with sonic-cfggen
2022-06-02 13:03:27 -07:00
bingwang-ms
0c9bbee735
Update qos template to support SYSTEM_DEFAULT table (#10936)
* Update qos template to support SYSTEM_DEFAULT table

Signed-off-by: bingwang <wang.bing@microsoft.com>
2022-06-02 21:48:57 +08:00
xumia
0552d6b172
Support symcrypt fips config for aboot/uboot (#10729)
Why I did it
Support symcrypt fips config for aboot/uboot
2022-06-02 15:35:17 +08:00
Hua Liu
96954f0134
[swsscommon] Add c++ version sonic-db-cli from sonic-swss-common (#10825)
#### Why I did it
    Fix sonic-db-cli high CPU usage on SONiC startup issue: https://github.com/Azure/sonic-buildimage/issues/10218
    ETA of this issue will be 2022/05/31

#### How I did it
    Re-write sonic-cli with c++ in sonic-swss-common: https://github.com/Azure/sonic-swss-common/pull/607
    Modify swss-common rules and slave.mk to install c++ version sonic-db-cli.
    

#### How to verify it
    Pass all E2E test scenario.

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111

#### Description for the changelog
    Build and install c++ version sonic-db-cli from swss-common.

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/SONiC/wiki/Configuration.
-->

#### A picture of a cute animal (not mandatory but encouraged)
2022-06-01 08:05:53 +08:00
Lukas Stockner
c9b27cde71
[swss] Clear VXLAN tunnel table from State DB on startup (#10822)
* When reloading config after crashes, VTEP interfaces are sometimes not created since the tunnel still exists in the STATE_DB.
* Adding VXLAN_TUNNEL_TABLE to the list of tables to be cleaned in swss.sh fixes the problem.
2022-05-31 08:54:31 -07:00
davidpil2002
ab0930313b
[YANG] Add support for Password Hardening (#10322)
- Why I did it
Yang Model about password hardening feature, the sonic CLI of this feature was autogenerated from this Yang model

- How I did it
Create new Yang model in src/sonic-yang-models/yang-models/sonic-passwh.yang.

- How to verify it
There are unitests(yang test) in this P.R covering all the passwords policies with good and bad values cases.
Or is possible manually using the config/show password commands that were autogenerated from this Yang model. (this CLI code added in sonic-utilities)
2022-05-29 13:54:51 +03:00
xumia
f0dfd398a6
Revert "Reduce image size for lazy installation packages (#10775)" (#10916)
This reverts commit 15cf9b0d70.
Why I did it
Revert the PR #10775, for it has impact on onie installation.
It is caused by the symbol links not supported in some of the onie unzip.
We will enable after fixing the issue, see #10914
2022-05-26 09:39:48 +08:00
abdosi
0285bfe42e
[chassis] Fix issues regarding database service failure handling and mid-plane connectivity for namespace. (#10500)
What/Why I did:

Issue1: By setting up of ipvlan interface in interface-config.sh we are not tolerant to failures. Reason being interface-config.service is one-shot and do not have restart capability. 

Scenario: For example if let's say database service goes in fail state  then interface-services also gets failed because of dependency check but later database service gets restart but interface service will remain in stuck state and the ipvlan interface nevers get created.

Solution: Moved all the logic in database service from interface-config service which looks more align logically also since the namespace is created here and all the network setting (sysctl) are happening here.With this if database starts we recreate the interface.

Issue 2: Use of IPVLAN vs MACVLAN

Currently we are using ipvlan mode.  However above failure scenario is not handle correctly by ipvlan mode. Once the ipvlan interface is created and ip address assign to it and if we restart interface-config or database (new PR) service Linux Kernel gives error "Error: Address already assigned to an ipvlan device."  based on this:https://github.com/torvalds/linux/blob/master/drivers/net/ipvlan/ipvlan_main.c#L978Reason being if we do not do cleanup of ip address assignment (need to be unique for IPVLAN)  it remains in Kernel Database and never goes to free pool even though namespace is deleted. 

Solution: Considering this hard dependency of unique ip macvlan mode is better for us and since everything is managed by Linux Kernel and no dependency for on user configured IP address.

Issue3: Namespace database Service do not check reachability to Supervisor Redis Chassis   Server.

Currently there is no explicit check as we never do Redis PING from namespace to Supervisor Redis Chassis  Server. With this check it's possible we will start database and all other docker even though there is no connectivity and will hit the error/failure late in cycle

Solution: Added explicit PING from namespace that will check this reachability.

Issue 4:flushdb give exception when trying to accces Chassis Server DB over Unix Sokcet.

Solution: Handle gracefully via try..except and log the message.
2022-05-24 16:54:12 -07:00
Maxime Lorrillere
392899682f
[Arista] Add support for Wolverine linecards (#8887)
Add support for WolverineQCpu, WolverineQCpuMs, WolverineQCpuBk, WolverineQCpuBkMs

Co-authored-by: Maxime Lorrillere <mlorrillere@arista.com>
2022-05-20 14:11:06 -07:00
Senthil Kumar Guruswamy
f37dd770cd
System Ready (#10479)
Why I did it
At present, there is no mechanism in an event driven model to know that the system is up with all the essential sonic services and also, all the docker apps are ready along with port ready status to start the network traffic. With the asynchronous architecture of SONiC, we will not be able to verify if the config has been applied all the way down to the HW. But we can get the closest up status of each app and arrive at the system readiness.

How I did it
A new python based system monitor tool is introduced under system-health framework to monitor all the essential system host services including docker wrapper services on an event based model and declare the system is ready. This framework gives provision for docker apps to notify its closest up status. CLIs are provided to fetch the current system status and also service running status and its app ready status along with failure reason if any.

How to verify it
"show system-health sysready-status" click CLI
Syslogs for system ready
2022-05-20 13:25:11 -07:00
Arun Saravanan Balachandran
f4b22f67a4
[initramfs]: SSD firmware upgrade in initramfs (#10748)
Why I did it
To upgrade SSD firmware in initramfs while rebooting from SONiC to SONiC and during NOS to SONiC migration.

How I did it
New option 'ssd-upgrader-part’ is introduced in grub command line, to indicate the partition and its filesystem type in which the SSD firmware updater is present. ‘ssd-upgrader-part’ syntax is ssd-upgrader-part=<partition>,<filesystem type>. Example: ssd-upgrader-part=/dev/sda8,ext4

A new initramfs script ‘ssd-upgrade’ is included in init-premount and it invokes the SSD firmware updater (ssd-fw-upgrade) present in the partition indicated by the boot option 'ssd-upgrader-part'

How to verify it
In SONiC, the SSD firmware updater is copied to “/host/” directory.
Fast-reboot is to be initiated with the ‘-u’ option ([scripts/fast-reboot] Add option to include ssd-upgrader-part boot option with SONiC partition sonic-utilities#2150)
After reboot, while booting into SONiC the SSD firmware updater will be executed in initramfs.
2022-05-12 08:11:02 -07:00
Marty Y. Lok
23f9126f59
[VoQ][config] Multiasic Supervisor card fails to load config_db#.json in chassis when system is reboot (#10106)
Supervisor card fails to load config_db#.json in chassis when system reboot. 
This is an intermittent issue, fixes #10105
2022-05-09 11:06:11 -07:00
xumia
15cf9b0d70
Reduce image size for lazy installation packages (#10775)
Why I did it
The image size is too large, when there are multiple lazy packages and multiple platforms. It is not necessary to keep the lazy installation packages in multiple copies.
For cisco image, the image size will reduce from 3.5G to 1.7G.

How I did it
Use symbol links to only keep one package for each of the lazy package.
Make a new folder fsroot/platform/common
Copy the lazy packages into the folder.
When using a package in each of the platform, such as x86_64-grub, x86_64-8800_rp-r0, x86_64-8201_on-r0, etc, only make a symbol link to the package in the common folder.
2022-05-09 08:26:09 -07:00
xumia
8ec8900d31
Support SONiC OpenSSL FIPS 140-3 based on SymCrypt engine (#9573)
Why I did it
Support OpenSSL FIPS 140-3, see design doc: https://github.com/Azure/SONiC/blob/master/doc/fips/SONiC-OpenSSL-FIPS-140-3.md.

How I did it
Install the fips packages.
To build the fips packages, see https://github.com/Azure/sonic-fips
Azure pipelines: https://dev.azure.com/mssonic/build/_build?definitionId=412

How to verify it
Validate the SymCrypt engine:

admin@sonic:~$ dpkg-query -W | grep openssl
openssl 1.1.1k-1+deb11u1+fips
symcrypt-openssl        0.1

admin@sonic:~$ openssl engine -v | grep -i symcrypt
(symcrypt) SCOSSL (SymCrypt engine for OpenSSL)
admin@sonic:~$
2022-05-06 07:21:30 +08:00
Junchao-Mellanox
681c24878b
Fix race condition between networking service and interface-config service (#10573)
Why I did it
The PR is aimed to fix a bug that mgmt port eth0 may loss IP even if user configured static IP of eth0. This is not a always reproduceable issue, the reproducing flow is like:

Systemd starts networking service, which runs a dhcp based configuration and assigned an ip from dhcp.
Systemd starts interface-config service who depends on networking service
Interface-config service runs command “ifdown –force eth0”, check line. but networking service is still running so that this line failed with error: “error: Another instance of this program is already running.”. This error is printed by ifupdown2 lib who is the main process of networking service. So, ifdown actually does not work here, the ip of eth0 is not down.
Interface-config service updates /etc/networking/interface to static configuration.
Interface-config service runs command “systemctl restart networking”. This command kills the previous networking related processes (log: networking.service: Main process exited, code=killed, status=15/TERM), and try to reconfigure the ip address with static configuration. But it detects that the configured IP and the existing IP are the same, and it does not really configure the ip to kernel. Hence, the ip is still getting from dhcp. (this could be a bug of ifupdown2: previous ip is from dhcp, new ip is a static ip, it treats them as same instead of re-configuring the IP)
When the lease of the ip expires, the ip of eth0 is removed by kernel and the issue reproduces.
The issue is not always reproduceable because networking service usually runs fast so that it won't hit step#3.

How I did it
Check networking service state before running "ifdown –force eth0", wait for it done if it is activating.

How to verify it
Manual test.
2022-05-05 15:21:44 -07:00
shlomibitton
4ec3af86af
[Fastboot] Delay PMON service for better fastboot performance (#10567)
- Why I did it
Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time.
This parallel execution consume CPU time and the duration of create_switch is longer than it should be.
Following this finding, and the motivation to ensure these services will not interfere in the future, PMON is delayed in 90 seconds until the system finish the init flow after fastboot.

- How I did it
Add a timer for PMON service.
Exclude for MLNX platform the start trigger of PMON when SYNCD starts in case of fastboot.
Copy the timer file to the host bin image.

- How to verify it
Run fast-reboot on MLNX platform and observe faster create_switch execution time.
2022-05-02 10:44:17 +03:00
shlomibitton
1d84e0d7df
[Fastboot] Delay LLDP service for better fastboot performance (#10568)
- Why I did it
Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time.
This parallel execution consume CPU time and the duration of create_switch is longer than it should be.
Following this finding, and the motivation to ensure these services will not interfere in the future, LLDP is delayed in 90 seconds until the system finish the init flow after fastboot.

- How I did it
Add a timer for LLDP service.
Copy the timer file to the host bin image.

- How to verify it
Run fast-reboot on MLNX platform and observe faster create_switch execution time.
This PR is dependent on PR: #10567
2022-04-28 10:35:14 +03:00
ganglv
9d7387a18e
[sonic-host-services]: Fix import and invalid path (#10660)
Why I did it
Can not start sonic-hostservice

How I did it
Install python3-dbus and systemd-python, and replace invalid path

How to verify it
Start the service with below commands:
sudo systemctl start sonic-hostservice
sudo systemctl status sonic-hostservice

Signed-off-by: Gang Lv ganglv@microsoft.com
2022-04-27 07:14:51 +08:00
Saikrishna Arcot
64187a1b15
Remove SSH host keys after installing the custom version of sshd (#10633)
* Remove SSH host keys after installing the custom version of sshd

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

* Use an override for for sshd instead of overwriting the service file

Don't overwrite upstream's .service file, and instead use an override
file for making sure the host key(s) are generated.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2022-04-25 10:38:52 -07:00
bingwang-ms
3fc3259a35
Define qos map AZURE_TUNNEL for QoS remapping of tunnel traffic (#10565)
* Add AZURE_TUNNEL map

Signed-off-by: bingwang <wang.bing@microsoft.com>
2022-04-25 15:06:10 +08:00
yozhao101
e24fe9bc60
[Monit] Fix the issue which shows Monit can not reset its counter. (#10288)
Signed-off-by: Yong Zhao <yozhao@microsoft.com>

Why I did it
This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.

Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:

  check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
      if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.

The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok.

Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry:

    Program 'container_memory_telemetry'
         status                             Status ok
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Sat, 19 Mar 2022 19:56:26
Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:

    Program 'container_memory_telemetry'
         status                             Status failed
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Tue, 01 Feb 2022 22:52:55
After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok.

How I did it
In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.

How to verify it
I verified this change on lab device str-s6000-acs-12. Another pytest PR (Azure/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
2022-04-20 18:08:06 -07:00