Commit Graph

1207 Commits

Author SHA1 Message Date
Alpesh Patel
6b48346ff5 qos template change for backend compute-ai deployment (#16150)
#### Why I did it

To enable qos config for a certain backend deployment mode, for resource-type "Compute-AI".
This deployment has the following requirement:

- Config below enabled if DEVICE_TYPE as one of backend_device_types
- Config below enabled if ResourceType is 'Compute-AI'
- 2 lossless TCs' (2, 3)
- 2 lossy TCs' (0,1)
- DSCP to TC map uses 4 DSCP code points and maps to the TCs' as follows:
   "DSCP_TO_TC_MAP": {
        "AZURE": {
             "48" : "0",
            "46" : "1",
            "3"  : "3",
            "4"  : "4"
        }
    }

- WRED profile has green {min/max/mark%} as {2M/10M/5%}

This required template change <as in the PR> in addition to the vendor qos.json.j2 file (not included here).

### How I did it

#### How to verify it
- with the above change and the vendor config change, generated the qos.json file and verified that the objective stated in "Why I did it" was met

- verified no error

### Description for the changelog
Update qos_config.j2 for Comptue-AI deployment on one of backend device type roles
2023-09-21 18:34:11 +08:00
Prince George
d5a96f69f1 [platform]: Disable interrupt for intel i2c-i801 driver (#16309)
On S6100 we are seeing almost 100K interrupts per second on intels i801 SMBUS controller which affects systems performance.

We now disable the i801 driver interrupt and instead enable polling

Microsoft ADO (number only): 24910530

How I did it
Disable the interrupt by passing the interrupt disable feature argument to i2c-i801 driver

How to verify it
This fix is NOT applicable for ARM based platforms. Applicable only for intel based platforms:-

- On SN2700 its already disabled in Mellanox hw-mgmt
- Celestica DX010 and E1031
- Dell S6100 verified the interrupts are no longer incrementing.
- Arista 7260CX3

Signed-off-by: Prince George <prgeor@microsoft.com>
2023-09-21 16:33:37 +08:00
StormLiangMS
2b381b1fd4
Revert "revert [syslog] Add remote syslog configuration (cherry-pick to 202305) (#15897) (#16179)" (#16549)
This reverts commit 164fa102c0.
2023-09-14 20:52:14 +08:00
Kebo Liu
fe7eeed051
[202305][Mellanox] Update SDK/FW/SAI to 4.6.1020/2012.1020/SAIBuild2305.25.0.3(#16096) (#16298)
* [Mellanox] Update SDK/FW/SAI to 4.6.1020/2012.1020/SAIBuild2305.25.0.3 (#16096)

SONiC changes:
1. Support Spectrum4 ASIC FW binary building.
2. Support new SDK sx-obj-desc lib building since new SAI need it.
3. Remove SX_SCEW debian package from Mellanox SDK build since we are no longer using it (we use libxml2 instead).
4. Update SAI, SDK, FW to version 4.6.1020/2012.1020/SAIBuild2305.25.0.3

SDK/FW bug fixes
1. In SPC-1 platforms: Fastboot mode is not operational for Split port with Force mode in 50G speed
SFP modules are kept in disabled state after set LPM (low power mode) on/off for at least 3 minutes.
2. When preforming fast boot from an old SDK version (currently installed) to a newer one (target version), and the system was initially loaded with a new SDK version (past version), and the system has not been wiped, under specific conditions, the fast boot would use the past version's data and may fail.

SDK/FW Features
1. On SN2700 all ports can support y cable by credo

SAI bug Fixes
1. When creating an ACL rule with SAI_ACL_ENTRY_ATTR_FIELD_SRC_IP/SAI_ACL_ENTRY_ATTR_FIELD_DST_IP enabled, and then disabling the field by setting enable=false, a match on L3_type=IPv4 will remain programmed for the rule Issue resolved after the fix
2. Allow the max scale of virtual routers to be configure for SPC-1, SPC-2, SPC-3 when fastboot enable
3. Remove default hash key of SRC_MAC, DST_MAC and ETH_TYPE

SAI features
1. Port init profile

- How I did it
Update SDK/FW/SAI make files

- How to verify it
Run full sonic-mgmt regression on Mellanox platform

Signed-off-by: Kebo Liu <kebol@nvidia.com>
Conflicts:
	platform/mellanox/mlnx-sai.mk

* Fix issue: unprintable character is rendered when handling comments in j2

Use "{#-" and "-#}" to mark comments in jinja template

Signed-off-by: Stephen Sun <stephens@nvidia.com>

---------

Signed-off-by: Stephen Sun <stephens@nvidia.com>
Co-authored-by: Stephen Sun <stephens@nvidia.com>
2023-09-10 22:28:46 +08:00
mssonicbld
ebe24a134c
[chassis] Chassis DB cleanup when asic comes up (#16213) (#16417) 2023-09-03 23:52:39 +08:00
mssonicbld
40a5cea84c
Assign the higher metric value for Ipv6 default route learnt via RA message (#16367) (#16429) 2023-09-03 22:16:46 +08:00
mssonicbld
d62ae374a9
chassis-packet: Update arp_update script for FAILED and STALE check (#16311) (#16423) 2023-09-03 21:24:17 +08:00
Junchao-Mellanox
cead17cb55 Fix issue: systemctl daemon-reload would sporadically cause udev handler fail (#15253)
#### Why I did it

A workaround to back port the fix for a systemd issue.

The systemd issue: https://github.com/systemd/systemd/issues/24668
The systemd PR to fix the issue: https://github.com/systemd/systemd/pull/24673/files

The formal solution should upgrade systemd to a version that contains the fix. But, systemd is a very basic service, upgrading systemd requires heavy test. 

#### How I did it
Copy the correct systemd-udevd.service file in build time 

#### Tested branch (Please provide the tested image version)

- [x] 202211
- [ ] <!-- image version 2 -->

```
SONiC Software Version: SONiC.fix-udev.3-b65c7bdec_Internal
SONiC OS Version: 11
Distribution: Debian 11.7
Kernel: 5.10.0-18-2-amd64
Build commit: b65c7bdec
Build date: Mon Jun 19 10:54:50 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci02-241

Platform: x86_64-mlnx_msn4700-r0
HwSKU: ACS-MSN4700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2022X08597
Model Number: MSN4700-WS2FO
Hardware Revision: A1
Uptime: 08:10:11 up 1 min,  1 user,  load average: 1.81, 0.67, 0.24
Date: Sun 25 Jun 2023 08:10:11

Docker images:
REPOSITORY                    TAG                             IMAGE ID       SIZE
docker-fpm-frr                fix-udev.3-b65c7bdec_Internal   a7b911e7cb6f   346MB
docker-fpm-frr                latest                          a7b911e7cb6f   346MB
docker-platform-monitor       fix-udev.3-b65c7bdec_Internal   94c5178cf80b   731MB
docker-platform-monitor       latest                          94c5178cf80b   731MB
docker-orchagent              fix-udev.3-b65c7bdec_Internal   46b393e0ace8   328MB
docker-orchagent              latest                          46b393e0ace8   328MB
docker-syncd-mlnx             fix-udev.3-b65c7bdec_Internal   1f5c6c23e33a   734MB
docker-syncd-mlnx             latest                          1f5c6c23e33a   734MB
docker-sflow                  fix-udev.3-b65c7bdec_Internal   7e45992c8c59   317MB
docker-sflow                  latest                          7e45992c8c59   317MB
docker-teamd                  fix-udev.3-b65c7bdec_Internal   e4d905592cda   316MB
docker-teamd                  latest                          e4d905592cda   316MB
docker-nat                    fix-udev.3-b65c7bdec_Internal   7fe799367580   319MB
docker-nat                    latest                          7fe799367580   319MB
docker-macsec                 latest                          d702a5554171   318MB
docker-snmp                   fix-udev.3-b65c7bdec_Internal   3bce8fcf71cd   338MB
docker-snmp                   latest                          3bce8fcf71cd   338MB
docker-sonic-telemetry        fix-udev.3-b65c7bdec_Internal   f13949cbc817   597MB
docker-sonic-telemetry        latest                          f13949cbc817   597MB
docker-dhcp-relay             latest                          153d9072805d   306MB
docker-router-advertiser      fix-udev.3-b65c7bdec_Internal   aed642b9a6bc   299MB
docker-router-advertiser      latest                          aed642b9a6bc   299MB
docker-sonic-p4rt             fix-udev.3-b65c7bdec_Internal   a3cae5ca65a7   870MB
docker-sonic-p4rt             latest                          a3cae5ca65a7   870MB
docker-mux                    fix-udev.3-b65c7bdec_Internal   b81f0401b9a8   347MB
docker-mux                    latest                          b81f0401b9a8   347MB
docker-eventd                 fix-udev.3-b65c7bdec_Internal   c5917d0e801f   298MB
docker-eventd                 latest                          c5917d0e801f   298MB
docker-lldp                   fix-udev.3-b65c7bdec_Internal   fd5dc14a7976   341MB
docker-lldp                   latest                          fd5dc14a7976   341MB
docker-database               fix-udev.3-b65c7bdec_Internal   438c2715a1dd   299MB
docker-database               latest                          438c2715a1dd   299MB
docker-sonic-mgmt-framework   fix-udev.3-b65c7bdec_Internal   5c50b115fbcd   414MB
docker-sonic-mgmt-framework   latest  
```
2023-09-03 18:32:54 +08:00
Vadym Hlushko
b7dfc5b280 [memory_checker] Add a specific log message in a case when the docker service is not running. (#16018)
#### Why I did it
To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created #11129](https://github.com/sonic-net/sonic-buildimage/pull/11129).
There could be a scenario before the reboot, where
1. The `docker service` has stopped
2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry`

In such scenario, the `memory_checker` script will throw an error to the syslog:
```
ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))'
```
But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog.

#### How I did it
Change the log severity to the warning and changed the return value.

#### How to verify it
It is really hard to catch the exact moment described in the `Why I did it` section.
In order to check the logic:
1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](47742dfc2c/files/image_config/monit/memory_checker (L139)) file on the switch.
2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry`
3. Check the syslog for such messages:
```
WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte
d.', FileNotFoundError(2, 'No such file or directory'))'

INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running!
```
2023-09-03 18:32:43 +08:00
xumia
288ebd5dd3 Support FIPS DB configuration (#15632)
Why I did it
Support FIPS DB configuration
Design Doc: sonic-net/SONiC#1372

Work item tracking
Microsoft ADO (number only): 24411148
How I did it
Add the FIPS Yang model to make FIPS configurable in ConfigDB.

How to verify it
See TestPlan: sonic-net/sonic-mgmt#9092
Build the image and run the tests: sonic-net/sonic-mgmt#9091
2023-09-03 16:33:25 +08:00
StormLiangMS
7b8906600c
add sonic release for 202305 (#16364) 2023-09-03 09:23:39 +08:00
andywongarista
f0823e6dd0
[Arista] Add support for DCS-7060DX5-32 (#14793) (#16176)
* Add asic support for blackhawkth4dd

* Add bfd feature to BlackhawkTh4Dd

* Add platform data for blackhawkth4

* Add Qos settings for Blackhawk-TH4

* Add pg and queue settings for Blackhawk-TH4

* Add buffers_defaults_t0.j2

* Add blackhawkth4 to boot0

* Update 7060dx5 config.bcm

* Fix build error

---------

Co-authored-by: Boyang Yu <byu@arista.com>
Co-authored-by: David Meggy <davidm@arista.com>
2023-09-03 09:21:33 +08:00
mssonicbld
adfc486456
Run db_migrator for non first-time reboots (#16116) (#16306) 2023-08-29 05:36:36 +08:00
Vaibhav Hemant Dixit
0b83639068
Fix CONFIG_DB_INITIALIZED flag check logic and set/reset flag for warmboot (#15685) (#16217)
Cherypick of #15685

MSFT ADO: 24274591

Why I did it
Two changes:

1 Fix a day1 issue, where check to wait until CONFIG_DB_INITIALIZED is incorrect.
There are multiple places where same incorrect logic is used.

Current logic (until [[ $($SONIC_DB_CLI CONFIG_DB GET "CONFIG_DB_INITIALIZED") ]];) will always result in pass, irrespective of the result of GET operation.

root@str2-7060cx-32s-29:~# sonic-db-cli CONFIG_DB GET "CONFIG_DB_INITIALIZED"
1
root@str2-7060cx-32s-29:~# until [[ $(sonic-db-cli CONFIG_DB GET "CONFIG_DB_INITIALIZED") ]]; do echo "entered here"; done
root@str2-7060cx-32s-29:~# 

root@str2-7060cx-32s-29:~# 
root@str2-7060cx-32s-29:~# sonic-db-cli CONFIG_DB GET "CONFIG_DB_INITIALIZED"                                             
0
root@str2-7060cx-32s-29:~# until [[ $(sonic-db-cli CONFIG_DB GET "CONFIG_DB_INITIALIZED") ]]; do echo "entered here"; done
root@str2-7060cx-32s-29:~# 
Fix this logic by checking for value of flag to be "1".

root@str2-7060cx-32s-29:~# until [[ $(sonic-db-cli CONFIG_DB GET "CONFIG_DB_INITIALIZED") -eq 1 ]]; do echo "entered here"; done
entered here
entered here
entered here
This gap in logic was highlighted when another fix was merged: #14933
The issue being fixed here caused warmboot-finalizer to not wait until config-db is initialized.

2 Set and unset CONFIG_DB_INITIALIZED for warm-reboot case
Currently, during warm shutdown CONFIG_DB_INITIALIZED's value is stored in redis db backup. This is restored back when the dump is loaded during warm-recovery.
So the value of CONFIG_DB_INITIALIZED does not depend on config db's state, however it remain what it was before reboot.

Fix this by setting CONFIG_DB_INITIALIZED to 0 as when the DB is loaded, and set it to 1 after db_migrator is done.

Work item tracking
Microsoft ADO (number only):
How I did it
How to verify it
2023-08-24 16:58:24 +08:00
StormLiangMS
164fa102c0
revert [syslog] Add remote syslog configuration (cherry-pick to 202305) (#15897) (#16179) 2023-08-19 16:01:29 +08:00
Vaibhav Hemant Dixit
2969d84e58 Revert "Revert "Fix for fast/cold-boot: call db_migrator only after old config is loaded (#14933)" (#15464)" (#15684)
This reverts commit 9649a44470.
2023-08-15 04:32:38 +08:00
Yevhen Fastiuk
4602d30a73
[syslog] Add remote syslog configuration (cherry-pick to 202305) (#15897)
cherry-pick: #14513
depends: https://github.com/sonic-net/sonic-utilities/pull/2939

* Add an ability to configure remote syslog servers
* Add an initial configuration for remote syslog
* Extend YANG module and add unit tests

#### Why I did it
Adding the following functionality to rsyslog feature:

* Configure remote syslog servers: protocol, filter, severity level
* Update global syslog configuration: severity level, message format

#### How I did it
added parameters to syslog server and global configuration.

#### How to verify it
create syslog server using CLI/adding to Redis-DB
verify server is added to file /etc/rsyslog.conf and server is functional.

#### Description for the changelog
extend rsyslog capabilities, added server and global configuration parameters.

#### Link to config_db schema for YANG module changes
[sonic-syslog.yang](https://github.com/sonic-net/sonic-buildimage/blob/master/src/sonic-yang-models/yang-models/sonic-syslog.yang)
2023-08-14 13:12:33 -07:00
mssonicbld
ec73d0f3ff
[chassis]: removed dependency for bgp and swss for chassis supervisor (#15734) (#16135)
Fixes #15667 and #13293

Work item tracking
Microsoft ADO 24472854:

How I did it
On chassis supervisor bgp feature is disabled in hostcfgd. The dependency between swss and bgp causes the bgp containers to start even though the feature is disabled.

How to verify it
Tests on chassis supervisor and LC

Co-authored-by: Arvindsrinivasan Lakshmi Narasimhan <55814491+arlakshm@users.noreply.github.com>
2023-08-14 22:39:24 +08:00
Longxiang Lyu
6e49fa5fd2 [monit][dualtor] Periodically check mux neighbors consistency (#15769)
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
2023-08-08 18:33:29 +08:00
mssonicbld
4ca01a7715
[syncd.sh] Clear semaphore before updating firmware (#15818) (#16067) 2023-08-07 18:20:15 +08:00
vmittal-msft
5ee18ece65 Update WRED profile on system ports (#15612)
* Update WRED profile on system ports
2023-08-07 14:33:42 +08:00
mssonicbld
33a10b479a
[nvidia] make sure shared storage with syncd is cleared on restarts (#14547) (#16046)
Why I did it
Sharing the storage of syncd with other proprietary application extensions allows them to communicate with syncd in differnt ways.
If one container wants to pass some information to syncd then shared storage can be used. However, today the shared storage isn't cleaned on restarts making it possible for syncd to read out-of-date information generated in the past.

NOTE: No plans to use it for standard SONIC dockers and we are working on removing the SDK dependency from PMON docker

How I did it
Implemented new service to clean the shared storage.

How to verify it
Do reboot/fast-reboot/warm-reboot/config-reload/systemctl restart swss and verify /tmp/ is cleaned after each restart in syncd container.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Co-authored-by: Stepan Blyshchak <38952541+stepanblyschak@users.noreply.github.com>
2023-08-07 09:27:43 +08:00
Junchao-Mellanox
bf37c3162c Fix issue: set delayed attribute to true for platform monitor service (#15816)
There is a redundant line in init_cfg.json.j2. It would cause pmon service always has "delayed=False". However, we know that PMON has a timer now. So, I try to fix it here.
2023-08-07 00:34:12 +08:00
mssonicbld
6004054711
[arp_update]: Fix IPv6 neighbor race condition (#15583) (#15877) 2023-07-19 20:06:12 +08:00
lixiaoyuner
c59f55f6a3
Move k8s script to docker-config-engine (#14788) (#15768)
Why I did it
To reduce the container's dependency from host system

Work item tracking
Microsoft ADO (number only):
17713469
How I did it
Move the k8s container startup script to config engine container, other than mount it from host.

How to verify it
Check file path(/usr/share/sonic/scripts/container_startup.py) inside config engine container.

Signed-off-by: Yun Li <yunli1@microsoft.com>
Co-authored-by: Qi Luo <qiluo-msft@users.noreply.github.com>
2023-07-17 23:21:01 +08:00
mssonicbld
0b1f834e22
update rsyslog log size conf (#15821) (#15837) 2023-07-14 20:34:22 +08:00
mssonicbld
bb3eff6ab4
Revert "Fix for fast/cold-boot: call db_migrator only after old config is loaded (#14933)" (#15464) (#15618) 2023-06-29 22:35:47 +08:00
Stepan Blyshchak
e2e5b77f16
[mlnx-ffb.sh] Update issu-version location (#14925)
#### Why I did it

ISSU version check fails due to inability to mount squashfs from 202211 on 201911

#### How I did it

Put ISSU version file under platform directory

#### How to verify it

Warm-upgrade matrix:
- 201911 (with https://github.com/sonic-net/sonic-buildimage/pull/14928) to master
- 201911 (with https://github.com/sonic-net/sonic-buildimage/pull/14928) to 202211
- 202012 (with https://github.com/sonic-net/sonic-buildimage/pull/14927) to master
- 202205 (with this change cherry-picked) to master
2023-06-15 15:14:52 -07:00
Saikrishna Arcot
f84dfd2345
Re-add 127.0.0.1/8 when bringing down the interfaces (#15080)
* Re-add 127.0.0.1/8 when bringing down the interfaces

With #5353, 127.0.0.1/16 was added to the lo interface, and then
127.0.0.1/8 was removed. However, when bringing down the lo interface,
like during a config reload, 127.0.0.1/16 gets removed, but 127.0.0.1/8
isn't added back to the interface. This means that there's a period of
time where 127.0.0.1 is not available at all, and services that need to
connect to 127.0.01 (such as for redis DB) will fail.

To fix this, when going down, add 127.0.0.1/8. Add this address before
the existing configuration gets removed, so that 127.0.0.1 is available
at all times.

Note that running `ifdown lo` doesn't actually bring down the loopback
interface; the interface always stays "physically" up.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-06-13 18:45:39 -07:00
Hua Liu
05f1a5a31e
Add watchdog mechanism to swss service and generate alert when swss have issue. (#15429)
Add watchdog mechanism to swss service and generate alert when swss have issue. 

**Work item tracking**
Microsoft ADO (number only): 16578912

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.

Manually test process_monitoring/test_critical_process_monitoring.py can pass.

Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly.

Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737
UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306
2023-06-12 17:53:54 -07:00
Alpesh Patel
633fff8c10
enable ethernet backplane port support in port config for packet mode T2 devices (#14533)
For T2 systems using packet mode, the backplane interfaces (Ethernet-BP#) and the fabric card ethernet interfaces are not visible as neighbor interfaces.
In packet mode, these interfaces needs qos and buffer config as well.
This fix addresses that issue and adds the backplane interfaces to the PORTS_ACTIVE list
2023-06-12 14:02:22 -07:00
mssonicbld
cb9d9e57a6
[ci/build]: Upgrade SONiC package versions (#15431)
Upgrade SONiC Versions
2023-06-12 22:27:29 +08:00
mssonicbld
a45595158b
[ci/build]: Upgrade SONiC package versions (#15345) 2023-06-10 20:38:13 +08:00
Liping Xu
78c41a1e58
allow docker_inram to kernel cmd list (#15374)
Why I did it
After docker_inram is enabled, the docker folder's default max size is 1.5G.
It's not big enough for some tests which need to install additional docker images or install extra packages.

Work item tracking
Microsoft ADO 24199761:
How I did it
add docker_inram into cmdline_allowlist

How to verify it
sudo sh -c 'echo "docker_inram_size=3000M" >> kernel-cmdline-append'
sudo reboot and check the docker folder size
2023-06-10 14:19:44 +08:00
Sudharsan Dhamal Gopalarathnam
162856ad9a
[sflow]Delay starting sflow service until ports are created (#15333)
* [sflow]Delay starting sflow service until ports are created
* Removing sflow from sonic.target dependency since it will be managed by hostcfgd
2023-06-09 16:28:15 -07:00
Ye Jianquan
cec9d7b83a
Revert "Add watchdog mechanism to swss service and generate alert when swss have issue. (#14686)" (#15390)
This reverts commit 44427a2f6b.
Docker image not updated during PR validation and caused PR check failures.
Force merge this revert. After cache is updated after this PR is merged, issue should be fixed.
2023-06-09 09:10:35 +08:00
Yevhen Fastiuk
8a6d45227e
[Clock] Add timezone config YANG model (#14651)
* Add the ability to configure timezone

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>

* Add YANG model for timezone

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>

* Add timezone reference

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>

---------

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>
2023-06-07 10:39:24 -07:00
Hua Liu
44427a2f6b
Add watchdog mechanism to swss service and generate alert when swss have issue. (#14686)
This PR depends on https://github.com/sonic-net/sonic-swss/pull/2737 merge first.

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737
UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306
2023-06-05 22:21:17 -07:00
siqbal1986
381cfe4485
Added VNET_MONITOR_TABLE,BFD_SESSION_TABLE,VNET_ROUTE_TUNNEL_TABLE to the list (#14992)
* The 3 tables in state DB need to be cleaned up after SWSS restart for have consistant state.
2023-06-05 13:18:50 -07:00
mssonicbld
4335690de7 [ci/build]: Upgrade SONiC package versions 2023-06-05 20:51:47 +08:00
Arvindsrinivasan Lakshmi Narasimhan
3f4b959d3f
[chassis] add libffi-dev for sonic-utilities (#15218)
In the PR sonic-net/sonic-utilities#2850 , for support remote access of linecards paramiko package is installed in sonic-utilities. libffi-dev needs to installed to be able to compile for armhf image

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2023-06-03 14:36:50 -07:00
mssonicbld
f80e182c22
[ci/build]: Upgrade SONiC package versions (#15325) 2023-06-03 19:45:07 +08:00
mssonicbld
c044e6e34e
[ci/build]: Upgrade SONiC package versions (#15307) 2023-06-02 21:40:29 +08:00
Vaibhav Hemant Dixit
02b17839c3
Fix for fast/cold-boot: call db_migrator only after old config is loaded (#14933)
Why I did it
Fix the issue where db_migrator is called before DB is loaded w/ config. This leads to db_migrator:

Not finding anything, and resumes to incorrectly migrate every missing config
This is not expected. migration should happen after the old config is loaded and only new schema changes need migration.
Since DB does not have anything when migrator is called, db_migrator fails when some APIs return None.
The reason for incorrect call is that:

database service starts db_migrator as part of startup sequence.
config-setup service loads data from old-config/minigraph. However, since it has Requires=database.service.
Hence, config-setup starts only when database service is started. And database service is started when db_migrator is completed.
Fixed by:

Check if this is first time boot by checking pending_config_migration flag.
If pending_config_migration is enabled, then do not call db_migrator as part of database service startup.
Let database service start which triggers config-setup service to start.
Now call db_migrator after when config-setup service loads old-config/minigraph
2023-05-30 10:16:21 -07:00
vmittal-msft
ecb4db58a9
Update PG headroom settings ports based on port speed/cable length (#14908)
* Update PG headroom settings ports based on port speed/cable length

* Updated XOFF settings to use chip level numbers than core

* Updated PG headroom based on uplink/downlink side

* fix for sonic-config-gen tests

* More fixes for unit test cases

* more test fixes

* Merged multiple functions into one
2023-05-19 08:19:27 -07:00
Pavan-Nokia
c5d0507224
[arm64][Nokia-7215-A1]Add support for Nokia-7215-A1 platform (#13795)
Add new Nokia build target and establish an arm64 build:

    Platform: arm64-nokia_ixs7215_52xb-r0
    HwSKU: Nokia-7215-A1
    ASIC: marvell
    Port Config: 48x1G + 4x10G

How I did it

- Change make files for saiserver and syncd to use Bulleseye kernel
- Change Marvell SAI version to 1.11.0-1
- Add Prestera make files to build kernel, Flattened Device Tree blob and ramdisk for arm64 platforms
- Provide device and platform related files for new platform support (arm64-nokia_ixs7215_52xb-r0).
2023-05-18 14:24:05 -07:00
Samuel Angebault
fa95ebcaae Add optional zram compression for docker_inram
Some devices running SONiC have a small storage device (2G and 4G mainly)
The SONiC image growth over time has made it impossible to install
2 images on a single device.
Some mitigations have been implemented in the past for some devices but
there is a need to do more.

One such mitigation is `docker_inram` which creates a `tmpfs` and
extracts `dockerfs.tar.gz` in it.
This all happens in the SONiC initramfs and by ensuring the installation
process does not extract `dockerfs.tar.gz` on the flash but keep the file as is.

This mitigation does a tradeoff by using more RAM to reduce the disk footprint.
It however creates new issues for devices with 4G of system memory since
the extracted `dockerfs.tar.gz` nears the 1.6G.
Considering debian upgrades (with dual base images) and the continuous
stream of features this is only going to get bigger.

This change introduces an alternative to the `tmpfs` by allowing a system
to extract the `dockerfs.tar.gz` inside a `zram` device thus bringing
compression in play at the detriment of performance.

Introduce 2 new optional kernel parameters to be consumed by SONiC initramfs.
 - `docker_inram_size` which represent the max physical size of the
   `zram` or `tmpfs` volume (defaults to DOCKER_RAMFS_SIZE)
 - `docker_inram_algo` which is the method to use to extract the
   `dockerfs.tar.gz` (defaults to `tmpfs`)
   other values are considered to be compression algorithm for `zram`
   (e.g `zstd`, `zlo-rle`, `lz4`)

Refactored the logic to mount the docker fs in the SONiC initramfs under
the `union-mount` script.
Moved the code into a function to make it cleaner and separated the
inram volume creation and docker extraction.

On Arista platform with a flash smaller or equal to 4GB set
`docker_inram_algo` to `zstd` which produces the best compression ratio
at the detriment of a slower write performance and a similar read
performance to other `zram` compression algorithms.
2023-05-18 14:21:52 -07:00
Samuel Angebault
467994c024 [Arista] Fix boot0 code for docker_inram
Enable docker_inram for all systems with 4GB or less of flash.
This is mandatory to allow these systems to store 2 SONiC images.

This change also fixes the missing docker_inram attribute when
installing a new image from SONiC.
Because the SWI image can ship with additional kernel parameters within
such as `sonic_fips=` this lead to a conflict.
To prevent the conflict, the extra kernel parameters from the SWI are
now stored in the file `kernel-cmdline-append` which isn't used anywhere.
2023-05-18 14:21:52 -07:00
Anish Narsian
05a85b57b8
[arp_update] Resolve neighbors from config_db (#15006)
* To resolve NEIGH table entries present in CONFIG_DB. Without this change arp/ndp entries which we wish to resolve, and configured via CONFIG_DB are not resolved.
2023-05-17 10:42:03 -07:00
mssonicbld
3d1ae46f90 [ci/build]: Upgrade SONiC package versions 2023-05-15 18:32:43 +08:00