Commit Graph

1192 Commits

Author SHA1 Message Date
Vaibhav Hemant Dixit
2969d84e58 Revert "Revert "Fix for fast/cold-boot: call db_migrator only after old config is loaded (#14933)" (#15464)" (#15684)
This reverts commit 9649a44470.
2023-08-15 04:32:38 +08:00
Yevhen Fastiuk
4602d30a73
[syslog] Add remote syslog configuration (cherry-pick to 202305) (#15897)
cherry-pick: #14513
depends: https://github.com/sonic-net/sonic-utilities/pull/2939

* Add an ability to configure remote syslog servers
* Add an initial configuration for remote syslog
* Extend YANG module and add unit tests

#### Why I did it
Adding the following functionality to rsyslog feature:

* Configure remote syslog servers: protocol, filter, severity level
* Update global syslog configuration: severity level, message format

#### How I did it
added parameters to syslog server and global configuration.

#### How to verify it
create syslog server using CLI/adding to Redis-DB
verify server is added to file /etc/rsyslog.conf and server is functional.

#### Description for the changelog
extend rsyslog capabilities, added server and global configuration parameters.

#### Link to config_db schema for YANG module changes
[sonic-syslog.yang](https://github.com/sonic-net/sonic-buildimage/blob/master/src/sonic-yang-models/yang-models/sonic-syslog.yang)
2023-08-14 13:12:33 -07:00
mssonicbld
ec73d0f3ff
[chassis]: removed dependency for bgp and swss for chassis supervisor (#15734) (#16135)
Fixes #15667 and #13293

Work item tracking
Microsoft ADO 24472854:

How I did it
On chassis supervisor bgp feature is disabled in hostcfgd. The dependency between swss and bgp causes the bgp containers to start even though the feature is disabled.

How to verify it
Tests on chassis supervisor and LC

Co-authored-by: Arvindsrinivasan Lakshmi Narasimhan <55814491+arlakshm@users.noreply.github.com>
2023-08-14 22:39:24 +08:00
Longxiang Lyu
6e49fa5fd2 [monit][dualtor] Periodically check mux neighbors consistency (#15769)
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
2023-08-08 18:33:29 +08:00
mssonicbld
4ca01a7715
[syncd.sh] Clear semaphore before updating firmware (#15818) (#16067) 2023-08-07 18:20:15 +08:00
vmittal-msft
5ee18ece65 Update WRED profile on system ports (#15612)
* Update WRED profile on system ports
2023-08-07 14:33:42 +08:00
mssonicbld
33a10b479a
[nvidia] make sure shared storage with syncd is cleared on restarts (#14547) (#16046)
Why I did it
Sharing the storage of syncd with other proprietary application extensions allows them to communicate with syncd in differnt ways.
If one container wants to pass some information to syncd then shared storage can be used. However, today the shared storage isn't cleaned on restarts making it possible for syncd to read out-of-date information generated in the past.

NOTE: No plans to use it for standard SONIC dockers and we are working on removing the SDK dependency from PMON docker

How I did it
Implemented new service to clean the shared storage.

How to verify it
Do reboot/fast-reboot/warm-reboot/config-reload/systemctl restart swss and verify /tmp/ is cleaned after each restart in syncd container.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Co-authored-by: Stepan Blyshchak <38952541+stepanblyschak@users.noreply.github.com>
2023-08-07 09:27:43 +08:00
Junchao-Mellanox
bf37c3162c Fix issue: set delayed attribute to true for platform monitor service (#15816)
There is a redundant line in init_cfg.json.j2. It would cause pmon service always has "delayed=False". However, we know that PMON has a timer now. So, I try to fix it here.
2023-08-07 00:34:12 +08:00
mssonicbld
6004054711
[arp_update]: Fix IPv6 neighbor race condition (#15583) (#15877) 2023-07-19 20:06:12 +08:00
lixiaoyuner
c59f55f6a3
Move k8s script to docker-config-engine (#14788) (#15768)
Why I did it
To reduce the container's dependency from host system

Work item tracking
Microsoft ADO (number only):
17713469
How I did it
Move the k8s container startup script to config engine container, other than mount it from host.

How to verify it
Check file path(/usr/share/sonic/scripts/container_startup.py) inside config engine container.

Signed-off-by: Yun Li <yunli1@microsoft.com>
Co-authored-by: Qi Luo <qiluo-msft@users.noreply.github.com>
2023-07-17 23:21:01 +08:00
mssonicbld
0b1f834e22
update rsyslog log size conf (#15821) (#15837) 2023-07-14 20:34:22 +08:00
mssonicbld
bb3eff6ab4
Revert "Fix for fast/cold-boot: call db_migrator only after old config is loaded (#14933)" (#15464) (#15618) 2023-06-29 22:35:47 +08:00
Stepan Blyshchak
e2e5b77f16
[mlnx-ffb.sh] Update issu-version location (#14925)
#### Why I did it

ISSU version check fails due to inability to mount squashfs from 202211 on 201911

#### How I did it

Put ISSU version file under platform directory

#### How to verify it

Warm-upgrade matrix:
- 201911 (with https://github.com/sonic-net/sonic-buildimage/pull/14928) to master
- 201911 (with https://github.com/sonic-net/sonic-buildimage/pull/14928) to 202211
- 202012 (with https://github.com/sonic-net/sonic-buildimage/pull/14927) to master
- 202205 (with this change cherry-picked) to master
2023-06-15 15:14:52 -07:00
Saikrishna Arcot
f84dfd2345
Re-add 127.0.0.1/8 when bringing down the interfaces (#15080)
* Re-add 127.0.0.1/8 when bringing down the interfaces

With #5353, 127.0.0.1/16 was added to the lo interface, and then
127.0.0.1/8 was removed. However, when bringing down the lo interface,
like during a config reload, 127.0.0.1/16 gets removed, but 127.0.0.1/8
isn't added back to the interface. This means that there's a period of
time where 127.0.0.1 is not available at all, and services that need to
connect to 127.0.01 (such as for redis DB) will fail.

To fix this, when going down, add 127.0.0.1/8. Add this address before
the existing configuration gets removed, so that 127.0.0.1 is available
at all times.

Note that running `ifdown lo` doesn't actually bring down the loopback
interface; the interface always stays "physically" up.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-06-13 18:45:39 -07:00
Hua Liu
05f1a5a31e
Add watchdog mechanism to swss service and generate alert when swss have issue. (#15429)
Add watchdog mechanism to swss service and generate alert when swss have issue. 

**Work item tracking**
Microsoft ADO (number only): 16578912

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.

Manually test process_monitoring/test_critical_process_monitoring.py can pass.

Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly.

Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737
UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306
2023-06-12 17:53:54 -07:00
Alpesh Patel
633fff8c10
enable ethernet backplane port support in port config for packet mode T2 devices (#14533)
For T2 systems using packet mode, the backplane interfaces (Ethernet-BP#) and the fabric card ethernet interfaces are not visible as neighbor interfaces.
In packet mode, these interfaces needs qos and buffer config as well.
This fix addresses that issue and adds the backplane interfaces to the PORTS_ACTIVE list
2023-06-12 14:02:22 -07:00
mssonicbld
cb9d9e57a6
[ci/build]: Upgrade SONiC package versions (#15431)
Upgrade SONiC Versions
2023-06-12 22:27:29 +08:00
mssonicbld
a45595158b
[ci/build]: Upgrade SONiC package versions (#15345) 2023-06-10 20:38:13 +08:00
Liping Xu
78c41a1e58
allow docker_inram to kernel cmd list (#15374)
Why I did it
After docker_inram is enabled, the docker folder's default max size is 1.5G.
It's not big enough for some tests which need to install additional docker images or install extra packages.

Work item tracking
Microsoft ADO 24199761:
How I did it
add docker_inram into cmdline_allowlist

How to verify it
sudo sh -c 'echo "docker_inram_size=3000M" >> kernel-cmdline-append'
sudo reboot and check the docker folder size
2023-06-10 14:19:44 +08:00
Sudharsan Dhamal Gopalarathnam
162856ad9a
[sflow]Delay starting sflow service until ports are created (#15333)
* [sflow]Delay starting sflow service until ports are created
* Removing sflow from sonic.target dependency since it will be managed by hostcfgd
2023-06-09 16:28:15 -07:00
Ye Jianquan
cec9d7b83a
Revert "Add watchdog mechanism to swss service and generate alert when swss have issue. (#14686)" (#15390)
This reverts commit 44427a2f6b.
Docker image not updated during PR validation and caused PR check failures.
Force merge this revert. After cache is updated after this PR is merged, issue should be fixed.
2023-06-09 09:10:35 +08:00
Yevhen Fastiuk
8a6d45227e
[Clock] Add timezone config YANG model (#14651)
* Add the ability to configure timezone

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>

* Add YANG model for timezone

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>

* Add timezone reference

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>

---------

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>
2023-06-07 10:39:24 -07:00
Hua Liu
44427a2f6b
Add watchdog mechanism to swss service and generate alert when swss have issue. (#14686)
This PR depends on https://github.com/sonic-net/sonic-swss/pull/2737 merge first.

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737
UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306
2023-06-05 22:21:17 -07:00
siqbal1986
381cfe4485
Added VNET_MONITOR_TABLE,BFD_SESSION_TABLE,VNET_ROUTE_TUNNEL_TABLE to the list (#14992)
* The 3 tables in state DB need to be cleaned up after SWSS restart for have consistant state.
2023-06-05 13:18:50 -07:00
mssonicbld
4335690de7 [ci/build]: Upgrade SONiC package versions 2023-06-05 20:51:47 +08:00
Arvindsrinivasan Lakshmi Narasimhan
3f4b959d3f
[chassis] add libffi-dev for sonic-utilities (#15218)
In the PR sonic-net/sonic-utilities#2850 , for support remote access of linecards paramiko package is installed in sonic-utilities. libffi-dev needs to installed to be able to compile for armhf image

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2023-06-03 14:36:50 -07:00
mssonicbld
f80e182c22
[ci/build]: Upgrade SONiC package versions (#15325) 2023-06-03 19:45:07 +08:00
mssonicbld
c044e6e34e
[ci/build]: Upgrade SONiC package versions (#15307) 2023-06-02 21:40:29 +08:00
Vaibhav Hemant Dixit
02b17839c3
Fix for fast/cold-boot: call db_migrator only after old config is loaded (#14933)
Why I did it
Fix the issue where db_migrator is called before DB is loaded w/ config. This leads to db_migrator:

Not finding anything, and resumes to incorrectly migrate every missing config
This is not expected. migration should happen after the old config is loaded and only new schema changes need migration.
Since DB does not have anything when migrator is called, db_migrator fails when some APIs return None.
The reason for incorrect call is that:

database service starts db_migrator as part of startup sequence.
config-setup service loads data from old-config/minigraph. However, since it has Requires=database.service.
Hence, config-setup starts only when database service is started. And database service is started when db_migrator is completed.
Fixed by:

Check if this is first time boot by checking pending_config_migration flag.
If pending_config_migration is enabled, then do not call db_migrator as part of database service startup.
Let database service start which triggers config-setup service to start.
Now call db_migrator after when config-setup service loads old-config/minigraph
2023-05-30 10:16:21 -07:00
vmittal-msft
ecb4db58a9
Update PG headroom settings ports based on port speed/cable length (#14908)
* Update PG headroom settings ports based on port speed/cable length

* Updated XOFF settings to use chip level numbers than core

* Updated PG headroom based on uplink/downlink side

* fix for sonic-config-gen tests

* More fixes for unit test cases

* more test fixes

* Merged multiple functions into one
2023-05-19 08:19:27 -07:00
Pavan-Nokia
c5d0507224
[arm64][Nokia-7215-A1]Add support for Nokia-7215-A1 platform (#13795)
Add new Nokia build target and establish an arm64 build:

    Platform: arm64-nokia_ixs7215_52xb-r0
    HwSKU: Nokia-7215-A1
    ASIC: marvell
    Port Config: 48x1G + 4x10G

How I did it

- Change make files for saiserver and syncd to use Bulleseye kernel
- Change Marvell SAI version to 1.11.0-1
- Add Prestera make files to build kernel, Flattened Device Tree blob and ramdisk for arm64 platforms
- Provide device and platform related files for new platform support (arm64-nokia_ixs7215_52xb-r0).
2023-05-18 14:24:05 -07:00
Samuel Angebault
fa95ebcaae Add optional zram compression for docker_inram
Some devices running SONiC have a small storage device (2G and 4G mainly)
The SONiC image growth over time has made it impossible to install
2 images on a single device.
Some mitigations have been implemented in the past for some devices but
there is a need to do more.

One such mitigation is `docker_inram` which creates a `tmpfs` and
extracts `dockerfs.tar.gz` in it.
This all happens in the SONiC initramfs and by ensuring the installation
process does not extract `dockerfs.tar.gz` on the flash but keep the file as is.

This mitigation does a tradeoff by using more RAM to reduce the disk footprint.
It however creates new issues for devices with 4G of system memory since
the extracted `dockerfs.tar.gz` nears the 1.6G.
Considering debian upgrades (with dual base images) and the continuous
stream of features this is only going to get bigger.

This change introduces an alternative to the `tmpfs` by allowing a system
to extract the `dockerfs.tar.gz` inside a `zram` device thus bringing
compression in play at the detriment of performance.

Introduce 2 new optional kernel parameters to be consumed by SONiC initramfs.
 - `docker_inram_size` which represent the max physical size of the
   `zram` or `tmpfs` volume (defaults to DOCKER_RAMFS_SIZE)
 - `docker_inram_algo` which is the method to use to extract the
   `dockerfs.tar.gz` (defaults to `tmpfs`)
   other values are considered to be compression algorithm for `zram`
   (e.g `zstd`, `zlo-rle`, `lz4`)

Refactored the logic to mount the docker fs in the SONiC initramfs under
the `union-mount` script.
Moved the code into a function to make it cleaner and separated the
inram volume creation and docker extraction.

On Arista platform with a flash smaller or equal to 4GB set
`docker_inram_algo` to `zstd` which produces the best compression ratio
at the detriment of a slower write performance and a similar read
performance to other `zram` compression algorithms.
2023-05-18 14:21:52 -07:00
Samuel Angebault
467994c024 [Arista] Fix boot0 code for docker_inram
Enable docker_inram for all systems with 4GB or less of flash.
This is mandatory to allow these systems to store 2 SONiC images.

This change also fixes the missing docker_inram attribute when
installing a new image from SONiC.
Because the SWI image can ship with additional kernel parameters within
such as `sonic_fips=` this lead to a conflict.
To prevent the conflict, the extra kernel parameters from the SWI are
now stored in the file `kernel-cmdline-append` which isn't used anywhere.
2023-05-18 14:21:52 -07:00
Anish Narsian
05a85b57b8
[arp_update] Resolve neighbors from config_db (#15006)
* To resolve NEIGH table entries present in CONFIG_DB. Without this change arp/ndp entries which we wish to resolve, and configured via CONFIG_DB are not resolved.
2023-05-17 10:42:03 -07:00
mssonicbld
3d1ae46f90 [ci/build]: Upgrade SONiC package versions 2023-05-15 18:32:43 +08:00
mssonicbld
31223fb9fe
[ci/build]: Upgrade SONiC package versions (#15057) 2023-05-13 18:30:20 +08:00
judyjoseph
efeae03ea3
Add override_config to load_minigraph in config-setup service (#14834)
This PR is to handle the override minigraph config by golden_config_db.json file if it is present in the backup location.
2023-05-10 11:54:33 -07:00
Zain Budhwani
a738c39328
Add fix to monit_regex.json for catching mem_usage and cpu_usage (#14954)
Why I did it
Current regex not able to capture logs, modify regex to capture syslog messages

Work item tracking
Microsoft ADO (number only): 13366345
How I did it
Code change

How to verify it
sonic-mgmt test case
2023-05-08 11:48:17 -07:00
Ying Xie
72c52bc677
Revert "Clear DNS configuration received from DHCP during networking reconfiguration in Linux. (#13516)" (#14902)
This reverts commit c7ecd92c54.
2023-05-01 17:12:38 -07:00
mssonicbld
80c5ab4a4a
[ci/build]: Upgrade SONiC package versions (#14896) 2023-05-01 18:10:48 +08:00
mssonicbld
0d709a3655
[ci/build]: Upgrade SONiC package versions (#14888) 2023-04-29 17:42:19 +08:00
Tejaswini Chadaga
ca224863cb
Changes to support TSA from supervisor (#14691)
Why I did it
Support for SONIC chassis isolation using TSA and un-isolation using TSB from supervisor module

Work item tracking
Microsoft ADO (number only): 17826134
How I did it
When TSA is run on the supervisor, it triggers TSA on each of the linecards using the secure rexec infrastructure introduced in sonic-net/sonic-utilities#2701. User password is requested to allow secure login to linecards through ssh, before execution of TSA/TSB on the linecards

TSA of the chassis withdraws routes from all the external BGP neighbors on each linecard, in order to isolate the entire chassis. No route withdrawal is done from the internal BGP sessions between the linecards to prevent transient drops during internal route deletion. With these changes, complete isolation of a single linecard using TSA will not be possible (a separate CLI/script option will be introduced at a later time to achieve this)

Changes also include no-stats option with TSC for quick retrieval of the current system isolation state

This PR also reverts changes in #11403

How to verify it
These changes have a dependency on sonic-net/sonic-utilities#2701 for testing

Run TSA from supervisor module and ensure transition to Maintenance mode on each linecard
Verify that all routes are withdrawn from eBGP neighbors on all linecards
Run TSB from supervisor module and ensure transition to Normal mode on each linecard
Verify that all routes are re-advertised from eBGP neighbors on all linecards
Run TSC no-stats from supervisor and verify that just the system maintenance state is returned from all linecards
2023-04-28 16:28:06 +08:00
Stephen Sun
9e56fea091
Temporary WA for the issue that asic_table.json can not be rendered (#13888)
- Why I did it
We suspect the issue #13791 is caused by redis server being temporarily unavailable during system initialization so we do not use -d in sonic-cfggen, for now, to avoid accessing redis server

- How I did it
Provide a string containing required json data when calling sonic-cfggen

- How to verify it
Manually test it

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2023-04-24 17:02:35 +03:00
mssonicbld
5ad844f185 [ci/build]: Upgrade SONiC package versions 2023-04-24 18:33:06 +08:00
mssonicbld
81a557885b
[ci/build]: Upgrade SONiC package versions (#14799) 2023-04-22 17:47:40 +08:00
mssonicbld
d006219e2d
[ci/build]: Upgrade SONiC package versions (#14718) 2023-04-19 18:59:16 +08:00
Aryeh Feigin
039a9c998a
[Fast-boot] Clear teamd-timer when finalizing fast-reboot (#14583)
Part of sonic-net/sonic-utilities#2760
Similar to #14295

- Why I did it
To clear teamd timer when fast-reboot is finalized to prevent any further affect.

- How I did it
Deleted teamd timer from config-db in fast-reboot finalizer.
config save call is moved to after clearing teamd-timer so it won't have any further affect as well.

- How to verify it
Verified manually that entry was deleted after fast-reboot was finailized.
2023-04-18 09:15:42 +03:00
Stepan Blyshchak
d73c810e86
[image_config] add rasdaemon.timer (#14300)
rasdaemon is a tool to log hardware errors. It takes 100% CPU during
boot for a few seconds. It impacts fast/warm boot by delaying control
plane restoration for 5 sec on some platforms.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2023-04-17 08:58:45 -07:00
mssonicbld
7f262d71da
[ci/build]: Upgrade SONiC package versions (#14685) 2023-04-17 19:58:43 +08:00
mssonicbld
49dbaeb649
[ci/build]: Upgrade SONiC package versions (#14672) 2023-04-15 18:21:50 +08:00