Commit Graph

1306 Commits

Author SHA1 Message Date
Saikrishna Arcot
34a1ac1a0f Migrate from ntp to ntpsec
Debian Bookworm no longer uses NTP, and instead uses NTPsec. Modify our
files to update/replace the NTPsec files instead.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-11-21 18:53:15 -08:00
abdosi
4a7aa2634f
[chassis] Support advertisement of Loopback0 of all LC's across all e-BGP peers in TSA mode (#16714)
What I did:
In Chassis TSA mode Loopback0 Ip's of each LC's should be advertise through e-BGP peers of each remote LC's

How I did:

- Route-map policy to Advertise own/self Loopback IP to other internal iBGP peers with a community internal_community as define in constants.yml
- Route-map policy to match on above internal_community when route is received from internal iBGP peers and set a internal tag as define in constants.yml and also delete the internal_community so we don't send to any of e-BGP peers
- In TSA new route-map match on above internal tag and permit the route (Loopback0 IP's of remote LC's) and set the community to traffic_shift_community.
- In TSB delete the above new route-map.

How I verify:

Manual Verification

UT updated.
sonic-mgmt PR: sonic-net/sonic-mgmt#10239


Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2023-11-20 09:42:02 -08:00
Ze Gan
9f08f88a0d
[dpu]: Add DPU database service (#17161)
Sub PRs:

sonic-net/sonic-host-services#84
#17191

Why I did it
According to the design, the database instances of DPU will be kept in the NPU host.

Microsoft ADO (number only): 25072889

How I did it
To follow the multiple ASIC design, I assume a new platform environment variable NUM_DPU will be defined in the /usr/share/sonic/device/$PLATFORM/platform_env.conf. Based on this number, NPU host will launch a corresponding number of instances for the DPU database.

Signed-off-by: Ze Gan <ganze718@gmail.com>
2023-11-17 09:10:03 -08:00
ganglv
c71fb3a30f
Share image for gnmi and telemetry (#16863)
Why I did it
Share docker image to support gnmi container and telemetry container

Work item tracking
Microsoft ADO 25423918:
How I did it
Create telemetry image from gnmi docker image.
Enable gnmi container and disable telemetry container by default.

How to verify it
Run end to end test.
2023-11-08 08:54:36 +08:00
prabhataravind
7e49530459
[copp]: Enable rate limiting for bgp, lacp, dhcp, lldp, macsec and udld (#14859)
Why I did it
It was observed that a flood of DHCP packets without rate-limiting can cause BGP flaps or lacp keepalive losses.
This change attempts to prevent or reduce such BGP flaps by enabling appropriate rate-limiting in SONiC for all traffic types.

Work item tracking
Microsoft ADO 17964421:

How I did it
Set a reasonable CIR/CBS value of 300 for queue4_group3 (dhcp, lldp, macsec) and 6000 for queue4_group1.
The value 300 was arrived at after testing with dhcp flooding using ptf (using multiple threads). Throttling at this rate was necessary to ensure that dhcp flooding does not cause BGP flaps.

How to verify it
Verified with this script running from ptf, that BGP flaps don't happen when CBS/CIR is set at 300 for queue4_group3.

 import threading
 from scapy.all import *
 
 def send_dhcp_discover(intf):
     dhcp_discover = Ether(dst='ff:ff:ff:ff:ff:ff',src=RandMAC()) \
                         /IP(src='1.1.1.1',dst='255.255.255.255') \
                         /UDP(sport=68,dport=67) \
                         /DHCP(options=[('message-type','discover'),('end')])
     sendp(dhcp_discover,count=100000,iface=intf)
 
 
 if __name__ == "__main__":
     t1 = threading.Thread(target=send_dhcp_discover, args=("eth1",))
     t2 = threading.Thread(target=send_dhcp_discover, args=("eth2",))
     t1.start()
     t2.start()
     t1.join()
     t2.join()

Verified on Arista-7260CX3-D108C8 running 202012 that the copp rule for queue4_group1 and queue4_group3 do NOT affect BGP packets. To verify this using PTF, the copp rules were modified to set the "CBS" and "CIR" for queue4_group1 and queue4_group3 at 600pps and 50k packets each of "BGP open" and "DHCP Discover" were simultaneously sent from the same PTF port to the DUT. It was verified using "show c cpu" that packets are hitting the cpu queue at 1200 pps (double the configured CIR/CBS for these packet types). This helped conclude that throttling rate is per trap (or packet type) and not per queue.

Verified with updated sonic-mgmt tests ([tests/copp]: Update copp mgmt tests to support new rate-limits sonic-mgmt#8199) on broadcom and mellanox platforms that these traffic types are rate-limited.

Signed-off-by: Prabhat Aravind <paravind@microsoft.com>
2023-10-25 10:49:24 -07:00
Kebo Liu
31451295d5
Add special rsyslog filter for MSN2700 platform (#16684)
- Why I did it
Mellanox MSN2700 platforms have a non-functional error log: "ERR pmon#sensord: Error getting sensor data: dps460/#10: Can't read". This error is because of a firmware issue with some PSU, we are not able to upgrade the FW online. Since there is no functional impact, this error log can be ignored safely.

- How I did it
Add a new rsyslog rule to the rsyslog-container.conf.j2, if the docker name is pmon and the platform name matches, the new rule will be inserted into the docker rsyslogd.conf

- How to verify it
run regression on the MSN2700 platform to make the error log will not be printed to the syslog.

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2023-10-24 17:54:44 +03:00
Samuel Angebault
e4a497183a
Add build option to reduce final image size (#16729)
* Reduce SONiC image filesystem size

Add a build option to reduce the image size.
The image reduction process is affecting the builds in 2 ways:
 - change some packages that are installed in the rootfs
 - apply a rootfs reduction script

The script itself will perform a few steps:
 - remove file duplication by leveraging hardlinks
   - under /usr/share/sonic since the symlinks under the device folder are lost during the build.
   - under /var/lib/docker since the files there will only be mounted ro
 - remove some extra files (man, docs, licenses, ...)
 - some image specific space reduction (only for aboot images currently)

The script can later be improved but for now it's reducing the rootfs
size by ~30%.

* restore fully featured vim package
2023-10-24 10:01:58 +08:00
Samuel Angebault
d760fb928c
Disable CPU C-States other than C1 (#16703)
Why I did it
Networking devices need to be responsive. Such responsiveness is harmed when the CPU change state.
There is a latency penalty when a CPU is idle (e.g C2) and need to exit this state to come back to C1 state.
To prevent this from happening the CPU should be forced to remain in C1 state.

How I did it
Generalize the cstate forcing to C1 to all Arista products.
This is done by adding processor.max_cstate=1 to the kernel cmdline for all CPUs.
Additionally Intel CPUs also need intel_idle.max_cstate=0 to fallback to the acpi_idle driver.

How to verify it
Check that processor.max_cstate=1 is present on the cmdline for AMD CPUs
Check that both processor.max_cstate=1 and intel_idle.max_cstate=0 are present on the cmdline for Intel CPUs
2023-10-13 20:24:39 -07:00
Longxiang Lyu
072eaed2e3
[snmp] Check intfmgrd running before start (#16588)
Add pre start check to ensure intfmgrd is running.
The check will run for 20 seconds at most.

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
2023-10-13 16:00:51 -07:00
Saikrishna Arcot
469aed2cf7
[baseimage]: Update openssh to 1:8.4p1-5+deb11u2 (#16826)
Openssh in Debian Bullseye has been updated to 1:8.4p1-5+deb11u2 to fix CVE-2023-38408. 
Since we're building openssh with some patches, we need to update our version as well.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-10-11 10:42:20 -07:00
Vadym Hlushko
3bd396043e
[buffers] Add 'create_only_config_db_buffers.json' file for the Mellanox devices (not MSFT SKU) (#16233)
* [buffers] Add create_only_config_db_buffers.json for MLNX devices (not MSFT SKU), inject it at the start of the swss docker

Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>

* [buffers] Align the sonic-device_metadata.yang

Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>

---------

Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
2023-10-03 08:35:57 -07:00
lixiaoyuner
bca2ce25ef
[k8master]: Install nc cmd for k8s master network issue debug (#16745) 2023-09-30 01:16:51 -07:00
Vaibhav Hemant Dixit
07f8507911
[fast-reboot] Fix regression: set FAST_REBOOT state_db flag to support fast-reboot from older images (#16733)
Why I did it
Fix: #16699

Fast reboot is failing from old OS versions (eg., 201911 image) to latest (eg., master branch) after PR #15685

The system wide flag for FAST_REBOOT is still required when the base OS version does not support the new fast-reboot reconciliation logic (no db dump)
2023-09-28 09:37:21 -07:00
mssonicbld
4768b76610 [ci/build]: Upgrade SONiC package versions 2023-09-26 12:32:31 +08:00
Yevhen Fastiuk
52f6dd65a3
Improve remote fetch (#12795)
### Why I did it
To fix those errors:
One:
```
Connecting to urm.nvidia.com (urm.nvidia.com)|*.*.*.*|:443... connected.
GnuTLS: Error in the pull function.
Unable to establish SSL connection.
Error 4
make[1]: Leaving directory '/sonic/src/smartmontools'
[ target/debs/bullseye/smartmontools_6.6-1_amd64.deb ]
```
Second:
```
Get:90 https://debian-mirror-url buster/main amd64 librrd-dev amd64 1.7.1-2 [284 kB]
Get:91 https://debian-mirror-url buster/main amd64 psmisc amd64 23.2-1+deb10u1 [126 kB]
Get:92 https://debian-mirror-url buster/main amd64 python-smbus amd64 4.1-1 [12.2 kB]
Get:93 https://debian-mirror-url buster/main amd64 python3.7-dev amd64 3.7.3-2+deb10u3 [510 kB]
Get:94 https://debian-mirror-url buster/main amd64 python3-dev amd64 3.7.3-1 [1264 B]
Get:95 https://debian-mirror-url buster/main amd64 python3-smbus amd64 4.1-1 [12.5 kB]
Get:96 https://debian-mirror-url buster/main amd64 rrdtool amd64 1.7.1-2 [485 kB]
Fetched 122 MB in 12s (9976 kB/s)
E: Failed to fetch https://debian-mirror-url/pool/main/p/python-defaults/python2-minimal_2.7.16-1_amd64.deb  500  Internal Server Error [IP: *.*.*.* 443]
E: Failed to fetch https://debian-mirror-url/pool/main/f/fontconfig/fontconfig-config_2.13.1-2_all.deb  500  Internal Server Error [IP: *.*.*.* 443]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
The command '/bin/sh -c apt-get update &&       apt-get install -y          build-essential         python3-dev             ipmitool                librrd8                 librrd-dev              rrdtool                 python-smbus            python3-smbus           dmidecode               i2c-tools               psmisc                  libpci3' returned a non-zero code: 100
[ target/docker-platform-monitor.gz ]
Error 1
```

#### How I did it
Add retry mechanism to apt, wget, and curl hooks
2023-09-23 18:07:04 -07:00
abdosi
2ce04ab91d
[chassisd]: Add alternate to the bridge interface created on chassis supervisor. (#16505)
Add alternate name eth1-midplane to Linux bridge br1 created on supervisor on some chassis platforms.
See description here: #16504

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2023-09-23 00:27:21 -07:00
Hua Liu
11f5a75425
[tacacs]: Fix tcpdump report error when tacacs enabled (#16372)
Fix tcpdump report error when tacacs enabled.

Why I did it
Fix tcpdump report error when tacacs enabled:
Sep 1 09:25:18.189395 vlab-01 ERR tcpdump: nss_tacplus: /etc/tacplus_nss.conf fopen failed
Sep 1 09:25:18.189606 vlab-01 ERR tcpdump: nss_tacplus: bad config or server line for nss_tacplus

This is because debian add a patch create AppArmor profile for resource access control. The profile need update to allow tcpdump access /etc/tacplus_nss.conf.

Work item tracking
Microsoft ADO: 17667308

How I did it
Modify tcpdump AppArmor profile, add new line to allow tcpdump access TACACS config file:

/etc/tacplus_nss.conf r,
2023-09-23 00:07:53 -07:00
mssonicbld
a8af76f108 [ci/build]: Upgrade SONiC package versions 2023-09-22 12:32:57 +08:00
Zain Budhwani
382d68fe42
Move syncd events to syncd.conf (#15950)
### Why I did it

syncd events should have tag sonic-events-syncd, not sonic-events-host. Created a new conf file which will have syncd events

##### Work item tracking
- Microsoft ADO **(number only)**:17747466

#### How I did it

Code change

#### How to verify it

Pipeline
2023-09-19 17:29:49 -07:00
vdahiya12
45a852233b
[pmon] update gRPC version to 1.57.0 (#16257)
Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
2023-09-15 16:41:51 -07:00
Yaqiang Zhu
d11e0a214e
Add use_unix_socket_path to supervisor-proc-exit-listener (#16548)
Why I did it
ConfigDBConnector in supervisor-proc-exit-listener uses default parameter to connect CONFIG_DB (connect by 127.0.0.1:6379) which would fail at non-host network mode container, because they are not sharing the same network and socket.

How I did it
Add a new parameter use_unix_socket_path to this script to indicate whether to use socket to connect CONFIG_DB.

How to verify it
Build image and install it, kill critical processes in container and container crushed.
2023-09-15 16:23:25 -07:00
Saikrishna Arcot
f207a9b0e0
Fix potentially not having any loopback address on lo interface (#16490)
In #15080, there was a command added to re-add 127.0.0.1/8 to the lo
interface when the networking configuration is being brought down.
However, the trigger for that command is `down`, which, looking at
ifupdown2 configuration files, runs immediately after 127.0.0.1/16 is
removed. This means there may be a period of time where there are no
loopback addresses assigned to the lo interface, and redis commands will
fail.

Fix this by changing this to pre-down, which should run well before
127.0.0.1/16 is removed, and should always leave lo with a loopback
address.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-09-14 12:55:50 -07:00
ganglv
ce7145475d
Fix grpc package for ptf container (#16536)
Why I did it
PTF container needs to use new grpcio package.

Work item tracking
Microsoft ADO (number only):
How I did it
Update versions-py2

How to verify it
Check pipeline artifact
2023-09-14 08:27:32 +08:00
Zain Budhwani
337a9dbcf4
Add rsyslog plugin support for frr log (#16192)
### Why I did it

Currently there is only rsyslog plugin support for /var/log/syslog, meaning we do not detect events that occur in frr logs such as BGP Hold Timer Expiry that appears in frr/bgpd.log. 

##### Work item tracking
- Microsoft ADO **(number only)**: 13366345

#### How I did it

Add omprog action to frr/bgpd.log and frr/zebra.log. Add appropriate regex for both events.

#### How to verify it

sonic-mgmt test case
2023-09-12 16:53:45 -07:00
Yaqiang Zhu
76b7cb8b64
[dhcp_server] Add dhcp_server container (#14031)
Why I did it
Add dhcp_server ipv4 feature to SONiC.
HLD: sonic-net/SONiC#1282

How I did it
To be clarify: This container is disabled by INCLUDE_DHCP_SERVER = n for now, which would cause container not build.

Add INCLUDE_DHCP_SERVER to indicate whether to build dhcp_server container
Add docker file for dhcp_server, build and install kea-dhcp4 inside container
Add template file for dhcp_server container services.
Add entry for dhcp_server to FEATURE table in config_db.
How to verify it
Build image with INCLUDE_DHCP_SERVER = y to verify:

Image can be install successfully without crush.
By config feature state dhcp_server enabled to enable dhcp_server.
2023-09-11 09:15:56 -07:00
vganesan-nokia
b13b41fc22
[swss] Chassis db clean up optimization and bug fixes (#16454)
* [swss] Chassis db clean up optimization and bug fixes

This commit includes the following changes:
    - Fix for regression failure due to error in finding CHASSIS_APP_DB in
    pizzabox (#PR 16451)
    - After attempting to delete the system neighbor entries from
    chassis db, before starting clearing the system interface entries,
    wait for sometime only if some system neighbors were deleted.
    If there are no system neighbors entries deleted for the asic coming up,
    no need to wait.
    - Similar changes for system lag delete. Before deleting the
    system lag, wait for some time only if some system lag memebers were
    deleted. If there are no system lag members deleted no need to wait.
    - Flush the SYSTEM_NEIGH_TABLE from the local STATE_DB. While asic
    is coming up, when system neigh entries are deleted from chassis ap
    db (as part of chassis db clean up), there is no orchs/process running to
    process the delete messages from chassis redis. Because of this, stale system
    neigh are entries present in the local STATE_DB. The stale entries result in
    creation of orphan (no corresponding data path/asic db entry) kernel neigh
    entries during STATE_DB:SYSTEM_NEIGH_TABLE entries processing by nbrmgr (after
    the swss serive came up). This is avoided by flushing the SYSTEM_NEIGH_TABLE from
    the local STATE_DB when sevice comes up.

Signed-off-by: vedganes <veda.ganesan@nokia.com>

* [swss] Chassis db clean up bug fixes review comment fix - 1

Debug logs added for deletion of other tables (SYSTEM_INTERFACE and SYSTEM_LAG_TABLE)

Signed-off-by: vedganes <veda.ganesan@nokia.com>

---------

Signed-off-by: vedganes <veda.ganesan@nokia.com>
2023-09-11 08:28:27 -07:00
lixiaoyuner
4f53819efa
Install parted package for k8s master (#16484)
### Why I did it
Need a tool to extend disk size
##### Work item tracking
- Microsoft ADO **(number only)**: 25094467
#### How I did it
Install parted package
#### How to verify it
Use apt list parted command to check if it's installed
2023-09-07 23:22:47 -07:00
Aman Singhal
e22136dd9f
[cisco]: Enable Kdump config by default for cisco-8000 (#16224)
Why I did it
Enabling kdump by default for cisco-8000 by setting crashkernel cmdline arg in device installer.conf.
After bootup, sonic-kdump-config wipes crashkernel arg from /host/grub/grub.cfg, and resets USE_KDUMP in /etc/default/kdump-tools, so kdump will not be enabled on subsequent reboot.

How I did it
Setting kdump enable config as part of init_cfg.json for cisco-8000 platforms.

How to verify it
Install SONiC image with kdump enabled by default (device/hwsku/installer.conf), then reboot.
Kdump config should persist on subsequent reboots and kdump loaded during bootup

Signed-off-by: Aman Singhal <amans@cisco.com>
2023-09-07 01:30:24 -07:00
mssonicbld
204579a0cc [ci/build]: Upgrade SONiC package versions 2023-09-06 12:32:47 +08:00
Prince George
a4e37a5cd6
[platform]: Disable interrupt for intel i2c-i801 driver (#16309)
On S6100 we are seeing almost 100K interrupts per second on intels i801 SMBUS controller which affects systems performance.

We now disable the i801 driver interrupt and instead enable polling

Microsoft ADO (number only): 24910530

How I did it
Disable the interrupt by passing the interrupt disable feature argument to i2c-i801 driver

How to verify it
This fix is NOT applicable for ARM based platforms. Applicable only for intel based platforms:-

- On SN2700 its already disabled in Mellanox hw-mgmt
- Celestica DX010 and E1031
- Dell S6100 verified the interrupts are no longer incrementing.
- Arista 7260CX3

Signed-off-by: Prince George <prgeor@microsoft.com>
2023-09-05 10:23:57 -07:00
Stephen Sun
b5e8c16134
[Mellanox] Enhance FW upgrade mechanism (#16090)
### Why I did it

1. Enhance the diagnosis information collecting mechanism
   - If the option `-v` is fed, it will pass additional diagnosis flags to mlxfwmanager
   - Collect all the output from mlxfwmanager and print them to syslog if it fails
2. Abort syncd in case waiting for device or upgrading firmware fails

Signed-off-by: Stephen Sun <stephens@nvidia.com>

### How I did it

#### How to verify it

Regression and manual test
2023-09-04 11:28:53 -07:00
anamehra
f6897bb585
chassis-packet: Update arp_update script for FAILED and STALE check (#16311)
chassis-packet: Update arp_update script for FAILED and STALE check (#16311)

1. Fixing an issue with FAILED entry resolution retry.
Neighbor entries in arp table may sometimes enter a FAILED state when the far end is down and reports the state as follows:
2603:10e2:400:3::1 dev PortChannel19 router FAILED
While the arp_update script handles the entries for FAILED in the following format, the above was not handled due to the token location (extra router keyword at index 4):
2603:10e2:400:3::1 dev PortChannel19 FAILED

The former format may appear if an arp resolution is tried on a link that is known but the far end goes down, e.g., pinging a STALE entry while the far end is down.

2. Refreshing STALE entries to make sure the far end is reachable.
STALE entries for some backend ports may appear in chassis-packet when no traffic is received for a while on the port. When the far end goes down, it is expected for BFD to stop sending packets on the session for which the far end is not reachable. But as the entry is known as stale, on the Cisco chassis, BFD keeps sending packets. Refreshing the stale entry will keep active links as reachable in the neighbor table while the entries for the far end down will enter a failed state. FAILED state entries will be retired and entered reachable when far end comes back up.
2023-09-01 11:41:46 -07:00
abdosi
566b5dfa1f
Assign the higher metric value for Ipv6 default route learnt via RA message (#16367)
* Fix the Loopback0 IPv6 address of LC's in chassis not reachable from peer device's
* Assign the metric vaule for Ipv6 default route learnt via RA message to higher value so that BGP learnt default route is higher priority.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2023-09-01 11:38:14 -07:00
mssonicbld
f78d25b11e [ci/build]: Upgrade SONiC package versions 2023-09-01 16:32:44 +08:00
vganesan-nokia
5fded5c51b
[chassis] Chassis DB cleanup when asic comes up (#16213)
* [chassis]Chassis DB cleanup when asic comes up

Cleanup the entries from the following tables in chassis app db in
redis_chassis server in the supervisor
(1) SYSTEM_NEIGH
(2) SYSTEM_INTERFACE
(3) SYSTEM_LAG_MEMBER_TABLE
(4) SYSTEM_LAG_TABLE
As part of the clean up only those entries created by the asic that
is coming up are deleted. The LAG IDs used by the asics are also
de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET

- Added check to run the chassis db clean up only for voq switches.

Signed-off-by: vedganes <veda.ganesan@nokia.com>
2023-08-31 23:38:56 -07:00
lixiaoyuner
410e6ff406
Install pyOpenSSL package for k8s master (#16361)
### Why I did it
Need a tool to check certificate's detail of information.
##### Work item tracking
- Microsoft ADO **(number only)**: 25020260
#### How I did it
Install pyOpenSSL package for k8s master
#### How to verify it
Pip3 list to check whether it's installed when include_kubernetes_master=y
2023-08-31 22:26:24 -07:00
Alpesh Patel
cabdac17a5
qos template change for backend compute-ai deployment (#16150)
#### Why I did it

To enable qos config for a certain backend deployment mode, for resource-type "Compute-AI".
This deployment has the following requirement:

- Config below enabled if DEVICE_TYPE as one of backend_device_types
- Config below enabled if ResourceType is 'Compute-AI'
- 2 lossless TCs' (2, 3)
- 2 lossy TCs' (0,1)
- DSCP to TC map uses 4 DSCP code points and maps to the TCs' as follows:
   "DSCP_TO_TC_MAP": {
        "AZURE": {
             "48" : "0",
            "46" : "1",
            "3"  : "3",
            "4"  : "4"
        }
    }

- WRED profile has green {min/max/mark%} as {2M/10M/5%}

This required template change <as in the PR> in addition to the vendor qos.json.j2 file (not included here).

### How I did it

#### How to verify it
- with the above change and the vendor config change, generated the qos.json file and verified that the objective stated in "Why I did it" was met

- verified no error

### Description for the changelog
Update qos_config.j2 for Comptue-AI deployment on one of backend device type roles
2023-08-31 11:30:20 -07:00
Vadym Hlushko
43340cd58d
[memory_checker] Add a specific log message in a case when the docker service is not running. (#16018)
#### Why I did it
To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created #11129](https://github.com/sonic-net/sonic-buildimage/pull/11129).
There could be a scenario before the reboot, where
1. The `docker service` has stopped
2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry`

In such scenario, the `memory_checker` script will throw an error to the syslog:
```
ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))'
```
But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog.

#### How I did it
Change the log severity to the warning and changed the return value.

#### How to verify it
It is really hard to catch the exact moment described in the `Why I did it` section.
In order to check the logic:
1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](47742dfc2c/files/image_config/monit/memory_checker (L139)) file on the switch.
2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry`
3. Check the syslog for such messages:
```
WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte
d.', FileNotFoundError(2, 'No such file or directory'))'

INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running!
```
2023-08-31 11:28:20 -07:00
abdosi
b6edc374ba
[build]: Added flag in sonic_version.yml to see if image is secured or non-secured (#16191)
What I did:

Added flag in sonic_version.yml to see if compiled image is secured or non-secured. This is done using build/compile time environmental variable SECURE_UPGRADE_MODE as define in HLD: https://github.com/sonic-net/SONiC/blob/master/doc/secure_boot/hld_secure_boot.md

This flag does not provide the runtime status of whether the image has booted securely or not. It's possible that compile time signed image (secured image) can boot on non secure platform.

Why I did:
Flag can be used for manual check or by the test case.

ADO: 24319390

How I verify:
Manual Verification

---
build_version: 'master-16191.346262-cdc5e72a3'
debian_version: '11.7'
kernel_version: '5.10.0-18-2-amd64'
asic_type: broadcom
asic_subtype: 'broadcom'
commit_id: 'cdc5e72a3'
branch: 'master-16191'
release: 'none'
build_date: Fri Aug 25 03:15:45 UTC 2023
build_number: 346262
built_by: AzDevOps@vmss-soni001UR5
libswsscommon: 1.0.0
sonic_utilities: 1.2
sonic_os_version: 11
secure_boot_image: 'no'

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2023-08-29 13:41:01 -07:00
mssonicbld
55849d0c6b
[ci/build]: Upgrade SONiC package versions (#16300) 2023-08-28 18:31:51 +08:00
mssonicbld
c8465c0d9a
[ci/build]: Upgrade SONiC package versions (#16294) 2023-08-26 18:45:45 +08:00
Vivek
0652991eb8
Run db_migrator for non first-time reboots (#16116)
- Why I did it
The recent change #15685 (comment) removed the db migration for non first reboots.
This is problematic for many deployments which doesn't rely on ZTP and push a custom config_db.json
Port to older branches after #15685 is ported back

- How I did it
Re-introduce the logic to run the db_migrator on non-first boots

- How to verify it
Verified reboot and warm-reboot cases

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
2023-08-22 18:36:38 +03:00
mssonicbld
871b122495
[ci/build]: Upgrade SONiC package versions (#16219) 2023-08-21 18:32:24 +08:00
mssonicbld
ec91ff30c9 [ci/build]: Upgrade SONiC package versions 2023-08-20 14:32:25 +08:00
Kebo Liu
1626e198a8
[Mellanox] Update SDK/FW/SAI to 4.6.1020/2012.1020/SAIBuild2305.25.0.3 (#16096)
SONiC changes:
1. Support Spectrum4 ASIC FW binary building.
2. Support new SDK sx-obj-desc lib building since new SAI need it.
3. Remove SX_SCEW debian package from Mellanox SDK build since we are no longer using it (we use libxml2 instead).
4. Update SAI, SDK, FW to version 4.6.1020/2012.1020/SAIBuild2305.25.0.3

SDK/FW bug fixes
1. In SPC-1 platforms: Fastboot mode is not operational for Split port with Force mode in 50G speed
SFP modules are kept in disabled state after set LPM (low power mode) on/off for at least 3 minutes.
2. When preforming fast boot from an old SDK version (currently installed) to a newer one (target version), and the system was initially loaded with a new SDK version (past version), and the system has not been wiped, under specific conditions, the fast boot would use the past version's data and may fail.

SDK/FW Features
1. On SN2700 all ports can support y cable by credo

SAI bug Fixes
1. When creating an ACL rule with SAI_ACL_ENTRY_ATTR_FIELD_SRC_IP/SAI_ACL_ENTRY_ATTR_FIELD_DST_IP enabled, and then disabling the field by setting enable=false, a match on L3_type=IPv4 will remain programmed for the rule Issue resolved after the fix
2. Allow the max scale of virtual routers to be configure for SPC-1, SPC-2, SPC-3 when fastboot enable 
3. Remove default hash key of SRC_MAC, DST_MAC and ETH_TYPE

SAI features
1. Port init profile

- How I did it
Update SDK/FW/SAI make files

- How to verify it
Run full sonic-mgmt regression on Mellanox platform

Signed-off-by: Kebo Liu <kebol@nvidia.com>
2023-08-15 15:32:52 +03:00
Arvindsrinivasan Lakshmi Narasimhan
46817036fd
[chassis]: removed dependency for bgp and swss for chassis supervisor (#15734)
Fixes #15667 and #13293

Work item tracking
Microsoft ADO 24472854:

How I did it
On chassis supervisor bgp feature is disabled in hostcfgd. The dependency between swss and bgp causes the bgp containers to start even though the feature is disabled.

How to verify it
Tests on chassis supervisor and LC
2023-08-07 09:52:48 -07:00
Vadym Hlushko
9fba98ce6d
[syncd.sh] Clear semaphore before updating firmware (#15818)
Why I did it
The hw resources should be released before updating firmware.

How I did it
Added logic to release hw resources in syncd.sh script

Signed-off-by: Vadym Hlushko <vadymh@nvidia.com>
2023-08-06 22:30:33 -07:00
andywongarista
96fa513690
[Arista] Add support for DCS-7060DX5-32 (#14793)
* Add asic support for blackhawkth4dd

* Add bfd feature to BlackhawkTh4Dd

* Add platform data for blackhawkth4

* Add Qos settings for Blackhawk-TH4

* Add pg and queue settings for Blackhawk-TH4

* Add buffers_defaults_t0.j2

* Add blackhawkth4 to boot0

* Update 7060dx5 config.bcm

* Fix build error

---------

Co-authored-by: Boyang Yu <byu@arista.com>
Co-authored-by: David Meggy <davidm@arista.com>
2023-08-05 22:11:45 +08:00
Vaibhav Hemant Dixit
e127701660
Fix CONFIG_DB_INITIALIZED flag check logic and set/reset flag for warmboot (#15685)
* Fix CONFIG_DB_INITIALIZED flag check logic and set/reset flag for warm-reboot
* Fix db-cli usage
* Handle same image warm-reboot and generalize handling of INIT flag
* Cover boot from ONIE case: set config init flag when minigraph, config_db are missing
* Handle case: first boot of SONiC
* Check for config init flag
* Simplify logic, and do not call db_migrator for same image reboot
2023-08-04 16:00:26 -07:00
ganglv
5c4ab7a7f4
Use DNS j2 for default DNS configuration (#15901)
Why I did it
Support default DNS configuration

How I did it
Use j2 template to generate default DNS configuration.

How to verify it
Run sonic-config-engine unit test.
2023-07-31 15:43:00 -07:00
mssonicbld
9b08fe4eb2 [ci/build]: Upgrade SONiC package versions 2023-07-29 16:32:36 +08:00
xumia
a0b3ec2df6
Support FIPS DB configuration (#15632)
Why I did it
Support FIPS DB configuration
Design Doc: sonic-net/SONiC#1372

Work item tracking
Microsoft ADO (number only): 24411148
How I did it
Add the FIPS Yang model to make FIPS configurable in ConfigDB.

How to verify it
See TestPlan: sonic-net/sonic-mgmt#9092
Build the image and run the tests: sonic-net/sonic-mgmt#9091
2023-07-28 16:54:02 +08:00
Longxiang Lyu
dc139cfc32
[monit][dualtor] Periodically check mux neighbors consistency (#15769)
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
2023-07-24 21:16:49 -07:00
lixiaoyuner
10b65d9826
Add k8s master code new (#15716)
Why I did it
Currently, k8s master image is generated from a separate branch which we created by ourselves, not release ones. We need to commit these k8s master related code to master branch for a better way to do k8s master image build out.

Work item tracking
Microsoft ADO (number only):
19998138
How I did it
Install k8s dashboard docker images
Install geneva mds and mdsd and fluentd docker images and tag them as latest, tagging latest will help create container always with the latest version
Install azure-storage-blob and azure-identity, this will help do etcd backup and restore.
Install kubernetes python client packages, this will help read worker and container state, we can send these metric to Geneva.
Remove mdm debian package, will replace it with the mdm docker image
Add k8s master entrance script, this script will be called by rc-local service when system startup. we have some master systemd services in compute-move repo, when VMM service create master VM, VMM will copy all master service files inside VM, the entrance script will setup all services according to the service files.
When the entrance script content changed, the PR build will set include_kubernetes_master=y to help do validation for k8s master related code change. The default value of include_kubernetes_master should be always n for public master branch. We will generate master image from internal master branch
How to verify it
Build with INCLUDE_KUBERNETES_MASTER = y
2023-07-25 07:44:59 +08:00
Junchao-Mellanox
05f9c5c297
Fix issue: set delayed attribute to true for platform monitor service (#15816)
There is a redundant line in init_cfg.json.j2. It would cause pmon service always has "delayed=False". However, we know that PMON has a timer now. So, I try to fix it here.
2023-07-24 08:30:35 -07:00
guangyao6
9567c06570
Add BGP configuration for BGPSentinel peer (#15714)
Why I did it
For route registry service, in order to block hijacked routes, IBGP session needs to be set up from BGP sentinel service to SONiC, and BGP sentinel service advertise the same route with higher local-preference and no export community. So that SONiC takes the route from BGP sentinel as the best path and does not advertise the route to EBGP peers.
In order to do that, new route-maps are needed. So this change adds a new set of templates, keeping BGPSentinel peers out of the other templates.

Work item tracking
Microsoft ADO (number only): 24451346
How I did it
Add sentinel_community in constants.yml, route from BGPSentinel do not match this community will be denied.
Add support to convert BGPSentinel related configuration in the BGPPeerPassive element of the minigraph to a new BGP_SENTINELS table in CONFIG_DB
Add a new set of "sentinels" templates to docker-fpm-frr
Add a new BGP peer manager to bgpcfgd, to add neighbors from the BGP_SENTINELS table using the "sentinels" templates
Add a test case for minigraph.py, making sure the BGPSentinel and BGPSentinelV6 elements create BGP_SENTINELS DB entry.
Add a set of test cases for the new sentinels templates in sonic-bgpcfgd tests.
Add sonic-bgp-sentinel.yang and a set of testcases for the yang file.

How to verify it
Testcases and UT newly added would pass.
Setup IPv4 and IPv6 BGPSentinel services in minigraph, and load minigraph, show CONFIG_DB and "show runningconfig bgp", configuration would be loaded successfully.
Using t1-lag topo and setup IBGP session from BGPSentinel to SONiC loopback address, IBGP session would up.
Advertise route from BGPSentinel to T1 with sentinel_community, higher local-preference and no-export communiyt. In T1, show bgp route, the result is "Not advertise to any EBGP peer".
Withdraw the route in BGPSentinel, in T1, route would advertise to EBGP peers.
Advertise route from T1 that does not match sentinel_community, in T1, would not see the route in show bgp route.
2023-07-21 09:32:29 +08:00
vmittal-msft
fea10546f2
Update WRED profile on system ports (#15612)
* Update WRED profile on system ports
2023-07-19 15:00:39 -07:00
mssonicbld
ecc0f4c243 [ci/build]: Upgrade SONiC package versions 2023-07-20 04:32:51 +08:00
mssonicbld
39f3e1f97a
[ci/build]: Upgrade SONiC package versions (#15862) 2023-07-17 19:08:24 +08:00
mssonicbld
273cb46af9
[ci/build]: Upgrade SONiC package versions (#15854) 2023-07-15 20:23:42 +08:00
Liping Xu
95d11976bd
update rsyslog log size conf (#15821)
Why I did it
For some devices whose log folder size is larger than 200M, for example, 256M, the LOG_FILE_ROTATE_SIZE_KB should be 16M. and
THRESHOLD_KB=$((USABLE_SPACE_KB - (NUM_LOGS_TO_ROTATE * LOG_FILE_ROTATE_SIZE_KB * 2)))
= $(( (VAR_LOG_SIZE_KB * 90 / 100) - RESERVED_SPACE_KB)) - (NUM_LOGS_TO_ROTATE * LOG_FILE_ROTATE_SIZE_KB * 2)))
= $(( (256M * 90 / 100) - 4096)) - (8 * 16M * 2)))
the result would be a negative value

Work item tracking
Microsoft ADO (number only):
24524827
How I did it
Add a case for 400M, if the log folder size is between 200M and 400M, set the log file size to 2M

How to verify it
Do cmd "sudo logrotate -f /etc/logrotate.conf" on DUT which val/log folder size is 256M, and check the syslog.
2023-07-14 15:44:17 +08:00
Mohammedz93
28b9299445
Support Reset factory (#14105)
#### Why I did it
Support reset factory in Sonic OS
[Reset Factory HLD](https://github.com/sonic-net/SONiC/pull/1231)
[Sonic-mgmt tests](https://github.com/sonic-net/sonic-mgmt/pull/7652)

#### How I did it
- Added new script "/usr/bin/reset-factory"
   * It generates a new config_db.json files with factory configurations
   * It clears system files and logs
   * It removes all docker containers on system except database
   * It clears non-default users and restores default users password
- Dump the default users info to a new file during build "/etc/sonic/default_users.json"
- Supported new type "Keep-basic" in "config-setup factory"
- Add new conf file for config-setup "/etc/config-setup/config-setup.conf

#### How to verify it
- Run reset-factory script with all types: < none | keep-all-config | only-config | keep-basic >
- Run config-setup factory with parameters < none | keep-basic >

#### Description for the changelog
Support reset factory in Sonic OS

#### Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
2023-07-11 16:14:17 -07:00
iavraham
72021fdb0f
Add remote syslog configuration (#14513)
* Add an ability to configure remote syslog servers
* Add an initial configuration for remote syslog
* Extend YANG module and add unit tests

#### Why I did it
Adding the following functionality to rsyslog feature:

- Configure remote syslog servers: protocol, filter, severity level
- Update global syslog configuration: severity level, message format

#### How I did it
added parameters to syslog server and global configuration.

#### How to verify it
create syslog server using CLI/adding to Redis-DB
verify server is added to file /etc/rsyslog.conf and server is functional.

#### Description for the changelog
extend rsyslog capabilities, added server and global configuration parameters.

#### Link to config_db schema for YANG module changes
https://github.com/iavraham/sonic-buildimage/blob/master/src/sonic-yang-models/yang-models/sonic-syslog.yang
2023-07-10 11:40:08 -07:00
mssonicbld
e57692c30d
[ci/build]: Upgrade SONiC package versions (#15757) 2023-07-08 19:34:00 +08:00
Vaibhav Hemant Dixit
ddb3086620
Revert "Revert "Fix for fast/cold-boot: call db_migrator only after old config is loaded (#14933)" (#15464)" (#15684)
This reverts commit 9649a44470.
2023-07-06 17:34:35 -07:00
lixiaoyuner
ca29197184
Move k8s script to docker-config-engine (#14788)
Why I did it
To reduce the container's dependency from host system

Work item tracking
Microsoft ADO (number only):
17713469
How I did it
Move the k8s container startup script to config engine container, other than mount it from host.

How to verify it
Check file path(/usr/share/sonic/scripts/container_startup.py) inside config engine container.

Signed-off-by: Yun Li <yunli1@microsoft.com>
Co-authored-by: Qi Luo <qiluo-msft@users.noreply.github.com>
2023-07-05 14:44:48 -07:00
mssonicbld
de65640633
[ci/build]: Upgrade SONiC package versions (#15715) 2023-07-05 18:37:13 +08:00
mssonicbld
7ef59d556b
[ci/build]: Upgrade SONiC package versions (#15706) 2023-07-03 19:18:54 +08:00
mssonicbld
aa5164ef09
[ci/build]: Upgrade SONiC package versions (#15647) 2023-07-01 18:39:31 +08:00
Lawrence Lee
b4a3711a95
[arp_update]: Fix IPv6 neighbor race condition (#15583)
* [arp_update]: Fix IPv6 neighbor race condition on dualtors
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2023-06-30 14:06:25 -07:00
Stepan Blyshchak
1ebdcda9e3
[nvidia] make sure shared storage with syncd is cleared on restarts (#14547)
Why I did it
Sharing the storage of syncd with other proprietary application extensions allows them to communicate with syncd in differnt ways.
If one container wants to pass some information to syncd then shared storage can be used. However, today the shared storage isn't cleaned on restarts making it possible for syncd to read out-of-date information generated in the past.

NOTE: No plans to use it for standard SONIC dockers and we are working on removing the SDK dependency from PMON docker

How I did it
Implemented new service to clean the shared storage.

How to verify it
Do reboot/fast-reboot/warm-reboot/config-reload/systemctl restart swss and verify /tmp/ is cleaned after each restart in syncd container.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2023-06-28 15:26:49 -07:00
siqbal1986
bf5b72a356
Vnet monitor table cleanup (#15399)
* Added  VNET_MONITOR_TABLE, BFD_SESSION_TABLE, to the listof tables to be cleaned up after swss restart.
* Added  VNET_ROUTE* table in cleanup. This should cover VNET_ROUTE_TUNNEL_TABLE as well.
2023-06-27 12:53:56 -07:00
mssonicbld
aa11acdddd [ci/build]: Upgrade SONiC package versions 2023-06-26 20:55:55 +08:00
Junchao-Mellanox
b07957bdad
Fix issue: systemctl daemon-reload would sporadically cause udev handler fail (#15253)
#### Why I did it

A workaround to back port the fix for a systemd issue.

The systemd issue: https://github.com/systemd/systemd/issues/24668
The systemd PR to fix the issue: https://github.com/systemd/systemd/pull/24673/files

The formal solution should upgrade systemd to a version that contains the fix. But, systemd is a very basic service, upgrading systemd requires heavy test. 

#### How I did it
Copy the correct systemd-udevd.service file in build time 

#### Tested branch (Please provide the tested image version)

- [x] 202211
- [ ] <!-- image version 2 -->

```
SONiC Software Version: SONiC.fix-udev.3-b65c7bdec_Internal
SONiC OS Version: 11
Distribution: Debian 11.7
Kernel: 5.10.0-18-2-amd64
Build commit: b65c7bdec
Build date: Mon Jun 19 10:54:50 UTC 2023
Built by: sw-r2d2-bot@r-build-sonic-ci02-241

Platform: x86_64-mlnx_msn4700-r0
HwSKU: ACS-MSN4700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2022X08597
Model Number: MSN4700-WS2FO
Hardware Revision: A1
Uptime: 08:10:11 up 1 min,  1 user,  load average: 1.81, 0.67, 0.24
Date: Sun 25 Jun 2023 08:10:11

Docker images:
REPOSITORY                    TAG                             IMAGE ID       SIZE
docker-fpm-frr                fix-udev.3-b65c7bdec_Internal   a7b911e7cb6f   346MB
docker-fpm-frr                latest                          a7b911e7cb6f   346MB
docker-platform-monitor       fix-udev.3-b65c7bdec_Internal   94c5178cf80b   731MB
docker-platform-monitor       latest                          94c5178cf80b   731MB
docker-orchagent              fix-udev.3-b65c7bdec_Internal   46b393e0ace8   328MB
docker-orchagent              latest                          46b393e0ace8   328MB
docker-syncd-mlnx             fix-udev.3-b65c7bdec_Internal   1f5c6c23e33a   734MB
docker-syncd-mlnx             latest                          1f5c6c23e33a   734MB
docker-sflow                  fix-udev.3-b65c7bdec_Internal   7e45992c8c59   317MB
docker-sflow                  latest                          7e45992c8c59   317MB
docker-teamd                  fix-udev.3-b65c7bdec_Internal   e4d905592cda   316MB
docker-teamd                  latest                          e4d905592cda   316MB
docker-nat                    fix-udev.3-b65c7bdec_Internal   7fe799367580   319MB
docker-nat                    latest                          7fe799367580   319MB
docker-macsec                 latest                          d702a5554171   318MB
docker-snmp                   fix-udev.3-b65c7bdec_Internal   3bce8fcf71cd   338MB
docker-snmp                   latest                          3bce8fcf71cd   338MB
docker-sonic-telemetry        fix-udev.3-b65c7bdec_Internal   f13949cbc817   597MB
docker-sonic-telemetry        latest                          f13949cbc817   597MB
docker-dhcp-relay             latest                          153d9072805d   306MB
docker-router-advertiser      fix-udev.3-b65c7bdec_Internal   aed642b9a6bc   299MB
docker-router-advertiser      latest                          aed642b9a6bc   299MB
docker-sonic-p4rt             fix-udev.3-b65c7bdec_Internal   a3cae5ca65a7   870MB
docker-sonic-p4rt             latest                          a3cae5ca65a7   870MB
docker-mux                    fix-udev.3-b65c7bdec_Internal   b81f0401b9a8   347MB
docker-mux                    latest                          b81f0401b9a8   347MB
docker-eventd                 fix-udev.3-b65c7bdec_Internal   c5917d0e801f   298MB
docker-eventd                 latest                          c5917d0e801f   298MB
docker-lldp                   fix-udev.3-b65c7bdec_Internal   fd5dc14a7976   341MB
docker-lldp                   latest                          fd5dc14a7976   341MB
docker-database               fix-udev.3-b65c7bdec_Internal   438c2715a1dd   299MB
docker-database               latest                          438c2715a1dd   299MB
docker-sonic-mgmt-framework   fix-udev.3-b65c7bdec_Internal   5c50b115fbcd   414MB
docker-sonic-mgmt-framework   latest  
```
2023-06-25 16:58:14 -07:00
Oleksandr Ivantsiv
475fe27c0b
[dns] Add support for static DNS configuration. (#14549)
- Why I did it
Add support for static DNS configuration. According to sonic-net/SONiC#1262 HLD.

- How I did it
Add a new resolv-config.service that is responsible for transferring configuration from Config DB into /etc/resolv.conf file that is consumed by various subsystems in Linux to resolve domain names into IP addresses.

- How to verify it
Run the image compilation. Each component related to the static DNS feature is covered with the unit tests.
Run sonic-mgmt tests. Static DNS feature will be covered with the system tests.
Install the image and run manual tests.
2023-06-22 19:12:30 +03:00
Vaibhav Hemant Dixit
9649a44470
Revert "Fix for fast/cold-boot: call db_migrator only after old config is loaded (#14933)" (#15464)
This reverts commit 02b17839c3.

Reverts #14933

The earlier commit caused a race condition that particularly broke cross branch warm upgrade.

Issue happens when db_migrator is still migrating the DB and finalizer is checking DB for list of components to reconcile.

If migration is not complete, finalizer get an empty list to wait for. Due to this, finalizer concludes warmboot (deletes system wide warmboot flag) and cause all the services to do cold restart.

ADO: 24274591
2023-06-16 13:58:38 -07:00
Stepan Blyshchak
e2e5b77f16
[mlnx-ffb.sh] Update issu-version location (#14925)
#### Why I did it

ISSU version check fails due to inability to mount squashfs from 202211 on 201911

#### How I did it

Put ISSU version file under platform directory

#### How to verify it

Warm-upgrade matrix:
- 201911 (with https://github.com/sonic-net/sonic-buildimage/pull/14928) to master
- 201911 (with https://github.com/sonic-net/sonic-buildimage/pull/14928) to 202211
- 202012 (with https://github.com/sonic-net/sonic-buildimage/pull/14927) to master
- 202205 (with this change cherry-picked) to master
2023-06-15 15:14:52 -07:00
Saikrishna Arcot
f84dfd2345
Re-add 127.0.0.1/8 when bringing down the interfaces (#15080)
* Re-add 127.0.0.1/8 when bringing down the interfaces

With #5353, 127.0.0.1/16 was added to the lo interface, and then
127.0.0.1/8 was removed. However, when bringing down the lo interface,
like during a config reload, 127.0.0.1/16 gets removed, but 127.0.0.1/8
isn't added back to the interface. This means that there's a period of
time where 127.0.0.1 is not available at all, and services that need to
connect to 127.0.01 (such as for redis DB) will fail.

To fix this, when going down, add 127.0.0.1/8. Add this address before
the existing configuration gets removed, so that 127.0.0.1 is available
at all times.

Note that running `ifdown lo` doesn't actually bring down the loopback
interface; the interface always stays "physically" up.

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
2023-06-13 18:45:39 -07:00
Hua Liu
05f1a5a31e
Add watchdog mechanism to swss service and generate alert when swss have issue. (#15429)
Add watchdog mechanism to swss service and generate alert when swss have issue. 

**Work item tracking**
Microsoft ADO (number only): 16578912

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.

Manually test process_monitoring/test_critical_process_monitoring.py can pass.

Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly.

Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737
UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306
2023-06-12 17:53:54 -07:00
Alpesh Patel
633fff8c10
enable ethernet backplane port support in port config for packet mode T2 devices (#14533)
For T2 systems using packet mode, the backplane interfaces (Ethernet-BP#) and the fabric card ethernet interfaces are not visible as neighbor interfaces.
In packet mode, these interfaces needs qos and buffer config as well.
This fix addresses that issue and adds the backplane interfaces to the PORTS_ACTIVE list
2023-06-12 14:02:22 -07:00
mssonicbld
cb9d9e57a6
[ci/build]: Upgrade SONiC package versions (#15431)
Upgrade SONiC Versions
2023-06-12 22:27:29 +08:00
mssonicbld
a45595158b
[ci/build]: Upgrade SONiC package versions (#15345) 2023-06-10 20:38:13 +08:00
Liping Xu
78c41a1e58
allow docker_inram to kernel cmd list (#15374)
Why I did it
After docker_inram is enabled, the docker folder's default max size is 1.5G.
It's not big enough for some tests which need to install additional docker images or install extra packages.

Work item tracking
Microsoft ADO 24199761:
How I did it
add docker_inram into cmdline_allowlist

How to verify it
sudo sh -c 'echo "docker_inram_size=3000M" >> kernel-cmdline-append'
sudo reboot and check the docker folder size
2023-06-10 14:19:44 +08:00
Sudharsan Dhamal Gopalarathnam
162856ad9a
[sflow]Delay starting sflow service until ports are created (#15333)
* [sflow]Delay starting sflow service until ports are created
* Removing sflow from sonic.target dependency since it will be managed by hostcfgd
2023-06-09 16:28:15 -07:00
Ye Jianquan
cec9d7b83a
Revert "Add watchdog mechanism to swss service and generate alert when swss have issue. (#14686)" (#15390)
This reverts commit 44427a2f6b.
Docker image not updated during PR validation and caused PR check failures.
Force merge this revert. After cache is updated after this PR is merged, issue should be fixed.
2023-06-09 09:10:35 +08:00
Yevhen Fastiuk
8a6d45227e
[Clock] Add timezone config YANG model (#14651)
* Add the ability to configure timezone

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>

* Add YANG model for timezone

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>

* Add timezone reference

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>

---------

Signed-off-by: Yevhen Fastiuk <yfastiuk@nvidia.com>
2023-06-07 10:39:24 -07:00
Hua Liu
44427a2f6b
Add watchdog mechanism to swss service and generate alert when swss have issue. (#14686)
This PR depends on https://github.com/sonic-net/sonic-swss/pull/2737 merge first.

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Add new UT https://github.com/sonic-net/sonic-mgmt/pull/8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: https://github.com/sonic-net/sonic-swss/pull/2737
UT PR: https://github.com/sonic-net/sonic-mgmt/pull/8306
2023-06-05 22:21:17 -07:00
siqbal1986
381cfe4485
Added VNET_MONITOR_TABLE,BFD_SESSION_TABLE,VNET_ROUTE_TUNNEL_TABLE to the list (#14992)
* The 3 tables in state DB need to be cleaned up after SWSS restart for have consistant state.
2023-06-05 13:18:50 -07:00
mssonicbld
4335690de7 [ci/build]: Upgrade SONiC package versions 2023-06-05 20:51:47 +08:00
Arvindsrinivasan Lakshmi Narasimhan
3f4b959d3f
[chassis] add libffi-dev for sonic-utilities (#15218)
In the PR sonic-net/sonic-utilities#2850 , for support remote access of linecards paramiko package is installed in sonic-utilities. libffi-dev needs to installed to be able to compile for armhf image

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2023-06-03 14:36:50 -07:00
mssonicbld
f80e182c22
[ci/build]: Upgrade SONiC package versions (#15325) 2023-06-03 19:45:07 +08:00
mssonicbld
c044e6e34e
[ci/build]: Upgrade SONiC package versions (#15307) 2023-06-02 21:40:29 +08:00
Vaibhav Hemant Dixit
02b17839c3
Fix for fast/cold-boot: call db_migrator only after old config is loaded (#14933)
Why I did it
Fix the issue where db_migrator is called before DB is loaded w/ config. This leads to db_migrator:

Not finding anything, and resumes to incorrectly migrate every missing config
This is not expected. migration should happen after the old config is loaded and only new schema changes need migration.
Since DB does not have anything when migrator is called, db_migrator fails when some APIs return None.
The reason for incorrect call is that:

database service starts db_migrator as part of startup sequence.
config-setup service loads data from old-config/minigraph. However, since it has Requires=database.service.
Hence, config-setup starts only when database service is started. And database service is started when db_migrator is completed.
Fixed by:

Check if this is first time boot by checking pending_config_migration flag.
If pending_config_migration is enabled, then do not call db_migrator as part of database service startup.
Let database service start which triggers config-setup service to start.
Now call db_migrator after when config-setup service loads old-config/minigraph
2023-05-30 10:16:21 -07:00
vmittal-msft
ecb4db58a9
Update PG headroom settings ports based on port speed/cable length (#14908)
* Update PG headroom settings ports based on port speed/cable length

* Updated XOFF settings to use chip level numbers than core

* Updated PG headroom based on uplink/downlink side

* fix for sonic-config-gen tests

* More fixes for unit test cases

* more test fixes

* Merged multiple functions into one
2023-05-19 08:19:27 -07:00
Pavan-Nokia
c5d0507224
[arm64][Nokia-7215-A1]Add support for Nokia-7215-A1 platform (#13795)
Add new Nokia build target and establish an arm64 build:

    Platform: arm64-nokia_ixs7215_52xb-r0
    HwSKU: Nokia-7215-A1
    ASIC: marvell
    Port Config: 48x1G + 4x10G

How I did it

- Change make files for saiserver and syncd to use Bulleseye kernel
- Change Marvell SAI version to 1.11.0-1
- Add Prestera make files to build kernel, Flattened Device Tree blob and ramdisk for arm64 platforms
- Provide device and platform related files for new platform support (arm64-nokia_ixs7215_52xb-r0).
2023-05-18 14:24:05 -07:00
Samuel Angebault
fa95ebcaae Add optional zram compression for docker_inram
Some devices running SONiC have a small storage device (2G and 4G mainly)
The SONiC image growth over time has made it impossible to install
2 images on a single device.
Some mitigations have been implemented in the past for some devices but
there is a need to do more.

One such mitigation is `docker_inram` which creates a `tmpfs` and
extracts `dockerfs.tar.gz` in it.
This all happens in the SONiC initramfs and by ensuring the installation
process does not extract `dockerfs.tar.gz` on the flash but keep the file as is.

This mitigation does a tradeoff by using more RAM to reduce the disk footprint.
It however creates new issues for devices with 4G of system memory since
the extracted `dockerfs.tar.gz` nears the 1.6G.
Considering debian upgrades (with dual base images) and the continuous
stream of features this is only going to get bigger.

This change introduces an alternative to the `tmpfs` by allowing a system
to extract the `dockerfs.tar.gz` inside a `zram` device thus bringing
compression in play at the detriment of performance.

Introduce 2 new optional kernel parameters to be consumed by SONiC initramfs.
 - `docker_inram_size` which represent the max physical size of the
   `zram` or `tmpfs` volume (defaults to DOCKER_RAMFS_SIZE)
 - `docker_inram_algo` which is the method to use to extract the
   `dockerfs.tar.gz` (defaults to `tmpfs`)
   other values are considered to be compression algorithm for `zram`
   (e.g `zstd`, `zlo-rle`, `lz4`)

Refactored the logic to mount the docker fs in the SONiC initramfs under
the `union-mount` script.
Moved the code into a function to make it cleaner and separated the
inram volume creation and docker extraction.

On Arista platform with a flash smaller or equal to 4GB set
`docker_inram_algo` to `zstd` which produces the best compression ratio
at the detriment of a slower write performance and a similar read
performance to other `zram` compression algorithms.
2023-05-18 14:21:52 -07:00
Samuel Angebault
467994c024 [Arista] Fix boot0 code for docker_inram
Enable docker_inram for all systems with 4GB or less of flash.
This is mandatory to allow these systems to store 2 SONiC images.

This change also fixes the missing docker_inram attribute when
installing a new image from SONiC.
Because the SWI image can ship with additional kernel parameters within
such as `sonic_fips=` this lead to a conflict.
To prevent the conflict, the extra kernel parameters from the SWI are
now stored in the file `kernel-cmdline-append` which isn't used anywhere.
2023-05-18 14:21:52 -07:00
Anish Narsian
05a85b57b8
[arp_update] Resolve neighbors from config_db (#15006)
* To resolve NEIGH table entries present in CONFIG_DB. Without this change arp/ndp entries which we wish to resolve, and configured via CONFIG_DB are not resolved.
2023-05-17 10:42:03 -07:00
mssonicbld
3d1ae46f90 [ci/build]: Upgrade SONiC package versions 2023-05-15 18:32:43 +08:00
mssonicbld
31223fb9fe
[ci/build]: Upgrade SONiC package versions (#15057) 2023-05-13 18:30:20 +08:00