Commit Graph

7891 Commits

Author SHA1 Message Date
Stephen Sun
b5e8c16134
[Mellanox] Enhance FW upgrade mechanism (#16090)
### Why I did it

1. Enhance the diagnosis information collecting mechanism
   - If the option `-v` is fed, it will pass additional diagnosis flags to mlxfwmanager
   - Collect all the output from mlxfwmanager and print them to syslog if it fails
2. Abort syncd in case waiting for device or upgrading firmware fails

Signed-off-by: Stephen Sun <stephens@nvidia.com>

### How I did it

#### How to verify it

Regression and manual test
2023-09-04 11:28:53 -07:00
Vadym Hlushko
78587cedc3
[Mellanox] Remove mlxtrace support for SPC4 (#16373)
- Why I did it
Because the Spectrum4 devices don't support mlxtrace utility.

- How I did it
Edit sai.profile and remove mlxtrace_spectrum4_itrace_*.cfg.ext files

Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
2023-09-04 10:53:20 +03:00
mssonicbld
c787d51f29
[submodule] Update submodule sonic-linux-kernel to the latest HEAD automatically (#16391)
#### Why I did it
src/sonic-linux-kernel
```
* 7ee50c9 - (HEAD -> master, origin/master, origin/HEAD) [Mellanox] Upstream kernel patches with HW-MGMT 7.0030.1011 (#327) (29 hours ago) [Kebo Liu]
```
#### How I did it
#### How to verify it
#### Description for the changelog
2023-09-03 18:33:09 +08:00
mssonicbld
ccfef69ac4
[submodule] Update submodule sonic-platform-daemons to the latest HEAD automatically (#16392)
#### Why I did it
src/sonic-platform-daemons
```
* c1c43f6 - (HEAD -> master, origin/master, origin/HEAD) [pmon][chassis][voq] Chassis DB cleanup when module is down (#394) (2 days ago) [vganesan-nokia]
```
#### How I did it
#### How to verify it
#### Description for the changelog
2023-09-03 18:33:05 +08:00
Yoush
559151b41e
[centec]: update sonic master centec-sai reference to v1.12.0-1 (#16238)
Signed-off-by: yoush <yoush@centec.com>
2023-09-01 23:22:00 -07:00
Vadym Hlushko
9e3fdded69
[Mellanox][SFP] Remove unused function parameter (#16318)
Why I did it
To avoid errors when the sfputil show error-status -hw is called from the host OS (not from the pmon docker).

How I did it
Remove the self.sdk_handle parameter from the _get_module_info() function.

How to verify it
Execute the sfputil show error-status -hw

Signed-off-by: vadymhlushko-mlnx <vadymh@nvidia.com>
2023-09-01 23:06:04 -07:00
Mai Bui
ff5f46955c
[database] make Redis process runs as non-root user (#16326)
Why I did it
Running the Redis server as the "root" user is not recommended. It is suggested that the server should be operated by a non-privileged user.

Work item tracking
Microsoft ADO (number only): 15895240

How I did it
Ensure the Redis process is operating under the 'redis' user in supervisord and make redis user own REDIS_DIR inside db container.

How to verify it
Built new image, verify redis process is running as 'redis' user and all containers are up.

Signed-off-by: Mai Bui <maibui@microsoft.com>
2023-09-01 23:03:15 -07:00
Zain Budhwani
84cfc3bc69
[eventd]: Remove unnecessary log (#16166)
Work item tracking
Microsoft ADO (number only): 16789053
2023-09-01 23:01:46 -07:00
Riff
7c1d720a65
[sonic-mgmt]: Adding sshconf 0.2.5 into sonic-mgmt container. (#16344)
Why I did it
This change is to help us running SSH config generation for our testbed in mgmt container.

Original PR in sonic-mgmt repo can be found here: sonic-net/sonic-mgmt#9773.

Work item tracking
Microsoft ADO (number only): 25007799

How I did it
Updating sonic-mgmt docker file to add sshconf 0.2.5 into pip install under venv.
2023-09-01 22:58:27 -07:00
Andrew Sapronov
0405b369af
[Netberg][Barefoot] Added support for Aurora 750 (#16342)
Why I did it
Support Intel Tofino based platforms Netberg Aurora 750
ASIC: Intel Tofino BFN-T10-064Q
Pors: 64x 100G

How I did it
Added specification to device/netberg directory
Added platform/barefoot/sonic-platform-modules-netberg contains kernel modules, scripts and sonic_platform packages.
Modified the platform/barefoot/platform-modules-netberg.mk to include Aurora 750 related ID.

Signed-off-by: Andrew Sapronov <andrew.sapronov@gmail.com>
2023-09-01 22:52:39 -07:00
Guohan Lu
3bdfdd95ea Revert "[Ragile]: Add new centec platform ra-b6010 (#14819)"
This reverts commit 75062436e8.
2023-09-01 22:43:18 -07:00
anamehra
f6897bb585
chassis-packet: Update arp_update script for FAILED and STALE check (#16311)
chassis-packet: Update arp_update script for FAILED and STALE check (#16311)

1. Fixing an issue with FAILED entry resolution retry.
Neighbor entries in arp table may sometimes enter a FAILED state when the far end is down and reports the state as follows:
2603:10e2:400:3::1 dev PortChannel19 router FAILED
While the arp_update script handles the entries for FAILED in the following format, the above was not handled due to the token location (extra router keyword at index 4):
2603:10e2:400:3::1 dev PortChannel19 FAILED

The former format may appear if an arp resolution is tried on a link that is known but the far end goes down, e.g., pinging a STALE entry while the far end is down.

2. Refreshing STALE entries to make sure the far end is reachable.
STALE entries for some backend ports may appear in chassis-packet when no traffic is received for a while on the port. When the far end goes down, it is expected for BFD to stop sending packets on the session for which the far end is not reachable. But as the entry is known as stale, on the Cisco chassis, BFD keeps sending packets. Refreshing the stale entry will keep active links as reachable in the neighbor table while the entries for the far end down will enter a failed state. FAILED state entries will be retired and entered reachable when far end comes back up.
2023-09-01 11:41:46 -07:00
abdosi
566b5dfa1f
Assign the higher metric value for Ipv6 default route learnt via RA message (#16367)
* Fix the Loopback0 IPv6 address of LC's in chassis not reachable from peer device's
* Assign the metric vaule for Ipv6 default route learnt via RA message to higher value so that BGP learnt default route is higher priority.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2023-09-01 11:38:14 -07:00
Sudharsan Dhamal Gopalarathnam
238a50ff13
[P4RT]Disabling p4rt by default to overcome build issues (#16343)
To fix #16015

P4RT is causing instability in build due to regular failures. Disabling P4RT by default
2023-09-01 11:07:50 -07:00
Marty Y. Lok
de7fb325ae
[Nokia-IXR7250E] Modify the platform_ndk.json for Nokia-IXR7250E platform (#16355)
Signed-off-by: mlok <marty.lok@nokia.com>
2023-09-01 08:54:40 -07:00
mssonicbld
f78d25b11e [ci/build]: Upgrade SONiC package versions 2023-09-01 16:32:44 +08:00
mssonicbld
162edc5c73
[submodule] Update submodule sonic-snmpagent to the latest HEAD automatically (#16368) 2023-09-01 15:03:02 +08:00
vganesan-nokia
5fded5c51b
[chassis] Chassis DB cleanup when asic comes up (#16213)
* [chassis]Chassis DB cleanup when asic comes up

Cleanup the entries from the following tables in chassis app db in
redis_chassis server in the supervisor
(1) SYSTEM_NEIGH
(2) SYSTEM_INTERFACE
(3) SYSTEM_LAG_MEMBER_TABLE
(4) SYSTEM_LAG_TABLE
As part of the clean up only those entries created by the asic that
is coming up are deleted. The LAG IDs used by the asics are also
de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET

- Added check to run the chassis db clean up only for voq switches.

Signed-off-by: vedganes <veda.ganesan@nokia.com>
2023-08-31 23:38:56 -07:00
lixiaoyuner
410e6ff406
Install pyOpenSSL package for k8s master (#16361)
### Why I did it
Need a tool to check certificate's detail of information.
##### Work item tracking
- Microsoft ADO **(number only)**: 25020260
#### How I did it
Install pyOpenSSL package for k8s master
#### How to verify it
Pip3 list to check whether it's installed when include_kubernetes_master=y
2023-08-31 22:26:24 -07:00
Senthil Kumar Guruswamy
34e5d266e5
Handle service start-limit-hit failure event case in sysmonitor (#16174) 2023-08-31 12:07:42 -07:00
Senthil Kumar Guruswamy
fdd5deb453
Fix for issue#14871 (#15433)
Include valid input check for system status in test along with db update
check
2023-08-31 12:04:48 -07:00
Alpesh Patel
cabdac17a5
qos template change for backend compute-ai deployment (#16150)
#### Why I did it

To enable qos config for a certain backend deployment mode, for resource-type "Compute-AI".
This deployment has the following requirement:

- Config below enabled if DEVICE_TYPE as one of backend_device_types
- Config below enabled if ResourceType is 'Compute-AI'
- 2 lossless TCs' (2, 3)
- 2 lossy TCs' (0,1)
- DSCP to TC map uses 4 DSCP code points and maps to the TCs' as follows:
   "DSCP_TO_TC_MAP": {
        "AZURE": {
             "48" : "0",
            "46" : "1",
            "3"  : "3",
            "4"  : "4"
        }
    }

- WRED profile has green {min/max/mark%} as {2M/10M/5%}

This required template change <as in the PR> in addition to the vendor qos.json.j2 file (not included here).

### How I did it

#### How to verify it
- with the above change and the vendor config change, generated the qos.json file and verified that the objective stated in "Why I did it" was met

- verified no error

### Description for the changelog
Update qos_config.j2 for Comptue-AI deployment on one of backend device type roles
2023-08-31 11:30:20 -07:00
Vadym Hlushko
43340cd58d
[memory_checker] Add a specific log message in a case when the docker service is not running. (#16018)
#### Why I did it
To fix the logic introduced by [[memory_checker] Do not check memory usage of containers which are not created #11129](https://github.com/sonic-net/sonic-buildimage/pull/11129).
There could be a scenario before the reboot, where
1. The `docker service` has stopped
2. In a very short period of time, the monit service performs the `root@sonic:/home/admin# monit status container_memory_telemetry`

In such scenario, the `memory_checker` script will throw an error to the syslog:
```
ERR memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))'
```
But, actually, this scenario is a correct behavior, because when the docker service is stopped, the Unix socket is destroyed and that is why we could see the `FileNotFoundError(2, 'No such file or directory'` exception in the syslog.

#### How I did it
Change the log severity to the warning and changed the return value.

#### How to verify it
It is really hard to catch the exact moment described in the `Why I did it` section.
In order to check the logic:
1. Change the Unix socket path to non-existing in [/usr/bin/memory_checker](47742dfc2c/files/image_config/monit/memory_checker (L139)) file on the switch.
2. Execute the `root@sonic:/home/admin# monit restart container_memory_telemetry`
3. Check the syslog for such messages:
```
WARNING memory_checker: Failed to retrieve the running container list from docker daemon! Error message is: 'Error while fetching server API version: ('Connection aborte
d.', FileNotFoundError(2, 'No such file or directory'))'

INFO memory_checker: [memory_checker] Exits without checking memory usage since container 'telemetry' is not running!
```
2023-08-31 11:28:20 -07:00
Arvindsrinivasan Lakshmi Narasimhan
3237b2cfc8
[chassis][voq] Fix to ignore duplicate nexthop in zebra (#16275)
Why I did it
Fixes #15803

In SONiC chassis, routes have recursive nexthop resolution when the routes are learnt from remote linecard.
In some cases after recursive nexthop resolution the number of nexthop for a route could reach 256.
Zebra ran out of space when filling up 256 nexthops which causes zebra crash.

Work item tracking
Microsoft ADO (24997365):

How I did it
Create a patch to port FRRouting/frr#14096 which has change to ignore duplicate nexthop when filling up fpm message

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
2023-08-31 11:06:33 -07:00
Junchao-Mellanox
0be57803e2
[Mellanox] Revise label name and fix typo in sensor.conf of 4600C (#16271)
- Why I did it
Revise lable name and fix typo in sensor.conf of 4600C

- How I did it
Revise lable name and fix typo in sensor.conf of 4600C

- How to verify it
Manual test
sonic-mgmt test_sensors.py
2023-08-31 19:41:12 +03:00
Yaqiang Zhu
110dc1e247
[yang][dhcp_server] Add dhcp_server_ipv4 yang model (#16327)
Why I did it
#15955 import sonic-vlan in yang model, which would cause YANG backlink issue. So #15955 was reverted by #16322.
This PR is re-submitted of #15955 without import sonic-vlan.
Add yang model for IPv4 DHCP Server.

How I did it
Add yang model for IPv4 DHCP Server.
Add four new tables: DHCP_SERVER_IPV4, DHCP_SERVER_IPV4_CUSTOMIZED_OPTIONS, DHCP_SERVER_IPV4_RANGE, DHCP_SERVER_IPV4_PORT.
Add related unit test.

HLD: https://github.com/yaqiangz/SONiC/blob/master_dhcp_server_hld/doc/dhcp_server/port_based_dhcp_server_high_level_design.md#rev-01

How to verify it
Build sonic_yang_models packages.
2023-08-31 08:52:36 -07:00
Xichen96
a5e180552f
add processor.max_cstate=0 to intel cpu cmdline (#16339)
Why I did it
This is a fix for PR [kernel] Change grub cmdline to set c-states to 0 for "Intel" CPUs by shlomibitton · Pull Request #6051 · sonic-net/sonic-buildimage (github.com)

The original PR will disable intel idle driver but it cannot limit the max c-state to 1 due to system will fall back to acpi idle driver.

Currently intel_idle.max_cstate=0 is already present, which will disable intel idle driver. With the added option, common idle driver will be disabled as well, so there will not be idle management. This is to prevent a bug that can be triggered by idle instruction on intel platform.

How I did it
Add the option to installer file beside intel_idle.max_cstate=0
2023-08-31 08:47:46 -07:00
pettershao-ragilenetworks
75062436e8
[Ragile]: Add new centec platform ra-b6010 (#14819)
What I did it
Add new platform arm64-ragile_ra-b6010-48gt4x-r0 (Centec)
ASIC Vendor: Centec
Switch ASIC: Centec
Port Config: 48x1G+4x10G

Why I did it
Add new platform RA-B6010-48GT4X

How I did it
Add new platform RA-B6010-48GT4X

Signed-off-by: pettershao-ragilenetworks <pettershao@ragilenetworks.com>
2023-08-31 08:38:24 -07:00
mssonicbld
2a48406f57
[submodule] Update submodule sonic-linux-kernel to the latest HEAD automatically (#16352)
#### Why I did it
src/sonic-linux-kernel
```
* 1800d11 - (HEAD -> master, origin/master, origin/HEAD) AMD-Pensando ELBA SOC support (#322) (23 hours ago) [Ashwin Hiranniah]
```
#### How I did it
#### How to verify it
#### Description for the changelog
2023-08-31 18:33:11 +08:00
Liping Xu
6d4ccbd310
update DOCKER_RAMFS_SIZE (#16305)
Why I did it
docker folder size on 202305 image is more than 1.5G. larger than the max size of docker ramfs size.

Work item tracking
Microsoft ADO (number only):
24969589
How I did it
Update the docker ramfs size from 1500M to 2500M

How to verify it
Boot 202305 image.
2023-08-31 16:49:03 +08:00
mssonicbld
5a51200350
[submodule] Update submodule sonic-snmpagent to the latest HEAD automatically (#16353)
#### Why I did it
src/sonic-snmpagent
```
* af2d5a4 - (HEAD -> master, origin/master, origin/HEAD) Fix FdbUpdater crash when SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID attribute missing. (#286) (19 hours ago) [Hua Liu]
```
#### How I did it
#### How to verify it
#### Description for the changelog
2023-08-31 16:33:02 +08:00
Zhijian Li
5e586a5a37
Fix openconfig_acl.py (#16303)
How I did it
Fix the regex for L4 port range in openconfig_acl.py.

How to verify it
Build image and install on Arista-720DT DUT, then try the repro steps in #16189 and confirmed the ACL rule be setup correctly:
2023-08-30 10:46:21 -07:00
vmittal-msft
9a15221e46
Update CPU transmitted packets to queue 7 for chassis (#16254)
* Update CPU transmitted packets to TC = 7 for SONIC chassis

* Added new SOC property to permitted list
2023-08-29 18:33:16 -07:00
abdosi
b6edc374ba
[build]: Added flag in sonic_version.yml to see if image is secured or non-secured (#16191)
What I did:

Added flag in sonic_version.yml to see if compiled image is secured or non-secured. This is done using build/compile time environmental variable SECURE_UPGRADE_MODE as define in HLD: https://github.com/sonic-net/SONiC/blob/master/doc/secure_boot/hld_secure_boot.md

This flag does not provide the runtime status of whether the image has booted securely or not. It's possible that compile time signed image (secured image) can boot on non secure platform.

Why I did:
Flag can be used for manual check or by the test case.

ADO: 24319390

How I verify:
Manual Verification

---
build_version: 'master-16191.346262-cdc5e72a3'
debian_version: '11.7'
kernel_version: '5.10.0-18-2-amd64'
asic_type: broadcom
asic_subtype: 'broadcom'
commit_id: 'cdc5e72a3'
branch: 'master-16191'
release: 'none'
build_date: Fri Aug 25 03:15:45 UTC 2023
build_number: 346262
built_by: AzDevOps@vmss-soni001UR5
libswsscommon: 1.0.0
sonic_utilities: 1.2
sonic_os_version: 11
secure_boot_image: 'no'

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
2023-08-29 13:41:01 -07:00
jingwenxie
f39adda55e
Revert "[yang][dhcp_server] Add dhcp_server_ipv4 yang model (#15955)" (#16322)
This reverts commit 44d52dbb8b.
2023-08-29 13:26:59 -07:00
Nonodark Huang
42cf153624
[ufispace][pddf] Add the pddf package dependency to the ufispace platform modules mk file. (#16302)
Signed-off-by: nonodark <ef67891@yahoo.com.tw>
2023-08-29 09:45:28 -07:00
Liu Shilong
459ba257a4
[ci] Add job to cleanup nfs in armhf and arm64 agents. (#16270)
Why I did it
Clean old cached file in nfs disk for armhf/arm64

Work item tracking
Microsoft ADO (number only): 24930879
2023-08-29 19:14:45 +08:00
guangyao6
80ce957d20
Add no-export to sentinel community-list (#16285)
Why I did it
Add no-export to bgp sentinel community-list. So that bgp updates from bgp sentinel service must match sentinel community and no-export, otherwise, the bgp update will be dropped.

Work item tracking
Microsoft ADO (24946274):
How I did it
Add no-export to bgp sentinel community-list.

How to verify it
Run UT, case would pass. Build the image and start the device. Add bgp sentinel and check that no-export community exist in bgp sentinel community list.
2023-08-29 09:12:19 +08:00
Yakiv Huryk
d0a40afcad
[build] add SKIP_BUILD_HOOK support for curl (#15923)
#### Why I did it
To support SKIP_BUILD_HOOK for curl command so the targets downloaded by curl (SONIC_ONLINE_DEBS, SONIC_ONLINE_FILES) can utilize it.

##### Work item tracking
- Microsoft ADO **(number only)**:

#### How I did it
Add a logic to invoke a real command instead of a `download_packages()` (the same way it's done for wget)

#### How to verify it
Add an online target (with URL attribute).
Add the "SKIP_VERSION=y" to this target.
Check that download_packages is not invoked.
2023-08-28 13:25:06 -07:00
Yaqiang Zhu
44d52dbb8b
[yang][dhcp_server] Add dhcp_server_ipv4 yang model (#15955)
Add yang model for IPv4 DHCP Server.
Add four new tables: DHCP_SERVER_IPV4, DHCP_SERVER_IPV4_CUSTOMIZED_OPTIONS, DHCP_SERVER_IPV4_RANGE, DHCP_SERVER_IPV4_PORT
2023-08-28 08:43:28 -07:00
Yaqiang Zhu
4da72b9eca
[yang] Add Bmc to Device Neighbor Metadata element type list (#16188)
Bmc is a valid neighbor type in minigraph, however it was missing from the YANG model definition. Usually, the Bmc type device can be neighbor of BmcMgmtToRRouter. This PR is to introduce this type.
2023-08-28 08:42:27 -07:00
Zhijian Li
1d1489b2c7
[minigraph-parser] Update the definition of acl table type BMCDATA and BMCDATAV6 (#16249)
Why I did it
According to ACL-Table-Type-HLD, the value type of MATCHES, ACTIONS and BIND_POINTS should be list instead of string. Opening this PR to update the definition of BMCDATA and BMCDATAV6.

How I did it
Update the definition of BMCDATA and BMCDATAV6 in minigraph-parser.

How to verify it
Verified by UT and build SONiC image.
2023-08-28 08:40:55 -07:00
Zhijian Li
83dca59efc
[YANG SONIC-ACL] Fix Yang definition of ACL_TABLE_TYPE (#16247)
How I did it
Update Yang definition of ACL_TABLE_TYPE.
Update existing testcase.
Add new testcase to cover lowercase key scenario.

How to verify it
Verified by building sonic_yang_models-1.0-py3-none-any.whl. While building the target package, unit tests were run and passed.
2023-08-28 08:40:01 -07:00
Rajkumar-Marvell
2c9c96c0d8
[SFLOW] Fixed SFLOW DROPMON patch to align with 2.0.45 version (#15948)
- Why I did it
Fixed build failure when flag ENABLE_SFLOW_DROPMON=y set

- How I did it
Fixed sflow dropmon patch to align with hsflowd version 2.0.45

Signed-off-by: rajkumar38 <rpennadamram@marvell.com>
2023-08-28 18:36:46 +03:00
Stephen Sun
0446d7654f
Add yang model for scheduler in PORT_QOS_MAP (#16244)
Signed-off-by: Stephen Sun <stephens@nvidia.com>
2023-08-28 15:05:11 +03:00
Stephen Sun
be8843b166
Fix issue: unprintable character is rendered when handling comments in j2 (#16287)
Use "{#-" and "-#}" to mark comments in jinja template

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2023-08-28 15:03:39 +03:00
Nazarii Hnydyn
65b0011866
[Mellanox] [PPI] Enable global port late create for SN5600 (#15866)
- Why I did it
Enabled port late create on SN5600 Spectrum-4 switch boots up with no ports

Work item tracking
N/A

- How I did it
Updated SAI xml config file

- How to verify it
Run sonic-mgmt tests of fastboot

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2023-08-28 14:50:53 +03:00
mssonicbld
55849d0c6b
[ci/build]: Upgrade SONiC package versions (#16300) 2023-08-28 18:31:51 +08:00
mssonicbld
c8465c0d9a
[ci/build]: Upgrade SONiC package versions (#16294) 2023-08-26 18:45:45 +08:00
mssonicbld
36b21157d6
[submodule] Update submodule sonic-gnmi to the latest HEAD automatically (#16282)
#### Why I did it
src/sonic-gnmi
```
* 7a1b7cd - (HEAD -> master, origin/master, origin/HEAD) Improve full path logic (#146) (37 minutes ago) [ganglv]
```
#### How I did it
#### How to verify it
#### Description for the changelog
2023-08-25 16:32:36 +08:00