chassis-packet: Update arp_update script for FAILED and STALE check (#16311)
1. Fixing an issue with FAILED entry resolution retry.
Neighbor entries in arp table may sometimes enter a FAILED state when the far end is down and reports the state as follows:
2603:10e2:400:3::1 dev PortChannel19 router FAILED
While the arp_update script handles the entries for FAILED in the following format, the above was not handled due to the token location (extra router keyword at index 4):
2603:10e2:400:3::1 dev PortChannel19 FAILED
The former format may appear if an arp resolution is tried on a link that is known but the far end goes down, e.g., pinging a STALE entry while the far end is down.
2. Refreshing STALE entries to make sure the far end is reachable.
STALE entries for some backend ports may appear in chassis-packet when no traffic is received for a while on the port. When the far end goes down, it is expected for BFD to stop sending packets on the session for which the far end is not reachable. But as the entry is known as stale, on the Cisco chassis, BFD keeps sending packets. Refreshing the stale entry will keep active links as reachable in the neighbor table while the entries for the far end down will enter a failed state. FAILED state entries will be retired and entered reachable when far end comes back up.
* [minigraph] remove number of lanes check for changing speed from 400G to 100G and set speed setting before lane reconfiguration (#15721)
8111 800G interface, split to 2x400G (each has 4 lanes) fails to change interface speed from 400G to 100G during deploy mg. In minigraph.xml, the interface speed configuration is good, but fails to generate the right value to config_db.json.
In order to support this SKU the speed transitioning should support both 4 lanes and 8 lanes in the port_config.ini.
Why I did it
before this change for a 400G to 100G transition, in all cases except when lanes are 8, we would continue and the line
ports.setdefault(port_name, {})['speed'] = port_speed_png[port_name]
would not be executed, hence the default speed will never be set for a case and config_db will not be updated,
where speed is transitioning from 400G to 100G or 40G, but lanes are not equal to 8.
In order for those cases to pass where lanes are not specifically 8, we need the change
Work item tracking
24242657
Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
* fix UT
Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
---------
Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
Bmc is a valid neighbor type in minigraph, however it was missing from the YANG model definition. Usually, the Bmc type device can be neighbor of BmcMgmtToRRouter. This PR is to introduce this type.
Co-authored-by: Yaqiang Zhu <zyq1512099831@gmail.com>
How I did it
Update Yang definition of ACL_TABLE_TYPE.
Update existing testcase.
Add new testcase to cover lowercase key scenario.
How to verify it
Verified by building sonic_yang_models-1.0-py3-none-any.whl. While building the target package, unit tests were run and passed.
Co-authored-by: Zhijian Li <zhijianli@microsoft.com>
Why I did it
According to ACL-Table-Type-HLD, the value type of MATCHES, ACTIONS and BIND_POINTS should be list instead of string. Opening this PR to update the definition of BMCDATA and BMCDATAV6.
How I did it
Update the definition of BMCDATA and BMCDATAV6 in minigraph-parser.
How to verify it
Verified by UT and build SONiC image.
Co-authored-by: Zhijian Li <zhijianli@microsoft.com>
- Why I did it
Revise lable name and fix typo in sensor.conf of 4600C
- How I did it
Revise lable name and fix typo in sensor.conf of 4600C
- How to verify it
Manual test
sonic-mgmt test_sensors.py
Co-authored-by: Junchao-Mellanox <57339448+Junchao-Mellanox@users.noreply.github.com>
On S6100 we are seeing almost 100K interrupts per second on intels i801 SMBUS controller which affects systems performance.
We now disable the i801 driver interrupt and instead enable polling
Microsoft ADO (number only): 24910530
How I did it
Disable the interrupt by passing the interrupt disable feature argument to i2c-i801 driver
How to verify it
This fix is NOT applicable for ARM based platforms. Applicable only for intel based platforms:-
- On SN2700 its already disabled in Mellanox hw-mgmt
- Celestica DX010 and E1031
- Dell S6100 verified the interrupts are no longer incrementing.
- Arista 7260CX3
Signed-off-by: Prince George <prgeor@microsoft.com>
Co-authored-by: Prince George <45705344+prgeor@users.noreply.github.com>
- Why I did it
Commit sonic-net/sonic-platform-daemons@153ea47 changed SfpStateUpdateTask from Process to Thread. In this commit, it raises an exception in SfpStateUpdateTask to make shutdown flow fast. But it does not work on Nvidia platform as Nvidia platform is passing timeout parameter of get_change_event to select. Linux select function can not be interrupted by a Python exception. There is no such issue on Nvidia platform before that commit. However, in order to comply with the commit and make shutdown flow fast, we decided to change Nvidia platform API implementation.
To fix issue #13591.
- How I did it
The select call in get_change_event should use no more than 1 second as timeout parameter.
Outside the select call, add a while loop to make sure timeout parameter of get_change_event work as expected
- How to verify it
Manual test
Co-authored-by: Junchao-Mellanox <57339448+Junchao-Mellanox@users.noreply.github.com>
* Fix the Loopback0 IPv6 address of LC's in chassis not reachable from peer device's
* Assign the metric vaule for Ipv6 default route learnt via RA message to higher value so that BGP learnt default route is higher priority.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Co-authored-by: abdosi <58047199+abdosi@users.noreply.github.com>
How I did it
Fix the regex for L4 port range in openconfig_acl.py.
How to verify it
Build image and install on Arista-720DT DUT, then try the repro steps in #16189 and confirmed the ACL rule be setup correctly:
Co-authored-by: Zhijian Li <zhijianli@microsoft.com>
Why I did it
Common Release Notes for 8102-64H, T0/DualTor, and 8101-32FH
Fix for an issue where drop counters were incrementing twice for packets with invalid tag
Fix for the ECC errors reported in SR 695600099
Fix for fwutil show updates failure
How I did it
Update platform version to 202205.2.2.11
There is a redundant line in init_cfg.json.j2. It would cause pmon service always has "has_timer=False". However, we know that PMON has a timer now. So, I try to fix it here.
* [chassis]Chassis DB cleanup when asic comes up
Cleanup the entries from the following tables in chassis app db in
redis_chassis server in the supervisor
(1) SYSTEM_NEIGH
(2) SYSTEM_INTERFACE
(3) SYSTEM_LAG_MEMBER_TABLE
(4) SYSTEM_LAG_TABLE
As part of the clean up only those entries created by the asic that
is coming up are deleted. The LAG IDs used by the asics are also
de-allocated from SYSTEM_LAG_ID_TABLE and SYSTEM_LAG_ID_SET
- Added check to run the chassis db clean up only for voq switches.
Signed-off-by: vedganes <veda.ganesan@nokia.com>
Co-authored-by: vganesan-nokia <67648637+vganesan-nokia@users.noreply.github.com>
Why I did it
Update the platform_reboot of Nokia Platform IXR-7250E-36x400G to displays the correct reboot-cause history when reboot from supervisor card.
Work item tracking
Microsoft ADO (number only):
How I did it
Modify the platform_reboot script to copy the correct reboo-cause.txt file from NDK to the /host/reboot-cause directory at the down cycle when the reboot is issued from Supervisor (for both reboot right after install a new image and normal reboot)
Signed-off-by: mlok <marty.lok@nokia.com>
Co-authored-by: Marty Y. Lok <76118573+mlok-nokia@users.noreply.github.com>
This is a fix for PR #6051
The original PR will disable intel idle driver but it cannot limit the max c-state to 1 due to system will fall back to acpi idle driver.
Currently intel_idle.max_cstate=0 is already present, which will disable intel idle driver. With the added option, common idle driver will be disabled as well, so there will not be idle management. This is to prevent a bug that can be triggered by idle instruction on intel platform.
Work item tracking
Microsoft ADO (number only): 24867921
How I did it
Add the option to installer file beside intel_idle.max_cstate=0
Signed-off-by: Xichen Lin <lukelin0907@gmail.com>
src/sonic-linux-kernel
* db00eb9 - (HEAD -> 202205, origin/202205) PATCH] net: allow user to set metric on default route learned via Router Advertisement (#326) (2 days ago) [abdosi]
To include the following fixes:
DNX:
CS00012287482 - Support for 1024 LAGs on DNX (Added back fix reverted in [202205] Update Broadcom DNX SAI version to 7.1.54.4 #15850)
CS00012302400 - New SAI 7.1.50.4 caused regression in sonic-mgmt ACL test &
ACL entry creation failing with SAI_STATUS_INVALID_PORT_NUMBER in SAI 7.1.50.4
(CS00012302347)
CS00012302163 - SAI_API_BRIDGE:_brcm_sai_bridge_port_learn_flag:1620 sai bridge lag port list get. failed with error -7.
CS00012296571 - LACP packets are queued to Queue 0 instead of Queue 7
CS00012301919 - The traffic is queued to VOQ 8 sometimes instead of destination port's VOQ
CS00012297160 - [SONIC] [J2C+] Traffic to unknown destination route getting enqueued on VOQ 10
CS00012298730 - [7.x][J2/J2C+] : Treat Q=0 as lowest priority and Q=7 as highest priority in Strict Priority Scheduling
Also includes -
XGS:
Port SONIC-62323 to SAI 7.1, Use single NH instead of ecmp
[SAI_BRANCH rel_ocp_sai_7_1] ECMP group expansion fail due to no resources
Fix capability for Hostif queue on SAI version 7.1
CS00012302193 - SAI_SWITCH_ATTR_SWITCH_HARDWARE_INFO attribute value changed
src/sonic-utilities
* 1ed5b5a9 - (HEAD -> 202205, origin/202205) Add transceiver status CLI to show output from TRANSCEIVER_STATUS table (cherry-pick to 202205) (#2950) (4 days ago) [longhuan-cisco]
* ba327726 - Fix in config override when all asic namespaces not present in golden_config_db (#2946) (4 days ago) [judyjoseph]
Why I did it
Dell S6100 Platform components needs to be updated.
How I did it
Modified platform.json to fix the issue.
How to verify it
Run sonic-mgmt component test and check whether it passes.
Co-authored-by: Aravind Mani <53524901+aravindmani-1@users.noreply.github.com>
How I did it
Update Yang definition of IN_PORTS and OUT_PORTS to string.
Since we cannot split the string with comma (,) and validate each substring is a valid SONiC port name. The only restriction for them is must be a string.
How to verify it
Verified by building sonic_yang_models-1.0-py3-none-any.whl. While building the target package, unit tests were run and passed.
Build a SONiC image based on 202205 branch and installed on physical DUT. Re try the steps in [Yang] Incorrect definition of IN_PORTS and OUT_PORTS in sonic-acl.yang #16190 and can see below success response:
Co-authored-by: Zhijian Li <zhijianli@microsoft.com>
- Why I did it
watchdogutil uses platform API watchdog instance to control/query watchdog status. In Nvidia watchdog status, it caches "armed" status in a object member "WatchdogImplBase.armed". This is not working for CLI infrastructure because each CLI will create a new watchdog instance, the status cached in previous instance will totally lose. Consider following commands:
admin@sonic:~$ sudo watchdogutil arm -s 100 =====> watchdog instance1, armed=True
Watchdog armed for 100 seconds
admin@sonic:~$ sudo watchdogutil status ======> watchdog instance2, armed=False
Status: Unarmed
admin@sonic:~$ sudo watchdogutil disarm =======> watchdog instance3, armed=False
Failed to disarm Watchdog
- How I did it
Use sysfs to query watchdog status
- How to verify it
Manual test
Unit test
Conflicts:
platform/mellanox/mlnx-platform-api/sonic_platform/watchdog.py
platform/mellanox/mlnx-platform-api/tests/test_watchdog.py
A workaround to back port the fix for a systemd issue.
The systemd issue: systemd/systemd#24668
The systemd PR to fix the issue: https://github.com/systemd/systemd/pull/24673/files
The formal solution should upgrade systemd to a version that contains the fix. But, systemd is a very basic service, upgrading systemd requires heavy test.
Release Notes for Cisco T0 and 8102-64H.
• Fix for PSUD crash when PSUs are inserted in an operational system
• Fix for VxLAN counters not incrementing in show vxlan counter' and 'show platform npu vxlan counters'
• Fix for continuous error messages reported by thermalctld
• Fix for dshell client enable/disable causing syncd crash
• Support for 9100 TPID for Cisco fanout.
• Caveat: Drop counters for packets with invalid VLAN tag are counted twice.
Release Notes for Cisco 8101-32FH:
• Aikido FPD 1.89 Upgrade
Update SAI xgs version to 7.1.54.4-3 to include the following XGS changes:
7.1.54.3-1: Port SONIC-62323 to SAI 7.1, Use single NH instead of ecmp
7.1.54.3-2: [SAI_BRANCH rel_ocp_sai_7_1] ECMP group expansion fail due to no resources
7.1.54.3-3: Fix capability for Hostif queue on SAI version 7.1
Signed-off-by: zitingguo-ms <zitingguo@microsoft.com>