Why I did it
When Supervisor card is rebooted by using PMON API, it takes about 90 seconds to trigger the shutdown in down path. At this time linecards have been up. This delays linecards database initialization which is trying to PING/PONG the database-chassis. To address this issue, we modified the NDK to use the system call with "sudo reboot" when the request is from PMON API on Supervisor case. The NDK version is 22.9.20 and greater. This new NDK requires this modifcaiton of platform_reboot to work with.
Work item tracking
Microsoft ADO (number only): 26365734
How I did it
Modify the platform_reboot In Supervisor not to reboot all IMMs since it has been done in the function reboot() in module.py. Also handle the reboot-cause.txt for on the Supervisor when the reboot is request from PMON API.
Modify the Nokia platform specific platform_reboot in linecard to disable all SPFs.
This PR works with NDK version 22.9.20 and above
Signed-off-by: mlok <marty.lok@nokia.com>
- Why I did it
Fix issue xcvrd crashes due to cannot import name 'initialize_sfp_thermal':
Nov 27 09:47:16.388639 sonic ERR pmon#xcvrd: Exception occured at CmisManagerTask thread due to ImportError("cannot import name 'initialize_sfp_thermal' from partially initialized module 'sonic_platform.thermal' (most likely due to a circular import) (/usr/local/lib/python3.9/dist-packages/sonic_platform/thermal.py)")
- How I did it
Add lock for creating SFP object
- How to verify it
Unit test
Manual Test
- Why I did it
When module is totally under software control, driver cannot get module temperature/temperature threshold from firmware. In this case, sonic needs to get temperature/temperature threshold from EEPROM. In this PR, a thread thermal updater is created to update module temperature/temperature threshold while software control is enabled.
- How I did it
Query ASIC temperature from SDK sysfs and update hw-management-tc periodically
Query Module temperature from EEPROM and update hw-management-tc periodically
- How to verify it
Manual test
New Unit tests
- Why I did it
Enable CMIS host management for Mellanox devices which are expected to support the feature
- How I did it
new thread in a new file and changing logic in platform code in chassis.py which is calling this thread from get_change_event()
this thread in the new file handles the state machine per port.
first the static detection takes place once the thread is up (during switch bootup sequence), until final decision if it's FW control or SW control module.
After it ends, the dynamic detection takes place, listening to changes in the sysfs fds, per port,
so it will be able to detect plug in or out events of a cable.
- How to verify it
Enhanced unit tests
run sonic mgmt on Nvidia SN4700 with CMIS host management enabled
Co-authored-by: dbarashinvd <105214075+dbarashinvd@users.noreply.github.com>
- Why I did it
Provide a dummy implementation for SFP error description when CMIS host management is enabled. A future feature shall be raised to implement SFP error description for such mode.
- How I did it
if SFP is under software control, provide "Not supported" as error description
if SFP is under initialization, provide "Initializing" as error description
- How to verify it
unit test
Why I did it
Each release branch needs to have release number set.
Work item tracking
Microsoft ADO (number only):
How I did it
How to verify it
This PR test.
* [submodule]: Update submodule sonic-swss/sonic-dash-api/protobuf (#17413)
1. Protobuf 3.21 has been released in the Debian bookworm
2. Update submodule sonic-swss and sonic-dash-api because they include related updates.
- Microsoft ADO **(number only)**:
1. In the protobuf.mk, If it isn't bullseye, ignore to compile the protobuf package
2. Move sonic-swss commits:
```
fd852084 (HEAD, origin/master, origin/HEAD) [dashrouteorch]: Rename dash route namespace (#2966)
```
3. Move sonic-dash-api and move build chain to its submodule
```
d4448c7 (HEAD, origin/master, origin/HEAD, master) [azp]: Add multi-platform artifacts (#11)
8a5e5cc [debian]: Add debian package (#10)
d96163a [misc]: Add dash utils and its tests (#9)
```
Signed-off-by: Ze Gan <ganze718@gmail.com>
To modify EEPROM API serial_number_str to return service tag instead of serial number in Dell S6100.
Ref PR: #1239
How I did it
Update EEPROM API serial_number_str to return service tag instead of serial number.
How to verify it
Verify decode-syseeprom -s returns service tag in Dell S6100.
src/sonic-swss
* d839eec3 - (HEAD -> 202311, origin/202311) Add support for fabric monitor daemon (swss part). (#2920) (11 days ago) [jfeng-arista]
* 8dc0a856 - Add support for new Port SI parameters in PortsOA (#2929) (11 days ago) [Tomer Shalvi]
* 9458b855 - [hash]: Add ECMP/LAG hash algorithm to OA (#2953) (12 days ago) [Nazarii Hnydyn]
* dac3972d - [coppmgrd] Fix Copp processing logic by using Producer del instead of del from Table (13 days ago) [Vivek]
* f6a35e98 - [gcov]: Fix directory prefix issue for (#2969) (13 days ago) [Lawrence Lee]
* 14408ca3 - [Chassis][master][orchagent] : Added test case to verify WRED profile on system ports (#2954) (2 weeks ago) [vmittal-msft]
* 2ca3deb0 - [dash] fix DASH ACL Rule protocol use-after-free (#2958) (3 weeks ago) [Yakiv Huryk]
* b8841ecb - [orchagent]: Extend the SRv6Orch to support the programming of the L3Adj (#2902) (3 weeks ago) [Carmine Scarpitta]
* 194566a7 - Fix the Orchagent Qos error messages reported in Issue #16787 (#2947) (3 weeks ago) [saksarav-nokia]
src/sonic-platform-common
* 5d69644 - (HEAD -> 202311, origin/202311) Adding supported vendor PNs for remote CDB FW upgrade (#418) (#419) (5 days ago) [mihirpat1]
* 036b2fc - [Credo][Ycable] Correct the lane mapping in the debugdumpregister function for the 50G cable (#417) (11 days ago) [Xinyu Lin]
* 2efe97e - Fix VDM freeze and unfreeze needed for PM stats collection (#402) (2 weeks ago) [jaganbal-a]
* cb80f17 - Fix issue: QSFP module with id 0x0d can be parsed using 8636 (#412) (3 weeks ago) [Stephen Sun]
FCS/CRC Errors will only be reported as RX_ERR.
Fix to avoid the mac port related errors.
Fix for sharedResSize testcase failure in QoS-SAI
Fix the issue related to voltage in 'show platform psustatus'.
Support WRED drop for lossy queues.
Fixed an issue where lossy traffic was getting dropped.
Enhancement of SAI logging for errors and interrupts
- Why I did it
Mellanox MSN2410 platforms have a non-functional error log: "ERR pmon#sensord: Error getting sensor data: dps460/#10: Can't read". This error is because of a firmware issue with some PSU, we are not able to upgrade the FW online. Since there is no functional impact, this error log can be ignored safely
- How I did it
Add a new rsyslog rule to the rsyslog-container.conf.j2, if the docker name is pmon and the platform name matches, the new rule will be inserted into the docker rsyslogd.conf
- How to verify it
run regression on the MSN2410 platform to make the error log will not be printed to the syslog.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Why I did it
To fix the EVPN type5 failure seen in FRR when there are multipaths for nexthop. The type5 routes were queued
show ip route vrf Vrf1
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
VRF Vrf1:
B>q 5.5.5.0/24 [200/0] via 30.0.0.2, Vlan100 onlink, weight 1, 00:00:40
q via 40.0.0.3, Vlan100 onlink, weight 1, 00:00:40
C>* 10.0.0.0/24 is directly connected, Vlan10, 00:00:43
B>q 100.0.0.0/24 [200/0] via 30.0.0.2, Vlan100 onlink, weight 1, 00:00:40
q via 40.0.0.3, Vlan100 onlink, weight 1, 00:00:40
Work item tracking
Microsoft ADO (number only):
How I did it
Porting the FRR fixFRRouting/frr#14835
How to verify it
Validated EVPN multipath with the scenario and confirmed its working.
The format of the media_settings.json file was updated to support the Port SI Per Speed Enhancements. Since media_checker is the validator for the media_settings.json file, it needs to be updated to align with the new format.
How I did it
I added six new SI parameter names introduced as part of the Port SI Per Speed Enhancements. Additionally, I implemented handling for the new hierarchy level (lane_speed_key) in the updated media_settings.json format while maintaining backward compatibility with vendors whose JSON does not support port SI per speed.
How to verify it
I locally built the Debian package using 'make target/debs/bullseye/sonic-device-data_1.0-1_all.deb,' and it completed successfully. Jenkins also built the entire image, which includes the media_checker as part of its process.
[arp_update]: Flush MAC mismatch neighbors
- Check for MAC mismatch between neighbor entries in the kernel and APPL_DB
- Flush any entries with a mismatch
Currently in this repo would not build dhcp_server container image by default, which would cause that building issue for dhcp_server introduced by other modules cannot be noticed in time.
This PR is to set build dhcp_server container in vs image.
How I did it
Remove Python3 venv in Python3-only sonic-mgmt-docker
How to verify it
There is no impact to sonic-mgmt-docker:latest tag.
Build sonic-mgmt-docker with LEGACY_SONIC_MGMT_DOCKER=y, see python3 venv is there.
Build sonic-mgmt-docker with LEGACY_SONIC_MGMT_DOCKER=n, see python3 venv is NOT included.
This change was submitted directly to 202205 but it's also needed in master and 202305 with SAI9.x
#13346
There has been a couple CSPs for this as well:
CS00012273013 - [7.1][J2, J2c+] Disable SA Equals DA trap on DNX
CS00012320965 - SAI9.2: iBGP doesn't work due to SA_EQUALS_DA trap
If SA_EQUALS_DA trap is enabled iBGP won't work as the Ethernet-IB0 ports are expected to get packets with SA==DA.
In the VOQ chassis design, for outgoing control plane packets, the packets goes the recycle port for routing, therefore the dmac of the packet should be the asic router mac. The source mac is assigned by the kernel, so it is also the asic router mac.
Why I did it
sonic_dhcp_server.whl contains not only dhcp_server functionality but also part of dhcp_relay functionality, the existing naming is not appropriate.
Why I did it
HLD implementation: Container Hardening (sonic-net/SONiC#1364)
Work item tracking
Microsoft ADO (number only): 14807420
How I did it
Reduce linux capabilities in privileged flag
How to verify it
Check container's settings: Privileged is false and container only has default Linux caps, does not have extended caps.
Modify j2 template files in docker-dhcp-relay. Add dhcprelayd to group dhcp-relay instead of isc-dhcp-relay-VlanXXX, which would make dhcprelayd to become critical process.
In dhcprelayd, subscribe FEATURE table to check whether dhcp_server feature is enabled.
2.1 If dhcp_server feature is disabled, means we need original dhcp_relay functionality, dhcprelayd would do nothing. Because dhcrelay/dhcpmon configuration is generated in supervisord configuration, they will automatically run.
2.2 If dhcp_server feature is enabled, dhcprelayd will stop dhcpmon/dhcrelay processes started by supervisord and subscribe dhcp_server related tables in config_db to start dhcpmon/dhcrelay processes.
2.3 While dhcprelayd running, it will regularly check feature status (by default per 5s) and would encounter below 4 state change about dhcp_server feature:
A) disabled -> enabled
In this scenario, dhcprelayd will subscribe dhcp_server related tables and stop dhcpmon/dhcrelay processes started by supervisord and start new pair of dhcpmon/dhcrelay processes. After this, dhcpmon/dhcrelay processes are totally managed by dhcprelayd.
B) enabled -> enabled
In this scenaro, dhcprelayd will monitor db changes in dhcp_server related tables to determine whether to restart dhcpmon/dhrelay processes.
C) enabled -> disabled
In this scenario, dhcprelayd would unsubscribe dhcp_server related tables and kill dhcpmon/dhcrelay processes started by itself. And then dhcprelayd will start dhcpmon/dhcrelay processes via supervisorctl.
D) disabled -> disabled
dhcprelayd will check whether dhcrelay processes running status consistent with supervisord configuration file. If they are not consistent, dhcprelayd will kill itself, then dhcp_relay container will stop because dhcprelayd is critical process.
Why I did it
Fixing CVEs CVE-2023-46752 CVE-2023-46753 CVE-2023-47234 CVE-2023-47235
Work item tracking
Microsoft ADO (number only):
How I did it
Porting the fixes in the below PRs
FRRouting/frr#14645FRRouting/frr#14716
How to verify it
Running regression