add partial reboot cause support for linecards
add watchdog support for linecards
add power draw information for chassis
properly implement Chassis.get_port_or_cage_type
fix pcieutil on chassis with powered off cards
fix watchdog-control.service crash
misc fixes and cleanups
Why I did it
Move armhf syncd docker compilation to bullseye.
How I did it
compile syncd docker for armhf platform using below commands,
NOJESSIE=1 NOSTRETCH=1 NOBUSTER=1 BLDENV=bullseye make configure PLATFORM=marvell-armhf PLATFORM_ARCH=armhf
NOJESSIE=1 NOSTRETCH=1 NOBUSTER=1 BLDENV=bullseye make target/docker-syncd-mrvl.gz
How to verify it
upgrade the syncd docker and verify ports are up.
Signed-off-by: rajkumar38 <rpennadamram@marvell.com>
* [SAI PTF] SAI PTF docker support sai-ptf v2
Publish the sai-ptf docker.
Take part of the change from previous PR #11610 (already reverted as some cache issue)
Cause in #11610, added two new target in it, one is sai-ptf another one is syncd-rpc with sai-ptf v2, to make the upgrade with more clear target, use this one take the sai-ptf one.
Test one:
NOSTRETCH=y NOJESSIE=y make configure PLATFORM=vs
NOSTRETCH=y NOJESSIE=y NOBULLSEYE=y SAITHRIFT_V2=y make target/docker-ptf-sai.gz
Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
* remove useless change
Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
* remove useless parameters
Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
* remove useless change
Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
* Update azure-pipelines-build.yml
remove a useless option
Signed-off-by: richardyu-ms <richard.yu@microsoft.com>
This PR is part of the following HLD:
Persistent loglevel HLD: sonic-net/SONiC#1041
- Why I did it
After the Logger tables moved from the LOGLEVEL_DB to the CONFIG_DB and the jinja2_cache was deleted the LOGLEVEL_DB is not in use.
- How I did it
Removed the LOGLEVEL_DB from the SONiC code
- How to verify it
All tests were passed
Why I did it
syseepromd in pmon crashes because of missing import in python script and doesn't get in running state
How I did it
Fix missing import issue to avoid python script failing
How to verify it
Boot system and wait till syseepromd gets into running state
Which release branch to backport (provide reason below if selected)
201811
201911
202006
202012
202106
202111
202205
* Build docker-gbsyncd-broncos image
* Correct typo in LIBSAI_BRONCOS_URL_PREFIX
* Update docker-gbsyncd-broncos/Dockerfile.j2
* Enable debug shell support on docker-gbsyncd-broncos
* Include bcmsh in docker-gbsyncd-broncos
Why I did it
In docker-gbsyncd-broncos image, enable debug shell support for BRCM broncos PHY.
How I did it
How to verify it
Note: need enable attr SAI_SWITCH_ATTR_SWITCH_SHELL_ENABLE support in BCM PAI library
# bcmsh
Press Enter to show prompt.
Press Ctrl+C to exit.
NOTICE: Only one bcmsh or bcmcmd can connect to the shell at same time.
BRCM:> help
help
List of available commands
- h or help => Print command menu
- l => Print list of active ports on the PHY
- ps <port_id> <options> => Print port status
<options> => 1 -> Link status
=> 2 -> Link training failure status
=> 3 -> Link training RX status
=> 4 -> PRBS lock status
=> 5 -> PRBS lock loss status
- rd <port_id> <addr> <no of registers to read> => Read register contents
- wr <port_id> <addr> <data> => Write register data
- rrd <lanemap> <if_side> <addr> <no of registers to read> => Raw read register contents using lanemap and if_side (line = 0, system = 1)
- rwr <lanemap> <if_side> <addr> <data> => Raw write register data using lanemap and if_side (line = 0, system = 1)
- fw or firmware => Print firmware version of the PHY
- pd or port_dump <port_id> <flags> => Dump port status
- eyescan <port_id> => Display eye scan
- fec_status <port_id> => Get fec status of the port
- polarity <lanemap> <if_side> <TX polarity> <RX Polarity> => Set TX and RX polarity
<lanemap> => 0xF, 0xFF, or 0xFFFF based on number of lanes
<if_side > => Line = 0, System = 1
<TX/RX Polarity> =>_TX/RX Polarity bitmap of all lanes
Each bit represents a lane number.
E.g. Lane 0's polarity value (0 or 1) is populated in Bit 0.
- polarity <lanemap> <if_side> => Print TX and RX polarity
- lb <port_id> <lb_value> => Enable loopback on the port
lb_value = 0 -> Disable, 1 -> PHY, 2 -> MAC
- lb <port_id> => Print loopback configuration of the port
- prbs <port_id> <options> <val> => Set/Get PRBS configuration
<options> => 1 -> Get PRBS state and polynomial
2 -> Set PRBS Polynomial, <val> - PRBS Polynomial
Please refer to phy/chip documentation for valid values
3 -> Enable PRBS
<val> => 0 Disable PRBS
1 Enable both PRBS Transmitter and Receiver
2 Enable PRBS Receiver
3 Enable PRBS Transmitter
exit or q => Exit the diagnostic shell
- Why I did it
Update SN2201 dynamic minimum fan speed table according to data provided by the thermal team.
- How I did it
Update the thermal table in device_data.py
- How to verify it
Run platform related regression
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Signed-off-by: maipbui <maibui@microsoft.com>
#### Why I did it
`subprocess.Popen()` and `subprocess.run()` is used with `shell=True`, which is very dangerous for shell injection.
`os` - not secure against maliciously constructed input and dangerous if used to evaluate dynamic content
#### How I did it
Replace `os` by `subprocess`
Remove unused functions
Why I did it
In case the device contains more then one FAN drawer, the FANs name was incorrect.
How I did it
Passed max fan value to FAN object.
Fixed get_name() FAN API
How to verify it
show platform fan
Why I did it
SONiC will report the kernel dump while system reboot in Belgite platform as the following shows:
How I did it
Cause:
Invalid cdev container pointer from the inode is being accessing in misc
device open, which causes a memory corruption in the slub.
Because of the slub corruption, random crash is seen during reboot.
Fix: - Instead of cdev pointer from the inode, mdev container pointer is
used from the file->privdate_data member.
Action: update the pddf_custom_wdt driver,
How to verify it
Do the reboot stress test to check whether there is kernel dump during reboot progress
- Why I did it
Update SDK/FW version - 4.5.3186/2010_3186 in order to have the following changes:
New functionality:
1. Added support for 6.5W (Class 8) in ports 49-50, 53-54, 57-58, and 61-62 on SN4600 system
Fix the following issues:
1. On very rare occasion (~1/100K), during I2C transaction with MMS1V50-WM and MMS1V90-WR modules on SN4700 system, the module may send unexpected stop which violate the I2C specification, possibly affecting the link up flow
2. When running 1GbE speeds on SN4600 system, the port remained active while peer side was closed
3. While toggling the cable with ‘sfputil lpmode on/off’, error msg like “ERR pmon#xcvrd: Receive PMPE error event on module 1: status {X} error type {y}” could be received
4. When toggling many ports of the Spectrum devices while raising 10GbE link up and link maintenance is enabled, the switch may get stuck and may need to be rebooted
5. When trying to reconfigure the Flex Parser header and Flex transition parameters after ISSU, the switch will returned an error even if the configuration was identical to that done before performing the ISSU
6. While moving from lossless to lossy mode while shared headroom was used, reduction of the shared headroom can only be done prior to pool type change and when shared headroom is not utilized
7. SLL configuration is missing in SDK dump
8. If TTL_CMD_COPY is used in Encap direction for a packet with no TTL, then the value passed in the ttl data structure will be used if non-zero (default 255 if zero)
9. PCI calibration changes from a static to a dynamic mechanism
10. Layer 4 port information is not initialized for BFD packet event. To address the issue, remote peer UDP port information was added in BFD packet event
11. SDK returned error when FEC mode is set on twisted pair, when FEC was set to None
- How I did it
Update pointer for the SDK/FW
- How to verify it
Run regression tests
Signed-off-by: dprital <drorp@nvidia.com>
Why I did it
syseepromd in pmon crashes because of missing import in python script and doesn't get in running state
How I did it
Fix missing import issue to avoid python script failing
How to verify it
Boot system and wait till syseepromd gets into running state
Signed-off-by: maipbui <maibui@microsoft.com>
#### Why I did it
`os` - not secure against maliciously constructed input and dangerous if used to evaluate dynamic content
#### How I did it
Replace `os` by `subprocess`
This fixes the following error
```
admin@sonic:~$ sudo fwutil show status
mount: /mnt/onie-fs: special device /dev/sda2
does not exist.
Error: Command '['mount', '-n', '-r', '-t', 'ext4', '/dev/sda2\n', '/mnt/onie-fs']' returned non-zero exit status 32.. Aborting...
Aborted!
admin@sonic:~$ sudo vi /usr/local/lib/python3.9/dist-packages/sonic_platform/
```
Seems like #11877 the rstrip('\n') was removed. Probably by mistake.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
fix linecard provisioning issue (500 error)
fix some value types for get_system_eeprom_info API
refactor code to leverage pci topology (enabling dynamic Pcie plugin)
refactor asic declaration logic to new style
misc fixes
Signed-off-by: Mariusz Stachura <mariusz.stachura@intel.com>
What I did
Adding the dynamic headroom calculation support for Barefoot platforms.
Why I did it
Enabling dynamic mode for barefoot case.
How I verified it
The community tests are adjusted and pass.
Remove swsssdk from sonic OS image and docker image
#### Why I did it
swsssdk is deprecated, so need remove from image.
#### How I did it
Update config file to remove swsssdk from image.
#### How to verify it
Pass all test case.
#### Which release branch to backport (provide reason below if selected)
<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
#### Description for the changelog
Remove swsssdk from sonic OS image and docker image
#### Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->
#### A picture of a cute animal (not mandatory but encouraged)
Signed-off-by: maipbui <maibui@microsoft.com>
#### Why I did it
`os` - not secure against maliciously constructed input and dangerous if used to evaluate dynamic content.
#### How I did it
`os` - use with `subprocess`
#### How to verify it
Signed-off-by: maipbui <maibui@microsoft.com>
Dependency: [PR (#12065)](https://github.com/sonic-net/sonic-buildimage/pull/12065) needs to merge first.
#### Why I did it
`subprocess.Popen()` and `subprocess.check_output()` is used with `shell=True`, which is very dangerous for shell injection.
#### How I did it
Disable `shell=True`, enable `shell=False`
#### How to verify it
Tested on DUT, compare and verify the output between the original behavior and the new changes' behavior.
[testresults.zip](https://github.com/sonic-net/sonic-buildimage/files/9550867/testresults.zip)
- Why I did it
To update MFT package to the latest version.
- How I did it
Updated MFT_VERSION & MFT_REVISION in platform/mellanox/mft.mk.
- How to verify it
Build an image and deploy to the switch
Check MFT version by dpkg -l | grep mft
Verify that all the SONiC services up and running
Run regression testing using tests from sonic-mgmt
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
- Why I did it
To include latest fixes and new functionality
SAI fixes and new features
fix#3205239, incorrect object type returned for SG child list
Fix VRF-VNI map entries remove issue
ECC health event and logging
[Port Buffers] restore default queue and pg configuration when all user pools are deleted
Fix EVPN type3 error on removal of uc/bc flood group
Fix EVPN type2 MAC move from local to remote results in SAI failure
Fix Disable learning on VXLAN tunnel
Fix error on VXLAN v6 tunnel removal
Fix port cannot apply schedule group when it is a lag member
Fix BFD add more detailed message on BFD packet not related to any existing session
gcc10 compilation fixes
Disable learning on VXLAN tunnel
Support BFD remote-disc exchange in negotiation stage
Tunnel Loopback packet action attribute implementation (for Dual TOR)
Add KVD resources MIN/MAX functionality (pending CRM issue with MIN only)
Support for CRC2 hash algorithm
Bulk counter support for PGs, queues
Support mirror sample rate attribute (SPC2+)
[Functional] [QoS] | Unable to remove SCHEDULE profile table even if there is no object referencing it
Next hop group optimized bulk API
Reduce verbosity of shared database already exists print
Span mirror policer (SPC2+), optimize pipeline for acl mirror action with policer on SPC2+
use same size descriptor pool for rx/tx
fix bfd - notify Sonic for admin-down event
2201 - empty list for supported fec for RJ45 ports
Fix don't disable used tunnel underlay interfaces
SDK fixes
100GbE FCI DAC (10137628-4050LF/HPE PN: 845408-B21) was recognized by mistake as supporting "cable burning' which caused the switch firmware to read page 0x9f (which unsupported in the cable) and to report this cable as having "bad eeprom".
Added remote peer UDP port information in BFD packet event.
After editing an ECMP, the resilient ECMP next-hop counter may not count correctly.
Fixed potential memory leaks in some APIs related to LPM
If TTL_CMD_COPY is used in Encap direction for a packet with no TTL, then the value passed in the ttl data structure will be used if non-zero (default 255 if zero).
In SN2201: When configuring Force mode, user should configure Speed and FEC on both sides
In Flex Tunnel encapsulation flow, if the encapsulation is with an IPv6 header, the flow label field may not be updated as expected.
In some cases, when changing speed to 400GbE over 8 lanes, the first few packets would be dropped.
In some traffic patterns involving small packets, the PortRcvErrors counter may mistakenly count events of local physical errors due to an internal flow in the hardware that involves link packets.
On Spectrum systems, sometimes during link failure, not all previous firmware indications cleared properly, potentially affecting the next link up attempt.
On the NVIDIA Spectrum-2 switch, when receiving a packet with Symbol Errors on ports that are configured to cut-thought mode, a pipeline might get stuck.
PCI calibration changes from a static to a dynamic mechanism.
SDK debug dump shows "Unknown" Counter in RFC3635 Counter Group.
SDK debug dump shows "Unknown" Counter in the PPCNT Traffic Class Counter Group.
SDK Dump missing column headers in some GC tables may result in difficulty understanding the dump.
SLL configuration is missing in SDK dump.
Spectrum-2 systems, do no support 1GbE on supported 40GbE modules.
When binding a UDP port which is already in use for BFD TX session, the error message appears incorrectly.
When Flex Tunnel was used, Flex Modifier sometimes experienced a brief mis-configuration during ISSU.
When many ports are active (e.g. 70 ports up), and the configuration of shared buffer is applied on the fly, occasionally, the firmware might get stuck.
When running 1GbE speeds on SN4600 system, the port remained active while peer side was closed.
When toggling many ports of the Spectrum devices while raising 10GbE link up and link maintenance is enabled, the switch may get stuck and may need to be rebooted.
When trying to reconfigure the Flex Parser header and Flex transition parameters after ISSU, the switch will returned an error even if the configuration was identical to that done before performing the ISSU.
While toggling the cable, and the low power mode is set to ON, an unexpected PMPE event error is received.
- How I did it
Updated SDK/SAI submodule and relevant makefiles with the required versions.
- How to verify it
Build an image and run tests from "sonic-mgmt".
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
- Why I did it
get_rx_los and get_tx_fault is not supported via the exisitng interface used, need provide dummy implementation for them.
NOTE: in later releases we will get them back via different interface.
- How I did it
Return False * lane_num for get_rx_los and get_tx_fault
- How to verify it
Added unit test
* Move qsfp eeprom reading to new cached api
* provide reading multiple pages in recursive manner
* workaround with flat memory on cmis
* remove workaround with memory model
* Remove unused imports
- Why I did it
Fix a typo in chassis platform API which causes the following error
>>> import sonic_platform as P
>>> c = P.platform.Platform().get_chassis()
>>> sl = c.get_all_sfps()
>>> sl[0].get_lpmode()
Sep 28 07:48:33 INFO LOG: Initializing SX log with STDOUT as output file.
False
>>> del c
Exception ignored in: <function Chassis.__del__ at 0x7f1d166ef8b0>
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/sonic_platform/chassis.py", line 126, in __del__
self.sfp_module.deinitialize_sdk_handle(sfp_module.SFP.shared_sdk_handle)
NameError: name 'sfp_module' is not defined
- How I did it
Use self while using the SDK handle
- How to verify it
Manual test
Signed-off-by: Stephen Sun <stephens@nvidia.com>
This reverts commit 9750cb4.
There is a PR to handle 202205 branch revert: #12184
- Why I did it
The PR to be reverted introduced many notice logs every 1 minute if SFP is not plugged:
Cannot get module EEPROM information: Input/output error
Before the "bad" PR, the message format is like this:
INFO pmon#supervisord: xcvrd Cannot get module EEPROM information: Input/output error
It was truncated by rsyslog because every message is the same. However, the "bad" PR introduces SFP index to the message:
NOTICE pmon#xcvrd: Failed to get EEPROM data for sfp 39: Cannot get module EEPROM information: Input/output error
Rsyslog no longer truncate such log and many such messages are flooded to syslog.
- How I did it
Revert the PR
- How to verify it
Manual test
- Why I did it
Update SDK/FW version - 4.5.2320/2010_2320 in order to have the following fixes:
• Spectrum-3 | PCI calibration changes from a static to a dynamic mechanism.
• [VxLAN] TTL was set to 0 for non IP traffic (such as ARP)
- How I did it
Update pointer for the SDK/FW
- How to verify it
Run regression tests