Why I did it
The Mellanox platform is required to support the fwutil auto-update feature defined here
This is to allow switches, when performing SONiC upgrades to choose whether to perform firmware upgrades that may interrupt the data plane through a cold boot.
How I did it
Two methods were added to the component implementations for mellanox.
In the base Component class we add a default function that chooses to skip the installation of any firmware unless the cold boot option is provided. This is because the Mellanox platform, by default, does not support installing firmware on ONIE, the CPLD, or the BIOS "on-the-fly".
In the ComponentSSD class we add a function that behaves similarly but uses the Mellanox specific SSD firmware upgrade tool to check if the current SSD supports being upgraded on the fly in order to decide whether to skip or perform the installation.
How to verify it
Unit tests are included with this PR. These test will run on build of target sonic-mellanox.bin
You may also perform fwutil auto-update ... commands after Azure/sonic-utilities#1242 is merged in.
- Why I did it
Enhance the Python3 support for platform API. Originally, some platform APIs call SDK API which didn't support Python 3. Now the Python 3 APIs have been supported in SDK 4.4.3XXX, Python3 is completely supported by platform API
- How I did it
Start all platform daemons from python3
1. Remove #/usr/bin/env python at the beginning of each platform API file as the platform API won't be started as daemons but be imported from other daemons.
2. Adjust SDK API calls accordingly
- How to verify it
Manually test and run regression platform test
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Why I did it
Adjust the Makefile for SDK/python-SDK-API to support both python2 and python3
- How to verify it
Build the image and check whether python2 and python3 are both supported by SDK API.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Added support BRCM SAI 5.0.0.1.
Major changes here:
CS00012019568 Link Training (all 100G ASICs - TH families and TD3)
CS00012184310 [attribute_capability| for port SAI_PORT_ATTR_TPID returns CREATE_IMP=false|SET_IMP=true|GET_IMP=true
CS00012182145 [IPinIP][Tunnel Delete] If IPinIP tunnel delete is performed observed following SYNCd error: ERR syncd#syncd: [none] SAI_API_TUNNEL:brcm_sai_tnl_mp_remove_tunnel_term_table_entry:4026 _brcm_sai_mptnl_sip_tnl_lookup failed with error -7.
CS00012182148 Rate Limit Parity error message to syncd/sonic.
CS00012178692 ACL drops counted as interface drops
CS00012183901 [WARMBOOT] WARMReboot with active traffic causes port flap reported during warm reboot
CS00012023263 TD3/TH2 : Support 4 lossless queues(2 SW PFCWD and 2 HW PFCWD)
CS00012019578 Pre FEC bit-error rate (BER) - DNX and XGS (TD and TH 50/100G)
To fix determine-reboot-cause service which was failing due to non-implemented thrown from get_reboot_case, if the reboot was done with `sudo reboot` (cold reboot)
Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com>
Why I did it
* For SAI - Upgrade to Version 1.19.0
- Add support for VxLAN encap TTL uniform model on SPC2/3
- Add ACL entry actions set VRF, set do no learn, add VLAN ID, add VLAN priority
- Add ACL field has VLAN tag
- Bulk counters (improve port statistics performance)
- Create async dump extra as part of debug generate dump
- Create irisc dump on severe health event
- Support 0 port systems (modify get switch mac to work accordingly)
- Set interface vlan up state for ping tool in SONiC
- Support attributes SAI_PORT_ATTR_QOS_SCHEDULER_PROFILE_ID, SAI_PORT_ATTR_QOS_INGRESS_BUFFER_PROFILE_LIST,
SAI_PORT_ATTR_QOS_EGRESS_BUFFER_PROFILE_LIST, SAI_PORT_ATTR_POLICER_ID as part of port create Git stats
* For SDK\FW - Upgrade to Version SDK 4.4.3106, FW 2008_3110
Added Features:
- Increased ACL table
- Enhanced PSAMPLE support
- Added support for Finisar SR4 module in SN3700 systems
- Added support for Python 3.0 in examples.
Fix bugs:
- On LR4 transceivers 00YD278, the firmware incorrectly identified the transceiver
- Reduce memory consumption for virtual LAG
- Fixed PSAMPLE listeners cleanup on SDK drivers unloading.
- On Spectrum-2 and Spectrum-3 systems, slow reaction time to Rx pause packets may lead to buffer overflow on servers.
- BER may be experienced when using 5m DAC cables between SN4700 and SN2700 in 100GbE speed.
- On very rare occasion, when connecting DR4 PAM4 transceiver to 100GbE DR1 NRZ, low BER may be experienced.
- Unexpected packet drops on the port ingress buffer may be experienced when working in 400GbE mode.
Note: When performing ISSU from an older version, this fix won't be applied. For fix to apply, a non-ISSU reset is required.
- Fix SN3800 specific warm boot scenario: Disable interface, Warm Boot, Enable Interface --> link will remain down.
Signed-off-by: Dror Prital <drorp@nvidia.com>
This is due to the fact that we use SONIC_OVERRIDE_BUILD_VARS internally
in our build jobs and this is not accounted in caching framework.
So we add MLNX_SDK_DEB_VERSION to force rebuild if we changed it via
SONIC_OVERRIDE_BUILD_VARS.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Why I did it
Quagga is no longer being used. Remove quagga-related code (e.g., docker-fpm-quagga, sonic-quagga, etc.).
How I did it
Remove quagga-related code.
#### Why I did it
- After [sonic-linux-kernel#177](https://github.com/Azure/sonic-linux-kernel/pull/177) changes, the I2C mux channels of Baseboard and Switchboard CPLDs are moved from i2c-4 and i2c-5 to i2c-36 and i2c-37 respectively.
- This caused QSFP driver initialization of i2c-36 to i2c-41 to fail causing the ports from Ethernet208 to Ethernet248 fail.
#### How I did it
- The fix to this problem is to change the order of QSFP driver initialization to I2C mux channels.
- Instead of the order i2c-10 to i2c-41, the order i2c-4 to i2c-35 is being utilized.
- Also, need to change the i2c-mux-channel number for Baseboard CPLD and switchboard CPLD in scripts to access them.
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
Currently we leveraged the Supervisor to monitor the running status of critical processes in each container and it is more reliable and flexible than doing the monitoring by Monit. So we removed the functionality of monitoring the critical processes by Monit.
How I did it
I removed the script process_checker and corresponding Monit configuration entries of critical processes.
How to verify it
I verified this on the device str-7260cx3-acs-1.
#### Why I did it
On our platforms syncd must be up while using the sonic_platform.
The issue is warm-reboot script first disables syncd then instantiate Chassis, which tries to connect syncd in __init__.
#### How I did it
Refactor Chassis to lazy initialize components.
Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com>
- Why I did it
To give SONiC Application Extension developers an environment to run and develop their apps.
- How I did it
Created sonic-sdk and sonic-sdk-buildenv dockers and their dbg versions.
- How to verify it
Build:
$ make -f slave target/sonic-sdk.gz target/sonic-sdk-buildenv.gz
- Why I did it
Pick up fix from new hw-management package:
Fix gearbox thermal zone name, which was lack suffix thermal zone number
- How I did it
Update the hw-management version number in the make file
Update hw-management submodule pointer
- How to verify it
Run platform related test cases on Mellanox platform
- Fix `thermal.get_position_in_parent` issue which causes test_snmp_phy_entity failure
- Add support for xcvr thermal info so that thermalctld can incorporate it into the cooling algorithm (QSFP and OSFP/QSFP-DD modules only)
- Add improvements to `arista` CLI
#### Why I did it
To fix the following:
```
# psuutil status
Traceback (most recent call last):
File "/usr/local/bin/psuutil", line 8, in <module>
sys.exit(cli())
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/psuutil/main.py", line 93, in status
psu_name = psu.get_name()
File "/usr/local/lib/python3.7/dist-packages/sonic_platform_base/device_base.py", line 28, in get_name
raise NotImplementedError
NotImplementedError
```
#### How I did it
Implemented get_name
Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com>
#### Why I did it
According to thermalctld hld, each fan must belong to a fan drawer, if the fan drawer does not physically exist, put fan into a virtual fan drawer. This PR is to clear fan from chassis._fan_list
#### How I did it
1. Don't put fan to chassis._fan_list
2. Always query fan from fan_drawer
#### Why I did it
This pull request allows calls to be made through the platform 2.0 API that retrieve the PSU and Chassis hardware revision on Mellanox platforms. Access to these values will aid customers in determining their hardware revisions for debugging and technical support. These values are intended to be eventually exposed through the CLI.
#### How I did it
For the PSU hardware revision I used the existing VPD function calls implemented in https://github.com/Azure/sonic-buildimage/pull/7382
For the Chassis hardware revision I parsed the SMBIOS / DMI type 2 information to retrieve the information.
#### Why I did it
xcvrd crashes when the application advertisement capability flag is not seen for few transceivers.
#### How I did it
Initialize the additional application capability in dunder init
Add support for Accton as9726-32d platform
This pull request is based on as9716-32d, so I reference as9716-32d to create new model: as9726-32d.
This module do not need led driver to control led, FPGA can handle it.
I also implement API2.0(sonic_platform) for this model, CPLD driver, PSU driver, Fan driver to control these HW behavior.
Why I did it
Fix issues below.
#7133#6602
So, remove the dps200 driver from the platform-specific driver.
Then, add the dps200 module driver to the Linux kernel tree.
How I did it
Remove the dps200 driver from the platform-specific driver and add the dps200 module driver to the Linux kernel.
How to verify it
Build an image with Azure/sonic-linux-kernel#207
Then, install to the Haliburton.
LED_PROC_INIT_SOC variable was incorrectly referenced as LED_SOC_INIT_SOC. Introduced in #5483
Rather than fixing the typo, I decided to simplify the script, removing the need for the conditional altogether by moving the bcmcmd call inside the conditional which checks for the presence of LED_SOC_INIT_SOC.
Fix following crash in `show version`:
```
Traceback (most recent call last):
File "/usr/local/bin/decode-syseeprom", line 32, in instantiate_eeprom_object
eeprom = sonic_platform.platform.Platform().get_chassis().get_eeprom()
AttributeError: module 'sonic_platform' has no attribute 'platform'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/decode-syseeprom", line 262, in <module>
sys.exit(main())
File "/usr/local/bin/decode-syseeprom", line 244, in main
print_serial(use_db)
File "/usr/local/bin/decode-syseeprom", line 169, in print_serial
eeprom = instantiate_eeprom_object()
File "/usr/local/bin/decode-syseeprom", line 34, in instantiate_eeprom_object
log.log_error('Failed to obtain EEPROM object due to {}'.format(repr(e)))
NameError: name 'log' is not defined
```
Signed-off-by: pettershao-ragilenetworks <pettershao@ragilenetworks.com>
Platform library changes
- Fix the use of /proc/modules during testing, fixes#7463
- Add `libsfp-eeprom.so` build to read/write xcvr eeproms in C
- Add some more reboot-cause information
- Write down temperature hw thresholds to the sensors
- Report software thresholds through platform api
- Writ `port_name sysfs` file of optoe`
- Tests enhancements
- Fix dependency issues for chassis provisioning
Platform configuration changes
- Add `pcie.yaml` configuration for a few platforms
- Mount `libsfp-eeprom.so` inside `pmon`
- Fix `Arista-7050SX3-48C8` and `Arista-7050SX3-48YC8' platform and hwsku
- Miscellaneous fixes
Co-authored-by: Boyang Yu <byu@arista.com>
Co-authored-by: Zhi Yuan Carl Zhao <zyzhao@arista.com>
Originally, SFP modules were always accessed from platform daemons, and arbitrary SFP modules can be accessed in the daemon. So all SFP modules were initialized in one shot once one of the following chassis APIs called
- get_all_sfps
- get_sfp_numbers
- get_sfp
Recently, we noticed that SFP modules can also be accessed from CLI, eg. the latest refactor of `sfputil`.
In this case, only one SFP module is accessed in the chassis object's life cycle.
To initialize all SFP modules in one shot is waste of time and causes the CLI to take much more time to finish.
So we would like to optimize the initialization flow by introducing a two-phase initialization approach:
- Partial initialization, which means the `chassis._sfp_list` has been initialized with proper length and all elements being `None`
- Full initialization, which means all elements in `chassis._sfp_list` are created
If the relevant function is called,
- `get_sfp`, only partial initialization will be done, and then the specific SFP module is initialized.
- `get_all_sfps` or `get_num_sfps`, full initialization will be done, which means all SFP modules are initialized.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
#### Why I did it
400G media EEPROM and DOM information are not populated properly in DellEMC Z9332f platform.
#### How I did it
Handled QSFP_DD, QSFP28/QSFP+, SFP+ accordingly based on media type detected.
- Why I did it
Updated FW to xx.2008.2526 version.
Fixed issues:
1. Spectrum-2, Spectrum-3 | sFlow | High CPU load and high on fully loaded switch.
2. Spectrum-2, Spectrum-3 | Fine grain LAG | in rare cases doesn’t update the right entry
- How I did it
Updated submodule pointer and version in a Makefile.
- How to verify it
Full regression and bugs validation
Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
#### Why I did it
We want to add the ability for the command `show platform psustatus` to show the serial number and part number of the PSU devices on Mellanox platforms. This will be useful for data-center management of field replaceable units (FRUs) on switches.
#### How I did it
I implemented the platform 2.0 functions `get_model()` and `get_serial()` for the PSU in the mellanox platform API by referencing the sysfs nodes provided by the [hw-management](https://github.com/Azure/sonic-buildimage/tree/master/platform/mellanox/hw-management) module.
#### Why I did it
Adopt a single way to get fan direction for all ASIC types.
It depends on hw-mgmt V.7.0010.2000.2303. Depends on https://github.com/Azure/sonic-buildimage/pull/7419
#### How I did it
Originally, the get_direction was implemented by fetching and parsing `/var/run/hw-management/system/fan_dir` on the Spectrum-2 and the Spectrum-3 systems. It isn't supported on the Spectrum system.
Now, it is implemented by fetching `/var/run/hw-management/thermal/fanX_dir` for all the platforms.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
- Removed the old function for detecting a faulty fan.
- Removed the old function for detecting excess temperature.
- Implement thermal_manager APIs based on ThermalManagerBase
- Implement thermal_conditions APIs based on ThermalPolicyConditionBase
- Implement thermal_actions APIs based on ThermalPolicyActionBase
- Implement thermal_info APIs based on ThermalPolicyInfoBase
- Add thermal_policy.json
The following error message is observed during chassis object being destroyed
"Exception ignored in: <function Chassis.__del__ at 0x7fd22165cd08>
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/sonic_platform/chassis.py", line 83, in __del__
ImportError: sys.meta_path is None, Python is likely shutting down
The chassis tries to import deinitialize_sdk_handle during being destroyed for the purpose of releasing the sdk_handle.
However, importing another module during shutting down can cause the error because some of the fundamental infrastructures are no longer available."
This error occurs when a chassis object is created and then destroyed in the Python shell.
- How I did it
To fix it, record the deinitialize_sdk_handle in the chassis object when sdk_handle is being initialized and call the deinitialize handler when the chassis object is being destroyed
- How to verify it
Manually test.
- Why I did it
Upgrade hw-mgmt to 7.0100.2303
Bug fixes
1. Fan direction feature fix for fixed FAN system (using shell instead of binutils/strings)
2. Remove cpld 4th link on systems with only 3 CPLD's
3. hw-mgmt: thermal: Add hardcoded critical trip point. Follow-up after patch "Removing critical thermal zones to prevent unexpected software system shutdown".
4. Fix sensor attribute mapping to be label based instead of index based to allow common handling of voltage regulator names independently of hardware changes.
5. Update 'lm-sensors' custom configuration file. Relevant only for users utilizing sensors.conf files coming along with hw-management package.
6. For full feature list please follow https://github.com/Mellanox/hw-mgmt/blob/V.7.0010.2300_BR/debian/Release.txt
- How I did it
Update hw-mgmt pointer
Remove unused patches
Fix existing patch to make sure it apply successfully
- How to verify it
Full platform regression on all mellanox platforms
Changes in the new release:
Fix 10G and 50G speeds in SAI XML to support all interface types
Enable SMAC=DMAC and SMAC MC in tunnel debug counter
Add tunnel statistics
Add isolation group API implementation
Fix ACL ANY debug counter to correctly track ACL drops
Add VXLAN source port hard coded range, controlled by K/V
FW dump me now feature
Add mlxtrace to saidump
Speed lane setting and AN control
Implement query stats API
VNI miss part of tunnel decal drop reason
Align with SAI API v1.8.1
Signed-off-by: Dror Prital <drorp@nvidia.com>
Signed-off-by: Stepan Blyschak stepanb@nvidia.com
This PR is part of SONiC Application Extension
Depends on #5938
- Why I did it
To provide an infrastructure change in order to support SONiC Application Extension feature.
- How I did it
Label every installable SONiC Docker with a minimal required manifest and auto-generate packages.json file based on
installed SONiC images.
- How to verify it
Build an image, execute the following command:
admin@sonic:~$ docker inspect docker-snmp:1.0.0 | jq '.[0].Config.Labels["com.azure.sonic.manifest"]' -r | jq
Cat /var/lib/sonic-package-manager/packages.json file to verify all dockers are listed there.
What/Why I did:
Updated sonic-sairedis submodule to use SAI1.8.1
[Submodule update] sonic-sairedis
d821bc0b137264daa01c347700c7c14677cf3370 (HEAD -> master, origin/master, origin/HEAD) [Mellanox] Add SAI template
config support (#803)
bb341e9ea069e974a41930d434d437f522476f29 [syncd] Bring back TimerWatchdog (#821)
badf6cea2650015269420932a9186113d1ad5ec6 Update .gitignore (#822)
1494bc69046ffe7135377844548a11e4168b407c [meta] Mark local function static (#818)
34e961cf39e9af93f492f66640739e1c7a1694c8 [pyext] Fix pyext/py2 library (#820)
0d3749d3a93fd7e59ebb83b49fa1d7e2a56d6cf4 Moved SAI Header to git tag v1.8.1 (#816)
70fff780d529f78b53af4bd104f4932d0c4d8dd6 Added --purge of base docker image packages before installing new ones. (#819)
Updated Broadcom SAI Debian package to 4.3.3.4-2 to use SAI 1.8.1 Header
New features and fixes in the new SDK/FW:
SN4600C | AN/LT support
SN2700 | AN/LT bugs fixes
WJH | FID_MISS support
Signed-off-by: Kebo Liu <kebol@nvidia.com>
#### Why I did it
To build flashrom properly with dependency tracking.
#### How I did it
Moved flashrom code from platform/broadcom/sonic-platform-modules-dell/tools directory to src/flashrom directory.
At the end, flashrom_0.9.7_amd64.deb package is build which will be installed in the devices.
- Support compile sonic arm image on arm server. If arm image compiling is executed on arm server instead of using qemu mode on x86 server, compile time can be saved significantly.
- Add kernel argument systemd.unified_cgroup_hierarchy=0 for upgrade systemd to version 247, according to #7228
- rename multiarch docker to sonic-slave-${distro}-march-${arch}
Co-authored-by: Xianghong Gu <xgu@centecnetworks.com>
Co-authored-by: Shi Lei <shil@centecnetworks.com>
- Fix build issue when `/proc/cmdline` is not available. Fixes#7145
- Properly detect linecard slot from linecard CPU
- Some fixes on the uart link between supervisor and linecard CPU
- Other small fixes
#### Why I did it
- xcvrd crash was seen in latest 201811 images.
- For Dell S6100,API 2.0 uses poll mode while 1.0 was still using interrupt mode.
#### How I did it
- Modified get_transceiver_change_event in 1.0 to poll mode.
- Why I did it
Add missed files for dynamic buffer calculation for ACS-MSN3420 and ACS-MSN4410
- How I did it
asic_table.j2: Add mapping from platform to ASIC
Add buffer_dynamic.json.j2 for ACS-MSN4410.
- How to verify it
Check whether the dynamic buffer calculation daemon starts successfully.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Platform API
- Fix Watchdog get_remaining_time logic
- Improve Sfp platform API implementation
- Improve EepromDecoder API implementation
- Fix mismatch between Fan name and platform.json
- Add PSU get_maximum_supplied_power
Internal
- Refactor of Xcvr declaration and initialization
- Cleanup of Resets and Gpios
- Add platform library versioning to enhance support capabilities
- Allow supervisor to manage cards from slot 2
- Miscelanous cleanups and refactors
Signed-off-by: Samuel Angebault <staphylo@arista.com>
* 7260cx3 DualToR config.bcm support based on DualToR setting in device metadata at boot time.
For HWSKU Arista-7260CX3-C64 the MMU setting SOC for T0/T1 is also combined into the config.bcm.j2 logic so use just one config file and adding delta based on Switch Roles.
Integrate hw-management package V.7.0010.2002
Bug fixes:
Removing critical thermal zones to prevent unexpected software system shutdown:
*Kernel 4.9 -0071-mlxsw-core-Remove-critical-trip-point-from-thermal-z.patch
*Kernel 4.19 -076-mlxsw-core-Remove-critical-trip-point-from-thermal-z.patch
Removing redundant link for cpld3 for fixed systems (SN2100, SN2010).
Fix an issue with missed attribute for cpld3 (port CPLD) for SN2700, SN2410.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Build Marvell kernel driver for prestera sai sdk
Builds interrupt and dma kernel driver
Removed the older method pre-compiled kernel module debian package and its makefile
To add latest SAI drop REL_4.3.3.3 to SONIC which addresses the following CSP cases:
CS00012058054: [4.3][IPinIP][TTL-PIPE] IPinIP TTL Pipe Mode is NOT working it is behaving UNIFORM mode even programed as PIPE mode
CS00011227466: [4.3] Warmboot support with tunnel encap
Fix the following issues:
Spectrum-2, Spectrum-3 | Port | Fix link issue when using 25 GbE rate between two ports while one is on Spectrum-2-based system and the other is on Spectrum-3-based system
All | warmboot | fail to upgrade from earlier SONiC versions with official SDK/FW 4.4.2306 (was on SONiC 201911)
All | What-Just-Happened | When enabling or disabling WJH under high traffic load to the host CPU, in very specific and low probability conditions, an error could occur, that may result in loss of data, channel failure or in extreme cases SW failure
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
To improve management of docker-gbsyncd-vs. gbsyncd_startup.py simply spawned syncd processes and then exited. In that case, supervisord would no longer manage any processes in the container, and thus there was no way to know if a critical process had exited.
I recently created gbsyncdmgrd to be a more complete, robust replacement for gbsyncd_startup.py.
NOTE: This PR is dependent on the inclusion of gbsyncdmgrd in the sonic-sairedis repo. A submodule update is pending at
#7089
The psample module was not loaded on barefoot platform. The loading of this module is a prerequisite for testing SFlow.
* add `.gitignore` to the `barefoot` subdirectory to overwrite ignore "platform/**/debian/*" in the root directory
Initialize fans and thermals lists on demand; make them properties in order to reduce Chassis object initialization time
Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com>
- Why I did it
The existing Fan led and Psu led object initialize itself to green color in init method. However, there are multiple daemons calls sonic platform API and there could be a case that:
A PSU is removed from system
Reboot switch
psud detects that 1 PSU is missing and set PSU led to red
Other daemon just start up and call sonic platform API, the API set PSU led to green by call PsuLed.init
This PR is a partial fix for the issue. As we also need guarantee that the led is initialized with a correct value. I checked existing psud and thermalctld code. psud always initialize the PSU led color on boot up, thermalcltd need some changes to initialize led color on the first run
- How I did it
Remove the led color initialization code from FanLed.init and PsuLed.init
- How to verify it
Manual test
Eliminate the need for `gbsyncd_start.sh`, which simply calls `exec "/usr/bin/gbsyncd_startup.py"`. The shell script is unnecessary.
Once this PR merges, we can remove `gbsyncd_start.sh` from the sonic-sairedis repo.
There was a change to replace platform utils with sonic platform API in psuutil. However, psu API is not initialized on host side. The PR is to fix it.
Use udevadm to trigger the udev rules on the first boot
How to verify:
- Connect C0 with E1031;
- Install or upgrade the sonic os to 202012 branch;
- When access to sonic check if /dev/C0-1 to /dev/C0-48 are existed.
- Why I did it
To pick up new features and fix from SDK/FW and SAI
SDK/FW new Feature:
All | Added support for multiple modules and cable types. For full list contact Nvidia networking support
Spectrum-3 | SN46000C | Added support for up to 5W on ports 49 to 64 .
SDK/FW bugs' fix:
All | fast reboot | fast boot failure from latest 201811 to 201911 and above
Spectrum | 10GbE/1GbE Transceiver (FTLX8574D3BCV) stopped working after firmware upgrade
Spectrum-2 | When device is rebooted with locked Optical Transceivers in split mode, the firmware may get stuck
Spectrum-2 | SN3700 | When connecting at 200GbE to Ixia K400, Ixia receives CRC errors
Spectrum-2 | SN3800 | On rare occasions packets loss may be experienced due to signal integrity issues
Spectrum-2 | When the port is a member of a LAG, after a warmboot and port toggle on the peer-side, the port remains down
Spectrum-3 | SN4700 | While using Optic cable in Split 4x1 mode in PAM4, when two first ports are toggled, the other 2 ports go down
Spectrum-3 | SN4700 | When working in 400GbE, deleting the headroom configuration (changing buffer size to zero) on the fly may cause continual packet drops
SAI
All | sFlow | Use hardcoded value 1 as netlink group number ax expected by hsflowd
- How I did it
Update the related version number in the make files and update the submodule pointer accordingly.
- How to verify it
Run regression test and everything works good.
#### Why I did it
- Python3 compatibility changes for PDDF eeprom class
- Adding API for temperature in PDDF psu class
- PEP8 standard changes and adding missing method in PDDF sfp class
#### How I did it
- Using python3 to invoke the sonic_platform module in PDDF based platform
- Running autopep8 tool to comply to PEP8 standards
#### Why I did it
Recently, CLI sfputil replace the old sonic platform utils with sonic platform API. However, sonic platform API does not support SFP low power mode and reset related operation. The PR is to fix it.
The change to replace platform utils with sonic platform API was reverted on 202012, once this PR is merged, we can cherry-pick these two PRs to 202012 together.
#### How I did it
In low power mode and reset related operation, use "docker exec" if the command is running on host side.
- Why I did it
Update MFT tool version to 4.16.0
Bugs fixes:
mlxlink: Fixed an issue that caused the margin scan to fail with the following message: Eye scan not completed.
mlxcable: Cable firmware burning capability is not supported.
New features:
mlxlink: Enabled margin scan on Network links.
mlxlink: Added PRBS TX/RX polarity inversion using the following flags: --invert_tx_polarity / --invert_rx_polarity
- How I did it
Update MFT make file with new version number.
- How to verify it
Build image and test related functions on Mellanox platform
Incorporate the below changes in DellEMC Z9332F platform:
- Implemented watchdog platform API support
- Implement ‘get_position_in_parent’, ‘is_replaceable’ methods for all device types
- Change return type of SFP methods to match specification in sonic_platform_common/sfp_base.py
- Added platform.json file in device directory.
Co-authored-by: V P Subramaniam <Subramaniam_Vellalap@dell.com>
- Why I did it
To collect platform based logs along with "show techsupport" on S6000 and S6100 plaforms.
- How I did it
On branch dell_techsupport_dump
Changes to be committed:
(use "git reset HEAD ..." to unstage)
new file: platform/broadcom/sonic-platform-modules-dell/common/actions.sh
modified: platform/broadcom/sonic-platform-modules-dell/debian/platform-modules-s6000.install
modified: platform/broadcom/sonic-platform-modules-dell/debian/platform-modules-s6100.install
new file: platform/broadcom/sonic-platform-modules-dell/s6000/scripts/hw-management-generate-dump.sh
new file: platform/broadcom/sonic-platform-modules-dell/s6100/scripts/hw-management-generate-dump.sh
- How to verify it
hw-mgmt-dump.tar.gz will be found in sonic_dump__< YYYYMMDD_HHMMSS>.tar.gz.
#### Why I did it
- The xcvrd service requires an event detection function, unplug or plug in the transceiver.
#### How I did it
- Add sysfs interrupt to notify userspace app of external interrupt
- Implement get_change_event() in chassis api.
- Also begin installing Python 3 sonic-platform package for Celestica platforms
- Provide `hw-management-generate-dump.sh` for `show techsupport`
- Load `optoe3` for OSFP and QSFP-DD transceivers
- Enhance reboot-cause caching robustness
Signed-off-by: Samuel Angebault <staphylo@arista.com>
- Made python2 to python3 changes
- Removed ord() func as python3 return int instead of str
- Had to change chr(..) to bytes([..]) function while using ctypes class methods
- Why I did it
Bug fixes
- In rare cases when thermal algorithm is reactivated after FAN/PSU insertion, FAN remains at high rpm
- When stop hw-management code received error in the log instead of exit code '0'.
- In SPC1 i2c sometimes collide with chip reset coming from SDK
- Remove raw eeprom data link, when working with PSU which don't have eeprom for "msn274x", "msn24xx" and "msn27xx" systems
- Fix memory leak on mlxsw_core_bus_device module removal
- How I did it
Update the hw-mgmt version number in the make file
Update the hw-mgmt repo pointer
- How to verify it
run platform related test cases on all Mellanox platform
Signed-off-by: Kebo Liu <kebol@nvidia.com>
To fix [DPB| wrong aliases for interfaces](https://github.com/Azure/sonic-buildimage/issues/6024) issue, implimented flexible alias support [design doc](https://github.com/Azure/SONiC/pull/749)
> [[dpb|config] Fix the validation logic of breakout mode](https://github.com/Azure/sonic-utilities/pull/1440) depends on this
#### How I did it
1. Removed `"alias_at_lanes"` from port-configuration file(i.e. platfrom.json)
2. Added dictionary to "breakout_modes" values. This defines the breakout modes available on the platform for this parent port, and it maps to the alias list. The alias list presents the alias names for individual ports in order under this breakout mode.
```
{
"interfaces": {
"Ethernet0": {
"index": "1,1,1,1",
"lanes": "0,1,2,3",
"breakout_modes": {
"1x100G[40G]": ["Eth1"],
"2x50G": ["Eth1/1", "Eth1/2"],
"4x25G[10G]": ["Eth1/1", "Eth1/2", "Eth1/3", "Eth1/4"],
"2x25G(2)+1x50G(2)": ["Eth1/1", "Eth1/2", "Eth1/3"],
"1x50G(2)+2x25G(2)": ["Eth1/1", "Eth1/2", "Eth1/3"]
}
}
}
```
#### How to verify it
`config interface breakout`
Signed-off-by: Sangita Maity <samaity@linkedin.com>
In preparation for the merging of Azure/sonic-platform-common#173, which properly defines class and instance members in the Platform API base classes.
It is proper object-oriented methodology to call the base class initializer, even if it is only the default initializer. This also future-proofs the potential addition of custom initializers in the base classes down the road.
In preparation for the merging of Azure/sonic-platform-common#173, which properly defines class and instance members in the Platform API base classes.
It is proper object-oriented methodology to call the base class initializer, even if it is only the default initializer. This also future-proofs the potential addition of custom initializers in the base classes down the road.
Modified the MRVL SAI debian version format to include debian revision number. This helps in identifying the SAI deb causing any build/runtime issue.
Signed-off-by: Rajkumar Pennadam Ramamoorthy <rpennadamram@marvell.com>
- Why I did it
System is stuck on 'starting' state on SimX platform because of infinite loop on 'hw-management-ready.sh' script .
The loop is polling to check if the hw-mgmt sysfs created before proceeding with the flow, for SimX platform the sysfs will never create so the system is not starting properly.
- How I did it
Add a condition to poll on hw-mgmt sysfs only if the switch is real HW and not SimX platform.
- How to verify it
Check "systemctl status hw-management.service" output on a SimX switch with this patch, the state will be "active".
Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
- Why I did it
Add support for new 64x200G SN4600 systems
- How I did it
Add all relevant files (w/o platform.json and hwsku.json as they will come later) with default SKU.
- How to verify it
Install image on switch, verify all ports are up and configured properly, run full platform SONiC tests.
#### Why I did it
To incorporate the below changes in DellEMC S5232, Z9264, Z9332 platforms.
- Update thermal high threshold values
- Make watchdog API Python2 and Python3 compatible
- Fix LGTM alerts
- Z9264: Fix get_change_event timer value
#### How I did it
- Use 'universal_newlines=True' in subprocess.Popen call.
- Change the timeout in 'get_change_event' to milliseconds to match specification in sonic_platform_common/chassis_base.py
- Fix parent `__init__` call for platform API, based on Azure/sonic-platform-common#173
- Implementation for more platform API methods
- Daily storage information reporting in `arista daemon`
- Enhancements to reboot cause reporting, now multi-sourcing reboot cause information
- Miscellaneous fixes
The S6000 devices, the cold reboot is abrupt and it is likely to cause issues which will cause the device to land into EFI shell. Hence the platform reboot will happen after graceful unmount of all the filesystems as in S6100.
Moved the platform_reboot to platform_reboot_override and hooked it to the systemd shutdown services as in S6100
- Why I did it
Mellanox SDK APIs support python 2 at the moment.
- How I did it
Mellanox SDK APIs support python 2 at the moment.
- How to verify it
Add python 2 to Mellanox syncd only.
- Which release branch to backport (provide reason below if selected)
docker exec -t syncd /bin/bash -c "sx_api_dbg_generate_dump.py /home/sx_api_dbg_dump"
You can see that it will work and generate /home/sx_api_dbg_dump
Signed-off-by: allas <allas@nvidia.com>
- Improve sonic-mgmt platform test suite pass rate
- Improve coverage of platform unit tests
- Provide platform specific reboot logic as per platform porting guide
- Fix bug due to pcie.yaml file being located in the wrong directory
#### Why I did it
Fix runtime issues caused by SONiC update
#### How I did it
- new attribute SAI_ACL_ENTRY_ATTR_FIELD_ACL_IP_TYPE supported
- new attribute SAI_SWITCH_ATTR_AVAILABLE_IPMC_ENTRY supported
Signed-off-by: Roman Savchuk <romanx.savchuk@intel.com>
- Why I did it
The pcie configuration file location is under plugin directory not under platform directory.
#6437
- How I did it
Move all pcie.yaml configuration file from plugin to platform directory.
Remove unnecessary timer to start pcie-check.service
Move pcie-check.service to sonic-host-services
- How to verify it
Verify on the device
- Add support for `DCS-7050SX3-48YC8` and `DCS-7050SX3-48C8` platform
- Add support for more variants of `DCS-7280CR3-32[PD]4`
- Add Supervisor to Linecard consutil support
- Complete Watchdog platform API support
- Fix some PSU behavior on `DCS-7050QX-32` and `DCS-7060CX-32S`
- Fix SEU management on `DCS-7060CX-32S`
- Allow kernel modules to build up to linux 5.10
- Rename led color `orange` to `amber`
- Miscellaneous fixes
Open ACL Outer VLAN ID for egress for ports part of VLAN RIF
- Why I did it
Open ACL Outer VLAN ID for egress for ports part of VLAN RIF
- How I did it
Updated SAI submodule pointer
- How to verify it
Build an image, deploy and check all is up and running.
Verify ACL sonic-mgmt test is passing
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
To have the following fixes:
* All | Port status remains down after warm boot and flapping the port on peer side
* All | LAG HASH | IPv6 SRC_IP is not accounted in LAG hashing [
* All | ASIC driver | Kernel crash observed when driver reload is initiated before it fully loaded
* Spectrum-3 | Buffer | In lossless configuration, headroom is been evicted only when the shared buffers is free
* All | prevent FW access during ISSU
Signed-off-by: Volodymyr Samotiy <volodymyrs@nvidia.com>
Azure/sonic-utilities#1431 changes the path to the udevprefix.conf file. The file previously inappropriately resided in the <platform>/plugins/ directory. That directory is reserved for now-deprecated Python platform plugins, and will be removed in the near future.
This PR is needed to fix the show interface counters output issue where the counters are not correct due to an issue introduced in BRCM SAI 4.3.
- How to verify it
Without the fix if one injects packets, the expected counters for the interfaces involved do not show correct count values. The RX count looks to be TX count while TX count looks to be RX count but even that the values could not be trusted.
After the fix the counters started to look correct. Here is one sample output taken after the fix is applied where I manually injected 10,000 packets into Ethernet92 to be routed out of the port channel member port Ethernet40. Also injected 10,000 invalid packets into the same Ethernet92 and all 10,000 packets were shown RX_DRP correctly.
- Why I did it
Enable platform API tests to run successfully by providing required test infrastructure files along with supporting changes.
- How I did it
Added platform.json along with supporting changes.
- Addition of pcie.yaml supporting pcied
- Addition of Real fan drawer support vs Virtual
- Removal of python2 wheel with support in place for python3
- supporting changes platform api tests
platform modules' deb package rules for x86_64-accton_wedge100bf_65x-r0 was missing the code that installs the sonic_platform wheel package file under device directory on host (which is checked by pmon deamon on start)
Merged bcmsai 4.3.0.13 code to top of master bcmsai 4.3.0.10-5.
- How to verify it
Ran nightly regression on T0 and T1 topology using bcmsai 4.3.0.13. Test results are better than previous runs.
For T0 -
New test passing - CRM, Decap, FDB, platform test, VxLAN
New test failing – PFC unknown MAC
For T1 –
New test passing – Port Channel
New test failing – platform test
Include SAI bug fixes:
- Apply device MAC on port host interface when port is removed from LAG.
- [Shared Headroom]: fixed watermark handling for SHP flow
- Decrease verbosity of policer unbind message when no policer is attached
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
This PR is to fix issue: #6603
The CP210x driver not attached properly after first login (cold-plug), lsusb and dmesg shown that the usb devices has been recognized but driver is missing.
Submodule commits included:
* src/sonic-platform-common 6ad0004...bd4dc03 (1):
> [sonic_sfp/qsfp_dd.py] Update DOM capability method name to align with other drivers (#163)
Also align all calling function names to match.
When Building syncd-rpc, libthrift has dependency on libboost-atomic1.71.0,
however the debian packager install version 1.67 instead. This PR
preinstalls libboost-atomic v 1.71 to avoid falling back to v 1.67.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
- Why I did it
To move ‘sonic-host-service’ which is currently built as a separate package to ‘sonic-host-services' package.
- How I did it
- Moved 'sonic-host-server' to 'src/sonic-host-services' and included it as part of the python3 wheel.
- Other files were moved to 'src/sonic-host-services-data' and included as part of the deb package.
- Changed build option ‘INCLUDE_HOST_SERVICE’ to ‘ENABLE_HOST_SERVICE_ON_START’ for enabling sonic-hostservice at boot-up by default.
ACL entry set attribute updates all the entries in the table. The correct behavior is to set the attribute on single entry.
- How I did it
Current SDK code, while setting the new attribute, is going through all the entries and updating it. Added a logic to check for requested entry and only allow for that ACL entry.
A case has filed with BRCM. Once an official fix is provided by BRCM, we will then remove this in house fix and apply the official fix.
Fix#6711
the requirement was introduced in commit 75104bb35d
to support sflow in stretch build. in buster build, the requirement
is met, no need to pin down the version.
Signed-off-by: Guohan Lu <lguohan@gmail.com>
**- Why I did it**
After migrating to python3, the operator '/' always get a float result, but it gets integer result in python2. Need fix this in thermal_conditions.
**- How I did it**
1. cast float value to int
2. change the unit test case to cover this situation
**- How to verify it**
Manually test and regression test
Accton util applies lsmod to check if drivers are installed.
But lsmod may return error on startup and skip module installation.
Signed-off-by: roy_lee <roy_lee@edge-core.com>
prerm is needed for platform modules package to be properly removed.
Added prerm to remove installed in postinst wheel packages.
Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com>
**- Why I did it**
PR https://github.com/Azure/sonic-platform-common/pull/102 modified the name of the SFF-8436 (QSFP) method to align the method name between all drivers, renaming it from `parse_qsfp_dom_capability` to `parse_dom_capability`. Once the submodule was updated, the callers using the old nomenclature broke. This PR updates all callers to use the new naming convention.
**- How I did it**
Update the name of the function globally for all calls into the SFF-8436 driver.
Note that the QSFP-DD driver still uses the old nomenclature and should be modified similarly. I will open a PR to handle this separately.
**- Why I did it**
To incorporate the below changes in DellEMC S6100, S6000 platforms.
- S6100, S6000:
- Enable 'thermalctld'
- Implement DeviceBase methods (presence, status, model, serial) for Fantray and Component
- Implement ‘get_position_in_parent’, ‘is_replaceable’ methods for all device types
- Implement ‘get_status’ method for Fantray
- Implement ‘get_temperature’, ‘get_temperature_high_threshold’, ‘get_voltage_high_threshold’, ‘get_voltage_low_threshold’ methods for PSU
- Implement ‘get_status_led’, ‘set_status_led’ methods for Chassis
- SFP:
- Make EEPROM read both Python2 and Python3 compatible
- Fix ‘get_tx_disable_channel’ method’s return type
- Implement ‘tx_disable’, ‘tx_disable_channel’ and ‘set_power_override’ methods
- S6000:
- Move PSU thermal sensors from Chassis to respective PSU
- Make available the data of both Fans present in each Fantray
**- How I did it**
- Remove 'skip_thermalctld:true' in pmon_daemon_control.json
- Implement the platform API methods in the respective device files
- Use `bytearray` for data read from transceiver EEPROM
- Change return type of 'get_tx_disable_channel' to match specification in sonic_platform_common/sfp_base.py
**- Why I did it**
Reduce the time it takes for the ASIC FW burn as part of the automatic FW upgrade procedure.
**- How I did it**
Add -d option to mlxfwmanager tool to use the faster MST device and not the default one which is not the fastest one.
**- How to verify it**
I manually changed ASIC FW followed by reboot command in order for FW upgrade to take place on deinit.
I manually changed ASIC FW followed by hard reset in order for FW upgrade to take place on init.
Signed-off-by: liora <liora@nvidia.com>
**Why I did it**
Disable SDK extended dump due to issue found
**How I did it**
Update SAI submodule
**How to verify it**
Verify the SDK extended dump is not called.
Signed-off-by: Eran Dahan <erand@nvidia.com>
- Why I did it
SONiC design requires sonic_platform package to be installed in SONiC host environment, not only in docker containers.
- How I did it
For now, sonic_platform python wheel package, that is used by pmon, is provided via device-specific platform modules deb packages that unpacks the wheel package file into specific device's directory on lazy-install.
The PR makes deb packages' postinst script also install these unpacked wheel packages to host.
Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com>
1. BRCM SAI Debian build need not have any Kernel version dependency - Starting with 4.3 BRCM made changes in SAI so that this dependency has been cleaned up. We can now remove the Kernel Version dependency from Azure Pipeline build script.
2. Bypass PEER_MODE p2mp setting causing SYNCd crash on non-TD3 SKUs - Temporarily patch BRCM SAI code to not cause SYNCd crash when Orchagent program SAI_TUNNEL_ATTR_PEER_MODE: SAI_TUNNEL_PEER_MODE_P2MP on Non-TD3 SKUs. Will remove this when BRCM provide proper fix to address this issue.
- Why I did it
Fix issue: ptf_nn_agent isn't able to start in syncd-rpc docker on buster.
- How I did it
The issue is fixed by installing python-dev, cffi and nnpy for python 2 explicitly.
- How to verify it
Run copp test on RPC image.
Starting with BRCM SAI 4.3.1.5 we see the following :ethtool not fount" error in syslog during boot up:
```
Jan 27 07:36:14.712472 str-s6100-acs-1 INFO syncd#/supervisord: syncd sh: 1:
Jan 27 07:36:14.712844 str-s6100-acs-1 INFO syncd#/supervisord: syncd ethtool: not found
Jan 27 07:36:14.713228 str-s6100-acs-1 INFO syncd#/supervisord: syncd #015
Jan 27 07:36:14.713840 str-s6100-acs-1 INFO syncd#syncd: [0] SAI_API_HOSTIF:_brcm_sai_hostif_speed_set:11894 cmd ethtool -s Ethernet39 speed 40000 rc:32512
Jan 27 07:36:14.717204 str-s6100-acs-1 NOTICE swss#orchagent: :- setHostIntfsOperStatus: Set operation status DOWN to host interface Ethernet39
Jan 27 07:36:14.717204 str-s6100-acs-1 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet39
Jan 27 07:36:14.717204 str-s6100-acs-1 NOTICE swss#orchagent: :- initializePort: Initializing port alias:Ethernet36 pid:1000000000040
Jan 27 07:36:14.726793 str-s6100-acs-1 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet36 admin:0 oper:0 addr:4c:76:25:f5:48:80 ifindex:75 master:0
Jan 27 07:36:14.727967 str-s6100-acs-1 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet36(ok) to state db
Jan 27 07:36:14.729331 str-s6100-acs-1 NOTICE swss#orchagent: :- addHostIntfs: Create host interface for port Ethernet36
Jan 27 07:36:14.752398 str-s6100-acs-1 INFO syncd#/supervisord: syncd sh: 1: ethtool: not found#015
Jan 27 07:36:14.752689 str-s6100-acs-1 INFO syncd#syncd: [0] SAI_API_HOSTIF:_brcm_sai_hostif_speed_set:11894 cmd ethtool -s Ethernet36 speed 40000 rc:32512
Jan 27 07:36:14.756050 str-s6100-acs-1 NOTICE swss#orchagent: :- setHostIntfsOperStatus: Set operation status DOWN to host interface Ethernet36
Jan 27 07:36:14.757585 str-s6100-acs-1 NOTICE swss#orchagent: :- initPort: Initialized port Ethernet36
```
It seems that starting with BRCM SAI 4.2.1.5 syncd is using ethtool to set the host interface speed and since this ethtool was not part of the syncd Docker, we observe these "ethtool not found" issue.
**- Why I did it**
sonic-utilities will become dependent upon sonic-platform-common as of https://github.com/Azure/sonic-utilities/pull/1386.
**- How I did it**
- Add sonic-platform-common as a dependency in docker-sonic-vs.mk
- Additionally, no longer install Python 2 packages of swsssdk and sonic-py-common, as they should no longer be needed.
Changes in the new release:
1. Policy based hashing optimization
2. New attribute support for Max port headroom
3. Tunnel ECN map fixes
4. Tunnel EVPN skeleton extensions (peer attrib, maps)
5. Bridge port admin not affecting port admin (optimize port down time)
6. CRM new API for neighbors and tunnel termination entries
7. Improve FDB event for flush by bridge port (before, null bridge was reported to SONiC, now the bridge will be extracted from bridge port)
8. DHCP L2 v4+v6 traps (for ZTP use case)
9. Generic counter implementation
Signed-off-by: Kebo Liu <kebol@nvidia.com>
- combine docker-ptf-saithrift into docker-ptf docker
- build docker-ptf under platform vs
- remove docker-ptf for other platforms
Signed-off-by: Guohan Lu <lguohan@gmail.com>
During ISSU, "mlxsw_minimal" driver still trying to access firmware, in some cases FW could return some wrong critical threshold value which will cause switch shutdown.
**- How I did it**
In order to prevent "mlxsw_minimal" driver from accessing ASIC during ISSU, SDK will raise "OFFLINE" 'udev' event
at the early beginning of such flow. When this event is received, hw-management will remove "mlxsw_minimal" driver.
There is no need to implement the opposite "ONLINE" event since this flow is ended up with "kexec".
**- How to verify it**
repeatedly perform warm reboot, make sure there is no switch shutdown occurred.
Bugs fixes:
All | Kernel | During system reload when CPU is loaded with heavy traffic, a Kernel Panic may occur.
All | Modules, Port split | FW stuck when device rebooted with locked Optical Transceivers in split mode
Spectrum-3 | PFC | On Spectrum-3 systems, slow reaction time to Rx pause packets on 40GbE ports may lead to buffer overflow on servers.
Spectrum-3 | SN4700, Port Split | On rare occasion SN4700, conducting 100G split (4x25G) in NRZ when splitter port 1 or 2 are down, ports 3 and 4 will also go down.
Enahncments:
All | Kernel | new notification on ISSU start, so other kernel drivers can disable any interface to ASIC
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Check #6483
add test to make sure default route change in eth0 does not
affect the default route in the default vrf
Signed-off-by: Guohan Lu <lguohan@gmail.com>
Fixes#6445
Because the ipmihelper.py script in the 9332 folder is slightly different than the common one (due to LGTM fixes), when the common one gets copied during build time it causes the workspace/build to become dirty.
Signed-off-by: Danny Allen <daall@microsoft.com>
- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.
Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.
- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:
The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.
If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.
- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.
Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.
- Which release branch to backport (provide reason below if selected)
201811
201911
[x ] 202006
**- Why I did it**
On the Mellanox platform, reboot cause is fetched from some certain sysfs which is created by the hw-management service. So determine-reboot-cause service shall start after hw-management, otherwise it could fail due to the related sysfs is not available yet.
**- How I did it**
Add a patch to the hw-management service to make sure determine-reboot-cause service should start after it.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
**- Why I did it**
- The thermalctld daemon on the Pmon docker requires support from the thermal manager API.
**- How I did it**
- Removed the old function for detecting a faulty fan.
- Removed the old function for detecting excess temperature.
- Implement thermal_manager APIs based on ThermalManagerBase
- Implement thermal_conditions APIs based on ThermalPolicyConditionBase
- Implement thermal_actions APIs based on ThermalPolicyActionBase
- Implement thermal_info APIs based on ThermalPolicyInfoBase
- Add thermal_policy.json
It's been reported that accton fan monitor process keeps consuming memory after few days.
The amount of memory occupied increases in linear and never leased.
Signed-off-by: roy_lee <roy_lee@edge-core.com>
1. Fixes the missing DPKG file for gbsyncd-vs package
2. Fixes the softlink issue on the Platform-common and ztp package
3. Fixes the PYTHNON_DEBS list is missing for DBG dockers.
- Improve random number generation during early Sonic initialization by providing SW updates to Linux entropy value.
- Improve handling of platform In-Band management port
This commit provides the following updates to the Nokia ixs7215 platform
1. The Marvell Armada-38x SOC requires SW assistance to improve the system
entropy value available early on in the Sonic boot sequence.
2. The Nokia ixs7215 platform does not have a dedicated Out-Of-Band (OOB) mgmt
port and thus requires additional logic to optionally support configuring
front panel port 48 as an In-Band mgmt port. This commit provides additional
logic to manage and maintain the operation of this In-Band mgmt port.
fix platform driver breakage due to python3 upgrade and fix load minigraph errors with config load_minigraph -y
**- How I did it**
added python3-smbus to the pmon docker template since the previous was python2 specific
fixed additional "ord" python2 specific code
fixed the jinja templates used by qos reload - the template logic required data to be parsed
**- How to verify it**
run "show platform XXX" commands and verify output
run "sudo config load_minigraph -y" and verify configuration
run "show interfaces XXX" and verify output
Co-authored-by: Carl Keene <keene@nokia.com>
Centec syncd have beend upgraded to buster, docker-syncd-centec-rpc do not need generate stretch based docker.
Co-authored-by: Xianghong Gu <xgu@centecnetworks.com>
In rare case can see that xcvrd failed due to "UnboundLocalError: local variable 'label_port' referenced before assignment"
Init "label_port" as None at the beginning of the function, to avoid the case that "label_port" not assigned.
Fix#119
when parallel build is enable, multiple dpkg-buildpackage
instances are running at the same time. /var/lib/dpkg is shared
by all instances and the /var/lib/dpkg/updates could be corrupted
and cause the build failure.
the fix is to use overlay fs to mount separate /var/lib/dpkg
for each dpkg-buildpackage instance so that they are not affecting
each other.
Signed-off-by: Guohan Lu <lguohan@gmail.com>
* Enable telemetry for ARM64 by default
* [Centec]Upgrade Centec syncd docker to buster; libjemalloc2 have been installed in docker-base-buster, remove libjemalloc1 from docker-syncd-centec's Dockerfile.j2
Co-authored-by: Gu Xianghong <xgu@centecnetworks.com>
- Make PDDF code compliant with both Python 2 and Python 3
- Align code with PEP8 standards using autopep8
- Build and install both Python 2 and Python 3 PDDF packages
Starting from build (master) 176 the warm reboot on BRCM Platform started to experience syncd crash. Upon further debug by Ying it was determined that the crash was related to the following new change:
[Dynamic buffer calc] Support dynamic buffer calculation (#1338)
Ying also debugged further and found The crash was caused by buffer pool profile setting operation SAI_BUFFER_PROFILE_ATTR_SHARED_DYNAMIC_TH
A case has filed with BRCM while a potential fix was tried by Ying that seems to have addressed this issue and we are making this change available in master branch so that it will allow further feature validation/testing especially in the warm reboot area.
Once an official fix is provided by BRCM, we will then remove this in house fix and apply the official fix.
- How to verify it
Just perform warm reboot with any master code 175 or above you should see this issue or issue the following cmd will also cause the crash: "mmuconfig -p egress_lossy_profile -a 0"
- Dynamically change EEPROM driver based on media type.
- Otherwise, EEPROM INFO and DOM INFO might not be fetched properly and will result in erroneous output.
combine multiple same operation into one operation to reduce
the build steps. this is to avoid max depth exceeded issue
in the build.
Signed-off-by: Guohan Lu <lguohan@gmail.com>
Add a systemd dependency to make platform-modules service as a prerequisite for determine-reboot-cause service to ensure platform initialization is complete before determine-reboot-cause.service executes.
the control file version is 4.2.1.5-7
sonic$ sudo dpkg -i target/debs/buster/libsaibcm-dev_4.2.1.5-8_amd64.deb
(Reading database ... 175880 files and directories currently installed.)
Preparing to unpack .../libsaibcm-dev_4.2.1.5-8_amd64.deb ...
Unpacking libsaibcm-dev (4.2.1.5-7) over (4.2.1.5-7) ...
Setting up libsaibcm-dev (4.2.1.5-7) ...
lgh@491d842369cf:/sonic$ sudo dpkg -i target/debs/buster/libsaibcm_4.2.1.5-8_amd64.deb
(Reading database ... 175880 files and directories currently installed.)
Preparing to unpack .../libsaibcm_4.2.1.5-8_amd64.deb ...
Unpacking libsaibcm (4.2.1.5-7) over (4.2.1.5-7) ...
Setting up libsaibcm (4.2.1.5-7) ...
Processing triggers for libc-bin (2.28-10) ...
Signed-off-by: Guohan Lu <lguohan@gmail.com>
Accton util applies lsmod to check if drivers are installed.
But lsmod may return error on startup and skip module installation.
Signed-off-by: Brandon Chuang <brandon_chuang@edge-core.com>
Features:
Spectrum-3 | Systems | Added GA-level support for SN4700 A0 system
All | Shared headroom | Added GA-level support for Shared headroom between PGs
Bugs fixes:
All | Counters | Sent traffic in certain size is wrongly increase to a smaller size counter, because port extended counter has a counter for sent traffic per packet-size range
All | Shared buffer | Configuring shared buffer on the fly may, on occasion, cause the chip to get stuck
Spectrum-2 | Modules | On occasion, link down is experienced with INPHI COLORZ PAM4 100G optic cables on SN3700 systems
- Why I did it
Make EEPROM platform APIs Python3 compliant in Nokia platform.
- How I did it
Handle bytearray type returned by read_eeprom and read_eeprom_bytes methods.
- How to verify it
Boot Nokia ixs7215 and verify PMON docker running and show platform syseeprom
Co-authored-by: Carl Keene <keene@nokia.com>
Introduce tunnel manager daemon. Start the process as part of swss container
Submodule update for swss:
9ed3026 - 2020-12-24 : [NAT] ACL Rule with DO_NOT_NAT action is getting failed. (#1502) [Akhilesh Samineni]
c39a4b1 - 2020-12-23 : Mux/IPTunnel orchagent changes (#1497) [Prince Sunny]
bc8df0e - 2020-12-23 : Add support for headroom pool watermark (#1567) [Neetha John]
**- Why I did it**
As part of migrating SONiC codebase from Python 2 to Python 3
**- How I did it**
- No longer install Python 2 in docker-base-buster or docker-config-engine-buster.
- Install Python 2 and pip2 in the following containers until we can completely eliminate it there:
- docker-platform-monitor
- docker-sonic-mgmt-framework
- docker-sonic-vs
- Pin pip2 version <21 where it is still temporarily needed, as pip version 21 will drop support for Python 2
- Also preform some other cleanup, ensuring that pip3, setuptools and wheel packages are installed in docker-base-buster, and then removing any attempts to re-install them in derived containers
* [Mellanox] Update SAI to 1.18.0
* [Mellanox] Update SDK to 4.4.2112
* Updated Mellanox SAI to 1.18.0.2
* Updated bcmsai debians to use SAI 1.7.1
* Updated Mellanox to use SAI 1.7.1
* Updated submodule sonic-sairedis using SAI 1.7.1
Co-authored-by: Vineet Mittal <vmittalmittal@microsoft.com>
Co-authored-by: Nazarii Hnydyn <nazariig@nvidia.com>
bug fix: #5914
Validated for tx_disable function of SFP+ on AS7312-54X, AS5812-54X, AS5712-54x, and AS5812-54x.
Signed-off-by: roy_lee <roy_lee@edge-core.com>
- Why I did it
To upgrade brcm syncd to buster
- How I did it
Updated BCM SAI using kernel version 4.19.0-12 and debian 10 to support buster.
Updated syncd docker from stretch to buster in sonic-buildimage
- How to verify it
Ensured docker is running synd buster.
After upgrade, ensured all BGP peers and ip interfaces are up.
Ping to BGP neighbors is working fine.
Install the necessary python3 dependent packages to convert restore_neighbor.py
to support python3 as python2 is EOL. See: Azure/sonic-swss#1542
Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
In marvell_et644m platform scripts, I have added a check to confirm the file availability before accessing it.
Signed-off-by: Sabareesh Kumar Anandan <sanandan@marvell.com>
**- Why I did it**
To support dynamic buffer calculation.
This PR also depends on the following PRs for sub modules
- [sonic-swss: [buffermgr/bufferorch] Support dynamic buffer calculation #1338](https://github.com/Azure/sonic-swss/pull/1338)
- [sonic-swss-common: Dynamic buffer calculation #361](https://github.com/Azure/sonic-swss-common/pull/361)
- [sonic-utilities: Support dynamic buffer calculation #973](https://github.com/Azure/sonic-utilities/pull/973)
**- How I did it**
1. Introduce field `buffer_model` in `DEVICE_METADATA|localhost` to represent which buffer model is running in the system currently:
- `dynamic` for the dynamic buffer calculation model
- `traditional` for the traditional model in which the `pg_profile_lookup.ini` is used
2. Add the tables required for the feature:
- ASIC_TABLE in platform/\<vendor\>/asic_table.j2
- PERIPHERAL_TABLE in platform/\<vendor\>/peripheral_table.j2
- PORT_PERIPHERAL_TABLE on a per-platform basis in device/\<vendor\>/\<platform\>/port_peripheral_config.j2 for each platform with gearbox installed.
- DEFAULT_LOSSLESS_BUFFER_PARAMETER and LOSSLESS_TRAFFIC_PATTERN in files/build_templates/buffers_config.j2
- Add lossless PGs (3-4) for each port in files/build_templates/buffers_config.j2
3. Copy the newly introduced j2 files into the image and rendering them when the system starts
4. Update the CLI options for buffermgrd so that it can start with dynamic mode
5. Fetches the ASIC vendor name in orchagent:
- fetch the vendor name when creates the docker and pass it as a docker environment variable
- `buffermgrd` can use this passed-in variable
6. Clear buffer related tables from STATE_DB when swss docker starts
7. Update the src/sonic-config-engine/tests/sample_output/buffers-dell6100.json according to the buffer_config.j2
8. Remove buffer pool sizes for ingress pools and egress_lossy_pool
Update the buffer settings for dynamic buffer calculation
python2 is end of life and SONiC is going to support python3. This PR is going to support:
1. Mellanox SONiC platform API python3 support
2. Install both python2 and python3 verson of Mellanox SONiC platform API or pmon and host side
- Enhance eeprom parsing robustness on corrupted fields
- Add chassis provisioning service
- Disable CPU sleep state on some systems
- Complete refactor for FanSlots
- Fix module unload while still in use
Barefoot platform vendors' sonic_platform packages import the Python 'thrift' library. Previously, our custom-built package was being installed in the PMon container and host OS. However, we are only building a Python 2 version of that package, which was only intended for use with saithrift.
Fixes#6077
Changes for supporting vstest for VOQ system ports. The changes include:
(1)Use of chassis_db.json is avoided since the SYSTEM_PORT is made
available in virtual chassis linecard's default_config.json which will
be loaded during bootup
(2)Core port index map file is introduced and is copied from virtual chassis
directory to hwsku direcory by start.sh
(3)vs sai profile is modified to include core port index map file name
Signed-off-by: vedganes <vedavinayagam.ganesan@nokia.com>
- Why I did it
For determining reboot-cause while running newer BIOS, SMF firmware.
- How I did it
Made changes in reboot-cause determination script to add support for behavior of newer firmware.
- How to verify it
Performed different type of resets and verified "show reboot-cause" provides the correct reason.
Logs: UT_logs.txt
- Description for the changelog
DellEMC S6100: Update reboot-cause determination to support new firmware
EEPROM cache file is not refreshed after install a new ONIE version even if the eeprom data is updated. The current Eeprom class always try to read from the cache file when the file exists. The PR is aimed to fix it.
- Change return type of SFP methods to match specification in sonic_platform_common/sfp_base.py
- Use init methods of base classes to initialize common instance variables
- Handle negative timeout values in S6100's watchdog ‘arm’ method
- Return appropriate values for 'get_target_speed', 'set_status_led' to avoid false warnings
- Implement platform module API
- Implement platform drawer API
- Complete and fix a few other platform API
- Add psu and fan support for chassis
- Fix dependency issue with determine-reboot-cause
* Update 50-ttyUSB-C0.rules
Remove some debug codes
* Update popmsg.sh
remove the debug info
* Update udev_prefix.sh
remove debug info
* Update platform-modules-haliburton.postinst
To enable the execute permission of udev scripts
* Update 50-ttyUSB-C0.rules
Fix port 13 - 48 plug out but not pop out message issue.
Signed-off-by: Jing Kan jika@microsoft.com
- Why I did it
Fix the i2c device to address conflicts behind the PCA9548 switch.
- How I did it
Load the i2c-mux-pca954x with parameter force-deselect-on-exit=1.
Submodule updates include the following commits:
* src/sonic-utilities 9dc58ea...f9eb739 (18):
> Remove unnecessary calls to str.encode() now that the package is Python 3; Fix deprecation warning (#1260)
> [generate_dump] Ignoring file/directory not found Errors (#1201)
> Fixed porstat rate and util issues (#1140)
> fix error: interface counters is mismatch after warm-reboot (#1099)
> Remove unnecessary calls to str.decode() now that the package is Python 3 (#1255)
> [acl-loader] Make list sorting compliant with Python 3 (#1257)
> Replace hard-coded fast-reboot with variable. And some typo corrections (#1254)
> [configlet][portconfig] Remove calls to dict.has_key() which is not available in Python 3 (#1247)
> Remove unnecessary conversions to list() and calls to dict.keys() (#1243)
> Clean up LGTM alerts (#1239)
> Add 'requests' as install dependency in setup.py (#1240)
> Convert to Python 3 (#1128)
> Fix mock SonicV2Connector in python3: use decode_responses mode so caller code will be the same as python2 (#1238)
> [tests] Do not trim from PATH if we did not append to it; Clean up/fix shebangs in scripts (#1233)
> Updates to bgp config and show commands with BGP_INTERNAL_NEIGHBOR table (#1224)
> [cli]: NAT show commands newline issue after migrated to Python3 (#1204)
> [doc]: Update Command-Reference.md (#1231)
> Added 'import sys' in feature.py file (#1232)
* src/sonic-py-swsssdk 9d9f0c6...1664be9 (2):
> Fix: no need to decode() after redis client scan, so it will work for both python2 and python3 (#96)
> FieldValueMap `contains`(`in`) will also work when migrated to libswsscommon(C++ with SWIG wrapper) (#94)
- Also fix Python 3-related issues:
- Use integer (floor) division in config_samples.py (sonic-config-engine)
- Replace print statement with print function in eeprom.py plugin for x86_64-kvm_x86_64-r0 platform
- Update all platform plugins to be compatible with both Python 2 and Python 3
- Remove shebangs from plugins files which are not intended to be executable
- Replace tabs with spaces in Python plugin files and fix alignment, because Python 3 is more strict
- Remove trailing whitespace from plugins files
Some syncd containers are still based on Debian Stretch, and thus do not have Python 3 available. For these containers, we must still rely on Python 2 to run supervisord_dependent_startup and supervisor-proc-exit-listener.