Commit Graph

31 Commits

Author SHA1 Message Date
Junchao-Mellanox
f0ddd102d5
[Mellanox] Add CPU thermal control for Nvidia platforms (#10202)
Why I did it
Add CPU thermal control for Nvidia platforms which will be enabled for platforms that have heavy CPU load. Now it is only enabled on 4800, and it will be enabled on future platforms.

How I did it
Check CPU pack temperature and update cooling level accordingly

How to verify it
Manual test
Added sonic-mgmt test case, PR link will update later
2022-03-21 09:54:52 -07:00
Junchao-Mellanox
fe59e0f2c0
[Mellanox] Fix issue: thermal zone threshold value 0 causes fan speed stuck at 100% (#10057)
- Why I did it
In SONiC thermal control algorithm, it compares thermal zone temperature with thermal zone threshold. Previously, a thermal zone with no thermal sensor can still get its threshold. However, a recently driver patch changes this behavior: a thermal zone with no thermal sensor will return 0 for threshold. We need to ignore such thermal zone.

- How I did it
Ignore thermal zones whose temperature is 0.

- How to verify it
Added unit test case and Manual test
2022-02-24 12:05:56 +02:00
Junchao-Mellanox
4ae504a813
[Mellanox] Optimize thermal control policies (#9452)
- Why I did it
Optimize thermal control policies to simplify the logic and add more protection code in policies to make sure it works even if kernel algorithm does not work.

- How I did it
Reduce unused thermal policies
Add timely ASIC temperature check in thermal policy to make sure ASIC temperature and fan speed is coordinated
Minimum allowed fan speed now is calculated by max of the expected fan speed among all policies
Move some logic from fan.py to thermal.py to make it more readable

- How to verify it
1. Manual test
2. Regression
2022-01-19 11:44:37 +02:00
Lior Avramov
a9c9f56eeb
[Mellanox] Include CPU board and switch board sensors only on SN2201 system (#9644)
Why I did it
Recently additional sensors that were needed only for specific system added to all systems and caused errors.

How I did it
* Include CPU board and switch board sensors only on SN2201 system
* Fix issue in test_chassis_thermal, now it skips non existing thermals.

How to verify it
Run show platform temperature

Signed-off-by: liora <liora@nvidia.com>
2022-01-05 10:25:47 -08:00
Lior Avramov
1fce3ebda3
[Mellanox] Add support for SN2201 platform (#9333)
- Why I did it
Add support for SN2201 platform

- How I did it
Add required content for SN2201 platform
Note: still missing kernel driver support for this system. Once all is upstream will be updated as well.

- How to verify it
Install and basic sanity tests including traffic.

Signed-off-by: liora liora@nvidia.com
2021-12-06 14:47:50 +02:00
Junchao-Mellanox
e8b4c2a1f4
[Mellanox] Refactor Mellanox platform API to support dynamic port configuration (#8422)
- Why I did it
* To support systems with dynamic port configuration
* Apply lazy initialization to faster the speed of loading platform API

- How I did it
* Add module.py to implement dynamic port configuration (aka line card model)
* Adjust chassis.py, platform.py, thermal.py, sfp.py to support dynamic port configuration
* Optimize existing code

- How to verify it
Platform regression on MSN4700, MSN3800 and MSN2700, 100% pass
Unit test covers all new changes.
2021-10-25 07:59:06 +03:00
Dror Prital
5356244e53
[Mellanox] Add NVIDIA Copyright header to "mellanox" files (#8799)
- Why I did it
Add NVIDIA Copyright header to "mellanox" files

- How I did it
Add NVIDIA Copyright header as a comment for Mellanox files

- How to verify it
Sanity tests and PR checkers.
2021-10-17 19:03:02 +03:00
Junchao-Mellanox
552963ab0e
[Mellanox] Change thermal recover threshold from temp_trip_norm to temp_trip_high (#8792)
- Why I did it
Change thermal recover threshold from temp_trip_norm to temp_trip_high, so that thermal algorithm would set fan speed to minimum allowed earlier and save power.

- How I did it
Change thermal recover threshold from temp_trip_norm to temp_trip_high

- How to verify it
Manual test
2021-10-04 20:20:33 +03:00
Stephen Sun
80d01f2f9a
[Mellanox] Enhance Python3 support for platform API (#7410)
- Why I did it
Enhance the Python3 support for platform API. Originally, some platform APIs call SDK API which didn't support Python 3. Now the Python 3 APIs have been supported in SDK 4.4.3XXX, Python3 is completely supported by platform API

- How I did it
Start all platform daemons from python3
1. Remove #/usr/bin/env python at the beginning of each platform API file as the platform API won't be started as daemons but be imported from other daemons.
2. Adjust SDK API calls accordingly

- How to verify it
Manually test and run regression platform test

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-06-15 17:57:48 +03:00
Joe LeVeque
516ff8bfff
[Mellanox] Ensure concrete platform API classes call base class initializer (#6854)
In preparation for the merging of Azure/sonic-platform-common#173, which properly defines class and instance members in the Platform API base classes.

It is proper object-oriented methodology to call the base class initializer, even if it is only the default initializer. This also future-proofs the potential addition of custom initializers in the base classes down the road.
2021-02-25 11:06:22 -08:00
DavidZagury
5aee92e56d
[Mellanox] Add support for SN4600 system (#6879)
- Why I did it
Add support for new 64x200G SN4600 systems

- How I did it
Add all relevant files (w/o platform.json and hwsku.json as they will come later) with default SKU.

- How to verify it
Install image on switch, verify all ports are up and configured properly, run full platform SONiC tests.
2021-02-25 09:30:43 +02:00
Junchao-Mellanox
6348248138
[Mellanox] Add high threshold and high critical threshold support for gearbox (#6206)
- Why I did it

Add high threshold and high critical threshold support for gearbox

- How I did it

Read gearbox thermal related threshold from sysfs
2020-12-15 16:51:43 -08:00
Vadym Hlushko
503873056e
[Mellanox] SN4410 support (#5778)
Add support for Mellanox Spectrum-3 based 100GbE/400GbE 1U. 24 QSFP-DD28 and 8 QSFP-DD ports

Signed-off-by: Vadym Hlushko <vadymh@nvidia.com>
2020-11-24 10:43:48 -08:00
Junchao-Mellanox
b595a6eadf
[Mellanox] Implement new platform API for SONiC physical entity mib extension (#5645)
In order to support SONiC physical entity mib extension, a few new platform API are added to sonic-platform-common, this PR is to provide an mellanox platform implementation for those new APIs.
2020-11-16 18:56:03 -08:00
Joe LeVeque
3b89e5d467
[Python] Migrate applications/scripts to import sonic-py-common package (#5043)
As part of consolidating all common Python-based functionality into the new sonic-py-common package, this pull request:
1. Redirects all Python applications/scripts in sonic-buildimage repo which previously imported sonic_device_util or sonic_daemon_base to instead import sonic-py-common, which was added in https://github.com/Azure/sonic-buildimage/pull/5003
2. Replaces all calls to `sonic_device_util.get_platform_info()` to instead call `sonic_py_common.get_platform()` and removes any calls to `sonic_device_util.get_machine_info()` which are no longer necessary (i.e., those which were only used to pass the results to `sonic_device_util.get_platform_info()`.
3. Removes unused imports to the now-deprecated sonic-daemon-base package and sonic_device_util.py module

This is the next step toward resolving https://github.com/Azure/sonic-buildimage/issues/4999

Also reverted my previous change in which device_info.get_platform() would first try obtaining the platform ID string from Config DB and fall back to gathering it from machine.conf upon failure because this function is called by sonic-cfggen before the data is in the DB, in which case, the db_connect() call will hang indefinitely, which was not the behavior I expected. As of now, the function will always reference machine.conf.
2020-08-03 11:43:12 -07:00
Junchao-Mellanox
ce391645f2
[Mellanox] add ASIC temperature support to platform API (#4828)
**- Why I did it**

System health feature requires to read ASIC temperature and threshold from platform API

**- How I did it**

Implement Chassis.get_asic_temperature and Chassis.get_asic_temperature_threshold by getting value from system fs.
2020-06-28 17:54:28 -07:00
madhanmellanox
2c830f4074
Modified SKU based utils to Platform based utils (#4786)
Co-authored-by: Madhan Babu <madhan@arc-build-server.mtr.labs.mlnx>
2020-06-21 12:15:23 -07:00
Junchao-Mellanox
f277d13cd6
[Mellanox] Adjust log level to avoid too many thermal logs (#4631)
* Trigger thermal action log only if thermal condition changes
* test file existence before read file content
* fix error for set psu fan speed
* Remove logs because it print too frequently
2020-05-26 10:45:25 -07:00
Junchao-Mellanox
5e6c20481d
[Mellanox] Enhancement for fan led management (#4437) 2020-05-13 10:01:32 -07:00
shlomibitton
b6291372d9
[Mellanox] Add a new Mellanox platform x86_64-mlnx_msn4600c and new SKU ACS-MSN4600C (#4483)
* New SKU support for MSN4600C

Signed-off-by: Shlomi Bitton <shlomibi@mellanox.com>
2020-04-30 00:30:11 -07:00
Junchao-Mellanox
b26814f643
[Mellanox] Adjust dynamic minimum fan speed algorithm (#4476)
* remove air flow direction from dynamic minimum algorithm
* adjust minimum table according to thermal data
2020-04-27 20:52:57 -07:00
shlomibitton
ac6cfb115f
[Mellanox] Add a new Mellanox platform x86_64-mlnx_msn3420 and new SKU ACS-MSN3420 (#4436)
* New SKU support for MSN3420

Signed-off-by: Shlomi Bitton <shlomibi@mellanox.com>

Conflicts:
	device/mellanox/x86_64-mlnx_msn2700-r0/plugins/sfputil.py

* Add CPLD's

* Symlink fixes and semantics

* Adding new platform at end of lines
2020-04-26 14:39:55 +03:00
Junchao-Mellanox
c730f3e207
[Mellanox] thermal control enhancement for dynamic minimum fan speed and PSU fan speed policy (#4403) 2020-04-21 08:09:53 -07:00
Junchao-Mellanox
80bf061b37
[Mellanox] Fix thermal control bugs (#4298)
* [thermal control] Fix pmon docker stop issue on 3800
* [thermal fix] Fix QA test issue
* [thermal fix] change psu._get_power_available_status to psu.get_power_available_status
* [thermal fix] adjust log for PSU absence and power absence
* [thermal fix] add unit test for loading thermal policy file with duplicate conditions in different policies
* [thermal] fix fan.get_presence for non-removable SKU
* [thermal fix] fix issue: fan direction is based on drawer
* Fix issue: when fan is not present, should not read fan direction from sysfs but directly return N/A
* [thermal fix] add unit test for get_direction for absent FAN
* Unplugable PSU has no FAN, no need add a FAN object for this PSU
* Update submodules

Co-authored-by: Stephen Sun <5379172+stephenxs@users.noreply.github.com>
2020-03-25 10:54:07 -07:00
Kebo Liu
f4ed88297d
[Mellanox] Add a new Mellanox platform x86_64-mlnx_msn4700 and new SKU ACS-MSN4700 (#3901)
* add MSN4700 device files

* update ACS-MSN4700 sai profile

* update buffer pool size, headroom, sensor conf, port config and reboot scripts

* fix ident

* update sensor conf and buffer pool

* [sn4700] add sku 4700 to chassis.py

* [Mellanox-4700] Add 4700 info to psu and thermal platform API

* update buffer config file template to the latest.
update SAI profile to use 100G X 4lanes for now
update port_config.ini according to the SAI profile

* [Mellanox]Update the buffer configurations for 4700

* fix alignment in pg_profile_lookup.ini

* add platform components file for new sku

* Update device/mellanox/x86_64-mlnx_msn4700-r0/ACS-MSN4700/pg_profile_lookup.ini

Co-Authored-By: Nazarii Hnydyn <nazariig@mellanox.com>

* remove redundant line

* [Mellanox]Correct type, buffer size

Co-authored-by: Nazarii Hnydyn <nazariig@mellanox.com>
Co-authored-by: junchao <junchao@mellanox.com>
Co-authored-by: Stephen Sun <stephens@mellanox.com>
2020-03-24 14:32:52 +02:00
Junchao-Mellanox
be549db395
Add thermal control support for SONiC (#3949) 2020-03-09 10:41:10 -07:00
Nazarii Hnydyn
fc101b6ceb
[mellanox]: Add new Mellanox-SN3800-D112C8 sku. (#4085)
Signed-off-by: Nazarii Hnydyn <nazariig@mellanox.com>
2020-01-30 18:54:09 -08:00
Stephen Sun
1886bdf7ad [Mellanox] fix gearbox ambient thermal name (#4005) 2020-01-17 14:05:35 -08:00
Stephen Sun
aea09ba1da [sonic_platform] Correct the wrong log identifiers (#3596) 2019-10-15 11:29:45 -07:00
Stephen Sun
97b43f96bb [mlnx_platform_api.thermal]align thermal sensor names with hw-management v2.0.0191 (#3371)
temp_xxxx_module{} => module{}_temp_xxxx
2019-08-23 11:58:03 -07:00
Stephen Sun
1d15022df7 [Mellanox] support new platform api, thermal and psu part (#3175)
* support new platform api, thermal and psu part
for psu, all APIs are supported.
for thermal, we support
  get_temperature,
  get_high_threshold
for the thermal sensors of cpu core, cpu pack, psu and sfp module
and get_temperature for the ambient thermal sensors around the asic, port, fan, comex and board.

* 1. address review comments
2. improve the handling of PSU inserting/removal
3. tolerance diverse psu thermal sensor file name conventions

* 1. adjust thermal code according to the latest version of hw-management
2. check power_good_status rather than whether file existing ahead of reading voltage, current and power of PSU
2019-07-22 07:59:48 -07:00